jmlr jmlr2005 jmlr2005-10 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marcus Hutter, Jan Poland
Abstract: When applying aggregating strategies to Prediction with Expert Advice (PEA), the learning rate √ must be adaptively tuned. The natural choice of complexity/current loss renders the analysis of Weighted Majority (WM) derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative Follow the Perturbed Leader (FPL) algorithm from Kalai and Vempala (2003) based on Hannan’s algorithm is easier. We derive loss bounds for adaptive learning rate and both finite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are new. Keywords: prediction with expert advice, follow the perturbed leader, general weights, adaptive learning rate, adaptive adversary, hierarchy of experts, expected and high probability bounds, general alphabet and loss, online sequential prediction
Reference: text
sentIndex sentText sentNum sentScore
1 CH IDSIA, Galleria 2 6928 Manno-Lugano, Switzerland Editor: Manfred Warmuth Abstract When applying aggregating strategies to Prediction with Expert Advice (PEA), the learning rate √ must be adaptively tuned. [sent-3, score-0.089]
2 The natural choice of complexity/current loss renders the analysis of Weighted Majority (WM) derivatives quite complicated. [sent-4, score-0.12]
3 In particular, for arbitrary weights there have been no results proven so far. [sent-5, score-0.075]
4 We derive loss bounds for adaptive learning rate and both finite expert classes with uniform weights and countable expert classes with arbitrary weights. [sent-7, score-1.063]
5 For the former setup, our loss bounds match the best known results so far, while for the latter our results are new. [sent-8, score-0.181]
6 Keywords: prediction with expert advice, follow the perturbed leader, general weights, adaptive learning rate, adaptive adversary, hierarchy of experts, expected and high probability bounds, general alphabet and loss, online sequential prediction 1. [sent-9, score-0.77]
7 The goal of the master algorithm is to perform nearly as well as the best expert in the class, on any sequence of outcomes. [sent-12, score-0.407]
8 A little later, incrementally adaptive algorithms were developed by Auer and Gentile (2000); Auer et al. [sent-19, score-0.095]
9 Unfortunately, the loss bound proofs for the incrementally adaptive WM variants are quite complex and technical, despite the typically simple and elegant proofs for a static learning rate. [sent-23, score-0.437]
10 While for the original WM algorithm, assertions are proven for countable classes of experts with arbitrary weights, the modern variants usually restrict to finite classes with uniform weights (an exception being Gentile c 2005 Marcus Hutter and Jan Poland. [sent-25, score-0.241]
11 Furthermore, most authors have concentrated on predicting binary sequences, often with the 0/1 loss for {0, 1}-valued and the absolute loss for [0, 1]-valued predictions. [sent-32, score-0.324]
12 Nevertheless, it is easy to abstract completely from the predictions and consider the resulting losses only. [sent-34, score-0.11]
13 Instead of predicting according to a “weighted majority” in each time step, one chooses one single expert with a probability depending on his past cumulated loss. [sent-35, score-0.369]
14 A major advantage we will discover in this work is that its analysis remains easy for an adaptive learning rate, in contrast to the WM derivatives. [sent-43, score-0.072]
15 Moreover, it generalizes to online decision problems other than PEA. [sent-44, score-0.073]
16 Bounds on the cumulative regret of the standard form kL (where k is the complexity and L is the cumulative loss of the best expert in hindsight) are shown for countable expert classes with arbitrary weights, adaptive learning rate, and arbitrary losses. [sent-47, score-1.101]
17 Regarding the adaptive learning rate, we obtain proofs that are simpler and more elegant than for the corresponding WM algorithms. [sent-48, score-0.114]
18 ) Further, we prove the √ loss bounds for arbitrary weights first and adaptive learning rate. [sent-50, score-0.328]
19 In order to obtain the optimal kL bound in this case, we will need √ to introduce a hierarchical version of FPL, while without hierarchy we show a worse bound k L. [sent-51, score-0.115]
20 (For self-confident learning rate together with uniform weights and arbitrary losses, one can prove corresponding results for a variant of WM by adapting an argument by Auer et al. [sent-52, score-0.121]
21 ) PEA usually refers to an online worst case setting: n experts that deliver sequential predictions over a time range t = 1, . [sent-54, score-0.236]
22 The goal is to give a prediction such that the overall loss after T steps is “not much worse” than the best expert’s loss on any sequence of outcomes. [sent-59, score-0.319]
23 If the prediction is deterministic, then an adversary could choose a sequence which provokes maximal loss. [sent-60, score-0.111]
24 Consequently, we ask for a prediction strategy such that the expected loss on any sequence is small. [sent-62, score-0.199]
25 While Kalai and Vempala consider general online decision problems in finite-dimensional spaces, we focus on online prediction tasks based on a countable number of experts. [sent-65, score-0.182]
26 Like Kalai and Vempala (2003) we exploit the infeasible FPL predictor (IFPL) in our analysis. [sent-66, score-0.143]
27 The upper and lower bounds on IFPL are combined to derive various regret bounds on FPL in Section 5. [sent-76, score-0.296]
28 Bounds for static and dynamic learning rate in terms of the sequence length follow straight-forwardly. [sent-77, score-0.285]
29 The proof of our main bound in terms of the loss is much more elegant than the analysis of previous comparable results. [sent-78, score-0.197]
30 In particular, we show that the derived bounds also hold for an adaptive adversary. [sent-82, score-0.133]
31 Section 9 treats some additional issues, including bounds with high probability, computational aspects, deterministic predictors, and the absolute loss. [sent-83, score-0.122]
32 We are asked to perform sequential predictions yt ∈ Y at times t = 1, 2, . [sent-88, score-0.075]
33 At each time step t, we have access to the predictions (yti )1≤i≤n of n experts {e1 , . [sent-92, score-0.163]
34 , en }, where the size of the expert pool is n ∈ I ∪ {∞}. [sent-95, score-0.322]
35 It is convenient N to use the same notation for finite (n ∈ I ) and countably infinite (n = ∞) expert pool. [sent-96, score-0.348]
36 the loss might be 1 if the expert made an erroneous prediction and 0 otherwise. [sent-100, score-0.499]
37 ) Our goal is to achieve a total loss “not much worse” than the best expert, after t time steps. [sent-102, score-0.12]
38 We admit n ∈ I ∪{∞} experts, each of which is assigned a known complexity ki ≥ 0. [sent-103, score-0.198]
39 Usually we N −ki ≤ 1, which implies that the ki are valid lengths of prefix code words, for instance ki = require ∑i e i 1 ln n if n < ∞ or ki = 2 + 2 ln i if n = ∞. [sent-104, score-0.656]
40 If n is finite, then usually one sets ki = ln n for all i; this is the case of uniform complexities/weights. [sent-107, score-0.229]
41 If the set of experts is countably infinite (n = ∞), uniform complexities are not possible. [sent-108, score-0.203]
42 At each time t, each expert i suffers a loss2 sti =Loss(xt , yti ) ∈ [0, 1], and st = (sti )1≤i≤n is the vector of all losses at time t. [sent-110, score-0.61]
43 i −qi := argmini∈E {si + k ηt }= randomized prediction of FPL. [sent-123, score-0.095]
44 The bound on P[max] for any a ∈ I (including negative a) follows from R P[max{qi − ki } ≥ a] = P[∃i : qi − ki ≥ a] ≤ i n n i=1 i=1 ∑ P[qi − ki ≥ a] ≤ ∑ e−a−k i = u·e−a where the first inequality is the union bound. [sent-125, score-0.746]
45 Using E[z] ≤ E[max{0,z}] = 0∞ P[max{0,z} ≥ y]dy = R∞ i i 0 P[z ≥ y]dy (valid for any real-valued random variable z) for z = maxi {q −k }−lnu, this implies R i i E[max{q − k } − ln u] ≤ i Z ∞ 0 i i P[ max{q − k } ≥ y + ln u]dy ≤ i which proves the bound on E[max]. [sent-126, score-0.137]
46 Z ∞ 0 e−y dy = 1, 2 If n is finite, a lower bound E[maxi qi ] ≥ 0. [sent-127, score-0.188]
47 57721+lnn can be derived, showing that the upper bound on E[max] is quite tight (at least) for ki = 0 ∀i. [sent-128, score-0.233]
48 3) to arbitrary weights, establishing a relation between IFPL and the best expert in hindsight. [sent-130, score-0.345]
49 Theorem 2 (IFPL bounded by BEH) Let D ⊆ I n , st ∈ I n for 1 ≤ t ≤ T (both D and s may even R R have negative components, but we assume that all required extrema are attained), and q,k ∈ I n . [sent-131, score-0.107]
50 R If ηt > 0 is decreasing in t, then the loss of the infeasible FPL knowing st at time t in advance (l. [sent-132, score-0.335]
51 ) can be bounded in terms of the best predictor in hindsight (first term on r. [sent-135, score-0.17]
52 ηt ηT ηT d∈D ηT ηT d∈D 644 A DAPTIVE O NLINE P REDICTION BY F OLLOWING THE P ERTURBED L EADER Note that if D = E (or D = ∆) and st ≥ 0, then all extrema in the theorem are attained almost surely. [sent-139, score-0.143]
53 Consider the losses st = st +(k− ˜ ηt 1 1 q)( ηt − ηt−1 ) for the moment. [sent-143, score-0.196]
54 We first show by induction on T that the infeasible predictor M(s1:t ) ˜ has zero regret for any loss s, i. [sent-144, score-0.461]
55 For the induction step from T −1 to T we need to show M(s1:T ) ◦ sT ≤ M(s1:T ) ◦ s1:T − M(s 0, the expected loss of the infeasible FPL exceeds the loss of expert i by at most ki /ηT : 1 i k ∀i. [sent-148, score-0.87]
56 r1:T ≤ si + 1:T ηT 645 H UTTER AND P OLAND Theorem 2 can be generalized to expert dependent factorizable ηt ; ηti = ηt ·ηi by scaling ki ; i −ki ki /ηi and qi ; qi /ηi . [sent-149, score-1.028]
57 Using E[maxi { q ηi }] ≤ E[maxi {qi −ki }]/mini {ηi }, Corollary 3, generalizes to T 1 1 i k−q ◦ i ∀i, E[ ∑ M(s1:t + i ) st ] ≤ s1:T + ηi k + ηmin ηt T T t=1 where ηmin := mini {ηi }. [sent-150, score-0.097]
58 For example, for ηti = ki /t we get the desired bound si + T ·(ki +4). [sent-151, score-0.309]
59 Recall that t = E[M(s 1 larger than for the infeasible FPL: t ≤ eηt rt , Furthermore, if ηt ≤ 1, then also T which implies 1:T − r1:T ≤ ∑ ηt t. [sent-156, score-0.12]
60 Since this value has to be chosen in advance, a static choice of ηt requires certain prior information and therefore is not practical in many cases. [sent-166, score-0.145]
61 However, the static bounds are very easy to derive, and they provide a good means to compare different PEA algorithms. [sent-167, score-0.206]
62 If on the other hand the algorithm shall be applied without appropriate prior knowledge, a dynamic choice of ηt depending only on t and/or past observations, is necessary. [sent-168, score-0.072]
63 FPLK N i −qi makes randomized prediction ItK := argmini∈E K {si + k ηK } with ηtK := K/2t and suffers loss 0 is a decreasing sequence. [sent-170, score-0.241]
64 R Then the loss of FPL for uniform complexities (l. [sent-171, score-0.176]
65 ) can be lower-bounded in terms of the best predictor in hindsight (first term on r. [sent-174, score-0.17]
66 Randomizing independently for each t as described in the previous Section, the actual loss is ut = M(s < τi . [sent-181, score-0.12]
67 Selecting τi = ki implies bounds for FPL with entering times similar to the ones we derived here. [sent-183, score-0.259]
68 Another use of wt from the second last paragraph is the following: If the decision space is D =∆, then FPL may make a deterministic decision d =wt ∈∆ at time t with bounds now holding for sure, instead of selecting ei with probability wti . [sent-186, score-0.158]
69 For example for the absolute loss sti = |xt −yti | with observation xt ∈ [0,1] and predictions yti ∈ [0,1], a master algorithm predicting deterministically wt◦ yt ∈[0,1] suffers absolute loss |xt −wt◦ yt |≤ ∑i wti |xt −yti |= t , and hence has the same (or better) performance guarantees as FPL. [sent-187, score-0.73]
70 In general, masters can be chosen deterministic if prediction space Y and loss-function Loss(x,y) are convex. [sent-188, score-0.081]
71 For xt ,yti ∈ {0,1}, the absolute loss |xt − pt | of a master deterministically predicting pt ∈ [0,1] actually coincides with the pt -expected 0/1 loss of a master predicting 1 with probability pt . [sent-189, score-0.686]
72 Hence a regret bound for the absolute loss also implies the same regret for the 0/1 loss. [sent-190, score-0.54]
73 Discussion and Open Problems How does FPL compare with other expert advice algorithms? [sent-192, score-0.394]
74 Here the coefficient of the regret term KL, √ referred to as the leading constant in the sequel, is 2 for FPL (Theorem 5). [sent-195, score-0.199]
75 It is thus a factor of 2 worse than the Hedge bound for arbitrary loss by Freund and Schapire (1997), which is sharp in some sense (Vovk, 1995). [sent-196, score-0.178]
76 For special loss functions, the bounds can sometimes be improved, e. [sent-199, score-0.181]
77 to a leading constant of 1 in the static (randomized) WM case with 0/1 loss (CesaBianchi et al. [sent-201, score-0.29]
78 Not knowing the right learning rate in advance usually costs a factor of 2. [sent-205, score-0.111]
79 Also for binary prediction with uniform complexities and 0/1 loss, this result has been established recently – √ Yaroshinsky et al. [sent-207, score-0.113]
80 (2004) show a dynamic regret bound with leading constant 2(1+ε). [sent-208, score-0.306]
81 Remarkably, the best dynamic bound for a WM variant proven by Auer et al. [sent-209, score-0.107]
82 Considering the difference in the static case, we therefore conjecture that a bound with leading constant of 2 holds for a dynamic Hedge algorithm. [sent-211, score-0.305]
83 While there are several dynamic bounds for uniform weights, the only previous result for non-uniform weights we know of is (Gentile, 2003, Cor. [sent-213, score-0.185]
84 16), which gives the dynamic bound Gentile ≤si +i+O (si +i)ln(si +i) for a p-norm algorithm for the absolute loss. [sent-214, score-0.144]
85 This 1:T 1:T 1:T 1:T is comparable to our bound for rapidly decaying weights wi = exp(−i), i. [sent-215, score-0.087]
86 Our hierarchical FPL bound in Theorem 9 (b) generalizes this to arbitrary weights and losses and strengthens it, since both, asymptotic order and leading constant, are smaller. [sent-218, score-0.236]
87 It seems that the analysis of all experts algorithms, including Weighted Majority variants and FPL, gets more complicated for general weights together with adaptive learning rate, because the 3. [sent-219, score-0.245]
88 While FPL and Hedge and WMR (Littlestone and Warmuth, 1994) can sample an expert without knowing its prediction, Cesa-Bianchi et al. [sent-220, score-0.349]
89 Note also that for many (smooth) lossfunctions like the quadratic loss, finite regret can be achieved (Vovk, 1990). [sent-222, score-0.174]
90 657 H UTTER AND P OLAND η static static dynamic dynamic Loss 0/1 any 0/1 any conjecture √1 2 √ ! [sent-223, score-0.462]
91 choice of the learning rate must account for both the weight of the best expert (in hindsight) and its loss. [sent-228, score-0.368]
92 Both quantities are not known in advance, but may have a different impact on the learning rate: While increasing the current loss estimate always decreases ηt , the optimal learning rate for an expert with higher complexity would be larger. [sent-229, score-0.488]
93 Nevertheless we conjecture that the bounds ∝ T ki and ∝ si ki also hold without 1:T the hierarchy trick, probably by using expert dependent learning rate ηti . [sent-231, score-0.974]
94 We can also compare the worst-case bounds for FPL obtained in this work to similar bounds for Bayesian sequence prediction. [sent-233, score-0.144]
95 Then it is known that the Bayes optimal predictor based on the √ νi e−k -weighted mixture of νi ’s has an expected total loss of at most Lµ +2 Lµ kµ +2kµ , where Lµ is the expected total loss of the Bayes optimal predictor based on µ (Hutter, 2003a, Thm. [sent-235, score-0.354]
96 Using FPL, we obtained the same bound except for the leading order constant, but for any sequence independently of the assumption that it is generated by µ. [sent-239, score-0.082]
97 Optimality of universal Bayesian prediction for general loss and alphabet. [sent-311, score-0.177]
98 Prediction with expert advice by following the perturbed leader for general weights. [sent-337, score-0.563]
99 Online geometric optimization in the bandit setting against an adaptive adversary. [sent-373, score-0.104]
100 Master algorithms for active experts problems based on increasing loss values. [sent-379, score-0.241]
wordName wordTfidf (topN-words)
[('fpl', 0.561), ('expert', 0.322), ('hutter', 0.232), ('ki', 0.198), ('regret', 0.174), ('wm', 0.157), ('kalai', 0.146), ('static', 0.145), ('pea', 0.135), ('experts', 0.121), ('loss', 0.12), ('qi', 0.117), ('ifpl', 0.116), ('auer', 0.115), ('hindsight', 0.113), ('vempala', 0.105), ('hannan', 0.097), ('hedge', 0.097), ('leader', 0.097), ('oland', 0.097), ('utter', 0.097), ('vovk', 0.097), ('infeasible', 0.086), ('yti', 0.081), ('daptive', 0.077), ('eader', 0.077), ('erturbed', 0.077), ('ollowing', 0.077), ('poland', 0.077), ('yaroshinsky', 0.077), ('si', 0.076), ('advice', 0.072), ('perturbed', 0.072), ('adaptive', 0.072), ('dynamic', 0.072), ('gentile', 0.072), ('losses', 0.068), ('nline', 0.065), ('rediction', 0.065), ('st', 0.064), ('master', 0.063), ('bounds', 0.061), ('idsia', 0.058), ('prediction', 0.057), ('url', 0.057), ('predictor', 0.057), ('complexities', 0.056), ('weights', 0.052), ('sti', 0.049), ('predicting', 0.047), ('rate', 0.046), ('hierarchy', 0.045), ('countable', 0.045), ('dent', 0.045), ('aggregating', 0.043), ('extrema', 0.043), ('jan', 0.043), ('predictions', 0.042), ('elegant', 0.042), ('littlestone', 0.042), ('online', 0.04), ('xt', 0.04), ('maxi', 0.04), ('argmini', 0.039), ('lnn', 0.039), ('wti', 0.039), ('majority', 0.038), ('randomized', 0.038), ('advance', 0.038), ('absolute', 0.037), ('freund', 0.037), ('attained', 0.036), ('dy', 0.036), ('bound', 0.035), ('rt', 0.034), ('wt', 0.034), ('sequential', 0.033), ('generalizes', 0.033), ('bandit', 0.032), ('adversary', 0.032), ('corrections', 0.032), ('ln', 0.031), ('pt', 0.03), ('deterministically', 0.029), ('marcus', 0.029), ('conjecture', 0.028), ('weighted', 0.028), ('bayes', 0.027), ('knowing', 0.027), ('countably', 0.026), ('suffers', 0.026), ('leading', 0.025), ('induction', 0.024), ('ch', 0.024), ('deterministic', 0.024), ('warmuth', 0.023), ('arbitrary', 0.023), ('incrementally', 0.023), ('ti', 0.022), ('sequence', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
Author: Marcus Hutter, Jan Poland
Abstract: When applying aggregating strategies to Prediction with Expert Advice (PEA), the learning rate √ must be adaptively tuned. The natural choice of complexity/current loss renders the analysis of Weighted Majority (WM) derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative Follow the Perturbed Leader (FPL) algorithm from Kalai and Vempala (2003) based on Hannan’s algorithm is easier. We derive loss bounds for adaptive learning rate and both finite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are new. Keywords: prediction with expert advice, follow the perturbed leader, general weights, adaptive learning rate, adaptive adversary, hierarchy of experts, expected and high probability bounds, general alphabet and loss, online sequential prediction
2 0.06775818 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.
3 0.046948992 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.
4 0.040139757 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
Author: Atsuyoshi Nakamura, Michael Schmitt, Niels Schmitt, Hans Ulrich Simon
Abstract: Bayesian networks have become one of the major models used for statistical inference. We study the question whether the decisions computed by a Bayesian network can be represented within a low-dimensional inner product space. We focus on two-label classification tasks over the Boolean domain. As main results we establish upper and lower bounds on the dimension of the inner product space for Bayesian networks with an explicitly given (full or reduced) parameter collection. In particular, these bounds are tight up to a factor of 2. For some nontrivial cases of Bayesian networks we even determine the exact values of this dimension. We further consider logistic autoregressive Bayesian networks and show that every sufficiently expressive inner product space must have dimension at least Ω(n2 ), where n is the number of network nodes. We also derive the bound 2Ω(n) for an artificial variant of this network, thereby demonstrating the limits of our approach and raising an interesting open question. As a major technical contribution, this work reveals combinatorial and algebraic structures within Bayesian networks such that known methods for the derivation of lower bounds on the dimension of inner product spaces can be brought into play. Keywords: Bayesian network, inner product space, embedding, linear arrangement, Euclidean dimension
5 0.038480289 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
Author: John Langford
Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds
6 0.037676856 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
7 0.037181523 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
8 0.036219243 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
10 0.029357482 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
11 0.025285468 41 jmlr-2005-Kernel Methods for Measuring Independence
12 0.024517855 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
13 0.024031496 54 jmlr-2005-Managing Diversity in Regression Ensembles
14 0.023308864 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
15 0.023151971 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
16 0.023019742 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
17 0.02291017 11 jmlr-2005-Algorithmic Stability and Meta-Learning
18 0.022840604 71 jmlr-2005-Variational Message Passing
19 0.022690957 67 jmlr-2005-Stability of Randomized Learning Algorithms
20 0.021614918 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
topicId topicWeight
[(0, 0.137), (1, -0.005), (2, 0.081), (3, 0.014), (4, -0.177), (5, -0.07), (6, 0.061), (7, -0.018), (8, 0.025), (9, 0.04), (10, 0.002), (11, 0.038), (12, 0.06), (13, 0.029), (14, -0.016), (15, 0.088), (16, 0.149), (17, -0.09), (18, -0.124), (19, -0.112), (20, 0.205), (21, 0.413), (22, -0.048), (23, -0.094), (24, -0.103), (25, -0.161), (26, 0.188), (27, -0.239), (28, 0.191), (29, -0.228), (30, 0.121), (31, 0.303), (32, 0.091), (33, 0.117), (34, 0.116), (35, -0.094), (36, -0.028), (37, -0.098), (38, -0.081), (39, -0.088), (40, 0.03), (41, 0.122), (42, -0.177), (43, 0.112), (44, 0.044), (45, 0.013), (46, -0.021), (47, 0.087), (48, -0.181), (49, -0.276)]
simIndex simValue paperId paperTitle
same-paper 1 0.97091651 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
Author: Marcus Hutter, Jan Poland
Abstract: When applying aggregating strategies to Prediction with Expert Advice (PEA), the learning rate √ must be adaptively tuned. The natural choice of complexity/current loss renders the analysis of Weighted Majority (WM) derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative Follow the Perturbed Leader (FPL) algorithm from Kalai and Vempala (2003) based on Hannan’s algorithm is easier. We derive loss bounds for adaptive learning rate and both finite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are new. Keywords: prediction with expert advice, follow the perturbed leader, general weights, adaptive learning rate, adaptive adversary, hierarchy of experts, expected and high probability bounds, general alphabet and loss, online sequential prediction
2 0.22028947 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.
Author: Savina Andonova Jaeger
Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes
4 0.14635548 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.
5 0.14612798 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
6 0.12641384 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
7 0.12390662 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
8 0.11669263 67 jmlr-2005-Stability of Randomized Learning Algorithms
9 0.11602968 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
10 0.11292914 12 jmlr-2005-An MDP-Based Recommender System
11 0.11222677 41 jmlr-2005-Kernel Methods for Measuring Independence
12 0.10236564 11 jmlr-2005-Algorithmic Stability and Meta-Learning
13 0.091092572 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
14 0.090225771 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
15 0.09013363 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
16 0.075286716 71 jmlr-2005-Variational Message Passing
17 0.074538037 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
18 0.070706405 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
19 0.067987144 25 jmlr-2005-Denoising Source Separation
20 0.067514367 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
topicId topicWeight
[(13, 0.023), (19, 0.039), (29, 0.487), (36, 0.032), (37, 0.016), (42, 0.011), (43, 0.048), (52, 0.058), (59, 0.017), (70, 0.028), (88, 0.075), (90, 0.046), (94, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.74046808 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
Author: Marcus Hutter, Jan Poland
Abstract: When applying aggregating strategies to Prediction with Expert Advice (PEA), the learning rate √ must be adaptively tuned. The natural choice of complexity/current loss renders the analysis of Weighted Majority (WM) derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative Follow the Perturbed Leader (FPL) algorithm from Kalai and Vempala (2003) based on Hannan’s algorithm is easier. We derive loss bounds for adaptive learning rate and both finite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are new. Keywords: prediction with expert advice, follow the perturbed leader, general weights, adaptive learning rate, adaptive adversary, hierarchy of experts, expected and high probability bounds, general alphabet and loss, online sequential prediction
2 0.69526792 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures
Author: Asela Gunawardana, William Byrne
Abstract: The EM algorithm is widely used to develop iterative parameter estimation procedures for statistical models. In cases where these procedures strictly follow the EM formulation, the convergence properties of the estimation procedures are well understood. In some instances there are practical reasons to develop procedures that do not strictly fall within the EM framework. We study EM variants in which the E-step is not performed exactly, either to obtain improved rates of convergence, or due to approximations needed to compute statistics under a model family over which E-steps cannot be realized. Since these variants are not EM procedures, the standard (G)EM convergence results do not apply to them. We present an information geometric framework for describing such algorithms and analyzing their convergence properties. We apply this framework to analyze the convergence properties of incremental EM and variational EM. For incremental EM, we discuss conditions under these algorithms converge in likelihood. For variational EM, we show how the E-step approximation prevents convergence to local maxima in likelihood. Keywords: EM, variational EM, incremental EM, convergence, information geometry
3 0.23829208 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
4 0.23662582 41 jmlr-2005-Kernel Methods for Measuring Independence
Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf
Abstract: We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis. Keywords: independence, covariance operator, mutual information, kernel, Parzen window estimate, independent component analysis c 2005 Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet and Bernhard Schölkopf . G RETTON , H ERBRICH , S MOLA , B OUSQUET AND S CHÖLKOPF
5 0.23574501 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
6 0.22716579 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
7 0.22704321 3 jmlr-2005-A Classification Framework for Anomaly Detection
8 0.22649314 11 jmlr-2005-Algorithmic Stability and Meta-Learning
9 0.22632161 39 jmlr-2005-Information Bottleneck for Gaussian Variables
10 0.22630714 20 jmlr-2005-Clustering with Bregman Divergences
11 0.22624071 71 jmlr-2005-Variational Message Passing
12 0.22594571 48 jmlr-2005-Learning the Kernel Function via Regularization
13 0.22427127 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
14 0.22330415 36 jmlr-2005-Gaussian Processes for Ordinal Regression
15 0.22243738 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
16 0.22225004 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
17 0.22090159 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
18 0.22000676 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
19 0.21967965 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
20 0.21961243 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels