nips nips2007 nips2007-159 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jean-yves Audibert
Abstract: We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisfies ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Progressive mixture rules are deviation suboptimal Jean-Yves Audibert Willow Project - Certis Lab ParisTech, Ecole des Ponts 77455 Marne-la-Vall´ e, France e audibert@certis. [sent-1, score-0.314]
2 fr Abstract We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. [sent-3, score-0.081]
3 This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. [sent-8, score-1.041]
4 We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. [sent-9, score-0.177]
5 In several application fields of learning algorithms, these fluctuations play a key role: in finance for instance, the bigger the losses can be, the more money the bank needs to freeze in order to alleviate these possible losses. [sent-12, score-0.06]
6 In this case, a “good” algorithm is an algorithm having not only low expected risk but also small deviations. [sent-13, score-0.131]
7 Why are we interested in the learning task of doing as well as the best prediction function of a given finite set? [sent-14, score-0.135]
8 This scheme is very powerful since it leads to theoretical results, which, in most situations, would be very hard to prove without it. [sent-16, score-0.056]
9 An input point can then be seen as the vector containing the prediction of each candidate. [sent-19, score-0.135]
10 The problem is what to do when the dimensionality d of the input data (equivalently the number of prediction functions) is much higher than the number of training points n. [sent-20, score-0.135]
11 In this setting, one cannot use linear regression and its variants in order to predict as well as the best candidate up to a small additive term. [sent-21, score-0.052]
12 Besides, (penalized) empirical risk minimization is doomed to be suboptimal (see the second part of Theorem 2 and also [1]). [sent-22, score-0.131]
13 As far as the expected risk is concerned, the only known correct way of predicting as well as the best prediction function is to use the progressive mixture rule or its variants. [sent-23, score-1.138]
14 In this work we prove that they do not work well as far as risk deviations are concerned (see the second part of Theorem 1 3). [sent-25, score-0.228]
15 2 The progressive mixture rule and its variants We assume that we observe n pairs of input-output denoted Z1 = (X1 , Y1 ), . [sent-27, score-0.872]
16 The input and output spaces are denoted respectively X and Y, so that P is a probability distribution on the product space Z X × Y. [sent-31, score-0.08]
17 The quality of a (prediction) function g : X → Y is measured by the risk (or generalization error): R(g) = E(X,Y )∼P [Y, g(X)], where [Y, g(X)] denotes the loss (possibly infinite) incurred by predicting g(X) when the true output is Y . [sent-32, score-0.288]
18 We work under the following assumptions for the data space and the loss function : Y × Y → R ∪ {+∞}. [sent-33, score-0.11]
19 The loss function is • uniformly exp-concave: there exists λ > 0 such that for any y ∈ Y, the set y ∈ R : (y, y ) < +∞ is an interval containing a on which the function y → e−λ (y,y ) is concave. [sent-42, score-0.11]
20 • symmetrical: for any y1 , y2 ∈ Y, (y1 , y2 ) = (2a − y1 , 2a − y2 ), • admissible: for any y, y ∈ Y∩]a; +∞[, (y, 2a − y ) > (y, y ), • well behaved at center: for any y ∈ Y∩]a; +∞[, the function y : y → (y, y ) is twice continuously differentiable on a neighborhood of a and y (a) < 0. [sent-43, score-0.058]
21 As a consequence, the risk R is also a convex function (on the convex set of prediction functions for which it is finite). [sent-46, score-0.324]
22 The assumptions were motivated by the fact that they are satisfied in the following settings: • least square loss with bounded outputs: Y = [ymin ; ymax ] and (y1 , y2 ) = (y1 −y2 )2 . [sent-47, score-0.388]
23 Then we have a = (ymin + ymax )/2 and may take λ = 1/[2(ymax − ymin )2 ]. [sent-48, score-0.318]
24 y1 • entropy loss: Y = [0; 1] and (y1 , y2 ) = y1 log y2 + (1 − y1 ) log 1−y1 . [sent-49, score-0.212]
25 • exponential (or AdaBoost) loss: Y = [−ymax ; ymax ] and (y1 , y2 ) = e−y1 y2 . [sent-52, score-0.221]
26 • logit loss: Y = [−ymax ; ymax ] and (y1 , y2 ) = log(1 + e−y1 y2 ). [sent-54, score-0.295]
27 Let G be a finite reference set of prediction functions. [sent-57, score-0.165]
28 Under the previous assumptions, the only known algorithms satisfying (1) are the progressive indirect mixture rules defined below. [sent-58, score-1.078]
29 This distribution concentrates ˆ ˆ on functions having low cumulative loss up to time i. [sent-66, score-0.134]
30 , n}, let hi be a prediction function such that ∀ (x, y) ∈ Z 1 ˆ [y, hi (x)] ≤ − λ log Eg∼ˆi e−λ π [y,g(x)] . [sent-70, score-0.588]
31 (2) The progressive indirect mixture rule produces the prediction function n i=0 1 n+1 gpim = ˆ ˆ hi . [sent-71, score-1.568]
32 ˆ From the uniform exp-concavity assumption and Jensen’s inequality, hi does exist since one may ˆ i = Eg∼ˆ g. [sent-72, score-0.16]
33 This particular choice leads to the progressive mixture rule, for which the take h πi predicted output for any x ∈ X is gpm (x) = ˆ g∈G 1 n+1 n i=0 g e−λΣi (g) −λΣi (g ∈G e ) g(x). [sent-73, score-0.916]
34 Consequently, any result that holds for any progressive indirect mixture rule in particular holds for the progressive mixture rule. [sent-74, score-1.841]
35 The idea of a progressive mean of estimators has been introduced by Barron ([2]) in the context of density estimation with Kullback-Leibler loss. [sent-75, score-0.599]
36 The study of this procedure was made in density estimation and least square regression in [5, 6, 7, 8]. [sent-78, score-0.057]
37 Results for general losses can be found in [9, 10]. [sent-79, score-0.06]
38 Finally, the progressive indirect mixture rule is inspired by the work of Vovk, Haussler, Kivinen and Warmuth [11, 12, 13] on sequential prediction and was studied in the “batch” setting in [10]. [sent-80, score-1.277]
39 g The largest integer smaller or equal to the logarithm in base 2 of x is denoted by log2 x . [sent-87, score-0.05]
40 Theorem 1 Any progressive indirect mixture rule satisfies ER(ˆpim ) ≤ min R(g) + g g∈G log |G| λ(n+1) . [sent-91, score-1.214]
41 y∈Y The second part of Theorem 1 has the same (log |G|)/n rate as the lower bounds obtained in sequential prediction ([12]). [sent-94, score-0.327]
42 From the link between sequential predictions and our “batch” setting with i. [sent-95, score-0.066]
43 [10, Lemma 3]), upper bounds for sequential prediction lead to upper bounds for i. [sent-100, score-0.295]
44 The converse of this last assertion is not true, so that the second part of Theorem 1 is not a consequence of the lower bounds of [12]. [sent-107, score-0.089]
45 This last point explains the interest we have in progressive mixture rules. [sent-110, score-0.765]
46 Theorem 2 If B supy,y ,y ∈Y [ (y, y ) − (y, y )] < +∞, then any empirical risk minimizer, which produces a prediction function germ in argming∈G Σn , satisfies: ˆ ER(ˆerm ) ≤ min R(g) + B g g∈G 2 log |G| . [sent-111, score-0.502]
47 There exists a set G of d prediction functions ˜ such that: for any learning algorithm producing a prediction function in G (e. [sent-113, score-0.343]
48 germ ) there exists a ˆ probability distribution generating the data for which • the output marginal is supported by 2a − y1 and y1 : P (Y ∈ {2a − y1 ; y1 }) = 1, • ER(ˆ) ≥ min R(g) + g g∈G δ 8 log2 |G| n ∧ 2 , with δ (y1 , 2a − y1 ) − (y1 , y1 ) > 0. [sent-115, score-0.276]
49 ˜ ˜ The lower bound of Theorem 2 also says that one should not use cross-validation. [sent-116, score-0.075]
50 This holds for the loss functions considered in this work, and not for, e. [sent-117, score-0.079]
51 Theorem 3 If B supy,y ,y ∈Y [ (y, y ) − (y, y )] < +∞, then any progressive indirect mixture rule satisfies: for any > 0, with probability at least 1 − w. [sent-123, score-1.137]
52 the training set distribution, we have −1 log |G| R(ˆpim ) ≤ min R(g) + B 2 log(2 ) + λ(n+1) g n+1 g∈G Let y1 and y1 in Y∩]a; +∞[ such that y1 is twice continuously differentiable on [a; y1 ] and ˜ ˜ ˜ ˜ ˜ ˜ y1 (y1 ) ≤ 0 and y1 (y1 ) > 0. [sent-126, score-0.163]
53 Consider the prediction functions g1 ≡ y1 and g2 ≡ 2a − y1 . [sent-127, score-0.135]
54 This result is quite surprising since it gives an example of an algorithm which is optimal in terms of expectation convergence rate and for which the deviation convergence rate is (significantly) worse than the expectation convergence rate. [sent-130, score-0.427]
55 In fact, despite their popularity based on their unique expectation convergence rate, the progressive mixture rules are not good algorithms since a long argument essentially based on convexity shows that the following algorithm has both expectation and deviation convergence rate of order 1/n. [sent-131, score-1.213]
56 Let 4 germ be the minimizer of the empirical risk among functions in G. [sent-132, score-0.291]
57 Let g be the minimizer of the ˆ ˜ ˆ = ∪g∈G [g; germ ]. [sent-133, score-0.16]
58 The algorithm producing g satisfies for some C > 0, empirical risk in the star G ˆ ˜ for any > 0, with probability at least 1 − w. [sent-134, score-0.234]
59 On the contrary, in practice, one will have recourse to cross-validation to tune the parameter λ of the progressive mixture rule. [sent-139, score-0.765]
60 To summarize, to predict as well as the best prediction function in a given set G, one should not restrain the algorithm to produce its prediction function among the set G. [sent-140, score-0.297]
61 The progressive mixture rules satisfy this principle since they produce a prediction function in the convex hull of G. [sent-141, score-1.062]
62 This allows to achieve (log |G|)/n convergence rates in expectation. [sent-142, score-0.063]
63 Future work might look at whether one can transpose this algorithm to the sequential prediction setting, in which, up to now, the algorithms to predict as well as the best expert were dominated by algorithms producing a mixture expert inside the convex hull of the set of experts. [sent-144, score-0.586]
64 n+1 i=0 Now from [15, Theorem 1] (see also [16, Proposition 1]), for any > 0, with probability at least 1 − , we have −1 n n 1 1 ˆ ˆ (4) Yi+1 , h(Xi+1 ) + B log( ) i=0 R(hi ) ≤ n+1 i=0 n+1 2(n+1) Using [12, Theorem 3. [sent-151, score-0.061]
65 8] and the exp-concavity assumption, we have n n ˆ Yi+1 , h(Xi+1 ) ≤ min i=0 Yi+1 , g(Xi+1 ) + i=0 g∈G log |G| λ (5) Let g ∈ argminG R. [sent-152, score-0.138]
66 By Hoeffding’s inequality, with probability at least 1 − , we have ˜ 1 n+1 n i=0 Yi+1 , g (Xi+1 ) ≤ R(˜) + B ˜ g log( −1 ) 2(n+1) Merging (3), (4), (5) and (6), with probability at least 1 − 2 , we get R(ˆpim ) ≤ g 1 n+1 n i=0 ≤ R(˜) + B g 5. [sent-153, score-0.122]
67 2 Yi+1 , g (Xi+1 ) + ˜ 2 log( −1 ) n+1 + log |G| λ(n+1) +B (6) log( −1 ) 2(n+1) log |G| λ(n+1) . [sent-154, score-0.212]
68 Sketch of the proof of the lower bound We cannot use standard tools like Assouad’s argument (see e. [sent-155, score-0.126]
69 6]) because if it were possible, it would mean that the lower bound would hold for any algorithm and in particular for g , and this is false. [sent-158, score-0.075]
70 To prove that any progressive indirect mixture rule have no fast exponential ˜ deviation inequalities, we will show that on some event with not too small probability, for most of the i in {0, . [sent-159, score-1.26]
71 First we define the probability distribution for which we will prove that the progressive indirect mixture rules cannot have fast deviation convergence rates. [sent-164, score-1.262]
72 Then we define the event on which the progressive indirect mixture rules do not perform well. [sent-165, score-1.168]
73 We lower bound the probability of this excursion event. [sent-166, score-0.323]
74 Finally we conclude by lower bounding R(ˆpim ) on g the excursion event. [sent-167, score-0.263]
75 We consider a distribution generating the data such that the output distribution satisfies for any x ∈ X P (Y = y1 |X = x) = (1 + γ)/2 = 1 − P (Y = y2 |X = x), where y2 = 2a − y1 . [sent-174, score-0.088]
76 ˜ ˜ (8) Therefore g1 is the best prediction function in {g1 , g2 } for the distribution we have chosen. [sent-179, score-0.135]
77 (9) An excursion event on which the progressive indirect mixture rules will not perform well. [sent-187, score-1.389]
78 , n}, Si ≤ −τ , with τ the smallest integer larger than (log n)/(λδ) such that n − τ is even (for convenience). [sent-191, score-0.076]
79 The event Eτ can be seen as an excursion event of the random walk defined through the random variables Wj = 1Yj =y1 − 1Yj =y2 , j ∈ {1, . [sent-193, score-0.438]
80 , n}, which are equal to +1 with probability (1 + γ)/2 and −1 with probability (1 − γ)/2. [sent-196, score-0.054]
81 (11) This means that π−λΣi concentrates on the wrong function, i. [sent-201, score-0.078]
82 3 Lower bound of the probability of the excursion event. [sent-206, score-0.281]
83 This requires to look at the probability that a slightly shifted random walk in the integer space has a very long excursion above a certain threshold. [sent-207, score-0.377]
84 To lower bound this probability, we will first look at the non-shifted random walk. [sent-208, score-0.075]
85 Then we will see that for small enough shift parameter, probabilities of shifted random walk events are close to the ones associated to the non-shifted random walk. [sent-209, score-0.114]
86 We start with the following lemma for sums of Rademacher variables (proof omitted). [sent-216, score-0.055]
87 These random variables satisfy the following key lemma (proof omitted) 6 Lemma 2 For any set A ⊂ ( 1 , . [sent-222, score-0.055]
88 , σN ) ∈ A We may now lower bound the probability of the excursion event Eτ . [sent-231, score-0.413]
89 probability can be lower bounded, and after some computations, we obtain P(Eτ ) ≥ τ 1−γ 2τ 2 1−γ M/2 1+γ 1 − γ2 N 2 (14) [P(sN = τ ) − P(sN = M )] where we recall that τ have the order of log n, N = n − 2τ has the order of n and that γ > 0 and M ≥ τ have to be appropriately chosen. [sent-249, score-0.175]
90 √ These computations and (14) leads us to take M as the smallest integer larger than n such that √ n − M is even. [sent-255, score-0.151]
91 We obtain the following lower bound on the excursion probability P(Eτ ) ≥ Lemma 3 If γ = C0 (log n)/n with C0 a positive constant, then for any large enough n, P(Eτ ) ≥ 5. [sent-261, score-0.358]
92 Behavior of the progressive indirect mixture rule on the excursion event. [sent-264, score-1.297]
93 On the event Eτ , for any x ∈ X and any i ∈ {τ, . [sent-268, score-0.09]
94 From the convexity of the function y → (y2 , y) and by Jensen’s inequality, we obtain n ˆ [y2 , gpim (x)] − (y2 , y2 ) ≤ 1 ˆ ˜ [y2 , hi (x)] − (y2 , y2 ) ≤ τ δ + Cn−1 < C1 log n ˜ n+1 i=0 n+1 7 n for some constant C1 > 0 independent from γ. [sent-272, score-0.5]
95 Let us now prove that for n large enough, we have y2 ≤ gpim (x) ≤ y2 + C ˜ ˆ ˜ log n n ≤ y1 , ˜ (19) with C > 0 independent from γ. [sent-273, score-0.333]
96 We may take γ = 2C2 (log n)/n and obtain: for n large enough, δ on the event Eτ , we have R(ˆpim ) − R(g1 ) ≥ C log n/n. [sent-275, score-0.219]
97 From Lemma 3, this inequality holds g with probability at least 1/nC4 for some C4 > 0. [sent-276, score-0.1]
98 with probability at least , R(ˆpim ) − R(g1 ) ≥ c log(e ) . [sent-279, score-0.061]
99 where c is a positive constant g n depending only on the loss function, the symmetry parameter a and the output values y1 and y1 . [sent-280, score-0.169]
100 Sequential prediction of individual sequences under general loss functions. [sent-358, score-0.214]
wordName wordTfidf (topN-words)
[('progressive', 0.599), ('pim', 0.295), ('indirect', 0.229), ('excursion', 0.221), ('ymax', 0.221), ('gpim', 0.197), ('mixture', 0.166), ('hi', 0.16), ('prediction', 0.135), ('risk', 0.131), ('sn', 0.116), ('log', 0.106), ('germ', 0.098), ('symmetrical', 0.098), ('wj', 0.09), ('event', 0.09), ('rules', 0.084), ('rule', 0.082), ('loss', 0.079), ('cst', 0.074), ('logit', 0.074), ('ymin', 0.074), ('rademacher', 0.073), ('sequential', 0.066), ('theorem', 0.065), ('deviation', 0.064), ('convergence', 0.063), ('minimizer', 0.062), ('losses', 0.06), ('eg', 0.059), ('ming', 0.059), ('lemma', 0.055), ('concentrates', 0.055), ('preprint', 0.055), ('output', 0.053), ('proof', 0.051), ('expectation', 0.05), ('integer', 0.05), ('classication', 0.049), ('gpm', 0.049), ('hull', 0.049), ('si', 0.049), ('bounds', 0.047), ('argming', 0.043), ('audibert', 0.043), ('lower', 0.042), ('shifted', 0.042), ('producing', 0.042), ('omitted', 0.042), ('satis', 0.041), ('jensen', 0.041), ('barron', 0.039), ('inequality', 0.039), ('penalized', 0.039), ('walk', 0.037), ('rate', 0.037), ('convexity', 0.037), ('symmetry', 0.037), ('kivinen', 0.036), ('surely', 0.036), ('annual', 0.036), ('expert', 0.036), ('zn', 0.036), ('enough', 0.035), ('concerned', 0.035), ('generating', 0.035), ('haussler', 0.034), ('aggregating', 0.034), ('least', 0.034), ('er', 0.034), ('bound', 0.033), ('ey', 0.033), ('behaved', 0.033), ('deviations', 0.032), ('min', 0.032), ('assumptions', 0.031), ('exists', 0.031), ('uctuations', 0.03), ('reference', 0.03), ('yj', 0.03), ('prove', 0.03), ('excess', 0.029), ('convex', 0.029), ('yi', 0.028), ('nn', 0.028), ('paris', 0.027), ('cn', 0.027), ('predict', 0.027), ('probability', 0.027), ('let', 0.027), ('leads', 0.026), ('computations', 0.026), ('smallest', 0.026), ('batch', 0.026), ('continuously', 0.025), ('predicting', 0.025), ('variants', 0.025), ('square', 0.023), ('wrong', 0.023), ('take', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 159 nips-2007-Progressive mixture rules are deviation suboptimal
Author: Jean-yves Audibert
Abstract: We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisfies ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1
2 0.074033387 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor
Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1
3 0.067641608 110 nips-2007-Learning Bounds for Domain Adaptation
Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman
Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1
4 0.065618619 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations
Author: M. M. Mahmud, Sylvian Ray
Abstract: In transfer learning we aim to solve new problems using fewer examples using information gained from solving related problems. Transfer learning has been successful in practice, and extensive PAC analysis of these methods has been developed. However it is not yet clear how to define relatedness between tasks. This is considered as a major problem as it is conceptually troubling and it makes it unclear how much information to transfer and when and how to transfer it. In this paper we propose to measure the amount of information one task contains about another using conditional Kolmogorov complexity between the tasks. We show how existing theory neatly solves the problem of measuring relatedness and transferring the ‘right’ amount of information in sequential transfer learning in a Bayesian setting. The theory also suggests that, in a very formal and precise sense, no other reasonable transfer method can do much better than our Kolmogorov Complexity theoretic transfer method, and that sequential transfer is always justified. We also develop a practical approximation to the method and use it to transfer information between 8 arbitrarily chosen databases from the UCI ML repository. 1
Author: Lars Buesing, Wolfgang Maass
Abstract: We show that under suitable assumptions (primarily linearization) a simple and perspicuous online learning rule for Information Bottleneck optimization with spiking neurons can be derived. This rule performs on common benchmark tasks as well as a rather complex rule that has previously been proposed [1]. Furthermore, the transparency of this new learning rule makes a theoretical analysis of its convergence properties feasible. A variation of this learning rule (with sign changes) provides a theoretically founded method for performing Principal Component Analysis (PCA) with spiking neurons. By applying this rule to an ensemble of neurons, different principal components of the input can be extracted. In addition, it is possible to preferentially extract those principal components from incoming signals X that are related or are not related to some additional target signal YT . In a biological interpretation, this target signal YT (also called relevance variable) could represent proprioceptive feedback, input from other sensory modalities, or top-down signals. 1
6 0.06350033 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
7 0.062979922 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging
8 0.06056425 40 nips-2007-Bundle Methods for Machine Learning
9 0.059069913 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
10 0.057071913 63 nips-2007-Convex Relaxations of Latent Variable Training
11 0.056464747 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis
12 0.054649036 61 nips-2007-Convex Clustering with Exemplar-Based Models
13 0.053999819 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
14 0.053663038 187 nips-2007-Structured Learning with Approximate Inference
15 0.052925628 200 nips-2007-The Tradeoffs of Large Scale Learning
16 0.050968084 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
17 0.048325263 62 nips-2007-Convex Learning with Invariances
18 0.043423142 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
19 0.042713705 15 nips-2007-A general agnostic active learning algorithm
20 0.041540988 199 nips-2007-The Price of Bandit Information for Online Optimization
topicId topicWeight
[(0, -0.155), (1, -0.015), (2, -0.044), (3, 0.062), (4, 0.021), (5, -0.022), (6, 0.025), (7, -0.029), (8, -0.041), (9, -0.001), (10, 0.032), (11, 0.027), (12, 0.026), (13, -0.06), (14, 0.03), (15, -0.055), (16, 0.059), (17, 0.058), (18, -0.038), (19, -0.048), (20, -0.017), (21, -0.054), (22, 0.037), (23, -0.101), (24, 0.022), (25, 0.033), (26, -0.035), (27, -0.041), (28, 0.006), (29, -0.098), (30, -0.024), (31, 0.086), (32, -0.137), (33, 0.086), (34, -0.041), (35, 0.206), (36, 0.006), (37, -0.134), (38, -0.011), (39, -0.036), (40, -0.034), (41, 0.066), (42, -0.066), (43, 0.08), (44, -0.09), (45, -0.074), (46, 0.016), (47, -0.029), (48, 0.073), (49, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.93071789 159 nips-2007-Progressive mixture rules are deviation suboptimal
Author: Jean-yves Audibert
Abstract: We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisfies ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1
2 0.59140569 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan
Abstract: We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a variational characterization of f -divergences, which turns the estimation into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature. 1
3 0.56686252 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging
Author: Tim V. Erven, Steven D. Rooij, Peter Grünwald
Abstract: Bayesian model averaging, model selection and their approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates of convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can be inconsistent. We identify the catch-up phenomenon as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we define the switch-distribution, a modification of the Bayesian model averaging distribution. We prove that in many situations model selection and prediction based on the switch-distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. The method is practical; we give an efficient algorithm. 1
4 0.51559788 15 nips-2007-A general agnostic active learning algorithm
Author: Sanjoy Dasgupta, Claire Monteleoni, Daniel J. Hsu
Abstract: We present an agnostic active learning algorithm for any hypothesis class of bounded VC dimension under arbitrary data distributions. Most previous work on active learning either makes strong distributional assumptions, or else is computationally prohibitive. Our algorithm extends the simple scheme of Cohn, Atlas, and Ladner [1] to the agnostic setting, using reductions to supervised learning that harness generalization bounds in a simple but subtle manner. We provide a fall-back guarantee that bounds the algorithm’s label complexity by the agnostic PAC sample complexity. Our analysis yields asymptotic label complexity improvements for certain hypothesis classes and distributions. We also demonstrate improvements experimentally. 1
5 0.49546376 40 nips-2007-Bundle Methods for Machine Learning
Author: Quoc V. Le, Alex J. Smola, S.v.n. Vishwanathan
Abstract: We present a globally convergent method for regularized risk minimization problems. Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. SVMPerf can be shown to be a special case of our approach. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ ) steps to precision for general convex problems and in O(log(1/ )) steps for continuously differentiable problems. We demonstrate in experiments the performance of our approach. 1
6 0.48862788 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
7 0.48787001 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis
8 0.47417021 142 nips-2007-Non-parametric Modeling of Partially Ranked Data
9 0.47333953 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin
10 0.47284144 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
11 0.45635638 110 nips-2007-Learning Bounds for Domain Adaptation
12 0.44002548 101 nips-2007-How SVMs can estimate quantiles and the median
13 0.43556175 43 nips-2007-Catching Change-points with Lasso
14 0.40186399 196 nips-2007-The Infinite Gamma-Poisson Feature Model
15 0.39842466 46 nips-2007-Cluster Stability for Finite Samples
16 0.39385563 78 nips-2007-Efficient Principled Learning of Thin Junction Trees
17 0.39155528 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
18 0.37687835 53 nips-2007-Compressed Regression
19 0.37649813 179 nips-2007-SpAM: Sparse Additive Models
20 0.37474799 200 nips-2007-The Tradeoffs of Large Scale Learning
topicId topicWeight
[(5, 0.019), (13, 0.021), (16, 0.01), (21, 0.04), (34, 0.013), (35, 0.015), (47, 0.043), (49, 0.022), (83, 0.677), (90, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.99745375 159 nips-2007-Progressive mixture rules are deviation suboptimal
Author: Jean-yves Audibert
Abstract: We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisfies ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1
2 0.99382526 109 nips-2007-Kernels on Attributed Pointsets with Applications
Author: Mehul Parsana, Sourangshu Bhattacharya, Chiru Bhattacharya, K. Ramakrishnan
Abstract: This paper introduces kernels on attributed pointsets, which are sets of vectors embedded in an euclidean space. The embedding gives the notion of neighborhood, which is used to define positive semidefinite kernels on pointsets. Two novel kernels on neighborhoods are proposed, one evaluating the attribute similarity and the other evaluating shape similarity. Shape similarity function is motivated from spectral graph matching techniques. The kernels are tested on three real life applications: face recognition, photo album tagging, and shot annotation in video sequences, with encouraging results. 1
3 0.99334204 110 nips-2007-Learning Bounds for Domain Adaptation
Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman
Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1
4 0.99332893 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
Author: Simon Osindero, Geoffrey E. Hinton
Abstract: We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1
5 0.9916324 61 nips-2007-Convex Clustering with Exemplar-Based Models
Author: Danial Lashkari, Polina Golland
Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1
6 0.98881394 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis
7 0.92855954 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
8 0.91639984 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin
9 0.90166157 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
10 0.89965689 40 nips-2007-Bundle Methods for Machine Learning
11 0.89676708 21 nips-2007-Adaptive Online Gradient Descent
12 0.89584953 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching
13 0.89554346 116 nips-2007-Learning the structure of manifolds using random projections
14 0.88885701 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
15 0.88499779 101 nips-2007-How SVMs can estimate quantiles and the median
16 0.88209784 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
17 0.88068742 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
18 0.88036335 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
19 0.87961698 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
20 0.8774271 46 nips-2007-Cluster Stability for Finite Samples