nips nips2011 nips2011-282 knowledge-graph by maker-knowledge-mining

282 nips-2011-The Fast Convergence of Boosting

Source: pdf

Author: Matus J. Telgarsky

Abstract: This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). Next, the (disjoint) conditions under which the inﬁmal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/✏), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the inﬁmal empirical risk; the technique for overcoming this barrier may be of general interest. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). [sent-3, score-0.939]

2 First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). [sent-4, score-0.656]

3 Next, the (disjoint) conditions under which the inﬁmal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). [sent-5, score-0.577]

4 1 Introduction Boosting is the task of converting inaccurate weak learners into a single accurate predictor. [sent-8, score-0.448]

5 The existence of any such method was unknown until the breakthrough result of Schapire [1]: under a weak learning assumption, it is possible to combine many carefully chosen weak learners into a majority of majorities with arbitrarily low training error. [sent-9, score-0.82]

6 Finally, their combined effort produced AdaBoost, which attains the optimal convergence rate (under the weak learning assumption), and has an astonishingly simple implementation [3]. [sent-11, score-0.597]

7 Aiming to alleviate perceived deﬁciencies in the algorithm, other loss functions were proposed, foremost amongst these being the logistic loss [5]. [sent-13, score-0.265]

8 Given the wide practical success of boosting with the logistic loss, it is perhaps surprising that no convergence rate better than O(exp(1/✏2 )) was known, even under the weak learning assumption [6]. [sent-14, score-0.989]

9 This reliance is carried over from convex optimization, where the assumption of attainability is generally made, either directly, or through stronger conditions like compact level sets or strong convexity [7]. [sent-16, score-0.326]

10 The contribution of this manuscript is to provide a tight convergence theory for a large class of losses, including the exponential and logistic losses, which has heretofore resisted analysis. [sent-19, score-0.386]

11 Section 2 provides a few pieces of background: how to encode the weak learning class and sample as a matrix, boosting as coordinate descent, and the primal objective function. [sent-22, score-0.906]

12 Section 3 then gives the dual problem, max entropy. [sent-23, score-0.217]

13 Given these tools, section 4 shows how to adjust the weak learning rate to a quantity which is useful without any assumptions. [sent-24, score-0.486]

14 The ﬁrst step towards convergence rates is then taken in section 5, which demonstrates that the weak learning rate is in fact a mechanism to convert between the primal and dual problems. [sent-25, score-0.965]

15 The ﬁrst is the original rate of O(ln(1/✏)) for the exponential loss under the weak learning assumption [3]. [sent-35, score-0.612]

16 [9] showed, for a a class of losses similar to those considered here, a rate of O(ln(1/✏)) when the loss minimizer is attainable. [sent-37, score-0.224]

17 [10] established the general convergence under the exponential loss, with a rate of ⇥(1/✏). [sent-40, score-0.25]

18 Speciﬁcally, once it was revealed that boosting is trying to be not only correct but also have large margins [12], much work was invested into methods which explicitly maximized the margin [13], or penalized variants focused on the inseparable case [14, 15]. [sent-43, score-0.474]

19 These methods generally impose some form of regularization [15], which grants attainability of the risk minimizer, and allows standard techniques to grant general convergence rates. [sent-44, score-0.404]

20 2 Setup A view of boosting, which pervades this manuscript, is that the action of the weak learning class upon the sample can be encoded as a matrix [9, 15]. [sent-46, score-0.372]

21 Let a sample S := {(xi , yi )}m ✓ (X ⇥ Y)m 1 and a weak learning class H be given. [sent-47, score-0.409]

22 Let ai denote the ith row of A, corresponding to the example (xi , yi ), and let {hj }n index the set of weak 1 learners corresponding to columns of A. [sent-50, score-0.525]

23 As another example, if the weak learners are binary, and H has VC dimension d, then Sauer’s lemma grants that A has at most (m + 1)d columns. [sent-54, score-0.505]

24 This matrix view of boosting is thus similar to the interpretation of boosting performing descent on functional space, but the class complexity and ﬁnite sample have been used to reduce the function class to a ﬁnite object [16, 5]. [sent-55, score-0.598]

25 ) Crucially, the exponential loss exp( x) from AdaBoost and the logistic loss ln(1 + exp( x)) are in G0 (and the eventual G). [sent-78, score-0.367]

26 Boosting determines some weighting 2 Rn of the columns of A, which correspond to weak learners in H. [sent-79, score-0.448]

27 The (unnormalized) margin of example i is thus hai , i = e> A , where ei is an i indicator vector. [sent-80, score-0.22]

28 As such, boosting solves the minimization problem infn 2R m X i=1 g(hai , i) = infn 2R m X g(e> A ) = infn f (A ) = infn (f i 2R i=1 2R ¯ A)( ) =: fA , (2. [sent-82, score-1.067]

29 1 will show that this is equivalent to the weak learning assumption). [sent-90, score-0.372]

30 Then 0  infn f (A )  inf {f (A ) : 2R On the other hand, for any learnability holds. [sent-91, score-0.491]

31 Thus the inﬁmum is never attainable when weak The template boosting algorithm appears in ﬁg. [sent-94, score-0.734]

32 To interpret the gradient terms, note that (r(f A)( ))j = (A> rf (A ))j = m X i=1 g 0 (hai , i)hj (xi )yi , which is the expected correlation of hj with the target labels according to an unnormalized distribution with weights g 0 (hai , i). [sent-96, score-0.344]

33 The stopping condition r(f A)( ) = 0m means: either the distribution is degenerate (it is exactly zero), or every weak learner is uncorrelated with the target. [sent-97, score-0.415]

34 In the case of the exponential loss with binary weak learners, the line search step has a convenient closed form; but for other losses, or even for the exponential loss but with conﬁdence-rated predictors, there may not be a closed form. [sent-104, score-0.634]

35 To produce the eventual convergence rates, this manuscript utilizes a step size minimizing an upper bounding quadratic (which is guaranteed to exist); if instead a standard iterative line search guarantee were used, rates would only degrade by a constant factor [17, section 9. [sent-106, score-0.336]

36 Sometimes such a halfspace may not exist, and g applies a smoothly increasing penalty to points that are farther and farther outside it. [sent-111, score-0.303]

37 3 Dual Problem This section provides the convex dual to eq. [sent-112, score-0.251]

38 The relevance of the dual to convergence rates is as follows. [sent-115, score-0.382]

39 First, although the primal optimum may not be attainable, the dual optimum is always attainable—this suggests a strategy of mapping the convergence strategy to the dual, where there exists a clear notion of progress to the optimum. [sent-116, score-0.601]

40 Second, this section determines the dual feasible set—the space of dual variables or what the boosting literature typically calls unnormalized weights. [sent-117, score-0.834]

41 Understanding this set is key to relating weak learnability, attainability, and general instances. [sent-118, score-0.372]

42 Before proceeding, note that the dual formulation will make use of the Fenchel conjugate h⇤ ( ) = supx2dom(h) hx, i h(x), a concept taking a central place in convex analysis [18, 19]. [sent-119, score-0.251]

43 Interestingly, the Fenchel conjugates to the exponential and logistic losses are respectively the BoltzmannShannon and Fermi-Dirac entropies [19, Commentary, section 3. [sent-120, score-0.246]

44 3], and thus the dual is explicitly performing entropy maximization (cf. [sent-121, score-0.217]

45 For any A 2 Rm⇥n and g 2 G0 with f (x) = i g((x)i ), inf {f (A ) : 2 Rn } = sup { f ⇤ ( ): 2 where A := Ker(A> )\Rm is the dual feasible set. [sent-127, score-0.365]

46 The dual optimum + Pm Lastly, f ⇤ ( ) = i=1 g ⇤ (( )i ). [sent-128, score-0.263]

47 The dual feasible set A = Ker(A> ) \ Rm has a strong interpretation. [sent-131, score-0.273]

48 That + is to say, every nonzero feasible dual vector provides a (an unnormalized) distribution upon which every weak learner is uncorrelated! [sent-133, score-0.688]

49 Furthermore, recall that the weak learning assumption states that under any weighting of the input, there exists a correlated weak learner; as such, weak learnability necessitates that the dual feasible set contains only the zero vector. [sent-134, score-1.657]

50 As such, the constrained dual problem is aiming to write the origin as a high entropy convex combination of the points {ai }m . [sent-138, score-0.341]

51 1 4 A Generalized Weak Learning Rate The weak learning rate was critical to the original convergence analysis of AdaBoost, providing a handle on the progress of the algorithm. [sent-139, score-0.61]

52 Recall that the quantity appeared in the denominator of the convergence rate, and a weak learning assumption critically provided that this quantity is nonzero. [sent-140, score-0.573]

53 This section will generalize the weak learning rate to a quantity which is always positive, without any assumptions. [sent-141, score-0.486]

54 Note brieﬂy that this manuscript will differ slightly from the norm in that weak learning will be a purely sample-speciﬁc concept. [sent-142, score-0.471]

55 The usual weak learning assumption states that there exists no uncorrelating distribution over the input space. [sent-145, score-0.438]

56 This of course implies that any training sample S used by the algorithm will also have this property; however, it sufﬁces that there is no distribution over the input sample S which uncorrelates the weak learners from the target. [sent-146, score-0.448]

57 Returning to task, the weak learning assumption posits the existence of a constant, the weak learning rate , which lower bounds the correlation of the best weak learner with the target for any distribu4 tion. [sent-147, score-1.268]

58 Stated in terms of the matrix A, m X 0 < = infm max ( )i yi hj (xi ) = 2R+ j2[n] i=1 k k=1 inf 2Rm \{0m } + kA> k1 = k k1 inf 2Rm \{0m } + kA> k1 . [sent-148, score-0.284]

59 1) The only way this quantity can be positive is if 62 Ker(A> ) \ Rm = A , meaning the dual + feasible set is exactly {0m }. [sent-150, score-0.305]

60 As such, one candidate adjustment is to simply replace {0m } with the dual feasible set: kA> k1 0 := inf . [sent-151, score-0.365]

61 5 Prelude to Convergence Rates: Three Alternatives The pieces are in place to ﬁnally sketch how the convergence rates may be proved. [sent-164, score-0.232]

62 This section identiﬁes how the weak learning rate (A, S) can be used to convert the standard gradient guarantees into something which can be used in the presence of no attainable minimum. [sent-165, score-0.537]

63 However, outside the weak learnability case, the other terms in the bounds here will also incur a penalty of the form em for the exponential loss, and there is some evidence that this is unavoidable (see the lower bounds in Mukherjee et al. [sent-177, score-0.663]

64 a Next, note how the standard guarantee for coordinate descent methods can lead to guarantees on the progress of the algorithm in terms of dual distances, thanks to (A, S). [sent-180, score-0.36]

65 For any t, A 6= 0m⇥n , S ◆ { rf (A t )} with (A, S) > 0, and g 2 G, f (A t+1 ) ¯ fA  f (A t ) Proof. [sent-183, score-0.216]

66 The stopping condition grants (A, S) = inf 2S\Ker(A> ) 2 (A, S)2 D1 S\Ker(A> ) ( rf (A t )) ¯ fA 2⌘f (A t ) . [sent-184, score-0.365]

67 Thus, by deﬁnition of (A, S), kA> k1 1 DS\Ker(A> ) ( 5 )  kA> rf (A t )k1 . [sent-186, score-0.216]

68 rf (A t )) D1 ( S\Ker(A> ) (a) Weak learnability. [sent-187, score-0.216]

69 Figure 2: Viewing the rows {ai }m of A as points in Rn , boosting seeks a homogeneous halfspace, 1 parameterized by a normal 2 Rn , which contains all m points. [sent-190, score-0.311]

70 The convergence rate and dynamics of this process are controlled by A, which dictates one of the three above scenarios. [sent-192, score-0.223]

71 2), f (A t ) f (A t+1 ) kA> rf (A t )k2 1 2⌘f (A t ) 2 (A, S)2 D1 S\Ker(A> ) ( rf (A t )) 2⌘f (A t ) . [sent-195, score-0.432]

72 Recall the interpretation of boosting closing section 2: boosting seeks a halfspace, parameterized by 2 Rn , which contains the points {ai }m . [sent-197, score-0.625]

73 2 will be divided 1 into three cases, each distinguished by the kind of halfspace which boosting can reach. [sent-199, score-0.459]

74 The ﬁrst case is weak learnability: positive margins can be attained on each example, meaning a halfspace exists which strictly contains all points. [sent-202, score-0.692]

75 Boosting races to push all these margins unboundedly large, and has a convergence rate O(ln(1/✏)). [sent-203, score-0.431]

76 Next is the case that no halfspace contains the points within its interior: either any such halfspace has the points on its boundary, or no such halfspace exists at all (the degenerate choice = 0n ). [sent-204, score-0.643]

77 This is the case of attainability: boosting races towards ﬁnite margins at the rate O(ln(1/✏)). [sent-205, score-0.5]

78 The ﬁnal situation is a mix of the two: there exists a halfspace with some points on the boundary, some within its interior. [sent-206, score-0.251]

79 5) 2R A The equivalence means the presence of any of these properties sufﬁces to indicate weak learnability. [sent-220, score-0.372]

80 The last two statements encode the usual distributional version of the weak learning assumption. [sent-221, score-0.406]

81 6 The ﬁrst encodes the fact that there exists a homogeneous halfspace containing all points within its interior; this encodes separability, since removing the factor yi from the deﬁnition of ai will place all negative points outside the halfspace. [sent-222, score-0.459]

82 1, Rm \ Ker(A> ) = + D1 A ( rf (A t )) = inf k 2 A A = {0m }, which combined with g  rf (A t ) k1 = krf (A t )k1 g 0 gives f (A t )/ . [sent-229, score-0.582]

83 1) along with polyhedron Rm ◆ rf (Rm ) (whereby + (A, Rm ) > 0 by proposition 4. [sent-231, score-0.316]

84 Since the present setting is weak learnability, note by (4. [sent-234, score-0.372]

85 1) that the choice of polyhedron Rm grants + that (A, Rm ) is exactly the original weak learning rate. [sent-235, score-0.482]

86 While it is unclear if this analysis of and ⌘ was tight, note that it is plausible that the logistic loss is slower than the exponential loss in this scenario, as it works less in initial phases to correct minor margin violations. [sent-239, score-0.39]

87 5), attainability entails that the dual has fully interior points, and furthermore that the dual optimum is interior. [sent-249, score-0.723]

88 4) provided that the dual optimum has zeros at every coordinate. [sent-252, score-0.263]

89 As will be made clear in section 8, the primal and dual weights have the following dichotomy: either the margin hai , i goes to inﬁnity and ( A )i goes to zero, or the margin stays ﬁnite and ( A )i goes to some positive value. [sent-253, score-0.601]

90 Finally, (A, rf (C)) > 0, and for all t, ✓ ◆ 2 t ¯  (f (0m ) fA ) 1 c (A, rf (C)) ¯ f (A t ) fA . [sent-260, score-0.432]

91 The appearance of a rf ✏ modulus of strong convexity c (i. [sent-262, score-0.3]

92 7 When the inﬁmum is attainable, every margin hai , i converges to some ﬁnite value. [sent-265, score-0.22]

93 2) provides that no halfspace contains all points, so if one margin becomes positive and large, another becomes negative and large, giving a terrible objective value. [sent-267, score-0.247]

94 , quadratic lower bounds in the primal) grants quadratic upper bounds in the dual, which can be used to bound the dual distance in proposition 5. [sent-271, score-0.321]

95 This approach fails under weak learnability—some primal weights grow unboundedly, all dual weights shrink to zero, and no compact set contains all margins. [sent-273, score-0.713]

96 2) ++ + ( infn f (A ) = infn f (A+ )) ^ ( infn f (A0 ) = 0) ^ f 2R 2R 2R h i A0 with A0 = 0z ^ A+ 2 Rp , A = ++ A A+ has minimizers, + ( A0 = {0z }) ^ ( A+ \ Rp 6= ;) ^ ( ++ A = A0 ⇥ A+ ). [sent-282, score-0.591]

97 Set w := supt krf (A+ t ) + P1 A ( rf (A+ t ))k1 . [sent-304, score-0.274]

98 Then (A, Rz ⇥ rf (C+ )) > 0, and for all t, f (A t ) fA  + z 2 2 2f (A 0 )/ (t + 1) min 1, (A, R+ ⇥ rf (C+ )) /(( + w/(2c)) ⌘) . [sent-306, score-0.432]

99 On the convergence of the coordinate descent method for convex differentiable minimization. [sent-347, score-0.241]

100 On the equivalence of weak learnability and linear separability: New relaxations and efﬁcient boosting algorithms. [sent-378, score-0.853]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('weak', 0.372), ('ker', 0.306), ('boosting', 0.279), ('dual', 0.217), ('rf', 0.216), ('rm', 0.212), ('learnability', 0.202), ('attainability', 0.197), ('infn', 0.197), ('halfspace', 0.18), ('hai', 0.153), ('fa', 0.139), ('logistic', 0.119), ('convergence', 0.11), ('ka', 0.105), ('margins', 0.101), ('ln', 0.099), ('manuscript', 0.099), ('primal', 0.097), ('inf', 0.092), ('schapire', 0.09), ('attainable', 0.083), ('rate', 0.082), ('robert', 0.08), ('learners', 0.076), ('loss', 0.073), ('adaboost', 0.07), ('losses', 0.069), ('pieces', 0.067), ('margin', 0.067), ('mukherjee', 0.066), ('unnormalized', 0.065), ('hj', 0.063), ('tsch', 0.06), ('rn', 0.059), ('exponential', 0.058), ('gunnar', 0.058), ('krf', 0.058), ('unboundedly', 0.058), ('grants', 0.057), ('coordinate', 0.057), ('feasible', 0.056), ('mum', 0.056), ('rates', 0.055), ('polyhedron', 0.053), ('separability', 0.05), ('rz', 0.05), ('proposition', 0.047), ('interior', 0.046), ('optimum', 0.046), ('progress', 0.046), ('yoav', 0.045), ('ejt', 0.044), ('eventual', 0.044), ('indraneel', 0.044), ('modulus', 0.043), ('manfred', 0.043), ('learner', 0.043), ('rp', 0.042), ('yoram', 0.042), ('push', 0.042), ('theorem', 0.042), ('convexity', 0.041), ('descent', 0.04), ('risk', 0.04), ('ai', 0.04), ('exists', 0.039), ('races', 0.038), ('mal', 0.038), ('yi', 0.037), ('stitching', 0.035), ('closing', 0.035), ('fenchel', 0.035), ('convex', 0.034), ('scenarios', 0.034), ('encodes', 0.034), ('encode', 0.034), ('attains', 0.033), ('mechanism', 0.032), ('points', 0.032), ('quantity', 0.032), ('dictates', 0.031), ('outside', 0.031), ('origin', 0.031), ('company', 0.03), ('nish', 0.03), ('publishing', 0.03), ('farther', 0.03), ('colt', 0.03), ('freund', 0.029), ('pp', 0.028), ('jonathan', 0.028), ('degrade', 0.028), ('tightest', 0.028), ('remark', 0.028), ('assumption', 0.027), ('aiming', 0.027), ('minimizers', 0.027), ('compact', 0.027), ('revealed', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 282 nips-2011-The Fast Convergence of Boosting

Author: Matus J. Telgarsky

2 0.31003037 29 nips-2011-Algorithms and hardness results for parallel large margin learning

Author: Phil Long, Rocco Servedio

Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1

3 0.25935742 178 nips-2011-Multiclass Boosting: Theory and Algorithms

Author: Mohammad J. Saberian, Nuno Vasconcelos

Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1

4 0.19114654 153 nips-2011-Learning large-margin halfspaces with more malicious noise

Author: Phil Long, Rocco Servedio

Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efﬁcient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassiﬁcation error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1

5 0.15962084 299 nips-2011-Variance Penalizing AdaBoost

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efﬁciently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results conﬁrm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Signiﬁcant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1

6 0.12925301 49 nips-2011-Boosting with Maximum Adaptive Sampling

7 0.091441482 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

8 0.089622669 202 nips-2011-On the Universality of Online Mirror Descent

9 0.083929807 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation

10 0.081381522 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure

11 0.079368964 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

12 0.076189019 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

13 0.069915533 25 nips-2011-Adaptive Hedge

14 0.069205582 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

15 0.067941174 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

16 0.06736514 59 nips-2011-Composite Multiclass Losses

17 0.064142741 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

18 0.063970454 72 nips-2011-Distributed Delayed Stochastic Optimization

19 0.062300839 198 nips-2011-On U-processes and clustering performance

20 0.061970398 4 nips-2011-A Convergence Analysis of Log-Linear Training

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.198), (1, -0.079), (2, -0.083), (3, -0.129), (4, -0.001), (5, 0.134), (6, 0.026), (7, -0.099), (8, -0.332), (9, 0.061), (10, -0.056), (11, 0.004), (12, -0.001), (13, -0.113), (14, 0.052), (15, -0.04), (16, 0.063), (17, -0.151), (18, -0.188), (19, -0.173), (20, -0.014), (21, -0.076), (22, -0.057), (23, 0.097), (24, 0.058), (25, -0.084), (26, 0.032), (27, -0.067), (28, 0.058), (29, 0.004), (30, -0.092), (31, 0.027), (32, -0.033), (33, 0.024), (34, 0.019), (35, 0.036), (36, 0.0), (37, -0.019), (38, -0.012), (39, -0.03), (40, -0.008), (41, 0.011), (42, -0.047), (43, -0.05), (44, -0.012), (45, 0.056), (46, -0.056), (47, 0.0), (48, 0.047), (49, -0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95926291 282 nips-2011-The Fast Convergence of Boosting

Author: Matus J. Telgarsky

2 0.86033016 29 nips-2011-Algorithms and hardness results for parallel large margin learning

Author: Phil Long, Rocco Servedio

3 0.79871839 153 nips-2011-Learning large-margin halfspaces with more malicious noise

Author: Phil Long, Rocco Servedio

4 0.78617764 178 nips-2011-Multiclass Boosting: Theory and Algorithms

Author: Mohammad J. Saberian, Nuno Vasconcelos

5 0.78533077 299 nips-2011-Variance Penalizing AdaBoost

Author: Pannagadatta K. Shivaswamy, Tony Jebara

6 0.54779065 49 nips-2011-Boosting with Maximum Adaptive Sampling

7 0.45467648 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

8 0.40106654 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

9 0.39562535 59 nips-2011-Composite Multiclass Losses

10 0.37471685 241 nips-2011-Scalable Training of Mixture Models via Coresets

11 0.36411059 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

12 0.36035922 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

13 0.35650328 202 nips-2011-On the Universality of Online Mirror Descent

14 0.35237309 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

15 0.35144734 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

16 0.33945063 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

17 0.33196393 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

18 0.32262725 33 nips-2011-An Exact Algorithm for F-Measure Maximization

19 0.31925473 95 nips-2011-Fast and Accurate k-means For Large Datasets

20 0.31715372 4 nips-2011-A Convergence Analysis of Log-Linear Training

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.028), (4, 0.036), (20, 0.037), (26, 0.028), (31, 0.042), (43, 0.536), (45, 0.092), (57, 0.02), (74, 0.05), (83, 0.027), (99, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.992643 26 nips-2011-Additive Gaussian Processes

Author: David K. Duvenaud, Hannes Nickisch, Carl E. Rasmussen

Abstract: We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efﬁcient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks. 1

2 0.95855743 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty

Author: Shilin Ding, Grace Wahba, Xiaojin Zhu

Abstract: In discrete undirected graphical models, the conditional independence of node labels Y is speciﬁed by the graph structure. We study the case where there is another input random vector X (e.g. observed features) such that the distribution P (Y | X) is determined by functions of X that characterize the (higher-order) interactions among the Y ’s. The main contribution of this paper is to learn the graph structure and the functions conditioned on X at the same time. We prove that discrete undirected graphical models with feature X are equivalent to multivariate discrete models. The reparameterization of the potential functions in graphical models by conditional log odds ratios of the latter offers advantages in representation of the conditional independence structure. The functional spaces can be ﬂexibly determined by kernels. Additionally, we impose a Structure Lasso (SLasso) penalty on groups of functions to learn the graph structure. These groups with overlaps are designed to enforce hierarchical function selection. In this way, we are able to shrink higher order interactions to obtain a sparse graph structure. 1

same-paper 3 0.95527422 282 nips-2011-The Fast Convergence of Boosting

Author: Matus J. Telgarsky

4 0.93558562 44 nips-2011-Bayesian Spike-Triggered Covariance Analysis

Author: Jonathan W. Pillow, Il M. Park

Abstract: Neurons typically respond to a restricted number of stimulus features within the high-dimensional space of natural stimuli. Here we describe an explicit modelbased interpretation of traditional estimators for a neuron’s multi-dimensional feature space, which allows for several important generalizations and extensions. First, we show that traditional estimators based on the spike-triggered average (STA) and spike-triggered covariance (STC) can be formalized in terms of the “expected log-likelihood” of a Linear-Nonlinear-Poisson (LNP) model with Gaussian stimuli. This model-based formulation allows us to deﬁne maximum-likelihood and Bayesian estimators that are statistically consistent and efﬁcient in a wider variety of settings, such as with naturalistic (non-Gaussian) stimuli. It also allows us to employ Bayesian methods for regularization, smoothing, sparsiﬁcation, and model comparison, and provides Bayesian conﬁdence intervals on model parameters. We describe an empirical Bayes method for selecting the number of features, and extend the model to accommodate an arbitrary elliptical nonlinear response function, which results in a more powerful and more ﬂexible model for feature space inference. We validate these methods using neural data recorded extracellularly from macaque primary visual cortex. 1

5 0.93202823 264 nips-2011-Sparse Recovery with Brownian Sensing

Author: Alexandra Carpentier, Odalric-ambrym Maillard, Rémi Munos

Abstract: We consider the problem of recovering the parameter α ∈ RK of a sparse function f (i.e. the number of non-zero entries of α is small compared to the number K of features) given noisy evaluations of f at a set of well-chosen sampling points. We introduce an additional randomization process, called Brownian sensing, based on the computation of stochastic integrals, which produces a Gaussian sensing matrix, for which good recovery properties are proven, independently on the number of sampling points N , even when the features are arbitrarily non-orthogonal. Under the assumption that f is H¨ lder continuous with exponent at least √ we proo 1/2, vide an estimate α of the parameter such that �α − α�2 = O(�η�2 / N ), where � � η is the observation noise. The method uses a set of sampling points uniformly distributed along a one-dimensional curve selected according to the features. We report numerical experiments illustrating our method. 1

6 0.91917926 288 nips-2011-Thinning Measurement Models and Questionnaire Design

7 0.75911796 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions

8 0.73298943 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure

9 0.71352041 281 nips-2011-The Doubly Correlated Nonparametric Topic Model

10 0.70965713 123 nips-2011-How biased are maximum entropy models?

11 0.70901453 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods

12 0.68953705 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

13 0.68616033 132 nips-2011-Inferring Interaction Networks using the IBP applied to microRNA Target Prediction

14 0.6793741 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs

15 0.67030424 24 nips-2011-Active learning of neural response functions with Gaussian processes

16 0.66829455 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

17 0.66639459 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

18 0.65908486 306 nips-2011-t-divergence Based Approximate Inference

19 0.65878183 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

20 0.65556312 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis