nips nips2007 nips2007-40 knowledge-graph by maker-knowledge-mining

40 nips-2007-Bundle Methods for Machine Learning

Source: pdf

Author: Quoc V. Le, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a globally convergent method for regularized risk minimization problems. Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. SVMPerf can be shown to be a special case of our approach. In addition to the uniﬁed framework we present tight convergence bounds, which show that our algorithm converges in O(1/ ) steps to precision for general convex problems and in O(log(1/ )) steps for continuously differentiable problems. We demonstrate in experiments the performance of our approach. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract We present a globally convergent method for regularized risk minimization problems. [sent-12, score-0.225]

2 Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. [sent-13, score-0.319]

3 In addition to the uniﬁed framework we present tight convergence bounds, which show that our algorithm converges in O(1/ ) steps to precision for general convex problems and in O(log(1/ )) steps for continuously differentiable problems. [sent-15, score-0.39]

4 Starting from the active set methods described by Vapnik [17] increasingly sophisticated algorithms for solving regularized risk minimization problems have been developed. [sent-18, score-0.195]

5 Some of the most exciting recent developments are SVMPerf [5] and the Pegasos gradient descent solver [12]. [sent-19, score-0.201]

6 In addition to this, we present a formulation which does not require the solution of a quadratic program whilst in practice enjoying the same rate of convergence as algorithms of the SVMPerf family. [sent-25, score-0.246]

7 Our error analysis shows that the rates achieved by this algorithm are considerably better than what was previously known for SVMPerf, namely the algorithm enjoys O(1/ ) convergence and O(log(1/ )) convergence, whenever the loss is sufﬁciently smooth. [sent-26, score-0.28]

8 2 Problem Setting Denote by x ∈ X and y ∈ Y patterns and labels respectively and let l(x, y, w) be a loss function which is convex in w ∈ W, where either W = Rd (linear classiﬁer), or W is a Reproducing Kernel Hilbert Space for kernel methods. [sent-31, score-0.134]

9 Given a set of m training patterns {xi , yi }m the regularized risk i=1 1 functional which many estimation methods strive to minimize can be written as J(w) := Remp (w) + λΩ(w) where Remp (w) := 1 m m l(xi , yi , w). [sent-32, score-0.234]

10 (1) i=1 2 Ω(w) is a smooth convex regularizer such as 1 w , and λ > 0 is a regularization term. [sent-33, score-0.203]

11 Typically 2 Ω is cheap to compute and to minimize whereas the empirical risk term Remp (w) is computationally expensive to deal with. [sent-34, score-0.131]

12 We assume that the empirical risk Remp (w) is nonnegative. [sent-37, score-0.106]

13 In such cases one has to resort to bundle methods [3], which are based on the following elementary observation: for convex functions a ﬁrst order Taylor approximation is a lower bound. [sent-40, score-0.26]

14 The idea is to replace Remp [w] by these lower bounds and to optiFigure 1: A lower bound on the mize the latter in conjunction with Ω(w). [sent-43, score-0.142]

15 Figure 1 gives geometric convex empirical risk Remp (w) intuition. [sent-44, score-0.184]

16 1 Bundle Methods Subdifferential and Subgradient Before we describe the bundle method, it is necessary to clarify a key technical point. [sent-52, score-0.158]

17 The subgradient is a generalization of gradients appropriate for convex functions, including those which are not necessarily smooth. [sent-53, score-0.24]

18 Then a subgradient is the normal vector of any tangential supporting hyperplane of F at w. [sent-55, score-0.109]

19 Formally µ is called a subgradient of F at w if, and only if, F (w ) ≥ F (w) + w − w, µ ∀w . [sent-56, score-0.109]

20 2 The Algorithm Denote by wt ∈ W the values of w which are obtained by successive steps of our method, Let at ∈ W, bt ∈ R, and set w0 = 0, a0 = 0, b0 = 0. [sent-61, score-0.396]

21 Then, the Taylor expansion coefﬁcients of Remp [wt ] can be written as at+1 := ∂w Remp (wt ) and bt+1 := Remp (wt ) − at+1 , wt . [sent-62, score-0.314]

22 (3) Note that we do not require Remp to be differentiable: if Remp is not differentiable at wt we simply choose any element of the subdifferential as at+1 . [sent-63, score-0.452]

23 Since each Taylor approximation is a lower bound, we may take their maximum to obtain that Remp (w) ≥ maxt at , w + bt . [sent-64, score-0.14]

24 Moreover, by 2 Algorithm 1 Bundle Method( ) Initialize t = 0, w0 = 0, a0 = 0, b0 = 0, and J0 (w) = λΩ(w) repeat Find minimizer wt := argminw Jt (w) Compute gradient at+1 and offset bt+1 . [sent-65, score-0.393]

25 until t ≤ virtue of the fact that Remp is a non-negative function we can write the following lower bounds on Remp and J respectively: Rt (w) := max at , w + bt and Jt (w) := λΩ(w) + Rt (w). [sent-67, score-0.156]

26 Deﬁne w∗ := argmin J(w), γt := Jt+1 (wt ) − Jt (wt ), w wt := argmin Jt (w), and t := min Jt +1 (wt ) − Jt (wt ). [sent-69, score-0.337]

27 3 Dual Problem Optimization is often considerably easier in the dual space. [sent-75, score-0.163]

28 In fact, we will show that we need not know Ω(w) at all, instead it is sufﬁcient to work with its Fenchel-Legendre dual Ω∗ (µ) := supw w, µ − Ω(w). [sent-76, score-0.163]

29 The dual problem of minimize Jt (w) := max at , w + bt + λΩ(w) is w (5) t ≤t ∗ maximize Jt (α) := − λΩ∗ (−λ−1 Aα) + α b subject to α ≥ 0 and α α 1 = 1. [sent-89, score-0.245]

30 (6) Furthermore, the optimal wt and αt are related by the dual connection wt = ∂Ω∗ (−λ−1 Aαt ). [sent-90, score-0.749]

31 2 2 1 Recall that for Ω(w) = 2 w 2 the Fenchel-Legendre dual is given by Ω∗ (µ) = 1 µ 2 . [sent-91, score-0.163]

32 2 minimizew max(0, maxt ≤t at , w + bt ) + λ w 2 the dual becomes 2 maximize − α 1 2λ α Qα + α b subject to α ≥ 0 and α 1 = 1. [sent-98, score-0.279]

33 (7) This means that for quadratic regularization the dual optimization problem is a quadratic program where the number of variables equals the number of gradients computed previously. [sent-99, score-0.46]

34 All that is required to deﬁne the QP are the inner products between gradient vectors au , av . [sent-102, score-0.118]

35 Later in this section we propose a variant which does away with the quadratic program altogether while preserving most of the appealing convergence properties of Algorithm 1. [sent-103, score-0.27]

36 4 Examples Structured Estimation Many estimation problems [14, 16] can be written in terms of a piecewise linear loss function l(x, y, w) = max φ(x, y ) − φ(x, y), w + ∆(y, y ) (8) y ∈Y for some suitable joint feature map φ, and a loss function ∆(y, y ). [sent-105, score-0.165]

37 (9) y ∈Y Since Remp is deﬁned as a summation of loss terms, this allows us to apply Algorithm 1 directly for risk minimization: at every iteration t we ﬁnd all maximal constraint violators for each (xi , yi ) pair and compute the composite gradient vector. [sent-108, score-0.264]

38 This vector is then added to the convex program we have so far. [sent-109, score-0.109]

39 Effectively, by deﬁning a joint feature map as the sum over individual feature maps and by deﬁning a joint loss ∆ as the sum over individual losses SVMPerf performs exactly the same operations as we described above. [sent-111, score-0.134]

40 Hence, for losses of type (8) our algorithm is a direct extension of SVMPerf to structured estimation. [sent-112, score-0.105]

41 Exponential Families One of the advantages of our setting is that it applies to any convex loss function, as long as there is an efﬁcient way of computing the gradient. [sent-113, score-0.157]

42 It also shows that adding a new model becomes a matter of deﬁning a new loss function and its corresponding gradient, rather than having to build a full solver from scratch. [sent-120, score-0.159]

43 Towards this end, we ﬁrst observe that by strong duality the values of the primal and dual problems (5) and (6) are equal at optimality. [sent-127, score-0.257]

44 Next, we observe that the solution of the dual problem (6) at iteration t, denoted by αt , forms a feasible set of parameters for the dual problem (6) at iteration t + 1 by means of the parameterization (αt , 0), i. [sent-129, score-0.37]

45 4 To obtain a lower bound on the improvement due to Jt+1 (wt+1 ) we perform a line search along ((1− η)αt , η) in (6). [sent-133, score-0.172]

46 The constraint η ∈ [0, 1] ensures dual feasibility. [sent-134, score-0.163]

47 Note that, in general, solving the dual problem (6) results in an increase which is larger than that obtained via the line search. [sent-136, score-0.22]

48 The line search is employed in the analysis only for analytic tractability. [sent-137, score-0.108]

49 Depending on J(w) we will be able to prove two different convergence results. [sent-139, score-0.135]

50 2 (a) For regularizers Ω(w) for which ∂µ Ω∗ (µ) ≤ H ∗ we ﬁrst experience a regime of progress linear in γt and a subsequent slowdown to improvements which are quadratic in γt . [sent-140, score-0.147]

51 the Hessian of J is bounded, we have linear convergence throughout. [sent-143, score-0.109]

52 We ﬁrst derive lower bounds on the improvement Jt+1 (wt+1 ) − Jt (wt ), then the fact that for (b) the bounds are better. [sent-144, score-0.144]

53 Finally we prove the convergence rates by solving the difference equation in t . [sent-145, score-0.177]

54 This reasoning leads to the following theorem: Theorem 4 Assume that ∂w Remp (w) ≤ G for all w ∈ W , where W is some domain of interest 2 containing all wt for t ≤ t. [sent-146, score-0.293]

55 if γt > 4G2 H ∗ /λ if 4G2 H ∗ /λ ≥ γt ≥ H/2 otherwise (11) (12) Note that the error keeps on halving initially and settles for a somewhat slower rate of convergence after that, whenever the Hessian of the overall risk is bounded from above. [sent-151, score-0.309]

56 The reason for the difference in the convergence bound for differentiable and non-differentiable losses is that in the former case the gradient of the risk converges to 0 as we approach optimality, whereas in the former case, no such guarantees hold (e. [sent-152, score-0.529]

57 Two facts are worthwhile noting: a) The dual of many regularizers, e. [sent-155, score-0.163]

58 Applying the previous result we obtain a convergence theorem for bundle methods. [sent-163, score-0.297]

59 Under the assumptions of Theorem 4 we can give the following convergence guarantee for Algorithm 1. [sent-165, score-0.109]

60 In addition to that, the convergence is O(log(1/ )) for continuously differentiable problems. [sent-173, score-0.216]

61 Note that whenever Remp (w) is the average over many piecewise linear functions Remp (w) behaves essentially like a function with bounded Hessian as long as we are taking large enough steps not to “notice” the fact that the term is actually nonsmooth. [sent-174, score-0.176]

62 2 Remark 6 For Ω(w) = 1 w the dual Hessian is exactly H ∗ = 1. [sent-175, score-0.163]

63 Effectively the rate of convergence of the algorithm is governed by upper bounds on the primal and dual curvature of the objective function. [sent-177, score-0.497]

64 This acts like a condition number of the problem — for Ω(w) = 1 w Qw the dual is Ω∗ (z) = 1 z Q−1 z, hence the largest eigenvalues of Q and Q−1 2 2 would have a signiﬁcant inﬂuence on the convergence. [sent-178, score-0.185]

65 In terms of λ the number of iterations needed for convergence is O(λ−1 ). [sent-179, score-0.138]

66 This is likely due to the fact that the empirical risk Remp (w) is typically rather smooth and has a certain inherent curvature which acts as a natural regularizer in addition to the regularization afforded by λΩ[w]. [sent-181, score-0.292]

67 5 A Linesearch Variant The convergence analysis in Theorem 4 relied on a one-dimensional line search. [sent-182, score-0.166]

68 Algorithm 1, however, uses a more complex quadratic program to solve the problem. [sent-183, score-0.11]

69 Since even the simple updates promise good rates of convergence it is tempting to replace the corresponding step in the bundle update. [sent-184, score-0.309]

70 This can lead to considerable savings in particular for smaller problems, where the time spent in the quadratic programming solver is a substantial fraction of the total runtime. [sent-185, score-0.182]

71 2 To keep matters simple, we only consider quadratic regularization Ω(w) := 1 w . [sent-186, score-0.141]

72 Hence a line search only needs to determine ﬁrst and second derivative as done in the proof 2 1 2 of Theorem 4. [sent-188, score-0.108]

73 Experiments show that this simpliﬁed algorithm has essentially the same convergence properties. [sent-196, score-0.144]

74 For a fair comparison with existing solvers we use the quadratic regularizer Ω(w) := λ w 2 , and the binary hinge loss. [sent-199, score-0.217]

75 2 In our ﬁrst experiment, we address the rate of convergence and its dependence on the value of λ. [sent-200, score-0.136]

76 In Figure 2 we plot t as a function of iterations for various values of λ using the QP solver at every iteration to solve the dual problem (6) to optimality. [sent-201, score-0.317]

77 Surprisingly, even though theory predicts sub-linear speed of convergence for non-differentiable losses like the binary hinge loss (see (11)), our solver exhibits linear rates of convergence predicted only for differentiable functions (see (12)). [sent-203, score-0.627]

78 We conjecture that the average over many piecewise linear functions, Remp (w), behaves essentially like a smooth function. [sent-204, score-0.116]

79 As predicted, the convergence speed is inversely proportional to the value of λ. [sent-205, score-0.133]

80 In our second experiment, we compare the convergence speed of two variants of the bundle method, namely, with a QP solver in the inner loop (which essentially boils down to SVMPerf) and the line search variant which we described in Section 5. [sent-214, score-0.566]

81 Figure 3 depicts the evolution of the primal objective function value as a function of both CPU time as well as the number of iterations. [sent-217, score-0.109]

82 Nevertheless, both variants of the bundle method converge to this value even faster (line search is slightly slower than Pegasos on astro-ph, but this is not always the case for many other large datasets we tested on). [sent-222, score-0.237]

83 Note that both line search and Pegasos converge to within 0. [sent-223, score-0.136]

84 7 Related Research Our work is closely related to Shalev-Shwartz and Singer [11] who prove mistake bounds for online algorithms by lower bounding the progress in the dual. [sent-225, score-0.16]

85 Although not stated explicitly, essentially the same technique of lower bounding the dual improvement was used by Tsochantaridis et al. [sent-226, score-0.277]

86 [16] to show polynomial time convergence of the SVMStruct algorithm. [sent-227, score-0.109]

87 [16] only work with a quadratic objective function while the framework 7 proposed by Shalev-Shwartz and Singer [11] can handle arbitrary convex functions. [sent-229, score-0.192]

88 In both cases, a weaker analysis led to O(1/ 2 ) rates of convergence for nonsmooth loss functions. [sent-230, score-0.258]

89 On the other hand, our results establish a O(1/ ) rate for nonsmooth loss functions and O(log(1/ )) rates for smooth loss functions under mild technical assumptions. [sent-231, score-0.26]

90 But, as our experiments show, performing an exact line search in the dual leads to a faster decrease in the value of primal objective. [sent-237, score-0.345]

91 This, however, only applies whenever the empirical risk decomposes into individual loss terms (e. [sent-239, score-0.239]

92 The third related strand of research considers gradient descent in the primal with a line search to choose the optimal step size, see e. [sent-242, score-0.28]

93 Under assumptions of smoothness and strong convexity – that is, the objective function can be upper and lower bounded by quadratic functions – it can be shown that gradient descent with line search will converge to an accuracy of in O(log(1/ )) steps. [sent-247, score-0.407]

94 The problem here is the line search in the primal, since evaluating the regularized risk functional might be as expensive as computing its gradient, thus rendering a line search in the primal unattractive. [sent-248, score-0.455]

95 On the other hand, the dual objective is relatively simple to evaluate, thus making the line search in the dual, as performed by our algorithm, computationally feasible. [sent-249, score-0.306]

96 Finally, we would like to point out connections to subgradient methods [7]. [sent-250, score-0.109]

97 These algorithms are designed for nonsmooth functions, and essentially choose an arbitrary element of the subgradient set to perform a gradient descent like update. [sent-251, score-0.293]

98 By applying the analysis of Nedich and Bertsekas [7] to the regularized risk minimization problem with Ω(w) := λ w 2 , Ratliff et al. [sent-253, score-0.23]

99 [9] showed that sub2 gradient descent with a ﬁxed, but sufﬁciently small, stepsize will converge linearly to B(w∗ , G/λ). [sent-254, score-0.126]

100 A scalable modular convex solver for regularized risk minimization. [sent-348, score-0.346]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('remp', 0.596), ('jt', 0.426), ('wt', 0.293), ('svmperf', 0.204), ('pegasos', 0.186), ('dual', 0.163), ('bundle', 0.158), ('convergence', 0.109), ('subgradient', 0.109), ('risk', 0.106), ('solver', 0.103), ('differentiable', 0.085), ('bt', 0.082), ('quadratic', 0.079), ('convex', 0.078), ('losses', 0.078), ('tsochantaridis', 0.075), ('subdifferential', 0.074), ('primal', 0.074), ('regularizer', 0.065), ('taylor', 0.065), ('rt', 0.065), ('regularized', 0.059), ('line', 0.057), ('loss', 0.056), ('gradient', 0.056), ('gradients', 0.053), ('solvers', 0.052), ('nonsmooth', 0.051), ('search', 0.051), ('qp', 0.05), ('bounds', 0.05), ('hessian', 0.049), ('sub', 0.045), ('nedich', 0.043), ('descent', 0.042), ('rates', 0.042), ('enjoys', 0.041), ('curvature', 0.039), ('joachims', 0.039), ('progress', 0.036), ('smola', 0.035), ('essentially', 0.035), ('et', 0.035), ('bounded', 0.035), ('objective', 0.035), ('maxt', 0.034), ('vishwanathan', 0.034), ('singer', 0.033), ('piecewise', 0.032), ('whenever', 0.032), ('av', 0.032), ('ratliff', 0.032), ('regularizers', 0.032), ('regularization', 0.032), ('precision', 0.031), ('program', 0.031), ('minimization', 0.03), ('theorem', 0.03), ('matters', 0.03), ('au', 0.03), ('variant', 0.029), ('iterations', 0.029), ('converge', 0.028), ('xi', 0.028), ('smooth', 0.028), ('structured', 0.027), ('rate', 0.027), ('prove', 0.026), ('former', 0.026), ('cheap', 0.025), ('scores', 0.025), ('lower', 0.024), ('speed', 0.024), ('online', 0.024), ('yi', 0.024), ('conjunction', 0.024), ('optimality', 0.024), ('applies', 0.023), ('ball', 0.023), ('kdd', 0.023), ('converges', 0.023), ('optimization', 0.023), ('away', 0.022), ('decomposes', 0.022), ('offset', 0.022), ('iteration', 0.022), ('continuously', 0.022), ('argmin', 0.022), ('minimizer', 0.022), ('acts', 0.022), ('written', 0.021), ('furthermore', 0.021), ('steps', 0.021), ('hinge', 0.021), ('behaves', 0.021), ('jmlr', 0.021), ('improvement', 0.02), ('duality', 0.02), ('bound', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 40 nips-2007-Bundle Methods for Machine Learning

Author: Quoc V. Le, Alex J. Smola, S.v.n. Vishwanathan

2 0.19081676 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classiﬁer. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1

3 0.12284797 185 nips-2007-Stable Dual Dynamic Programming

Author: Tao Wang, Michael Bowling, Dale Schuurmans, Daniel J. Lizotte

Abstract: Recently, we have introduced a novel approach to dynamic programming and reinforcement learning that is based on maintaining explicit representations of stationary distributions instead of value functions. In this paper, we investigate the convergence properties of these dual algorithms both theoretically and empirically, and show how they can be scaled up by incorporating function approximation. 1

4 0.11448964 102 nips-2007-Incremental Natural Actor-Critic Algorithms

Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton

Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or inﬁnite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the ﬁrst convergence proofs and the ﬁrst fully incremental algorithms. 1

5 0.11252587 62 nips-2007-Convex Learning with Invariances

Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola

Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1

6 0.09218163 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

7 0.088465847 203 nips-2007-The rat as particle filter

8 0.087979145 69 nips-2007-Discriminative Batch Mode Active Learning

9 0.077027448 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

10 0.07404153 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

11 0.071601465 166 nips-2007-Regularized Boost for Semi-Supervised Learning

12 0.066077143 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

13 0.062297191 200 nips-2007-The Tradeoffs of Large Scale Learning

14 0.061322477 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

15 0.06056425 159 nips-2007-Progressive mixture rules are deviation suboptimal

16 0.056742795 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

17 0.056582779 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

18 0.054854725 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

19 0.054443505 21 nips-2007-Adaptive Online Gradient Descent

20 0.054145515 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.175), (1, -0.06), (2, -0.088), (3, 0.11), (4, -0.024), (5, -0.006), (6, 0.069), (7, -0.037), (8, -0.025), (9, 0.06), (10, 0.132), (11, -0.087), (12, 0.033), (13, 0.01), (14, 0.029), (15, -0.014), (16, 0.009), (17, 0.084), (18, -0.002), (19, 0.061), (20, -0.063), (21, -0.126), (22, 0.04), (23, -0.045), (24, 0.009), (25, -0.018), (26, -0.254), (27, 0.024), (28, 0.045), (29, -0.106), (30, -0.057), (31, -0.097), (32, -0.193), (33, -0.082), (34, -0.048), (35, 0.016), (36, 0.077), (37, -0.068), (38, 0.165), (39, -0.011), (40, 0.014), (41, 0.074), (42, 0.072), (43, 0.096), (44, 0.103), (45, -0.04), (46, 0.117), (47, -0.091), (48, 0.025), (49, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94657511 40 nips-2007-Bundle Methods for Machine Learning

Author: Quoc V. Le, Alex J. Smola, S.v.n. Vishwanathan

2 0.69494468 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

3 0.52339476 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

Author: Andreas Argyriou, Massimiliano Pontil, Yiming Ying, Charles A. Micchelli

Abstract: Learning the common structure shared by a set of supervised tasks is an important practical and theoretical problem. Knowledge of this structure may lead to better generalization performance on the tasks and may also facilitate learning new tasks. We propose a framework for solving this problem, which is based on regularization with spectral functions of matrices. This class of regularization problems exhibits appealing computational properties and can be optimized efﬁciently by an alternating minimization algorithm. In addition, we provide a necessary and sufﬁcient condition for convexity of the regularizer. We analyze concrete examples of the framework, which are equivalent to regularization with Lp matrix norms. Experiments on two real data sets indicate that the algorithm scales well with the number of tasks and improves on state of the art statistical performance. 1

4 0.51282936 62 nips-2007-Convex Learning with Invariances

Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola

5 0.50250882 185 nips-2007-Stable Dual Dynamic Programming

Author: Tao Wang, Michael Bowling, Dale Schuurmans, Daniel J. Lizotte

6 0.45986253 200 nips-2007-The Tradeoffs of Large Scale Learning

7 0.45361662 159 nips-2007-Progressive mixture rules are deviation suboptimal

8 0.44203752 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

9 0.41213727 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

10 0.40502307 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

11 0.40376228 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

12 0.38365221 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

13 0.37967199 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

14 0.3794964 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

15 0.3785159 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

16 0.36977294 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

17 0.36710337 24 nips-2007-An Analysis of Inference with the Universum

18 0.36478433 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

19 0.36185265 203 nips-2007-The rat as particle filter

20 0.34946153 102 nips-2007-Incremental Natural Actor-Critic Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.024), (13, 0.103), (16, 0.043), (19, 0.023), (21, 0.049), (31, 0.014), (34, 0.014), (35, 0.039), (47, 0.079), (49, 0.057), (54, 0.206), (83, 0.164), (85, 0.027), (90, 0.064)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81358701 141 nips-2007-New Outer Bounds on the Marginal Polytope

Author: David Sontag, Tommi S. Jaakkola

Abstract: We give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efﬁciently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints on the marginal polytope are derived through a series of projections onto the cut polytope. As a result, we obtain tighter upper bounds on the log-partition function. We also show empirically that the approximations of the marginals are signiﬁcantly more accurate when using the tighter outer bounds. Finally, we demonstrate the advantage of the new constraints for ﬁnding the MAP assignment in protein structure prediction. 1

same-paper 2 0.81063402 40 nips-2007-Bundle Methods for Machine Learning

Author: Quoc V. Le, Alex J. Smola, S.v.n. Vishwanathan

3 0.80782318 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

Author: Kristina Toutanova, Mark Johnson

Abstract: We present a novel Bayesian model for semi-supervised part-of-speech tagging. Our model extends the Latent Dirichlet Allocation model and incorporates the intuition that words’ distributions over tags, p(t|w), are sparse. In addition we introduce a model for determining the set of possible tags of a word which captures important dependencies in the ambiguity classes of words. Our model outperforms the best previously proposed model for this task on a standard dataset. 1

4 0.71257657 102 nips-2007-Incremental Natural Actor-Critic Algorithms

Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton

5 0.70844281 84 nips-2007-Expectation Maximization and Posterior Constraints

Author: Kuzman Ganchev, Ben Taskar, João Gama

Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to ﬁnd a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difﬁcult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efﬁcient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1

6 0.70457339 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

7 0.70428067 63 nips-2007-Convex Relaxations of Latent Variable Training

8 0.70197171 185 nips-2007-Stable Dual Dynamic Programming

9 0.70154583 24 nips-2007-An Analysis of Inference with the Universum

10 0.69934285 62 nips-2007-Convex Learning with Invariances

11 0.69788647 134 nips-2007-Multi-Task Learning via Conic Programming

12 0.69627619 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

13 0.69384474 22 nips-2007-Agreement-Based Learning

14 0.69257617 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

15 0.69254839 200 nips-2007-The Tradeoffs of Large Scale Learning

16 0.69225079 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

17 0.69156128 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

18 0.69144475 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

19 0.68779087 7 nips-2007-A Kernel Statistical Test of Independence

20 0.68398875 21 nips-2007-Adaptive Online Gradient Descent