nips nips2010 nips2010-163 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xinhua Zhang, Ankan Saha, S.v.n. Vishwanathan
Abstract: In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an accurate solution in O(1/ 2 ) iterations. By tightening the analysis, Teo et al. [2] showed that O(1/ ) iterations suffice. Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. In this paper we disprove this conjecture. We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a multivariate performance score. However, surprisingly, these problems are not inherently hard. By exploiting the structure of the objective function we can devise an algo√ rithm that converges in O(1/ ) iterations. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an accurate solution in O(1/ 2 ) iterations. [sent-13, score-0.158]
2 Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. [sent-16, score-0.102]
3 We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a multivariate performance score. [sent-18, score-0.103]
4 At the heart of SVMs is the following regularized risk minimization problem: n λ 1 2 min J(w) := w + Remp (w) with Remp (w) := max(0, 1 − yi w, xi ). [sent-23, score-0.315]
5 (1) w 2 n i=1 regularizer empirical risk Here we assume access to a training set of n labeled examples {(xi , yi )}n where xi ∈ Rd and yi ∈ i=1 2 2 {−1, +1}, and use the square Euclidean norm w = i wi as the regularizer. [sent-24, score-0.458]
6 There has been significant research devoted to developing specialized optimizers which minimize J(w) efficiently. [sent-26, score-0.072]
7 In an award winning paper, Joachims [1] presented a cutting plane method (CPM)1 , SVM-Perf, which was shown to converge to an accurate solution of (1) in O(1/ 2 ) iterations, with each iteration requiring O(nd) effort. [sent-27, score-0.25]
8 , yn ) and a candidate ¯ labeling y (to be concretized later), the Remp for the multivariate measure is formulated by [3] as 1 In this paper we use the term cutting plane methods to denote specialized solvers employed in machine learning. [sent-35, score-0.234]
9 While clearly related, they must not be confused with cutting plane methods used in optimization. [sent-36, score-0.119]
10 1 Remp (w) = max ¯ y∈{−1,1}n ¯ ∆(y, y) + 1 n n w, xi (¯i − yi ) . [sent-37, score-0.188]
11 y (2) i=1 In another award winning paper by Joachims [3], the regularized risk minimization problems corresponding to these measures are optimized by using a CPM. [sent-38, score-0.127]
12 Given the widespread use of CPM in machine learning, it is important to understand their convergence guarantees in terms of the upper and lower bounds on the number of iterations needed to converge to an accurate solution. [sent-39, score-0.376]
13 The tightest, O(1/ ), upper bounds on the convergence speed of CPM is due to Teo et al. [sent-40, score-0.188]
14 Therefore, it had been conjectured that the upper bounds might be further tightened via a more refined analysis. [sent-43, score-0.225]
15 In this paper we construct counter examples for both decomposable Remp like in equation (1) and non-decomposable Remp like in equation (2), on which CPM requires Ω(1/ ) iterations to converge, thus disproving this conjecture2 . [sent-44, score-0.092]
16 Our results lead to the following natural question: Do the lower bounds hold because regularized risk minimization problems are fundamentally hard, or is it an inherent limitation of CPM? [sent-48, score-0.269]
17 To understand our contribution one needs to understand the two standard assumptions that are made when proving convergence rates: • A1: The data points xi lie inside a L2 (Euclidean) ball of radius R, that is, xi ≤ R. [sent-51, score-0.152]
18 Finding a fast optimizer under assumption A2 remains an open problem. [sent-57, score-0.14]
19 , w, µ) denote vectors, wi denotes the i-th component of w, 0 refers to the vector with all zero components, ei is the i-th coordinate vector (all 0’s except 1 at the i-th coordinate) and ∆k refers to the k dimensional simplex. [sent-60, score-0.142]
20 Unless specified otherwise, ·, · denotes the Euclidean dot product x, w = i xi wi , and · refers to the Euclidean norm 1/2 w := ( w, w ) . [sent-61, score-0.134]
21 Two types of lower bounds are subsequently defined in Section 3, and Section 4 contains descriptions of various counter examples that we construct. [sent-68, score-0.222]
22 In Section 5 we describe an algorithm which provably converges to an √ accurate solution of (1) in O(1/ ) iterations under assumption A1. [sent-69, score-0.108]
23 2 BMRM cp At every iteration, BMRM replaces Remp by a piecewise linear lower bound Rk and optimizes [2] λ cp cp min Jk (w) := w 2 + Rk (w), where Rk (w) := max w, ai + bi , (3) w 1≤i≤k 2 to obtain the next iterate wk . [sent-72, score-0.761]
24 Here ai ∈ ∂Remp (wi−1 ) denotes an arbitrary subgradient of Remp at wi−1 and bi = Remp (wi−1 ) − wi−1 , ai . [sent-73, score-0.257]
25 The piecewise linear lower bound is successively tightened until the gap (4) k := min J(wt ) − Jk (wk ) 0≤t≤k falls below a predefined tolerance . [sent-74, score-0.188]
26 1 2: αk ← argmax − 2λ Ak α 2 + α, bk Require: Previous subgradients {ai }i=1 and k intercepts {bi }i=1 . [sent-87, score-0.183]
27 Note that Ak and bk in (5) are defined in Algorithm 1. [sent-102, score-0.089]
28 Note that at iteration k the dual Dk (α) is a QP with k variables. [sent-105, score-0.078]
29 Then for any < 4G2 /λ, both ls-bmrm and qpbmrm converge to an accurate solution of (1) as measured by (4) after at most the following number of steps: λJ(0) 8G2 + − 1. [sent-112, score-0.076]
30 3 Upper and Lower Bounds Since most rates of convergence discussed in the machine learning community are upper bounds, it is important to rigorously define the meaning of a lower bound with respect to , and to study its relationship with the upper bounds. [sent-118, score-0.247]
31 Instead of minimizing the objective function J(w) defined in (1), if we minimize a scaled version cJ(w) this scales the approximation gap (4) by c. [sent-120, score-0.087]
32 Define T ( ; f, A) as the first step index k when wk becomes an accurate solution3 : T ( ; f, A) = min {k : f (wk ) − minw f (w) ≤ } . [sent-123, score-0.567]
33 (6) Upper and lower bounds are both properties for a pair of F and A. [sent-124, score-0.171]
34 A function g( ) is called an upper bound of (F, A) if for all functions f ∈ F and all > 0, it takes at most order g( ) steps for A to reduce the gap to less than , i. [sent-125, score-0.139]
35 (7) On the other hand, lower bounds can be defined in two different ways depending on how the above two universal qualifiers are flipped to existential qualifiers. [sent-128, score-0.171]
36 3 Algorithms ls-bmrm qp-bmrm Nesterov UB Assuming A1 SLB Assuming A2 UB SLB WLB WLB O(1/ ) O(1/ ) √ O(1/ ) Ω(1/ ) open √ Ω(1/ ) Ω(1/ ) open √ Ω(1/ ) O(1/ ) O(1/ ) n/a Ω(1/ ) open n/a Ω(1/ ) Ω(1/ ) n/a Table 1: Summary of the known upper bounds and our lower bounds. [sent-132, score-0.366]
37 ˜ • Strong lower bounds (SLB) h( ) is called a SLB of (F, A) if there exists a function f ∈ F, ˜ such that for all > 0 it takes at least h( ) steps for A to find an accurate solution of f : ˜ ˜ (SLB) ∃ f ∈ F, s. [sent-136, score-0.244]
38 (8) • Weak lower bound (WLB) h( ) is called a WLB of (F, A) if for any > 0, there exists a function f ∈ F depending on , such that it takes at least h( ) steps for A to find an accurate solution of f : (WLB) ∀ > 0, ∃ f ∈ F, s. [sent-139, score-0.144]
39 Fortunately, WLBs are sufficient to refute upper bounds or to establish their tightness. [sent-144, score-0.154]
40 The size of the function class F affects the upper and lower bounds in opposite ways. [sent-145, score-0.225]
41 lower) bounds on (F , A) is usually easier (resp. [sent-148, score-0.1]
42 4 Constructing Lower Bounds Letting the minimizer of J(w) be w∗ , we are interested in bounding the primal gap of the iterates wk : J(wk ) − J(w∗ ). [sent-151, score-0.611]
43 Datasets will be constructed explicitly whose resulting objective J(w) will be shown to attain the lower bounds of the algorithms. [sent-152, score-0.207]
44 1 Strong Lower Bounds for Solving Linear SVMs using ls-bmrm We first prove the Ω(1/ ) lower bound for ls-bmrm on SVM problems under assumption A1. [sent-156, score-0.099]
45 Setting λ = 16 , the regularized risk (1) can be written as (using w instead of w as it is now a scalar): min J(w) = w∈R 1 2 1 w w + 1− 32 2 2 + + 1 [1 − w]+ . [sent-158, score-0.127]
46 These recursive relations allow us to derive the convergence rate of α2k−1,1 and wk (see proof in [7, Appendix C]): Lemma 4 limk→∞ kα2k−1,1 = 1 . [sent-169, score-0.529]
47 Combining with (11), we get limk→∞ k|2 − wk | = 2. [sent-170, score-0.442]
48 4 Now that wk approaches 2 at the rate of O(1/k), it is finally straightforward to translate it into the rate at which J(wk ) approaches J(w∗ ). [sent-171, score-0.494]
49 2 Weak Lower Bounds for Solving Linear SVMs using qp-bmrm Theorem 1 gives an upper bound on the convergence rate of qp-bmrm, assuming that Remp satisfies the assumption A2. [sent-174, score-0.142]
50 In this section we further demonstrate that this O(1/ ) rate is also a WLB (hence tight) even when the Remp is specialized to SVM objectives satisfying A2. [sent-175, score-0.091]
51 n Given > 0, define n = 1/ and construct a dataset {(xi , yi )}i=1 as yi = (−1)i and xi = √ (−1)i (nei+1 + ne1 ) ∈ Rn+1 . [sent-176, score-0.331]
52 Then the corresponding objective function (1) is J(w) = w 2 n 2 +Remp (w), where Remp (w) = n √ 1 1 [1−yi w, xi ]+ = [1− nw1 −nwi+1 ]+ . [sent-177, score-0.081]
53 , n ) and J(w∗ ) = 2 check that yi w∗ , xi = 1, so ∂J(w∗ ) = setting all αi = 1 2n w∗ − 1 √ n n i=1 αi , α1 , . [sent-181, score-0.188]
54 Suppose running qp-bmrm on the objective function (13) 2 produces iterates w1 , . [sent-191, score-0.078]
55 Then it takes qp-bmrm at least 3 steps to find an accurate solution. [sent-198, score-0.073]
56 2k 4n 3 i∈[k] i∈[k] Indeed, after taking n steps, wn will cut a subgradient an+1 = 0 and bn+1 = 0, and then the minimizer of Jn+1 (w) gives exactly w∗ . [sent-200, score-0.144]
57 n , n i=1 −1 n Proof Since Remp (w0 ) = 0 and ∂Remp (w0 ) = αi yi xi : αi ∈ [0, 1] , we can choose b1 = Remp (w0 ) − a1 , w0 = 0 + 1 1 = , and n n 1 1 1 1 2 w − √ w1 − w2 + = √ , 1, 0, . [sent-204, score-0.188]
58 2 n n n In general, we claim that the k-th iterate wk produced by qp-bmrm is given by w1 = argmin w k copies 1 1 1 √ , , . [sent-208, score-0.579]
59 , k, then it is n easy to check that Remp (wk ) = 0 and ∂Remp (wk ) = −1 i=k+1 αi yi xi : αi ∈ [0, 1] . [sent-219, score-0.188]
60 Thus we n can again choose 1 1 ak+1 = − yk+1 xk+1 , and bk+1 = Remp (wk ) − ak+1 , wk = , so n n wk = k+1 copies wk+1 = argmin w 1 w 2 2 + max { ai , w + bi } 1≤i≤k+1 = 1 1 1 √ , ,. [sent-220, score-1.109]
61 1 2k 5 + 1 2n while J(w ∗ i∈[k+1] αi ai 1 ) = 4n from , : α ∈ ∆k+1 which it follows √ As an aside, the subgradient of the Remp in (13) does have Euclidean norm 2n at w = 0. [sent-228, score-0.143]
62 In fact, having a bounded subgradient of Remp at all wk is sufficient for qp-bmrm to converge at the rate in Theorem 1. [sent-235, score-0.576]
63 However when we assume A1 which is more restrictive than A2, it remains an open question to determine whether the O(1/ ) rates are optimal for qp-bmrm on SVM objectives. [sent-236, score-0.133]
64 y n Given > 0, define n = 1/ +1 and construct a dataset {(xi , yi )}i=1 as n follows: xi = − 2√3 e1 − n ei+1 ∈ Rn+1 with yi = −1 for all i ∈ [n−1], 2 y =1 ¯ y = −1 ¯ √ 3n and xn = 2 e1 + n en+1 ∈ Rn+1 with yn = +1. [sent-240, score-0.359]
65 Then the corresponding objective function is J(w) = 1 w 2 2 ¯ + max 1 − F1 (y, y) + ¯ y 1 n yi w, xi (yi yi − 1) . [sent-242, score-0.367]
66 Then qp-bmrm takes at least 1 3 3 w y = 1 y = −1 a b c d (14) steps to find an accurate solution. [sent-245, score-0.073]
67 Then at step k + 1 we have 1 1 + 2k if i ∈ [k] 6 1 y i wk , x i = 1 if k + 1 ≤ i ≤ n − 1 . [sent-260, score-0.442]
68 6 n 1 if i = n wk = (15) (16) 2 For convenience, define the term in the max in (14) as ¯ Υk (¯ ) := 1 − F1 (y, y) + y 1 n n yi wk , xi (yi yi − 1). [sent-261, score-1.215]
69 If y misclassifies the positive training example, ¯ then F1 (y, y) = 0 and by (16) we have n−1 Υk (¯ ) = 1−0 + y k 1 1 k+3 1 yi wk , xi (yi yi −1)+ (−1−1) = ¯ (yi yi −1)+ ¯ n i=1 2 6k i=1 6 n−1 (yi yi −1) ≤ 0. [sent-270, score-1.059]
70 Then F1 (y, y) = 2+t2+t2 , and 1 Υk (¯ ) = 1 − y = 2 + 2 + t1 + t2 t1 + t2 − 2 + t1 + t2 1 1 + 6 2k 1 1 + 3 k k (yi yi − 1) + ¯ i=1 1 6 n−1 (yi yi − 1) ¯ i=k+1 1 t − t2 t1 − t2 ≤ ≤0 3 3(2 + t) 6 (t := t1 + t2 ). [sent-278, score-0.286]
71 k copies n−k−1 copies ¯ So we can pick y as (−1, . [sent-279, score-0.15]
72 , −1, +1) which only misclassifies xk+1 , and get ak+1 = −2 1 yk+1 xk+1 = − √ e1 − ek+2 , n 3 bk+1 = Remp (wk ) − ak+1 , wk = 0 + :=Jk+1 (w) wk+1 = argmin w 1 w 2 2 αi = 1 k+1 ). [sent-285, score-0.478]
73 k+1 copies + max { ai , w + bi } = i∈[k+1] which can be verified by ∂Jk+1 (wk+1 ) = 1 1 = , 3 3 wk+1 + 1 1 1 √ , ,. [sent-286, score-0.189]
74 1 1 1 All that remains is to observe that J(wk ) = 1 ( 1 + k ) while minw J(w) ≤ J(wn−1 ) = 2 ( 1 + n−1 ) 2 3 3 1 1 1 from which it follows that J(wk ) − minw J(w) ≥ 2 ( k − n−1 ) as claimed in Theorem 6. [sent-295, score-0.139]
75 5 √ An O(nd/ ) Algorithm for Training Binary Linear SVMs The lower bounds we proved above show that CPM such as BMRM require Ω(1/ ) iterations to converge. [sent-296, score-0.212]
76 To demonstrate this, we will show that one can devise an algorithm for problems (1) and (2) which √ will converge in O(1/ ) iterations. [sent-298, score-0.072]
77 However, thanks to [7, Theorem 7 in Appendix A], the Fenchel dual of (1) is a convex smooth function with a Lipschitz continuous gradient, which are easy to optimize. [sent-300, score-0.088]
78 To formalize the idea of using the Fenchel dual, we can abstract from the objectives (1) and (2) a composite form of objective functions used in machine learning with linear models: min J(w) = f (w) + g (Aw), where Q1 is a closed convex set. [sent-301, score-0.161]
79 w∈Q1 (17) Here, f (w) is a strongly convex function corresponding to the regularizer, Aw stands for the output of a linear model, and g encodes the empirical risk measuring the discrepancy between the correct labels and the output of the linear model. [sent-302, score-0.1]
80 , xn ), f (w) = λ w , g (u) = minb∈R n i=1 [1 + ui − yi b]+ which corresponds to 2 g(α) = − i αi . [sent-316, score-0.143]
81 Then the adjoint form turns out to be the well known SVM dual objective function: αi − D(α) = i 1 α Y X XY α, 2λ α ∈ Q2 = α ∈ [0, n−1 ]n : yi αi = 0 . [sent-317, score-0.288]
82 Denote A as a 2n -by-d matrix where the y-th row is n n 2 1 ¯ ¯ xi (¯i − yi ) for each y ∈ {−1, +1} , f (w) = λ w , g (u) = maxy ∆(y, y) + n uy y ¯ ¯ i=1 2 ¯ ¯ which corresponds to g(α) = −n y ∆(y, y)αy , we recover the primal objective (2) for multi¯ variate performance measure. [sent-319, score-0.264]
83 (20) n In a series of papers [6, 9, 10], Nesterov developed optimal gradient based methods for minimizing the composite objectives with primal (17) and adjoint (18). [sent-321, score-0.157]
84 A sequence of wk and αk is produced such that under√ assumption A1 the duality gap J(wk ) − D(αk ) is reduced to less than after at most k = O(1/ ) steps. [sent-322, score-0.521]
85 Given a differentiable convex function F on Q2 , a point α, and a direction g, we can define the Bregman projection as: ¯ ¯ V (α, g) := argmin F (α) − F (α) − g, α . [sent-336, score-0.098]
86 Then the application of the algorithm in [9] will endow a distribution over all possible labelings: p(¯ ; w) ∝ exp c∆(¯ , y) + y y ai xi , w yi , where c and ai are constant scalars. [sent-338, score-0.332]
87 ¯ (21) i The solver will request the expectation Ey [ i ai xi yi ] which in turn requires that marginal distri¯ ¯ bution of p(¯i ). [sent-339, score-0.288]
88 Fortunately, for multivariate scores defined by contingency tables, it is possible to compute the marginals in O(n2 ) time by using dynamic programming, and this cost is similar to the algorithm proposed by [3]. [sent-341, score-0.115]
89 While upper bounds on their rates of convergence were known, lower bounds were not studied before. [sent-345, score-0.393]
90 In this paper we set out to fill this gap by exhibiting counter examples in binary classification on which CPM require Ω(1/ ) iterations. [sent-346, score-0.102]
91 The Ω(1/ ) lower bound is a fundamental lim√ itation of these algorithms and not an artifact of the problem. [sent-348, score-0.096]
92 Devising a O(1/ ) algorithm under the less restrictive assumption A2 remains an open problem. [sent-351, score-0.127]
93 While the dependence on is still Ω(1/ ) or worse [18], one gets bounds independent of n. [sent-361, score-0.1]
94 On the other hand, one can employ coordinate descent in the dual as is done in the Sequential Minimal Optimization (SMO) algorithm of [19]. [sent-363, score-0.079]
95 However, as [20] show, if the kernel matrix obtained by stacking xi into a matrix X and X X is not strictly positive definite, then SMO requires O(n/ ) iterations with each iteration costing O(nd) effort. [sent-364, score-0.112]
96 A method for unconstrained convex minimization problem with the rate of convergence O(1/k 2 ). [sent-407, score-0.096]
97 Lower bounds on rate of convergence of cutting plane methods (long version). [sent-415, score-0.279]
98 An algorithm for singly constrained class of quadratic programs subject to upper and lower bounds. [sent-472, score-0.157]
99 New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. [sent-478, score-0.157]
100 Information-theoretic lower bounds on the oracle complexity of convex optimization. [sent-505, score-0.207]
wordName wordTfidf (topN-words)
[('remp', 0.523), ('wk', 0.442), ('cpm', 0.294), ('bmrm', 0.258), ('slb', 0.21), ('wlb', 0.168), ('yi', 0.143), ('ak', 0.134), ('bounds', 0.1), ('teo', 0.095), ('svms', 0.089), ('bk', 0.089), ('jk', 0.077), ('copies', 0.075), ('ai', 0.072), ('subgradient', 0.071), ('lower', 0.071), ('cutting', 0.067), ('risk', 0.064), ('ub', 0.063), ('ankan', 0.063), ('contingency', 0.063), ('wi', 0.063), ('joachims', 0.057), ('nesterov', 0.057), ('adjoint', 0.057), ('minw', 0.057), ('xinhua', 0.055), ('upper', 0.054), ('dual', 0.052), ('multivariate', 0.052), ('plane', 0.052), ('counter', 0.051), ('gap', 0.051), ('saha', 0.051), ('quali', 0.051), ('euclidean', 0.049), ('dk', 0.049), ('xk', 0.048), ('limk', 0.048), ('open', 0.047), ('xi', 0.045), ('appendix', 0.043), ('iterates', 0.042), ('intercepts', 0.042), ('bi', 0.042), ('iterations', 0.041), ('optimizer', 0.04), ('primal', 0.04), ('accurate', 0.039), ('wn', 0.037), ('qp', 0.037), ('converge', 0.037), ('misclassify', 0.037), ('optimizers', 0.037), ('tightened', 0.037), ('objective', 0.036), ('argmin', 0.036), ('minimizer', 0.036), ('convex', 0.036), ('cp', 0.035), ('devise', 0.035), ('specialized', 0.035), ('convergence', 0.034), ('steps', 0.034), ('rates', 0.034), ('outlook', 0.034), ('devising', 0.034), ('conjectured', 0.034), ('regularized', 0.034), ('bregman', 0.032), ('singly', 0.032), ('objectives', 0.03), ('composite', 0.03), ('misclassi', 0.029), ('winning', 0.029), ('nemirovski', 0.029), ('min', 0.029), ('yn', 0.028), ('solver', 0.028), ('assumption', 0.028), ('proving', 0.028), ('fenchel', 0.028), ('coordinate', 0.027), ('argmax', 0.027), ('recursive', 0.027), ('restrictive', 0.027), ('refers', 0.026), ('projection', 0.026), ('rate', 0.026), ('aw', 0.026), ('iteration', 0.026), ('claim', 0.026), ('artifact', 0.025), ('subgradients', 0.025), ('smo', 0.025), ('remains', 0.025), ('bundle', 0.024), ('projections', 0.024), ('programming', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
Author: Xinhua Zhang, Ankan Saha, S.v.n. Vishwanathan
Abstract: In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an accurate solution in O(1/ 2 ) iterations. By tightening the analysis, Teo et al. [2] showed that O(1/ ) iterations suffice. Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. In this paper we disprove this conjecture. We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a multivariate performance score. However, surprisingly, these problems are not inherently hard. By exploiting the structure of the objective function we can devise an algo√ rithm that converges in O(1/ ) iterations. 1
2 0.11076254 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
Author: Andreas Krause, Pietro Perona, Ryan G. Gomes
Abstract: Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
3 0.10394668 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
Author: Jason Lee, Ben Recht, Nathan Srebro, Joel Tropp, Ruslan Salakhutdinov
Abstract: The max-norm was proposed as a convex matrix regularizer in [1] and was shown to be empirically superior to the trace-norm for collaborative filtering problems. Although the max-norm can be computed in polynomial time, there are currently no practical algorithms for solving large-scale optimization problems that incorporate the max-norm. The present work uses a factorization technique of Burer and Monteiro [2] to devise scalable first-order algorithms for convex programs involving the max-norm. These algorithms are applied to solve huge collaborative filtering, graph cut, and clustering problems. Empirically, the new methods outperform mature techniques from all three areas. 1
4 0.10355358 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
5 0.094590589 29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control
Author: Konrad Rawlik, Marc Toussaint, Sethu Vijayakumar
Abstract: Algorithms based on iterative local approximations present a practical approach to optimal control in robotic systems. However, they generally require the temporal parameters (for e.g. the movement duration or the time point of reaching an intermediate goal) to be specified a priori. Here, we present a methodology that is capable of jointly optimizing the temporal parameters in addition to the control command profiles. The presented approach is based on a Bayesian canonical time formulation of the optimal control problem, with the temporal mapping from canonical to real time parametrised by an additional control variable. An approximate EM algorithm is derived that efficiently optimizes both the movement duration and control commands offering, for the first time, a practical approach to tackling generic via point problems in a systematic way under the optimal control framework. The proposed approach, which is applicable to plants with non-linear dynamics as well as arbitrary state dependent and quadratic control costs, is evaluated on realistic simulations of a redundant robotic plant.
6 0.094421141 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
7 0.094110243 243 nips-2010-Smoothness, Low Noise and Fast Rates
8 0.088671163 290 nips-2010-t-logistic regression
9 0.087608568 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
10 0.081012063 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
11 0.080698922 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
12 0.076074265 176 nips-2010-Multiple Kernel Learning and the SMO Algorithm
13 0.076013967 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
14 0.07381352 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
15 0.071635671 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
16 0.071354054 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
17 0.066816323 228 nips-2010-Reverse Multi-Label Learning
18 0.066614017 257 nips-2010-Structured Determinantal Point Processes
19 0.06587448 162 nips-2010-Link Discovery using Graph Feature Tracking
20 0.064132325 22 nips-2010-Active Estimation of F-Measures
topicId topicWeight
[(0, 0.18), (1, 0.041), (2, 0.142), (3, -0.01), (4, 0.049), (5, 0.004), (6, -0.026), (7, 0.004), (8, 0.018), (9, -0.019), (10, -0.027), (11, -0.068), (12, 0.064), (13, 0.056), (14, -0.024), (15, -0.018), (16, 0.012), (17, 0.025), (18, 0.033), (19, -0.042), (20, 0.079), (21, -0.061), (22, 0.082), (23, -0.001), (24, 0.015), (25, -0.02), (26, -0.109), (27, -0.012), (28, 0.087), (29, -0.076), (30, 0.006), (31, 0.011), (32, 0.053), (33, -0.052), (34, -0.088), (35, 0.1), (36, 0.044), (37, -0.073), (38, -0.012), (39, 0.033), (40, 0.056), (41, -0.068), (42, 0.063), (43, 0.027), (44, 0.08), (45, 0.053), (46, 0.077), (47, -0.078), (48, -0.075), (49, -0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.92735529 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
Author: Xinhua Zhang, Ankan Saha, S.v.n. Vishwanathan
Abstract: In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an accurate solution in O(1/ 2 ) iterations. By tightening the analysis, Teo et al. [2] showed that O(1/ ) iterations suffice. Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. In this paper we disprove this conjecture. We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a multivariate performance score. However, surprisingly, these problems are not inherently hard. By exploiting the structure of the objective function we can devise an algo√ rithm that converges in O(1/ ) iterations. 1
2 0.72937721 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
Author: Katya Scheinberg, Shiqian Ma, Donald Goldfarb
Abstract: Gaussian graphical models are of great interest in statistical learning. Because the conditional independencies between different nodes correspond to zero entries in the inverse covariance matrix of the Gaussian distribution, one can learn the structure of the graph by estimating a sparse inverse covariance matrix from sample data, by solving a convex maximum likelihood problem with an ℓ1 -regularization term. In this paper, we propose a first-order method based on an alternating linearization technique that exploits the problem’s special structure; in particular, the subproblems solved in each iteration have closed-form solutions. Moreover, our algorithm obtains an ϵ-optimal solution in O(1/ϵ) iterations. Numerical experiments on both synthetic and real data from gene association networks show that a practical version of this algorithm outperforms other competitive algorithms. 1
3 0.65429163 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
Author: Tamir Hazan, Raquel Urtasun
Abstract: In this paper we propose an approximated structured prediction framework for large scale graphical models and derive message-passing algorithms for learning their parameters efficiently. We first relate CRFs and structured SVMs and show that in CRFs a variant of the log-partition function, known as the soft-max, smoothly approximates the hinge loss function of structured SVMs. We then propose an intuitive approximation for the structured prediction problem, using duality, based on a local entropy approximation and derive an efficient messagepassing algorithm that is guaranteed to converge. Unlike existing approaches, this allows us to learn efficiently graphical models with cycles and very large number of parameters. 1
4 0.62421882 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright
Abstract: Many statistical M -estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. We analyze the convergence rates of first-order gradient methods for solving such problems within a high-dimensional framework that allows the data dimension d to grow with (and possibly exceed) the sample size n. This high-dimensional structure precludes the usual global assumptions— namely, strong convexity and smoothness conditions—that underlie classical optimization analysis. We define appropriately restricted versions of these conditions, and show that they are satisfied with high probability for various statistical models. Under these conditions, our theory guarantees that Nesterov’s first-order method [12] has a globally geometric rate of convergence up to the statistical precision of the model, meaning the typical Euclidean distance between the true unknown parameter θ∗ and the optimal solution θ. This globally linear rate is substantially faster than previous analyses of global convergence for specific methods that yielded only sublinear rates. Our analysis applies to a wide range of M -estimators and statistical models, including sparse linear regression using Lasso ( 1 regularized regression), group Lasso, block sparsity, and low-rank matrix recovery using nuclear norm regularization. Overall, this result reveals an interesting connection between statistical precision and computational efficiency in high-dimensional estimation. 1
5 0.6176793 29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control
Author: Konrad Rawlik, Marc Toussaint, Sethu Vijayakumar
Abstract: Algorithms based on iterative local approximations present a practical approach to optimal control in robotic systems. However, they generally require the temporal parameters (for e.g. the movement duration or the time point of reaching an intermediate goal) to be specified a priori. Here, we present a methodology that is capable of jointly optimizing the temporal parameters in addition to the control command profiles. The presented approach is based on a Bayesian canonical time formulation of the optimal control problem, with the temporal mapping from canonical to real time parametrised by an additional control variable. An approximate EM algorithm is derived that efficiently optimizes both the movement duration and control commands offering, for the first time, a practical approach to tackling generic via point problems in a systematic way under the optimal control framework. The proposed approach, which is applicable to plants with non-linear dynamics as well as arbitrary state dependent and quadratic control costs, is evaluated on realistic simulations of a redundant robotic plant.
6 0.60724688 202 nips-2010-Parallelized Stochastic Gradient Descent
7 0.58938116 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
8 0.58298934 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
9 0.56535923 243 nips-2010-Smoothness, Low Noise and Fast Rates
10 0.56055516 63 nips-2010-Distributed Dual Averaging In Networks
11 0.55048901 191 nips-2010-On the Theory of Learnining with Privileged Information
12 0.54459745 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
13 0.54196882 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
14 0.53970915 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
15 0.53051013 290 nips-2010-t-logistic regression
16 0.51594037 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
17 0.50341731 142 nips-2010-Learning Bounds for Importance Weighting
18 0.49317533 118 nips-2010-Implicit Differentiation by Perturbation
19 0.47906071 257 nips-2010-Structured Determinantal Point Processes
20 0.47783241 235 nips-2010-Self-Paced Learning for Latent Variable Models
topicId topicWeight
[(13, 0.052), (17, 0.019), (27, 0.037), (30, 0.093), (35, 0.032), (45, 0.197), (50, 0.041), (52, 0.031), (53, 0.247), (60, 0.079), (77, 0.034), (78, 0.027), (90, 0.029)]
simIndex simValue paperId paperTitle
1 0.84903669 226 nips-2010-Repeated Games against Budgeted Adversaries
Author: Jacob D. Abernethy, Manfred K. Warmuth
Abstract: We study repeated zero-sum games against an adversary on a budget. Given that an adversary has some constraint on the sequence of actions that he plays, we consider what ought to be the player’s best mixed strategy with knowledge of this budget. We show that, for a general class of normal-form games, the minimax strategy is indeed efficiently computable and relies on a “random playout” technique. We give three diverse applications of this new algorithmic template: a cost-sensitive “Hedge” setting, a particular problem in Metrical Task Systems, and the design of combinatorial prediction markets. 1
2 0.80462784 2 nips-2010-A Bayesian Approach to Concept Drift
Author: Stephen Bach, Mark Maloof
Abstract: To cope with concept drift, we placed a probability distribution over the location of the most-recent drift point. We used Bayesian model comparison to update this distribution from the predictions of models trained on blocks of consecutive observations and pruned potential drift points with low probability. We compare our approach to a non-probabilistic method for drift and a probabilistic method for change-point detection. In our experiments, our approach generally yielded improved accuracy and/or speed over these other methods. 1
same-paper 3 0.79793102 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
Author: Xinhua Zhang, Ankan Saha, S.v.n. Vishwanathan
Abstract: In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an accurate solution in O(1/ 2 ) iterations. By tightening the analysis, Teo et al. [2] showed that O(1/ ) iterations suffice. Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. In this paper we disprove this conjecture. We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a multivariate performance score. However, surprisingly, these problems are not inherently hard. By exploiting the structure of the objective function we can devise an algo√ rithm that converges in O(1/ ) iterations. 1
4 0.77960932 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions
Author: Bo Thiesson, Chong Wang
Abstract: Remarkably easy implementation and guaranteed convergence has made the EM algorithm one of the most used algorithms for mixture modeling. On the downside, the E-step is linear in both the sample size and the number of mixture components, making it impractical for large-scale data. Based on the variational EM framework, we propose a fast alternative that uses component-specific data partitions to obtain a sub-linear E-step in sample size, while the algorithm still maintains provable convergence. Our approach builds on previous work, but is significantly faster and scales much better in the number of mixture components. We demonstrate this speedup by experiments on large-scale synthetic and real data. 1
5 0.7768746 208 nips-2010-Policy gradients in linearly-solvable MDPs
Author: Emanuel Todorov
Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.
6 0.70007163 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
7 0.696729 222 nips-2010-Random Walk Approach to Regret Minimization
8 0.69079512 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
9 0.68916768 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
10 0.68852997 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
11 0.68749195 243 nips-2010-Smoothness, Low Noise and Fast Rates
12 0.68729144 4 nips-2010-A Computational Decision Theory for Interactive Assistants
13 0.68638742 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
14 0.6855849 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
15 0.68499541 265 nips-2010-The LASSO risk: asymptotic results and real world examples
16 0.6849001 63 nips-2010-Distributed Dual Averaging In Networks
17 0.68403101 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
18 0.68362731 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
19 0.68268067 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
20 0.68103188 280 nips-2010-Unsupervised Kernel Dimension Reduction