nips nips2008 nips2008-85 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Karthik Sridharan, Shai Shalev-shwartz, Nathan Srebro
Abstract: We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM objective. Thus, we obtain a rate of 1/n on the convergence of the SVM objective (with fixed regularization parameter) to its infinite data limit. We demonstrate how this is essential for obtaining certain type of oracle inequalities for SVMs. The results extend also to approximate minimization as well as to strong convexity with respect to an arbitrary norm, and so also to objectives regularized using other p norms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Fast Rates for Regularized Objectives Karthik Sridharan, Nathan Srebro, Shai Shalev-Shwartz Toyota Technological Institute — Chicago Abstract We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. [sent-1, score-0.341]
2 We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. [sent-2, score-0.209]
3 Thus, we obtain a rate of 1/n on the convergence of the SVM objective (with fixed regularization parameter) to its infinite data limit. [sent-4, score-0.218]
4 We demonstrate how this is essential for obtaining certain type of oracle inequalities for SVMs. [sent-5, score-0.199]
5 The results extend also to approximate minimization as well as to strong convexity with respect to an arbitrary norm, and so also to objectives regularized using other p norms. [sent-6, score-0.422]
6 1 Introduction We consider the problem of (approximately) minimizing a stochastic objective F (w) = Eθ [f (w; θ)] (1) where the optimization is with respect to w ∈ W, based on an i. [sent-7, score-0.115]
7 (2) The relevant special case is regularized linear prediction, where θ = (x, y), ( w, φ(x) , y) is the loss of predicting w, φ(x) when the true target is y, and r(w) is a regularizer. [sent-15, score-0.25]
8 It is well known that when the domain W and the mapping φ(·) are bounded, and the function (z; θ) is Lipschitz continuous in z, the empirical averages n ˆ ˆ F (w) = E [f (w; θ)] = 1 n f (w; θi ) (3) i=1 converge uniformly to their expectations F (w) with rate 1/n. [sent-16, score-0.15]
9 This justifies using the empirical minimizer ˆ ˆ w = arg min F (w), (4) w∈W ˆ and we can then establish convergence of F (w) to the population optimum F (w ) = min F (w) w∈W with a rate of (5) 1/n. [sent-17, score-0.294]
10 Recently, Hazan et al [1] studied an online analogue to this problem, and established that if f (w; θ) is strongly convex in w, the average online regret diminishes with a much faster rate, namely (log n)/n. [sent-18, score-0.39]
11 The function f (w; θ) becomes strongly convex when, for example, we have 2 r(w) = λ w as in SVMs and other regularized learning settings. [sent-19, score-0.322]
12 2 In this paper we present an analogous “fast rate” for empirical minimization of a strongly convex stochastic objective. [sent-20, score-0.293]
13 In fact, we do not need to assume that we perform the empirical minimization 1 exactly: we provide uniform (over all w ∈ W) guarantees on the population sub-optimality F (w)− ˆ ˆ ˆ F (w ) in terms of the empirical sub-optimality F (w) − F (w) with a rate of 1/n. [sent-21, score-0.243]
14 Specifically, consider f (w; θ) as in (2), where (z; θ) is convex and LLipschitz in z, the norm of φ(θ) is bounded by B, and r is λ-strongly convex. [sent-24, score-0.253]
15 It might not be surprising that requiring strong convexity yields a rate of 1/n. [sent-27, score-0.363]
16 Indeed, the connection between strong convexity, variance bounds, and rates of 1/n, is well known. [sent-28, score-0.113]
17 In particular, we do not require any “low noise” conditions, nor that the loss function is strongly convex (it need only be weakly convex). [sent-30, score-0.344]
18 This 1/n rate on the SVM objective is always valid, and does not depend on any low-noise conditions or on specific properties of the kernel√ function. [sent-33, score-0.195]
19 Such a “fast” rate might seem surprising at a first glance to the reader familiar with the 1/ n rate on the expected loss of the SVM optimum. [sent-34, score-0.376]
20 There is no contradiction here—what we √ establish is that although the loss might converge at a rate of 1/ n, the SVM objective (regularized loss) always converges at a rate of 1/n. [sent-35, score-0.451]
21 √ In fact, in Section 3 we see how a rate of 1/n on the objective corresponds to a rate of 1/ n on the loss. [sent-36, score-0.253]
22 Specifically, we perform an oracle analysis of the optimum of the SVM objective (rather than of empirical minimization subject to a norm constraint, as in other oracle analyses of regularized linear learning), based on the existence of some (unknown) low-norm, low-error predictor w. [sent-37, score-0.739]
23 Strong convexity is a concept that depends on a choice of norm. [sent-38, score-0.173]
24 We state our results in a general form, for any choice of norm · . [sent-39, score-0.112]
25 Strong convexity of r(w) must hold with respect to the chosen norm · , and the data φ(θ) must be bounded with respect to the dual norm · ∗ , i. [sent-40, score-0.456]
26 This allows us to apply our results also to more general forms of regularizers, 2 including squared p norm regularizers, r(w) = λ w p , for p < 1 ≤ 2 (see Corollary 2). [sent-43, score-0.132]
27 However, 2 the reader may choose to read the paper always thinking of the norm w , and so also its dual norm w ∗ , as the standard 2 -norm. [sent-44, score-0.279]
28 2 Main Result We consider a generalized linear function f : W × Θ → R, that can be written as in (2), defined over a closed convex subset W of a Banach space equipped with norm · . [sent-45, score-0.227]
29 Lipschitz continuity and boundedness We require that the mapping φ(·) is bounded by B, i. [sent-46, score-0.108]
30 Strong Convexity We require that F (w) is λ-strongly convex w. [sent-49, score-0.133]
31 Recalling that w = arg minw F (w), this ensures (see for example [2, Lemma 13]): F (w) ≥ F (w ) + λ 2 w−w 2 (7) We require only that the expectation F (w) = E [f (w; θ)] is strongly convex. [sent-54, score-0.161]
32 Of course, requiring that f (w; θ) is λ-strongly convex for all θ (with respect to w) is enough to ensure the condition. [sent-55, score-0.185]
33 2 In particular, for a generalized linear function of the form (2) it is enough to require that (z; y) is convex in z and that r(w) is λ-strongly convex (w. [sent-56, score-0.267]
34 We now provide a faster convergence rate using the above conditions. [sent-60, score-0.148]
35 Let W be a closed convex subset of a Banach space with norm · and dual norm · ∗ and consider f (w; θ) = ( w, φ(θ) ; θ) + r(w) satisfying the Lipschitz continuity, boundedness, ˆ ˆ and strong convexity requirements with parameters B, L, and λ. [sent-62, score-0.619]
36 λn 2 It is particularly interesting to consider regularizers of the form r(w) = λ w p , which are (p−1)λ2 strongly convex w. [sent-65, score-0.24]
37 Consider an p norm and its dual q , with 1 < p ≤ 2, 1 + p = 1, and the objective q 2 f (w; θ) = ( w, φ(θ) ; θ) + λ w p , where φ(θ) q ≤ B and (z; y) is convex and L-Lipschitz 2 in z. [sent-70, score-0.349]
38 p -regularized λ 2 w p (8) 2 where (z, y) is some convex loss function and x q ≤ B. [sent-74, score-0.242]
39 For example, in SVMs we use the 2 norm, and so bound x 2 ≤ B, and the hinge loss (z, y) = [1 − yz]+ , which is 1-Lipschitz. [sent-75, score-0.198]
40 the regularized empirical loss, or in other words, how quickly do we converge to the infinite-data optimum of the objective. [sent-78, score-0.226]
41 f (w; x, y) = ( w, x , y) + We see, then, that the SVM objective converges to its optimum value at a fast rate of 1/n, without ˆ ˆ any special assumptions. [sent-79, score-0.282]
42 This still doesn’t mean that the expected loss L(w) = E [ ( w, x , y)] converges at this rate. [sent-80, score-0.234]
43 For ˆ each data set size we plot the excess expected loss L(w) − L(w ) and the sub-optimality of the ˆ ˆ − F (w ) (recall that F (w) = L(w) + λ w 2 ). [sent-82, score-0.237]
44 Although the ˆ ˆ regularized expected loss F (w) 2 regularized expected loss converges to its infinite data limit, i. [sent-83, score-0.67]
45 to the population minimizer, with ˆ rate roughly 1/n, the expected loss L(w) converges at a slower rate of roughly 1/n. [sent-85, score-0.426]
46 Studying the convergence rate of the SVM objective allows us to better understand and appreciate analysis of computational optimization approaches for this objective, as well as obtain oracle ˆ inequalities on the generalization loss of w, as we do in the following Section. [sent-86, score-0.536]
47 An example of a regularizer that is strongly convex with respect to the 1 norm is the (unnormalized) entropy regud 2 larizer [3]: r(w) = i=1 |wi | log(|wi |). [sent-93, score-0.331]
48 Consider a function f (w; θ) = ( w, φ(θ) ; θ) + i=1 |wi | log(|wi |), where φ(θ) ∞ ≤ B and (z; y) is convex and L-Lipschitz in z. [sent-98, score-0.115]
49 Right: Excess expected loss L(wλ ) − minw L(wo ), relative to the overall optimal wo = arg minw L(w), with λn = 300/n. [sent-101, score-0.669]
50 We assume, as an oracle assumption, that there exists a good predictor wo with low norm wo ˆ and which attains low expected loss L(wo ). [sent-109, score-1.282]
51 Using the results of Section 2, we can ˆ ˜ guaranteed to find w such that F ˜ translate this approximate optimality of the empirical objective to an approximate optimality of the expected objective Fλ (w) = E [fλ (w)]. [sent-111, score-0.384]
52 The optimal choice for λ is λ(n) = c B log(1/δ) √ , wo n (13) for some constant c. [sent-116, score-0.401]
53 We can now formally state our oracle inequality, which is obtained by substituting (13) into (12): Corollary 4. [sent-117, score-0.169]
54 For any wo and any δ > 0, with probability 2 ˆ ˆ ˜ ˜ at least 1−δ over a sample of size n, we have that for all w s. [sent-119, score-0.422]
55 Fλ(n) (w) ≤ min Fλ(n) (w)+O( B ), λn where λ(n) chosen as in (13), the following holds: 2 B 2 wo log(1/δ) ˜ L(w) ≤ L(wo ) + O n Corollary 4 is demonstrated empirically on the right plot of Figure 1. [sent-121, score-0.423]
56 The way we set λ(n) in Corollary 4 depends on wo . [sent-122, score-0.401]
57 Fλ(n) (w) ≤ B2 ˆ min Fλ(n) (w) + O( λn ), the following holds: 4 B 2 ( wo + 1) log(1/δ) ˜ L(w) ≤ inf L(wo ) + O wo n The price we pay here is that the bound of Corollary 5 is larger by a factor of wo relative to the bound of Corollary 4. [sent-127, score-1.397]
58 Nevertheless, this bound allows us to converge with a rate of 1/n to the expected loss of any fixed predictor. [sent-128, score-0.37]
59 We note that a more sophisticated analysis has been carried out by Steinwart et al [4], who showed that rates faster than √ 1/ n are possible under certain conditions on noise and complexity of kernel class. [sent-131, score-0.25]
60 In Steinwart’s et al analyses the estimation rates (i. [sent-132, score-0.142]
61 rates for expected regularized risk) are given in terms of the approximation error quantity λ w 2 + L(w ) − L∗ where L∗ is the Bayes risk. [sent-134, score-0.244]
62 In our re2 sult we consider the estimation rate for regularized objective independent of the approximation error. [sent-135, score-0.294]
63 For each w, we define gw (θ) = f (w; θ) − f (w ; θ), and so our goal is to bound the expectation of gw in terms of its empirical average. [sent-137, score-1.352]
64 Since our desired bound is not exactly uniform, and we would like to pay different attention to functions depending on their expected sub-optimality, we will instead consider the following reweighted class. [sent-139, score-0.164]
65 For any r > 0 define r gw = Gr = gw 4k(w) : w ∈ W, k(w) = min{k ∈ Z+ : E [gw ] ≤ r4k } where Z+ is the set of non-negative integers. [sent-140, score-1.24]
66 In other words, r gw ∈ G and the scaling factor ensures that E [gw ] ≤ r. [sent-141, score-0.64]
67 r gw (17) ∈ Gr is just a scaled version of We will begin by bounding the variation between expected and empirical average values of g r ∈ Gr . [sent-142, score-0.748]
68 Define: hr = w Hr = hw 4k(w) : w ∈ W, k(w) = min{k ∈ Z+ : E [gw ] ≤ r4k } (18) where hw (θ) = gw (θ) − (r(w) − r(w )) = ( w, φ(θ) ; θ) − ( w∗ , φ(θ) ; θ). [sent-147, score-1.186]
69 (19) r r That is, hw (θ) is the data dependent component of gw , dropping the (scaled) regularization terms. [sent-148, score-0.736]
70 r ˆ ˆ r With this definition we have E [gw ] − E [gw ] = E [hr ] − E [hr ] (the regularization terms on the left w w hand side cancel out), and so it is enough to bound the deviation of the empirical means in Hr . [sent-149, score-0.156]
71 This can be done in terms of the Rademacher Complexity of the class, R(Hr ) [6, Theorem 5]: For any δ > 0, with probability at least 1 − δ, ˆ sup E [hr ] − E [hr ] ≤ 2R(Hr ) + hr ∈Hr log 1/δ 2n . [sent-150, score-0.48]
72 sup |hr (θ)| hr ∈Hr ,θ (20) We will now proceed to bounding the two terms on the right hand side: Lemma 6. [sent-151, score-0.457]
73 From the definition of hr given in (18)–(19), the Lipschitz continuity of (·; θ), and the w bound φ(θ) ∗ ≤ B, we have for all w, θ: |hr (θ)| ≤ w |hw (θ)| 4k(w) ≤ LB w − w /4k(w) (21) We now use the strong convexity of F (w), and in particular eq. [sent-153, score-0.717]
74 (7), as well as the definitions of gw and k(w), and finally note that 4k(w) ≥ 1, to get: w−w ≤ 2 λ (F (w) 2 λ E [gw ] − F (w )) = ≤ 2 k(w) r λ4 ≤ 2 k(w) r λ 16 (22) Substituting (22) in (21) yields the desired bound. [sent-154, score-0.643]
75 We will use the following generic bound on the Rademacher complexity of linear functionals [7, Theorem 1]: for any t(w) which is λ-strongly convex (w. [sent-157, score-0.226]
76 t a norm with dual norm · ∗ ), R({φ → w, φ | t(w) ≤ a}) ≤ (sup φ ∗ ) 2a λn . [sent-159, score-0.257]
77 Rearranging terms yields: r E [gw ] ≤ 1 √ ˆ E [gw ] 1−4D/ r Combining the two cases (25) and (26) (and requiring r ≥ (4D)2 so that have: √ 1 ˆ E [gw ] ≤ 1−4D/√r E [gw ] + rD + (26) 1 √ 1−4D/ r ≥ 1), we always (27) 1 Setting r = (1 + a )2 (4D)2 yields the bound in Theorem 1. [sent-173, score-0.124]
78 5 Comparison with Previous “Fast Rate” Guarantees √ Rates faster than 1/ n for estimation have been previously explored under various conditions, where strong convexity has played a significant role. [sent-174, score-0.272]
79 Lee et al [8] showed faster rates for squared loss, exploiting the strong convexity of this loss function, but only under finite pseudodimensionality assumption, which do not hold in SVM-like settings. [sent-175, score-0.561]
80 Most methods for deriving fast rates first bound the variance of the functions in the class by some monotone function of their expectations. [sent-179, score-0.181]
81 Then, using methods as in Bartlett et al [5], one √ can get bounds that have a localized complexity term and additional terms of order faster than 1/ n. [sent-180, score-0.233]
82 However, it is important to note that the localized complexity term typically dominates the rate and still needs to be controlled. [sent-181, score-0.159]
83 For example, Bartlett et al [12] show that strict convexity of the loss function implies a variance bound, and provide a general result that can enable obtaining faster rates as long as the complexity term is low. [sent-182, score-0.545]
84 Thus we see that even for a strictly convex loss function, such as the squared loss, additional conditions are necessary in order to obtain “fast” rates. [sent-184, score-0.286]
85 In this work we show that strong convexity not only implies a variance bound but in fact can be used to bound the localized complexity. [sent-185, score-0.407]
86 An important distinction is that we require strong convexity of the function F (w) with respect to the norm w . [sent-186, score-0.358]
87 This is rather different than requiring the loss function z → (z, y) be strongly convex on the reals. [sent-187, score-0.356]
88 In particular, the loss of a linear predictor, w → ( w, x , y) can never be strongly convex in a multi-dimensional space, even if is strongly convex, since it is flat in directions orthogonal to x. [sent-188, score-0.41]
89 As mentioned, f (w; x, y) = ( w, x , y) can never be strongly convex in a high-dimensional space. [sent-189, score-0.199]
90 However, we actually only require the strong convexity of the expected loss F (w). [sent-190, score-0.436]
91 If the loss function (z, y) is λ-strongly convex in z, and the eigenvalues of the covariance of x are bounded away from zero, strong convexity of F (w) can be ensured. [sent-191, score-0.521]
92 This enables us to use Theorem 7 1 to obtain rates of 1/n on the expected loss itself. [sent-193, score-0.248]
93 An interesting observation about our proof technique is that the only concentration inequality we invoked was McDiarmid’s Inequality (in [6, Theorem 5] to obtain (20)—a bound on the deviations in terms of the Rademacher complexity). [sent-195, score-0.126]
94 This was possible because we could make a localization argument for the ∞ norm of the functions in our function class in terms of their expectation. [sent-196, score-0.112]
95 6 Summary We believe this is the first demonstration that, without any additional requirements, the SVM objective converges to its infinite data limit with a rate of O(1/n). [sent-197, score-0.215]
96 Although the quantity that is ultimately of interest to us is the expected loss, and not the regularized expected loss, it is still important to understand the statistical behavior of the regularized expected loss. [sent-200, score-0.455]
97 A better understanding of its behavior can allow us to both theoretically explore the behavior of regularized learning methods, to better understand empirical behavior observed in practice, and to appreciate guarantees of stochastic optimization approaches for such regularized objectives. [sent-204, score-0.433]
98 As we saw in Section 3, deriving such fast rates is also essential for obtaining simple and general oracle inequalities, that also helps us guide our choice of regularization parameters. [sent-205, score-0.288]
99 Covering number bounds of certain regularized linear function classes. [sent-219, score-0.151]
100 The importance of convexity in learning with squared loss. [sent-264, score-0.193]
wordName wordTfidf (topN-words)
[('gw', 0.62), ('wo', 0.401), ('hr', 0.384), ('convexity', 0.173), ('oracle', 0.134), ('loss', 0.127), ('regularized', 0.123), ('convex', 0.115), ('norm', 0.112), ('corollary', 0.107), ('bw', 0.1), ('hw', 0.091), ('objective', 0.089), ('strongly', 0.084), ('lb', 0.084), ('rate', 0.082), ('svm', 0.079), ('bound', 0.071), ('gr', 0.067), ('expected', 0.063), ('al', 0.063), ('rademacher', 0.059), ('rates', 0.058), ('steinwart', 0.056), ('strong', 0.055), ('bartlett', 0.055), ('lipschitz', 0.054), ('excess', 0.047), ('bousquet', 0.047), ('banach', 0.047), ('inequalities', 0.046), ('log', 0.045), ('converges', 0.044), ('faster', 0.044), ('predictor', 0.044), ('opt', 0.042), ('minimizer', 0.042), ('empirical', 0.041), ('regularizers', 0.041), ('complexity', 0.04), ('minw', 0.039), ('localized', 0.037), ('appreciate', 0.036), ('optimum', 0.035), ('substituting', 0.035), ('continuity', 0.034), ('hazan', 0.033), ('sridharan', 0.033), ('dual', 0.033), ('risk', 0.032), ('fast', 0.032), ('inequality', 0.03), ('theorem', 0.03), ('svms', 0.03), ('requiring', 0.03), ('sup', 0.03), ('boundedness', 0.03), ('pay', 0.03), ('population', 0.028), ('bounds', 0.028), ('wi', 0.027), ('minimization', 0.027), ('converge', 0.027), ('bounded', 0.026), ('stochastic', 0.026), ('regularization', 0.025), ('rd', 0.025), ('objectives', 0.025), ('away', 0.025), ('concentration', 0.025), ('bounding', 0.024), ('guarantees', 0.024), ('conditions', 0.024), ('yields', 0.023), ('convergence', 0.022), ('min', 0.022), ('optimality', 0.022), ('reader', 0.022), ('applies', 0.021), ('least', 0.021), ('ensure', 0.021), ('regret', 0.021), ('norms', 0.021), ('online', 0.021), ('et', 0.021), ('translate', 0.02), ('regularizer', 0.02), ('ensures', 0.02), ('squared', 0.02), ('deriving', 0.02), ('behavior', 0.02), ('obtaining', 0.019), ('requirements', 0.019), ('enough', 0.019), ('approximate', 0.019), ('proceed', 0.019), ('require', 0.018), ('diverges', 0.018), ('algorithmically', 0.018), ('hides', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 85 nips-2008-Fast Rates for Regularized Objectives
Author: Karthik Sridharan, Shai Shalev-shwartz, Nathan Srebro
Abstract: We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM objective. Thus, we obtain a rate of 1/n on the convergence of the SVM objective (with fixed regularization parameter) to its infinite data limit. We demonstrate how this is essential for obtaining certain type of oracle inequalities for SVMs. The results extend also to approximate minimization as well as to strong convexity with respect to an arbitrary norm, and so also to objectives regularized using other p norms. 1
2 0.1613111 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
Author: Sham M. Kakade, Karthik Sridharan, Ambuj Tewari
Abstract: This work characterizes the generalization ability of algorithms whose predictions are linear in the input vector. To this end, we provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes, which directly lead to a number of generalization bounds. This derivation provides simplified proofs of a number of corollaries including: risk bounds for linear prediction (including settings where the weight vectors are constrained by either L2 or L1 constraints), margin bounds (including both L2 and L1 margins, along with more general notions based on relative entropy), a proof of the PAC-Bayes theorem, and upper bounds on L2 covering numbers (with Lp norm constraints and relative entropy constraints). In addition to providing a unified analysis, the results herein provide some of the sharpest risk and margin bounds. Interestingly, our results show that the uniform convergence rates of empirical risk minimization algorithms tightly match the regret bounds of online learning algorithms for linear prediction, up to a constant factor of 2. 1
3 0.13269225 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
Author: Shai Shalev-shwartz, Sham M. Kakade
Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1
4 0.096820801 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents the first Rademacher complexity-based error bounds for noni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case. Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni.i.d. settings. They benefit from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from such finite samples and lead to tighter generalization bounds. We also present the first margin bounds for kernel-based classification in this non-i.i.d. setting and briefly study their convergence.
5 0.093439259 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
Author: Sham M. Kakade, Ambuj Tewari
Abstract: This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. 1
6 0.092805482 239 nips-2008-Tighter Bounds for Structured Estimation
7 0.09105207 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
8 0.081212468 228 nips-2008-Support Vector Machines with a Reject Option
9 0.076206528 193 nips-2008-Regularized Co-Clustering with Dual Supervision
10 0.074012823 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
11 0.065828905 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
12 0.065003559 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
13 0.060823262 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
14 0.057749923 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
15 0.057027794 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions
16 0.056037594 202 nips-2008-Robust Regression and Lasso
17 0.055062424 62 nips-2008-Differentiable Sparse Coding
18 0.054401796 214 nips-2008-Sparse Online Learning via Truncated Gradient
19 0.051656015 40 nips-2008-Bounds on marginal probability distributions
20 0.05093481 65 nips-2008-Domain Adaptation with Multiple Sources
topicId topicWeight
[(0, -0.15), (1, -0.003), (2, -0.176), (3, 0.066), (4, -0.097), (5, 0.039), (6, -0.02), (7, -0.038), (8, -0.05), (9, 0.034), (10, 0.014), (11, 0.106), (12, 0.037), (13, 0.031), (14, 0.002), (15, 0.08), (16, 0.026), (17, 0.004), (18, -0.029), (19, 0.023), (20, -0.125), (21, 0.055), (22, -0.007), (23, -0.024), (24, 0.001), (25, 0.062), (26, -0.029), (27, 0.046), (28, -0.102), (29, -0.056), (30, -0.015), (31, 0.025), (32, -0.012), (33, -0.015), (34, -0.032), (35, 0.026), (36, -0.016), (37, 0.048), (38, -0.103), (39, 0.004), (40, 0.064), (41, -0.005), (42, 0.08), (43, 0.04), (44, 0.064), (45, 0.051), (46, 0.059), (47, 0.114), (48, -0.02), (49, -0.107)]
simIndex simValue paperId paperTitle
same-paper 1 0.94460595 85 nips-2008-Fast Rates for Regularized Objectives
Author: Karthik Sridharan, Shai Shalev-shwartz, Nathan Srebro
Abstract: We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM objective. Thus, we obtain a rate of 1/n on the convergence of the SVM objective (with fixed regularization parameter) to its infinite data limit. We demonstrate how this is essential for obtaining certain type of oracle inequalities for SVMs. The results extend also to approximate minimization as well as to strong convexity with respect to an arbitrary norm, and so also to objectives regularized using other p norms. 1
2 0.77713919 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
Author: Sham M. Kakade, Karthik Sridharan, Ambuj Tewari
Abstract: This work characterizes the generalization ability of algorithms whose predictions are linear in the input vector. To this end, we provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes, which directly lead to a number of generalization bounds. This derivation provides simplified proofs of a number of corollaries including: risk bounds for linear prediction (including settings where the weight vectors are constrained by either L2 or L1 constraints), margin bounds (including both L2 and L1 margins, along with more general notions based on relative entropy), a proof of the PAC-Bayes theorem, and upper bounds on L2 covering numbers (with Lp norm constraints and relative entropy constraints). In addition to providing a unified analysis, the results herein provide some of the sharpest risk and margin bounds. Interestingly, our results show that the uniform convergence rates of empirical risk minimization algorithms tightly match the regret bounds of online learning algorithms for linear prediction, up to a constant factor of 2. 1
3 0.66466492 239 nips-2008-Tighter Bounds for Structured Estimation
Author: Olivier Chapelle, Chuong B. Do, Choon H. Teo, Quoc V. Le, Alex J. Smola
Abstract: Large-margin structured estimation methods minimize a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy. 1
4 0.65992182 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
Author: Shai Shalev-shwartz, Sham M. Kakade
Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1
5 0.60478753 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
Author: Hamed Masnadi-shirazi, Nuno Vasconcelos
Abstract: The machine learning problem of classifier design is studied from the perspective of probability elicitation, in statistics. This shows that the standard approach of proceeding from the specification of a loss, to the minimization of conditional risk is overly restrictive. It is shown that a better alternative is to start from the specification of a functional form for the minimum conditional risk, and derive the loss function. This has various consequences of practical interest, such as showing that 1) the widely adopted practice of relying on convex loss functions is unnecessary, and 2) many new losses can be derived for classification problems. These points are illustrated by the derivation of a new loss which is not convex, but does not compromise the computational tractability of classifier design, and is robust to the contamination of data with outliers. A new boosting algorithm, SavageBoost, is derived for the minimization of this loss. Experimental results show that it is indeed less sensitive to outliers than conventional methods, such as Ada, Real, or LogitBoost, and converges in fewer iterations. 1
6 0.57128251 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
7 0.56480163 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
8 0.53459436 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates
9 0.52473187 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
10 0.50451475 185 nips-2008-Privacy-preserving logistic regression
11 0.49803272 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions
12 0.49771586 228 nips-2008-Support Vector Machines with a Reject Option
13 0.49156401 196 nips-2008-Relative Margin Machines
14 0.44253024 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
15 0.43972152 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
16 0.40550143 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning
17 0.38771516 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
18 0.36340296 193 nips-2008-Regularized Co-Clustering with Dual Supervision
19 0.3591218 48 nips-2008-Clustering via LP-based Stabilities
20 0.34883541 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
topicId topicWeight
[(6, 0.109), (7, 0.072), (12, 0.042), (28, 0.158), (40, 0.245), (57, 0.046), (59, 0.012), (63, 0.041), (71, 0.075), (77, 0.037), (83, 0.049), (84, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.7902596 85 nips-2008-Fast Rates for Regularized Objectives
Author: Karthik Sridharan, Shai Shalev-shwartz, Nathan Srebro
Abstract: We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM objective. Thus, we obtain a rate of 1/n on the convergence of the SVM objective (with fixed regularization parameter) to its infinite data limit. We demonstrate how this is essential for obtaining certain type of oracle inequalities for SVMs. The results extend also to approximate minimization as well as to strong convexity with respect to an arbitrary norm, and so also to objectives regularized using other p norms. 1
2 0.71853054 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
Author: Sham M. Kakade, Ambuj Tewari
Abstract: This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. 1
3 0.66802716 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
Author: Shai Shalev-shwartz, Sham M. Kakade
Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1
4 0.6628089 202 nips-2008-Robust Regression and Lasso
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1
5 0.66085637 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
6 0.65711844 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
7 0.65388966 75 nips-2008-Estimating vector fields using sparse basis field expansions
8 0.65248954 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
9 0.65081012 196 nips-2008-Relative Margin Machines
10 0.65051287 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
11 0.6483236 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
12 0.6481207 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
13 0.64615464 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
14 0.64610636 143 nips-2008-Multi-label Multiple Kernel Learning
15 0.64544606 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF
16 0.64473391 228 nips-2008-Support Vector Machines with a Reject Option
17 0.64421266 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
18 0.64416903 179 nips-2008-Phase transitions for high-dimensional joint support recovery
19 0.64339513 239 nips-2008-Tighter Bounds for Structured Estimation
20 0.64243877 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization