nips nips2008 nips2008-164 knowledge-graph by maker-knowledge-mining

164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms


Source: pdf

Author: Sham M. Kakade, Ambuj Tewari

Abstract: This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 org Abstract This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. [sent-4, score-0.592]

2 Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. [sent-5, score-0.228]

3 This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. [sent-6, score-0.365]

4 As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. [sent-7, score-0.025]

5 1 Introduction Online regret minimizing algorithms provide some of the most successful algorithms for many machine learning problems, both in terms of the speed of optimization and the quality of generalization. [sent-8, score-0.294]

6 Notable examples include efficient learning algorithms for structured prediction [Collins, 2002] (an algorithm now widely used) and for ranking problems [Crammer et al. [sent-9, score-0.118]

7 Online convex optimization is a sequential paradigm in which at each round, the learner predicts a vector wt ∈ S ⊂ Rn , nature responds with a convex loss function, t , and the learner suffers loss t (wt ). [sent-11, score-1.008]

8 In this setting, the goal of the learner is to minimize the regret: T t=1 T t (wt ) − min w∈S t (w) t=1 which is the difference between his cumulative loss and the cumulative loss of the optimal fixed vector. [sent-12, score-0.413]

9 Typically, these algorithms are used to train a learning algorithm incrementally, by sequentially feeding the algorithm a data sequence, (X1 , Y1 ), . [sent-13, score-0.047]

10 we would like: R(w) − min R(w) w∈S to be small, where R(w) := E [ (w; (X, Y ))] is the risk. [sent-22, score-0.021]

11 Intuitively, it seems plausible that low regret on an i. [sent-23, score-0.215]

12 These include cases which are tailored to general convex √ functions [Cesa-Bianchi et al. [sent-28, score-0.202]

13 , 2004] (whose regret is O( T )) and mistake bound settings [CesaBianchi and Gentile, 2008] (where the the regret could be O(1) under separability assumptions). [sent-29, score-0.497]

14 In these conversions, we typically choose w to be the average of the wt produced by our online algorithm. [sent-30, score-0.558]

15 Recently, there has been a growing body of work providing online algorithms for strongly convex loss functions (i. [sent-31, score-0.52]

16 t is strongly convex), with regret guarantees that are merely O(ln T ). [sent-33, score-0.324]

17 Such algorithms have the potential to be highly applicable since many machine learning optimization problems are in fact strongly convex — either with strongly convex loss functions (e. [sent-34, score-0.627]

18 log loss, square loss) or, indirectly, via strongly convex regularizers (e. [sent-36, score-0.245]

19 Note that in the latter case, the loss function itself may only be just convex but a strongly convex regularizer effectively makes this a strongly convex optimization problem; e. [sent-39, score-0.7]

20 the SVM optimization problem uses the hinge loss with L2 regularization. [sent-41, score-0.128]

21 In fact, for this case, the P EGASOS algorithm of Shalev-Shwartz et al. [sent-42, score-0.065]

22 [2007] — based on the online strongly convex programming algorithm of Hazan et al. [sent-43, score-0.468]

23 [2007] provide a similar subgradient method for max-margin based structured prediction, which also has favorable empirical performance. [sent-46, score-0.026]

24 The aim of this paper is to examine the generalization properties of online convex programming algorithms when the loss function is strongly convex (where strong convexity can be defined in a general sense, with respect to some arbitrary norm || · ||). [sent-47, score-0.755]

25 Suppose we have an online algorithm which has some guaranteed cumulative regret bound RegT (e. [sent-48, score-0.462]

26 Then a corollary of our main result shows that with probability greater than 1 − δ ln T , we obtain a parameter w from our online algorithm such that:   1 RegT ln 1 ln RegT δ R(w) − min R(w) ≤ +O + δ . [sent-51, score-1.433]

27 w T T T Here, the constants hidden in the O-notation are determined by the Lipschitz constant and the strong convexity parameter of the loss . [sent-52, score-0.142]

28 Importantly, note that the correction term is of lower order than √ the regret — if the regret is ln T then the additional penalty is O( ln T ). [sent-53, score-1.277]

29 If one naively uses the T Hoeffding-Azuma methods in Cesa-Bianchi et al. [sent-54, score-0.083]

30 [2004], one would obtain a significantly worse √ penalty of O(1/ T ). [sent-55, score-0.019]

31 This result solves an open problem in Shalev-Shwartz et al. [sent-56, score-0.065]

32 P EGASOS is an online strongly convex programming algorithm for the SVM objective function — it repeatedly (and randomly) subsamples the training set in order to minimize the empirical SVM objective function. [sent-58, score-0.443]

33 A corollary to this work essentially shows the convergence rate of P EGASOS (as a randomized optimization algorithm) is concentrated rather sharply. [sent-59, score-0.084]

34 [2007] also provide an online algorithm (based on Hazan et al. [sent-61, score-0.205]

35 Our results are also directly applicable in providing a sharper concentration result in their setting (In particular, see the regret bound in Equation 15, for which our results can be applied to). [sent-63, score-0.404]

36 This paper continues the line of research initiated by several researchers [Littlestone, 1989, CesaBianchi et al. [sent-64, score-0.089]

37 , 2004, Zhang, 2005, Cesa-Bianchi and Gentile, 2008] which looks at how to convert online algorithms into batch algorithms with provable guarantees. [sent-65, score-0.276]

38 Cesa-Bianchi and Gentile [2008] prove faster rates in the case when the cumulative loss of the online algorithm is small. [sent-66, score-0.302]

39 Here, we are interested in the case where the cumulative regret is small. [sent-67, score-0.274]

40 Zhang [2005] explicitly goes via the exponential moment method to derive sharper concentration results. [sent-69, score-0.099]

41 In particular, for the regression problem with squared loss, Zhang [2005] gives a result similar to ours (see Theorem 8 therein). [sent-70, score-0.017]

42 The present work can also be seen as generalizing his result to the case where we have strong convexity with respect to a general norm. [sent-71, score-0.039]

43 Coupled with recent advances in low regret algorithms in this setting, we are able to provide a result that holds more generally. [sent-72, score-0.266]

44 Our key technical tool is a probabilistic inequality due to Freedman [Freedman, 1975]. [sent-73, score-0.055]

45 This, combined with a variance bound (Lemma 1) that follows from our assumptions about the loss function, allows us to derive our main result (Theorem 2). [sent-74, score-0.153]

46 We then apply it to statistical learning with bounded loss, and to P EGASOS in Section 4. [sent-75, score-0.022]

47 2 Setting Fix a compact convex subset S of some space equipped with a norm · . [sent-76, score-0.143]

48 Let · ∗ be the dual norm defined by v ∗ := supw : w ≤1 v · w. [sent-77, score-0.025]

49 Our goal is to minimize F (w) := E [f (w; Z)] over w ∈ S. [sent-79, score-0.018]

50 (LIpschitz and STrongly convex assumption) For all z ∈ Z, the function fz (w) = f (w; z) is convex in w and satisfies: 1. [sent-82, score-0.496]

51 ∀w ∈ S, ∀λ ∈ ∂fz (w) (∂fz denotes the subdifferential of fz ), λ ∗ ≤ L. [sent-88, score-0.279]

52 Note that this assumption implies ∀w, w ∈ S, |fz (w) − fz (w )| ≤ L w − w . [sent-89, score-0.299]

53 ∀θ ∈ [0, 1], ∀w, w ∈ S, ν fz (θw + (1 − θ)w ) ≤ θfz (w) + (1 − θ)fz (w ) − θ(1 − θ) w − w 2 . [sent-96, score-0.26]

54 We consider an online setting in which independent (but not necessarily identically distributed) random variables Z1 , . [sent-98, score-0.16]

55 Now consider an algorithm that starts out with some w1 and at time t, having seen Zt , updates the parameter wt to wt+1 . [sent-103, score-0.418]

56 Define the statistics, T T RegT := t=1 f (wt ; Zt ) − min w∈S f (w; Zt ) , t=1 T Diff T := t=1 T (F (wt ) − F (w )) = t=1 F (wt ) − T F (w ) . [sent-118, score-0.021]

57 Define the sequence of random variables ξt := F (wt ) − F (w ) − (f (wt ; Zt ) − f (w ; Zt )) . [sent-119, score-0.025]

58 (1) Since Et−1 [f (wt ; Zt )] = F (wt ) and Et−1 [f (w ; Zt )] = F (w ), ξt is a martingale difference sequence. [sent-120, score-0.095]

59 This definition needs some explanation as it is important to look at the right martingale 1 difference sequence to derive the results we want. [sent-121, score-0.12]

60 Even under assumption LIST, T t f (wt ; Zt ) 1 1 and T t f (w ; Zt ) will not be concentrated around T t F (wt ) and F (w ) respectively at a √ rate better then O(1/ T ) in general. [sent-122, score-0.068]

61 But if we look at the difference, we are able to get sharper concentration. [sent-123, score-0.067]

62 3 A General Online to Batch Conversion The following simple lemma is crucial for us. [sent-124, score-0.098]

63 It says that under assumption LIST, the variance of the increment in the regret f (wt ; Zt ) − f (w ; Zt ) is bounded by its (conditional) expectation F (wt ) − F (w ). [sent-125, score-0.315]

64 Such a control on the variance is often the main ingredient in obtaining sharper concentration results. [sent-126, score-0.12]

65 Suppose assumption LIST holds and let ξt be the martingale difference sequence defined in (1). [sent-128, score-0.183]

66 Let 2 Vart−1 ξt := Et−1 ξt be the conditional variance of ξt given Z1 , . [sent-129, score-0.021]

67 Then, under assumption LIST, we have, 4L2 (F (wt ) − F (w )) . [sent-133, score-0.039]

68 ν Vart−1 ξt ≤ The variance bound given by the above lemma allows us to prove our main theorem. [sent-134, score-0.148]

69 Under assumption LIST, we have, with probability at least 1 − 4 ln(T )δ, 1 T T t=1 F (wt ) − F (w ) ≤ Further, using Jensen’s inequality, 3. [sent-136, score-0.039]

70 1 L2 ln(1/δ) ν RegT +4 T 1 T t 16L2 , 6B ν RegT + max T ln(1/δ) T ¯ ¯ F (wt ) can be replaced by F (w) where w := 1 T t wt . [sent-137, score-0.459]

71 We have, 2 Vart−1 ξt ≤ Et−1 (f (wt ; Zt ) − f (w ; Zt )) ≤ Et−1 L2 wt − w [ Assumption LIST, part 1 ] = L2 wt − w 2 2 . [sent-139, score-0.836]

72 (2) On the other hand, using part 2 of assumption LIST, we also have for any w, w ∈ S, f (w; Z) + f (w ; Z) ≥f 2 w+w ;Z 2 Taking expectation this gives, for any w, w ∈ S, F (w) + F (w ) ≥F 2 w+w 2 + + ν w−w 8 ν w−w 8 2 2 . [sent-140, score-0.039]

73 Now using this with w = wt , w = w , we get F (wt ) + F (w ) ≥F 2 wt + w ν + wt − w 2 8 ν ≥ F (w ) + wt − w 2 . [sent-142, score-1.672]

74 The proof of this lemma can be found in the appendix. [sent-145, score-0.132]

75 , XT is a martingale difference sequence with |Xt | ≤ b. [sent-150, score-0.12]

76 Further, let σ = have, for any δ < 1/e and T ≥ 3, T Prob Xt > max 2σ, 3b t=1 ln(1/δ) ln(1/δ) ≤ 4 ln(T )δ . [sent-156, score-0.041]

77 Therefore, Lemma 3 gives us that with probability at least 1 − 4 ln(T )δ, we have T t=1 ξt ≤ max 2σ, 6B ln(1/δ) By definition of RegT , ln(1/δ) . [sent-161, score-0.058]

78 T Diff T − RegT ≤ ξt t=1 and therefore, with probability, 1 − 4 ln(T )δ, we have Diff T − RegT ≤ max 4 L2 Diff T , 6B ν ln(1/δ) ln(1/δ) . [sent-162, score-0.041]

79 Using Lemma 4 below to solve the above quadratic inequality for Diff T , gives T t=1 F (wt ) RegT − F (w ) ≤ +4 T T L2 ln(1/δ) ν RegT + max T 16L2 , 6B ν ln(1/δ) T The following elementary lemma was required to solve a recursive inequality in the proof of the above theorem. [sent-163, score-0.3]

80 A loss function : D × Y → [0, 1] measures quality of predictions. [sent-177, score-0.103]

81 Fix a convex set S of some normed space and a function h : X × S → D. [sent-178, score-0.142]

82 Let our hypotheses class be {x → h(x; w) | w ∈ S}. [sent-179, score-0.023]

83 On input x, the hypothesis parameterized by w predicts h(x; w) and incurs loss (h(x; w), y) if the correct prediction is y. [sent-180, score-0.123]

84 The risk of w is defined by R(w) := E [ (h(X; w), Y )] and let w := arg minw∈S R(w) denote the (parameter for) the hypothesis with minimum risk. [sent-181, score-0.033]

85 It is easy to see that this setting falls under the general framework given above by thinking of the pair (X, Y ) as Z and setting f (w; Z) = f (w; (X, Y )) to be (h(X; w), Y ). [sent-182, score-0.04]

86 The range of f is [0, 1] by our assumption about the loss functions so B = 1. [sent-184, score-0.142]

87 Suppose we run an online algorithm on our data that generates a sequence of hypotheses w0 , . [sent-185, score-0.188]

88 If we now choose α0 = bc ln(1/δ) then αj ≥ bc ln(1/δ) for all j and hence every term in the above summation is bounded by −c2 ln(1/δ) 2+2/3 exp which is less then δ if we choose c = 5/3. [sent-197, score-0.082]

89 The assumption of the lemma implies that one of the following inequalities holds: √ s − r ≤ 6b∆2 s − r ≤ 4 ds∆ . [sent-203, score-0.137]

90 This gives us, 2 √ √ s = ( s)2 ≤ 2 d∆ + 4d∆2 + r √ √ √ [ x + y ≤ x + y] Combining (7) and (8) finishes the proof. [sent-205, score-0.017]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wt', 0.418), ('ln', 0.414), ('regt', 0.352), ('zt', 0.3), ('fz', 0.26), ('regret', 0.215), ('vart', 0.189), ('egasos', 0.162), ('xt', 0.162), ('prob', 0.152), ('online', 0.14), ('freedman', 0.118), ('convex', 0.118), ('strongly', 0.109), ('loss', 0.103), ('lemma', 0.098), ('conversions', 0.081), ('martingale', 0.077), ('chicago', 0.07), ('sharper', 0.067), ('list', 0.067), ('et', 0.065), ('batch', 0.062), ('gentile', 0.061), ('di', 0.059), ('cumulative', 0.059), ('yt', 0.058), ('lipschitz', 0.056), ('inequality', 0.055), ('cesabianchi', 0.054), ('dr', 0.046), ('tti', 0.043), ('hazan', 0.043), ('ratliff', 0.043), ('max', 0.041), ('tewari', 0.041), ('generalization', 0.04), ('convexity', 0.039), ('svm', 0.039), ('assumption', 0.039), ('zhang', 0.037), ('sham', 0.037), ('programming', 0.036), ('suppose', 0.035), ('minw', 0.034), ('proof', 0.034), ('risk', 0.033), ('conversion', 0.033), ('concentration', 0.032), ('learner', 0.032), ('il', 0.032), ('excess', 0.031), ('corollary', 0.03), ('bc', 0.03), ('bound', 0.029), ('concentrated', 0.029), ('ds', 0.029), ('fix', 0.028), ('algorithms', 0.027), ('measurable', 0.026), ('structured', 0.026), ('norm', 0.025), ('optimization', 0.025), ('sequence', 0.025), ('holds', 0.024), ('martingales', 0.024), ('nishes', 0.024), ('initiated', 0.024), ('normed', 0.024), ('theorem', 0.024), ('hypotheses', 0.023), ('providing', 0.023), ('bounded', 0.022), ('subsamples', 0.022), ('ambuj', 0.022), ('paradigm', 0.021), ('variance', 0.021), ('min', 0.021), ('provable', 0.02), ('separability', 0.02), ('feeding', 0.02), ('predicts', 0.02), ('setting', 0.02), ('tailored', 0.019), ('subdifferential', 0.019), ('littlestone', 0.019), ('examines', 0.019), ('guaranteed', 0.019), ('penalty', 0.019), ('minimize', 0.018), ('naively', 0.018), ('increment', 0.018), ('responds', 0.018), ('applicable', 0.018), ('difference', 0.018), ('mistake', 0.018), ('crammer', 0.018), ('regularizers', 0.018), ('gives', 0.017), ('kakade', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

Author: Sham M. Kakade, Ambuj Tewari

Abstract: This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. 1

2 0.31414804 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Author: Shai Shalev-shwartz, Sham M. Kakade

Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1

3 0.25092933 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

Author: Giovanni Cavallanti, Nicolò Cesa-bianchi, Claudio Gentile

Abstract: We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of √ queried labels, and α > 0 is the exponent in the low noise condition. For all α > 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. 1

4 0.22535114 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

Author: Zhihua Zhang, Michael I. Jordan, Dit-Yan Yeung

Abstract: Kernel supervised learning methods can be unified by utilizing the tools from regularization theory. The duality between regularization and prior leads to interpreting regularization methods in terms of maximum a posteriori estimation and has motivated Bayesian interpretations of kernel methods. In this paper we pursue a Bayesian interpretation of sparsity in the kernel setting by making use of a mixture of a point-mass distribution and prior that we refer to as “Silverman’s g-prior.” We provide a theoretical analysis of the posterior consistency of a Bayesian model choice procedure based on this prior. We also establish the asymptotic relationship between this procedure and the Bayesian information criterion. 1

5 0.19634962 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems

Author: Alan S. Willsky, Erik B. Sudderth, Michael I. Jordan, Emily B. Fox

Abstract: Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. Our nonparametric Bayesian approach utilizes a hierarchical Dirichlet process prior to learn an unknown number of persistent, smooth dynamical modes. We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index.

6 0.16331376 170 nips-2008-Online Optimization in X-Armed Bandits

7 0.15652716 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

8 0.15014872 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

9 0.14704333 214 nips-2008-Sparse Online Learning via Truncated Gradient

10 0.12909667 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

11 0.12560901 168 nips-2008-Online Metric Learning and Fast Similarity Search

12 0.11334088 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions

13 0.10899372 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

14 0.10519888 112 nips-2008-Kernel Measures of Independence for non-iid Data

15 0.10274141 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces

16 0.10043177 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

17 0.096649192 195 nips-2008-Regularized Policy Iteration

18 0.095524073 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

19 0.093439259 85 nips-2008-Fast Rates for Regularized Objectives

20 0.092758328 15 nips-2008-Adaptive Martingale Boosting


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.195), (1, 0.076), (2, -0.298), (3, -0.011), (4, -0.308), (5, 0.211), (6, -0.11), (7, 0.095), (8, 0.038), (9, -0.096), (10, -0.114), (11, 0.068), (12, 0.065), (13, 0.05), (14, 0.16), (15, -0.075), (16, -0.075), (17, 0.049), (18, -0.09), (19, -0.051), (20, -0.16), (21, 0.017), (22, 0.075), (23, 0.087), (24, -0.121), (25, -0.093), (26, -0.019), (27, 0.022), (28, 0.116), (29, 0.013), (30, -0.049), (31, -0.078), (32, 0.141), (33, -0.008), (34, -0.048), (35, 0.068), (36, 0.055), (37, 0.071), (38, -0.05), (39, 0.089), (40, 0.02), (41, -0.038), (42, -0.053), (43, -0.018), (44, 0.009), (45, -0.066), (46, 0.096), (47, 0.082), (48, 0.074), (49, -0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97538352 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

Author: Sham M. Kakade, Ambuj Tewari

Abstract: This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. 1

2 0.72924984 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Author: Shai Shalev-shwartz, Sham M. Kakade

Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1

3 0.69068348 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

Author: Zhihua Zhang, Michael I. Jordan, Dit-Yan Yeung

Abstract: Kernel supervised learning methods can be unified by utilizing the tools from regularization theory. The duality between regularization and prior leads to interpreting regularization methods in terms of maximum a posteriori estimation and has motivated Bayesian interpretations of kernel methods. In this paper we pursue a Bayesian interpretation of sparsity in the kernel setting by making use of a mixture of a point-mass distribution and prior that we refer to as “Silverman’s g-prior.” We provide a theoretical analysis of the posterior consistency of a Bayesian model choice procedure based on this prior. We also establish the asymptotic relationship between this procedure and the Bayesian information criterion. 1

4 0.67629623 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

Author: Giovanni Cavallanti, Nicolò Cesa-bianchi, Claudio Gentile

Abstract: We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of √ queried labels, and α > 0 is the exponent in the low noise condition. For all α > 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. 1

5 0.59845948 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions

Author: Matthew Streeter, Daniel Golovin

Abstract: We present an algorithm for solving a broad class of online resource allocation problems. Our online algorithm can be applied in environments where abstract jobs arrive one at a time, and one can complete the jobs by investing time in a number of abstract activities, according to some schedule. We assume that the fraction of jobs completed by a schedule is a monotone, submodular function of a set of pairs (v, τ ), where τ is the time invested in activity v. Under this assumption, our online algorithm performs near-optimally according to two natural metrics: (i) the fraction of jobs completed within time T , for some fixed deadline T > 0, and (ii) the average time required to complete each job. We evaluate our algorithm experimentally by using it to learn, online, a schedule for allocating CPU time among solvers entered in the 2007 SAT solver competition. 1

6 0.54585129 168 nips-2008-Online Metric Learning and Fast Similarity Search

7 0.53935486 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

8 0.46963319 85 nips-2008-Fast Rates for Regularized Objectives

9 0.46005452 170 nips-2008-Online Optimization in X-Armed Bandits

10 0.44499069 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems

11 0.43124327 214 nips-2008-Sparse Online Learning via Truncated Gradient

12 0.40716398 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

13 0.345155 15 nips-2008-Adaptive Martingale Boosting

14 0.32884523 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

15 0.30270964 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making

16 0.30217108 239 nips-2008-Tighter Bounds for Structured Estimation

17 0.29576528 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

18 0.27918103 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

19 0.26706651 112 nips-2008-Kernel Measures of Independence for non-iid Data

20 0.26237509 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.216), (7, 0.052), (12, 0.077), (28, 0.126), (40, 0.123), (57, 0.029), (63, 0.031), (66, 0.102), (71, 0.068), (77, 0.024), (83, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85821885 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

Author: Sham M. Kakade, Ambuj Tewari

Abstract: This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. 1

2 0.82246518 43 nips-2008-Cell Assemblies in Large Sparse Inhibitory Networks of Biologically Realistic Spiking Neurons

Author: Adam Ponzi, Jeff Wickens

Abstract: Cell assemblies exhibiting episodes of recurrent coherent activity have been observed in several brain regions including the striatum[1] and hippocampus CA3[2]. Here we address the question of how coherent dynamically switching assemblies appear in large networks of biologically realistic spiking neurons interacting deterministically. We show by numerical simulations of large asymmetric inhibitory networks with fixed external excitatory drive that if the network has intermediate to sparse connectivity, the individual cells are in the vicinity of a bifurcation between a quiescent and firing state and the network inhibition varies slowly on the spiking timescale, then cells form assemblies whose members show strong positive correlation, while members of different assemblies show strong negative correlation. We show that cells and assemblies switch between firing and quiescent states with time durations consistent with a power-law. Our results are in good qualitative agreement with the experimental studies. The deterministic dynamical behaviour is related to winner-less competition[3], shown in small closed loop inhibitory networks with heteroclinic cycles connecting saddle-points. 1

3 0.81960487 214 nips-2008-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1

4 0.81662142 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG

Author: Julia Owen, Hagai T. Attias, Kensuke Sekihara, Srikantan S. Nagarajan, David P. Wipf

Abstract: The synchronous brain activity measured via MEG (or EEG) can be interpreted as arising from a collection (possibly large) of current dipoles or sources located throughout the cortex. Estimating the number, location, and orientation of these sources remains a challenging task, one that is significantly compounded by the effects of source correlations and the presence of interference from spontaneous brain activity, sensor noise, and other artifacts. This paper derives an empirical Bayesian method for addressing each of these issues in a principled fashion. The resulting algorithm guarantees descent of a cost function uniquely designed to handle unknown orientations and arbitrary correlations. Robust interference suppression is also easily incorporated. In a restricted setting, the proposed method is shown to have theoretically zero bias estimating both the location and orientation of multi-component dipoles even in the presence of correlations, unlike a variety of existing Bayesian localization methods or common signal processing techniques such as beamforming and sLORETA. Empirical results on both simulated and real data sets verify the efficacy of this approach. 1

5 0.80739486 65 nips-2008-Domain Adaptation with Multiple Sources

Author: Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents a theoretical analysis of the problem of domain adaptation with multiple sources. For each source domain, the distribution over the input points as well as a hypothesis with error at most ǫ are given. The problem consists of combining these hypotheses to derive a hypothesis with small error with respect to the target domain. We present several theoretical results relating to this problem. In particular, we prove that standard convex combinations of the source hypotheses may in fact perform very poorly and that, instead, combinations weighted by the source distributions benefit from favorable theoretical guarantees. Our main result shows that, remarkably, for any fixed target function, there exists a distribution weighted combining rule that has a loss of at most ǫ with respect to any target mixture of the source distributions. We further generalize the setting from a single target function to multiple consistent target functions and show the existence of a combining rule with error at most 3ǫ. Finally, we report empirical results for a multiple source adaptation problem with a real-world dataset.

6 0.80073619 178 nips-2008-Performance analysis for L\ 2 kernel classification

7 0.79318744 85 nips-2008-Fast Rates for Regularized Objectives

8 0.77651858 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

9 0.77183682 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

10 0.73583198 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

11 0.72525007 202 nips-2008-Robust Regression and Lasso

12 0.72005111 75 nips-2008-Estimating vector fields using sparse basis field expansions

13 0.7077136 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

14 0.7070756 62 nips-2008-Differentiable Sparse Coding

15 0.70018917 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

16 0.69856417 228 nips-2008-Support Vector Machines with a Reject Option

17 0.69774282 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

18 0.69257319 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

19 0.69021863 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias

20 0.68556732 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes