nips nips2009 nips2009-22 knowledge-graph by maker-knowledge-mining

22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Source: pdf

Author: Chonghai Hu, Weike Pan, James T. Kwok

Abstract: Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., ℓ1 -regularizer). Gradient methods, though highly scalable and easy to implement, are known to converge slowly. In this paper, we develop a novel accelerated gradient method for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic composite optimization with convex or strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple algorithm but with the best regret bounds currently known for these problems. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 hk Abstract Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e. [sent-6, score-0.076]

2 In this paper, we develop a novel accelerated gradient method for stochastic optimization while still preserving their computational simplicity and scalability. [sent-12, score-0.301]

3 The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic composite optimization with convex or strongly convex objectives. [sent-13, score-0.349]

4 Moreover, SAGE can also be extended for online learning, resulting in a simple algorithm but with the best regret bounds currently known for these problems. [sent-15, score-0.118]

5 This leads to the minimization of the regularized risk min w 1 m m ℓ(w; xi , yi ) + λΩ(w), (1) i=1 where λ is a regularization parameter. [sent-26, score-0.071]

6 In optimization terminology, the deterministic optimization problem in (1) can be considered as a sample average approximation (SAA) of the corresponding stochastic optimization problem: min EXY [ℓ(w; X, Y )] + λΩ(w). [sent-27, score-0.19]

7 w (2) Since both ℓ(·, ·) and Ω(·) are typically convex, (1) is a convex optimization problem which can be conveniently solved even with standard off-the-shelf optimization packages. [sent-28, score-0.134]

8 Indeed, even tailor-made softwares for speciﬁc models, such as the sequential minimization optimization (SMO) method for the SVM, have superlinear computational 1 complexities and thus are not feasible for large data sets. [sent-31, score-0.069]

9 In light of this, the use of stochastic methods have recently drawn a lot of interest and many of these are highly successful. [sent-32, score-0.076]

10 Most are based on (variants of) the stochastic gradient descent (SGD). [sent-33, score-0.181]

11 Examples include Pegasos [1], SGD-QN [2], FOLOS [3], and stochastic coordinate descent (SCD) [4]. [sent-34, score-0.092]

12 While standard gradient schemes have a slow convergence rate, they can often be “accelerated”. [sent-38, score-0.134]

13 This stems from the pioneering work of Nesterov in 1983 [7], which is a deterministic algorithm for smooth optimization. [sent-39, score-0.067]

14 Recently, it is also extended for composite optimization, where the objective has a smooth component and a non-smooth component [8, 9]. [sent-40, score-0.15]

15 This is particularly relevant to machine learning since the loss ℓ and regularizer Ω in (2) may be non-smooth. [sent-41, score-0.061]

16 Examples include loss functions such as the commonly-used hinge loss used in the SVM, and regularizers such as the popular ℓ1 penalty in Lasso [10], and basis pursuit. [sent-42, score-0.102]

17 These accelerated gradient methods have also been successfully applied in the optimization problems of multiple kernel learning [11] and trace norm minimization [12]. [sent-43, score-0.245]

18 Very recently, Lan [13] made an initial attempt to further extend this for stochastic composite optimization, and obtained the convergence rate of √ O L/N 2 + (M + σ)/ N . [sent-44, score-0.186]

19 (3) Here, N is the number of iterations performed by the algorithm, L is the Lipschitz parameter of the gradient of the smooth term in the objective, M is the Lipschitz parameter of the nonsmooth term, and σ is the variance of the stochastic subgradient. [sent-45, score-0.244]

20 Moreover, note that the ﬁrst term of (3) is related to the smooth component in the objective while the second term is related to the nonsmooth component. [sent-46, score-0.088]

21 Complexity results [14, 13] show that (3) is the optimal convergence rate for any iterative algorithm solving stochastic (general) convex composite optimization. [sent-47, score-0.256]

22 However, as pointed out in [15], a very useful property that can improve the convergence rates in machine learning optimization problems is strong convexity. [sent-48, score-0.086]

23 For example, (2) can be strongly convex either because of the strong convexity of ℓ (e. [sent-49, score-0.124]

24 On the other hand, [13] is more interested in general convex optimization problems and so strong convexity is not utilized. [sent-54, score-0.142]

25 Moreover, though theoretically interesting, [13] may be of limited practical use as (1) the stepsize in its update rule depends on the often unknown σ; and (2) the number of iterations performed by the algorithm has to be ﬁxed in advance. [sent-55, score-0.076]

26 Inspired by the successes of Nesterov’s method, we develop in this paper a novel accelerated subgradient scheme for stochastic composite optimization. [sent-56, score-0.292]

27 It achieves the optimal convergence rate √ of O L/N 2 + σ/ N for general convex objectives, and O (L + µ)/N 2 + σµ−1 /N for µstrongly convex objectives. [sent-57, score-0.16]

28 Finally, we also extend the accelerated gradient scheme to online learning. [sent-59, score-0.213]

29 We obtain O( N ) regret for general convex problems and O(log N ) regret for strongly convex problems, which are the best regret bounds currently known for these problems. [sent-60, score-0.328]

30 2 Setting and Mathematical Background First, we recapitulate a few notions in convex analysis. [sent-61, score-0.058]

31 [14] The gradient of a differentiable function f (x) is Lipschitz continuous with Lipschitz parameter L if, for any x and y, L x − y 2. [sent-64, score-0.103]

32 (4) f (y) ≤ f (x) + ∇f (x), y − x + 2 (Strong convexity) A function φ(x) is µ-strongly convex if φ(y) ≥ φ(x)+ g(x), y−x + µ y−x 2 2 for any x, y and subgradient g(x) ∈ ∂φ(x). [sent-65, score-0.11]

33 [14] Let φ(x) be µ-strongly convex and x∗ = arg minx φ(x). [sent-67, score-0.114]

34 (5) 2 2 We consider the following stochastic convex stochastic optimization problem, with a composite objective function min{φ(x) ≡ E[F (x, ξ)] + ψ(x)}, (6) x where ξ is a random vector, f (x) ≡ E[F (x, ξ)] is convex and differentiable, and ψ(x) is convex but non-smooth. [sent-69, score-0.442]

35 Moreover, we assume that the gradient of f (x) is L-Lipschitz and φ(x) is µ-strongly convex (with µ ≥ 0). [sent-71, score-0.147]

36 Recall that in smooth optimization, the gradient update xt+1 = xt − λ∇f (xt ) on a function f (x) can be seen as proximal regularization of the linearized f at the current iterate xt [16]. [sent-73, score-0.673]

37 In other 1 words, xt+1 = arg minx ( ∇f (xt ), x − xt + 2λ x − xt 2 ). [sent-74, score-0.552]

38 (Gradient mapping) [8] In minimizing f (x) + ψ(x), where f is convex and differentiable and ψ is convex and non-smooth, 1 x − xt 2 + ψ(x) (7) xt+1 = arg min ∇f (x), x − xt + x 2λ 1 is called the generalized gradient update, and δ = λ (xt − xt+1 ) is the (generalized) gradient mapping. [sent-76, score-0.842]

39 It can be shown that the gradient mapping is analogous to the gradient in smooth convex optimization [14, 8]. [sent-78, score-0.318]

40 This is also a common construct used in recent stochastic subgradient methods [3, 17]. [sent-79, score-0.128]

41 3 Accelerated Gradient Method for Stochastic Learning Let G(xt , ξt ) ≡ ∇x F (x, ξt )|x=xt be the stochastic gradient of F (x, ξt ). [sent-80, score-0.165]

42 We assume that it is an unbiased estimator of the gradient ∇f (x), i. [sent-81, score-0.089]

43 Note that yt is the generalized gradient update, and xt+1 is a convex combination of yt and zt . [sent-86, score-1.569]

44 Note that the only expensive step of Algorithm 1 is the computation of the generalized gradient update yt , which is analogous to the subgradient computation in other subgradient-based methods. [sent-90, score-0.641]

45 3, this can often be efﬁciently obtained in many regularized risk minimization problems. [sent-93, score-0.071]

46 yt = arg minx G(xt , ξt ), x − xt + Lt x − xt 2 + ψ(x) . [sent-99, score-0.987]

47 2 zt = zt−1 − (Lt αt + µ)−1 [Lt (xt − yt ) + µ(zt−1 − xt )]. [sent-100, score-1.215]

48 Let δt ≡ Lt (xt − yt ) be the gradient mapping involved in updating yt . [sent-106, score-0.972]

49 For t ≥ 0, φ(x) is quadratically bounded from below as 2Lt − L µ x − xt 2 + ∆t , yt − x + δt 2 . [sent-109, score-0.718]

50 φ(x) ≥ φ(yt ) + δt , x − xt + 2 2L2 t 3 Proposition 1. [sent-110, score-0.248]

51 Assume that for each t ≥ 0, ∆t 2 Lt αt ∗ ≤ σ and Lt > L, then + µαt x − zt 2 2 2 Lt αt x − zt−1 ≤ (1 − αt )[φ(yt−1 ) − φ(x)] + 2 φ(yt ) − φ(x) + 2 + σ2 + αt ∆t , x − zt−1 . [sent-111, score-0.532]

52 Deﬁne Vt (x) = δt , x − xt + µ x − xt 2 + Lt2αt x − zt−1 2 . [sent-113, score-0.496]

53 It is easy to see that 2 zt = arg minx∈Rd Vt (x). [sent-114, score-0.55]

54 Hence on applying Lemmas 2 and 3, we obtain that for any x, Lt αt + µ x − zt 2 Vt (zt ) ≤ Vt (x) − 2 µ Lt αt Lt αt + µ = δt , x − xt + x − xt 2 + x − zt−1 2 − x − zt 2 2 2 2 2Lt −L Lt αt Lt αt +µ ≤ φ(x)−φ(yt )− δt 2 + x−zt−1 2 − x−zt 2 + ∆t , x−yt . [sent-116, score-1.56]

55 2L2 2 2 t Then, φ(yt ) can be bounded from above, as: Lt αt 2Lt − L δt 2 − zt − zt−1 2 φ(yt ) ≤φ(x) + δt , xt − zt − 2 2Lt 2 (9) Lt αt Lt αt + µ 2 2 + x − zt−1 − x − zt + ∆t , x − yt , 2 2 where the non-positive term − µ zt − xt 2 has been dropped from its right-hand-side. [sent-117, score-3.093]

56 On the other 2 hand, by applying Lemma 3 with x = yt−1 , we get 2Lt − L δt 2 , (10) φ(yt ) − φ(yt−1 ) ≤ δt , xt − yt−1 + ∆t , yt−1 − yt − 2L2 t where the non-positive term − µ yt−1 − xt 2 has also been dropped from the right-hand-side. [sent-118, score-0.946]

57 First, by using the update rule of xt in Algorithm 1 and the Young’s inequality1 , we have φ(yt ) − φ(x) ≤ (1 − αt )[φ(yt−1 ) − φ(x)] − A = δt , αt (xt − zt−1 ) + (1 − αt )(xt − yt−1 ) + αt δt , zt−1 − zt = αt δt , zt−1 − zt ≤ 2 Lt αt zt − zt−1 2 2 + δt 2 . [sent-121, score-1.891]

58 2Lt On the other hand, B can be bounded as B = ∆t , αt x + (1 − αt )yt−1 − xt + ∆t , xt − yt = αt ∆t , x − zt−1 + (12) ∆t , δt Lt σ δt , (13) Lt where the second equality is due to the update rule of xt , and the last step is from the CauchySchwartz inequality and the boundedness of ∆t . [sent-122, score-1.272]

59 Thus, Eξ[t] ∆t , x∗ − zt−1 = Eξ[t−1] Eξ[t] [ ∆t , x∗ − zt−1 |ξ[t−1] ] = Eξ[t−1] Eξt [ ∆t , x∗ − zt−1 ] = Eξ[t−1] x∗ − zt−1 , Eξt [∆t ] = 0, where the ﬁrst equality uses Ex [h(x)] = Ey Ex [h(x)|y], and the last equality is from our assumption that the stochastic gradient G(x, ξ) is unbiased. [sent-132, score-0.191]

60 2 Lt αt + µαt E[ x∗ − zt 2 ] 2 2 σ2 Lt αt E[ x∗ − zt−1 2 ] + . [sent-135, score-0.532]

61 As is shown in the following theorem, the convergence rate can be further improved by assuming strong convexity. [sent-144, score-0.059]

62 (17) 2 N Nµ In comparison, FOLOS only converges as O(log(N )/N ) for strongly convex objectives. [sent-150, score-0.078]

63 2 Remarks As in recent studies on stochastic composite optimization [13], the error bounds in (15) and (17) consist of two terms: a faster term which is related to the smooth component and a slower term related to the non-smooth component. [sent-152, score-0.25]

64 SAGE beneﬁts from using the structure of the problem and accelerates the convergence of the smooth component. [sent-153, score-0.077]

65 On the other hand, many stochastic (sub)gradient-based algorithms like FOLOS do not separate the smooth from the non-smooth part, but simply treat the whole objective as non-smooth. [sent-154, score-0.132]

66 Consequently, convergence of the smooth component is also slowed √ down to O(1/ N ). [sent-155, score-0.091]

67 As can be seen from (15) and (17), the convergence of SAGE is essentially encumbered by the variance of the stochastic subgradient. [sent-156, score-0.109]

68 This is because in SAGE, the output yt is obtained from a generalized gradient update. [sent-164, score-0.544]

69 3 Efﬁcient Computation of yt The computational efﬁciency of Algorithm 1 hinges on the efﬁcient computation of yt . [sent-169, score-0.87]

70 Recall that yt is just the generalized gradient update, and so is not signiﬁcantly more expensive than the gradient update in traditional algorithms. [sent-170, score-0.678]

71 Indeed, the generalized gradient update is often a central component in various optimization and machine learning algorithms. [sent-171, score-0.194]

72 4 Accelerated Gradient Method for Online Learning In this section, we extend the proposed accelerated gradient scheme for online learning of (2). [sent-174, score-0.213]

73 The algorithm, shown in Algorithm 2, is similar to the stochastic version in Algorithm 1. [sent-175, score-0.076]

74 Output yt = arg minx ∇ft−1 (xt ), x − xt + Lt x − xt 2 + ψ(x) . [sent-180, score-0.987]

75 2 zt = zt−1 − αt (Lt + µαt )−1 [Lt (xt − yt ) + µ(zt−1 − xt )]. [sent-181, score-1.215]

76 end loop First, we introduce the following lemma, which plays a similar role as its stochastic counterpart of Lemma 3. [sent-182, score-0.088]

77 Moreover, let δt ≡ Lt (xt − yt ) be the gradient mapping related to the updating of yt . [sent-183, score-0.972]

78 For t > 1, φt (x) can be quadratically bounded from below as φt−1 (x) ≥ φt−1 (yt ) + δt , x − xt + µ x − xt 2 2 + 2Lt − L δt 2 . [sent-185, score-0.531]

79 Then for Algorithm 2, ˆ φt−1 (yt−1 ) − φt−1 (x) ≤ Q2 Lt Lt + µαt + x − zt−1 2 − x − zt 2 2(1 − αt )(Lt − L) 2αt 2αt (18) 2 Lt (1 − αt )Lt − αt (1 − αt )L 2 2 yt−1 − zt−1 − zt − y t . [sent-188, score-1.064]

80 From the update rule of zt , one can check that µ τt zt = arg min Vt (x) ≡ δt , x − xt + x − xt 2 + x − zt−1 2 . [sent-191, score-1.625]

81 (19) 2 2Lt 2 2 2 6 On the other hand, Lt δt 2 = ( zt − xt 2 − zt − yt 2 ) 2Lt 2 Lt (1 − αt ) Lt zt − zt−1 2 + zt−1 − yt−1 ≤ 2αt 2 δt , xt − zt − on using the convexity of · 2 2 − Lt zt − yt 2 , (20) 2 . [sent-193, score-4.057]

82 Using (20), the inequality (19) becomes φt−1 (yt ) − φt−1 (x) ≤ Lt (1 − αt ) Lt zt−1 − yt−1 2 − zt − y t 2 2 2 τt τt + µ Lt − L δt 2 + x − zt−1 2 − x − zt 2 . [sent-194, score-1.078]

83 − 2 2Lt 2 2 (21) On the other hand, by the convexity of φt−1 (x) and the Young’s inequality, we have φt−1 (yt−1 ) − φt−1 (yt ) ≤ ∇ft−1 (yt−1 ) + gt−1 (yt−1 ), yt−1 − yt ˆ ≤ Q2 (1 − αt )(Lt − L) + yt−1 − yt 2 . [sent-195, score-0.901]

84 2(1 − αt )(Lt − L) 2 Moreover, by using the update rule of xt and the convexity of · yt−1 − yt 2 = (yt−1 − xt ) + (xt − yt ) ≤ αt yt−1 − zt−1 2 2 + (1 − αt )−1 xt − yt 2 , we have = αt (yt−1 − zt−1 ) + (xt − yt ) 2 (22) = αt yt−1 − zt−1 2 + 2 δt 2 . [sent-196, score-2.562]

85 Then the regret of Algorithm 2 can be bounded as N t=1 [φt (yt ) − φt (x∗ )] ≤ √ Q2 LD2 LD2 + + N. [sent-202, score-0.075]

86 Then the regret of Algorithm 2 can be bounded as N t=1 [φt (yt ) − φt (x∗ )] ≤ Q2 (2a + a−1 )µ + L D2 + log(N + 1). [sent-206, score-0.075]

87 Experiments In this section, we perform experiments on the stochastic optimization of (2). [sent-208, score-0.114]

88 We choose the square loss for ℓ(·, ·) and the ℓ1 regularizer for Ω(·) in (2). [sent-211, score-0.061]

89 3 and [3], the generalized gradient update can be efﬁciently computed by soft thresholding in this case. [sent-213, score-0.142]

90 This is used on all the algorithms except SCD, since SCD is based on coordinate descent and is quite different from the other stochastic subgradient algorithms. [sent-235, score-0.144]

91 Moreover, the additional costs on maintaining xt and zt are small, and the most expensive step in each SAGE iteration is in computing the generalized gradient update. [sent-244, score-0.901]

92 Hence, its per-iteration complexity is comparable with the other (sub)gradient schemes, and its convergence in terms of the number of data access operations is still the fastest (Figures 1(b), 1(c), 1(f) and 1(g)). [sent-245, score-0.06]

93 8 Density of w SAGE FOLOS SMIDAS Error (%) L1 regularized loss L1 regularized loss 1 2 4 6 8 4000 2000 0 0 10 Number of Data Accessesx 106 (b) SAGE FOLOS SMIDAS SCD 6000 (c) 2 4 6 8 10 Number of Data Accesses x 106 (d) 4 0. [sent-255, score-0.142]

94 8 4 1 Error (%) L1 regularized loss L1 regularized loss 100 1 0. [sent-263, score-0.142]

95 6 Conclusion In this paper, we developed a novel accelerated gradient method (SAGE) for stochastic convex composite optimization. [sent-273, score-0.387]

96 Moreover, SAGE can also be extended to online learning, obtaining the best regret bounds currently known. [sent-276, score-0.106]

97 Hence, SCD is not shown in the plots on the regularized loss versus number of iterations. [sent-279, score-0.071]

98 A method for unconstrained convex minimization problem with the rate of con1 vergence o( k2 ). [sent-315, score-0.089]

99 Mind the duality gap: Logarithmic regret algorithms for online optimization. [sent-362, score-0.082]

100 Mirror descent and nonlinear projected subgradient methods for convex optimization. [sent-368, score-0.126]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('zt', 0.532), ('lt', 0.474), ('yt', 0.435), ('sage', 0.303), ('xt', 0.248), ('scd', 0.152), ('folos', 0.144), ('smidas', 0.139), ('accelerated', 0.098), ('gradient', 0.089), ('stochastic', 0.076), ('composite', 0.066), ('convex', 0.058), ('regret', 0.056), ('subgradient', 0.052), ('sub', 0.052), ('accesses', 0.051), ('smooth', 0.044), ('loss', 0.038), ('minx', 0.038), ('optimization', 0.038), ('vt', 0.036), ('pegasos', 0.034), ('pcmac', 0.033), ('convergence', 0.033), ('regularized', 0.033), ('update', 0.033), ('moreover', 0.031), ('convexity', 0.031), ('lipschitz', 0.03), ('online', 0.026), ('accessesx', 0.025), ('exy', 0.025), ('nesterov', 0.024), ('regularizer', 0.023), ('kong', 0.023), ('young', 0.023), ('hong', 0.022), ('strongly', 0.02), ('lemma', 0.02), ('ft', 0.02), ('minimization', 0.02), ('generalized', 0.02), ('bounded', 0.019), ('risk', 0.018), ('arg', 0.018), ('nonsmooth', 0.018), ('duchi', 0.017), ('figures', 0.017), ('iterations', 0.017), ('descent', 0.016), ('beck', 0.016), ('quadratically', 0.016), ('yn', 0.016), ('strong', 0.015), ('dropped', 0.015), ('montreal', 0.015), ('operations', 0.015), ('regularizers', 0.014), ('inequality', 0.014), ('component', 0.014), ('differentiable', 0.014), ('rule', 0.014), ('sequences', 0.014), ('hand', 0.013), ('ji', 0.013), ('updating', 0.013), ('equality', 0.013), ('access', 0.012), ('hinge', 0.012), ('algorithm', 0.012), ('bounds', 0.012), ('hence', 0.012), ('objective', 0.012), ('slow', 0.012), ('expensive', 0.012), ('ex', 0.012), ('loop', 0.012), ('currently', 0.012), ('rate', 0.011), ('proliferation', 0.011), ('administrative', 0.011), ('bordes', 0.011), ('catholic', 0.011), ('cauchyschwartz', 0.011), ('corvalis', 0.011), ('doklady', 0.011), ('hangzhou', 0.011), ('kowloon', 0.011), ('kwok', 0.011), ('las', 0.011), ('linearized', 0.011), ('nevada', 0.011), ('nowadays', 0.011), ('pioneering', 0.011), ('quebec', 0.011), ('seattle', 0.011), ('softwares', 0.011), ('trillions', 0.011), ('vegas', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Author: Chonghai Hu, Weike Pan, James T. Kwok

2 0.2483559 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

Author: Marco Cuturi, Jean-philippe Vert, Alexandre D'aspremont

Abstract: We propose new methodologies to detect anomalies in discrete-time processes taking values in a probability space. These methods are based on the inference of functionals whose evaluations on successive states visited by the process are stationary and have low autocorrelations. Deviations from this behavior are used to ﬂag anomalies. The candidate functionals are estimated in a subspace of a reproducing kernel Hilbert space associated with the original probability space considered. We provide experimental results on simulated datasets which show that these techniques compare favorably with other algorithms.

3 0.24435893 177 nips-2009-On Learning Rotations

Author: Raman Arora

Abstract: An algorithm is presented for online learning of rotations. The proposed algorithm involves matrix exponentiated gradient updates and is motivated by the von Neumann divergence. The multiplicative updates are exponentiated skew-symmetric matrices which comprise the Lie algebra of the rotation group. The orthonormality and unit determinant of the matrix parameter are preserved using matrix logarithms and exponentials and the algorithm lends itself to intuitive interpretation in terms of the differential geometry of the manifold associated with the rotation group. A complexity reduction result is presented that exploits the eigenstructure of the matrix updates to simplify matrix exponentiation to a quadratic form. 1

4 0.21854854 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

Abstract: This paper considers a sensitivity analysis in Hidden Markov Models with continuous state and observation spaces. We propose an Inﬁnitesimal Perturbation Analysis (IPA) on the ﬁltering distribution with respect to some parameters of the model. We describe a methodology for using any algorithm that estimates the ﬁltering density, such as Sequential Monte Carlo methods, to design an algorithm that estimates its gradient. The resulting IPA estimator is proven to be asymptotically unbiased, consistent and has computational complexity linear in the number of particles. We consider an application of this analysis to the problem of identifying unknown parameters of the model given a sequence of observations. We derive an IPA estimator for the gradient of the log-likelihood, which may be used in a gradient method for the purpose of likelihood maximization. We illustrate the method with several numerical experiments.

5 0.19367459 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

Author: Alan S. Willsky, Erik B. Sudderth, Michael I. Jordan, Emily B. Fox

Abstract: We propose a Bayesian nonparametric approach to the problem of modeling related time series. Using a beta process prior, our approach is based on the discovery of a set of latent dynamical behaviors that are shared among multiple time series. The size of the set and the sharing pattern are both inferred from data. We develop an efﬁcient Markov chain Monte Carlo inference method that is based on the Indian buffet process representation of the predictive distribution of the beta process. In particular, our approach uses the sum-product algorithm to efﬁciently compute Metropolis-Hastings acceptance probabilities, and explores new dynamical behaviors via birth/death proposals. We validate our sampling algorithm using several synthetic datasets, and also demonstrate promising results on unsupervised segmentation of visual motion capture data.

6 0.18028453 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

7 0.17239231 27 nips-2009-Adaptive Regularization of Weight Vectors

8 0.16518956 178 nips-2009-On Stochastic and Worst-case Models for Investing

9 0.15591118 220 nips-2009-Slow Learners are Fast

10 0.12785064 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

11 0.11002272 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

12 0.099710561 154 nips-2009-Modeling the spacing effect in sequential category learning

13 0.084817186 45 nips-2009-Beyond Convexity: Online Submodular Minimization

14 0.065452255 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

15 0.065279149 172 nips-2009-Nonparametric Bayesian Texture Learning and Synthesis

16 0.060839504 76 nips-2009-Efficient Learning using Forward-Backward Splitting

17 0.060300998 63 nips-2009-DUOL: A Double Updating Approach for Online Learning

18 0.055362675 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes

19 0.054271393 33 nips-2009-Analysis of SVM with Indefinite Kernels

20 0.053465545 11 nips-2009-A General Projection Property for Distribution Families

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.143), (1, 0.197), (2, 0.109), (3, -0.082), (4, 0.31), (5, 0.181), (6, 0.069), (7, 0.099), (8, -0.018), (9, -0.025), (10, 0.155), (11, 0.107), (12, -0.062), (13, -0.03), (14, -0.12), (15, -0.23), (16, -0.084), (17, -0.086), (18, 0.055), (19, 0.045), (20, 0.109), (21, 0.062), (22, -0.084), (23, 0.092), (24, 0.011), (25, 0.073), (26, 0.008), (27, -0.068), (28, 0.045), (29, 0.005), (30, 0.031), (31, -0.04), (32, -0.104), (33, 0.15), (34, 0.013), (35, -0.156), (36, 0.11), (37, -0.063), (38, 0.035), (39, 0.011), (40, 0.038), (41, -0.115), (42, -0.017), (43, -0.093), (44, -0.025), (45, -0.025), (46, -0.103), (47, 0.025), (48, 0.015), (49, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97272736 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Author: Chonghai Hu, Weike Pan, James T. Kwok

2 0.69358307 177 nips-2009-On Learning Rotations

Author: Raman Arora

3 0.6912185 27 nips-2009-Adaptive Regularization of Weight Vectors

Author: Koby Crammer, Alex Kulesza, Mark Dredze

Abstract: We present AROW, a new online learning algorithm that combines several useful properties: large margin training, conﬁdence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent conﬁdence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. 1

4 0.67313063 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

5 0.62926197 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

Author: Marco Cuturi, Jean-philippe Vert, Alexandre D'aspremont

6 0.52772468 178 nips-2009-On Stochastic and Worst-case Models for Investing

7 0.51829374 220 nips-2009-Slow Learners are Fast

8 0.42907408 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

9 0.36807895 154 nips-2009-Modeling the spacing effect in sequential category learning

10 0.3659977 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

11 0.34271282 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

12 0.30167919 63 nips-2009-DUOL: A Double Updating Approach for Online Learning

13 0.29870766 172 nips-2009-Nonparametric Bayesian Texture Learning and Synthesis

14 0.29749179 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

15 0.2881189 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

16 0.28752396 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

17 0.27237681 76 nips-2009-Efficient Learning using Forward-Backward Splitting

18 0.26168579 94 nips-2009-Fast Learning from Non-i.i.d. Observations

19 0.26067498 33 nips-2009-Analysis of SVM with Indefinite Kernels

20 0.24778754 45 nips-2009-Beyond Convexity: Online Submodular Minimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.043), (25, 0.049), (35, 0.03), (36, 0.141), (37, 0.218), (39, 0.038), (51, 0.013), (53, 0.013), (58, 0.154), (61, 0.04), (71, 0.072), (86, 0.044), (91, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86501169 235 nips-2009-Structural inference affects depth perception in the context of potential occlusion

Author: Ian Stevenson, Konrad Koerding

Abstract: In many domains, humans appear to combine perceptual cues in a near-optimal, probabilistic fashion: two noisy pieces of information tend to be combined linearly with weights proportional to the precision of each cue. Here we present a case where structural information plays an important role. The presence of a background cue gives rise to the possibility of occlusion, and places a soft constraint on the location of a target - in effect propelling it forward. We present an ideal observer model of depth estimation for this situation where structural or ordinal information is important and then ﬁt the model to human data from a stereo-matching task. To test whether subjects are truly using ordinal cues in a probabilistic manner we then vary the uncertainty of the task. We ﬁnd that the model accurately predicts shifts in subject’s behavior. Our results indicate that the nervous system estimates depth ordering in a probabilistic fashion and estimates the structure of the visual scene during depth perception. 1

same-paper 2 0.85073966 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Author: Chonghai Hu, Weike Pan, James T. Kwok

3 0.83979464 234 nips-2009-Streaming k-means approximation

Author: Nir Ailon, Ragesh Jaiswal, Claire Monteleoni

Abstract: We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means (based on the recent k-means++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (ﬁtting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method. 1

4 0.82104945 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

Author: Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya

Abstract: Policy gradient Reinforcement Learning (RL) algorithms have received substantial attention, seeking stochastic policies that maximize the average (or discounted cumulative) reward. In addition, extensions based on the concept of the Natural Gradient (NG) show promising learning efﬁciency because these regard metrics for the task. Though there are two candidate metrics, Kakade’s Fisher Information Matrix (FIM) for the policy (action) distribution and Morimura’s FIM for the stateaction joint distribution, but all RL algorithms with NG have followed Kakade’s approach. In this paper, we describe a generalized Natural Gradient (gNG) that linearly interpolates the two FIMs and propose an efﬁcient implementation for the gNG learning based on a theory of the estimating function, the generalized Natural Actor-Critic (gNAC) algorithm. The gNAC algorithm involves a near optimal auxiliary function to reduce the variance of the gNG estimates. Interestingly, the gNAC can be regarded as a natural extension of the current state-of-the-art NAC algorithm [1], as long as the interpolating parameter is appropriately selected. Numerical experiments showed that the proposed gNAC algorithm can estimate gNG efﬁciently and outperformed the NAC algorithm.

5 0.78666568 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

Author: Ruben Coen-cagli, Peter Dayan, Odelia Schwartz

Abstract: A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor–like receptive ﬁelds and divisive gain modulation from local surrounds; these have led to inﬂuential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non–linear correlations between the responses of spatially distributed Gaborlike receptive ﬁelds, which, when trained on an ensemble of natural scenes, uniﬁes a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Li’s (2002) hypothesis that V1 builds a saliency map, and ﬁts data on the tilt illusion. 1

6 0.71097368 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

7 0.70941985 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

8 0.70025259 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

9 0.69979346 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

10 0.69436723 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

11 0.69184476 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

12 0.6911782 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

13 0.68930703 76 nips-2009-Efficient Learning using Forward-Backward Splitting

14 0.68923461 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

15 0.68837142 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

16 0.6849246 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

17 0.68487704 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

18 0.68278348 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

19 0.68070763 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

20 0.68016213 180 nips-2009-On the Convergence of the Concave-Convex Procedure