nips nips2012 nips2012-324 knowledge-graph by maker-knowledge-mining

324 nips-2012-Stochastic Gradient Descent with Only One Projection


Source: pdf

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. [sent-5, score-0.658]

2 , positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. [sent-8, score-0.54]

3 We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. [sent-9, score-0.232]

4 Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. [sent-10, score-0.259]

5 Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. [sent-11, score-0.83]

6 To find a solution within the domain K that optimizes the given objective function f (x), SGD computes an unbiased estimate of the gradient of f (x), and updates the solution by moving it in the opposite direction of the estimated gradient. [sent-15, score-0.313]

7 To ensure that the solution stays within the domain K, SGD has to project the updated solution back into the K at every iteration. [sent-16, score-0.172]

8 Although efficient algorithms have been developed for projecting solutions into special domains (e. [sent-17, score-0.137]

9 The central theme of this paper is to develop a SGD based method that does not require projection at each iteration. [sent-21, score-0.191]

10 But, one main shortcoming of the algo1 rithm proposed in [10] is that it has a slower convergence rate (i. [sent-23, score-0.163]

11 In this work, we demonstrate that a properly modified SGD algorithm can achieve the optimal convergence rate of O(T −1/2 ) using only ONE projection for general stochastic convex optimization problem. [sent-28, score-0.697]

12 We further develop an SGD based algorithm for strongly convex optimization that achieves a convergence rate of O(ln T /T ), which is only a logarithmic factor worse than the optimal rate [9]. [sent-29, score-0.615]

13 The key idea of both algorithms is to appropriately penalize the intermediate solutions when they are outside the domain. [sent-30, score-0.14]

14 With an appropriate design of penalization mechanism, the average solution xT obtained by the SGD after T iterations will be very close to the domain K, even without intermediate projections. [sent-31, score-0.191]

15 As a result, the final feasible solution xT can be obtained by projecting xT into the domain K, the only projection that is needed for the entire algorithm. [sent-32, score-0.352]

16 We note that our approach is very different from the previous efforts in developing projection free convex optimization algorithms (see [8, 12, 11] and references therein), where the key idea is to develop appropriate updating procedures to restore the feasibility of solutions at every iteration. [sent-33, score-0.578]

17 The proposed algorithm achieves the optimal convergence rate of √ O(1/ T ) with only one projection; • We propose a stochastic gradient descent algorithm for strongly convex optimization that constructs the penalty function using a smoothing technique. [sent-35, score-0.868]

18 This algorithm attains an O(ln T /T ) convergence rate with only one projection. [sent-36, score-0.162]

19 2 Related Works Generally, the computational complexity of the projection step in SGD has seldom been taken into account in the literature. [sent-37, score-0.156]

20 Here, we briefly review the previous works on projection free convex optimization, which is closely related to the theme of this study. [sent-38, score-0.375]

21 For some specific domains, efficient algorithms have been developed to circumvent the high computational cost caused by projection step at each iteration of gradient descent methods. [sent-39, score-0.355]

22 Clarkson [5] proposed a sparse greedy approximation algorithm for convex optimization over a simplex domain, which is a generalization of an old algorithm by Frank and Wolfe [7] (a. [sent-41, score-0.35]

23 Zhang [21] introduced a similar sequential greedy approximation algorithm for certain convex optimization problems over a domain given by a convex hull. [sent-44, score-0.536]

24 Hazan [8] devised an algorithm for approximately maximizing a concave function over a trace norm bounded PSD cone, which only needs to compute the maximum eigenvalue and the corresponding eigenvector of a symmetric matrix. [sent-45, score-0.168]

25 Recently, Jaggi [11] put these ideas into a general framework for convex optimization with a general convex domain. [sent-48, score-0.415]

26 Instead of projecting the intermediate solution into a complex convex domain, Jaggi’s algorithm solves a linearized problem over the same domain. [sent-49, score-0.344]

27 It is a projection free online learning algorithm, built on the the assumption that it is possible to efficiently minimize a linear function over the complex domain. [sent-53, score-0.263]

28 One main shortcoming of the OFW algorithm is that its convergence rate for general stochastic optimization is O(T −1/3 ), significantly slower than that of a standard stochastic gradient descent algorithm (i. [sent-54, score-0.681]

29 It achieves a convergence rate of O(T −1/2 ) only when the objective function is smooth, which unfortunately does not hold for many machine learning problems where either a non-smooth regularizer or a non-smooth loss function is used. [sent-57, score-0.178]

30 Another limitation of OFW is that it assumes a linear optimization problem over the domain K can be solved efficiently. [sent-58, score-0.155]

31 2 3 Preliminaries Throughout this paper, we consider the following convex optimization problem: min f (x), x∈K (1) where K is a bounded convex domain. [sent-62, score-0.486]

32 We assume that K can be characterized by an inequality constraint and without loss of generality is bounded by the unit ball, i. [sent-63, score-0.135]

33 , K = {x ∈ Rd : g(x) ≤ 0} ⊆ B = {x ∈ Rd : x 2 ≤ 1}, (2) where g(x) is a convex constraint function. [sent-65, score-0.203]

34 Note that when a domain is characterized by multiple convex constraint functions, say gi (x) ≤ 0, i = 1, . [sent-71, score-0.294]

35 To solve the optimization problem in (1), we assume that the only information available to the algorithm is through a stochastic oracle that provides unbiased estimates of the gradient of f (x). [sent-75, score-0.381]

36 At each iteration t, given solution xt , the oracle returns f (xt ; ξt ), an unbiased estimate of the true gradient f (xt ), i. [sent-82, score-0.901]

37 Before proceeding, we recall a few definitions from convex analysis [17]. [sent-86, score-0.164]

38 (3) In particular, a convex function f (x) with a bounded (sub)gradient ∂f (x) ∗ ≤ G is G-Lipschitz continuous, where · ∗ is the dual norm to · . [sent-91, score-0.277]

39 In the sequel, we use the standard Euclidean norm to define 2 Lipschitz and strongly convex functions. [sent-97, score-0.282]

40 Stochastic gradient descent method is an iterative algorithm and produces a sequence of solutions xt , t = 1, . [sent-98, score-0.921]

41 (5) 2 √ For general convex optimization, stochastic gradient descent methods can obtain an O(1/ T ) convergence rate in expectation or in a high probability provided (5) [16]. [sent-102, score-0.578]

42 As we mentioned in the Introduction section, SGD methods are computationally efficient only when the projection ΠK (x) can be carried out efficiently. [sent-103, score-0.156]

43 The objective of this work is to develop computationally efficient stochastic optimization algorithms that are able to yield the same performance guarantee as the standard SGD algorithm but with only ONE projection when applied to the problem in (1). [sent-104, score-0.394]

44 4 Algorithms and Main Results We now turn to extending the SGD method to the setting where only one projection is allowed to perform for the entire sequence of updating. [sent-105, score-0.177]

45 The main idea is to incorporate the constraint function g(x) into the objective function to penalize the intermediate solutions that are outside the domain. [sent-106, score-0.201]

46 A projection is performed at the end of the iterations to restore the feasibility of the average solution. [sent-108, score-0.247]

47 , T do 4: Compute xt+1 = xt − ηt ( f (xt , ξt ) + λt g(xt )) 5: Update xt+1 = xt+1 / max ( xt+1 2 , 1), 6: Update λt+1 = [(1 − γηt )λt + ηt g(xt )]+ 7: end for T 8: Output: xT = ΠK (xT ), where xT = t=1 xt /T . [sent-112, score-1.36]

48 The key ingredient of proposed algorithms is to replace the projection step with the gradient computation of the constraint function defining the domain K, which is significantly cheaper than projection step. [sent-113, score-0.551]

49 , X 0 where X is a symmetric matrix, the corresponding inequality constraint is g(X) = λmax (−X) ≤ 0, where λmax (X) computes the largest eigenvalue of X and is a convex function. [sent-116, score-0.291]

50 In this case, g(X) only requires computing the minimum eigenvector of a matrix, which is cheaper than a full eigenspectrum computation required at each iteration of the standard SGD algorithm to restore feasibility. [sent-117, score-0.164]

51 Below, we state a few assumptions about f (x) and g(x) often made in stochastic optimization as: A1 f (x) 2 ≤ G1 , g(x) 2 ≤ G2 , |g(x)| ≤ C2 , ∀x ∈ B, 2 2 2 /σ )] A2 Eξt [exp( f (x, ξt ) − f (x) ≤ exp(1), ∀x ∈ B. [sent-118, score-0.221]

52 We also make the following mild assumption about the boundary of the convex domain K as: A3 there exists a constant ρ > 0 such that min g(x)=0 g(x) 2 ≥ ρ. [sent-119, score-0.333]

53 The purpose of introducing assumption A3 is to ensure that the optimal dual variable for the constrained optimization problem in (1) is well bounded from the above, a key factor for our analysis. [sent-121, score-0.214]

54 To see this, we write the problem in (1) into a convex-concave optimization problem: min max f (x) + λg(x). [sent-122, score-0.135]

55 x∈B λ≥0 Let (x∗ , λ∗ ) be the optimal solution to the above convex-concave optimization problem. [sent-123, score-0.152]

56 We propose two different ways of incorporating the constraint function into the objective function, which result in two algorithms, one for general convex and the other for strongly convex functions. [sent-130, score-0.477]

57 Instead of solving the constrained optimization problem in (1), we try to solve the following convex-concave optimization problem min max L(x, λ). [sent-135, score-0.222]

58 It differs from the existing stochastic gradient descent methods in that it updates both the primal variable x (steps 4 and 5) and the dual variable λ (step 6), which shares the same step sizes. [sent-137, score-0.372]

59 It is noticeable that a similar primal-dual updating is explored in [15] to avoid projection in online learning. [sent-139, score-0.228]

60 Our work differs from [15] in that their algorithm and analysis only lead to a bound for the regret and the violation of the constraints in a long run, which does not necessarily guarantee the feasibility of final solution. [sent-140, score-0.16]

61 Also our proof techniques differ from [16], where the convergence rate is obtained for the saddle point; however our goal is to attain bound on the convergence of the primal feasible solution. [sent-141, score-0.345]

62 This is because, in order to obtain a regret of O( T ), we need a to set γ = Ω( T ), which unfortunately will lead to a blowup of the gradients and consequently a poor regret bound. [sent-145, score-0.194]

63 Using a primal-dual updating schema allows us to adjust the penalization term √ more carefully to obtain an O(1/ T ) convergence rate. [sent-146, score-0.154]

64 2 SGD with One Projection for Strongly Convex Optimization We first emphasize that it is difficult to extend Algorithm 1 to achieve an O(ln T /T ) convergence rate for strongly convex optimization. [sent-150, score-0.389]

65 This is because although −L(x, λ) is strongly convex in λ, its modulus for strong convexity is γ, which is too small to obtain an O(ln T ) regret bound. [sent-151, score-0.404]

66 To achieve a faster convergence rate for strongly convex optimization, we change assumptions A1 and A2 to A4 f (x, ξt ) 2 ≤ G1 , g(x) 2 ≤ G2 , ∀x ∈ B, where we slightly abuse the same notation G1 . [sent-152, score-0.419]

67 Note that A1 only requires that f (x) 2 is bounded and A2 assumes a mild condition on the stochastic gradient. [sent-153, score-0.178]

68 In contrast, for strongly convex optimization we need to assume a bound on the stochastic gradient f (x, ξt ) 2 . [sent-154, score-0.567]

69 According to the discussion in the last subsection, we know that the optimal dual variable λ∗ is upper bounded by G1 /ρ, and consequently is upper bounded by λ0 . [sent-157, score-0.152]

70 Similar to the last approach, we write the optimization problem (1) into an equivalent convexconcave optimization problem: min f (x) = min max f (x) + λg(x) = min f (x) + λ0 [g(x)]+ . [sent-158, score-0.274]

71 , T do exp (λ0 g(xt )/γ) 4: Compute xt+1 = xt − ηt f (xt , ξt ) + λ0 g(xt ) 1 + exp(λ0 g(xt )/γ) 5: Update xt+1 = xt+1 / max( xt+1 2 , 1) 6: end for T 7: Output: xT = ΠK (xT ), where xT = t=1 xt /T . [sent-164, score-1.367]

72 Unlike Algorithm 1, only the primal variable x is updated in each iteration using the stochastic gradient computed in (11). [sent-167, score-0.289]

73 F (x) = f (x) + The following theorem shows that Algorithm 2 achieves an O(ln T /T ) convergence rate if the cost functions are strongly convex. [sent-168, score-0.275]

74 For any β-strongly convex function f (x), if we set ηt = 1/(2βt), t = 1, . [sent-170, score-0.164]

75 , T , γ = ln T /T , and λ0 > G1 /ρ in Algorithm 2, under assumptions A3 and A4, we have with a probability at least 1 − δ, ln T f (xT ) ≤ min f (x) + O , x∈K T where O(·) suppresses polynomial factors that depend on ln(1/δ), 1/β, G1 , G2 , ρ, and λ0 . [sent-173, score-0.826]

76 It is well known that the optimal convergence rate of SGD for strongly convex optimization is O(1/T ) [9] which has been proven to be tight in stochastic optimization setting [1]. [sent-174, score-0.691]

77 According to Theorem 2, Algorithm 2 achieves an almost optimal convergence rate except for the factor of ln T . [sent-175, score-0.531]

78 It is worth mentioning that although it is not explicitly given in Theorem 2, the detailed expression for the convergence rate of Algorithm 2 exhibits a tradeoff in setting λ0 (more can be found in the √ proof of Theorem 2). [sent-176, score-0.164]

79 Finally, under assumptions A1-A3, Algorithm 2 can achieve an O(1/ T ) convergence rate for general convex functions, similar to Algorithm 1. [sent-177, score-0.331]

80 The lemma below states two key inequalities, which follows the standard analysis of gradient descent. [sent-183, score-0.156]

81 For any general convex function f (x), if we set ηt = γ/(2G2 ), t = 1, · · · , T , we have 2 T T (f (xt ) − f (x∗ )) + t=1 2 [ t=1 g(xt )]2 G2 (G2 + C2 ) γ + 1 ≤ 2+ γT + 2 2 /γ) 2 2(γT + 2G2 γ G2 G2 where x∗ = arg minx∈K f (x). [sent-188, score-0.164]

82 Recalling the definition of xT = T t=1 xt /T and using the convexity of f (x) and g(x), we have √ T 1 ∗ f (xT ) − f (x ) + [g(xT )]2 ≤ O √ . [sent-195, score-0.71]

83 If we only update the primal variable using the penalized objective in (10), whose gradient depends on 1/γ, it will cause a blowup in the regret bound with (1/γ + γT + T /γ), which leads to a non-convergent bound. [sent-208, score-0.323]

84 2 Proof of Theorem 2 Our proof of Theorem 2 for the convergence rate of Algorithm 2 when applied to strongly convex functions starts with the following lemma by analogy of Lemma 2. [sent-210, score-0.47]

85 For any β-strongly convex function f (x), if we set ηt = 1/(2βt), we have T T (F (x) − F (x∗ )) ≤ t=1 (G2 + λ2 G2 )(1 + ln T ) β 1 0 2 + ζt (x∗ ) − 2β 4 t=1 T x∗ − xt 2 2 t=1 where x∗ = arg minx∈K f (x). [sent-212, score-1.184]

86 We have 4 + Pr ΛT ≤ 4G1 Pr DT ≤ T xt − x 2 , ΛT = 2 DT ln 7 m m + 4G1 ln δ δ T t=1 ζt (x), ≥ 1 − δ. [sent-216, score-1.371]

87 As a result, we have T T ζt (x∗ ) = t=1 ( f (xt ) − f (xt , ξt )) (x∗ − xt ) ≤ 2G1 T DT ≤ 4G1 , t=1 which together with the inequality in Lemma 3 leads to the bound T (F (xt ) − F (x∗ )) ≤ 4G1 + t=1 (G2 + λ2 G2 )(1 + ln T ) 1 0 2 . [sent-221, score-1.093]

88 2β In the second case, we assume T ζt (x∗ ) ≤ 4G1 DT ln m m β + 4G1 ln ≤ DT + δ δ 4 t=1 √ where the last step uses the fact 2 ab ≤ a2 + b2 . [sent-222, score-0.702]

89 We thus have T (F (xt ) − F (x∗ )) ≤ t=1 16G2 m 1 + 4G1 ln , β δ m (G2 + λ2 G2 )(1 + ln T ) 16G2 1 0 2 1 + 4G1 ln + . [sent-223, score-1.053]

90 β δ 2β Combing the results of the two cases, we have, with a probability 1 − δ, T (F (xt ) − F (x∗ )) ≤ t=1 16G2 (G2 + λ2 G2 )(1 + ln T ) m 1 1 0 2 + 4G1 ln + 4G1 + . [sent-224, score-0.702]

91 Noting that x∗ ∈ K, g(x∗ ) ≤ 0, we have F (x∗ ) ≤ f (x∗ ) + γ ln 2. [sent-226, score-0.351]

92 On the other hand, λ0 g(xT ) F (xT ) = f (xT ) + γ ln 1 + exp ≥ f (xT ) + max (0, λ0 g(xT )) . [sent-227, score-0.402]

93 γ Therefore, with the value of γ = ln T /T , we have ln T f (xT ) ≤ f (x∗ ) + O , (15) T ln T f (xT ) + λ0 g(xT ) ≤ f (x∗ ) + O . [sent-228, score-1.053]

94 (16) T Applying the inequalities (13) and (14) to (16), and noting that γ = ln T /T , we have ln T λ0 ρ xT − xT 2 ≤ G1 xT − xT 2 + O . [sent-229, score-0.751]

95 Therefore ln T ln T f (xT ) ≤ f (xT ) − f (xT ) + f (xT ) ≤ G1 xT − xT 2 + f (x∗ ) + O ≤ f (x∗ ) + O , T T where in the second inequality we use inequality (15). [sent-231, score-0.804]

96 6 Conclusions In the present paper, we made a progress towards making the SGD method efficient by proposing a framework in which it is possible to exclude the projection steps from the SGD algorithm. [sent-232, score-0.156]

97 We have proposed two novel algorithms to overcome the computational bottleneck of the projection step in applying SGD to optimization problems with complex domains. [sent-233, score-0.265]

98 Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. [sent-245, score-0.292]

99 Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. [sent-286, score-0.227]

100 Trading regret for efficiency: online convex optimization with long term constraints. [sent-318, score-0.37]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xt', 0.669), ('sgd', 0.4), ('ln', 0.351), ('convex', 0.164), ('projection', 0.156), ('stochastic', 0.104), ('gradient', 0.102), ('strongly', 0.088), ('optimization', 0.087), ('regret', 0.074), ('descent', 0.071), ('rate', 0.071), ('psd', 0.069), ('domain', 0.068), ('ofw', 0.068), ('convergence', 0.066), ('cone', 0.063), ('dt', 0.058), ('primal', 0.057), ('jaggi', 0.056), ('lemma', 0.054), ('domains', 0.053), ('restore', 0.052), ('projecting', 0.051), ('inequality', 0.051), ('lagrangian', 0.05), ('inequalities', 0.049), ('suppresses', 0.048), ('ming', 0.046), ('blowup', 0.046), ('online', 0.045), ('hazan', 0.045), ('bounded', 0.045), ('solution', 0.041), ('intermediate', 0.041), ('convexity', 0.041), ('minx', 0.041), ('penalization', 0.041), ('penalize', 0.041), ('mahdavi', 0.04), ('unbiased', 0.039), ('constraint', 0.039), ('feasibility', 0.039), ('dual', 0.038), ('modulus', 0.037), ('eigenvalue', 0.037), ('feasible', 0.036), ('absorb', 0.035), ('theme', 0.035), ('solutions', 0.033), ('ying', 0.032), ('eigenvector', 0.031), ('theorem', 0.031), ('cheaper', 0.03), ('norm', 0.03), ('assumptions', 0.03), ('mild', 0.029), ('exp', 0.029), ('greedy', 0.028), ('martingale', 0.028), ('proof', 0.027), ('updating', 0.027), ('semide', 0.026), ('jin', 0.026), ('min', 0.026), ('ball', 0.026), ('shortcoming', 0.026), ('iteration', 0.026), ('boundary', 0.026), ('algorithm', 0.025), ('outside', 0.025), ('optimal', 0.024), ('oracle', 0.024), ('gi', 0.023), ('projections', 0.023), ('smoothing', 0.022), ('remark', 0.022), ('bound', 0.022), ('max', 0.022), ('stays', 0.022), ('complex', 0.022), ('objective', 0.022), ('simplex', 0.021), ('sequence', 0.021), ('icml', 0.02), ('frank', 0.02), ('unattractive', 0.02), ('coresets', 0.02), ('latin', 0.02), ('sulovsk', 0.02), ('zsh', 0.02), ('urgent', 0.02), ('schema', 0.02), ('combing', 0.02), ('jinfeng', 0.02), ('assumption', 0.02), ('summation', 0.02), ('free', 0.02), ('polynomial', 0.02), ('achieves', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

2 0.41432828 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

Author: Marc Deisenroth, Shakir Mohamed

Abstract: Rich and complex time-series data, such as those generated from engineering systems, financial markets, videos, or neural recordings are now a common feature of modern data analysis. Explaining the phenomena underlying these diverse data sets requires flexible and accurate models. In this paper, we promote Gaussian process dynamical systems as a rich model class that is appropriate for such an analysis. We present a new approximate message-passing algorithm for Bayesian state estimation and inference in Gaussian process dynamical systems, a nonparametric probabilistic generalization of commonly used state-space models. We derive our message-passing algorithm using Expectation Propagation and provide a unifying perspective on message passing in general state-space models. We show that existing Gaussian filters and smoothers appear as special cases within our inference framework, and that these existing approaches can be improved upon using iterated message passing. Using both synthetic and real-world data, we demonstrate that iterated message passing can improve inference in a wide range of tasks in Bayesian state estimation, thus leading to improved predictions and more effective decision making. 1

3 0.31300741 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

4 0.25663713 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun

Abstract: We propose a novel Bayesian approach to solve stochastic optimization problems that involve finding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a two-step procedure of first, doing inference over the function space and second, finding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a non-parametric conjugate prior based on a kernel regressor. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. Given t observations of the function, the posterior can be evaluated efficiently in time O(t2 ) up to a multiplicative constant. Finally, we show how to apply our model to optimize a noisy, non-convex, high-dimensional objective function.

5 0.25134164 293 nips-2012-Relax and Randomize : From Value to Algorithms

Author: Sasha Rakhlin, Ohad Shamir, Karthik Sridharan

Abstract: We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such “unorthodox” methods as Follow the Perturbed Leader and the R2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a “random playout”. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 1

6 0.1897943 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

7 0.18946961 292 nips-2012-Regularized Off-Policy TD-Learning

8 0.18019599 170 nips-2012-Large Scale Distributed Deep Networks

9 0.17642699 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

10 0.17195731 305 nips-2012-Selective Labeling via Error Bound Minimization

11 0.16828088 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

12 0.16807932 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

13 0.13882756 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

14 0.13585158 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

15 0.13255389 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

16 0.13144076 195 nips-2012-Learning visual motion in recurrent neural networks

17 0.13136895 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

18 0.13046628 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

19 0.11776807 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

20 0.11225751 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.245), (1, -0.011), (2, 0.289), (3, 0.283), (4, 0.192), (5, -0.053), (6, -0.031), (7, -0.096), (8, 0.003), (9, -0.077), (10, 0.102), (11, -0.004), (12, -0.092), (13, 0.117), (14, 0.011), (15, 0.092), (16, -0.085), (17, 0.02), (18, 0.029), (19, 0.092), (20, 0.014), (21, -0.126), (22, 0.063), (23, -0.201), (24, -0.096), (25, 0.005), (26, -0.014), (27, 0.199), (28, 0.016), (29, -0.047), (30, -0.074), (31, -0.126), (32, 0.1), (33, -0.187), (34, -0.116), (35, 0.014), (36, -0.041), (37, -0.05), (38, 0.073), (39, 0.023), (40, -0.129), (41, -0.05), (42, -0.041), (43, -0.085), (44, -0.03), (45, -0.057), (46, -0.091), (47, 0.023), (48, 0.025), (49, -0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98951542 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

2 0.84915501 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

Author: Marc Deisenroth, Shakir Mohamed

Abstract: Rich and complex time-series data, such as those generated from engineering systems, financial markets, videos, or neural recordings are now a common feature of modern data analysis. Explaining the phenomena underlying these diverse data sets requires flexible and accurate models. In this paper, we promote Gaussian process dynamical systems as a rich model class that is appropriate for such an analysis. We present a new approximate message-passing algorithm for Bayesian state estimation and inference in Gaussian process dynamical systems, a nonparametric probabilistic generalization of commonly used state-space models. We derive our message-passing algorithm using Expectation Propagation and provide a unifying perspective on message passing in general state-space models. We show that existing Gaussian filters and smoothers appear as special cases within our inference framework, and that these existing approaches can be improved upon using iterated message passing. Using both synthetic and real-world data, we demonstrate that iterated message passing can improve inference in a wide range of tasks in Bayesian state estimation, thus leading to improved predictions and more effective decision making. 1

3 0.82290435 293 nips-2012-Relax and Randomize : From Value to Algorithms

Author: Sasha Rakhlin, Ohad Shamir, Karthik Sridharan

Abstract: We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such “unorthodox” methods as Follow the Perturbed Leader and the R2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a “random playout”. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 1

4 0.72916514 292 nips-2012-Regularized Off-Policy TD-Learning

Author: Bo Liu, Sridhar Mahadevan, Ji Liu

Abstract: We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. 1

5 0.7268694 305 nips-2012-Selective Labeling via Error Bound Minimization

Author: Quanquan Gu, Tong Zhang, Jiawei Han, Chris H. Ding

Abstract: In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the out-of-sample error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.

6 0.67655009 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

7 0.65018046 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

8 0.58806753 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

9 0.5812946 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

10 0.53742844 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

11 0.53465396 285 nips-2012-Query Complexity of Derivative-Free Optimization

12 0.5175485 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning

13 0.51672137 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

14 0.51367301 11 nips-2012-A Marginalized Particle Gaussian Process Regression

15 0.50900209 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

16 0.49402121 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

17 0.48171306 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

18 0.45595545 195 nips-2012-Learning visual motion in recurrent neural networks

19 0.44843012 23 nips-2012-A lattice filter model of the visual pathway

20 0.43616262 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (21, 0.021), (23, 0.128), (38, 0.216), (42, 0.056), (53, 0.01), (54, 0.023), (68, 0.023), (74, 0.027), (76, 0.148), (80, 0.165), (92, 0.057)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95686376 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

Author: Andrew Delong, Olga Veksler, Anton Osokin, Yuri Boykov

Abstract: Inference in high-order graphical models has become important in recent years. Several approaches are based, for example, on generalized message-passing, or on transformation to a pairwise model with extra ‘auxiliary’ variables. We focus on a special case where a much more efficient transformation is possible. Instead of adding variables, we transform the original problem into a comparatively small instance of submodular vertex-cover. These vertex-cover instances can then be attacked by existing algorithms (e.g. belief propagation, QPBO), where they often run 4–15 times faster and find better solutions than when applied to the original problem. We evaluate our approach on synthetic data, then we show applications within a fast hierarchical clustering and model-fitting framework. 1

2 0.93933845 132 nips-2012-Fiedler Random Fields: A Large-Scale Spectral Approach to Statistical Network Modeling

Author: Antonino Freno, Mikaela Keller, Marc Tommasi

Abstract: Statistical models for networks have been typically committed to strong prior assumptions concerning the form of the modeled distributions. Moreover, the vast majority of currently available models are explicitly designed for capturing some specific graph properties (such as power-law degree distributions), which makes them unsuitable for application to domains where the behavior of the target quantities is not known a priori. The key contribution of this paper is twofold. First, we introduce the Fiedler delta statistic, based on the Laplacian spectrum of graphs, which allows to dispense with any parametric assumption concerning the modeled network properties. Second, we use the defined statistic to develop the Fiedler random field model, which allows for efficient estimation of edge distributions over large-scale random networks. After analyzing the dependence structure involved in Fiedler random fields, we estimate them over several real-world networks, showing that they achieve a much higher modeling accuracy than other well-known statistical approaches.

same-paper 3 0.92954814 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

4 0.91734457 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

Author: Sanjeev Arora, Rong Ge, Ankur Moitra, Sushant Sachdeva

Abstract: We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form y = Ax + η where A is an unknown n × n matrix and x is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable and η is an n-dimensional Gaussian random variable with unknown covariance Σ: We give an algorithm that provable recovers A and Σ up to an additive and whose running time and sample complexity are polynomial in n and 1/ . To accomplish this, we introduce a novel “quasi-whitening” step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of A one by one via local search. 1

5 0.91473746 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz

Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1

6 0.91271996 293 nips-2012-Relax and Randomize : From Value to Algorithms

7 0.91258824 227 nips-2012-Multiclass Learning with Simplex Coding

8 0.90963984 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

9 0.9071945 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

10 0.90638846 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

11 0.90454578 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

12 0.9038409 292 nips-2012-Regularized Off-Policy TD-Learning

13 0.90350521 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

14 0.90323156 358 nips-2012-Value Pursuit Iteration

15 0.90276146 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

16 0.90233803 330 nips-2012-Supervised Learning with Similarity Functions

17 0.90028334 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

18 0.90016431 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

19 0.89960086 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

20 0.89930987 200 nips-2012-Local Supervised Learning through Space Partitioning