nips nips2012 nips2012-263 knowledge-graph by maker-knowledge-mining

263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization


Source: pdf

Author: Xi Chen, Qihang Lin, Javier Pena

Abstract: This paper considers a wide spectrum of regularized stochastic optimization problems where both the loss function and regularizer can be non-smooth. We develop a novel algorithm based on the regularized dual averaging (RDA) method, that can simultaneously achieve the optimal convergence rates for both convex and strongly convex loss. In particular, for strongly convex loss, it achieves the opti1 1 mal rate of O( N + N 2 ) for N iterations, which improves the rate O( log N ) for preN vious regularized dual averaging algorithms. In addition, our method constructs the final solution directly from the proximal mapping instead of averaging of all previous iterates. For widely used sparsity-inducing regularizers (e.g., 1 -norm), it has the advantage of encouraging sparser solutions. We further develop a multistage extension using the proposed algorithm as a subroutine, which achieves the 1 uniformly-optimal rate O( N + exp{−N }) for strongly convex loss. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper considers a wide spectrum of regularized stochastic optimization problems where both the loss function and regularizer can be non-smooth. [sent-5, score-0.229]

2 We develop a novel algorithm based on the regularized dual averaging (RDA) method, that can simultaneously achieve the optimal convergence rates for both convex and strongly convex loss. [sent-6, score-0.75]

3 In particular, for strongly convex loss, it achieves the opti1 1 mal rate of O( N + N 2 ) for N iterations, which improves the rate O( log N ) for preN vious regularized dual averaging algorithms. [sent-7, score-0.74]

4 In addition, our method constructs the final solution directly from the proximal mapping instead of averaging of all previous iterates. [sent-8, score-0.247]

5 We further develop a multistage extension using the proposed algorithm as a subroutine, which achieves the 1 uniformly-optimal rate O( N + exp{−N }) for strongly convex loss. [sent-12, score-0.41]

6 1 Introduction Many risk minimization problems in machine learning can be formulated into a regularized stochastic optimization problem of the following form: minx∈X {φ(x) := f (x) + h(x)}. [sent-13, score-0.185]

7 (1) Here, the set of feasible solutions X is a convex set in Rn , which is endowed with a norm · and the dual norm · ∗ . [sent-14, score-0.207]

8 If µ > 0, f (x) is strongly convex and µ is the so-called strong convexity parameter. [sent-29, score-0.298]

9 In the past few years, many stochastic (sub)gradient methods [6, 5, 8, 12, 14, 10, 9, 11, 7, 18] have been developed to directly solve the stochastic optimization problem in Eq. [sent-41, score-0.298]

10 Recently, Xiao [21] proposed the regularized dual averaging (RDA) method and its accelerated version (AC-RDA) based on Nesterov’s primal-dual method [17]. [sent-45, score-0.266]

11 Instead of only utilizing a single stochastic subgradient G(xt , ξt ) of the current iteration, it updates the parameter vector using the average of all past stochastic subgradients {G(xi , ξi )}t and hence leads to improved empirical performances. [sent-46, score-0.442]

12 i=1 In this paper, we propose a novel regularized dual averaging method, called optimal RDA or ORDA, which achieves the optimal expected convergence rate of E[φ(x) − φ(x∗ )], where x is the solution from ORDA and x∗ is the optimal solution of Eq. [sent-47, score-0.556]

13 As compared to previous dual averaging methods, it has three main advantages: 1. [sent-49, score-0.194]

14 ORDA is a self-adaptive and optimal algorithm for solving both convex and strongly convex f (x) with the strong convexity parameter µ as an input. [sent-53, score-0.455]

15 When µ = 0, ORDA reduces to a variant of AC-RDA in [21] with the optimal rate for solving convex f (x). [sent-54, score-0.255]

16 For strongly convex f (x) with µ > 0, our algorithm achieves the optimal rate of 1 µN while AC-RDA does not utilize the advantage of strong convexity. [sent-56, score-0.443]

17 Existing RDA methods [21] and many other stochastic gradient methods (e. [sent-58, score-0.158]

18 , [14, 10]) N N can only show the convergence rate for the averaged iterates: xN = t=1 t xt / t=1 t , ¯ where the { t } are nonnegative weights. [sent-60, score-0.218]

19 For example, when h(x) is a sparsity-inducing regularizer ( 1 -norm), although xt computed from proximal mapping will be sparse as t goes large, the averaged solution could be non-sparse. [sent-64, score-0.242]

20 In contrast, our method directly generates the final solution from the proximal mapping, which leads to sparser solutions. [sent-65, score-0.145]

21 Furthermore, using ORDA as a subroutine, we develop the multi-stage ORDA which obtains the 2 +M 2 convergence rate of O σ µN + exp{− µ/LN } for strongly convex f (x). [sent-68, score-0.413]

22 Recall that ORDA has the rate O σ 2 +M 2 µN + L N2 for strongly convex f (x). [sent-69, score-0.362]

23 The rate of muli-stage ORDA improves L the second term in the rate of ORDA from O N 2 to O exp{− µ/LN } and achieves the socalled “uniformly-optimal ” rate [15]. [sent-70, score-0.338]

24 Although the improvement is on the non-dominating term, multi-stage ORDA is an optimal algorithm for both stochastic and deterministic optimization. [sent-71, score-0.194]

25 In particular, for deterministic strongly convex and smooth f (x) (M = 0), one can use the same algorithm but only replaces the stochastic subgradient G(x, ξ) by the deterministic gradient f (x). [sent-72, score-0.625]

26 2 +M 2 Then, the variance of the stochastic subgradient σ = 0. [sent-73, score-0.215]

27 Sample ξt from the distribution P (ξ) and compute the stochastic subgradient G(yt , ξt ). [sent-83, score-0.215]

28 , optimal with respect to both stochastic and deterministic optimization. [sent-90, score-0.194]

29 2 Preliminary and Notations In the framework of first-order stochastic optimization, the only available information of f (x) is the stochastic subgradient. [sent-91, score-0.242]

30 Formally speaking, stochastic subgradient of f (x) at x, G(x, ξ), is a vectorvalued function such that Eξ G(x, ξ) = f (x) ∈ ∂f (x). [sent-92, score-0.215]

31 ∗ (3) A key updating step in dual averaging methods, the proximal mapping, utilizes the Bregman divergence. [sent-94, score-0.298]

32 Let ω(x) : X → R be a strongly convex and differentiable function, the Bregman divergence associated with ω(x) is defined as: V (x, y) := ω(x) − ω(y) − 1 2 ω(y), x − y . [sent-95, score-0.264]

33 3 Optimal Regularized Dual Averaging Method In dual averaging methods [17, 21], the key proximal mapping step utilizes the average of all past stochastic subgradients to update the parameter vector. [sent-103, score-0.538]

34 t 2 log For strongly convex f (x), the current dual averaging methods achieve a rate of O( σ µN N ), which is suboptimal. [sent-105, score-0.574]

35 In this section, we propose a new dual averaging algorithm which adapts to both strongly and non-strongly convex f (x) via the strong convexity parameter µ and achieves optimal rates in both cases. [sent-106, score-0.583]

36 In addition, for previous dual averaging methods, to guarantee the convergence, N the final solution takes the form: x = N1 t=0 zt and hence is not sparse in nature for sparsity+1 inducing regularizers. [sent-107, score-0.28]

37 Instead of taking the average, we introduce another proximal mapping and generate the final solution directly from the second proximal mapping. [sent-108, score-0.23]

38 However, according to [13], the 3 convergence of φ(zN ) to the optimal φ(x∗ ) is established only under a more restrictive assumption that x∗ is a strong local minimizer of φ relative to the optimal manifold and the convergence rate is quite slow. [sent-113, score-0.328]

39 The parameter c is set to achieve the optimal rates for both convex and strongly convex loss. [sent-119, score-0.463]

40 In Step 1, the intermediate point yt is a convex combination of xt and zt and when µ = 0, yt = (1 − θt )xt + θt zt . [sent-123, score-0.492]

41 Therefore, gt in Step 3 is a convex combination of {G(yi , ξi )}i=0 . [sent-126, score-0.183]

42 As compared to RDA which uses the average of past subgradients, gt in ORDA is a weighted average of all past stochastic subgradients and the subgradient from the larger iteration has a larger weight (i. [sent-127, score-0.398]

43 In practice, instead of storing all past stochastic subgradients, gt−1 θt−1 νt−1 gt could be simply updated based on gt−1 : gt = θt νt + G(yt ,ξt ) νt . [sent-130, score-0.292]

44 We also note that since the error in the stochastic subgradient G(yt , ξt ) will affect the sparsity of xt+1 via the second proximal mapping, to obtain stable sparsity recovery performances, it would be better to construct the stochastic subgradient with a small batch of samples [21, 1]. [sent-131, score-0.622]

45 1 Convergence Rate We present the convergence rate for ORDA. [sent-134, score-0.149]

46 , smooth empirical loss), one only needs 4 to change the stochastic subgradient G(yt , ξt ) to f (yt ). [sent-152, score-0.236]

47 The resulting algorithm, which reduces to ∗ Algorithm 3 in [20], is an optimal deterministic first-order method with the rate O( LV (x 2,x0 ) ). [sent-153, score-0.171]

48 N We note that the quadratic growth assumption of V (x, y) is not necessary for convex f (x). [sent-154, score-0.192]

49 If one does not assume this assumption and replaces the last step in ORDA by xt+1 = µ arg minx∈X x, G(yt , ξt ) + h(x) + 2θ2 + γt x − yt 2 , we can achieve the same rate as in 2 t Eq. [sent-155, score-0.236]

50 But the quadratic growth assumption is indeed required for showing the convergence for strongly convex f (x) as in the next corollary. [sent-157, score-0.386]

51 Corollary 2 For strongly convex f (x) with µ > 0, we set c = 0 and Γ = L and obtain that: Eφ(xN +1 ) − φ(x∗ ) ≤ 4τ LV (x∗ , x0 ) 4τ (σ 2 + M 2 ) + . [sent-158, score-0.264]

52 (8), O µN , is optimal and better than the O log N rate for previous µN dual averaging methods. [sent-160, score-0.328]

53 In particular, for deterministic smooth and strongly convex L f (x) (i. [sent-162, score-0.322]

54 , empirical loss with σ = M = 0), ORDA only achieves the rate of O( N 2 ) while the µ optimal deterministic rate should be O exp(− L N ) [16]. [sent-164, score-0.323]

55 2 High Probability Bounds For stochastic optimization problems, another important evaluation criterion is the confidence level of the objective value. [sent-167, score-0.171]

56 (3), namely Eξ ( G(x, ξ) − f (x) 2 ) ≤ σ 2 , and according to ∗ √ Corollary 1 and 2, (N, δ) = O (σ+M ) V (x∗ ,x0 ) √ Nδ for convex f (x), and (N, δ) = O σ 2 +M 2 µN δ for strongly convex f (x). [sent-173, score-0.385]

57 To obtain tighter bounds, we strengthen the basic assumption of the stochastic subgradient in Eq. [sent-175, score-0.253]

58 By further making the ∗ boundedness assumption ( x∗ − zt ≤ D) and utilizing a technical lemma from [3], we obtain a √ much tighter high probability bound with (N, δ) = O ln(1/δ)Dσ √ N for both convex and strongly convex f (x). [sent-178, score-0.532]

59 1, for convex f (x), ORDA achieves the uniformly-optimal rate. [sent-181, score-0.152]

60 However, for strongly convex f (x), although the dominating term of the convergence rate in Eq. [sent-182, score-0.444]

61 Inspired by the multi-stage stochastic approximation methods [7, 9, 11], we propose the multi-stage extension of ORDA in Algorithm 2 for stochastic strongly convex optimization. [sent-184, score-0.523]

62 The multi-stage ORDA has achieved uniformly-optimal convergence rate as shown in Theorem 2 with the proof in Appendix. [sent-188, score-0.149]

63 µ (9) Related Works In the last few years, a number of stochastic gradient methods [6, 5, 8, 12, 14, 21, 10, 11, 7, 4, 3] have been developed to solve Eq. [sent-200, score-0.158]

64 In Table 1, we compare the proposed ORDA and its multi-stage extension with some widely used stochastic gradient methods using the following metrics. [sent-202, score-0.175]

65 The convergence rate for solving (non-strongly) convex f (x) and whether this rate has achieved the uniformly-optimal (Uni-opt) rate. [sent-205, score-0.368]

66 The convergence rate for solving strongly convex f (x) and whether (1) the dominating term of rate is optimal, i. [sent-207, score-0.542]

67 Whether the final solution x, on which the results of convergence are built, is generated from the weighted average of previous iterates (Avg) or from the proximal mapping (Prox). [sent-211, score-0.216]

68 For sparsity-inducing regularizers, the solution directly from the proximal mapping is often sparser than the averaged solution. [sent-212, score-0.177]

69 2 In Table 1, the algorithms in the first 7 rows are stochastic approximation algorithms where only the current stochastic gradient is used at each iteration. [sent-215, score-0.279]

70 The last 4 rows are dual averaging methods where all past subgradients are used. [sent-216, score-0.281]

71 Some algorithms in Table 1 make a more restrictive assumption on the stochastic gradient: ∃G > 0, E G(x, ξ) 2 ≤ G2 , ∀x ∈ X . [sent-217, score-0.146]

72 As we can see from Table 1, the proposed ORDA possesses all good properties except that the convergence rate for strongly convex f (x) is not uniformly-optimal. [sent-220, score-0.413]

73 In particular, SAGE [8] achieves a nearly optimal rate since the parameter D in the convergence rate is chosen such that E xt − x∗ 2 ≤ D for all t ≥ 0 2 and it could be much larger than V ≡ V (x∗ , x0 ). [sent-222, score-0.396]

74 As compared to AC-SA [10] and multi-stage AC-SA [11], our methods do not require the final averaging step; and as shown in our experiments, ORDA has better empirical performances due to the usage of all past stochastic subgradients. [sent-224, score-0.277]

75 Furthermore, we improve the rates of RDA and extend AC-RDA to an optimal algorithm for both convex and strongly convex f (x). [sent-225, score-0.445]

76 [9] proposed multi-stage algorithms to achieve the optimal strongly convex rate based on non-accelerated dual averaging methods. [sent-228, score-0.61]

77 2 G Recently, the paper [18] develops another stochastic gradient method which achieves the rate O( µN ) for strongly convex f (x). [sent-235, score-0.551]

78 However, for non-smooth f (x), it requires the averaging of the last a few iterates and this rate is not uniformly-optimal. [sent-236, score-0.232]

79 We compare our ORDA and M ORDA (only for strongly convex loss) with several state-of-the-art stochastic gradient methods, including RDA and AC-RDA [21], AC-SA [10], FOBOS [6] and SAGE [8]. [sent-238, score-0.422]

80 , c in ORDA for convex loss) within an appropriate range and choose the one that leads to the minimum objective value. [sent-242, score-0.149]

81 We set n = 100 and create a large pool of samples for generating stochastic gradients and evaluating objective values. [sent-247, score-0.149]

82 Since we focus on stochastic optimization instead of online learning, we could randomly draw samples from an underlying distribution. [sent-249, score-0.182]

83 So we construct the stochastic gradient using the mini-batch strategy [2, 1] with the batch size 50. [sent-250, score-0.171]

84 We note that for dual averaging methods, the solution generated from the (first) proximal mapping (e. [sent-262, score-0.333]

85 Objective Then we set ρ = 1 to test algorithms for solving strongly convex f (x). [sent-299, score-0.264]

86 This observation is consistent 22 with our theoretical analysis since the improvement of the convergence rate only appears on 20 the non-dominating term. [sent-303, score-0.149]

87 In addition, ORDA, 100 200 300 400 500 Iteration M ORDA, AC-SA and SAGE with the conver1 gence rate O( µN ) achieve lower objective valFigure 3: ORDA v. [sent-304, score-0.156]

88 7 Conclusions and Future Works In this paper, we propose a new dual averaging method which achieves the optimal rates for solving stochastic regularized problems with both convex and strongly convex loss functions. [sent-312, score-0.856]

89 We further propose a multi-stage extension to achieve the uniformly-optimal convergence rate for strongly convex loss. [sent-313, score-0.448]

90 Although we study stochastic optimization problems in this paper, our algorithms can be easily converted into online optimization approaches, where a sequence of decisions {xt }N are generated t=1 according to Algorithm 1 or 2. [sent-314, score-0.191]

91 Given the expected convergence rate in Corollary 1 and 2, the expected regret can be easily derived. [sent-316, score-0.172]

92 For N N 1 ∗ example, for strongly convex f (x): ERN (x∗ ) ≤ t=1 (E(φ(xt )) − φ(x )) ≤ t=1 O( t ) = O(ln N ). [sent-317, score-0.264]

93 Adaptive subgradient methods for online learning and stochastic optimization. [sent-348, score-0.241]

94 Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. [sent-365, score-0.18]

95 Accelerated gradient methods for stochastic optimization and online learning. [sent-372, score-0.206]

96 Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, part i: a generic algorithmic framework. [sent-382, score-0.524]

97 Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, part ii: shrinking procedures and optimal algorithms. [sent-387, score-0.56]

98 Manifold identification of dual averaging methods for regularized stochastic online learning. [sent-399, score-0.383]

99 making stochastic gradient descent optimal for strongly convex problems. [sent-426, score-0.458]

100 Dual averaging methods for regularized stochastic learning and online optimization. [sent-442, score-0.297]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('orda', 0.858), ('yes', 0.164), ('rda', 0.164), ('strongly', 0.143), ('convex', 0.121), ('stochastic', 0.121), ('averaging', 0.108), ('rate', 0.098), ('sage', 0.097), ('subgradient', 0.094), ('proximal', 0.091), ('dual', 0.086), ('yt', 0.081), ('na', 0.07), ('zt', 0.07), ('xt', 0.069), ('lv', 0.062), ('gt', 0.062), ('fobos', 0.062), ('avg', 0.057), ('prox', 0.057), ('subgradients', 0.053), ('convergence', 0.051), ('minx', 0.05), ('obj', 0.05), ('ac', 0.042), ('xn', 0.042), ('regularized', 0.042), ('sparser', 0.038), ('deterministic', 0.037), ('gradient', 0.037), ('optimal', 0.036), ('nk', 0.036), ('past', 0.034), ('bregman', 0.033), ('sparsity', 0.033), ('growth', 0.033), ('mapping', 0.032), ('achieves', 0.031), ('dominating', 0.031), ('accelerated', 0.03), ('objective', 0.028), ('duchi', 0.028), ('juditsky', 0.026), ('iterates', 0.026), ('online', 0.026), ('lan', 0.025), ('assumption', 0.025), ('rates', 0.024), ('shamir', 0.024), ('loss', 0.023), ('regret', 0.023), ('recovery', 0.022), ('optimization', 0.022), ('regularizer', 0.021), ('corollary', 0.021), ('smooth', 0.021), ('notations', 0.021), ('boundedness', 0.02), ('florida', 0.02), ('iterations', 0.02), ('nal', 0.02), ('convexity', 0.02), ('iterate', 0.019), ('utilizing', 0.019), ('zn', 0.019), ('subroutine', 0.019), ('exp', 0.018), ('achieve', 0.018), ('composite', 0.018), ('xk', 0.018), ('sa', 0.018), ('extension', 0.017), ('lipschitz', 0.017), ('manifold', 0.017), ('ld', 0.017), ('nemirovski', 0.017), ('subdifferential', 0.016), ('solution', 0.016), ('regularizers', 0.016), ('solver', 0.015), ('colt', 0.015), ('replaces', 0.014), ('table', 0.014), ('hazan', 0.014), ('strong', 0.014), ('performances', 0.014), ('stage', 0.013), ('batch', 0.013), ('tighter', 0.013), ('improves', 0.013), ('quadratic', 0.013), ('utilizes', 0.013), ('could', 0.013), ('mellon', 0.013), ('carnegie', 0.013), ('pena', 0.012), ('cotter', 0.012), ('gence', 0.012), ('inspired', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Author: Xi Chen, Qihang Lin, Javier Pena

Abstract: This paper considers a wide spectrum of regularized stochastic optimization problems where both the loss function and regularizer can be non-smooth. We develop a novel algorithm based on the regularized dual averaging (RDA) method, that can simultaneously achieve the optimal convergence rates for both convex and strongly convex loss. In particular, for strongly convex loss, it achieves the opti1 1 mal rate of O( N + N 2 ) for N iterations, which improves the rate O( log N ) for preN vious regularized dual averaging algorithms. In addition, our method constructs the final solution directly from the proximal mapping instead of averaging of all previous iterates. For widely used sparsity-inducing regularizers (e.g., 1 -norm), it has the advantage of encouraging sparser solutions. We further develop a multistage extension using the proposed algorithm as a subroutine, which achieves the 1 uniformly-optimal rate O( N + exp{−N }) for strongly convex loss. 1

2 0.14952594 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright

Abstract: We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a O(d/T ) convergence rate for strongly convex objectives in d dimensions and O( s(log d)/T ) convergence rate when the optimum is s-sparse. Our algorithm is based on successively solving a series of ℓ1 -regularized optimization problems using Nesterov’s dual averaging algorithm. We establish that the error of our solution after T iterations is at most O(s(log d)/T ), with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also confirmed in numerical simulations where we compare to several baselines on a least-squares regression problem.

3 0.13882756 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

4 0.11081002 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1

5 0.10015871 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

6 0.098668456 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

7 0.088307686 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

8 0.087902114 282 nips-2012-Proximal Newton-type methods for convex optimization

9 0.087597854 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

10 0.083695441 292 nips-2012-Regularized Off-Policy TD-Learning

11 0.080852263 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

12 0.080825016 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

13 0.071120791 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

14 0.066982985 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

15 0.063069396 293 nips-2012-Relax and Randomize : From Value to Algorithms

16 0.061779197 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

17 0.059087403 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

18 0.057835191 300 nips-2012-Scalable nonconvex inexact proximal splitting

19 0.054757841 16 nips-2012-A Polynomial-time Form of Robust Regression

20 0.053846061 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.132), (1, 0.004), (2, 0.139), (3, 0.074), (4, 0.09), (5, 0.025), (6, -0.02), (7, -0.068), (8, 0.06), (9, 0.009), (10, 0.025), (11, 0.054), (12, -0.017), (13, 0.052), (14, 0.04), (15, 0.043), (16, -0.081), (17, -0.048), (18, 0.04), (19, 0.01), (20, 0.081), (21, -0.027), (22, 0.083), (23, -0.001), (24, -0.029), (25, -0.072), (26, -0.022), (27, -0.008), (28, 0.003), (29, 0.019), (30, 0.069), (31, -0.065), (32, -0.043), (33, -0.078), (34, -0.024), (35, 0.026), (36, 0.082), (37, -0.007), (38, 0.069), (39, -0.01), (40, -0.02), (41, -0.052), (42, 0.052), (43, 0.005), (44, -0.029), (45, -0.026), (46, -0.028), (47, 0.085), (48, 0.091), (49, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94439334 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Author: Xi Chen, Qihang Lin, Javier Pena

Abstract: This paper considers a wide spectrum of regularized stochastic optimization problems where both the loss function and regularizer can be non-smooth. We develop a novel algorithm based on the regularized dual averaging (RDA) method, that can simultaneously achieve the optimal convergence rates for both convex and strongly convex loss. In particular, for strongly convex loss, it achieves the opti1 1 mal rate of O( N + N 2 ) for N iterations, which improves the rate O( log N ) for preN vious regularized dual averaging algorithms. In addition, our method constructs the final solution directly from the proximal mapping instead of averaging of all previous iterates. For widely used sparsity-inducing regularizers (e.g., 1 -norm), it has the advantage of encouraging sparser solutions. We further develop a multistage extension using the proposed algorithm as a subroutine, which achieves the 1 uniformly-optimal rate O( N + exp{−N }) for strongly convex loss. 1

2 0.76518279 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright

Abstract: We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a O(d/T ) convergence rate for strongly convex objectives in d dimensions and O( s(log d)/T ) convergence rate when the optimum is s-sparse. Our algorithm is based on successively solving a series of ℓ1 -regularized optimization problems using Nesterov’s dual averaging algorithm. We establish that the error of our solution after T iterations is at most O(s(log d)/T ), with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also confirmed in numerical simulations where we compare to several baselines on a least-squares regression problem.

3 0.74914789 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1

4 0.71594477 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Author: Nicolas L. Roux, Mark Schmidt, Francis R. Bach

Abstract: We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly. 1

5 0.65965456 285 nips-2012-Query Complexity of Derivative-Free Optimization

Author: Ben Recht, Kevin G. Jamieson, Robert Nowak

Abstract: This paper provides lower bounds on the convergence rate of Derivative Free Optimization (DFO) with noisy function evaluations, exposing a fundamental and unavoidable gap between the performance of algorithms with access to gradients and those with access to only function evaluations. However, there are situations in which DFO is unavoidable, and for such situations we propose a new DFO algorithm that is proved to be near optimal for the class of strongly convex objective functions. A distinctive feature of the algorithm is that it uses only Boolean-valued function comparisons, rather than function evaluations. This makes the algorithm useful in an even wider range of applications, such as optimization based on paired comparisons from human subjects, for example. We also show that regardless of whether DFO is based on noisy function evaluations or Boolean-valued function comparisons, the convergence rate is the same. 1

6 0.63919288 324 nips-2012-Stochastic Gradient Descent with Only One Projection

7 0.61301851 292 nips-2012-Regularized Off-Policy TD-Learning

8 0.54396141 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

9 0.54166579 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

10 0.53597993 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

11 0.53095198 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

12 0.51723284 293 nips-2012-Relax and Randomize : From Value to Algorithms

13 0.51510614 161 nips-2012-Interpreting prediction markets: a stochastic approach

14 0.49625653 217 nips-2012-Mixability in Statistical Learning

15 0.49123684 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

16 0.48905447 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

17 0.48834056 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

18 0.48683479 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

19 0.46422523 282 nips-2012-Proximal Newton-type methods for convex optimization

20 0.46074834 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.061), (17, 0.012), (21, 0.026), (32, 0.193), (38, 0.175), (39, 0.012), (42, 0.078), (54, 0.015), (55, 0.029), (68, 0.013), (74, 0.026), (76, 0.123), (80, 0.085), (92, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89043146 23 nips-2012-A lattice filter model of the visual pathway

Author: Karol Gregor, Dmitri B. Chklovskii

Abstract: Early stages of visual processing are thought to decorrelate, or whiten, the incoming temporally varying signals. Motivated by the cascade structure of the visual pathway (retina → lateral geniculate nucelus (LGN) → primary visual cortex, V1) we propose to model its function using lattice filters - signal processing devices for stage-wise decorrelation of temporal signals. Lattice filter models predict neuronal responses consistent with physiological recordings in cats and primates. In particular, they predict temporal receptive fields of two different types resembling so-called lagged and non-lagged cells in the LGN. Moreover, connection weights in the lattice filter can be learned using Hebbian rules in a stage-wise sequential manner reminiscent of the neuro-developmental sequence in mammals. In addition, lattice filters can model visual processing in insects. Therefore, lattice filter is a useful abstraction that captures temporal aspects of visual processing. Our sensory organs face an ongoing barrage of stimuli from the world and must transmit as much information about them as possible to the rest of the brain [1]. This is a formidable task because, in sensory modalities such as vision, the dynamic range of natural stimuli (more than three orders of magnitude) greatly exceeds the dynamic range of relay neurons (less than two orders of magnitude) [2]. The reason why high fidelity transmission is possible at all is that the continuity of objects in the physical world leads to correlations in natural stimuli, which imply redundancy. In turn, such redundancy can be eliminated by compression performed by the front end of the visual system leading to the reduction of the dynamic range [3, 4]. A compression strategy appropriate for redundant natural stimuli is called predictive coding [5, 6, 7]. In predictive coding, a prediction of the incoming signal value is computed from past values delayed in the circuit. This prediction is subtracted from the actual signal value and only the prediction error is transmitted. In the absence of transmission noise such compression is lossless as the original signal could be decoded on the receiving end by inverting the encoder. If predictions are accurate, the dynamic range of the error is much smaller than that of the natural stimuli. Therefore, minimizing dynamic range using predictive coding reduces to optimizing prediction. Experimental support for viewing the front end of the visual system as a predictive encoder comes from the measurements of receptive fields [6, 7]. In particular, predictive coding suggests that, for natural stimuli, the temporal receptive fields should be biphasic and the spatial receptive fields center-surround. These predictions are born out by experimental measurements in retinal ganglion cells, [8], lateral geniculate nucleus (LGN) neurons [9] and fly second order visual neurons called large monopolar cells (LMCs) [2]. In addition, the experimentally measured receptive fields vary with signal-to-noise ratio as would be expected from optimal prediction theory [6]. Furthermore, experimentally observed whitening of the transmitted signal [10] is consistent with removing correlated components from the incoming signals [11]. As natural stimuli contain correlations on time scales greater than hundred milliseconds, experimentally measured receptive fields of LGN neurons are equally long [12]. Decorrelation over such long time scales requires equally long delays. How can such extended receptive field be produced by 1 biological neurons and synapses whose time constants are typically less than hundred milliseconds [13]? The field of signal processing offers a solution to this problem in the form of a device called a lattice filter, which decorrelates signals in stages, sequentially adding longer and longer delays [14, 15, 16, 17]. Motivated by the cascade structure of visual systems [18], we propose to model decorrelation in them by lattice filters. Naturally, visual systems are more complex than lattice filters and perform many other operations. However, we show that the lattice filter model explains several existing observations in vertebrate and invertebrate visual systems and makes testable predictions. Therefore, we believe that lattice filters provide a convenient abstraction for modeling temporal aspects of visual processing. This paper is organized as follows. First, we briefly summarize relevant results from linear prediction theory. Second, we explain the operation of the lattice filter in discrete and continuous time. Third, we compare lattice filter predictions with physiological measurements. 1 Linear prediction theory Despite the non-linear nature of neurons and synapses, the operation of some neural circuits in vertebrates [19] and invertebrates [20] can be described by a linear systems theory. The advantage of linear systems is that optimal circuit parameters may be obtained analytically and the results are often intuitively clear. Perhaps not surprisingly, the field of signal processing relies heavily on the linear prediction theory, offering a convenient framework [15, 16, 17]. Below, we summarize the results from linear prediction that will be used to explain the operation of the lattice filter. Consider a scalar sequence y = {yt } where time t = 1, . . . , n. Suppose that yt at each time point depends on side information provided by vector zt . Our goal is to generate a series of linear predictions, yt from the vector zt , yt = w · zt . We define a prediction error as: ˆ ˆ et = yt − yt = yt − w · zt ˆ (1) and look for values of w that minimize mean squared error: e2 = 1 nt e2 = t t 1 nt (yt − w · zt )2 . (2) t The weight vector, w is optimal for prediction of sequence y from sequence z if and only if the prediction error sequence e = y − w · z is orthogonal to each component of vector z: ez = 0. (3) When the whole series y is given in advance, i.e. in the offline setting, these so-called normal equations can be solved for w, for example, by Gaussian elimination [21]. However, in signal processing and neuroscience applications, another setting called online is more relevant: At every time step t, prediction yt must be made using only current values of zt and w. Furthermore, after a ˆ prediction is made, w is updated based on the prediction yt and observed yt , zt . ˆ In the online setting, an algorithm called stochastic gradient descent is often used, where, at each time step, w is updated in the direction of negative gradient of e2 : t w →w−η w (yt − w · zt ) 2 . (4) This leads to the following weight update, known as least mean square (LMS) [15], for predicting sequence y from sequence z: w → w + ηet zt , (5) where η is the learning rate. The value of η represents the relative influence of more recent observations compared to more distant ones. The larger the learning rate the faster the system adapts to recent observations and less past it remembers. In this paper, we are interested in predicting a current value xt of sequence x from its past values xt−1 , . . . , xt−k restricted by the prediction order k > 0: xt = wk · (xt−1 , . . . , xt−k )T . ˆ 2 (6) This problem is a special case of the online linear prediction framework above, where yt = xt , zt = (xt−1 , . . . , xt−k )T . Then the gradient update is given by: w → wk + ηet (xt−1 , . . . , xt−k )T . (7) While the LMS algorithm can find the weights that optimize linear prediction (6), the filter wk has a long temporal extent making it difficult to implement with neurons and synapses. 2 Lattice filters One way to generate long receptive fields in circuits of biological neurons is to use a cascade architecture, known as the lattice filter, which calculates optimal linear predictions for temporal sequences and transmits prediction errors [14, 15, 16, 17]. In this section, we explain the operation of a discrete-time lattice filter, then adapt it to continuous-time operation. 2.1 Discrete-time implementation The first stage of the lattice filter, Figure 1, calculates the error of the first order optimal prediction (i.e. only using the preceding element of the sequence), the second stage uses the output of the first stage and calculates the error of the second order optimal prediction (i.e. using only two previous values) etc. To make such stage-wise error computations possible the lattice filter calculates at every stage not only the error of optimal prediction of xt from past values xt−1 , . . . , xt−k , called forward error, ftk = xt − wk · (xt−1 , . . . , xt−k )T , (8) but, perhaps non-intuitively, also the error of optimal prediction of a past value xt−k from the more recent values xt−k+1 , . . . , xt , called backward error: bk = xt−k − w k · (xt−k+1 , . . . , xt )T , t k where w and w k (9) are the weights of the optimal prediction. For example, the first stage of the filter calculates the forward error ft1 of optimal prediction of xt from xt−1 : ft1 = xt − u1 xt−1 as well as the backward error b1 of optimal prediction of xt−1 from t xt : b1 = xt−1 − v 1 xt , Figure 1. Here, we assume that coefficients u1 and v 1 that give optimal linear t prediction are known and return to learning them below. Each following stage of the lattice filter performs a stereotypic operation on its inputs, Figure 1. The k-th stage (k > 1) receives forward, ftk−1 , and backward, bk−1 , errors from the previous stage, t delays backward error by one time step and computes a forward error: ftk = ftk−1 − uk bk−1 t−1 (10) of the optimal linear prediction of ftk−1 from bk−1 . In addition, each stage computes a backward t−1 error k−1 k bt = bt−1 − v k ftk−1 (11) of the optimal linear prediction of bk−1 from ftk−1 . t−1 As can be seen in Figure 1, the lattice filter contains forward prediction error (top) and backward prediction error (bottom) branches, which interact at every stage via cross-links. Operation of the lattice filter can be characterized by the linear filters acting on the input, x, to compute forward or backward errors of consecutive order, so called prediction-error filters (blue bars in Figure 1). Because of delays in the backward error branch the temporal extent of the filters grows from stage to stage. In the next section, we will argue that prediction-error filters correspond to the measurements of temporal receptive fields in neurons. For detailed comparison with physiological measurements we will use the result that, for bi-phasic prediction-error filters, such as the ones in Figure 1, the first bar of the forward prediction-error filter has larger weight, by absolute value, than the combined weights of the remaining coefficients of the corresponding filter. Similarly, in backward predictionerror filters, the last bar has greater weight than the rest of them combined. This fact arises from the observation that forward prediction-error filters are minimum phase, while backward predictionerror filters are maximum phase [16, 17]. 3 Figure 1: Discrete-time lattice filter performs stage-wise computation of forward and backward prediction errors. In the first stage, the optimal prediction of xt from xt−1 is computed by delaying the input by one time step and multiplying it by u1 . The upper summation unit subtracts the predicted xt from the actual value and outputs prediction error ft1 . Similarly, the optimal prediction of xt−1 from xt is computed by multiplying the input by v 1 . The lower summation unit subtracts the optimal prediction from the actual value and outputs backward error b1 . In each following stage t k, the optimal prediction of ftk−1 from bk−1 is computed by delaying bk−1 by one time step and t t multiplying it by uk . The upper summation unit subtracts the prediction from the actual ftk−1 and outputs prediction error ftk . Similarly, the optimal prediction of bk−1 from ftk−1 is computed by t−1 multiplying it by uk . The lower summation unit subtracts the optimal prediction from the actual value and outputs backward error bk . Black connections have unitary weights and red connections t have learnable negative weights. One can view forward and backward error calculations as applications of so-called prediction-error filters (blue) to the input sequence. Note that the temporal extent of the filters gets longer from stage to stage. Next, we derive a learning rule for finding optimal coefficients u and v in the online setting. The uk is used for predicting ftk−1 from bk−1 to obtain error ftk . By substituting yt = ftk−1 , zt = bk−1 and t−1 t−1 et = ftk into (5) the update of uk becomes uk → uk + ηftk bk−1 . t−1 (12) Similarly, v k is updated by v k → v k + ηbk ftk−1 . (13) t Interestingly, the updates of the weights are given by the product of the activities of outgoing and incoming nodes of the corresponding cross-links. Such updates are known as Hebbian learning rules thought to be used by biological neurons [22, 23]. Finally, we give a simple proof that, in the offline setting when the entire sequence x is known, f k and bk , given by equations (10, 11), are indeed errors of optimal k-th order linear prediction. Let D be one step time delay operator (Dx)t = xt−1 . The induction statement at k is that f k and bk are k-th order forward and backward errors of optimal linear prediction which is equivalent to f k and bk k k being of the form f k = x−w1 Dx−. . .−wk Dk x and bk = Dk x−w1k Dk−1 x−. . .−wkk x and, from k i normal equations (3), satisfying f D x = 0 and Dbk Di x = bk Di−1 x = 0 for i = 1, . . . , k. That this is true for k = 1 directly follows from the definition of f 1 and b1 . Now we assume that this is true for k − 1 ≥ 1 and show it is true for k. It is easy to see from the forms of f k−1 and bk−1 k k and from f k = f k−1 − uk Dbk−1 that f k has the correct form f k = x − w1 Dx − . . . − wk Dk x. k i k−1 k k−1 Regarding orthogonality for i = 1, . . . , k − 1 we have f D x = (f − u Db )Di x = f k−1 Di x − uk (Dbk−1 )Di x = 0 using the induction assumptions of orhogonality at k − 1. For the remaining i = k we note that f k is the error of the optimal linear prediction of f k−1 from Dbk−1 k−1 and therefore 0 = f k Dbk−1 = f k (Dk x − w1k−1 Dk−1 x − . . . + wk−1 Dx) = f k Dk x as desired. The bk case can be proven similarly. 2.2 Continuous-time implementation The last hurdle remaining for modeling neuronal circuits which operate in continuous time with a lattice filter is its discrete-time operation. To obtain a continuous-time implementation of the lattice 4 filter we cannot simply take the time step size to zero as prediction-error filters would become infinitesimally short. Here, we adapt the discrete-time lattice filter to continous-time operation in two steps. First, we introduce a discrete-time Laguerre lattice filter [24, 17] which uses Laguerre polynomials rather than the shift operator to generate its basis functions, Figure 2. The input signal passes through a leaky integrator whose leakage constant α defines a time-scale distinct from the time step (14). A delay, D, at every stage is replaced by an all-pass filter, L, (15) with the same constant α, which preserves the magnitude of every Fourier component of the input but shifts its phase in a frequency dependent manner. Such all-pass filter reduces to a single time-step delay when α = 0. The optimality of a general discrete-time Laguerre lattice filter can be proven similarly to that for the discrete-time filter, simply by replacing operator D with L in the proof of section 2.1. Figure 2: Continuous-time lattice filter using Laguerre polynomials. Compared to the discretetime version, it contains a leaky integrator, L0 ,(16) and replaces delays with all-pass filters, L, (17). Second, we obtain a continuous-time formulation of the lattice filter by replacing t − 1 → t − δt, defining the inverse time scale γ = (1 − α)/δt and taking the limit δt → 0 while keeping γ fixed. As a result L0 and L are given by: Discrete time L0 (x)t L(x)t Continuous time = αL0 (x)t−1 + xt (14) = α(L(x)t−1 − xt ) + xt−1 (15) dL0 (x)/dt = −γL0 (x) + x L(x) = x − 2γL0 (x) (16) (17) Representative impulse responses of the continuous Laguerre filter are shown in Figure 2. Note that, similarly to the discrete-time case, the area under the first (peak) phase is greater than the area under the second (rebound) phase in the forward branch and the opposite is true in the backward branch. Moreover, the temporal extent of the rebound is greater than that of the peak not just in the forward branch like in the basic discrete-time implementation but also in the backward branch. As will be seen in the next section, these predictions are confirmed by physiological recordings. 3 Experimental evidence for the lattice filter in visual pathways In this section we demonstrate that physiological measurements from visual pathways in vertebrates and invertebrates are consistent with the predictions of the lattice filter model. For the purpose of modeling visual pathways, we identify summation units of the lattice filter with neurons and propose that neural activity represents forward and backward errors. In the fly visual pathway neuronal activity is represented by continuously varying graded potentials. In the vertebrate visual system, all neurons starting with ganglion cells are spiking and we identify their firing rate with the activity in the lattice filter. 3.1 Mammalian visual pathway In mammals, visual processing is performed in stages. In the retina, photoreceptors synapse onto bipolar cells, which in turn synapse onto retinal ganglion cells (RGCs). RGCs send axons to the LGN, where they synapse onto LGN relay neurons projecting to the primary visual cortex, V1. In addition to this feedforward pathway, at each stage there are local circuits involving (usually inhibitory) inter-neurons such as horizontal and amacrine cells in the retina. Neurons of each class 5 come in many types, which differ in their connectivity, morphology and physiological response. The bewildering complexity of these circuits has posed a major challenge to visual neuroscience. Alonso et al. • Connections between LGN and Cortex J. Neurosci., June 1, 2001, 21(11):4002–4015 4009 Temporal Filter 1 0.5 0 -0.5 -1 RGC LGN 0 100 Time (ms) 200 Figure 7. Distribution of geniculate cells and simple cells with respect to the timing of their responses. The distribution of three parameters derived from impulse responses of geniculate and cortical neurons is shown. A, Peak time. B, Zero-crossing time. C, Rebound index. Peak time is the time with the strongest response in the first phase of the impulse response. Zero-crossing time is the time between the first and second phases. Rebound index is the area of the impulse response after the zero crossing divided by the area before the zero crossing. Only impulse responses with good signal to noise were included (Ͼ5 SD above baseline; n ϭ 169). Figure 3: Electrophysiologically measured temporal receptive fields get progressively longer along the cat visual pathway. Left: A cat LGN cell (red) has a longer receptive field than a corresponding RGC cell (blue) (adapted from [12] which also reports population data). Right (A,B): Extent of the temporal receptive fields of simple cells in cat V1 is greater than that of corresponding LGN cells as quantified by the peak (A) and zero-crossing (B) times. Right (C): In the temporal receptive fields of cat LGN and V1 cells the peak can be stronger or weaker than the rebound (adapted from [25]). simple cells and geniculate cells differed for all temporal parameters measured, there was considerable overlap between the distributions (Fig. 7). This overlap raises the following question: does connectivity depend on how well geniculate and cortical responses are matched with respect to time? For instance, do simple cells with fast subregions (early times to peak and early zero crossings) receive input mostly from geniculate cells with fast centers? Figure 8 illustrates the visual responses from a geniculate cell and a simple cell that were monosynaptically connected. A strong positive peak was observed in the correlogram (shown with a 10 msec time window to emphasize its short latency and fast rise time). In this case, an ON central subregion was well overlapped with an ON geniculate center (precisely at the peak of the subregion). Moreover, the timings of the visual responses from the overlapped subregion and the geniculate center were very similar (same onset, ϳ0 –25 msec; same peak, ϳ25–50 msec). It is worth noting that the two central subregions of the simple cell were faster and stronger than the two lateral subregions. The responses of the central subregions matched the timing of the geniculate center. In contrast, the timing of the lateral subregions resembled more closely the timing of the geniculate surround (both peaked at 25–50 msec). Unlike the example shown in Figure 8, a considerable number of geniculocortical pairs produced responses with different timing. For example, Figure 9 illustrates a case in which a geniculate center fully overlapped a strong simple-cell subregion of the same sign, but with slower timing (LGN onset, ϳ0 –25 msec; peak, ϳ25–50 msec; simple-cell onset, ϳ25–50 msec; peak, ϳ50 –75 msec). The cross-correlogram between this pair of neurons was flat, which indicates the absence of a monosynaptic connection (Fig. 9, top right). To examine the role of timing in geniculocortical connectivity, we measured the response time course from all cell pairs that met two criteria. First, the geniculate center overlapped a simple-cell subregion of the same sign (n ϭ 104). Second, the geniculate center overlapped the cortical subregion in a near-optimal position (relative overlap Ͼ 50%, n ϭ 47; see Materials and Methods; Fig. 5A). All these cell pairs had a high probability of being monosynaptically connected because of the precise match in receptive-field position and sign (31 of 47 were connected). The distributions of peak time, zero-crossing time, and rebound index from these cell pairs were very similar to the distributions from the entire sample (Fig. 7; see also Fig. 10 legend). The selected cell pairs included both presumed directional (predicted DI Ͼ 0.3, see Materials and Methods; 12/20 connected) and nondirectional (19/27 connected) simple cells. Most geniculate cells had small receptive fields (less than two simple-cell subregion widths; see Receptive-field sign), although five cells with larger receptive fields were also included (three connected). From the 47 cell pairs used in this analysis, those with similar response time courses had a higher probability of being connected (Fig. 10). In particular, cell pairs that had both similar peak time and zero-crossing time were all connected (n ϭ 12; Fig. 10 A). Directionally selective simple cells were included in all timing groups. For example, in Figure 10 A there were four, five, two, and one directionally selective cells in the time groups Ͻ20, 40, 60, and Ͼ60 msec, respectively. Similar results were obtained if we restricted our sample to geniculate centers overlapped with the dominant subregion of the simple cell (n ϭ 31). Interestingly, the efficacy and contributions of the connections seemed to depend little on the relative timing of the visual responses (Fig. 10, right). Although our sample of them was quite small, lagged cells are of considerable interest and therefore deserve comment. We recorded from 13 potentially lagged LGN cells whose centers were superimposed with a simple-cell subregion (eight with rebound indices between 1.2 and 1.5; five with rebound indices Ͼ1.9). Only seven of these pairs could be used for timing comparisons (in one pair the baseline of the correlogram had insufficient spikes; in three pairs the geniculate receptive fields were Here, we point out several experimental observations related to temporal processing in the visual system consistent with the lattice filter model. First, measurements of temporal receptive fields demonstrate that they get progressively longer at each consecutive stage: i) LGN neurons have longer receptive fields than corresponding pre-synaptic ganglion cells [12], Figure 3left; ii) simple cells in V1 have longer receptive fields than corresponding pre-synaptic LGN neurons [25], Figure 3rightA,B. These observation are consistent with the progressively greater temporal extent of the prediction-error filters (blue plots in Figure 2). Second, the weight of the peak (integrated area under the curve) may be either greater or less than that of the rebound both in LGN relay cells [26] and simple cells of V1 [25], Figure 3right(C). Neurons with peak weight exceeding that of rebound are often referred to as non-lagged while the others are known as lagged found both in cat [27, 28, 29] and monkey [30]. The reason for this becomes clear from the response to a step stimulus, Figure 4(top). By comparing experimentally measured receptive fields with those of the continuous lattice filter, Figure 4, we identify non-lagged neurons with the forward branch and lagged neurons with the backward branch. Another way to characterize step-stimulus response is whether the sign of the transient is the same (non-lagged) or different (lagged) relative to sustained response. Third, measurements of cross-correlation between RGCs and LGN cell spikes in lagged and nonlagged neurons reveals a difference of the transfer function indicative of the difference in underlying circuitry [30]. This is consistent with backward pathway circuit of the Laguerre lattice filter, Figure 2, being more complex then that of the forward path (which results in different transfer function). ” (or replacing ”more complex” with ”different”) Third, measurements of cross-correlation between RGCs and LGN cell spikes in lagged and nonlagged neurons reveals a difference of the transfer function indicative of the difference in underlying circuitry [31]. This is consistent with the backward branch circuit of the Laguerre lattice filter, Figure 2, being different then that of the forward branch (which results in different transfer function). In particular, a combination of different glutamate receptors such as AMPA and NMDA, as well as GABA receptors are thought to be responsible for observed responses in lagged cells [32]. However, further investigation of the corresponding circuitry, perhaps using connectomics technology, is desirable. Fourth, the cross-link weights of the lattice filter can be learned using Hebbian rules, (12,13) which are biologically plausible [22, 23]. Interestingly, if these weights are learned sequentially, starting from the first stage, they do not need to be re-learned when additional stages are added or learned. This property maps naturally on the fact that in the course of mammalian development the visual pathway matures in a stage-wise fashion - starting with the retina, then LGN, then V1 - and implying that the more peripheral structures do not need to adapt to the maturation of the downstream ones. 6 Figure 4: Comparison of electrophysiologically measured responses of cat LGN cells with the continuous-time lattice filter model. Top: Experimentally measured temporal receptive fields and step-stimulus responses of LGN cells (adapted from [26]). Bottom: Typical examples of responses in the continuous-time lattice filter model. Lattice filter coefficients were u1 = v 1 = 0.4, u2 = v 2 = 0.2 and 1/γ = 50ms to model the non-lagged cell and u1 = v 1 = u2 = v 2 = 0.2 and 1/γ = 60ms to model the lagged cell. To model photoreceptor contribution to the responses, an additional leaky integrator L0 was added to the circuit of Figure 2. While Hebbian rules are biologically plausible, one may get an impression from Figure 2 that they must apply to inhibitory cross-links. We point out that this circuit is meant to represent only the computation performed rather than the specific implementation in terms of neurons. As the same linear computation can be performed by circuits with a different arrangement of the same components, there are multiple implementations of the lattice filter. For example, activity of non-lagged OFF cells may be seen as representing minus forward error. Then the cross-links between the non-lagged OFF pathway and the lagged ON pathway would be excitatory. In general, classification of cells into lagged and non-lagged seems independent of their ON/OFF and X/Y classification [31, 28, 29], but see[33]. 3.2 Insect visual pathway In insects, two cell types, L1 and L2, both post-synaptic to photoreceptors play an important role in visual processing. Physiological responses of L1 and L2 indicate that they decorrelate visual signals by subtracting their predictable parts. In fact, receptive fields of these neurons were used as the first examples of predictive coding in neuroscience [6]. Yet, as the numbers of synapses from photoreceptors to L1 and L2 are the same [34] and their physiological properties are similar, it has been a mystery why insects, have not just one but a pair of such seemingly redundant neurons per facet. Previously, it was suggested that L1 and L2 provide inputs to the two pathways that map onto ON and OFF pathways in the vertebrate retina [35, 36]. Here, we put forward a hypothesis that the role of L1 and L2 in visual processing is similar to that of the two branches of the lattice filter. We do not incorporate the ON/OFF distinction in the effectively linear lattice filter model but anticipate that such combined description will materialize in the future. As was argued in Section 2, in forward prediction-error filters, the peak has greater weight than the rebound, while in backward prediction-error filters the opposite is true. Such difference implies that in response to a step-stimulus the signs of sustained responses compared to initial transients are different between the branches. Indeed, Ca2+ imaging shows that responses of L1 and L2 to step-stimulus are different as predicted by the lattice filter model [35], Figure 5b. Interestingly, the activity of L1 seems to represent minus forward error and L2 - plus backward error, suggesting that the lattice filter cross-links are excitatory. To summarize, the predictions of the lattice filter model seem to be consistent with the physiological measurements in the fly visual system and may help understand its operation. 7 Stimulus 1 0.5 0 0 5 10 15 20 5 10 15 20 5 10 time 15 20 − Forward Error 1 0 −1 0 Backward Error 1 0 −1 0 Figure 5: Response of the lattice filter and fruit fly LMCs to a step-stimulus. Left: Responses of the first order discrete-time lattice filter to a step stimulus. Right: Responses of fly L1 and L2 cells to a moving step stimulus (adapted from [35]). Predicted and the experimentally measured responses have qualitatively the same shape: a transient followed by sustained response, which has the same sign for the forward error and L1 and the opposite sign for the backward error and L2. 4 Discussion Motivated by the cascade structure of the visual pathway, we propose to model its operation with the lattice filter. We demonstrate that the predictions of the continuous-time lattice filter model are consistent with the course of neural development and the physiological measurement in the LGN, V1 of cat and monkey, as well as fly LMC neurons. Therefore, lattice filters may offer a useful abstraction for understanding aspects of temporal processing in visual systems of vertebrates and invertebrates. Previously, [11] proposed that lagged and non-lagged cells could be a result of rectification by spiking neurons. Although we agree with [11] that LGN performs temporal decorrelation, our explanation does not rely on non-linear processing but rather on the cascade architecture and, hence, is fundamentally different. Our model generates the following predictions that are not obvious in [11]: i) Not only are LGN receptive fields longer than RGC but also V1 receptive fields are longer than LGN; ii) Even a linear model can generate a difference in the peak/rebound ratio; iii) The circuit from RGC to LGN should be different for lagged and non-lagged cells consistent with [31]; iv) The lattice filter circuit can self-organize using Hebbian rules, which gives a mechanistic explanation of receptive fields beyond the normative framework of [11]. In light of the redundancy reduction arguments given in the introduction, we note that, if the only goal of the system were to compress incoming signals using a given number of lattice filter stages, then after the compression is peformed only one kind of prediction errors, forward or backward needs to be transmitted. Therefore, having two channels, in the absence of noise, may seem redundant. However, transmitting both forward and backward errors gives one the flexibility to continue decorrelation further by adding stages performing relatively simple operations. We are grateful to D.A. Butts, E. Callaway, M. Carandini, D.A. Clark, J.A. Hirsch, T. Hu, S.B. Laughlin, D.N. Mastronarde, R.C. Reid, H. Rouault, A. Saul, L. Scheffer, F.T. Sommer, X. Wang for helpful discussions. References [1] F. Rieke, D. Warland, R.R. van Steveninck, and W. Bialek. Spikes: exploring the neural code. MIT press, 1999. [2] S.B. Laughlin. Matching coding, circuits, cells, and molecules to signals: general principles of retinal design in the fly’s eye. Progress in retinal and eye research, 13(1):165–196, 1994. [3] F. Attneave. Some informational aspects of visual perception. Psychological review, 61(3):183, 1954. [4] H. Barlow. Redundancy reduction revisited. Network: Comp in Neural Systems, 12(3):241–253, 2001. [5] R.M. Gray. Linear Predictive Coding and the Internet Protocol. Now Publishers, 2010. [6] MV Srinivasan, SB Laughlin, and A. Dubs. Predictive coding: a fresh view of inhibition in the retina. Proceedings of the Royal Society of London. Series B. Biological Sciences, 216(1205):427–459, 1982. [7] T. Hosoya, S.A. Baccus, and M. Meister. Dynamic predictive coding by the retina. Nature, 436:71, 2005. 8 [8] HK Hartline, H.G. Wagner, and EF MacNichol Jr. The peripheral origin of nervous activity in the visual system. Studies on excitation and inhibition in the retina: a collection of papers from the laboratories of H. Keffer Hartline, page 99, 1974. [9] N.A. Lesica, J. Jin, C. Weng, C.I. Yeh, D.A. Butts, G.B. Stanley, and J.M. Alonso. Adaptation to stimulus contrast and correlations during natural visual stimulation. Neuron, 55(3):479–491, 2007. [10] Y. Dan, J.J. Atick, and R.C. Reid. Efficient coding of natural scenes in the lateral geniculate nucleus: experimental test of a computational theory. The Journal of Neuroscience, 16(10):3351–3362, 1996. [11] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Computation in Neural Systems, 6(3):345–358, 1995. [12] X. Wang, J.A. Hirsch, and F.T. Sommer. Recoding of sensory information across the retinothalamic synapse. The Journal of Neuroscience, 30(41):13567–13577, 2010. [13] C. Koch. Biophysics of computation: information processing in single neurons. Oxford Univ Press, 2005. [14] F. Itakura and S. Saito. On the optimum quantization of feature parameters in the parcor speech synthesizer. In Conference Record, 1972 International Conference on Speech Communication and Processing, Boston, MA, pages 434–437, 1972. [15] B. Widrow and S.D. Stearns. Adaptive signal processing. Prentice-Hall, Inc. Englewood Cliffs, NJ, 1985. [16] S. Haykin. Adaptive filter theory. Prentice-Hall, Englewood-Cliffs, NJ, 2003. [17] A.H. Sayed. Fundamentals of adaptive filtering. Wiley-IEEE Press, 2003. [18] D.J. Felleman and D.C. Van Essen. Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex, 1(1):1–47, 1991. [19] X. Wang, F.T. Sommer, and J.A. Hirsch. Inhibitory circuits for visual processing in thalamus. Current Opinion in Neurobiology, 2011. [20] SB Laughlin, J. Howard, and B. Blakeslee. Synaptic limitations to contrast coding in the retina of the blowfly calliphora. Proceedings of the Royal society of London. Series B. Biological sciences, 231(1265):437–467, 1987. [21] D.C. Lay. Linear Algebra and Its Applications. Addison-Wesley/Longman, New York/London, 2000. [22] D.O. Hebb. The organization of behavior: A neuropsychological theory. Lawrence Erlbaum, 2002. [23] O. Paulsen and T.J. Sejnowski. Natural patterns of activity and long-term synaptic plasticity. Current opinion in neurobiology, 10(2):172–180, 2000. [24] Z. Fejzo and H. Lev-Ari. Adaptive laguerre-lattice filters. Signal Processing, IEEE Transactions on, 45(12):3006–3016, 1997. [25] J.M. Alonso, W.M. Usrey, and R.C. Reid. Rules of connectivity between geniculate cells and simple cells in cat primary visual cortex. The Journal of Neuroscience, 21(11):4002–4015, 2001. [26] D. Cai, G.C. Deangelis, and R.D. Freeman. Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens. Journal of Neurophysiology, 78(2):1045–1061, 1997. [27] D.N. Mastronarde. Two classes of single-input x-cells in cat lateral geniculate nucleus. i. receptive-field properties and classification of cells. Journal of Neurophysiology, 57(2):357–380, 1987. [28] J. Wolfe and L.A. Palmer. Temporal diversity in the lateral geniculate nucleus of cat. Visual neuroscience, 15(04):653–675, 1998. [29] AB Saul and AL Humphrey. Spatial and temporal response properties of lagged and nonlagged cells in cat lateral geniculate nucleus. Journal of Neurophysiology, 64(1):206–224, 1990. [30] A.B. Saul. Lagged cells in alert monkey lateral geniculate nucleus. Visual neurosci, 25:647–659, 2008. [31] D.N. Mastronarde. Two classes of single-input x-cells in cat lateral geniculate nucleus. ii. retinal inputs and the generation of receptive-field properties. Journal of Neurophysiology, 57(2):381–413, 1987. [32] P. Heggelund and E. Hartveit. Neurotransmitter receptors mediating excitatory input to cells in the cat lateral geniculate nucleus. i. lagged cells. Journal of neurophysiology, 63(6):1347–1360, 1990. [33] J. Jin, Y. Wang, R. Lashgari, H.A. Swadlow, and J.M. Alonso. Faster thalamocortical processing for dark than light visual targets. The Journal of Neuroscience, 31(48):17471–17479, 2011. [34] M. Rivera-Alba, S.N. Vitaladevuni, Y. Mischenko, Z. Lu, S. Takemura, L. Scheffer, I.A. Meinertzhagen, D.B. Chklovskii, and G.G. de Polavieja. Wiring economy and volume exclusion determine neuronal placement in the drosophila brain. Current Biology, 21(23):2000–5, 2011. [35] D.A. Clark, L. Bursztyn, M.A. Horowitz, M.J. Schnitzer, and T.R. Clandinin. Defining the computational structure of the motion detector in drosophila. Neuron, 70(6):1165–1177, 2011. [36] M. Joesch, B. Schnell, S.V. Raghu, D.F. Reiff, and A. Borst. On and off pathways in drosophila motion vision. Nature, 468(7321):300–304, 2010. 9

2 0.87664717 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

Author: James Hensman, Magnus Rattray, Neil D. Lawrence

Abstract: We present a general method for deriving collapsed variational inference algorithms for probabilistic models in the conjugate exponential family. Our method unifies many existing approaches to collapsed variational inference. Our collapsed variational inference leads to a new lower bound on the marginal likelihood. We exploit the information geometry of the bound to derive much faster optimization methods based on conjugate gradients for these models. Our approach is very general and is easily applied to any model where the mean field update equations have been derived. Empirically we show significant speed-ups for probabilistic inference using our bound. 1

3 0.87270236 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning

Author: Toke Hansen, Michael W. Mahoney

Abstract: In many applications, one has side information, e.g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks “nearby” that pre-specified target region. Locally-biased problems of this sort are particularly challenging for popular eigenvector-based machine learning and data analysis tools. At root, the reason is that eigenvectors are inherently global quantities. In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. We also provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning. 1

same-paper 4 0.83883858 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Author: Xi Chen, Qihang Lin, Javier Pena

Abstract: This paper considers a wide spectrum of regularized stochastic optimization problems where both the loss function and regularizer can be non-smooth. We develop a novel algorithm based on the regularized dual averaging (RDA) method, that can simultaneously achieve the optimal convergence rates for both convex and strongly convex loss. In particular, for strongly convex loss, it achieves the opti1 1 mal rate of O( N + N 2 ) for N iterations, which improves the rate O( log N ) for preN vious regularized dual averaging algorithms. In addition, our method constructs the final solution directly from the proximal mapping instead of averaging of all previous iterates. For widely used sparsity-inducing regularizers (e.g., 1 -norm), it has the advantage of encouraging sparser solutions. We further develop a multistage extension using the proposed algorithm as a subroutine, which achieves the 1 uniformly-optimal rate O( N + exp{−N }) for strongly convex loss. 1

5 0.83014363 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

6 0.81529588 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

7 0.76966506 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

8 0.76315719 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

9 0.76155281 358 nips-2012-Value Pursuit Iteration

10 0.76151806 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

11 0.76125646 27 nips-2012-A quasi-Newton proximal splitting method

12 0.76107323 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

13 0.76097453 285 nips-2012-Query Complexity of Derivative-Free Optimization

14 0.76067954 227 nips-2012-Multiclass Learning with Simplex Coding

15 0.7600795 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

16 0.75980163 275 nips-2012-Privacy Aware Learning

17 0.75917411 86 nips-2012-Convex Multi-view Subspace Learning

18 0.75910866 324 nips-2012-Stochastic Gradient Descent with Only One Projection

19 0.75829661 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

20 0.75771534 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction