nips nips2009 nips2009-76 knowledge-graph by maker-knowledge-mining

76 nips-2009-Efficient Learning using Forward-Backward Splitting


Source: pdf

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. [sent-7, score-0.161]

2 This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. [sent-8, score-0.123]

3 Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . [sent-9, score-0.158]

4 We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. [sent-10, score-0.095]

5 The focus of this paper is an algorithmic framework for regularized convex programming to minimize the following sum of two functions: f (w) + r(w) , (1) where both f and r are convex bounded below functions (so without loss of generality we assume they are into R+ ). [sent-20, score-0.249]

6 Often, the function f is an empirical loss and takes the form i∈S ℓi (w) for a sequence of loss functions ℓi : Rn → R+ , and r(w) is a regularization term that penalizes for excessively complex vectors, for instance r(w) = λ w p . [sent-21, score-0.233]

7 (1), focusing especially on derivations for and the use of non-differentiable regularization functions. [sent-24, score-0.101]

8 One of the most general is the subgradient method [1], which is elegant and very simple. [sent-27, score-0.196]

9 Let ∂f (w) denote the subgradient set of f at w, namely, ∂f (w) = {g | ∀v : f (v) ≥ f (w) + g, v − w }. [sent-28, score-0.196]

10 Subgradient procedures then minimize the function f (w) by iteratively updating the parameter vector w according to the update rule wt+1 = wt − ηt g f , where ηt is a constant or diminishing step t size and g f ∈ ∂f (wt ) is an arbitrary vector from the subgradient set of f evaluated at wt . [sent-29, score-1.543]

11 A slightly t more general method than the above is the projected gradient method, which iterates wt+1 = ΠΩ wt − ηt g f t = argmin w∈Ω 1 w − (wt − ηt g f ) t 2 2 where ΠΩ (w) is the Euclidean projection of w onto the set Ω. [sent-30, score-0.787]

12 Standard results [1] show that the (projected) subgradient method converges at a rate of O(1/ε2 ), or equivalently that the error f (w)− √ f (w⋆ ) = O(1/ T ), given some simple assumptions on the boundedness of the subdifferential set and Ω (we have omitted constants dependent on ∂f or dim(Ω)). [sent-31, score-0.239]

13 (1) gives simple iterates of the form wt+1 = wt − ηt g f − ηt g r , where g r ∈ ∂r(wt ). [sent-33, score-0.643]

14 t t t A common problem in subgradient methods is that if r or f is non-differentiable, the iterates of the subgradient method are very rarely at the points of non-differentiability. [sent-34, score-0.427]

15 In the case of regularization functions such as r(w) = w 1 , however, these points (zeros in the case of the ℓ1 -norm) are often the true minima of the function. [sent-35, score-0.117]

16 [5] give proofs of convergence for forward-backward splitting in Hilbert spaces, though without establishing strong rates of convergence. [sent-41, score-0.112]

17 Similar projected-gradient methods, when the regularization function r is no longer part of the objective function but rather cast as a constraint so that r(w) ≤ λ, are also well known [1]. [sent-43, score-0.145]

18 [6] give a general and efficient projected gradient method for ℓ1 -constrained problems. [sent-44, score-0.104]

19 There is also a body of literature on regret analysis for online learning and online convex programming with convex constraints upon which we build [7, 8]. [sent-45, score-0.353]

20 Learning sparse models generally is of great interest in the statistics literature, specifically in the context of consistency and recovery of sparsity patterns through ℓ1 or mixed-norm regularization across multiple tasks [2, 3, 9]. [sent-46, score-0.236]

21 In this paper, we describe a general gradient-based framework, which we call F OBOS, and analyze it in batch and online learning settings. [sent-47, score-0.102]

22 We then provide bounds for online convex programming and give a convergence rate for stochastic gradient descent. [sent-52, score-0.261]

23 6 with experiments examining various aspects of the proposed framework, in particular the runtime and sparsity selection performance of the derived algorithms. [sent-58, score-0.094]

24 2 Forward-Looking Subgradients and Forward-Backward Splitting In this section we introduce our algorithm, laying the framework for its strategy for online or batch convex programming. [sent-59, score-0.15]

25 F OBOS is motivated by the desire to have the iterates wt attain points of non-differentiability of the function r. [sent-63, score-0.668]

26 The method alleviates the problems of non-differentiability in cases such as ℓ1 -regularization by taking analytical minimization steps interleaved with subgradient steps. [sent-64, score-0.217]

27 Put informally, F OBOS is analogous to the projected subgradient method, but replaces or augments the projection step with an instantaneous minimization problem for which it is possible to derive a closed form solution. [sent-65, score-0.3]

28 F OBOS is succinct as each iteration consists of the following two steps: 1 wt+ 2 wt+1 = wt − ηt g f t = argmin w (2) 1 1 w − wt+ 2 2 2 + ηt+ 1 r(w) 2 . [sent-66, score-0.639]

29 The first step thus simply amounts to an unconstrained subgradient step with respect to the function f . [sent-69, score-0.251]

30 In the second step we find a 2 new vector that interpolates between two goals: (i) stay close to the interim vector wt+ 1 , and (ii) 2 attain a low complexity value as expressed by r. [sent-70, score-0.134]

31 Note that the regularization function is scaled by an interim step size, denoted ηt+ 1 . [sent-71, score-0.151]

32 Namely, the zero vector must belong to subgradient set of the objective at the optimum wt+1 , that is, 2 1 1 0∈∂ w − wt+ 2 + ηt+ 1 r(w) . [sent-75, score-0.238]

33 2 2 w=wt+1 Since wt+ 1 = wt − ηt g f , the above property amounts to 0 ∈ wt+1 − wt + ηt g f + ηt+ 1 ∂r(wt+1 ). [sent-76, score-1.233]

34 (3), we are guar1 anteed to obtain a vector g r ∈ ∂r(wt+1 ) such that 0 = wt+1 − wt + ηt g f + ηt+ 2 g r . [sent-78, score-0.628]

35 We can t t+1 t+1 understand this as an update scheme where the new weight vector wt+1 is a linear combination of the previous weight vector wt , a vector from the subgradient set of f at wt , and a vector from the subgradient of r evaluated at the yet to be determined wt+1 . [sent-79, score-1.777]

36 To recap, we can write wt+1 as wt+1 = wt − ηt g f − ηt+ 1 g r , t t+1 2 (4) where g f ∈ ∂f (wt ) and g r ∈ ∂r(wt+1 ). [sent-80, score-0.608]

37 Second, the forward-looking gradient allows us to build on existing analyses and show that the resulting framework enjoys the formal convergence properties of many existing gradient-based and online convex programming algorithms. [sent-84, score-0.262]

38 3 Convergence and Regret Analysis of F OBOS In this section we build on known results while using the forward-looking property of F OBOS to provide convergence rate and regret analysis. [sent-85, score-0.159]

39 As 2 1 we show in the sequel, it is sufficient to set ηt+ 2 to ηt or ηt+1 , depending on whether we are doing online or batch optimization, in order to obtain convergence and low regret bounds. [sent-87, score-0.225]

40 (4) and moderately straightforward arguments with convexity and subgradient calculus. [sent-90, score-0.196]

41 We begin by deriving convergence results under the fairly general assumption [10, 11] that the subgradients are bounded as follows: ∂f (w) 2 ≤ Af (w) + G2 , ∂r(w) 2 ≤ Ar(w) + G2 . [sent-94, score-0.095]

42 (5) For example, any Lipschitz loss (such as the logistic or hinge/SVM) satisfies the above with A = 0 and G equal to the Lipschitz constant; least squares satisfies Eq. [sent-95, score-0.114]

43 Assume the following hold: (i) the norm of any subgradient from ∂f and the norm of any subgradient from ∂r are bounded as in Eq. [sent-98, score-0.444]

44 ,T } f (wt ) + r(wt ) ≤ 1 T T 3DG f (wt ) + r(wt ) ≤ √ cAD T 1 − G√7T t=1 3 + f (w⋆ ) + r(w⋆ ) cAD 1 − G√7T Bounds of the form we present above, where the point minimizing f (wt ) + r(wt ) converges rather than the last point wT , are standard in subgradient optimization. [sent-110, score-0.196]

45 We next derive regret bounds for F OBOS in online settings in which we are given a sequence of functions ft : Rn → R. [sent-116, score-0.256]

46 The goal is for the sequence of predictions wt to attain low regret when compared to a single optimal predictor w⋆ . [sent-117, score-0.746]

47 Formally, let ft (w) denote the loss suffered on the tth input loss function when using a predictor w. [sent-118, score-0.205]

48 The regret of an online algorithm which uses w1 , . [sent-119, score-0.154]

49 t a fixed predictor w⋆ while using a regularization function r is T Rf +r (T ) = t=1 [ft (wt ) + r(wt ) − (ft (w⋆ ) + r(w⋆ ))] . [sent-127, score-0.124]

50 To achieve an online bound for a sequence of convex functions ft , we modify arguments of [7]. [sent-129, score-0.194]

51 Assume that wt − w⋆ ≤ D for all iterations and the norm of the subgradient sets ∂ft and ∂r are bounded above by G. [sent-133, score-0.83]

52 with ηt = c/ t satisfies Rf +r (T ) ≤ GD + D + 7G2 c 2c For slightly technical reasons, the assumption on the boundedness of wt and the subgradients is not actually restrictive (see Appendix A for details). [sent-136, score-0.695]

53 It is possible to obtain an O(log T ) regret bound for F OBOS when the sequence of loss functions ft (·) or the function r(·) is strongly convex, similar to [8], by using the curvature of ft or r. [sent-137, score-0.296]

54 Using the regret analysis for online learning, we can also give convergence rates for stochastic F OBOS, √ which are O( T ). [sent-139, score-0.214]

55 4 Derived Algorithms We now give a few variants of F OBOS by considering different regularization functions. [sent-141, score-0.128]

56 The emphasis of the section is on non-differentiable regularization functions that lead to sparse solutions. [sent-142, score-0.142]

57 We also give simple extensions to apply F OBOS to mixed-norm regularization [9] that build on the first part of this section. [sent-143, score-0.146]

58 To simplify our derivations, we denote by v the vector ˜ wt+ 1 = wt − ηt g f and let λ denote ηt+ 1 · λ. [sent-147, score-0.628]

59 The objective is decomposable into a sum of 1-dimensional convex problems of the form ˜ minw 1 (w − v)2 + λ|w|. [sent-152, score-0.144]

60 (6) gives a simple online and offline method for minimizing a convex f with ℓ1 regularization. [sent-155, score-0.112]

61 5, where we describe a unified view that t facilitates an efficient implementation for all the regularization functions discussed in this paper. [sent-158, score-0.117]

62 Differentiating the objective and setting the result equal to 2 2 4 ˜ zero, we have w⋆ − v + λw⋆ = 0, which, using the original notation, yields the update wt+1 = wt − ηt g f t . [sent-160, score-0.667]

63 F OBOS with ℓ2 regularization: A lesser used regularization function is the ℓ2 norm of the weight ˜ ˜ vector. [sent-162, score-0.153]

64 The resulting second step of the F OBOS update with ℓ2 regularization amounts to wt+1 = 1 − ˜ λ wt+ 1 2 + = 1− ˜ λ wt − ηt g f t + (wt − ηt g f ) . [sent-165, score-0.782]

65 t ˜ ℓ2 -regularization results in a zero weight vector under the condition that wt − ηt g f ≤ λ. [sent-166, score-0.654]

66 This t condition is rather more stringent for sparsity than the condition for ℓ1 , so it is unlikely to hold in high dimensions. [sent-167, score-0.094]

67 F OBOS with ℓ∞ regularization: We now turn to a less explored regularization function, the ℓ∞ norm of w. [sent-169, score-0.127]

68 Since all the weight vectors operate over the same instance space, it may be beneficial to tie the weights corresponding 1 k to the same input feature: we would to zero the row of weights wj , . [sent-189, score-0.093]

69 Then the ith row contains weight of the ith feature for each class. [sent-194, score-0.095]

70 5 5 Efficient implementation in high dimensions In many settings, especially online learning, the weight vector wt and the gradients g f reside in a t very high-dimensional space, but only a relatively small number of the components of g f are nont zero. [sent-209, score-0.718]

71 The need to cope with gradient sparsity becomes further pronounced in mixed-norm problems, as a single component of the gradient may correspond to an entire row of W . [sent-211, score-0.266]

72 The algorithmic consequence of Proposition 4 is that it is possible to perform a lazy update on each iteration by omitting the terms of wt (or whole rows of the matrix Wt when using mixed-norms) that are outside the support of g f , the gradient of the loss at iteration t. [sent-225, score-0.799]

73 Let Λt denote the sum of the step sizes times regularization multipliers ληt used from round 1 through t. [sent-227, score-0.12]

74 The advantage of the lazy evaluation is pronounced when using mixed-norm regularization as it lets us avoid updating entire rows so long as the row index corresponds to a zero entry of the gradient g f . [sent-229, score-0.243]

75 Pegasos was originally implemented and evaluated on SVM-like problems by using the the hinge-loss as the empirical loss function along with an ℓ2 regularization term, but it can be straightforwardly extended to the binary logistic loss 2 function. [sent-238, score-0.273]

76 1 show (on a log-scale) the regularized empirical loss of the algorithms minus the optimal value of the objective function. [sent-247, score-0.117]

77 1 show that F OBOS performs comparably to Pegasos on the logistic loss (left figure) and hinge (SVM) loss (middle figure). [sent-251, score-0.172]

78 1, we eliminated this additional projection step and ran the algorithms with the logistic loss. [sent-255, score-0.095]

79 We compared F OBOS to a simple subgradient method, where the subgradient of the λ w 1 term is simply λ sign(w)), and a fast interior point (IP) method which was designed specifically for solving ℓ1 -regularized logistic regression [15]. [sent-261, score-0.448]

80 2 we show the objective function (empirical loss plus the ℓ1 regularization term) obtained by each of the algorithms minus the optimal objective value. [sent-263, score-0.222]

81 The standard subgradient method is clearly much slower than the other two methods even though we chose the initial step size for which the subgradient method converged the fastest. [sent-266, score-0.428]

82 Furthermore, the subgradient method does not achieve any sparsity along its entire run. [sent-267, score-0.29]

83 In order to obtain a weight vector wt such that f (wt ) − f (w⋆ ) ≤ 10−2 , F OBOS works very well, though the IP method enjoys faster convergence rate when the weight vector is very close to optimal solution. [sent-269, score-0.796]

84 However, the IP algorithm was specifically designed to minimize empirical logistic loss with ℓ1 regularization whereas F OBOS enjoys a broad range of applicable settings. [sent-270, score-0.262]

85 2 shows the sparsity levels (fraction of non-zero weights) achieved by F OBOS as a function of the number of iterations of the algorithm. [sent-272, score-0.094]

86 Each line represents a different synthetic experiment as λ is modified to give more or less sparsity to the solution vector w⋆ . [sent-273, score-0.156]

87 The results show that F OBOS quickly selects the sparsity pattern of w⋆ , and the level of sparsity persists throughout its execution. [sent-274, score-0.206]

88 We found this sparsity pattern common to non-stochastic versions of F OBOS we tested. [sent-275, score-0.094]

89 Mixed-norm experiments: Our experiments with mixed-norm regularization (ℓ1 /ℓ2 and ℓ1 /ℓ∞ ) focus mostly on sparsity rather than on the speed of minimizing the objective. [sent-276, score-0.195]

90 Our experiments compared multiclass classification with ℓ1 , ℓ1 /ℓ2 , and ℓ1 /ℓ∞ regularization on the MNIST handwritten digit database and the StatLog Landsat Satellite dataset [16]. [sent-280, score-0.121]

91 Linear classifiers 7 Figure 2: Left: Performance of F OBOS, a subgradient method, and an interior point method on ℓ1 regularized logistic regularization. [sent-283, score-0.27]

92 Left: sparsity level achieved by F OBOS along its run. [sent-284, score-0.094]

93 1 Figure 3: Left: F OBOS sparsity and test error for LandSat dataset with ℓ1 -regularization. [sent-311, score-0.117]

94 Right: F OBOS sparsity and test error for MNIST dataset with ℓ1 /ℓ2 -regularization. [sent-312, score-0.117]

95 3, we show the test set error and row sparsity in W as a function of training time (number of single-example gradient calculations) for the ℓ1 -regularized multiclass logistic loss with 720 training examples. [sent-318, score-0.336]

96 However, the deterministic version increases the level of sparsity throughout its run, while the stochastic-gradient version has highly variable sparsity levels and does not give solutions as sparse as the deterministic counterpart. [sent-324, score-0.256]

97 For comparison of the different regularization approaches, we report in Table 1 the test error as a function of row sparsity of the learned matrix W . [sent-326, score-0.253]

98 However, on the MNIST data the ℓ1 regularization and the ℓ1 /ℓ2 achieve comparable performance for each level of structural sparsity. [sent-328, score-0.122]

99 The performance on the different datasets might indicate that structural sparsity is effective only when the set of parameters indeed exhibit natural grouping. [sent-330, score-0.115]

100 16 Table 1: LandSat (left) and MNIST (right) classification error versus sparsity 8 References [1] D. [sent-355, score-0.094]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('obos', 0.657), ('wt', 0.608), ('subgradient', 0.196), ('regularization', 0.101), ('sparsity', 0.094), ('pegasos', 0.093), ('regret', 0.09), ('folos', 0.08), ('landsat', 0.067), ('ft', 0.066), ('online', 0.064), ('loss', 0.058), ('logistic', 0.056), ('minw', 0.054), ('gradient', 0.05), ('mnist', 0.048), ('convex', 0.048), ('subgradients', 0.046), ('nnz', 0.046), ('ip', 0.042), ('batch', 0.038), ('update', 0.037), ('row', 0.035), ('iterates', 0.035), ('splitting', 0.035), ('appendix', 0.034), ('convergence', 0.033), ('wj', 0.032), ('duchi', 0.031), ('argmin', 0.031), ('cad', 0.031), ('interim', 0.031), ('sign', 0.03), ('enjoys', 0.028), ('projected', 0.027), ('give', 0.027), ('weight', 0.026), ('norm', 0.026), ('attain', 0.025), ('sparse', 0.025), ('zeros', 0.025), ('boundedness', 0.025), ('lazy', 0.025), ('argminw', 0.023), ('satellite', 0.023), ('test', 0.023), ('predictor', 0.023), ('entries', 0.022), ('singer', 0.022), ('cast', 0.022), ('objective', 0.022), ('programming', 0.021), ('algorithmic', 0.021), ('structural', 0.021), ('regularizer', 0.021), ('minimization', 0.021), ('cope', 0.021), ('prevalent', 0.021), ('ws', 0.021), ('rf', 0.021), ('vector', 0.02), ('projection', 0.02), ('multiclass', 0.02), ('settings', 0.02), ('informally', 0.02), ('decomposable', 0.02), ('minus', 0.019), ('stems', 0.019), ('step', 0.019), ('plot', 0.019), ('minimize', 0.019), ('stay', 0.019), ('rightmost', 0.019), ('minimizer', 0.018), ('build', 0.018), ('operations', 0.018), ('quickly', 0.018), ('sequel', 0.018), ('merits', 0.018), ('mixed', 0.018), ('rate', 0.018), ('regularized', 0.018), ('instantaneous', 0.017), ('reconstruction', 0.017), ('though', 0.017), ('amounts', 0.017), ('ith', 0.017), ('pronounced', 0.016), ('solver', 0.016), ('functions', 0.016), ('recovery', 0.016), ('shorthand', 0.016), ('begin', 0.016), ('solutions', 0.016), ('proposition', 0.016), ('slightly', 0.016), ('updating', 0.016), ('solution', 0.015), ('normally', 0.015), ('dual', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 76 nips-2009-Efficient Learning using Forward-Backward Splitting

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

2 0.092336766 45 nips-2009-Beyond Convexity: Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and bandit settings. 1

3 0.090608567 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

Author: Novi Quadrianto, John Lim, Dale Schuurmans, Tibério S. Caetano

Abstract: We develop a convex relaxation of maximum a posteriori estimation of a mixture of regression models. Although our relaxation involves a semidefinite matrix variable, we reformulate the problem to eliminate the need for general semidefinite programming. In particular, we provide two reformulations that admit fast algorithms. The first is a max-min spectral reformulation exploiting quasi-Newton descent. The second is a min-min reformulation consisting of fast alternating steps of closed-form updates. We evaluate the methods against Expectation-Maximization in a real problem of motion segmentation from video data. 1

4 0.089042217 220 nips-2009-Slow Learners are Fast

Author: Martin Zinkevich, John Langford, Alex J. Smola

Abstract: Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multi-core architectures. In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning. 1

5 0.077977635 246 nips-2009-Time-Varying Dynamic Bayesian Networks

Author: Le Song, Mladen Kolar, Eric P. Xing

Abstract: Directed graphical models such as Bayesian networks are a favored formalism for modeling the dependency structures in complex multivariate systems such as those encountered in biology and neural science. When a system is undergoing dynamic transformation, temporally rewiring networks are needed for capturing the dynamic causal influences between covariates. In this paper, we propose time-varying dynamic Bayesian networks (TV-DBN) for modeling the structurally varying directed dependency structures underlying non-stationary biological/neural time series. This is a challenging problem due the non-stationarity and sample scarcity of time series data. We present a kernel reweighted 1 -regularized auto-regressive procedure for this problem which enjoys nice properties such as computational efficiency and provable asymptotic consistency. To our knowledge, this is the first practical and statistically sound method for structure learning of TVDBNs. We applied TV-DBNs to time series measurements during yeast cell cycle and brain response to visual stimuli. In both cases, TV-DBNs reveal interesting dynamics underlying the respective biological systems. 1

6 0.076283351 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

7 0.074197993 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

8 0.066383995 178 nips-2009-On Stochastic and Worst-case Models for Investing

9 0.06411723 157 nips-2009-Multi-Label Prediction via Compressed Sensing

10 0.061393064 14 nips-2009-A Parameter-free Hedging Algorithm

11 0.060839504 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

12 0.060631584 193 nips-2009-Potential-Based Agnostic Boosting

13 0.05439724 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks

14 0.05319906 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

15 0.051414471 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

16 0.050005451 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

17 0.045531813 63 nips-2009-DUOL: A Double Updating Approach for Online Learning

18 0.045072623 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

19 0.044595022 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

20 0.044370379 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.145), (1, 0.119), (2, 0.033), (3, 0.002), (4, 0.067), (5, 0.072), (6, -0.008), (7, -0.022), (8, 0.044), (9, 0.053), (10, -0.017), (11, -0.012), (12, -0.062), (13, 0.014), (14, -0.06), (15, 0.059), (16, 0.015), (17, -0.051), (18, -0.029), (19, 0.034), (20, -0.0), (21, -0.047), (22, -0.006), (23, 0.016), (24, 0.044), (25, 0.001), (26, 0.019), (27, 0.079), (28, 0.053), (29, -0.014), (30, 0.031), (31, -0.069), (32, 0.05), (33, -0.032), (34, -0.015), (35, 0.051), (36, -0.109), (37, -0.087), (38, 0.003), (39, -0.013), (40, 0.016), (41, -0.108), (42, -0.018), (43, -0.079), (44, -0.008), (45, 0.093), (46, 0.009), (47, 0.006), (48, -0.046), (49, 0.106)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92728072 76 nips-2009-Efficient Learning using Forward-Backward Splitting

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

2 0.82536459 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as â„“1 -norm for promoting sparsity. We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. 1

3 0.65455306 220 nips-2009-Slow Learners are Fast

Author: Martin Zinkevich, John Langford, Alex J. Smola

Abstract: Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multi-core architectures. In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning. 1

4 0.58395672 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

Author: Chun-nan Hsu, Yu-ming Chang, Hanshen Huang, Yuh-jye Lee

Abstract: It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks. 1

5 0.50318992 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

Author: David P. Wipf, Srikantan S. Nagarajan

Abstract: Finding maximally sparse representations from overcomplete feature dictionaries frequently involves minimizing a cost function composed of a likelihood (or data fit) term and a prior (or penalty function) that favors sparsity. While typically the prior is factorial, here we examine non-factorial alternatives that have a number of desirable properties relevant to sparse estimation and are easily implemented using an efficient and globally-convergent, reweighted 1 -norm minimization procedure. The first method under consideration arises from the sparse Bayesian learning (SBL) framework. Although based on a highly non-convex underlying cost function, in the context of canonical sparse estimation problems, we prove uniform superiority of this method over the Lasso in that, (i) it can never do worse, and (ii) for any dictionary and sparsity profile, there will always exist cases where it does better. These results challenge the prevailing reliance on strictly convex penalty functions for finding sparse solutions. We then derive a new non-factorial variant with similar properties that exhibits further performance improvements in some empirical tests. For both of these methods, as well as traditional factorial analogs, we demonstrate the effectiveness of reweighted 1 -norm algorithms in handling more general sparse estimation problems involving classification, group feature selection, and non-negativity constraints. As a byproduct of this development, a rigorous reformulation of sparse Bayesian classification (e.g., the relevance vector machine) is derived that, unlike the original, involves no approximation steps and descends a well-defined objective function. 1

6 0.49122041 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

7 0.4770577 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

8 0.46528119 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

9 0.46391693 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

10 0.45722535 33 nips-2009-Analysis of SVM with Indefinite Kernels

11 0.45313984 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

12 0.44958571 45 nips-2009-Beyond Convexity: Online Submodular Minimization

13 0.44938579 14 nips-2009-A Parameter-free Hedging Algorithm

14 0.439347 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

15 0.42462096 178 nips-2009-On Stochastic and Worst-case Models for Investing

16 0.42359164 63 nips-2009-DUOL: A Double Updating Approach for Online Learning

17 0.42273796 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

18 0.42107964 46 nips-2009-Bilinear classifiers for visual recognition

19 0.41336298 181 nips-2009-Online Learning of Assignments

20 0.41048867 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.012), (24, 0.065), (25, 0.066), (35, 0.047), (36, 0.14), (39, 0.046), (51, 0.013), (58, 0.105), (61, 0.033), (71, 0.047), (72, 0.223), (81, 0.021), (86, 0.059), (91, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92969692 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

Author: Shobha Venkataraman, Avrim Blum, Dawn Song, Subhabrata Sen, Oliver Spatscheck

Abstract: We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different “experts” algorithms and online paging algorithms. We prove guarantees on our algorithm’s performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithm achieves high accuracy on large real-world data sets, with significant improvements over existing approaches.

same-paper 2 0.83480096 76 nips-2009-Efficient Learning using Forward-Backward Splitting

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

3 0.8125087 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur

Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1

4 0.73481625 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov

Abstract: We consider the problem of learning probabilistic models for complex relational structures between various types of objects. A model can help us “understand” a dataset of relational facts in at least two ways, by finding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. Inference is fully Bayesian but scales well to large data sets. The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data.

5 0.68884516 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

Author: Novi Quadrianto, John Lim, Dale Schuurmans, Tibério S. Caetano

Abstract: We develop a convex relaxation of maximum a posteriori estimation of a mixture of regression models. Although our relaxation involves a semidefinite matrix variable, we reformulate the problem to eliminate the need for general semidefinite programming. In particular, we provide two reformulations that admit fast algorithms. The first is a max-min spectral reformulation exploiting quasi-Newton descent. The second is a min-min reformulation consisting of fast alternating steps of closed-form updates. We evaluate the methods against Expectation-Maximization in a real problem of motion segmentation from video data. 1

6 0.68863624 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

7 0.68565965 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

8 0.68407327 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

9 0.6834228 72 nips-2009-Distribution Matching for Transduction

10 0.68305844 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

11 0.68275625 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

12 0.68234664 180 nips-2009-On the Convergence of the Concave-Convex Procedure

13 0.6814785 129 nips-2009-Learning a Small Mixture of Trees

14 0.68097204 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

15 0.68093628 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

16 0.68030071 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

17 0.68008471 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

18 0.68004578 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

19 0.67914402 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

20 0.67904681 157 nips-2009-Multi-Label Prediction via Compressed Sensing