nips nips2008 nips2008-214 knowledge-graph by maker-knowledge-mining

214 nips-2008-Sparse Online Learning via Truncated Gradient


Source: pdf

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. [sent-7, score-0.806]

2 Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. [sent-10, score-0.441]

3 We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. [sent-13, score-0.34]

4 As an example, the largest dataset we use in this paper has over 107 sparse examples and 109 features using about 1011 bytes. [sent-15, score-0.219]

5 This paper addresses the problem of inducing sparsity in learned weights while using an onlinelearning algorithm. [sent-27, score-0.292]

6 For example, either simply adding L1 regularization to the gradient of an online weight update or simply rounding small weights to zero are problematic. [sent-29, score-0.97]

7 This algorithm does not work automatically in an online fashion. [sent-35, score-0.228]

8 Consider a loss function L(w, zi ) which is convex in w, where zi = (xi , yi ) is an input–output pair. [sent-37, score-0.398]

9 The convex-constraint formulation has a simple online version using the projection idea in [14]. [sent-41, score-0.255]

10 It requires the projection of weight w into an L1 ball at every online step. [sent-42, score-0.277]

11 This operation is difficult to implement efficiently for large-scale data with many features even if all features are sparse, although important progress was made recently so the complexity is logarithmic in the number of features [5]. [sent-43, score-0.339]

12 In contrast, the soft-regularization formulation (2) is efficient for a batch setting[8] so we pursue it here in an online setting where it has complexity independent of the number of features. [sent-44, score-0.363]

13 It operates by decaying the weights on previous examples and then rounding these weights to zero when they become small. [sent-47, score-0.465]

14 The Forgetron is stated for kernelized online algorithms, while we are concerned with the simple linear setting. [sent-48, score-0.228]

15 Consequently, we rely on experiments in section 5 to show truncated gradient achieves good sparsity in practice. [sent-57, score-0.722]

16 We compare truncated gradient to a few others on small datasets, including the Lasso, online rounding of coefficients to zero, and L1 -regularized subgradient descent. [sent-58, score-1.181]

17 We make a prediction yi based on the current weights wi = [wi , . [sent-66, score-0.42]

18 We observe yi , let zi = (xi , yi ), and incur some known loss L(wi , zi ) convex in wi . [sent-71, score-0.734]

19 t We want an update rule f allows us to bound the sum of losses, i=1 L(wi , zi ), as well as achieving sparsity. [sent-74, score-0.258]

20 For this purpose, we start with the standard stochastic gradient descent (SGD) rule, which is of the form: f (wi ) = wi − η 1 L(wi , zi ), (3) where 1 L(a, b) is a subgradient of L(a, b) with respect to the first variable a. [sent-75, score-0.853]

21 The above method has been widely used in online learning (e. [sent-81, score-0.228]

22 Moreover, it is argued to be efficient even for solving batch problems where we repeatedly run the online algorithm over training data multiple times. [sent-84, score-0.336]

23 However, the learning rule (3) itself does not achieve sparsity in the weights, which we address in this paper. [sent-87, score-0.249]

24 Note that variants of SGD exist in the literature, such as exponentiated gradient descent (EG) [6]. [sent-88, score-0.299]

25 3 Sparse Online Learning In this section, we first examine three methods for achieving sparsity in online learning, including a novel algorithm called truncated gradient. [sent-91, score-0.752]

26 1 Simple Coefficient Rounding In order to achieve sparsity, the most natural method is to round small coefficients (whose magnitudes are below a threshold θ > 0) to zero after every K online steps. [sent-95, score-0.288]

27 That is, if i/K is not an integer, we use the standard SGD rule (3); if i/K is an integer, we modify the rule as: f (wi ) = T0 (wi − η 1 L(wi , zi ), θ), (4) where: θ ≥ 0 is a threshold, T0 (v, θ) = [T0 (v1 , θ), . [sent-96, score-0.265]

28 , vd ] ∈ Rd , T0 (vj , θ) = vj I(|vj | < θ), and I(·) is the set-indicator function. [sent-102, score-0.385]

29 In other words, we first perform a standard stochastic gradient descent, and then round the coefficients to zero. [sent-103, score-0.279]

30 In general, we should not take K = 1, especially when η is small, since in each step wi is modified by only a small amount. [sent-105, score-0.333]

31 If a coefficient is zero, it remains small after one online update, and the rounding operation pulls it back to zero. [sent-106, score-0.513]

32 Consequently, rounding can be done only after every K steps (with a reasonably large K); in this case, nonzero coefficients have sufficient time to go above the threshold θ. [sent-107, score-0.435]

33 However, if K is too large, then in the training stage, we must keep many more nonzero features in the intermediate steps before they are rounded to zero. [sent-108, score-0.231]

34 The sensitivity in choosing appropriate K is a main drawback of this method; another drawback is the lack of theoretical guarantee for its online performance. [sent-110, score-0.228]

35 2 L1 -Regularized Subgradient In the experiments, we also combined rounding-in-the-end-of-training with a simple online subgradient method for L1 regularization with a regularization parameter g > 0: f (wi ) = wi − η 1 L(wi , zi ) − ηg sgn(wi ), (5) where for a vector v = [v1 , . [sent-113, score-0.899]

36 , sgn(vd )], and sgn(vj ) = 1 if vj > 0, sgn(vj ) = −1 if vj < 0, and sgn(vj ) = 0 if vj = 0. [sent-119, score-0.588]

37 In the experiments, the online method (5) with rounding in the end is used as a simple baseline. [sent-120, score-0.513]

38 Notice this method does not produce sparse weights online simply because only in very rare cases do two floats add up to 0. [sent-121, score-0.373]

39 3 Truncated Gradient In order to obtain an online version of the simple rounding rule in (4), we observe that direct rounding to zero is too aggressive. [sent-124, score-0.865]

40 We call this idea truncated gradient, where the amount of shrinkage is controlled by a gravity parameter gi > 0: f (wi ) = T1 (wi − η 1 L(wi , zi ), ηgi , θ), (6) where for a vector v = [v1 , . [sent-126, score-0.908]

41 , T1 (vd , α, θ)], with  max(0, vj − α) if vj ∈ [0, θ] T1 (vj , α, θ) = min(0, vj + α) if vj ∈ [−θ, 0] . [sent-132, score-0.784]

42  vj otherwise 0, T1 (v, α, θ) = Again, the truncation can be performed every K online steps. [sent-133, score-0.478]

43 That is, if i/K is not an integer, we let gi = 0; if i/K is an integer, we let gi = Kg for a gravity parameter g > 0. [sent-134, score-0.6]

44 Since ηKg θ when ηK is small, the truncation operation is less aggressive than the rounding in (4). [sent-143, score-0.385]

45 Another important special case of (6) is setting θ = ∞, in which all weight components shrink in every online step. [sent-147, score-0.308]

46 The method is a modification of the L1 -regularized subgradient descent rule (5). [sent-148, score-0.24]

47 The parameter gi ≥ 0 controls the sparsity achieved with the algorithm, and setting gi = 0 gives exactly the standard SGD rule (3). [sent-149, score-0.703]

48 5, this special case of truncated gradient can be regarded as an online counterpart of L1 regularization since it approximately solves an L1 regularization problem in the limit of η → 0. [sent-151, score-1.079]

49 We also show the prediction performance of truncated gradient, measured by total loss, is comparable to standard stochastic gradient descent while introducing sparse weight vectors. [sent-152, score-0.829]

50 1 (Sparse Online Regret) Consider sparse online update rule (6) with w 1 = [0, . [sent-164, score-0.388]

51 5Aη T L(wi , zi ) + i=1 w 2 ¯ 1 η B+ + 2 2ηT T ≤ where v ·I(|v | ≤ θ) T 1 = d j=1 gi wi+1 · I(wi+1 ≤ θ) 1 − 0. [sent-170, score-0.372]

52 5Aη 1 T i=1 [L(w, zi ) + gi w · I(wi+1 ≤ θ) 1 ], ¯ ¯ |vj |I(|vj | ≤ θ) for vectors v = [v1 , . [sent-171, score-0.372]

53 In the above theorem, the right-hand side involves a term gi w · I(wi+1 ≤ θ) 1 that depends on ¯ wi+1 which is not easily estimated. [sent-183, score-0.284]

54 To remove this dependency, a trivial upper bound of θ = ∞ can be used, leading to L1 penalty gi w 1 . [sent-184, score-0.303]

55 In the general case of θ < ∞, we cannot remove the ¯ wi+1 dependency because the effective regularization condition (as shown on the left-hand side) is the non-convex penalty gi w · I(|w| ≤ θ) 1 . [sent-185, score-0.343]

56 Solving such a non-convex formulation is hard both in the online and batch settings. [sent-186, score-0.363]

57 Without a good characterization of the local minimum, it is not possible for us to replace gi w · I(wi+1 ≤ θ) 1 on the right-hand side by gi w · I(w ≤ ¯ ¯ ¯ θ) 1 because such a formulation implies we could efficiently solve a non-convex problem with a simple online update rule. [sent-188, score-0.811]

58 Still, when θ < ∞, one naturally expects the right-hand side penalty gi w · I(wi+1 ≤ θ) 1 is much smaller than the corresponding L1 penalty gi w 1 , especially when ¯ ¯ wj has many components close to 0. [sent-189, score-0.622]

59 1 also implies a tradeoff between sparsity and regret performance. [sent-192, score-0.313]

60 We may simply consider the case where gi = g is a constant. [sent-193, score-0.268]

61 When g is small, we have less sparsity but the regret term g w · I(wi+1 ≤ θ) 1 ≤ g w 1 on the right-hand side is also small. [sent-194, score-0.327]

62 When g is large, we are ¯ ¯ able to achieve more sparsity but the regret g w·I(wi+1 ≤ θ) 1 on the right-hand side also becomes ¯ large. [sent-195, score-0.355]

63 Such a tradeoff between sparsity and prediction accuracy is empirically studied in section 5, where we achieve significant sparsity with only a small g (and thus small decrease of performance). [sent-196, score-0.393]

64 When T → ∞, if we let η → 0 and ηT → ∞, then 1 T T i=1 [L(wi , zi ) + g wi 1 ] ≤ inf w∈Rd ¯ 1 T T L(w, zi ) + 2g w ¯ ¯ 1 + o(1). [sent-198, score-0.592]

65 This implies truncated gradient can be regarded as the online counterpart of L1 -regularization methods. [sent-202, score-0.901]

66 In the stochastic setting where the examples are drawn iid from some underlying distribution, the sparse online gradient method proposed in this paper solves the L1 regularization problem. [sent-203, score-0.646]

67 In this setting, we can go through training examples one-by-one in an online fashion, and repeat multiple times over the training data. [sent-206, score-0.26]

68 For simplicity, we only consider the case with θ = ∞ and constant gravity gi = g. [sent-211, score-0.359]

69 Let w1 = w1 = 0, and define recursively for t ≥ 1: ˆ wt+1 = T (wt − η 1 (wt , zit ), gη), wt+1 = wt + (wt+1 − wt )/(t + 1), ˆ ˆ ˆ where each it is drawn from {1, . [sent-219, score-0.226]

70 In other words, on average wT approximately solves ¯ n 1 the batch L1 -regularization problem inf w n i=1 L(w, zi ) + g w 1 when T is large. [sent-231, score-0.307]

71 Since L1 regularization is often used to achieve sparsity in the ˆ batch learning setting, the connection of truncated gradient to L1 regularization can be regarded as an alternative justification for the sparsity ability of this algorithm. [sent-234, score-1.221]

72 4 Efficient Implementation of Truncated Gradient for Square Loss The truncated descent update rule (6) can be applied to least-squares regression using square loss, j leading to f (wi ) = T1 (wi + 2η(yi − yi )xi , ηgi , θ), where the prediction is given by yi = j wi xj . [sent-235, score-1.0]

73 ˆ ˆ i We altered an efficient SGD implementation, Vowpal Wabbit [7], for least-squares regression according to truncated gradient. [sent-236, score-0.399]

74 So the memory footprint is essentially just the number of nonzero weights, even when the total numbers of data and features are astronomically large. [sent-239, score-0.26]

75 In many online-learning situations such as web applications, only a small subset of the features have nonzero values for any example x. [sent-240, score-0.231]

76 It is thus desirable to deal with sparsity only in this small subset rather than in all features, while simultaneously inducing sparsity on all feature weights. [sent-241, score-0.343]

77 During online learning, at each step i, we only go through the nonzero features j of example i, and calculate the un-performed shrinkage of w j between τj and the current time i. [sent-244, score-0.539]

78 This lazy-update idea of delaying the shrinkage calculation until needed is the key to efficient implementation of truncated gradient. [sent-246, score-0.449]

79 The UCI datasets we used do not have many features, and it seems a large fraction of these features are useful for making predictions. [sent-251, score-0.246]

80 For each dataset, we performed 10fold cross validation over the training set to identify the best set of parameters, including the learning rate η, the gravity g, number of passes of the training set, and the decay of learning rate across these passes. [sent-258, score-0.222]

81 For UCI datasets with randomly added features, truncated gradient was able to reduce the number of features by a fraction of more than 90%, except for the ad dataset in which only 71% reduction was observed. [sent-264, score-0.965]

82 4 rcv1 wpbc wbc wdbc spam shroom krvskp magic04 ad zoo rcv1 Big_Ads Dataset wpbc wbc wdbc spam shroom krvskp magic04 housing 0 ad 0. [sent-273, score-0.63]

83 2 crx Fraction Left Base data 1000 extra 1 Dataset Figure 1: Left: the amount of features left after sparsification for each dataset without 1% performance loss. [sent-275, score-0.237]

84 3% decrease in accuracy (instead of 1%) during cross validation, truncated gradient was able to achieve 91. [sent-278, score-0.636]

85 For classification tasks, we also studied how truncated gradient affects AUC (Area Under the ROC Curve), a standard metric for classification. [sent-282, score-0.568]

86 The next set of experiments compares truncated gradient to other algorithms regarding their abilities to tradeoff feature sparsification and performance. [sent-291, score-0.597]

87 We have chosen K = 10 since it worked better than K = 1, and this choice was especially important for the coefficient rounding algorithm. [sent-294, score-0.316]

88 On all datasets, truncated gradient (with θ = ∞) is consistently competitive with the other online algorithms and significantly outperformed them in some problems, implying truncated gradient is generally effective. [sent-299, score-1.364]

89 Moreover, truncated gradient with θ = g behaves similarly to rounding (and sometimes better). [sent-300, score-0.853]

90 This was expected as truncated gradient with θ = g can be regarded as a principled variant of rounding with valid theoretical justification. [sent-301, score-0.924]

91 It is also interesting to observe the qualitative behavior of truncated gradient was often similar to LASSO, especially when very sparse weight vectors were allowed (the left sides in the graphs). [sent-302, score-0.71]

92 In this case, LASSO seemed to overfit while truncated gradient was more robust to overfitting. [sent-306, score-0.568]

93 The robustness of online learning is often attributed to early stopping, which has been extensively studied (e. [sent-307, score-0.228]

94 For large datasets with sparse features, only truncated gradient and the ad hoc coefficient rounding algorithm are applicable. [sent-311, score-1.084]

95 6 Conclusion This paper covers the first efficient sparsification technique for large-scale online learning with strong theoretical guarantees. [sent-335, score-0.228]

96 The algorithm, truncated gradient, is the natural extension of Lassostyle regression to the online-learning setting. [sent-336, score-0.399]

97 1 proves the technique is sound: it never harms performance much compared to standard stochastic gradient descent in adversarial situations. [sent-338, score-0.32]

98 Furthermore, we show the asymptotic solution of one instance of the algorithm is essentially equivalent to Lasso regression, thus justifying the algorithm’s ability to produce sparse weight vectors when the number of features is intractably large. [sent-339, score-0.224]

99 Worst-case quadratic loss bounds for prediction using linear functions and gradient descent. [sent-353, score-0.3]

100 Solving large scale linear prediction problems using stochastic gradient descent algorithms. [sent-432, score-0.348]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('truncated', 0.37), ('wi', 0.302), ('rounding', 0.285), ('gi', 0.241), ('online', 0.228), ('sparsi', 0.206), ('gradient', 0.198), ('vj', 0.196), ('vd', 0.189), ('sparsity', 0.154), ('lasso', 0.132), ('zi', 0.131), ('regret', 0.13), ('gravity', 0.118), ('nonzero', 0.118), ('auc', 0.114), ('features', 0.113), ('wt', 0.113), ('batch', 0.108), ('sgd', 0.106), ('subgradient', 0.1), ('datasets', 0.093), ('sgn', 0.089), ('uci', 0.087), ('forgetron', 0.083), ('kg', 0.083), ('ad', 0.076), ('loss', 0.074), ('descent', 0.073), ('regarded', 0.071), ('regularization', 0.069), ('rule', 0.067), ('sparse', 0.062), ('rd', 0.057), ('weights', 0.056), ('truncation', 0.054), ('coef', 0.053), ('weight', 0.049), ('stochastic', 0.049), ('shrinkage', 0.048), ('crx', 0.047), ('krvskp', 0.047), ('onlinelearning', 0.047), ('shroom', 0.047), ('vowpal', 0.047), ('wabbit', 0.047), ('aggressive', 0.046), ('dataset', 0.044), ('side', 0.043), ('wpbc', 0.041), ('lihong', 0.041), ('wbc', 0.041), ('theorem', 0.041), ('solves', 0.04), ('decaying', 0.04), ('fraction', 0.04), ('cross', 0.04), ('cation', 0.039), ('tunable', 0.038), ('wdbc', 0.035), ('eg', 0.035), ('advance', 0.035), ('inducing', 0.035), ('formulations', 0.035), ('yi', 0.034), ('ef', 0.034), ('counterpart', 0.034), ('ads', 0.034), ('extra', 0.033), ('integer', 0.033), ('cients', 0.033), ('penalty', 0.033), ('go', 0.032), ('rate', 0.032), ('square', 0.032), ('rutgers', 0.032), ('round', 0.032), ('update', 0.031), ('implementation', 0.031), ('shrink', 0.031), ('sub', 0.031), ('langford', 0.031), ('py', 0.031), ('reduction', 0.031), ('especially', 0.031), ('enjoys', 0.029), ('ows', 0.029), ('memory', 0.029), ('tradeoff', 0.029), ('bound', 0.029), ('regression', 0.029), ('exponentiated', 0.028), ('spam', 0.028), ('prediction', 0.028), ('inf', 0.028), ('convex', 0.028), ('achieve', 0.028), ('operates', 0.028), ('formulation', 0.027), ('simply', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 214 nips-2008-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1

2 0.21260934 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

Author: Tong Zhang

Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1

3 0.19997661 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Author: Shai Shalev-shwartz, Sham M. Kakade

Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1

4 0.16669987 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

Author: Ofer Dekel

Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1

5 0.15937367 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

6 0.15833604 104 nips-2008-Improved Moves for Truncated Convex Models

7 0.15815097 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions

8 0.15593839 62 nips-2008-Differentiable Sparse Coding

9 0.14704333 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

10 0.12028393 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

11 0.11616081 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

12 0.11173663 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

13 0.10955317 194 nips-2008-Regularized Learning with Networks of Features

14 0.10502633 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

15 0.10134763 179 nips-2008-Phase transitions for high-dimensional joint support recovery

16 0.10062741 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

17 0.099633709 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

18 0.094231062 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

19 0.091701619 168 nips-2008-Online Metric Learning and Fast Similarity Search

20 0.090496756 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.276), (1, -0.008), (2, -0.225), (3, 0.062), (4, -0.048), (5, 0.064), (6, -0.123), (7, -0.039), (8, -0.118), (9, 0.083), (10, -0.155), (11, 0.11), (12, -0.069), (13, 0.004), (14, 0.163), (15, -0.152), (16, -0.054), (17, -0.016), (18, -0.096), (19, -0.006), (20, -0.007), (21, 0.021), (22, -0.003), (23, 0.077), (24, -0.052), (25, -0.032), (26, 0.025), (27, -0.07), (28, -0.051), (29, 0.043), (30, -0.123), (31, 0.061), (32, 0.025), (33, 0.03), (34, -0.02), (35, -0.062), (36, -0.006), (37, -0.043), (38, 0.059), (39, -0.003), (40, -0.037), (41, 0.04), (42, -0.032), (43, -0.034), (44, -0.147), (45, -0.033), (46, -0.088), (47, 0.012), (48, -0.082), (49, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96812743 214 nips-2008-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1

2 0.71776229 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

Author: Tong Zhang

Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1

3 0.66023695 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

Author: Tong Zhang

Abstract: Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with non-zero coefficients and reconstructing the target function from noisy observations. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever beneficial. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory. 1

4 0.65621698 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Author: Shai Shalev-shwartz, Sham M. Kakade

Abstract: We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions. 1

5 0.65170956 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

Author: Ofer Dekel

Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1

6 0.64677423 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions

7 0.62240952 168 nips-2008-Online Metric Learning and Fast Similarity Search

8 0.5781939 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

9 0.57078838 202 nips-2008-Robust Regression and Lasso

10 0.56146961 62 nips-2008-Differentiable Sparse Coding

11 0.54701173 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

12 0.52659154 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions

13 0.48584366 169 nips-2008-Online Models for Content Optimization

14 0.4852207 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

15 0.47911251 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

16 0.46359718 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning

17 0.46317002 104 nips-2008-Improved Moves for Truncated Convex Models

18 0.46081755 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

19 0.43813616 194 nips-2008-Regularized Learning with Networks of Features

20 0.42460537 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.523), (7, 0.053), (12, 0.032), (16, 0.018), (28, 0.133), (57, 0.036), (63, 0.031), (71, 0.012), (77, 0.025), (78, 0.012), (83, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98067939 65 nips-2008-Domain Adaptation with Multiple Sources

Author: Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents a theoretical analysis of the problem of domain adaptation with multiple sources. For each source domain, the distribution over the input points as well as a hypothesis with error at most ǫ are given. The problem consists of combining these hypotheses to derive a hypothesis with small error with respect to the target domain. We present several theoretical results relating to this problem. In particular, we prove that standard convex combinations of the source hypotheses may in fact perform very poorly and that, instead, combinations weighted by the source distributions benefit from favorable theoretical guarantees. Our main result shows that, remarkably, for any fixed target function, there exists a distribution weighted combining rule that has a loss of at most ǫ with respect to any target mixture of the source distributions. We further generalize the setting from a single target function to multiple consistent target functions and show the existence of a combining rule with error at most 3ǫ. Finally, we report empirical results for a multiple source adaptation problem with a real-world dataset.

2 0.95640582 43 nips-2008-Cell Assemblies in Large Sparse Inhibitory Networks of Biologically Realistic Spiking Neurons

Author: Adam Ponzi, Jeff Wickens

Abstract: Cell assemblies exhibiting episodes of recurrent coherent activity have been observed in several brain regions including the striatum[1] and hippocampus CA3[2]. Here we address the question of how coherent dynamically switching assemblies appear in large networks of biologically realistic spiking neurons interacting deterministically. We show by numerical simulations of large asymmetric inhibitory networks with fixed external excitatory drive that if the network has intermediate to sparse connectivity, the individual cells are in the vicinity of a bifurcation between a quiescent and firing state and the network inhibition varies slowly on the spiking timescale, then cells form assemblies whose members show strong positive correlation, while members of different assemblies show strong negative correlation. We show that cells and assemblies switch between firing and quiescent states with time durations consistent with a power-law. Our results are in good qualitative agreement with the experimental studies. The deterministic dynamical behaviour is related to winner-less competition[3], shown in small closed loop inhibitory networks with heteroclinic cycles connecting saddle-points. 1

3 0.94252235 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG

Author: Julia Owen, Hagai T. Attias, Kensuke Sekihara, Srikantan S. Nagarajan, David P. Wipf

Abstract: The synchronous brain activity measured via MEG (or EEG) can be interpreted as arising from a collection (possibly large) of current dipoles or sources located throughout the cortex. Estimating the number, location, and orientation of these sources remains a challenging task, one that is significantly compounded by the effects of source correlations and the presence of interference from spontaneous brain activity, sensor noise, and other artifacts. This paper derives an empirical Bayesian method for addressing each of these issues in a principled fashion. The resulting algorithm guarantees descent of a cost function uniquely designed to handle unknown orientations and arbitrary correlations. Robust interference suppression is also easily incorporated. In a restricted setting, the proposed method is shown to have theoretically zero bias estimating both the location and orientation of multi-component dipoles even in the presence of correlations, unlike a variety of existing Bayesian localization methods or common signal processing techniques such as beamforming and sLORETA. Empirical results on both simulated and real data sets verify the efficacy of this approach. 1

same-paper 4 0.93035245 214 nips-2008-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1

5 0.87112665 178 nips-2008-Performance analysis for L\ 2 kernel classification

Author: Jooseuk Kim, Clayton Scott

Abstract: We provide statistical performance guarantees for a recently introduced kernel classifier that optimizes the L2 or integrated squared error (ISE) of a difference of densities. The classifier is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classifier. Unlike SVMs, however, the L2 kernel classifier does not involve a regularization parameter. We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classifier on the sense of both ISE and probability of error. Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. 1

6 0.78383988 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

7 0.77970737 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

8 0.72782475 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

9 0.67611349 202 nips-2008-Robust Regression and Lasso

10 0.66196626 75 nips-2008-Estimating vector fields using sparse basis field expansions

11 0.64607173 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

12 0.64444703 62 nips-2008-Differentiable Sparse Coding

13 0.64395571 85 nips-2008-Fast Rates for Regularized Objectives

14 0.64329129 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

15 0.64324492 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

16 0.63819999 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias

17 0.63550073 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

18 0.63469833 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

19 0.63310093 228 nips-2008-Support Vector Machines with a Reject Option

20 0.61638594 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms