nips nips2004 nips2004-70 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
Reference: text
sentIndex sentText sentNum sentScore
1 We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. [sent-6, score-0.644]
2 We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples. [sent-8, score-0.717]
3 1 Many commonly used supervised learning methods can be cast in this form, including regularized 1-norm and 2-norm support vector machines [13, 4], regularized linear and logistic regression (i. [sent-10, score-0.945]
4 Ridge regression, lasso and their logistic equivalents) and more. [sent-12, score-0.342]
5 In [8] we show that boosting can also be described as approximate regularized optimization, with an l1 -norm penalty. [sent-13, score-0.443]
6 Detailed discussion of the considerations in selecting penalty and loss functions for regularized fitting is outside the scope of this paper. [sent-14, score-0.882]
7 Computational considerations: we should be able to solve the problems we pose with the computational resources at our disposal. [sent-18, score-0.113]
8 Kernel methods and boosting are examples of computational tricks that allow us to solve very high dimensional problems – exactly or approximately – with a relatively small cost. [sent-19, score-0.257]
9 In this paper we suggest a new computational approach. [sent-20, score-0.071]
10 Once we have settled on a loss and penalty, we are still faced with the problem of selecting a “good” regularization parameter λ, in terms of prediction performance. [sent-21, score-0.259]
11 A common approach is to solve (1) for several values of λ, then use holdout data (or theoretical approaches, like AIC or SRM) to select a good value. [sent-22, score-0.035]
12 However, if we view the regularized optimization problem as a family of problems, parameterized by the regularization parameˆ ter λ, it allows us to define the “path” of optimal solutions {β(λ) : 0 ≤ λ ≤ ∞}, which is a p 1-dimensional curve through R . [sent-23, score-0.649]
13 Path following methods attempt to utilize the mathematical properties of this curve to devise efficient procedures for “following” it and generating the full set of regularized solutions with a (relatively) small computational cost. [sent-24, score-0.487]
14 As it turns out, there is a family of well known and interesting regularized problems for which efficient exact path following algorithms can be devised. [sent-25, score-0.742]
15 These include the lasso [3], 1- and 2-norm support vector machines [13, 4] and many others [9]. [sent-26, score-0.161]
16 The main property of these problems which makes them amenable to such methods is the piecewise linearity of the regularized solution path in Rp . [sent-27, score-0.853]
17 See [9] for detailed exposition of these properties and the resulting algorithms. [sent-28, score-0.039]
18 However, the path following idea can stretch beyond these exact piecewise linear algorithms. [sent-29, score-0.394]
19 In [8] we have described boosting as an approximate gradient-based algorithm for following l1 -norm regularized solution paths. [sent-31, score-0.593]
20 [6] suggest a gradient descent algorithm for finding an optimal solution for a fixed value of λ and are seemingly unaware that the path they are going through is of independent interest as it consists of approximate (alas very approximate) solutions to l1 -regularized problems. [sent-32, score-0.748]
21 Gradient-based methods, however, can only follow regularized paths under strict and non-testable conditions, and theoretical “closeness” results to the optimal path are extremely difficult to prove for them (see [8] for details). [sent-33, score-0.933]
22 In this paper, we suggest a general second-order algorithm for following “curved” regularized solution paths (i. [sent-34, score-0.693]
23 It consists of iteratively changing the regularization parameter, while making a single Newton step at every iteration towards the optimal penalized solution, for the current value of λ. [sent-37, score-0.303]
24 Section 4 is devoted to a formal statement and proof outline of our main result. [sent-39, score-0.064]
25 We discuss possible extensions and future work in section 5. [sent-40, score-0.033]
26 2 Path following algorithm We assume throughout that the loss function C is twice differentiable. [sent-41, score-0.278]
27 Assume for now also that the penalty J is twice differentiable (this assumption does not apply to the l1 norm penalty which is of great interest and we address this point later). [sent-42, score-0.857]
28 The key to our method are the normal equations for (1): ˆ ˆ C(β(λ)) + λ J(β(λ)) = 0 (2) ( ) Our algorithm iteratively constructs an approximate solution βt by taking “small” Newton-Raphson steps trying to maintain (2) as the regularization changes. [sent-43, score-0.291]
29 Our main result in this paper is to show, both empirically and theoretically, that for small , the dif( ) ˆ ference βt − β(λ0 + · t) is small, and thus that our method successfully tracks the path of optimal solutions to (1). [sent-44, score-0.56]
30 Algorithm 1 gives a formal description of our quadratic tracking method. [sent-45, score-0.079]
31 We start from a ˆ solution to (1) for some fixed λ0 (e. [sent-46, score-0.11]
32 At each iteration we increase λ by and take a single Newton-Raphson step towards the solution to (2) with the new λ value in step 2(b). [sent-49, score-0.144]
33 Algorithm 1 Approximate incremental quadratic algorithm for regularized optimization ( ) ˆ 1. [sent-50, score-0.469]
34 1 The l1 -norm penalty The l1 -norm penalty, J(β) = β 1 , is of special interest because of its favorable statistical properties (e. [sent-54, score-0.334]
35 [2]) and its widespread use in popular methods, such as the lasso [10] and 1-norm SVM [13]. [sent-56, score-0.161]
36 However it is not differentiable and so our algorithm does not apply to l1 -penalized problems directly. [sent-57, score-0.199]
37 To understand how we can generalize Algorithm 1 to this situation, we need to consider the Karush-Kuhn-Tucker (KKT) conditions for optimality of the optimization problem implied by (1). [sent-58, score-0.272]
38 All variables with smaller generalized correlation have 0 coefficient at the optimal penalized solution for this λ. [sent-60, score-0.287]
39 Note that the l1 -norm penalty is twice differentiable everywhere except at 0. [sent-61, score-0.523]
40 So if we carefully manage the set of non-0 coefficients according to these KKT conditions, we can still apply our algorithm in the lower-dimensional subspace spanned by non-0 coefficients only. [sent-62, score-0.04]
41 Thus we get Algorithm 2, which employs the Newton approach of Algorithm 1 for twice differentiable penalty, limited to the sub-space of “active” coefficients denoted by A. [sent-63, score-0.189]
42 It adds to Algorithm 1 updates for the “add variable to active set” and “remove variable from active set” events, when a variable becomes “highly correlated” as defined in (4) and when a coefficient hits 0 , respectively. [sent-64, score-0.142]
43 2 Algorithm 2 Approximate incremental quadratic algorithm for regularized optimization with lasso penalty ( ) ˆ ˆ 1. [sent-65, score-0.964]
44 2 Computational considerations For a fixed λ0 and λmax , Algorithms 1 and 2 take O(1/ ) steps. [sent-69, score-0.072]
45 Since we implicitly assume throughout that n ≥ p, we get overall complexity of O(n · p2 / ). [sent-71, score-0.094]
46 The choice of represents a tradeoff between computational complexity and accuracy (in section 4 we present theoretical results on the relationship between and the accuracy of the path approximation we get). [sent-72, score-0.426]
47 In practice, our algorithm is practical for problems with up to several hundred predictors and several thousand observations. [sent-73, score-0.136]
48 It is interesting to compare this calculation to the obvious alternative, which is to solve O(1/ ) regularized problems (1) separately, using a Newton-Raphson approach, resulting in the same complexity (assuming the number of Newton-Raphson iterations for finding each solution is bounded). [sent-75, score-0.625]
49 Our algorithm guarantees we stay very close to the optimal solution path with a single Newton step at each new value of λ. [sent-77, score-0.565]
50 • Empirically we observe that in some cases our algorithm is able to follow the path while direct solution for some values of λ fails to converge. [sent-78, score-0.482]
51 We assume this is related to various numeric properties of the specific problems being solved. [sent-79, score-0.04]
52 • For the interesting case of l1 -norm penalty and a “curved” loss function (like logistic log-likelihood), there is no direct Newton-Raphson algorithm. [sent-80, score-0.675]
53 Re-formulating the problem into differentiable form requires doubling the dimensionality. [sent-81, score-0.119]
54 Using our Algorithm 2, we can still utilize the same Newton method, with significant computational savings when many coefficients are 0 and we work in a lowerdimensional subspace. [sent-82, score-0.076]
55 2 When a coefficient hits 0 it not only hits a non-differentiability point in the penalty, it also ceases to be maximally correlated as defined in (4). [sent-83, score-0.16]
56 3 Connection to path following methods from numerical analysis There is extensive literature on path-following methods for solution paths of general parametric problems. [sent-86, score-0.643]
57 That is, the corrector step starts from the previous approximate solution. [sent-89, score-0.049]
58 These methods are recognized as attractive options when the functions defining the path (in our case, the combination of loss and penalty) are “smooth” and “far from linear”. [sent-90, score-0.462]
59 These conditions for efficacy of our approach are reflected in the regularity conditions for the closeness result in Section 4. [sent-91, score-0.503]
60 3 Example: l2 - and l1 -penalized logistic regression Regularized logistic regression has been successfully used as a classification and probability estimation approach [11, 12]. [sent-92, score-0.594]
61 We first illustrate applying our quadratic method to this regularized problem using a small subset of the “spam” data-set, available from the UCI repository (http://www. [sent-93, score-0.382]
62 Next, we apply it to the full “spam” data-set, to demonstrate its time complexity on bigger problems. [sent-98, score-0.056]
63 They represent the “optimality gap”: ( ) et = C(βt ) ( ) J(βt ) + ·t where the division is done componentwise (and so the five curves in each plot correspond ˆ to the five variables we are using). [sent-104, score-0.054]
64 Note that the optimal solution β(t ) is uniquely defined by the fact that (2) holds and therefore the “optimality gap” is equal to zero componentwise ˆ at β(t ). [sent-105, score-0.214]
65 By convexity and regularity of the loss and the penalty, there is a correspondence ˆ between small values of e and small distance β ( ) (t)− β(t ) . [sent-106, score-0.354]
66 In our example we observe that the components of e seem to be bounded in a small region around 0 for both paths (note the small scale of the y axis in both plots — the maximal error is under 10−3 ). [sent-107, score-0.201]
67 We conclude that on this simple example our method tracks the optimal solution paths well, both for the l1 - and l2 -regularized problems. [sent-108, score-0.408]
68 The plots on the left show the actual coefficient paths — the curve in R5 is shown as five coefficient traces in R, each corresponding to one variable, with the non-regularized solution (identical for both problems) on the extreme left. [sent-109, score-0.311]
69 Next, we run our algorithm on the full “spam” data-set, containing p = 57 predictors and n = 4601 observations. [sent-110, score-0.096]
70 For both the l1 - and l2 -penalized paths we used −4 x 10 2. [sent-111, score-0.201]
71 5 0 10 20 λ 30 40 −2 0 Figure 1: Solution paths (left) and optimality criterion (right) for l1 penalized logistic regression (top) and l2 penalized logistic regression (bottom). [sent-119, score-1.197]
72 02 and starting from the non-regularized logistic regression solution (i. [sent-121, score-0.407]
73 02, and the whole path was generated in under 5 minutes using a Matlab implementation on an IBM T-30 Laptop. [sent-124, score-0.332]
74 Like in the small scale example, the “optimality criterion” was uniformly small throughout the two paths, with none of its 57 components exceeding 10−3 at any point. [sent-125, score-0.038]
75 4 Theoretical closeness result In this section we prove that our algorithm can track the path of true solutions to (1). [sent-126, score-0.618]
76 We show that under regularity conditions on the loss and penalty (which hold for all the candidates we have examined), if we run Algorithm 1 with a specific step size , then we remain within O( 2 ) of the true path of optimal regularized solutions. [sent-127, score-1.505]
77 Theorem 1 Assume λ0 > 0, then for the derivatives of C and J, small enough and under regularity conditions on ∀0 < c < λmax − λ0 , ˆ β ( ) (c/ ) − β(λ0 + c) = O( 2 ) So there is a uniform bound O( 2 ) on the error which does not depend on c. [sent-128, score-0.452]
78 Proof We give the details of the proof in Appendix A of [7]. [sent-129, score-0.064]
79 Similar to section 3 we define the “optimality gap”: ( ) (7) ( C(βt ) ( ) J(βt ) )j + λt = etj Also define a “regularity constant” M , which depends on λ0 and the first, second and third derivatives of the loss and penalty. [sent-131, score-0.181]
80 The proof is presented as a succession of lemmas: Lemma 2 Let u1 = M · p · 2 , ut = M (ut−1 + √ p · )2 , then: et 2 ≤ ut This lemma gives a recursive expression bounding the error in the optimality gap (7) as the algorithm proceeds. [sent-132, score-0.693]
81 The proof is based on separate Taylor expansions of the numerator and denominator of the ratio C in the optimality gap and some tedious algebra. [sent-133, score-0.44]
82 J Lemma 3 If √ p M ≤ 1/4 then ut 1 2M − √ p· − √ √ 1−4 p· M 2M = O( 2 ) , ∀t This lemma shows that the recursive bound translates to a uniform O( 2 ) bound, if is small enough. [sent-134, score-0.361]
83 The proof consists of analytically finding the fixed point of the increasing series ut . [sent-135, score-0.157]
84 1 Required regularity conditions Regularity in the loss and the penalty is required in the definition of the regularity constant M and in the translation of the O( 2 ) bound on the “optimality gap” into one on the distance from the path in lemma 4. [sent-139, score-1.481]
85 The exact derivation of the regularity conditions is highly technical and lengthy. [sent-140, score-0.312]
86 Assuming that λ0 > 0 and λmax < ∞ these conditions hold for every interesting example we have encountered, including: • Ridge regression and the lasso (that is, l2 - and l1 - regularized squared error loss). [sent-144, score-0.742]
87 We observe in figure 1 that the tracking algorithm indeed suffers the biggest inaccuracy for the small values of λ, but manages to “self correct” as λ increases. [sent-149, score-0.077]
88 Basis expansions and Kernel methods Our approach obviously applies, as is, to models that are linear in basis expansions of the original variables (like wavelets or kernel methods) as long as p < n is preserved. [sent-152, score-0.328]
89 However, the method can easily be applied to high (including infinite) dimensional kernel versions of regularized models where RKHS theory applies. [sent-153, score-0.372]
90 We know that the solution path is fully within the span of the representer functions, that is the columns of the Kernel matrix. [sent-154, score-0.442]
91 , kn and the standard l2 -norm penalty, the regularized problem becomes: α(λ) = arg min ˆ C(yi , α ki ) + λα Kα α i so the penalty now also contains the Kernel matrix, but this poses no complications in using Algorithm 1. [sent-158, score-0.683]
92 The only consideration we need to keep in mind is the computational one, as our complexity is O(n3 / ). [sent-159, score-0.161]
93 So our method is fully applicable and practical for kernel methods, as long as the number of observations, and the resulting kernel matrix, are not too large (up to several hundreds). [sent-160, score-0.126]
94 Unsupervised learning There is no reason to limit the applicability of this approach to supervised learning. [sent-161, score-0.067]
95 Thus, for example, adaptive density estimation using negative log-likelihood as a loss can be regularized and the solution path be tracked using our algorithm. [sent-162, score-0.881]
96 Computational tricks The computational complexity of our algorithm limits its applicability to large problems. [sent-163, score-0.23]
97 To improve its scalability we primarily need to reduce the effort in the Hessian calculation and inversion. [sent-164, score-0.045]
98 The obvious suggestion here would be to keep the Hessian part of step 2(b) in Algorithm 1 fixed for many iterations and change the gradient part only, then update the Hessian occasionally. [sent-165, score-0.065]
99 The author thanks Noureddine El Karoui for help with the proof and Jerome Friedman, Giles Hooker, Trevor Hastie and Ji Zhu for helpful discussions. [sent-169, score-0.064]
100 Boosting as a regularized path to a maximum margin classifier. [sent-217, score-0.641]
wordName wordTfidf (topN-words)
[('penalty', 0.334), ('path', 0.332), ('regularized', 0.309), ('regularity', 0.224), ('paths', 0.201), ('logistic', 0.181), ('lasso', 0.161), ('optimality', 0.148), ('rosset', 0.134), ('gap', 0.134), ('loss', 0.13), ('penalized', 0.127), ('differentiable', 0.119), ('regression', 0.116), ('tibshirani', 0.112), ('solution', 0.11), ('hastie', 0.105), ('closeness', 0.103), ('solutions', 0.102), ('spam', 0.101), ('expansions', 0.094), ('ut', 0.093), ('regularization', 0.092), ('zhu', 0.091), ('lemma', 0.09), ('conditions', 0.088), ('boosting', 0.085), ('newton', 0.082), ('hits', 0.08), ('coef', 0.079), ('curved', 0.075), ('considerations', 0.072), ('twice', 0.07), ('proof', 0.064), ('kernel', 0.063), ('piecewise', 0.062), ('hessians', 0.059), ('tricks', 0.059), ('bound', 0.059), ('translates', 0.058), ('hessian', 0.058), ('complexity', 0.056), ('predictors', 0.056), ('componentwise', 0.054), ('yi', 0.053), ('derivatives', 0.051), ('johnstone', 0.05), ('optimal', 0.05), ('approximate', 0.049), ('shrinkage', 0.047), ('tracks', 0.047), ('calculation', 0.045), ('cients', 0.045), ('ridge', 0.045), ('wavelets', 0.045), ('kim', 0.043), ('quadratic', 0.042), ('incremental', 0.042), ('ibm', 0.041), ('prove', 0.041), ('arg', 0.04), ('tting', 0.04), ('ve', 0.04), ('problems', 0.04), ('algorithm', 0.04), ('kkt', 0.04), ('detailed', 0.039), ('max', 0.039), ('utilize', 0.038), ('computational', 0.038), ('hold', 0.038), ('throughout', 0.038), ('applicability', 0.037), ('selecting', 0.037), ('tracking', 0.037), ('optimization', 0.036), ('rp', 0.035), ('solve', 0.035), ('mind', 0.034), ('friedman', 0.034), ('iteration', 0.034), ('appendix', 0.033), ('stay', 0.033), ('extensions', 0.033), ('keep', 0.033), ('suggest', 0.033), ('gradient', 0.032), ('obviously', 0.032), ('family', 0.031), ('recursive', 0.031), ('illustrate', 0.031), ('active', 0.031), ('interesting', 0.03), ('supervised', 0.03), ('uniform', 0.03), ('aug', 0.029), ('ference', 0.029), ('gen', 0.029), ('aic', 0.029), ('ter', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
2 0.33860159 42 nips-2004-Computing regularization paths for learning multiple kernels
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
3 0.33555362 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie
Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. 1
4 0.12540922 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie
Abstract: We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.
5 0.11350536 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers
Author: Ligen Wang, Balázs Kégl
Abstract: In this paper we propose to combine two powerful ideas, boosting and manifold learning. On the one hand, we improve A DA B OOST by incorporating knowledge on the structure of the data into base classifier design and selection. On the other hand, we use A DA B OOST’s efficient learning mechanism to significantly improve supervised and semi-supervised algorithms proposed in the context of manifold learning. Beside the specific manifold-based penalization, the resulting algorithm also accommodates the boosting of a large family of regularized learning algorithms. 1
6 0.11120496 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
7 0.11087743 138 nips-2004-Online Bounds for Bayesian Algorithms
8 0.10795794 164 nips-2004-Semi-supervised Learning by Entropy Minimization
9 0.093481265 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging
10 0.091278151 137 nips-2004-On the Adaptive Properties of Decision Trees
11 0.090431042 178 nips-2004-Support Vector Classification with Input Data Uncertainty
12 0.090374351 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
13 0.090040118 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
14 0.087372914 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
15 0.084026232 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
16 0.082298979 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
17 0.0820636 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters
18 0.080990687 82 nips-2004-Incremental Algorithms for Hierarchical Classification
19 0.079106994 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
20 0.078971669 92 nips-2004-Kernel Methods for Implicit Surface Modeling
topicId topicWeight
[(0, -0.258), (1, 0.129), (2, 0.023), (3, 0.226), (4, -0.057), (5, -0.052), (6, 0.075), (7, -0.117), (8, -0.088), (9, 0.04), (10, 0.209), (11, -0.061), (12, 0.084), (13, 0.07), (14, -0.064), (15, 0.017), (16, 0.12), (17, -0.147), (18, -0.183), (19, -0.173), (20, 0.162), (21, -0.228), (22, 0.057), (23, -0.054), (24, 0.124), (25, 0.119), (26, -0.1), (27, -0.087), (28, 0.056), (29, 0.061), (30, 0.022), (31, 0.002), (32, -0.059), (33, -0.028), (34, 0.067), (35, 0.04), (36, -0.045), (37, -0.088), (38, 0.02), (39, 0.012), (40, -0.063), (41, -0.072), (42, 0.093), (43, -0.051), (44, 0.03), (45, 0.116), (46, 0.055), (47, 0.079), (48, 0.074), (49, 0.0)]
simIndex simValue paperId paperTitle
same-paper 1 0.97469813 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
2 0.85045451 42 nips-2004-Computing regularization paths for learning multiple kernels
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
3 0.67531896 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie
Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. 1
4 0.58466798 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
Author: Lorenzo Rosasco, Andrea Caponnetto, Ernesto D. Vito, Francesca Odone, Umberto D. Giovannini
Abstract: Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. 1
5 0.46002746 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
Author: Yuanqing Lin, Daniel D. Lee
Abstract: Bayesian Regularization and Nonnegative Deconvolution (BRAND) is proposed for estimating time delays of acoustic signals in reverberant environments. Sparsity of the nonnegative filter coefficients is enforced using an L1 -norm regularization. A probabilistic generative model is used to simultaneously estimate the regularization parameters and filter coefficients from the signal data. Iterative update rules are derived under a Bayesian framework using the Expectation-Maximization procedure. The resulting time delay estimation algorithm is demonstrated on noisy acoustic data.
6 0.4465777 138 nips-2004-Online Bounds for Bayesian Algorithms
7 0.42355537 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
8 0.42151377 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
9 0.37645602 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
10 0.37349036 54 nips-2004-Distributed Information Regularization on Graphs
11 0.35153902 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
12 0.35127714 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
13 0.33216697 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
14 0.33104676 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
15 0.33011234 82 nips-2004-Incremental Algorithms for Hierarchical Classification
16 0.329036 94 nips-2004-Kernels for Multi--task Learning
17 0.32556814 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities
18 0.30913633 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem
19 0.30128774 137 nips-2004-On the Adaptive Properties of Decision Trees
20 0.30128631 136 nips-2004-On Semi-Supervised Classification
topicId topicWeight
[(13, 0.15), (15, 0.115), (26, 0.058), (31, 0.054), (33, 0.153), (35, 0.021), (39, 0.094), (50, 0.037), (59, 0.031), (71, 0.017), (81, 0.192)]
simIndex simValue paperId paperTitle
1 0.94407964 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning
Author: Constanze Hofstoetter, Manuel Gil, Kynan Eng, Giacomo Indiveri, Matti Mintz, Jörg Kramer, Paul F. Verschure
Abstract: We present a biophysically constrained cerebellar model of classical conditioning, implemented using a neuromorphic analog VLSI (aVLSI) chip. Like its biological counterpart, our cerebellar model is able to control adaptive behavior by predicting the precise timing of events. Here we describe the functionality of the chip and present its learning performance, as evaluated in simulated conditioning experiments at the circuit level and in behavioral experiments using a mobile robot. We show that this aVLSI model supports the acquisition and extinction of adaptively timed conditioned responses under real-world conditions with ultra-low power consumption. I n tro d u cti o n 1 The association of two correlated stimuli, an initially neutral conditioned stimulus (CS) which predicts a meaningful unconditioned stimulus (US), leading to the acquisition of an adaptive conditioned response (CR), is one of the most essential forms of learning. Pavlov introduced the classical conditioning paradigm in the early 20th century to study associative learning (Pavlov 1927). In classical conditioning training an animal is repeatedly exposed to a CS followed by a US after a certain inter-stimulus interval (ISI). The animal learns to elicit a CR matched to the ISI, reflecting its knowledge about an association between the CS, US, and their temporal relationship. Our earlier software implementation of a * Jörg Kramer designed the cerebellum chip that was first tested at the 2002 Telluride Neuromorphic Engineering Workshop. Tragically, he died soon afterwards while hiking on Telescope Peak on 24 July, 2002. biophysically constrained model of the cerebellar circuit underlying classical conditioning (Verschure and Mintz 2001; Hofstötter et al. 2002) provided an explanation of this phenomenon by assuming a negative feedback loop between the cerebellar cortex, deep nucleus and inferior olive. It could acquire and extinguish correctly timed CRs over a range of ISIs in simulated classical conditioning experiments, as well as in associative obstacle avoidance tasks using a mobile robot. In this paper we present the analog VLSI (aVLSI) implementation of this cerebellum model – the cerebellum chip – and the results of chip-level and behavioral robot experiments. 2 T h e mo d el ci r cu i t a n d a VL S I i mp l eme n ta ti o n Figure 1: Anatomy of the cerebellar model circuit (left) and the block diagram of the corresponding chip (right). The model (Figure 1) is based on the identified cerebellar pathways of CS, US and CR (Kim and Thompson 1997) and includes four key hypotheses which were implemented in the earlier software model (Hofstötter et al. 2002): 1. CS related parallel fiber (pf) and US related climbing fiber (cf) signals converge at Purkinje cells (PU) in the cerebellum (Steinmetz et al. 1989). The direction of the synaptic changes at the pf-PU-synapse depends on the temporal coincidence of pf and cf activity. Long-term depression (LTD) is induced by pf activity followed by cf activity within a certain time interval, while pf activity alone induces long-term potentiation (LTP) (Hansel et al. 2001). 2. A prolonged second messenger response to pf stimulation in the dendrites of PU constitutes an eligibility trace from the CS pathway (Sutton and Barto 1990) that bridges the ISI (Fiala et al. 1996). 3. A microcircuit (Ito 1984) comprising PU, deep nucleus (DN) and inferior olive (IO) forms a negative feedback loop. Shunting inhibition of IO by DN blocks the reinforcement pathway (Thompson et al. 1998), thus controlling the induction of LTD and LTP at the pf-PU-synapse. 4. DN activity triggers behavioral CRs (McCormick and Thompson 1984). The inhibitory PU controls DN activity by a mechanism called rebound excitation (Hesslow 1994): When DN cells are disinhibited from PU input, their membrane potential slowly repolarises and spikes are emitted if a certain threshold is reached. Thereby, the correct timing of CRs results from the adaptation of a pause in PU spiking following the CS. In summary, in the model the expression of a CR is triggered by DN rebound excitation upon release from PU inhibition. The precise timing of a CR is dependent on the duration of an acquired pause in PU spiking following a CS. The PU response is regulated by LTD and LTP at the pf-PU-synapse under the control of a negative feedback loop comprising DN, PU and IO. We implemented an analog VLSI version of the cerebellar model using a standard 1.6µm CMOS technology, and occupying an area of approximately 0.25 mm2. A block diagram of the hardware model is shown in Figure 1. The CS block receives the conditioned stimulus and generates two signals: an analog long-lasting, slowly decaying trace (cs_out) and an equally long binary pulse (cs_wind). Similarly, the US block receives an unconditioned stimulus and generates a fast pulse (us_out). The two pulses cs_wind and us_out are sent to the LT-ISI block that is responsible for perfoming LTP and LTD, upregulating or downregulating the synaptic weight signal w. This signal determines the gain by which the cs_out trace is multiplied in the MU block. The output of the multiplier MU is sent on to the PU block, together with the us_out signal. It is a linear integrate-and-fire neuron (the axon-hillock circuit) connected to a constant current source that produces regular spontaneous activity. The current source is gated by the digital cf_wind signal, such that the spontaneous activity is shut off for the duration of the cs_out trace. The chip allowed one of three learning rules to be connected. Experiments showed that an ISI-dependent learning rule with short ISIs resulting in the strongest LTD was the most useful (Kramer and Hofstötter 2002). Two elements were added to adapt the model circuit for real-world robot experiments. Firstly, to prevent the expression of a CR after a US had already been triggered, an inhibitory connection from IO to CRpathway was added. Secondly, the transduction delay (TD) from the aVLSI circuit to any effectors (e.g. motor controls of a robot) had to be taken into account, which was done by adding a delay from DN to IO of 500ms. The chip’s power consumption is conservatively estimated at around 100 W (excluding off-chip interfacing), based on measurements from similar integrateand-fire neuron circuits (Indiveri 2003). This figure is an order of magnitude lower than what could be achieved using conventional microcontrollers (typically 1-10 mW), and could be improved further by optimising the circuit design. 3 S i mu l a ted co n d i ti o n i n g ex p eri men ts The aim of the “in vitro” simulated conditioning experiments was to understand the learning performance of the chip. To obtain a meaningful evaluation of the performance of the learning system for both the simulated conditioning experiments and the robot experiments, the measure of effective CRs was used. In acquisition experiments CS-US pairs are presented with a fixed ISI. Whenever a CR occurs that precedes the US, the US signal is not propagated to PU due to the inhibitory connection from DN to IO. Thus in the context of acquisition experiments a CR is defined as effective if it prevents the occurrence of a US spike at PU. In contrast, in robot experiments an effective CR is defined at the behavioral level, including only CRs that prevent the US from occurring. Figure 2: Learning related response changes in the cerebellar aVLSI chip. The most relevant neural responses to a CS-US pair (ISI of 3s, ITI of 12s) are presented for a trial before (naive) significant learning occurred and when a correctly timed CR is expressed (trained). US-related pf and CS/CR-related cf signals are indicated by vertical lines passing through the subplots. A CS-related pf-signal evokes a prolonged response in the pf-PU-synapse, the CS-trace (Trace subplot). While an active CS-trace is present, an inhibitory element (I) is active which inactivates an element representing the spontaneous activity of PU (Hofstötter et al. 2002). (A) The US-related cf input occurs while there is an active CS-trace (Trace subplot), in this case following the CS with an ISI of 3s. LTD predominates over LTP under these conditions (Weight subplot). Because the PU membrane potential (PU) remains above spiking threshold, PU is active and supplies constant inhibition to DN (DN) while in the CS-mode. Thus, DN cannot repolarize and remains inactive so that no CR is triggered. (B) Later in the experiment, the synaptic weight of the pf-PU-synapse (Weight) has been reduced due to previous LTD. As a result, following a CS-related pf input, the PU potential (PU subplot) falls below the spiking threshold, which leads to a pause in PU spiking. The DN membrane potential repolarises, so that rebound spikes are emitted (DN subplot). This rebound excitation triggers a CR. DN inhibition of IO prevents US related cfactivity. Thus, although a US signal is still presented to the circuit, the reinforcing US pathway is blocked. These conditions induce only LTP, raising the synaptic weight of the pf-PU-synapse (Weight subplot). The results we obtained were broadly consistent with those reported in the biological literature (Ito 1984; Kim and Thompson 1997). The correct operation of the circuit can be seen in the cell traces illustrating the properties of the aVLSI circuit components before significant learning (Figure 2 A), and after a CR is expressed (Figure 2B). Long-term acquisition experiments (25 blocks of 10 trials each over 50 minutes) showed that chip functions remained stable over a long time period. In each trial the CS was followed by a US with a fixed ISI of 3s; the inter trial interval (ITI) was 12s. The number of effective CRs shows an initial fast learning phase followed by a stable phase with higher percentages of effective CRs (Figure 3B). In the stable phase the percentage of effective CRs per block fluctuates around 80-90%. There are fluctuations of up to 500ms in the CR latency caused by the interaction of LTD and LTP in the stable phase, but the average CR latency remains fairly constant. Figure 4 shows the average of five acquisition experiments (5 blocks of 10 trials per experiment) for ISIs of 2.5s, 3s and 3.5s. The curves are similar in shape to the ones in the long-term experiment. The CR latency quickly adjusts to match the ISI and remains stable thereafter (Figure 4A). The effect of the ISI-dependent learning rule can be seen in two ways: firstly, the shorter the ISI, the faster the stable phase is reached, denoting faster learning. Secondly, the shorter the ISI, the better the performance in terms of percentage of effective CRs (Figure 4B). The parameters of the chip were tuned to optimally encode short ISIs in the range of 1.75s to 4.5s. Separate experiments showed that the chip could also adapt rapidly to changes in the ISI within this range after initial learning. (Error bar = 1 std. dev.) Figure 3: Long-term changes in CR latency (A) and % effective CRs (B) per block of 10 CSs during acquisition. Experiment length = 50min., ISI = 3s, ITI = 12s. (Error bar = 1 std. dev.) Figure 4: Average of five acquisition experiments per block of 10 CSs for ISIs of 2.5s ( ), 3s (*) and 3.5s ( ). (A) Avg. CR latency. (B) Avg. % effective CRs. 4 Ro b o t a s s o ci a ti v e l ea rn i n g ex p eri men t s The “in vivo” learning capability of the chip was evaluated by interfacing it to a robot and observing its behavior in an unsupervised obstacle avoidance task. Experiments were performed using a Khepera microrobot (K-team, Lausanne, Switzerland, Figure 5A) in a circular arena with striped walls (Figure 5C). The robot was equipped with 6 proximal infra-red (IR) sensors (Figure 5B). Activation of these sensors (US) due to a collision triggered a turn of ~110° in the opposite direction (UR). A line camera (64 pixels x 256 gray-levels) constituted the distal sensor, with detection of a certain spatial frequency (~0.14 periods/degree) signalling the CS. Visual CSs and collision USs were conveyed to CSpathway and USpathway on the chip. The activation of CRpathway triggered a motor CR: a 1s long regression followed by a turn of ~180°. Communication between the chip and the robot was performed using Matlab on a PC. The control program could be downloaded to the robot's processor, allowing the robot to act fully autonomously. In each experiment, the robot was placed in the circular arena exploring its environment with a constant speed of ~4 cm/s. A spatial frequency CS was detected at some distance when the robot approached the wall, followed by a collision with the wall, stimulating the IR sensors and thus triggering a US. Consequently the CS was correlated with the US, predicting it. The ISIs of these stimuli were variable, due to noise in sensor sampling, and variations in the angle at which the robot approached the wall. Figure 5: (A) Khepera microrobot with aVLSI chip mounted on top. (B) Only the forward sensors were used during the experiments. (C) The environment: a 60cm diameter circular arena surrounded by a 15cm high wall. A pattern of vertical, equally sized black and white bars was placed on the wall. Associative learning mediated by the cerebellum chip significantly altered the robot's behavior in the obstacle avoidance task (Figure 6) over the course of each experiment. In the initial learning phase, the behavior was UR driven: the robot drove forwards until it collided with the wall, only then performing a turn (Figure 6A1). In the trained phase, the robot usually turned just before it collided with the wall (Figure 6A2), reducing the number of collisions. The positions of the robot when a CS, US or CR event occurred in these two phases are shown in Figure 6B1 and B2. The CRs were not expressed immediately after the CSs, but rather with a CR latency adjusted to just prevent collisions (USs). Not all USs were avoided in the trained phase due to some excessively short ISIs (Figure 7) and normal extinction processes over many unreinforced trials. After the learning phase the percentage of effective CRs fluctuated between 70% and 100% (Figure 7). Figure 6: Learning performance of the robot. (Top row) Trajectories of the robot. The white circle with the black dot in the center indicates the beginning of trajectories. (Bottom row) The same periods of the experiment examined at the circuit level: = CS, * = US, = CR. (A1, B1) Beginning of the experiment (CS 3-15). (A2, B2) Later in the experiment (CS 32-44). Figure 7: Trends in learning behavior (average of 5 experiments, 25 min. each). 90 CSs were presented in each experiment. Error bars indicate one standard deviation. (A) Average percentage of effective CRs over 9 blocks of 10 CSs. (B) Number of CS occurrences ( ), US occurrences (*) and CR occurrences ( ). 5 Di s cu s s i o n We have presented one of the first examples of a biologically constrained model of learning implemented in hardware. Our aVLSI cerebellum chip supports the acquisition and extinction of adaptively timed responses under noisy, real world conditions. These results provide further evidence for the role of the cerebellar circuit embedded in a synaptic feedback loop in the learning of adaptive behavior, and pave the way for the creation of artefacts with embedded ultra low-power learning capabilities. 6 Ref eren ces Fiala, J. C., Grossberg, S. and Bullock, D. (1996). Metabotropic glutamate receptor activation in cerebellar Purkinje cells as substrate for adaptive timing of the classical conditioned eye-blink response. Journal of Neuroscience 16: 3760-3774. Hansel, C., Linden, D. J. and D'Angelo, E. (2001). Beyond parallel fiber LTD, the diversity of synaptic and nonsynaptic plasticity in the cerebellum. Nature Neuroscience 4: 467-475. Hesslow, G. (1994). Inhibition of classical conditioned eyeblink response by stimulation of the cerebellar cortex in decerebrate cat. Journal of Physiology 476: 245-256. Hofstötter, C., Mintz, M. and Verschure, P. F. M. J. (2002). The cerebellum in action: a simulation and robotics study. European Journal of Neuroscience 16: 1361-1376. Indiveri, G. (2003). A low-power adaptive integrate-and-fire neuron circuit. IEEE International Symposium on Circuits and Systems, Bangkok, Thailand, 4: 820-823. Ito, M. (1984). The modifiable neuronal network of the cerebellum. Japanese Journal of Physiology 5: 781-792. Kim, J. J. and Thompson, R. F. (1997). Cerebellar circuits and synaptic mechanisms involved in classical eyeblink conditioning. Trends in the Neurosciences 20(4): 177-181. Kim, J. J. and Thompson, R. F. (1997). Cerebellar circuits and synaptic mechanisms involved in classical eyeblink conditioning. Trend. Neurosci. 20: 177-181. Kramer, J. and Hofstötter, C. (2002). An aVLSI model of cerebellar mediated associative learning. Telluride Workshop, CO, USA. McCormick, D. A. and Thompson, R. F. (1984). Neuronal response of the rabbit cerebellum during acquisition and performance of a classical conditioned nictitating membrane-eyelid response. J. Neurosci. 4: 2811-2822. Pavlov, I. P. (1927). Conditioned Reflexes, Oxford University Press. Steinmetz, J. E., Lavond, D. G. and Thompson, R. F. (1989). Classical conditioning in rabbits using pontine nucleus stimulation as a conditioned stimulus and inferior olive stimulation as an unconditioned stimulus. Synapse 3: 225-233. Sutton, R. S. and Barto, A. G. (1990). Time derivate models of Pavlovian Reinforcement Learning and Computational Neuroscience: Foundations of Adaptive Networks., MIT press: chapter 12, 497-537. Thompson, R. F., Thompson, J. K., Kim, J. J. and Shinkman, P. G. (1998). The nature of reinforcement in cerebellar learning. Neurobiology of Learning and Memory 70: 150-176. Verschure, P. F. M. J. and Mintz, M. (2001). A real-time model of the cerebellar circuitry underlying classical conditioning: A combined simulation and robotics study. Neurocomputing 38-40: 1019-1024.
2 0.90362036 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography
Author: Einat Klein, Rachel Mislovaty, Ido Kanter, Andreas Ruttor, Wolfgang Kinzel
Abstract: Two neural networks that are trained on their mutual output synchronize to an identical time dependant weight vector. This novel phenomenon can be used for creation of a secure cryptographic secret-key using a public channel. Several models for this cryptographic system have been suggested, and have been tested for their security under different sophisticated attack strategies. The most promising models are networks that involve chaos synchronization. The synchronization process of mutual learning is described analytically using statistical physics methods.
same-paper 3 0.86256021 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
4 0.78376222 163 nips-2004-Semi-parametric Exponential Family PCA
Author: Sajama Sajama, Alon Orlitsky
Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1
5 0.77466613 42 nips-2004-Computing regularization paths for learning multiple kernels
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
6 0.77116209 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
7 0.76559925 131 nips-2004-Non-Local Manifold Tangent Learning
8 0.76247501 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
9 0.75992376 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
10 0.75517482 124 nips-2004-Multiple Alignment of Continuous Time Series
11 0.75428301 113 nips-2004-Maximum-Margin Matrix Factorization
12 0.75239837 28 nips-2004-Bayesian inference in spiking neurons
13 0.75043041 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
14 0.74911058 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
15 0.7486105 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
16 0.74851298 102 nips-2004-Learning first-order Markov models for control
17 0.74583906 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
18 0.74566334 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification
19 0.74560928 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
20 0.74485636 178 nips-2004-Support Vector Classification with Input Data Uncertainty