nips nips2004 nips2004-42 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Computing regularization paths for learning multiple kernels Francis R. [sent-1, score-0.603]
2 edu Abstract The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. [sent-7, score-1.253]
3 In this paper, we present an algorithm that computes the entire regularization path for these problems. [sent-8, score-0.778]
4 The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. [sent-9, score-1.042]
5 Given a learning problem, two major tasks faced by practitioners are to find an appropriate kernel and to understand how regularization affects the solution and its performance. [sent-12, score-0.525]
6 This paper addresses both of these issues within the supervised learning setting by combining three themes from recent statistical machine learning research, namely multiple kernel learning [2, 3, 1], computation of regularization paths [4, 5], and the use of path following methods [6]. [sent-13, score-0.944]
7 The problem of learning the kernel from data has recently received substantial attention, and several formulations have been proposed that involve optimization over the conic structure of the space of kernels [2, 1, 3]. [sent-14, score-0.69]
8 In this paper we follow the specific formulation of [1], who showed that learning a conic combination of basis kernels is equivalent to regularizing the original supervised learning problem by a weighted block 1-norm (see Section 2. [sent-15, score-0.745]
9 Thus, by solving a single convex optimization problem, the coefficients of the conic combination of kernels and the values of the parameters (the dual variables) are obtained. [sent-17, score-0.779]
10 Given the basis kernels and their coefficients, there is one free parameter remaining—the regularization parameter. [sent-18, score-0.572]
11 Kernel methods are nonparametric methods, and thus regularization plays a crucial role in their behavior. [sent-19, score-0.384]
12 In order to understand a nonparametric method, in particular complex non- parametric methods such as those considered in this paper, it is useful to be able to consider the entire path of regularization, that is, the set of solutions for all values of the regularization parameter [7, 4]. [sent-20, score-0.829]
13 Moreover, if it is relatively cheap computationally to compute this path, then it may be of practical value to compute the path as standard practice in fitting a model. [sent-21, score-0.387]
14 This would seem particularly advisable in cases in which performance can display local minima along the regularization path. [sent-22, score-0.371]
15 For least-squares regression with a 1-norm penalty or for the support vector machine, there exist efficient computational techniques to explore the regularization path [4, 5]. [sent-24, score-0.891]
16 These techniques exploit the fact that for these problems the path is piecewise linear. [sent-25, score-0.54]
17 In this paper we consider the extension of these techniques to the multiple kernel learning problem. [sent-26, score-0.211]
18 As we will show (in Section 3), in this setting the path is no longer piecewise linear. [sent-27, score-0.494]
19 It is, however, piecewise smooth, and we are able to follow it by using numerical continuation techniques [8, 6]. [sent-28, score-0.38]
20 To do this in a computationally efficient way, we invoke logarithmic barrier techniques analogous to those used in interior point methods for convex optimization (see Section 3. [sent-29, score-0.229]
21 The empirical complexity of our algorithm is thus a constant times the complexity of solving the problem using interior point methods for one value of the regularization parameter (see Section 3. [sent-35, score-0.639]
22 In Section 4, we present simulation experiments for classification and regression problems, using a large set of basis kernels based on the most widely used kernels (linear, polynomial, Gaussian). [sent-37, score-0.545]
23 In particular, we show empirically that the number of kernels in the conic combination is not a monotonic function of the amount of regularization. [sent-38, score-0.616]
24 Thus the need to compute full regularization paths is particularly acute in our more complex (block 1-norm regularization) case. [sent-42, score-0.392]
25 2 Block 1-norm regularization In this section we review the block 1-norm regularization framework of [1] as it applies to differentiable loss functions. [sent-43, score-0.855]
26 1 Classical 2-norm regularization Primal formulation We consider the general regularized learning optimization problem [7], where the data xi , i = 1, . [sent-46, score-0.462]
27 The kernel associated with this feature map is denoted k(x, y) = Φ(x) Φ(y). [sent-54, score-0.189]
28 The optimization problem is the following1 : minw∈Rp n i=1 (yi , w Φ(xi )) + λ ||w||2 , 2 (1) where λ > 0 is a regularization parameter and ||w|| is the 2-norm of w, defined as ||w|| = (w w)1/2 . [sent-55, score-0.413]
29 In this paper, we focus on loss functions that are strictly convex and twice continuously differentiable in their second argument. [sent-57, score-0.217]
30 is strictly convex and differentiable, the maximum defining ψi (v) is attained at a unique point equal to ψi (v) (possibly equal to +∞ or −∞). [sent-60, score-0.196]
31 The function ψi (v) is then strictly convex and twice differentiable in its domain. [sent-61, score-0.217]
32 Dual formulation and optimality conditions L(w, u, α) = i The Lagrangian for problem (1) is ϕi (ui ) + λ ||w||2 − λ 2 i αi (ui − w Φ(xi )) and is minimized with respect to u and w with w = − i αi Φ(xi ). [sent-63, score-0.18]
33 The dual problem is then maxα∈Rn − i ψi (λαi ) − λ α Kα , (2) 2 where K ∈ Rn×n is the kernel matrix of the points, i. [sent-64, score-0.324]
34 The optimality condition for the dual variable α is then: ∀i, (Kα)i + ψi (λαi ) = 0 2. [sent-67, score-0.211]
35 2 (3) Block 1-norm regularization In this paper, we map the input space X to m different feature spaces F1 , . [sent-68, score-0.387]
36 We will see in Section 4 how these should be linked to the rank of the kernel matrices. [sent-90, score-0.234]
37 Following [1], we consider the following problem with weighted block 1-norm regularization2 (where ||wj || = (wj wj )1/2 still denotes the 2-norm of wj ): minw∈F1 ×···×Fm i ϕi (w Φ(xi )) + λ j dj ||wj ||. [sent-91, score-0.419]
38 In order to derive optimality conditions, we can reformulate it with conic constraints and derive the following dual problem (we omit details for brevity) [9, 1]: maxα − i ψi (λαi ) such that ∀j, α Kj α d2 j (5) where Kj is the kernel matrix associated with kernel kj , i. [sent-93, score-1.319]
39 (5) can be obtained by using only the kernel matrices Kj (i. [sent-100, score-0.206]
40 , this is indeed a kernel machine) and that the optimal solution of the block 1-norm 2 In [1], the square of the block 1-norm was used. [sent-102, score-0.39]
41 However, when the entire regularization path is sought, it is easy to show that the two problems are equivalent. [sent-103, score-0.778]
42 The advantage of the current formulation is that when the blocks are of size one the problem reduces to classical 1-norm regularization [4]. [sent-104, score-0.477]
43 path target β/λ Predictor step Corrector steps (α 0,σ0) (α1,σ 1) Path Figure 1: (Left) Geometric interpretation of the dual problem in Eq. [sent-105, score-0.635]
44 Thus, with this formulation, we learn the coefficients of the conic combination of kernels as well as the dual variables α [1]. [sent-112, score-0.692]
45 As shown in [1], the conic combination is sparse, i. [sent-113, score-0.299]
46 3 Geometric interpretation of dual problem Each function ψi is strictly convex, with a strict minimum at βi defined by ψi (βi ) = 0 (for least-squares regression we have βi = −yi , and for the logistic regression we have βi = −yi /2). [sent-117, score-0.54]
47 The negated dual objective i ψi (λαi ) is thus a metric between α and β/λ (for least-squares regression, this is simply the squared distance while for logistic regression, this is an entropy distance). [sent-118, score-0.241]
48 j When computing the regularization path from λ = +∞ to λ = 0, the target goes from 0 to ∞ in the direction β (see Figure 1). [sent-120, score-0.803]
49 We thus initialize the path following technique with λ = maxj (β Kj β/d2 )1/2 and α = β/λ. [sent-122, score-0.411]
50 j 3 Building the regularization path In this section, the goal is to vary λ from +∞ (no regularization) to 0 (full regularization) and obtain a representation of the path of solutions (α(λ), η(λ)). [sent-123, score-1.109]
51 We will essentially approximate the path by a piecewise linear function of σ = log(λ). [sent-124, score-0.494]
52 1 Active set method For the dual formulation Eq. [sent-126, score-0.182]
53 (6), if the set of active kernels J (α) is known, i. [sent-128, score-0.249]
54 , the set of kernels that are such that α Kj α = d2 , then the optimality conditions become j ∀j ∈ J , α Kj α = d2 j ∀i, ( j∈J (7) ηj Kj α)i + ψi (λαi ) = 0 and they are valid as long as ∀j ∈ J , α Kj α / d2 and ∀j ∈ J , ηj j 0. [sent-130, score-0.318]
55 The path is thus piecewise smooth, with “kinks” at each point where the active set J changes. [sent-131, score-0.556]
56 On each of the smooth sections, only those kernels with index belonging to J are used to define α and η, through Eq. [sent-132, score-0.294]
57 When all blocks have size one, or equivalently when all kernel matrices have rank one, then the path is provably linear in 1/λ between each kink [4] and is thus easy to follow. [sent-134, score-0.745]
58 However, when the kernel matrices have higher rank, this is not the case and additional numerical techniques are needed, which we now present. [sent-135, score-0.317]
59 2 Following a smooth path using numerical continuation techniques In this section, we provide a brief review of path following, focusing on predictor-corrector methods [8]. [sent-139, score-1.07]
60 In order to follow the curve α(σ), the most effective numerical method is the predictorcorrector method, which works as follows (see Figure 1): • predictor step : from (α0 , σ0 ) predict where α(σ0 + h) should be using the first order expansion, i. [sent-143, score-0.248]
61 • corrector steps : (α1 , σ1 ) might not satisfy J(α1 , σ1 ) = 0, i. [sent-146, score-0.22]
62 Predictor-corrector methods approximate the path by a sequence of points on that path, which can be joined to provide a piecewise linear approximation. [sent-152, score-0.494]
63 At first glance, in order to follow the piecewise smooth path all that is needed is to follow each piece and detect when the active set changes, i. [sent-153, score-0.715]
64 We instead prefer to use a numerical regularization technique that will (a) make the entire path smooth, (b) make sure that the Newton steps are globally convergent, and (c) will still enable us to use only a subset of the kernels to define the path locally. [sent-156, score-1.53]
65 3 Numerical regularization We borrow a classical regularization method from interior point methods, in which a constrained problem is made unconstrained by using a convex log-barrier [9]. [sent-158, score-0.884]
66 Any j 10 4 8 η η 3 2 4 1 0 6 2 0 2 −log(λ) 4 6 0 0 5 −log(λ) 10 Figure 2: Examples of variation of η along the regularization path for linear regression (left) and logistic regression (right). [sent-163, score-1.051]
67 dual-feasible variables η and α (not necessarily linked through a functional relationship) define primal-dual variables and the quantity ηj (d2 −α Kj α) is exactly the duality gap [9], j i. [sent-164, score-0.202]
68 2 to follow the path for a fixed µ, for the variables α only, since η is now a function of α. [sent-170, score-0.497]
69 The corrector steps, are not only Newton steps for solving a system of nonlinear equations, they are also Newton-Raphson steps to minimize a strictly convex function, and are thus globally convergent [9]. [sent-171, score-0.526]
70 4 Path following algorithm Our path following algorithm is simply a succession of predictor-corrector steps, described in Section 3. [sent-173, score-0.387]
71 In Figure 2, we show simple examples of the values of the kernel weights η along the path for a toy problem with a small number of kernels, for kernel linear regression and kernel logistic regression. [sent-178, score-1.145]
72 It is worth noting that the weights are not even approximately monotonic functions of λ; also the behavior of those weights as λ approaches zero (or σ grows unbounbed) is very specific: they become constant for linear regression and they grow up to infinity for logistic regression. [sent-179, score-0.428]
73 In Section 4, we show (a) why these behaviors occur and (b) what the consequences are regarding the performance of the multiple kernel learning problem. [sent-180, score-0.165]
74 Step size selection A major issue in path following methods is the choice of the step h: if h is too big, the predictor will end up very far from the path and many Newton steps have to be performed, while if h is too small, progress is too slow. [sent-182, score-0.951]
75 We chose a simple adaptive scheme where at each predictor step we select the biggest h so that the predictor step stays in the domain |J(α, σ)| ε. [sent-183, score-0.176]
76 The precision parameter ε is itself adapted at each iteration: if the number of corrector steps at the previous iteration is greater than 8 then ε is decreased whereas if this number is less than 4, it is increased. [sent-184, score-0.246]
77 Running time complexity Between each kink, the path is smooth, thus there is a bounded number of steps [8, 9]. [sent-185, score-0.551]
78 We have observed empirically that the overall number of those steps is O(m), thus the total empirical complexity is O(mn3 + m2 n2 ). [sent-187, score-0.216]
79 The difference is that every point along our path is meaningful, not just the destination. [sent-191, score-0.387]
80 2 0 0 5 10 5 10 −log(λ) 50 40 30 20 10 0 0 −log(λ) Figure 3: Varying the weights (dj ): (left) classification on the Liver dataset, (right) regression on the Boston dataset ; for each dataset, two different values of γ, (left) γ = 0 and (right) γ = 1 . [sent-212, score-0.19]
81 (Top) training set accuracy in bold, testing set accuracy in dashed, (bottom) number of kernels in the conic combination. [sent-213, score-0.506]
82 We thus would have to use all j kernels when computing the various derivatives. [sent-215, score-0.235]
83 We circumvent this by truncating those ηj that are close to their lower bound to zero: we thus only use the kernels that are numerically present in the combination. [sent-216, score-0.275]
84 Second-order predictor step The implicit function theorem also allows to compute derivative of the path of higher orders. [sent-217, score-0.508]
85 This makes 110 kernels for the Boston dataset and 54 for the Liver dataset. [sent-220, score-0.246]
86 Intuitively, the regularization weight dj for kernel Kj should be an increasing function of the rank of Kj , i. [sent-222, score-0.67]
87 In order to explore the effect of dj on performance, we set dj as follows: we compute the number 1 pj of eigenvalues of Kj that are greater than 2n (remember that because of the unit trace constraint, these n eigenvalues sum to 1), and we take dj = pγ . [sent-225, score-0.384]
88 If γ = 0, then all dj ’s are j equal to one, and when γ increases, kernel matrices of high rank such as the identity matrix have relatively higher weights, noting that a higher weight implies a heavier regularization. [sent-226, score-0.408]
89 In Figure 3, for the Boston and liver datasets, we plot the number of kernels in the conic combination as well as the training and testing errors, for γ = 0 and γ = 1. [sent-227, score-0.617]
90 We can make the following simple observations: Number of kernels The number of kernels present in the sparse conic combination is a non monotonic function of the regularization parameter. [sent-228, score-1.133]
91 When the blocks are onedimensional, a situation equivalent to variable selection with a 1-norm penalty, this number is usually a nearly monotonic function of the regularization parameter [4]. [sent-229, score-0.474]
92 Local minima Validation set performance may exhibit local minima, and thus algorithms based on hill-climbing might exhibit poor performance by being trapped in a local minimum, whereas our approach where we compute the entire path would avoid that. [sent-230, score-0.503]
93 Behavior for small λ For all values of γ, as λ goes to zero, the number of kernels remains the same, the training error goes to zero, while the testing error remains constant. [sent-231, score-0.406]
94 It is thus crucial to use weights dj that grow with the “size” of the kernel, and not simply constant. [sent-233, score-0.211]
95 5 Conclusion We have presented an algorithm to compute entire regularization paths for the problem of multiple kernel learning. [sent-236, score-0.638]
96 Empirical results using this algorithm have provided us with insight into the effect of regularization for such problems. [sent-237, score-0.335]
97 In particular we showed that the behavior of the block 1-norm regularization differs notably from traditional (non-block) 1-norm regularization. [sent-238, score-0.495]
98 Indeed, the main computational burden (for both predictor and corrector steps) is the inversion of a Hessian. [sent-240, score-0.219]
99 Multiple kernel learning, conic duality, and the SMO algorithm. [sent-251, score-0.427]
100 The entire regularization path for the support vector machine. [sent-287, score-0.778]
wordName wordTfidf (topN-words)
[('kj', 0.466), ('path', 0.387), ('regularization', 0.335), ('conic', 0.262), ('kernels', 0.211), ('kernel', 0.165), ('vyi', 0.151), ('dual', 0.134), ('corrector', 0.131), ('dj', 0.128), ('regression', 0.123), ('piecewise', 0.107), ('block', 0.1), ('continuation', 0.1), ('steps', 0.089), ('predictor', 0.088), ('wj', 0.083), ('logistic', 0.083), ('goes', 0.081), ('convex', 0.08), ('optimality', 0.077), ('newton', 0.077), ('monotonic', 0.077), ('interior', 0.076), ('kinks', 0.075), ('liver', 0.074), ('boston', 0.067), ('numerical', 0.065), ('follow', 0.062), ('differentiable', 0.059), ('smooth', 0.059), ('paths', 0.057), ('entire', 0.056), ('log', 0.054), ('strictly', 0.052), ('complexity', 0.051), ('kink', 0.05), ('duality', 0.05), ('variables', 0.048), ('formulation', 0.048), ('bach', 0.048), ('techniques', 0.046), ('yi', 0.045), ('rank', 0.042), ('matrices', 0.041), ('numerically', 0.04), ('xb', 0.04), ('minw', 0.04), ('hastie', 0.039), ('active', 0.038), ('combination', 0.037), ('francis', 0.037), ('blocks', 0.036), ('minima', 0.036), ('dataset', 0.035), ('lanckriet', 0.035), ('fm', 0.035), ('classical', 0.033), ('convergent', 0.033), ('xa', 0.033), ('tibshirani', 0.033), ('notably', 0.033), ('implicit', 0.033), ('curve', 0.033), ('testing', 0.033), ('widths', 0.032), ('weights', 0.032), ('equal', 0.032), ('ui', 0.031), ('rd', 0.031), ('quadratically', 0.031), ('conditions', 0.03), ('empirically', 0.029), ('gap', 0.029), ('rn', 0.028), ('spaces', 0.028), ('primal', 0.028), ('solving', 0.028), ('xi', 0.027), ('behavior', 0.027), ('linked', 0.027), ('grow', 0.027), ('cients', 0.027), ('optimization', 0.027), ('zero', 0.027), ('twice', 0.026), ('review', 0.026), ('parameter', 0.026), ('geometric', 0.025), ('omit', 0.025), ('nonparametric', 0.025), ('solution', 0.025), ('problem', 0.025), ('simulations', 0.024), ('thus', 0.024), ('feature', 0.024), ('index', 0.024), ('coef', 0.024), ('leaves', 0.024), ('empirical', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 42 nips-2004-Computing regularization paths for learning multiple kernels
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
2 0.33860159 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
3 0.27058789 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie
Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. 1
4 0.1923573 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
Author: Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John D. Lafferty
Abstract: We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results. 1
5 0.13398644 168 nips-2004-Semigroup Kernels on Finite Sets
Author: Marco Cuturi, Jean-philippe Vert
Abstract: Complex objects can often be conveniently represented by finite sets of simpler components, such as images by sets of patches or texts by bags of words. We study the class of positive definite (p.d.) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. We prove a general integral representation of such kernels and present two particular examples. One of them leads to a kernel for sets of points living in a space endowed itself with a positive definite kernel. We provide experimental results on a benchmark experiment of handwritten digits image classification which illustrate the validity of the approach. 1
6 0.12417467 54 nips-2004-Distributed Information Regularization on Graphs
7 0.12061504 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
8 0.1137949 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
9 0.11106063 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
10 0.11013333 92 nips-2004-Kernel Methods for Implicit Surface Modeling
11 0.10667056 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
12 0.097061165 94 nips-2004-Kernels for Multi--task Learning
13 0.095536828 30 nips-2004-Binet-Cauchy Kernels
14 0.095527574 178 nips-2004-Support Vector Classification with Input Data Uncertainty
15 0.09303233 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
16 0.0902844 177 nips-2004-Supervised Graph Inference
17 0.090042159 164 nips-2004-Semi-supervised Learning by Entropy Minimization
18 0.083532207 113 nips-2004-Maximum-Margin Matrix Factorization
19 0.081682891 136 nips-2004-On Semi-Supervised Classification
20 0.079693958 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
topicId topicWeight
[(0, -0.259), (1, 0.129), (2, -0.053), (3, 0.238), (4, -0.136), (5, -0.073), (6, -0.004), (7, -0.217), (8, -0.056), (9, -0.059), (10, 0.145), (11, -0.033), (12, 0.027), (13, 0.107), (14, -0.024), (15, 0.076), (16, 0.153), (17, -0.17), (18, -0.092), (19, -0.125), (20, 0.202), (21, -0.168), (22, 0.044), (23, -0.072), (24, 0.086), (25, 0.077), (26, -0.071), (27, -0.096), (28, 0.021), (29, 0.058), (30, -0.028), (31, -0.021), (32, -0.025), (33, 0.013), (34, 0.067), (35, -0.023), (36, -0.016), (37, -0.073), (38, -0.04), (39, -0.001), (40, -0.029), (41, -0.094), (42, 0.13), (43, -0.059), (44, -0.022), (45, 0.044), (46, 0.014), (47, 0.012), (48, -0.001), (49, -0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.9716813 42 nips-2004-Computing regularization paths for learning multiple kernels
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
2 0.89160299 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
3 0.68764192 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie
Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. 1
4 0.58715886 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
Author: Lorenzo Rosasco, Andrea Caponnetto, Ernesto D. Vito, Francesca Odone, Umberto D. Giovannini
Abstract: Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. 1
5 0.56823862 94 nips-2004-Kernels for Multi--task Learning
Author: Charles A. Micchelli, Massimiliano Pontil
Abstract: This paper provides a foundation for multi–task learning using reproducing kernel Hilbert spaces of vector–valued functions. In this setting, the kernel is a matrix–valued function. Some explicit examples will be described which go beyond our earlier results in [7]. In particular, we characterize classes of matrix– valued kernels which are linear and are of the dot product or the translation invariant type. We discuss how these kernels can be used to model relations between the tasks and present linear multi–task learning algorithms. Finally, we present a novel proof of the representer theorem for a minimizer of a regularization functional which is based on the notion of minimal norm interpolation. 1
6 0.51099002 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
7 0.50860798 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
8 0.49488118 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
9 0.43654916 54 nips-2004-Distributed Information Regularization on Graphs
10 0.43474182 177 nips-2004-Supervised Graph Inference
11 0.41959545 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
12 0.41781905 168 nips-2004-Semigroup Kernels on Finite Sets
13 0.4130404 30 nips-2004-Binet-Cauchy Kernels
14 0.3942185 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities
15 0.3846809 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
16 0.38337436 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
17 0.35843757 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
18 0.35745701 92 nips-2004-Kernel Methods for Implicit Surface Modeling
19 0.35434824 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
20 0.34778216 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
topicId topicWeight
[(13, 0.12), (15, 0.14), (26, 0.038), (31, 0.029), (33, 0.159), (39, 0.367), (50, 0.038), (81, 0.013)]
simIndex simValue paperId paperTitle
1 0.88838822 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
Author: Antti Honkela, Harri Valpola
Abstract: In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. This yields an accurate approximation for cases of large posterior variance. The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. 1
2 0.88800514 113 nips-2004-Maximum-Margin Matrix Factorization
Author: Nathan Srebro, Jason Rennie, Tommi S. Jaakkola
Abstract: We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-definite program, and discuss generalization error bounds for them. 1
same-paper 3 0.85253483 42 nips-2004-Computing regularization paths for learning multiple kernels
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
4 0.7202583 163 nips-2004-Semi-parametric Exponential Family PCA
Author: Sajama Sajama, Alon Orlitsky
Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1
5 0.68697309 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
6 0.66339839 124 nips-2004-Multiple Alignment of Continuous Time Series
7 0.65809077 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
8 0.6473555 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
9 0.64053601 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
10 0.63945866 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming
11 0.63877219 68 nips-2004-Face Detection --- Efficient and Rank Deficient
12 0.63852018 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
13 0.63556278 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
14 0.63373756 125 nips-2004-Multiple Relational Embedding
15 0.63151246 178 nips-2004-Support Vector Classification with Input Data Uncertainty
16 0.628093 131 nips-2004-Non-Local Manifold Tangent Learning
17 0.6280508 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
18 0.62378031 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
19 0.61947858 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
20 0.61508161 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection