nips nips2004 nips2004-96 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lorenzo Rosasco, Andrea Caponnetto, Ernesto D. Vito, Francesca Odone, Umberto D. Giovannini
Abstract: Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Learning, Regularization and Ill-Posed Inverse Problems Lorenzo Rosasco DISI, Universit` di Genova a Genova, I rosasco@disi. [sent-1, score-0.068]
2 it Ernesto De Vito Dipartimento di Matematica Universit` di Modena a and INFN, Sezione di Genova Genova, I devito@unimo. [sent-3, score-0.204]
3 it Andrea Caponnetto DISI, Universit` di Genova a Genova, I caponnetto@disi. [sent-4, score-0.068]
4 it Umberto De Giovannini DISI, Universit` di Genova a Genova, I umberto. [sent-6, score-0.068]
5 it Francesca Odone DISI, Universit` di Genova a Genova, I odone@disi. [sent-8, score-0.068]
6 it Abstract Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. [sent-10, score-0.48]
7 Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. [sent-11, score-0.545]
8 Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. [sent-13, score-1.031]
9 1 Introduction The main goal of learning from examples is to infer an estimator, given a finite sample of data drawn according to a fixed but unknown probabilistic input-output relation. [sent-14, score-0.058]
10 The desired property of the selected estimator is to perform well on new data, i. [sent-15, score-0.078]
11 The fundamental works of Vapnik and further developments [16], [8], [5], show that the key to obtain a meaningful solution to the above problem is to control the complexity of the solution space. [sent-18, score-0.156]
12 Interestingly, as noted by [12], [8], [2], this is the idea underlying regularization techniques for ill-posed inverse problems [15], [7]. [sent-19, score-0.493]
13 In such a context to avoid undesired oscillating behavior of the solution we have to restrict the solution space. [sent-20, score-0.13]
14 Anyway a careful analysis shows that a rigorous connection between learning and regularization for inverse problem is not straightforward. [sent-22, score-0.471]
15 In this paper we consider the square loss and show that the problem of learning can be translated into a convenient inverse problem and consistency results can be derived in a general setting. [sent-23, score-0.601]
16 When a generic loss is considered the analysis becomes immediately more complicated. [sent-24, score-0.065]
17 Some weaker results in the same spirit of those presented in this paper can be found in [13] where anyway the connections with inverse problems is not discussed. [sent-26, score-0.425]
18 Finally, our analysis is close to the idea of stochastic inverse problems discussed in [16]. [sent-27, score-0.291]
19 After recalling the main concepts and notation of learning and inverse problems, in section 4 we develop a formal connection between the two theories. [sent-29, score-0.339]
20 2 Learning from examples We briefly recall some basic concepts of learning theory [16], [8]. [sent-32, score-0.081]
21 The relation between the input x ∈ X and the output y ∈ Y is described by a probability distribution ρ(x, y) = ν(x)ρ(y|x) on X × Y . [sent-34, score-0.029]
22 The distribution ρ is known only through a sample z = (x, y) = ((x1 , y1 ), . [sent-35, score-0.033]
23 The goal of learning is, given the sample z, to find a function fz : X → R such that fz (x) is an estimate of the output y when the new input x is given. [sent-42, score-1.063]
24 The function fz is called estimator and the rule that, given a sample z, provides us with fz is called learning algorithm. [sent-43, score-1.141]
25 Given a measurable function f : X → R, the ability of f to describe the distribution ρ is measured by its expected risk defined as I[f ] = X×Y (f (x) − y)2 dρ(x, y). [sent-44, score-0.153]
26 The regression function g(x) = y dρ(y|x), Y is the minimizer of the expected risk over the set of all measurable functions and always exists since Y is compact. [sent-45, score-0.224]
27 To overcome this problem, in the regularized least squares algorithm an hypothesis space H is fixed, and, given λ > 0, an estimator fz λ is defined as the solution of the regularized least squares problem, min{ f ∈H 1 i=1 (f (xi ) − yi )2 + λ f 2 H }. [sent-47, score-1.093]
28 (1) The regularization parameter λ has to be chosen depending on the available data, λ = λ( , z), in such a way that, for every > 0 lim P I[fz λ( →+∞ ,z) ] − inf I[f ] ≥ f ∈H = 0. [sent-48, score-0.427]
29 (2) We note that in general inf f ∈H I[f ] is larger that I[g] and represents a sort of irreducible error associated with the choice of the space H. [sent-49, score-0.185]
30 The above convergence in probability is usually called consistency of the algorithm [16] [14]. [sent-50, score-0.215]
31 3 Ill-Posed Inverse Problems and Regularization In this section we give a very brief account of linear inverse problems and regularization theory [15], [7]. [sent-51, score-0.541]
32 Let H and K be two Hilbert spaces and A : H → K a linear bounded operator. [sent-52, score-0.03]
33 Consider the equation Af = gδ (3) where gδ , g ∈ K and g − gδ K ≤ δ. [sent-53, score-0.028]
34 Finding the function f satisfying the above equation, given A and g δ , is the linear inverse problem associated to Eq. [sent-55, score-0.269]
35 The above problem is, in general, illposed, that is, the Uniqueness can be restored introducing the Moore-Penrose generalized inverse f † = A† g defined as the minimum norm solution of the problem min Af − g f ∈H 2 K . [sent-57, score-0.398]
36 The similarity between regularized least squares algorithm (1) and Tikhonov regularization (5) is apparent. [sent-60, score-0.386]
37 First, to treat the problem of learning in the setting of ill-posed inverse problems we have to define a direct problem by means of a suitable operator A. [sent-62, score-0.503]
38 Second, in the context of learning, it is not clear the nature of the noise δ. [sent-63, score-0.1]
39 Finally we have to clarify the relation between consistency (2) and the kind of convergence expressed by (7). [sent-64, score-0.276]
40 In the following sections we will show a possible way to tackle these problems. [sent-65, score-0.042]
41 4 Learning as an Inverse Problem We can now show how the problem of learning can be rephrased in a framework close to the one presented in the previous section. [sent-66, score-0.026]
42 We assume that hypothesis space H is a reproducing kernel Hilbert space [1] with a continuous kernel K : X × X → R. [sent-67, score-0.072]
43 If x ∈ X, we let Kx (s) = K(s, x), and, if ν is the marginal distribution of ρ on X, we define the bounded linear operator A : H → L2 (X, ν) as (Af )(x) = f, Kx H = f (x), 1 In the framework of inverse problems, many other regularization procedures are introduced [7]. [sent-68, score-0.64]
44 In particular, for all f ∈ H, the expected risk becomes, 2 I[f ] = Af − g L2 (X,ν) + I[g], where g is the regression function [2]. [sent-71, score-0.122]
45 The above equation clarifies that if the expected risk admits a minimizer fH on the hypothesis space H, then it is exactly the generalized solution2 f † = A† g of the problem Af = g. [sent-72, score-0.351]
46 (8) Moreover, given a training set z = (x, y), we get a discretized version Ax : H → E of A, that is (Ax f )i = f, Kxi H = f (xi ), where E = R is the finite dimensional euclidean space endowed with the scalar product y, y E = 1 yi yi . [sent-73, score-0.121]
47 i=1 It is straightforward to check that 1 i=1 2 E (f (xi ) − yi )2 = Ax f − y , so that the estimator fz λ given by the regularized least squares algorithm is the regularized solution of the discrete problem Ax f = y. [sent-74, score-1.044]
48 First, in learning from examples we are not interested into finding an approximation of the generalized solution of the discretized problem (9), but we want to find a stable approximation of the solution of the exact problem (8) (compare with [9]). [sent-76, score-0.382]
49 Second, we notice that in learning theory the consistency property (2) involves the control of the quantity I[fz λ ] − inf I[f ] = Af − g f ∈H 2 L2 (X,ν) − inf Af − g f ∈H 2 L2 (X,ν) . [sent-77, score-0.439]
50 (10) If P is the projection on the closure of the range of A, the definition of P gives I[fz λ ] − inf I[f ] = Afz λ − P g f ∈H 2 L2 (X,ν) (11) (the above equality stronlgy depends on the fact that the loss function is the square loss). [sent-78, score-0.254]
51 In the inverse problem setting, the square root of the above quantity is called the residue of the solution fz λ . [sent-79, score-1.0]
52 Hence, consistency is controlled by the residue of the estimator, instead of the reconstruction error fz λ − f † (as in inverse problems). [sent-80, score-1.065]
53 In particular, consistency H is a weaker condition than the one required by (7) and does not require the existence of the generalized solution fH . [sent-81, score-0.298]
54 Moreover, the noisy data y ∈ E and the exact data g ∈ L2 (X, ν) belong to different spaces, so that the notion of noise has to be modified. [sent-83, score-0.131]
55 Given the above premise our derivation of consistency results is developed in two steps: we first study the residue of the solution by means of a measure of the noise due to discretization and then we show a possible way to give a probabilistic evaluation of the noise previously introduced. [sent-84, score-0.639]
56 2 The fact that fH is the minimal norm solution of (4) is ensured by the assumption that the support of the measure ν is X, since in this case the operator A is injective. [sent-85, score-0.248]
57 1 Bounding the Residue of the Regularized Solution We recall that the regularized solutions of problems (9) and (8) are given by λ fz = (A∗ Ax + λI)−1 A∗ y, x x fλ = (A∗ A + λI)−1 A∗ g. [sent-87, score-0.714]
58 λ The above equations show that fz and f λ depend only on A∗ Ax and A∗ A which are x operators from H into H and on A∗ y and A∗ g which are elements of H, so that the space x E disappears. [sent-88, score-0.515]
59 This observation suggests that noise levels could be A∗ Ax − A∗ A L(H) x and A∗ y − A∗ g H , where · L(H) is the uniform operator norm. [sent-89, score-0.268]
60 The next theorem is the central result of the paper. [sent-92, score-0.03]
61 Briefly, the idea is to note that λ Afz − P g λ ≤ Afz − Af λ L2 (X,ν) − Af λ − P g L2 (X,ν) 1 L2 (X,ν) λ = (A∗ A) 2 (fz − f λ ) H where the last equation follows by polar decomposition of the operator A. [sent-97, score-0.163]
62 Moreover a simple algebraic computation gives λ fz −f λ = (A∗ A+λI)−1 (A∗ A−A∗ Ax )(A∗ Ax +λI)−1 A∗ y+(A∗ A+λI)−1 (A∗ y−A∗ g) x x x x where the relevant quantities for definition of the noise appear. [sent-98, score-0.615]
63 The first item in the above proposition quantifies the difference between the residues of the regularized solutions of the exact and discretized problems in terms of the noise level δ = (δ1 , δ2 ). [sent-99, score-0.456]
64 As mentioned before this is exactly the kind of result needed to derive consistency. [sent-100, score-0.032]
65 On the other hand the last part of the proposition gives sufficient conditions on the parameter λ to ensure convergence of the residue to zero as the level noise decreases. [sent-101, score-0.352]
66 The above results were obtained introducing the collection Uδ of training sets compatible with a certain noise level δ. [sent-102, score-0.132]
67 It is left to quantify the noise level corresponding to a training set of cardinality . [sent-103, score-0.132]
68 This will be achieved in a probabilistic setting in the next section. [sent-104, score-0.025]
69 2 Stochastic Evaluation of the Noise In this section we estimate the discretization noise δ = (δ1 , δ2 ). [sent-106, score-0.169]
70 Other estimates of the noise δ can be given using, for example, union bounds and Hoeffding’s inequality. [sent-108, score-0.1]
71 If we choose the regularization parameter so that λ = λ( , z) = O( 1 ), with 0 < b < 1 the sample error b 2 1 converges in probability to zero with order O when → ∞. [sent-114, score-0.296]
72 On the other 1−2b hand the second term represents the approximation error and it is possible to show, using standard results from spectral theory, that it vanishes as λ goes to zero [7]. [sent-115, score-0.083]
73 Finally, the last term represents the minimum attainable risk once the hypothesis space H has been chosen. [sent-116, score-0.136]
74 From the above observations it is clear that consistency is ensured once the parameter λ is chosen according to the aforementioned conditions. [sent-117, score-0.25]
75 Nonetheless to provide convergence rates it is necessary to control the convergence rate of the approximation error. [sent-118, score-0.126]
76 It can be shown that if the explicit dependence of the approximation error on λ is not available we cannot determine . [sent-120, score-0.084]
77 an optimal a priori (data independent) dependency λ = λ( ) for the regularization parameter. [sent-121, score-0.202]
78 Nevertheless a posteriori (data dependent) choices λ = λ( , z) can be considered to automatically achieve optimal convergence rate [5], [6]. [sent-122, score-0.078]
79 With respect to this last fact we notice that the set of samples such that inequality (14) holds depends on and η, but does not depend λ, so that we can consider without any further effort a posteriori parameter choices (compare with [4], [5]). [sent-123, score-0.132]
80 6 Conclusions In this paper we defined a direct and inverse problem suitable for the learning problem and derived consistency results for the regularized least squares algorithm. [sent-125, score-0.647]
81 Though our analysis formally explains the connections between learning theory and linear inverse problems, its main limit is that we considered only the square loss. [sent-126, score-0.365]
82 We briefly sketch how the arguments presented in the paper extend to general loss functions. [sent-127, score-0.065]
83 For sake of simplicity we consider a differentiable loss function V . [sent-128, score-0.101]
84 It is easy to see that the minimizer fH of the expected risk satisfies the following equation SfH = 0 (16) where S = LK ◦ O and LK is the integral operator with kernel K, that is (LK f )(x) = K(x, s)f (s)dν(s) X and O is the operator defined by (Of )(x) = V (y, f (x))dρ(y|x). [sent-129, score-0.491]
85 Y If we consider a generic differentiable loss the operator O and hence S is non linear, and estimating fH is an ill-posed non linear inverse problem. [sent-130, score-0.535]
86 It is well known that the theory for this kind of problems is much less developed than the corresponding theory for linear problems. [sent-131, score-0.176]
87 Moreover, since, in general, I[f ] does not define a metric, it is not so clear the relation between the expected risk and the residue. [sent-132, score-0.151]
88 It appears evident that the attempt to extend our results to a wider class of loss function is not straightforward. [sent-133, score-0.065]
89 A possible way to tackle the problem, further developing our analysis, might pass through the exploitation of a natural convexity assumption on the loss function. [sent-134, score-0.107]
90 Future work also aims to derive tighter probabilistic bounds on the noise using recently proposed data dependent techniques. [sent-135, score-0.125]
91 Model selection for regularized leastsquares algorithm in learning theory. [sent-176, score-0.123]
92 On the adaptive selection of the parameter in regularization of ill-posed problems. [sent-188, score-0.236]
93 Regularization of inverse problems, volume 375 of Mathematics and its Applications. [sent-192, score-0.243]
94 Statistical learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. [sent-216, score-0.264]
95 A different type of convergence for statistical learning algorithms. [sent-224, score-0.047]
96 Consistency of support vector machines and other regularized kernel machines. [sent-228, score-0.123]
wordName wordTfidf (topN-words)
[('fz', 0.515), ('genova', 0.303), ('ax', 0.285), ('inverse', 0.243), ('regularization', 0.202), ('afz', 0.182), ('consistency', 0.168), ('af', 0.162), ('operator', 0.135), ('fh', 0.132), ('tikhonov', 0.132), ('regularized', 0.123), ('disi', 0.121), ('residue', 0.112), ('caponnetto', 0.105), ('noise', 0.1), ('inf', 0.097), ('risk', 0.096), ('lim', 0.094), ('vito', 0.091), ('universit', 0.079), ('estimator', 0.078), ('anyway', 0.072), ('minimizer', 0.071), ('discretization', 0.069), ('di', 0.068), ('discretized', 0.067), ('loss', 0.065), ('solution', 0.065), ('squares', 0.061), ('irreducible', 0.061), ('odone', 0.061), ('rosasco', 0.061), ('supz', 0.061), ('poggio', 0.058), ('lk', 0.058), ('kx', 0.053), ('closure', 0.053), ('sup', 0.052), ('ensured', 0.048), ('problems', 0.048), ('theory', 0.048), ('convergence', 0.047), ('tackle', 0.042), ('sons', 0.041), ('hypothesis', 0.04), ('square', 0.039), ('inequality', 0.038), ('generalized', 0.038), ('recalling', 0.037), ('brie', 0.036), ('differentiable', 0.036), ('connections', 0.035), ('moreover', 0.035), ('translated', 0.034), ('parameter', 0.034), ('concepts', 0.033), ('sample', 0.033), ('levels', 0.033), ('kind', 0.032), ('reproducing', 0.032), ('approximation', 0.032), ('level', 0.032), ('posteriori', 0.031), ('exact', 0.031), ('measurable', 0.031), ('bounded', 0.03), ('procedures', 0.03), ('theorem', 0.03), ('relation', 0.029), ('notice', 0.029), ('hilbert', 0.028), ('non', 0.028), ('equation', 0.028), ('solutions', 0.028), ('error', 0.027), ('yi', 0.027), ('john', 0.027), ('weaker', 0.027), ('proposition', 0.027), ('mcdiarmid', 0.026), ('colin', 0.026), ('mol', 0.026), ('dipartimento', 0.026), ('lorenzo', 0.026), ('admits', 0.026), ('andreas', 0.026), ('pg', 0.026), ('steve', 0.026), ('formal', 0.026), ('check', 0.026), ('technical', 0.026), ('expected', 0.026), ('problem', 0.026), ('probabilistic', 0.025), ('mathematics', 0.025), ('treat', 0.025), ('dependence', 0.025), ('goes', 0.024), ('transaction', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
Author: Lorenzo Rosasco, Andrea Caponnetto, Ernesto D. Vito, Francesca Odone, Umberto D. Giovannini
Abstract: Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. 1
2 0.11305958 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
Author: Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert
Abstract: This paper investigates the effect of Kernel Principal Component Analysis (KPCA) within the classification framework, essentially the regularization properties of this dimensionality reduction method. KPCA has been previously used as a pre-processing step before applying an SVM but we point out that this method is somewhat redundant from a regularization point of view and we propose a new algorithm called Kernel Projection Machine to avoid this redundancy, based on an analogy with the statistical framework of regression for a Gaussian white noise model. Preliminary experimental results show that this algorithm reaches the same performances as an SVM. 1
3 0.11120496 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
4 0.10667056 42 nips-2004-Computing regularization paths for learning multiple kernels
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
5 0.084742315 69 nips-2004-Fast Rates to Bayes for Kernel Machines
Author: Ingo Steinwart, Clint Scovel
Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1
6 0.081209965 103 nips-2004-Limits of Spectral Clustering
7 0.080768228 178 nips-2004-Support Vector Classification with Input Data Uncertainty
8 0.078972876 54 nips-2004-Distributed Information Regularization on Graphs
9 0.078238703 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
10 0.075934805 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
11 0.071007028 168 nips-2004-Semigroup Kernels on Finite Sets
12 0.068470299 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
13 0.068143368 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
14 0.066895381 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
15 0.063922524 49 nips-2004-Density Level Detection is Classification
16 0.060060851 161 nips-2004-Self-Tuning Spectral Clustering
17 0.059844188 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
18 0.058458928 137 nips-2004-On the Adaptive Properties of Decision Trees
19 0.057798624 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
20 0.057704844 164 nips-2004-Semi-supervised Learning by Entropy Minimization
topicId topicWeight
[(0, -0.185), (1, 0.069), (2, -0.002), (3, 0.106), (4, -0.079), (5, -0.053), (6, -0.007), (7, -0.032), (8, -0.015), (9, -0.05), (10, 0.018), (11, -0.035), (12, 0.088), (13, -0.006), (14, -0.081), (15, -0.076), (16, 0.038), (17, -0.032), (18, -0.077), (19, -0.023), (20, 0.021), (21, -0.075), (22, 0.057), (23, -0.067), (24, 0.03), (25, 0.128), (26, 0.027), (27, -0.09), (28, -0.043), (29, 0.043), (30, -0.068), (31, 0.063), (32, -0.088), (33, 0.176), (34, -0.044), (35, -0.006), (36, 0.032), (37, 0.0), (38, -0.036), (39, -0.01), (40, 0.007), (41, -0.039), (42, -0.041), (43, 0.021), (44, 0.152), (45, 0.043), (46, -0.043), (47, -0.062), (48, 0.006), (49, 0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.94066763 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
Author: Lorenzo Rosasco, Andrea Caponnetto, Ernesto D. Vito, Francesca Odone, Umberto D. Giovannini
Abstract: Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. 1
2 0.5737071 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
3 0.56048 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
Author: Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert
Abstract: This paper investigates the effect of Kernel Principal Component Analysis (KPCA) within the classification framework, essentially the regularization properties of this dimensionality reduction method. KPCA has been previously used as a pre-processing step before applying an SVM but we point out that this method is somewhat redundant from a regularization point of view and we propose a new algorithm called Kernel Projection Machine to avoid this redundancy, based on an analogy with the statistical framework of regression for a Gaussian white noise model. Preliminary experimental results show that this algorithm reaches the same performances as an SVM. 1
4 0.55149168 42 nips-2004-Computing regularization paths for learning multiple kernels
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
5 0.53106189 185 nips-2004-The Convergence of Contrastive Divergences
Author: Alan L. Yuille
Abstract: This paper analyses the Contrastive Divergence algorithm for learning statistical parameters. We relate the algorithm to the stochastic approximation literature. This enables us to specify conditions under which the algorithm is guaranteed to converge to the optimal solution (with probability 1). This includes necessary and sufficient conditions for the solution to be unbiased.
6 0.53030145 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
7 0.51145899 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
8 0.49471056 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters
9 0.4902468 137 nips-2004-On the Adaptive Properties of Decision Trees
10 0.47355273 69 nips-2004-Fast Rates to Bayes for Kernel Machines
11 0.43670729 54 nips-2004-Distributed Information Regularization on Graphs
12 0.41040289 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
13 0.39814216 94 nips-2004-Kernels for Multi--task Learning
14 0.39588651 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
15 0.38925889 49 nips-2004-Density Level Detection is Classification
16 0.38319132 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
17 0.37927058 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
18 0.37899724 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
19 0.37717617 103 nips-2004-Limits of Spectral Clustering
20 0.37238666 138 nips-2004-Online Bounds for Bayesian Algorithms
topicId topicWeight
[(13, 0.143), (15, 0.129), (26, 0.075), (31, 0.027), (33, 0.162), (35, 0.012), (39, 0.022), (50, 0.045), (94, 0.293)]
simIndex simValue paperId paperTitle
1 0.90805805 152 nips-2004-Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization
Author: Fei Sha, Lawrence K. Saul
Abstract: An auditory “scene”, composed of overlapping acoustic sources, can be viewed as a complex object whose constituent parts are the individual sources. Pitch is known to be an important cue for auditory scene analysis. In this paper, with the goal of building agents that operate in human environments, we describe a real-time system to identify the presence of one or more voices and compute their pitch. The signal processing in the front end is based on instantaneous frequency estimation, a method for tracking the partials of voiced speech, while the pattern-matching in the back end is based on nonnegative matrix factorization, an unsupervised algorithm for learning the parts of complex objects. While supporting a framework to analyze complicated auditory scenes, our system maintains real-time operability and state-of-the-art performance in clean speech.
same-paper 2 0.81822002 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
Author: Lorenzo Rosasco, Andrea Caponnetto, Ernesto D. Vito, Francesca Odone, Umberto D. Giovannini
Abstract: Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. 1
3 0.77937782 19 nips-2004-An Application of Boosting to Graph Classification
Author: Taku Kudo, Eisaku Maeda, Yuji Matsumoto
Abstract: This paper presents an application of Boosting for classifying labeled graphs, general structures for modeling a number of real-world data, such as chemical compounds, natural language texts, and bio sequences. The proposal consists of i) decision stumps that use subgraph as features, and ii) a Boosting algorithm in which subgraph-based decision stumps are used as weak learners. We also discuss the relation between our algorithm and SVMs with convolution kernels. Two experiments using natural language data and chemical compounds show that our method achieves comparable or even better performance than SVMs with convolution kernels as well as improves the testing efficiency. 1
4 0.76242828 145 nips-2004-Parametric Embedding for Class Visualization
Author: Tomoharu Iwata, Kazumi Saito, Naonori Ueda, Sean Stromsten, Thomas L. Griffiths, Joshua B. Tenenbaum
Abstract: In this paper, we propose a new method, Parametric Embedding (PE), for visualizing the posteriors estimated over a mixture model. PE simultaneously embeds both objects and their classes in a low-dimensional space. PE takes as input a set of class posterior vectors for given data points, and tries to preserve the posterior structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a Gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier’s behavior in supervised, semi-supervised and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of web pages, semi-supervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, Latent Dirichlet Allocation. 1
5 0.68970942 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
Author: Rasmus K. Olsson, Lars K. Hansen
Abstract: We discuss an identification framework for noisy speech mixtures. A block-based generative model is formulated that explicitly incorporates the time-varying harmonic plus noise (H+N) model for a number of latent sources observed through noisy convolutive mixtures. All parameters including the pitches of the source signals, the amplitudes and phases of the sources, the mixing filters and the noise statistics are estimated by maximum likelihood, using an EM-algorithm. Exact averaging over the hidden sources is obtained using the Kalman smoother. We show that pitch estimation and source separation can be performed simultaneously. The pitch estimates are compared to laryngograph (EGG) measurements. Artificial and real room mixtures are used to demonstrate the viability of the approach. Intelligible speech signals are re-synthesized from the estimated H+N models.
6 0.66033858 131 nips-2004-Non-Local Manifold Tangent Learning
7 0.66001421 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
8 0.65612644 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
9 0.65524316 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
10 0.65230274 116 nips-2004-Message Errors in Belief Propagation
11 0.65144885 28 nips-2004-Bayesian inference in spiking neurons
12 0.65140361 102 nips-2004-Learning first-order Markov models for control
13 0.65073979 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
14 0.64953446 70 nips-2004-Following Curved Regularized Optimization Solution Paths
15 0.64852268 69 nips-2004-Fast Rates to Bayes for Kernel Machines
16 0.64841044 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
17 0.64819139 163 nips-2004-Semi-parametric Exponential Family PCA
18 0.64667505 178 nips-2004-Support Vector Classification with Input Data Uncertainty
19 0.64667451 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
20 0.64490294 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data