nips nips2005 nips2005-190 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yoshua Bengio, Olivier Delalleau, Nicolas L. Roux
Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior – with similarity between examples expressed with a local kernel – are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semisupervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Our discussion covers supervised, semisupervised and unsupervised learning algorithms. [sent-7, score-0.06]
2 These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. [sent-8, score-0.19]
3 This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. [sent-9, score-0.074]
4 We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i. [sent-10, score-0.28]
5 This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge. [sent-13, score-0.234]
6 Additional prior knowledge is expressed by choosing the space of the data and the particular notion of similarity between examples (typically expressed as a kernel function). [sent-15, score-0.234]
7 More recently, there has also been much interest in non-parametric semi-supervised learning algorithms, such as (Zhu, Ghahramani and Lafferty, 2003; Zhou et al. [sent-17, score-0.031]
8 , 2004; Belkin, Matveeva and Niyogi, 2004; Delalleau, Bengio and Le Roux, 2005), which also fall in this category, and share many ideas with manifold learning algorithms. [sent-18, score-0.197]
9 Since this is a very large class of algorithms and it is attracting so much attention, it is worthwhile to investigate its limitations, and this is the main goal of this paper. [sent-19, score-0.046]
10 In this paper, we focus on algorithms in which the learned function is expressed in terms of a linear combination of kernel functions applied on the training examples: n f (x) = b + αi KD (x, xi ) (1) i=1 where optionally a bias term b is added, D = {z1 , . [sent-21, score-0.367]
11 , zn } are training examples (zi = xi for unsupervised learning, zi = (xi , yi ) for supervised learning, and yi can take a special “missing” value for semi-supervised learning). [sent-24, score-0.284]
12 The αi ’s are scalars chosen by the learning algorithm using D, and KD (·, ·) is the kernel function, a symmetric function (sometimes expected to be positive definite), which may be chosen by taking into account all the x i ’s. [sent-25, score-0.172]
13 A typical kernel function is the Gaussian kernel, 2 1 Kσ (u, v) = e− σ2 ||u−v|| , (2) with the width σ controlling how local the kernel is. [sent-26, score-0.358]
14 , 2004) to see that LLE, Isomap, Laplacian eigenmaps and other spectral manifold learning algorithms such as spectral clustering can be generalized to be written as in eq. [sent-28, score-0.243]
15 One obtains consistency of classical non-parametric estimators by appropriately varying the hyper-parameter that controls the locality of the estimator as n increases. [sent-30, score-0.028]
16 For a wide class of kernel regression estimators, the unconditional variance and squared bias can be shown to be written as follows (H¨ rdle et al. [sent-32, score-0.221]
17 , 2004): a C1 + C2 σ 4 , nσ d with C1 and C2 not depending on n nor on the dimension d. [sent-33, score-0.04]
18 Consider for example the increase in number of examples required to get the same level of error, in 1 dimension versus d dimensions. [sent-35, score-0.098]
19 If n1 is the number of examples required to get a level of error e, (4+d)/5 to get the same level of error in d dimensions requires on the order of n1 examples, i. [sent-36, score-0.058]
20 the required number of examples is exponential in d. [sent-38, score-0.058]
21 For the k-nearest neighbor classifier, a similar result is obtained (Snapp and Venkatesh, 1998): expected error = ∞ cj n−j/d expected error = E∞ + j=2 where E∞ is the asymptotic error, d is the dimension and n the number of examples. [sent-39, score-0.092]
22 Note however that, if the data distribution is concentrated on a lower dimensional manifold, it is the manifold dimension that matters. [sent-40, score-0.206]
23 1 Result for Supervised Learning The following theorem informs us about the number of sign changes that a Gaussian kernel machine can achieve, when it has k bases (i. [sent-44, score-0.337]
24 k support vectors, or at least k training examples). [sent-46, score-0.065]
25 Let f : R → R computed by a Gaussian kernel machine (eq. [sent-49, score-0.141]
26 We would like to say something about kernel machines in Rd , and we can do this simply by considering a straight line in Rd and the number of sign changes that the solution function f can achieve along that line. [sent-52, score-0.269]
27 Suppose that the learning problem is such that in order to achieve a given error level for samples from a distribution P with a Gaussian kernel machine (eq. [sent-55, score-0.172]
28 1), then f must change sign at least 2k times along some straight line (i. [sent-56, score-0.158]
29 , in the case of a classifier, the decision surface must be crossed at least 2k times by that straight line). [sent-58, score-0.09]
30 Then the kernel machine must have at least k bases (non-zero αi ’s). [sent-59, score-0.261]
31 Let the straight line be parameterized by x(t) = u + tw, with t ∈ R and w = 1 without loss of generality. [sent-61, score-0.06]
32 If f is a Gaussian kernel classifier with k bases, then g can be written k g(t) = b + βi exp − i=1 (t − ti )2 2σ 2 where u + ti w is the projection of xi on the line Du,w = {u + tw, t ∈ R}, and βi = 0. [sent-63, score-0.243]
33 The number of bases of g is k ≤ k , as there may exist xi = xj such that ti = tj . [sent-64, score-0.249]
34 Since g must change sign at least 2k times, thanks to theorem 2. [sent-65, score-0.136]
35 1, we can conclude that g has at least k bases, i. [sent-66, score-0.03]
36 The above theorem tells us that if we are trying to represent a function that locally varies a lot (in the sense that its sign along a straight line changes many times), then we need many training examples to do so with a Gaussian kernel machine. [sent-69, score-0.466]
37 Note that it says nothing about the dimensionality of the space, but we might expect to have to learn functions that vary more when the data is high-dimensional. [sent-70, score-0.095]
38 The next theorem confirms this suspicion in the special case of the d-bits parity function: parity : (b1 , . [sent-71, score-1.128]
39 , bd ) ∈ {0, 1}d → d 1 if i=1 bi is even −1 otherwise We will show that learning this apparently simple function with Gaussians centered on points in {0, 1}d is difficult, in the sense that it requires a number of Gaussians exponential in d (for a fixed Gaussian width). [sent-74, score-0.183]
40 2 does not apply to the d-bits parity function, so it represents another type of local variation (not along a line). [sent-76, score-0.586]
41 , bd ) ∈ Xd | bd = 1} (4) We say that a decision function f : Rd → R solves the parity problem if sign(f (xi )) = parity(xi ) for all i in {1, . [sent-87, score-0.78]
42 Let f (x) = i=1 αi Kσ (xi , x) be a linear combination of Gaussians with same width σ centered on points xi ∈ Xd . [sent-93, score-0.167]
43 If f solves the parity problem, then αi parity(xi ) > 0 for all i. [sent-94, score-0.604]
44 We denote 0 0 by x0 the points in Hd and by αi their coefficient in the expansion of f (see eq. [sent-106, score-0.03]
45 For xi ∈ Hd , we denote by x1 ∈ Hd its projection on Hd (obtained by i 1 0 1 setting its last bit to 1), whose coefficient in f is αi . [sent-108, score-0.132]
46 i = 0 x0 ∈Hd i 0 0 Since Hd is isomorphic to Xd−1 , the restriction of f to Hd implicitely defines a function 0 over Xd−1 that solves the parity problem (because the last bit in Hd is 0, the parity is not 0 modified). [sent-111, score-1.221]
47 Using our induction hypothesis, we have that for all x0 ∈ Hd : i 0 1 αi + γαi parity(x0 ) > 0. [sent-112, score-0.045]
48 One has to be careful 1 1 that the parity is modified between Hd and its mapping to Xd−1 (because the last bit in Hd 1 is 1). [sent-114, score-0.575]
49 Thus we obtain that the restriction of (−f ) to Hd defines a function over Xd−1 that 1 solves the parity problem, and the induction hypothesis tells us that for all x1 ∈ Hd : j 1 0 − αj + γαj −parity(x1 ) > 0. [sent-115, score-0.649]
50 Since this is true for all x0 in Hd , we have proved lemma 2. [sent-124, score-0.037]
51 Let f (x) = b + i=1 αi Kσ (xi , x) be an affine combination of Gaussians with same width σ centered on points xi ∈ Xd . [sent-128, score-0.167]
52 If f solves the parity problem, then there are at least 2d−1 non-zero coefficients αi . [sent-129, score-0.634]
53 First, given any xi ∈ Xd , the number of d points in Xd that differ from xi by exactly k bits is k . [sent-132, score-0.234]
54 Thus, d Kσ (xi , xj ) = xj ∈Xd k=0 d k2 exp − 2 k 2σ = cσ . [sent-133, score-0.114]
55 without bias) of Gaussians g such that g(xi ) = f (xi ) for all xi ∈ Xd . [sent-136, score-0.102]
56 (8) xj ∈Xd g verifies g(xi ) = f (xi ) iff xj ∈Xd βj Kσ (xj , xi ) = b, i. [sent-138, score-0.216]
57 the vector β satisfies the linear system Mσ β = b1, where Mσ is the kernel matrix whose element (i, j) is Kσ (xi , xj ) and 1 is a vector of ones. [sent-140, score-0.198]
58 It is well known that Mσ is invertible as long as the xi are all −1 different, which is the case here (Micchelli, 1986). [sent-141, score-0.102]
59 By contradiction, suppose f solves the parity problem with less than 2d−1 non-zero coefficients αi . [sent-144, score-0.604]
60 Then there exist two points xs and xt in Xd such that αs = αt = 0 and parity(xs ) = 1 = −parity(xt ). [sent-145, score-0.077]
61 Since g(xi ) = f (xi ) for all xi ∈ Xd , g solves the parity problem with a linear combination of Gaussians centered points in X d . [sent-148, score-0.736]
62 Consequently, 1 is also an eigenvector −1 −1 of Mσ with eigenvalue c−1 > 0, and β = bMσ 1 = bc−1 1, which is in contradiction σ σ with βs βt < 0: f must therefore have at least 2d−1 non-zero coefficients. [sent-153, score-0.074]
63 4 is tight, since it is possible to solve the parity problem with exactly 2d−1 Gaussians and a bias, for instance by using a negative bias and putting a positive weight on each example satisfying parity(xi ) = 1. [sent-155, score-0.588]
64 Note that if the centers of the Gaussians are not restricted anymore to be points in Xd , it is possible to solve the parity problem with only d + 1 Gaussians and no bias (Bengio, Delalleau and Le Roux, 2005). [sent-157, score-0.618]
65 One may argue that parity is a simple discrete toy problem of little interest. [sent-158, score-0.545]
66 But even if we have to restrict the analysis to discrete samples in {0, 1}d for mathematical reasons, the parity function can be extended to a smooth function on the [0, 1]d hypercube depending only on the continuous sum b1 + . [sent-159, score-0.545]
67 4 is thus a basis to argue that the number of Gaussians needed to learn a function with many variations in a continuous space may scale linearly with these variations, and thus possibly exponentially in the dimension. [sent-164, score-0.103]
68 2 Results for Semi-Supervised Learning In this section we focus on algorithms of the type described in recent papers (Zhu, Ghahramani and Lafferty, 2003; Zhou et al. [sent-166, score-0.046]
69 , 2004; Belkin, Matveeva and Niyogi, 2004; Delalleau, Bengio and Le Roux, 2005), which are graph-based non-parametric semi-supervised learning algorithms. [sent-167, score-0.031]
70 The graph-based algorithms we consider here can be seen as minimizing the following cost function, as shown in (Delalleau, Bengio and Le Roux, 2005): ˆ ˆ ˆ ˆ ˆ C(Y ) = Yl − Yl 2 + µY LY + µ Y 2 (9) ˆ with Y = (ˆ1 , . [sent-170, score-0.046]
71 , yn ) the estimated labels on both labeled and unlabeled data, and L the y ˆ (un-normalized) graph Laplacian derived from a similarity function W between points such ˆ that Wij = W (xi , xj ) corresponds to the weights of the edges in the graph. [sent-173, score-0.171]
72 , yl ) is the vector of estimated labels on the l labeled examples, whose known labels y ˆ ˆ are given by Yl = (y1 , . [sent-177, score-0.282]
73 , yl ), and one may constrain Yl = Yl as in (Zhu, Ghahramani and Lafferty, 2003) by letting µ → 0. [sent-180, score-0.151]
74 We define a region with constant label as a connected subset of the graph where all nodes xi have the same estimated label (sign of yi ), and such ˆ that no other node can be added while keeping these properties. [sent-181, score-0.25]
75 After running a label propagation algorithm minimizing the cost of eq. [sent-184, score-0.044]
76 9, the number of regions with constant estimated label is less than (or equal to) the number of labeled examples. [sent-185, score-0.11]
77 By contradiction, if this proposition is false, then there exists a region with constant estimated label that does not contain any labeled example. [sent-187, score-0.111]
78 ˆ2 i=l+1 The second term is stricly positive, and because the region we consider is maximal (by definition) all samples xj outside of the region such that Wij > 0 verify yj < 0 (for xi a ˆ sample in the region). [sent-200, score-0.282]
79 Since all yi are stricly positive for i ∈ {l + 1, . [sent-201, score-0.093]
80 , l + q}, this means ˆ this second term can be stricly decreased by setting all yi to 0 for i ∈ {l + 1, . [sent-204, score-0.093]
81 their minimum), showing that the set of labels yi are not optimal, which conflicts with their definition as labels minimizing C. [sent-210, score-0.124]
82 But this number could grow exponentially with the dimension of the manifold(s) on which the data lie, for instance in the case of a labeling function varying highly along each dimension, even if the label variations are “simple” in a non-local sense, e. [sent-212, score-0.158]
83 Thus we conjecture the same kind of result holds for dense weight matrices, if the weighting function is local in the sense that it is close to zero when applied to a pair of examples far from each other. [sent-219, score-0.099]
84 We find that when the underlying manifold varies a lot in the sense of having high curvature in many places, then a large number of examples is required. [sent-221, score-0.261]
85 Note that the tangent plane is defined by the derivatives of the kernel machine function f , for such algorithms. [sent-222, score-0.141]
86 The core result is that the manifold tangent plane at x is mostly defined by the near neighbors of x in the training set (more precisely it is constrained to be in the span of the vectors x − xi , with xi a neighbor of x). [sent-223, score-0.525]
87 Hence one needs to cover the manifold with small enough linear patches with at least d + 1 examples per patch (where d is the dimension of the manifold). [sent-224, score-0.294]
88 In the same paper, we present a conjecture that generalizes the results presented here for Gaussian kernel classifiers to a larger class of local kernels, using the same notion of locality of the derivative summarized above for manifold learning algorithms. [sent-225, score-0.407]
89 In that case the derivative of f represents the normal of the decision surface, and we find that at x it mostly depends on the neighbors of x in the training set. [sent-226, score-0.103]
90 It could be argued that if a function has many local variations (hence is not very smooth), then it is not learnable unless having strong prior knowledge at hand. [sent-227, score-0.15]
91 The only prior we need in order to quickly learn such functions (in terms of number of examples needed) is that functions that are simple to express in that language (e. [sent-232, score-0.122]
92 Of course, if additional domain knowledge about the task is available, it should be used, but without abandoning research on learning algorithms that can address a wider scope of problems. [sent-237, score-0.077]
93 We hope that this paper will stimulate more research into such learning algorithms, since we expect local learning algorithms (that only rely on the smoothness prior) will be insufficient to make significant progress on complex problems such as those raised by research on Artificial Intelligence. [sent-238, score-0.193]
94 Nonlinear component analysis as a o u kernel eigenvalue problem. [sent-346, score-0.141]
95 Asymptotic derivation of the finite-sample risk of the k nearest neighbor classifier. [sent-353, score-0.052]
wordName wordTfidf (topN-words)
[('parity', 0.545), ('hd', 0.435), ('xd', 0.296), ('roux', 0.167), ('manifold', 0.166), ('delalleau', 0.156), ('yl', 0.151), ('kernel', 0.141), ('bengio', 0.117), ('xi', 0.102), ('le', 0.101), ('bases', 0.09), ('bd', 0.088), ('gaussians', 0.085), ('niyogi', 0.08), ('variations', 0.074), ('curse', 0.074), ('belkin', 0.072), ('sign', 0.068), ('stricly', 0.063), ('tw', 0.063), ('kolmogorov', 0.062), ('straight', 0.06), ('solves', 0.059), ('examples', 0.058), ('xj', 0.057), ('matveeva', 0.055), ('ller', 0.053), ('langford', 0.053), ('neighbor', 0.052), ('vapnik', 0.049), ('silva', 0.048), ('scholkopf', 0.047), ('xs', 0.047), ('labels', 0.047), ('zhu', 0.046), ('algorithms', 0.046), ('smola', 0.046), ('tenenbaum', 0.045), ('induction', 0.045), ('smoothness', 0.044), ('bm', 0.044), ('contradiction', 0.044), ('label', 0.044), ('wij', 0.044), ('bias', 0.043), ('implicitely', 0.042), ('snapp', 0.042), ('venkatesh', 0.042), ('local', 0.041), ('lafferty', 0.04), ('neighbors', 0.04), ('gaussian', 0.04), ('dimension', 0.04), ('saul', 0.039), ('theorem', 0.038), ('lot', 0.037), ('labeled', 0.037), ('dimensionality', 0.037), ('lkopf', 0.037), ('lemma', 0.037), ('rdle', 0.037), ('duda', 0.037), ('hart', 0.037), ('schmitt', 0.037), ('ghahramani', 0.036), ('zhou', 0.036), ('training', 0.035), ('sch', 0.035), ('width', 0.035), ('prior', 0.035), ('editors', 0.034), ('apparently', 0.034), ('montr', 0.033), ('cortes', 0.033), ('isomap', 0.033), ('solely', 0.032), ('charting', 0.031), ('burges', 0.031), ('micchelli', 0.031), ('brand', 0.031), ('learning', 0.031), ('region', 0.03), ('bit', 0.03), ('yi', 0.03), ('points', 0.03), ('least', 0.03), ('unsupervised', 0.029), ('learn', 0.029), ('boser', 0.029), ('says', 0.029), ('locally', 0.029), ('roweis', 0.029), ('thrun', 0.029), ('regions', 0.029), ('mostly', 0.028), ('guyon', 0.028), ('kd', 0.028), ('locality', 0.028), ('rd', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
Author: Yoshua Bengio, Olivier Delalleau, Nicolas L. Roux
Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior – with similarity between examples expressed with a local kernel – are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semisupervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge. 1
2 0.22721998 138 nips-2005-Non-Local Manifold Parzen Windows
Author: Yoshua Bengio, Hugo Larochelle, Pascal Vincent
Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.
3 0.16746563 85 nips-2005-Generalization to Unseen Cases
Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri
Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1
4 0.15501179 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
5 0.10977361 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
Author: Y. Altun, D. McAllester, M. Belkin
Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1
6 0.10598096 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
7 0.097998016 50 nips-2005-Convex Neural Networks
8 0.093941122 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
9 0.091099523 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
10 0.091065384 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
11 0.089202613 47 nips-2005-Consistency of one-class SVM and related algorithms
12 0.083333708 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning
13 0.082024664 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
14 0.081964999 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
15 0.070646882 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
16 0.06955231 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
17 0.066211984 105 nips-2005-Large-Scale Multiclass Transduction
18 0.065586992 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
19 0.062850475 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
20 0.062842429 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
topicId topicWeight
[(0, 0.218), (1, 0.127), (2, -0.091), (3, -0.091), (4, 0.009), (5, 0.094), (6, 0.088), (7, -0.073), (8, 0.127), (9, -0.034), (10, 0.053), (11, 0.042), (12, 0.013), (13, 0.035), (14, -0.082), (15, 0.049), (16, -0.045), (17, 0.163), (18, 0.012), (19, 0.082), (20, 0.107), (21, -0.006), (22, -0.022), (23, 0.069), (24, 0.072), (25, -0.075), (26, -0.268), (27, -0.021), (28, -0.128), (29, 0.087), (30, -0.066), (31, 0.007), (32, 0.082), (33, -0.009), (34, -0.096), (35, 0.005), (36, 0.097), (37, 0.045), (38, -0.008), (39, -0.034), (40, 0.178), (41, -0.065), (42, -0.047), (43, -0.121), (44, 0.017), (45, -0.025), (46, 0.008), (47, 0.12), (48, 0.026), (49, 0.062)]
simIndex simValue paperId paperTitle
same-paper 1 0.9239856 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
Author: Yoshua Bengio, Olivier Delalleau, Nicolas L. Roux
Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior – with similarity between examples expressed with a local kernel – are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semisupervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge. 1
2 0.80316818 138 nips-2005-Non-Local Manifold Parzen Windows
Author: Yoshua Bengio, Hugo Larochelle, Pascal Vincent
Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.
3 0.54848123 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
4 0.54109973 85 nips-2005-Generalization to Unseen Cases
Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri
Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1
5 0.47980481 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
Author: Maxim Raginsky, Svetlana Lazebnik
Abstract: We introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. The definition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. Using the formalism of high-rate vector quantization, we address issues of statistical consistency and analyze the behavior of our scheme in the presence of noise.
6 0.46012154 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
7 0.45347431 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning
8 0.4503724 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
9 0.44484165 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
10 0.40415427 71 nips-2005-Fast Krylov Methods for N-Body Learning
11 0.39866817 117 nips-2005-Learning from Data of Variable Quality
12 0.39761838 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
13 0.39634261 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
14 0.39219859 189 nips-2005-Tensor Subspace Analysis
15 0.37296885 76 nips-2005-From Batch to Transductive Online Learning
16 0.36984545 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
17 0.36672801 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
18 0.35339448 105 nips-2005-Large-Scale Multiclass Transduction
19 0.35209033 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
20 0.35204485 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
topicId topicWeight
[(3, 0.071), (9, 0.207), (10, 0.047), (27, 0.019), (31, 0.043), (34, 0.148), (50, 0.021), (55, 0.029), (60, 0.046), (65, 0.034), (69, 0.048), (72, 0.013), (73, 0.063), (88, 0.073), (91, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.84868485 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
Author: Yoshua Bengio, Olivier Delalleau, Nicolas L. Roux
Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior – with similarity between examples expressed with a local kernel – are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semisupervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge. 1
2 0.83926827 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
Author: Jun Suzuki, Hideki Isozaki
Abstract: This paper proposes a new approach to feature selection based on a statistical feature mining technique for sequence and tree kernels. Since natural language data take discrete structures, convolution kernels, such as sequence and tree kernels, are advantageous for both the concept and accuracy of many natural language processing tasks. However, experiments have shown that the best results can only be achieved when limited small sub-structures are dealt with by these kernels. This paper discusses this issue of convolution kernels and then proposes a statistical feature selection that enable us to use larger sub-structures effectively. The proposed method, in order to execute efficiently, can be embedded into an original kernel calculation process by using sub-structure mining algorithms. Experiments on real NLP tasks confirm the problem in the conventional method and compare the performance of a conventional method to that of the proposed method.
3 0.67525887 138 nips-2005-Non-Local Manifold Parzen Windows
Author: Yoshua Bengio, Hugo Larochelle, Pascal Vincent
Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.
4 0.67264217 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
Author: John D. Lafferty, Pradeep K. Ravikumar
Abstract: We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods. 1
5 0.67144388 184 nips-2005-Structured Prediction via the Extragradient Method
Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan
Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1
6 0.66911149 177 nips-2005-Size Regularized Cut for Data Clustering
7 0.6675964 58 nips-2005-Divergences, surrogate loss functions and experimental design
8 0.66669834 50 nips-2005-Convex Neural Networks
9 0.66523027 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
10 0.66442204 114 nips-2005-Learning Rankings via Convex Hull Separation
11 0.66431493 105 nips-2005-Large-Scale Multiclass Transduction
12 0.66167098 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
13 0.66062427 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
14 0.6600107 77 nips-2005-From Lasso regression to Feature vector machine
15 0.65617317 47 nips-2005-Consistency of one-class SVM and related algorithms
16 0.65279943 160 nips-2005-Query by Committee Made Real
17 0.64906842 98 nips-2005-Infinite latent feature models and the Indian buffet process
18 0.64749414 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
19 0.64739364 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
20 0.64522588 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators