nips nips2005 nips2005-138 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yoshua Bengio, Hugo Larochelle, Pascal Vincent
Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. [sent-6, score-0.201]
2 With this objective, we present a non-local non-parametric density estimator. [sent-7, score-0.112]
3 It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. [sent-8, score-0.275]
4 It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models. [sent-9, score-0.47]
5 A central issue in obtaining generalization is how information from the training examples can be used to make predictions about new examples and, without strong prior assumptions (i. [sent-11, score-0.144]
6 (Bengio, Delalleau and Le Roux, 2005) and (Bengio and Monperrus, 2005) present several arguments illustrating some fundamental limitations of modern kernel methods due to the curse of dimensionality, when the kernel is local (like the Gaussian kernel). [sent-14, score-0.158]
7 , that very important information about the predicted function at x is derived mostly from the near neighbors of x in the training set. [sent-17, score-0.288]
8 This analysis has been applied to supervised learning algorithms such as SVMs as well as to unsupervised manifold learning algorithms and graph-based semi-supervised learning. [sent-18, score-0.408]
9 This strongly suggests to investigate non-local learning methods, which can in principle generalize at x using information gathered at training points xi that are far from x. [sent-20, score-0.411]
10 We present here such a non-local learning algorithm, in the realm of density estimation. [sent-21, score-0.112]
11 The local covariance matrix characterizing the density in the immediate neighborhood of a data point is learned as a function of that data point, with global parameters. [sent-23, score-0.351]
12 This allows to potentially generalize in places with little or no training data, unlike traditional, local, non-parametric models. [sent-24, score-0.118]
13 Here, the implicit assumption is that there is some kind of regularity in the shape of the density, such that learning about its shape in one region could be informative of the shape in another region that is not adjacent. [sent-25, score-0.132]
14 , 2005), which learns a global covariance matrix for use in the Mahalanobis distance within a non-parametric classifier. [sent-28, score-0.131]
15 Here we generalize this global matrix to one that is a function of the datum x. [sent-29, score-0.088]
16 2 Manifold Parzen Windows In the Parzen Windows estimator, one puts a spherical (isotropic) Gaussian around each training point xi , with a single shared variance hyper-parameter. [sent-30, score-0.443]
17 If the data concentrates in certain directions around xi , we want that covariance matrix to be “flat” (near zero variance) in the orthogonal directions. [sent-32, score-0.522]
18 One way to achieve this is to parametrize each of these covariance matrices in terms of “principal directions” (which correspond to the tangent vectors of the manifold, if the data concentrates on a manifold). [sent-33, score-0.335]
19 2 2 In (Vincent and Bengio, 2003), µ(xi ) = 0, and σnoise (xi ) = σ0 is a global hyper2 2 parameter, while (λj (xi ), vj ) = (sj (xi ) + σnoise (xi ), vj (xi )) are the leading (eigenvalue,eigenvector) pairs from the eigen-decomposition of a locally weighted covariance matrix (e. [sent-38, score-0.381]
20 the empirical covariance of the vectors xl − xi , with xl a near neighbor of xi ). [sent-40, score-0.792]
21 2 The “noise level” hyper-parameter σ0 must be chosen such that the principal eigenvalues 2 are all greater than σ0 . [sent-41, score-0.109]
22 Another hyper-parameter is the number d of principal components 2 to keep. [sent-42, score-0.109]
23 This very simple model was found to be consistently better than the ordinary Parzen density estimator in numerical experiments in which all hyper-parameters are chosen by cross-validation. [sent-44, score-0.215]
24 3 Non-Local Manifold Tangent Learning In (Bengio and Monperrus, 2005) a manifold learning algorithm was introduced in which the tangent plane of a d-dimensional manifold at x is learned as a function of x ∈ RD , using globally estimated parameters. [sent-45, score-1.121]
25 The output of the predictor function F (x) is a d × D matrix whose d rows are the d (possibly non-orthogonal) vectors that span the tangent plane. [sent-46, score-0.288]
26 The training information about the tangent plane is obtained by considering pairs of near neighbors xi and xj in the training set. [sent-47, score-0.907]
27 Consider the predicted tangent plane of the manifold at xi , characterized by the rows of F (xi ). [sent-48, score-1.004]
28 For a good predictor we expect the vector (xi − xj ) to be close to its projection on the tangent plane, with local coordinates w ∈ Rd . [sent-49, score-0.296]
29 The training criterion chosen in (Bengio and Monperrus, 2005) then minimizes the sum over such (xi , xj ) of the sinus of the projection angle, i. [sent-51, score-0.273]
30 It is a heuristic criterion, which will be replaced in our new algorithm by one derived from the maximum likelihood criterion, considering that F (xi ) indirectly provides the principal eigenvectors of the local covariance matrix at xi . [sent-54, score-0.526]
31 Both criteria gave similar results experimentally, but the model proposed here yields a complete density estimator. [sent-55, score-0.112]
32 In both cases F (xi ) can be interpreted as specifying the directions in which one expects to see the most variations when going from xi to one of its near neighbors in a finite sample. [sent-56, score-0.545]
33 1 0 1 0 2 s2 + σnoise 1 xi µ tangent plane v1 σnoise Figure 1: Illustration of the local parametrization of local or Non-Local Manifold Parzen. [sent-57, score-0.621]
34 The examples around training point xi are modeled by a Gaussian. [sent-58, score-0.414]
35 µ(xi ) specifies the center of that Gaussian, which should be non-zero when xi is off the manifold. [sent-59, score-0.271]
36 vk ’s are principal directions of the Gaussian and are tangent vectors of the manifold. [sent-60, score-0.395]
37 4 Proposed Algorithm: Non-Local Manifold Parzen Windows In equations (1) and (2) we wrote µ(xi ) and S(xi ) as if they were functions of xi rather than simply using indices µi and Si . [sent-62, score-0.271]
38 This is because we introduce here a non-local version of Manifold Parzen Windows inspired from the non-local manifold tangent learning algorithm, i. [sent-63, score-0.585]
39 , in which we can share information about the density across different regions of space. [sent-65, score-0.112]
40 In our experiments we use a neural network of nhid hidden neurons, 2 with xi in input to predict µ(xi ), σnoise (xi ), and the s2 (xi ) and vj (xi ). [sent-66, score-0.464]
41 We will note F (xi ) the matrix whose rows are the vectors output of the neural network. [sent-69, score-0.089]
42 From it we obtain the s2 (xi ) and vj (xi ) by performj d ′ ing a singular value decomposition, i. [sent-70, score-0.115]
43 Moreover, to make sure j 2 σnoise does not get too small, which could make the optimization unstable, we impose 2 2 2 σnoise (xi ) = s2 noise (xi ) + σ0 , where snoise (·) is an output of the neural network and σ0 is a fixed constant. [sent-73, score-0.108]
44 The Gaussian centered near xi tells us how neighbors of xi are expected to differ from xi . [sent-76, score-1.014]
45 Its “principal” vectors vj (xi ) span the tangent of the manifold near xi . [sent-77, score-1.069]
46 The Gaussian center variation µ(xi ) tells us how xi is located with 2 respect to its projection on the manifold. [sent-78, score-0.325]
47 The noise variance σnoise (xi ) tells us how far 2 from the manifold to expect neighbors, and the directional variances s2 (xi ) + σnoise (xi ) j tell us how far to expect neighbors on the different local axes of the manifold, near xi ’s projection on the manifold. [sent-79, score-1.079]
48 , they allow to potentially discover shared structure among the different covariance matrices in the mixture. [sent-83, score-0.099]
49 (1) For xi ∈ X (2) Collect max(k,kµ ) nearest neighbors of xj . [sent-85, score-0.494]
50 µ Below, call yj one of the k nearest neighbors, yj one of the kµ nearest neighbors. [sent-86, score-0.354]
51 (3) Perform a stochastic gradient step on parameters of S(·) and µ(·), using the negative log-likelihood error signal on the yj , with a Gaussian of mean xi + µ(xi ) and of covariance matrix S(xi ). [sent-87, score-0.469]
52 The approximate gradients are: µ ∂C(yj ,xi ) ∂µ(xi ) = − nk ∂C(yj ,xi ) 2 ∂σnoise (xi ) ∂C(yj ,xi ) ∂F (xi ) 1 µ µ (yj ) µ S(xi )−1 (yj − xi − µ(xi )) 1 = 0. [sent-88, score-0.347]
53 5 nk (yj ) T r(S(xi )−1 ) − ||(yj − xi − µ(xi ))′ S(xi )−1 ||2 = 1 −1 nk (yj ) F (xi )S(xi ) I − (yj − xi − µ(xi ))(yj − xi − µ(xi ))′ S(xi )−1 where nk (y) = |Nk (y)| is the number of points in the training set that have y among their k nearest neighbors. [sent-89, score-1.208]
54 average NLL of NLMP density estimation on a validation set stops decreasing) 2 Result: trained µ(·) and S(·) functions, with corresponding σ0 . [sent-92, score-0.153]
55 However there may be a different and cheaper way to compute an estimate of the density at x. [sent-99, score-0.112]
56 We build here on an idea suggested in (Vincent, 2003), which yields an estimator that does not exactly integrate to one, but this is not an issue if the estimator is to be used for applications such as classification. [sent-100, score-0.206]
57 a local weighting kernel that assigns a weight of 1 to the k nearest neighbors and 0 to the rest) but it could easily be generalized to “soft” weighting, as in (Vincent, 2003). [sent-103, score-0.259]
58 Let us decompose the true density at x as: p(x) = p(x|x ∈ Bk (x))P (Bk (x)), where Bk (x) represents the spherical ball centered on x and containing the k nearest neighbors of x (i. [sent-104, score-0.36]
59 , the ball with radius x − Nk (x) where Nk (x) is the k-th neighbor of x in the training set). [sent-106, score-0.135]
60 It can be shown that the above NLMP learning procedure looks for functions µ(·) and S(·) that best characterize the distribution of the k training-set nearest neighbors of x as the normal N (·; x + µ(x), S(x)). [sent-107, score-0.192]
61 This yields the estimator p(x) = N (x; x+µ(x), S(x)) k , which requires only O(d. [sent-111, score-0.103]
62 We call this estimator Test-centric NLMP, since it considers only the Gaussian predicted at the test point, rather than a mixture of all the Gaussians obtained at the training points. [sent-113, score-0.249]
63 6 Experimental Results We have performed comparative experiments on both toy and real-world data, on density estimation and classification tasks. [sent-114, score-0.147]
64 The sinus data set includes examples sampled around a sinus curve. [sent-118, score-0.287]
65 In the spiral data set examples are sampled near a spiral. [sent-119, score-0.145]
66 The hyper-parameters are the number of principal directions (i. [sent-122, score-0.182]
67 , the dimension of the manifold), the number of nearest neighbors k and 2 kµ , the minimum constant noise variance σ0 and the number of hidden units of the neural network. [sent-124, score-0.281]
68 • Gaussian mixture with full but regularized covariance matrices. [sent-125, score-0.129]
69 • Parzen Windows density estimator, with a spherical Gaussian kernel. [sent-129, score-0.148]
70 The hyper-parameters are the number of principal 2 components, k of the nearest neighbor kernel and the minimum eigenvalue σ0 . [sent-132, score-0.248]
71 Note that, for these experiments, the number of principal directions (or components) was fixed to 1 for both NLMP and Manifold Parzen. [sent-133, score-0.182]
72 To help understand why Non-Local Manifold Parzen works well on these data, figure 2 illustrates the learned densities for the sinus and spiral data. [sent-135, score-0.229]
73 Basically, it works better here because it yields an estimator that is less sensitive to the specific samples around each test point, thanks to its ability to share structure Algorithm Non-Local MP Manifold Parzen Gauss Mix Full Parzen Windows sinus 1. [sent-136, score-0.246]
74 94 Table 2: Average Negative Log-Likelihood on the digit rotation experiment, when testing on a digit class (1’s) not used during training, for Non-Local Manifold Parzen, Manifold Parzen, and Parzen Windows. [sent-153, score-0.323]
75 Figure 2: Illustration of the learned densities (sinus on top, spiral on bottom) for four compared models. [sent-156, score-0.092]
76 2 radians and all these images were used as training data. [sent-165, score-0.142]
77 We used NLMP for training, and for testing we formed an augmented mixture with Gaussians centered not only on the training examples, but also on the original unrotated 1 digits. [sent-166, score-0.19]
78 We tested our estimator on the rotated versions of each of the 1 digits. [sent-167, score-0.217]
79 We compared this to Manifold Parzen trained on the training data containing both the original and rotated images of the training class digits and the unrotated 1 digits. [sent-168, score-0.364]
80 The objective of the experiment was to see if the model was able to infer the density correctly around the original unrotated images, i. [sent-169, score-0.198]
81 , to predict a high probability for the rotated versions of these images. [sent-171, score-0.114]
82 In table 2 we see quantitatively that the non-local estimator predicts the rotated images much better. [sent-172, score-0.267]
83 As qualitative evidence, we used small steps in the principal direction predicted by Testcentric NLMP to rotate an image of the digit 1. [sent-173, score-0.281]
84 To make this task even more illustrative of the generalization potential of non-local learning, we followed the tangent in the direction opposite to the rotations of the training set. [sent-174, score-0.306]
85 It can be seen in figure 3 that the rotated Figure 3: From left to right: original image of a digit 1; rotated analytically by −0. [sent-175, score-0.373]
86 2 radians; Rotation predicted using Non-Local MP; rotation predicted using MP. [sent-176, score-0.131]
87 Rotations are obtained by following the tangent vector in small steps. [sent-177, score-0.177]
88 digit obtained is quite similar to the same digit analytically rotated. [sent-178, score-0.264]
89 For comparison, we tried to apply the same rotation technique to that digit, but by using the principal direction, computed by Manifold Parzen, of its nearest neighbor’s Gaussian component in the training set. [sent-179, score-0.341]
90 In this experiment, to make sure that NLMP focusses on the tangent plane of the rotation manifold, we fixed the number of principal directions d = 1 and the number of nearest neighbors k = 1, and also imposed µ(·) = 0. [sent-181, score-0.726]
91 The original training set (7291) was split into a training (first 6291) and validation set (last 1000), used to tune hyper-parameters. [sent-185, score-0.209]
92 One density estimator for each of the 10 digit classes is estimated. [sent-186, score-0.334]
93 7 Conclusion We have proposed a non-parametric density estimator that, unlike its predecessors, is able to generalize far from the training examples by capturing global structural features of the Figure 4: Tranformations learned by Non-local MP. [sent-214, score-0.451]
94 The top row shows digits taken from the USPS training set, and the two following rows display the results of steps taken by one of the 7 principal directions learned by Non-local MP, the third one corresponding to more steps than the second one. [sent-215, score-0.331]
95 It does so by learning a function with global parameters that successfully predicts the local shape of the density, i. [sent-217, score-0.113]
96 , the tangent plane of the manifold along which the density concentrates. [sent-219, score-0.786]
97 Three types of experiments showed that this idea works, yields improved density estimation and reduced classification error compared to its local predecessors. [sent-220, score-0.154]
98 Technical Report 1258, D´ partement d’informatique et recherche e op´ rationnelle, Universit´ de Montr´ al. [sent-228, score-0.09]
99 Technical report, D´ partement d’informatique et recherche op´ rationnelle, Universit´ de Montr´ al. [sent-233, score-0.09]
100 PhD thesis, Universit´ de e e ` ` Montr´ al, D´ partement d’informatique et recherche op´ rationnelle, Montreal, Qc. [sent-262, score-0.09]
wordName wordTfidf (topN-words)
[('parzen', 0.507), ('manifold', 0.408), ('xi', 0.271), ('nlmp', 0.247), ('tangent', 0.177), ('bengio', 0.173), ('mp', 0.171), ('windows', 0.151), ('vincent', 0.119), ('digit', 0.119), ('vj', 0.115), ('sinus', 0.114), ('rotated', 0.114), ('density', 0.112), ('principal', 0.109), ('neighbors', 0.109), ('estimator', 0.103), ('monperrus', 0.095), ('yj', 0.094), ('plane', 0.089), ('training', 0.084), ('nearest', 0.083), ('bk', 0.081), ('covariance', 0.077), ('larochelle', 0.076), ('nk', 0.076), ('directions', 0.073), ('curse', 0.066), ('noise', 0.066), ('rotation', 0.065), ('near', 0.062), ('montr', 0.06), ('roux', 0.06), ('usps', 0.06), ('nhid', 0.057), ('unrotated', 0.057), ('delalleau', 0.056), ('spiral', 0.053), ('rationnelle', 0.05), ('concentrates', 0.045), ('informatique', 0.045), ('recherche', 0.045), ('partement', 0.045), ('gaussian', 0.045), ('universit', 0.045), ('shape', 0.044), ('local', 0.042), ('validation', 0.041), ('learned', 0.039), ('vectors', 0.036), ('spherical', 0.036), ('op', 0.035), ('toy', 0.035), ('upon', 0.034), ('generalize', 0.034), ('le', 0.033), ('radians', 0.033), ('decoste', 0.033), ('predicted', 0.033), ('neighbor', 0.031), ('xj', 0.031), ('examples', 0.03), ('tells', 0.03), ('variations', 0.03), ('around', 0.029), ('mixture', 0.029), ('scholkopf', 0.028), ('goldberger', 0.028), ('matrix', 0.027), ('neighborhood', 0.027), ('global', 0.027), ('rows', 0.026), ('estimators', 0.026), ('analytically', 0.026), ('builds', 0.026), ('kernel', 0.025), ('rotations', 0.025), ('montreal', 0.025), ('quantitatively', 0.025), ('rd', 0.025), ('images', 0.025), ('projection', 0.024), ('classi', 0.024), ('dimensionality', 0.024), ('variance', 0.023), ('understand', 0.023), ('regularized', 0.023), ('svm', 0.022), ('discover', 0.022), ('predictor', 0.022), ('xl', 0.022), ('far', 0.022), ('network', 0.021), ('sure', 0.021), ('bottou', 0.021), ('criterion', 0.02), ('testing', 0.02), ('ball', 0.02), ('direction', 0.02), ('locally', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 138 nips-2005-Non-Local Manifold Parzen Windows
Author: Yoshua Bengio, Hugo Larochelle, Pascal Vincent
Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.
2 0.22721998 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
Author: Yoshua Bengio, Olivier Delalleau, Nicolas L. Roux
Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior – with similarity between examples expressed with a local kernel – are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semisupervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge. 1
3 0.15702115 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning
Author: Jorge Silva, Jorge Marques, João Lemos
Abstract: There has been a surge of interest in learning non-linear manifold models to approximate high-dimensional data. Both for computational complexity reasons and for generalization capability, sparsity is a desired feature in such models. This usually means dimensionality reduction, which naturally implies estimating the intrinsic dimension, but it can also mean selecting a subset of the data to use as landmarks, which is especially important because many existing algorithms have quadratic complexity in the number of observations. This paper presents an algorithm for selecting landmarks, based on LASSO regression, which is well known to favor sparse approximations because it uses regularization with an l1 norm. As an added benefit, a continuous manifold parameterization, based on the landmarks, is also found. Experimental results with synthetic and real data illustrate the algorithm. 1
4 0.11883169 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
5 0.11592957 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul
Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1
6 0.10665233 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
7 0.10644811 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
8 0.10335619 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
9 0.10170499 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
10 0.08640369 50 nips-2005-Convex Neural Networks
11 0.080835819 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm
12 0.078536406 189 nips-2005-Tensor Subspace Analysis
13 0.075085789 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
14 0.074718051 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
15 0.072366878 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits
16 0.071433 47 nips-2005-Consistency of one-class SVM and related algorithms
17 0.071177103 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
18 0.069923483 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
19 0.063798331 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
20 0.059295427 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
topicId topicWeight
[(0, 0.208), (1, 0.103), (2, -0.097), (3, -0.04), (4, -0.002), (5, 0.042), (6, 0.118), (7, -0.132), (8, 0.17), (9, -0.051), (10, 0.104), (11, 0.015), (12, 0.057), (13, 0.068), (14, -0.061), (15, 0.096), (16, -0.116), (17, 0.214), (18, -0.073), (19, 0.15), (20, 0.029), (21, 0.066), (22, -0.131), (23, 0.16), (24, 0.008), (25, -0.044), (26, -0.273), (27, 0.015), (28, -0.085), (29, 0.133), (30, -0.038), (31, 0.108), (32, 0.017), (33, 0.003), (34, -0.07), (35, -0.041), (36, 0.091), (37, 0.022), (38, 0.037), (39, -0.096), (40, 0.061), (41, -0.089), (42, 0.019), (43, -0.016), (44, -0.051), (45, -0.04), (46, 0.014), (47, 0.014), (48, -0.029), (49, -0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.96090382 138 nips-2005-Non-Local Manifold Parzen Windows
Author: Yoshua Bengio, Hugo Larochelle, Pascal Vincent
Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.
2 0.78192627 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
Author: Yoshua Bengio, Olivier Delalleau, Nicolas L. Roux
Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior – with similarity between examples expressed with a local kernel – are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semisupervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge. 1
3 0.63353997 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
Author: Maxim Raginsky, Svetlana Lazebnik
Abstract: We introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. The definition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. Using the formalism of high-rate vector quantization, we address issues of statistical consistency and analyze the behavior of our scheme in the presence of noise.
4 0.60010624 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning
Author: Jorge Silva, Jorge Marques, João Lemos
Abstract: There has been a surge of interest in learning non-linear manifold models to approximate high-dimensional data. Both for computational complexity reasons and for generalization capability, sparsity is a desired feature in such models. This usually means dimensionality reduction, which naturally implies estimating the intrinsic dimension, but it can also mean selecting a subset of the data to use as landmarks, which is especially important because many existing algorithms have quadratic complexity in the number of observations. This paper presents an algorithm for selecting landmarks, based on LASSO regression, which is well known to favor sparse approximations because it uses regularization with an l1 norm. As an added benefit, a continuous manifold parameterization, based on the landmarks, is also found. Experimental results with synthetic and real data illustrate the algorithm. 1
5 0.52023315 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
Author: Zhenyue Zhang, Hongyuan Zha
Abstract: We propose a fast manifold learning algorithm based on the methodology of domain decomposition. Starting with the set of sample points partitioned into two subdomains, we develop the solution of the interface problem that can glue the embeddings on the two subdomains into an embedding on the whole domain. We provide a detailed analysis to assess the errors produced by the gluing process using matrix perturbation theory. Numerical examples are given to illustrate the efficiency and effectiveness of the proposed methods. 1
6 0.49370828 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm
7 0.49308822 189 nips-2005-Tensor Subspace Analysis
8 0.44845375 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
9 0.4398413 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
10 0.41058379 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
11 0.39926007 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
12 0.38959742 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
13 0.35105628 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
14 0.34684443 69 nips-2005-Fast Gaussian Process Regression using KD-Trees
15 0.34285042 127 nips-2005-Mixture Modeling by Affinity Propagation
16 0.33668885 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
17 0.33628136 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
18 0.31786311 71 nips-2005-Fast Krylov Methods for N-Body Learning
19 0.31280306 195 nips-2005-Transfer learning for text classification
20 0.29878688 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
topicId topicWeight
[(3, 0.04), (9, 0.014), (10, 0.04), (27, 0.027), (31, 0.049), (34, 0.131), (50, 0.015), (55, 0.026), (60, 0.283), (69, 0.076), (73, 0.044), (88, 0.094), (91, 0.045)]
simIndex simValue paperId paperTitle
1 0.94505328 134 nips-2005-Neural mechanisms of contrast dependent receptive field size in V1
Author: Jim Wielaard, Paul Sajda
Abstract: Based on a large scale spiking neuron model of the input layers 4Cα and β of macaque, we identify neural mechanisms for the observed contrast dependent receptive field size of V1 cells. We observe a rich variety of mechanisms for the phenomenon and analyze them based on the relative gain of excitatory and inhibitory synaptic inputs. We observe an average growth in the spatial extent of excitation and inhibition for low contrast, as predicted from phenomenological models. However, contrary to phenomenological models, our simulation results suggest this is neither sufficient nor necessary to explain the phenomenon.
2 0.92038965 29 nips-2005-Analyzing Coupled Brain Sources: Distinguishing True from Spurious Interaction
Author: Guido Nolte, Andreas Ziehe, Frank Meinecke, Klaus-Robert Müller
Abstract: When trying to understand the brain, it is of fundamental importance to analyse (e.g. from EEG/MEG measurements) what parts of the cortex interact with each other in order to infer more accurate models of brain activity. Common techniques like Blind Source Separation (BSS) can estimate brain sources and single out artifacts by using the underlying assumption of source signal independence. However, physiologically interesting brain sources typically interact, so BSS will—by construction— fail to characterize them properly. Noting that there are truly interacting sources and signals that only seemingly interact due to effects of volume conduction, this work aims to contribute by distinguishing these effects. For this a new BSS technique is proposed that uses anti-symmetrized cross-correlation matrices and subsequent diagonalization. The resulting decomposition consists of the truly interacting brain sources and suppresses any spurious interaction stemming from volume conduction. Our new concept of interacting source analysis (ISA) is successfully demonstrated on MEG data. 1
3 0.91266841 26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity
Author: Samuel M. McClure, Mark S. Gilzenrat, Jonathan D. Cohen
Abstract: We propose a model by which dopamine (DA) and norepinepherine (NE) combine to alternate behavior between relatively exploratory and exploitative modes. The model is developed for a target detection task for which there is extant single neuron recording data available from locus coeruleus (LC) NE neurons. An exploration-exploitation trade-off is elicited by regularly switching which of the two stimuli are rewarded. DA functions within the model to change synaptic weights according to a reinforcement learning algorithm. Exploration is mediated by the state of LC firing, with higher tonic and lower phasic activity producing greater response variability. The opposite state of LC function, with lower baseline firing rate and greater phasic responses, favors exploitative behavior. Changes in LC firing mode result from combined measures of response conflict and reward rate, where response conflict is monitored using models of anterior cingulate cortex (ACC). Increased long-term response conflict and decreased reward rate, which occurs following reward contingency switch, favors the higher tonic state of LC function and NE release. This increases exploration, and facilitates discovery of the new target. 1 In t rod u ct i on A central problem in reinforcement learning is determining how to adaptively move between exploitative and exploratory behaviors in changing environments. We propose a set of neurophysiologic mechanisms whose interaction may mediate this behavioral shift. Empirical work on the midbrain dopamine (DA) system has suggested that this system is particularly well suited for guiding exploitative behaviors. This hypothesis has been reified by a number of studies showing that a temporal difference (TD) learning algorithm accounts for activity in these neurons in a wide variety of behavioral tasks [1,2]. DA release is believed to encode a reward prediction error signal that acts to change synaptic weights relevant for producing behaviors [3]. Through learning, this allows neural pathways to predict future expected reward through the relative strength of their synaptic connections [1]. Decision-making procedures based on these value estimates are necessarily greedy. Including reward bonuses for exploratory choices supports non-greedy actions [4] and accounts for additional data derived from DA neurons [5]. We show that combining a DA learning algorithm with models of response conflict detection [6] and NE function [7] produces an effective annealing procedure for alternating between exploration and exploitation. NE neurons within the LC alternate between two firing modes [8]. In the first mode, known as the phasic mode, NE neurons fire at a low baseline rate but have relatively robust phasic responses to behaviorally salient stimuli. The second mode, called the tonic mode, is associated with a higher baseline firing and absent or attenuated phasic responses. The effects of NE on efferent areas are modulatory in nature, and are well captured as a change in the gain of efferent inputs so that neuronal responses are potentiated in the presence of NE [9]. Thus, in phasic mode, the LC provides transient facilitation in processing, time-locked to the presence of behaviorally salient information in motor or decision areas. Conversely, in tonic mode, higher overall LC discharge rate increases gain generally and hence increases the probability of arbitrary responding. Consistent with this account, for periods when NE neurons are in the phasic mode, monkey performance is nearly perfect. However, when NE neurons are in the tonic mode, performance is more erratic, with increased response times and error rate [8]. These findings have led to a recent characterization of the LC as a dynamic temporal filter, adjusting the system's relative responsivity to salient and irrelevant information [8]. In this way, the LC is ideally positioned to mediate the shift between exploitative and exploratory behavior. The parameters that underlie changes in LC firing mode remain largely unexplored. Based on data from a target detection task by Aston-Jones and colleagues [10], we propose that LC firing mode is determined in part by measures of response conflict and reward rate as calculated by the ACC and OFC, respectively [8]. Together, the ACC and OFC are the principle sources of cortical input to the LC [8]. Activity in the ACC is known, largely through human neuroimaging experiments, to change in accord with response conflict [6]. In brief, relatively equal activity in competing behavioral responses (reflecting uncertainty) produces high conflict. Low conflict results when one behavioral response predominates. We propose that increased long-term response conflict biases the LC towards a tonic firing mode. Increased conflict necessarily follows changes in reward contingency. As the previously rewarded target no longer produces reward, there will be a relative increase in response ambiguity and hence conflict. This relationship between conflict and LC firing is analogous to other modeling work [11], which proposes that increased tonic firing reflects increased environmental uncertainty. As a final component to our model, we hypothesize that the OFC maintains an ongoing estimate in reward rate, and that this estimate of reward rate also influences LC firing mode. As reward rate increases, we assume that the OFC tends to bias the LC in favor of phasic firing to target stimuli. We have aimed to fix model parameters based on previous work using simpler networks. We use parameters derived primarily from a previous model of the LC by Gilzenrat and colleagues [7]. Integration of response conflict by the ACC and its influence on LC firing was borrowed from unpublished work by Gilzenrat and colleagues in which they fit human behavioral data in a diminishing utilities task. Given this approach, we interpret our observed improvement in model performance with combined NE and DA function as validation of a mechanism for automatically switching between exploitative and exploratory action selection. 2 G o- No- G o Task and Core Mod el We have modeled an experiment in which monkeys performed a target detection task [10]. In the task, monkeys were shown either a vertical bar or a horizontal bar and were required to make or omit a motor response appropriately. Initially, the vertical bar was the target stimulus and correctly responding was rewarded with a squirt of fruit juice (r=1 in the model). Responding to the non-target horizontal stimulus resulted in time out punishment (r=-.1; Figure 1A). No responses to either the target or non-target gave zero reward. After the monkeys had fully acquired the task, the experimenters periodically switched the reward contingency such that the previously rewarded stimulus (target) became the distractor, and vice versa. Following such reversals, LC neurons were observed to change from emitting phasic bursts of firing to the target, to tonic firing following the switch, and slowly back to phasic firing for the new target as the new response criteria was obtained [10]. Figure 1: Task and model design. (A) Responses were required for targets in order to obtain reward. Responses to distractors resulted in a minor punishment. No responses gave zero reward. (B) In the model, vertical and horizontal bar inputs (I1 and I 2 ) fed to integrator neurons (X1 and X2 ) which then drove response units (Y1 and Y2 ). Responses were made if Y 1 or Y2 crossed a threshold while input units were active. We have previously modeled this task [7,12] with a three-layer connectionist network in which two input units, I1 and I 2 , corresponding to the vertical and horizontal bars, drive two mutually inhibitory integrator units, X1 and X2 . The integrator units subsequently feed two response units, Y1 and Y2 (Figure 1B). Responses are made whenever output from Y1 or Y2 crosses a threshold level of activity, θ. Relatively weak cross connections from each input unit to the opposite integrator unit (I1 to X2 and I 2 to X1 ) are intended to model stimulus similarity. Both the integrator and response units were modeled as noisy, leaky accumulators: ˙ X i =
same-paper 4 0.82291806 138 nips-2005-Non-Local Manifold Parzen Windows
Author: Yoshua Bengio, Hugo Larochelle, Pascal Vincent
Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.
5 0.64069176 28 nips-2005-Analyzing Auditory Neurons by Learning Distance Functions
Author: Inna Weiner, Tomer Hertz, Israel Nelken, Daphna Weinshall
Abstract: We present a novel approach to the characterization of complex sensory neurons. One of the main goals of characterizing sensory neurons is to characterize dimensions in stimulus space to which the neurons are highly sensitive (causing large gradients in the neural responses) or alternatively dimensions in stimulus space to which the neuronal response are invariant (defining iso-response manifolds). We formulate this problem as that of learning a geometry on stimulus space that is compatible with the neural responses: the distance between stimuli should be large when the responses they evoke are very different, and small when the responses they evoke are similar. Here we show how to successfully train such distance functions using rather limited amount of information. The data consisted of the responses of neurons in primary auditory cortex (A1) of anesthetized cats to 32 stimuli derived from natural sounds. For each neuron, a subset of all pairs of stimuli was selected such that the responses of the two stimuli in a pair were either very similar or very dissimilar. The distance function was trained to fit these constraints. The resulting distance functions generalized to predict the distances between the responses of a test stimulus and the trained stimuli. 1
6 0.63023907 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex
7 0.61034697 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
8 0.60980302 109 nips-2005-Learning Cue-Invariant Visual Responses
9 0.58622831 181 nips-2005-Spiking Inputs to a Winner-take-all Network
10 0.58265692 50 nips-2005-Convex Neural Networks
11 0.58177334 150 nips-2005-Optimizing spatio-temporal filters for improving Brain-Computer Interfacing
12 0.57464719 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models
13 0.56880832 183 nips-2005-Stimulus Evoked Independent Factor Analysis of MEG Data with Large Background Activity
14 0.56216717 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
15 0.56131989 184 nips-2005-Structured Prediction via the Extragradient Method
16 0.55959135 114 nips-2005-Learning Rankings via Convex Hull Separation
17 0.55950302 105 nips-2005-Large-Scale Multiclass Transduction
18 0.55749619 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
19 0.55625886 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions
20 0.5543077 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity