nips nips2004 nips2004-131 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. [sent-6, score-0.84]
2 This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. [sent-7, score-1.204]
3 , 2004), and which represent the manifold on the basis of local neighborhood relations, very often constructed using the nearest neighbors graph (the graph with one vertex per observed example, and arcs between near neighbors). [sent-14, score-1.0]
4 The above methods characterize the manifold through an embedding which associates each training example (an input object) with a low-dimensional coordinate vector (the coordinates on the manifold). [sent-15, score-0.842]
5 Other closely related methods characterize the manifold as well as “noise” around it. [sent-16, score-0.699]
6 mixtures of factor analyzers (Ghahramani and Hinton, 1996), Manifold Parzen windows (Vincent and Bengio, 2003), and other local PCA models such as mixtures of probabilistic PCA (Tipping and Bishop, 1999). [sent-19, score-0.219]
7 In this paper we claim that there is a fundamental weakness with such kernel methods, due to the locality of learning: we show that the local tangent plane of the manifold at a point x is defined based mostly on the near neighbors of x according to some possibly data- dependent kernel KD . [sent-21, score-1.782]
8 For example, if the manifold curves quickly around x, neighbors need to be closer for a locally linear approximation to be meaningful, which means that more data are needed. [sent-23, score-0.849]
9 Dimensionality of the manifold compounds that problem because the amount of data thus needed will grow exponentially with it. [sent-24, score-0.629]
10 Saying that y is “far” from x means that y is not well represented by its projection on the tangent plane at x. [sent-25, score-0.717]
11 In this paper we explore one way to address that problem, based on estimating the tangent planes of the manifolds as a function of x, with parameters that can be estimated not only from the data around x but from the whole dataset. [sent-26, score-0.735]
12 Note that there can be more than one manifold (e. [sent-27, score-0.629]
13 in vision, one may imagine a different manifold for each “class” of object), but the structure of these manifolds may be related, something that many previous manifold learning methods did not take advantage of. [sent-29, score-1.349]
14 We present experiments on a variety of tasks illustrating the weaknesses of the local manifold learning algorithms enumerated above. [sent-30, score-0.732]
15 2 Local Manifold Learning By “local manifold learning”, we mean a method that derives information about the local structure of the manifold (i. [sent-34, score-1.361]
16 implicitly its tangent directions) at x based mostly on the training examples “around” x. [sent-36, score-0.598]
17 There is a large group of manifold learning methods (as well as the spectral clustering methods) that share several characteristics, and can be seen as data-dependent kernel PCA (Bengio et al. [sent-37, score-0.707]
18 The embedding of the training set is obtained directly from the principal eigenvectors vk of M (the i-th element of vk gives the k-th coordinate of xi ’s embedding, possibly scaled by λk , i. [sent-44, score-0.33]
19 ek (xi ) = vik ) and the embedding n ¨ for a new example can be estimated using the Nystrom formula (Bengio et al. [sent-46, score-0.136]
20 , 2004): n 1 ek (x) = λk i=1 vki KD (x, xi ) for the k-th coordinate of x, where λk is the k-th eigenvalue of M (the optional scaling by λk would also apply). [sent-47, score-0.148]
21 The above equation says n that the embedding for a new example x is a local interpolation of the manifold coor(x,x dinates of its neighbors xi , with interpolating weights given by KDλk i ) . [sent-48, score-0.982]
22 To see more clearly how the tangent plane may depend only on the neighbors of x, consider the relation between the tangent plane and the embedding function: the tangent plane at x is k (x) simply the subspace spanned by the vectors ∂e∂x . [sent-49, score-2.265]
23 In the case of very “local” kernels like that of LLE, spectral clustering with Gaussian kernel, Laplacian Eigenmaps or kernel PCA with Gaussian kernel, that derivative only depends significantly on the near neighk (x) bors of x. [sent-50, score-0.126]
24 Consider for example kernel PCA with a Gaussian kernel: then ∂e∂x can be closely approximated by a linear combination of the difference vectors (x − x j ) for xj near x. [sent-51, score-0.214]
25 The weights of that combination may depend on the whole data set, but if the ambiant space has many more dimensions then the number of such “near” neighbors of x, this is a very strong locally determined constraint on the shape of the manifold. [sent-52, score-0.175]
26 , 2004), the corresponding data-dependent kernel can be defined as 1 1 1 ¯ ¯ ¯ KD (x, xi ) = − 2 (D(x, xi )2 − n j D(x, xj )2 − Di + D) where Di = n j D(xi , xj )2 1 ¯ ¯ and D = n j Dj . [sent-56, score-0.301]
27 Let N (x, xi ) denote the index j of the training set example xj that is a neighbor of x that minimizes ||x − xj || + D(xj , xi ). [sent-57, score-0.364]
28 Then (x − xN (x,xj ) ) (x − xN (x,xi ) ) ∂ek (x) 1 1 = vki D(x, xj ) − D(x, xi ) ∂x λk i n j ||x − xN (x,xj ) || ||x − xN (x,xi ) || (1) which is a linear combination of vectors (x − xk ), where xk is a neighbor of x. [sent-58, score-0.297]
29 This clearly shows that the tangent plane at x associated with Isomap is also included in the subspace spanned by the vectors (x − xk ) where xk is a neighbor of x. [sent-59, score-0.845]
30 There are also a variety of local manifold learning algorithms which can be classified as “mixtures of pancakes” (Ghahramani and Hinton, 1996; Tipping and Bishop, 1999; Vincent and Bengio, 2003; Teh and Roweis, 2003; Brand, 2003). [sent-60, score-0.732]
31 For these methods the local tangent directions directly correspond to the principal eigenvectors of the local covariance matrices. [sent-63, score-0.851]
32 Learning is also local since it is mostly the examples around the Gaussian center that determine its covariance structure. [sent-64, score-0.234]
33 local principal directions) are estimated mostly based on local data. [sent-68, score-0.318]
34 where to set the Gaussian centers to allocate them properly where there is data, and optionally how to orient the principal directions so as to obtain a globally coherent coordinate system for embedding the data. [sent-71, score-0.22]
35 1 Where Local Manifold Learning Would Fail It is easy to imagine at least four failure causes for local manifold learning methods, and combining them will create even greater problems: • Noise around the manifold: data are not exactly lying on the manifold. [sent-73, score-0.777]
36 In the case of non-linear manifolds, the presence of noise means that more data around each pancake region will be needed to properly estimate the tangent directions of the manifold in that region. [sent-74, score-1.267]
37 Local manifold learning methods basically approximate the manifold by the union of many locally linear patches. [sent-76, score-1.296]
38 With a high curvature manifold, more – smaller – patches will be needed, and the number of required patches will grow exponentially with the dimensionality of the manifold. [sent-78, score-0.206]
39 Consider for example the manifold of translations of a high-contrast image. [sent-79, score-0.653]
40 The tangent direction corresponds to the change in image due a small translation, i. [sent-80, score-0.52]
41 We have already seen that high manifold dimensionality d is hurtful because O(d) examples are required in each patch and O(r d ) patches (for some r depending on curvature and noise) are necessary to span the manifold. [sent-86, score-0.839]
42 In the translation example, if the image resolution is increased, then many more training images will be needed to capture the curvature around the translation manifold with locally linear patches. [sent-87, score-0.958]
43 In many real-world contexts there is not just one global manifold but a large number of manifolds which however share something about their structure. [sent-90, score-0.743]
44 A simple example is the manifold of transformations (view-point, position, lighting,. [sent-91, score-0.629]
45 There is one manifold per object instance (corresponding to the successive application of small amounts of all of these transformations). [sent-95, score-0.629]
46 If there are only a few examples for each such class then it is almost impossible to learn the manifold structures using only local manifold learning. [sent-96, score-1.388]
47 However, if the manifold structures are generated by a common underlying phenomenon then a non-local manifold learning method could potentially learn all of these manifolds and even generalize to manifolds for which a single instance is observed, as demonstrated in the experiments below. [sent-97, score-1.471]
48 3 Non-Local Manifold Tangent Learning Here we choose to characterize the manifolds in the data distribution through a matrixvalued function F (x) that predicts at x ∈ Rn a basis for the tangent plane of the manifold near x, hence F (x) ∈ Rd×n for a d-dimensional manifold. [sent-98, score-1.488]
49 Basically, F (x) specifies “where” (in which directions) one expects to find near neighbors of x. [sent-99, score-0.208]
50 As with Isomap, we consider that the vectors (x − xi ) with xi a near neighbor of x span a noisy estimate of the manifold tangent space. [sent-101, score-1.371]
51 In our experiments we simply collected the k nearest neighbors of each example x, but better selection criteria might be devised. [sent-103, score-0.197]
52 Points on the predicted tangent subspace can be written F (x)w with w ∈ Rd being local coordinates in the basis specified by F (x). [sent-104, score-0.664]
53 Several criteria are possible to match the neighbors differences with the subspace defined by F (x). [sent-105, score-0.171]
54 One that yields to simple analytic calculations is simply to minimize the distance between the x − xj vectors and their projection on the subspace defined by F (x). [sent-106, score-0.23]
55 The low-dimensional local coordinate vector wtj ∈ Rd that matches neighbor xj of example xt is thus an extra free parameter that has to be optimized, but is obtained analytically. [sent-107, score-0.655]
56 The overall training criterion involves a double optimization over function F and local coordinates wtj of what we call the relative projection error: min F,{wtj } t j∈N (xt ) ||F (xt )wtj − (xt − xj )||2 ||xt − xj ||2 (2) where N (x) denotes the selected set of near neighbors of x. [sent-108, score-0.835]
57 The normalization by ||x t − xj ||2 is to avoid giving more weight to the neighbors that are further away. [sent-109, score-0.225]
58 The solution for wtj is obtained by solving the linear system F (xt )F (xt )wtj = F (xt ) (xt − xj ) . [sent-115, score-0.295]
59 ||xt − xj ||2 (4) In our implementation this is done robustly through a singular value decomposition F (xt ) = U SV and wtj = B(xt − xj ) where B can be precomputed for all the neighbors d 2 of xt : B = ( k=1 1Sk > V. [sent-116, score-0.652]
60 The gradient of the criterion with respect to the i-th row of F (xt ), holding wtj fixed, is simply 2 j wtji (F (xt )w − (xt − xj )) ||xt − xj || (5) where wtji is the i-th element of wtj . [sent-119, score-0.69]
61 In practice, it is not necessary to store more than one wtj vector at a time. [sent-120, score-0.207]
62 It is trained by stochastic gradient descent, one example xt at a time. [sent-122, score-0.132]
63 However, once the tangent plane function is trained, there are ways to use it to obtain all of the above. [sent-124, score-0.672]
64 The simplest method is to apply existing algorithms that provide both an embedding and a density function based on a Gaussian mixture with pancake-like covariances. [sent-125, score-0.131]
65 For example one could use (Teh and Roweis, 2003) or (Brand, 2003), the local covariance matrix around x being constructed 2 from F (x)diag(σ 2 (x))F (x), where σi (x) should estimate V ar(wi ) around x. [sent-126, score-0.238]
66 1 Previous Work on Non-Local Manifold Learning The non-local manifold learning algorithm presented here (find F (·) which minimizes the criterion in eq. [sent-128, score-0.659]
67 That group defines a one-dimensional manifold generated by following the orbit x(t) = eGt x(0), where G is an n × n matrix and t is a scalar manifold coordinate. [sent-130, score-1.258]
68 A multi-dimensional manifold can be obtained by replacing Gt above by a linear combination of multiple generating matrices. [sent-131, score-0.629]
69 In (Rao and Ruderman, 1999) the matrix exponential is approximated to first order by (I + Gt), and the authors estimate G for a simple signal undergoing translations, using as a criterion the minimiza˜ ˜ tion of x,˜ mint ||(I + Gt)x − x||2 , where x is a neighbor of x in the data. [sent-132, score-0.133]
70 Note that in x this model the tangent plane is a linear function of x, i. [sent-133, score-0.672]
71 Our proposal extends this approach to multiple dimensions and non-linear relations between x and the tangent planes. [sent-137, score-0.498]
72 Note also the earlier work on Tangent Distance (Simard, LeCun and Denker, 1993), in which the tangent planes are not learned but used to build a nearest neighbor classifier that is based on the distance between the tangent subspaces around two examples to be compared. [sent-138, score-1.286]
73 The main advantage of the approach proposed here over local manifold learning is that the parameters of the tangent plane predictor can be estimated using data from very different regions of space, thus in principle allowing to be less sensitive to all four of the problems described in 2. [sent-139, score-1.458]
74 4 Experimental Results The objective of the experiments is to validate the proposed algorithm: does it estimate well the true tangent planes? [sent-141, score-0.52]
75 does it learn better than a local manifold learning algorithm? [sent-142, score-0.732]
76 Error Measurement In addition to visualizing the results for the low-dimensional data, we measure performance by considering how well the algorithm learns the local tangent distance, as measured by the normalized projection error of nearest neighbors (eq. [sent-143, score-0.843]
77 1), (d) Local PCA (using the d principal components of the k nearest neighbors of x in the training set). [sent-146, score-0.286]
78 06 −2 0 1 2 3 4 5 6 7 8 9 Figure 1: Task 1 2-D data with 1-D sinusoidal manifolds: the method indeed captures the tangent planes. [sent-158, score-0.498]
79 k, for compared methods (from lowest to highest at k=1: analytic, tangent learning, local PCA, Isomap). [sent-164, score-0.601]
80 Each manifold is composed of 4 near points obtained from a randomly based sinus, i. [sent-168, score-0.7]
81 Four neighbors were used for training both the Tangent Learning algorithm and the benchmark local nonparametric estimator (local PCA of the 4 neighbors). [sent-172, score-0.277]
82 Figure 1 shows the training set and the tangent planes recovered, both on the training examples and generalizing away from the data. [sent-173, score-0.676]
83 This problem is particularly difficult for local manifold learning, which does very poorly here: the out-ofsample relative prediction error are respectively 0. [sent-175, score-0.732]
84 Task 2 This is a higher dimensional manifold learning problem, with 41 dimensions. [sent-179, score-0.629]
85 Note that the tangent vectors are not linear in x. [sent-185, score-0.498]
86 The manifold coordinates are t1 and t2 , sampled uniformly, respectively from (−1, 1) and (. [sent-186, score-0.658]
87 First, the error decreases because of the effect of noise (near noisy neighbors may form a high angle with the tangent plane). [sent-194, score-0.662]
88 Then, it increases because of the curvature of manifold (further away neighbors form a larger angle). [sent-195, score-0.881]
89 There is one rotation manifold for each instance of digit from the database, but only two examples for each manifold: one real image from the MNIST dataset and one slightly rotated image. [sent-197, score-0.678]
90 In this context we use k = 1 nearest neighbor only and manifold dimension is 1. [sent-199, score-0.77]
91 The average relative projection error for the nearest neighbor is 0. [sent-200, score-0.186]
92 An even more interesting experiment consists in applying the above trained predictor on novel images that come from a very different distribution but one that shares the same manifold structure: it was applied to images of other characters that are not digits. [sent-210, score-0.683]
93 We have used the predicted tangent planes to follow the manifold by small steps (this is very easy to do in the case of a one-dimensional manifold). [sent-211, score-1.204]
94 Figure 3 shows for example on a letter ’M’ image the effect of a few such steps and a larger number of steps, both for the neural network predictor and for the local PCA predictor. [sent-212, score-0.155]
95 This example illustrates the crucial point that non-local tangent plane learning is able to generalize to truly novel cases, where local manifold learning fails. [sent-213, score-1.435]
96 In this spirit we have proposed a simple learning algorithm based on predicting the tangent plane at x with a function F (x) whose parameters are estimated based on the whole data set. [sent-218, score-0.696]
97 as in (Szummer and Jaakkola, 2002; Chapelle, Weston and Scholkopf, 2003; Belkin and Niyogi, 2003; Zhu, Ghahramani and Lafferty, 2003)), which rely on proper estimation of the manifold in order to propagate label information. [sent-221, score-0.629]
98 by follow- ing the manifold (using the local tangent estimates), to estimate a manifold-following path between pairs of neighboring examples. [sent-224, score-1.252]
99 The algorithm can also be extended in a straightforward way to obtain a Gaussian mixture or a mixture of factor analyzers (with the factors or the principal eigenvectors of the Gaussian centered at x obtained from F (x)). [sent-225, score-0.168]
100 This view can also provide an alternative criterion to optimize F (x) (the local log-likelihood of such a Gaussian). [sent-226, score-0.133]
wordName wordTfidf (topN-words)
[('manifold', 0.629), ('tangent', 0.498), ('wtj', 0.207), ('plane', 0.174), ('neighbors', 0.137), ('xt', 0.132), ('curvature', 0.115), ('local', 0.103), ('pca', 0.096), ('manifolds', 0.091), ('bengio', 0.088), ('xj', 0.088), ('isomap', 0.081), ('neighbor', 0.081), ('embedding', 0.078), ('planes', 0.077), ('near', 0.071), ('analytic', 0.063), ('roweis', 0.061), ('kd', 0.061), ('nearest', 0.06), ('ruderman', 0.06), ('brand', 0.057), ('kernel', 0.055), ('vincent', 0.055), ('principal', 0.052), ('becker', 0.051), ('teh', 0.048), ('niyogi', 0.048), ('obermayer', 0.048), ('directions', 0.046), ('dimensionality', 0.045), ('around', 0.045), ('projection', 0.045), ('rao', 0.044), ('coordinate', 0.044), ('mixtures', 0.043), ('editors', 0.042), ('belkin', 0.039), ('ghahramani', 0.038), ('locally', 0.038), ('training', 0.037), ('langford', 0.036), ('mostly', 0.036), ('translation', 0.036), ('xi', 0.035), ('denker', 0.035), ('sinus', 0.035), ('vki', 0.035), ('wtji', 0.035), ('ek', 0.034), ('lle', 0.034), ('subspace', 0.034), ('gt', 0.033), ('tipping', 0.033), ('eigenmaps', 0.033), ('thrun', 0.032), ('bishop', 0.032), ('silva', 0.032), ('generalize', 0.031), ('predictor', 0.03), ('olkopf', 0.03), ('analyzers', 0.03), ('mixture', 0.03), ('criterion', 0.03), ('curse', 0.029), ('vk', 0.029), ('ller', 0.029), ('hinton', 0.029), ('coordinates', 0.029), ('xk', 0.029), ('tenenbaum', 0.029), ('examples', 0.027), ('charting', 0.027), ('parzen', 0.027), ('simard', 0.027), ('lie', 0.027), ('noise', 0.027), ('units', 0.026), ('gaussians', 0.026), ('xn', 0.026), ('eigenvectors', 0.026), ('szummer', 0.026), ('lecun', 0.026), ('laplacian', 0.025), ('characterize', 0.025), ('hidden', 0.025), ('gaussian', 0.025), ('claim', 0.024), ('translations', 0.024), ('characters', 0.024), ('estimated', 0.024), ('density', 0.023), ('covariance', 0.023), ('share', 0.023), ('patches', 0.023), ('ti', 0.023), ('estimate', 0.022), ('everywhere', 0.022), ('image', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
2 0.23128845 17 nips-2004-Adaptive Manifold Learning
Author: Jing Wang, Zhenyue Zhang, Hongyuan Zha
Abstract: Recently, there have been several advances in the machine learning and pattern recognition communities for developing manifold learning algorithms to construct nonlinear low-dimensional manifolds from sample data points embedded in high-dimensional spaces. In this paper, we develop algorithms that address two key issues in manifold learning: 1) the adaptive selection of the neighborhood sizes; and 2) better fitting the local geometric structure to account for the variations in the curvature of the manifold and its interplay with the sampling density of the data set. We also illustrate the effectiveness of our methods on some synthetic data sets. 1
3 0.18621863 160 nips-2004-Seeing through water
Author: Alexei Efros, Volkan Isler, Jianbo Shi, Mirkó Visontai
Abstract: We consider the problem of recovering an underwater image distorted by surface waves. A large amount of video data of the distorted image is acquired. The problem is posed in terms of finding an undistorted image patch at each spatial location. This challenging reconstruction task can be formulated as a manifold learning problem, such that the center of the manifold is the image of the undistorted patch. To compute the center, we present a new technique to estimate global distances on the manifold. Our technique achieves robustness through convex flow computations and solves the “leakage” problem inherent in recent manifold embedding techniques. 1
4 0.17598934 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
Author: Elizaveta Levina, Peter J. Bickel
Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1
5 0.16535737 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
Author: Richard S. Zemel, Miguel Á. Carreira-Perpiñán
Abstract: Many machine learning algorithms for clustering or dimensionality reduction take as input a cloud of points in Euclidean space, and construct a graph with the input data points as vertices. This graph is then partitioned (clustering) or used to redefine metric information (dimensionality reduction). There has been much recent work on new methods for graph-based clustering and dimensionality reduction, but not much on constructing the graph itself. Graphs typically used include the fullyconnected graph, a local fixed-grid graph (for image segmentation) or a nearest-neighbor graph. We suggest that the graph should adapt locally to the structure of the data. This can be achieved by a graph ensemble that combines multiple minimum spanning trees, each fit to a perturbed version of the data set. We show that such a graph ensemble usually produces a better representation of the data manifold than standard methods; and that it provides robustness to a subsequent clustering or dimensionality reduction algorithm based on the graph. 1
6 0.10957612 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models
7 0.10068669 163 nips-2004-Semi-parametric Exponential Family PCA
8 0.089793272 164 nips-2004-Semi-supervised Learning by Entropy Minimization
9 0.08650364 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics
10 0.080027476 62 nips-2004-Euclidean Embedding of Co-Occurrence Data
11 0.079704054 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers
12 0.079548791 137 nips-2004-On the Adaptive Properties of Decision Trees
13 0.076908596 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
14 0.066996619 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
15 0.065948702 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
16 0.063627981 125 nips-2004-Multiple Relational Embedding
17 0.062542692 168 nips-2004-Semigroup Kernels on Finite Sets
18 0.061335489 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
19 0.060562238 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
20 0.05912938 83 nips-2004-Incremental Learning for Visual Tracking
topicId topicWeight
[(0, -0.207), (1, 0.059), (2, -0.041), (3, -0.056), (4, 0.055), (5, -0.044), (6, -0.094), (7, -0.044), (8, -0.026), (9, 0.075), (10, 0.018), (11, -0.247), (12, -0.003), (13, -0.289), (14, -0.138), (15, 0.171), (16, -0.045), (17, -0.094), (18, 0.011), (19, 0.147), (20, -0.145), (21, 0.08), (22, -0.085), (23, -0.083), (24, 0.018), (25, 0.252), (26, 0.035), (27, 0.076), (28, 0.117), (29, -0.105), (30, 0.006), (31, -0.091), (32, 0.099), (33, 0.022), (34, 0.081), (35, 0.007), (36, 0.016), (37, -0.014), (38, -0.037), (39, 0.056), (40, -0.02), (41, 0.003), (42, 0.06), (43, -0.065), (44, -0.004), (45, 0.059), (46, -0.004), (47, 0.13), (48, -0.094), (49, -0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.96015102 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
2 0.66276133 17 nips-2004-Adaptive Manifold Learning
Author: Jing Wang, Zhenyue Zhang, Hongyuan Zha
Abstract: Recently, there have been several advances in the machine learning and pattern recognition communities for developing manifold learning algorithms to construct nonlinear low-dimensional manifolds from sample data points embedded in high-dimensional spaces. In this paper, we develop algorithms that address two key issues in manifold learning: 1) the adaptive selection of the neighborhood sizes; and 2) better fitting the local geometric structure to account for the variations in the curvature of the manifold and its interplay with the sampling density of the data set. We also illustrate the effectiveness of our methods on some synthetic data sets. 1
3 0.6391142 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
Author: Richard S. Zemel, Miguel Á. Carreira-Perpiñán
Abstract: Many machine learning algorithms for clustering or dimensionality reduction take as input a cloud of points in Euclidean space, and construct a graph with the input data points as vertices. This graph is then partitioned (clustering) or used to redefine metric information (dimensionality reduction). There has been much recent work on new methods for graph-based clustering and dimensionality reduction, but not much on constructing the graph itself. Graphs typically used include the fullyconnected graph, a local fixed-grid graph (for image segmentation) or a nearest-neighbor graph. We suggest that the graph should adapt locally to the structure of the data. This can be achieved by a graph ensemble that combines multiple minimum spanning trees, each fit to a perturbed version of the data set. We show that such a graph ensemble usually produces a better representation of the data manifold than standard methods; and that it provides robustness to a subsequent clustering or dimensionality reduction algorithm based on the graph. 1
4 0.61129177 160 nips-2004-Seeing through water
Author: Alexei Efros, Volkan Isler, Jianbo Shi, Mirkó Visontai
Abstract: We consider the problem of recovering an underwater image distorted by surface waves. A large amount of video data of the distorted image is acquired. The problem is posed in terms of finding an undistorted image patch at each spatial location. This challenging reconstruction task can be formulated as a manifold learning problem, such that the center of the manifold is the image of the undistorted patch. To compute the center, we present a new technique to estimate global distances on the manifold. Our technique achieves robustness through convex flow computations and solves the “leakage” problem inherent in recent manifold embedding techniques. 1
5 0.60246712 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
Author: Elizaveta Levina, Peter J. Bickel
Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1
6 0.38777161 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models
7 0.29632199 62 nips-2004-Euclidean Embedding of Co-Occurrence Data
8 0.27485546 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
9 0.27387181 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
10 0.27303281 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
11 0.2720446 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation
12 0.26725757 68 nips-2004-Face Detection --- Efficient and Rank Deficient
13 0.26119089 137 nips-2004-On the Adaptive Properties of Decision Trees
14 0.25973147 163 nips-2004-Semi-parametric Exponential Family PCA
15 0.25806171 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
16 0.25795823 29 nips-2004-Beat Tracking the Graphical Model Way
17 0.24674931 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
18 0.24447677 127 nips-2004-Neighbourhood Components Analysis
19 0.23055041 130 nips-2004-Newscast EM
20 0.2302032 125 nips-2004-Multiple Relational Embedding
topicId topicWeight
[(13, 0.195), (15, 0.13), (18, 0.148), (26, 0.063), (31, 0.031), (33, 0.149), (35, 0.04), (39, 0.039), (50, 0.046), (56, 0.016), (76, 0.011), (87, 0.014)]
simIndex simValue paperId paperTitle
1 0.94925857 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making
Author: Peter M. Todd, Anja Dieckmann
Abstract: Simple lexicographic decision heuristics that consider cues one at a time in a particular order and stop searching for cues as soon as a decision can be made have been shown to be both accurate and frugal in their use of information. But much of the simplicity and success of these heuristics comes from using an appropriate cue order. For instance, the Take The Best heuristic uses validity order for cues, which requires considerable computation, potentially undermining the computational advantages of the simple decision mechanism. But many cue orders can achieve good decision performance, and studies of sequential search for data records have proposed a number of simple ordering rules that may be of use in constructing appropriate decision cue orders as well. Here we consider a range of simple cue ordering mechanisms, including tallying, swapping, and move-to-front rules, and show that they can find cue orders that lead to reasonable accuracy and considerable frugality when used with lexicographic decision heuristics. 1 O ne -Re ason De c i si on M aki ng and O r de r e d Se ar c h How do we know what information to consider when making a decision? Imagine the problem of deciding which of two objects or options is greater along some criterion, such as which of two cities is larger. We may know various facts about each city, such as whether they have a major sports team or a university or airport. To decide between them, we could weight and sum all the cues we know, or we could use a simpler lexicographic rule to look at one cue at a time in a particular order until we find a cue that discriminates between the options and indicates a choice [1]. Such lexicographic rules are used by people in a variety of decision tasks [2]-[4], and have been shown to be both accurate in their inferences and frugal in the amount of information they consider before making a decision. For instance, Gigerenzer and colleagues [5] demonstrated the surprising performance of several decision heuristics that stop information search as soon as one discriminating cue is found; because only that cue is used to make the decision, and no integration of information is involved, they called these heuristics “one-reason” decision mechanisms. Given some set of cues that can be looked up to make the decision, these heuristics differ mainly in the search rule that determines the order in which the information is searched. But then the question of what information to consider becomes, how are these search orders determined? Particular cue orders make a difference, as has been shown in research on the Take The Best heuristic (TTB) [6], [7]. TTB consists of three building blocks. (1) Search rule: Search through cues in the order of their validity, a measure of accuracy equal to the proportion of correct decisions made by a cue out of all the times that cue discriminates between pairs of options. (2) Stopping rule: Stop search as soon as one cue is found that discriminates between the two options. (3) Decision rule: Select the option to which the discriminating cue points, that is, the option that has the cue value associated with higher criterion values. The performance of TTB has been tested on several real-world data sets, ranging from professors’ salaries to fish fertility [8], in cross-validation comparisons with other more complex strategies. Across 20 data sets, TTB used on average only a third of the available cues (2.4 out of 7.7), yet still outperformed multiple linear regression in generalization accuracy (71% vs. 68%). The even simpler Minimalist heuristic, which searches through available cues in a random order, was more frugal (using 2.2 cues on average), yet still achieved 65% accuracy. But the fact that the accuracy of Minimalist lagged behind TTB by 6 percentage points indicates that part of the secret of TTB’s success lies in its ordered search. Moreover, in laboratory experiments [3], [4], [9], people using lexicographic decision strategies have been shown to employ cue orders based on the cues’ validities or a combination of validity and discrimination rate (proportion of decision pairs on which a cue discriminates between the two options). Thus, the cue order used by a lexicographic decision mechanism can make a considerable difference in accuracy; the same holds true for frugality, as we will see. But constructing an exact validity order, as used by Take The Best, takes considerable information and computation [10]. If there are N known objects to make decisions over, and C cues known for each object, then each of the C cues must be evaluated for whether it discriminates correctly (counting up R right decisions), incorrectly (W wrong decisions), or does not discriminate between each of the N·(N-1)/2 possible object pairs, yielding C·N·(N-1)/2 checks to perform to gather the information needed to compute cue validities (v = R/(R+W)) in this domain. But a decision maker typically does not know all of the objects to be decided upon, nor even all the cue values for those objects, ahead of time—is there any simpler way to find an accurate and frugal cue order? In this paper, we address this question through simulation-based comparison of a variety of simple cue-order-learning rules. Hope comes from two directions: first, there are many cue orders besides the exact validity ordering that can yield good performance; and second, research in computer science has demonstrated the efficacy of a range of simple ordering rules for a closely related search problem. Consequently, we find that simple mechanisms at the cue-order-learning stage can enable simple mechanisms at the decision stage, such as lexicographic one-reason decision heuristics, to perform well. 2 Si mpl e appr oac he s to c onstr uc ti ng c ue s e ar c h or de r s To compare different cue ordering rules, we evaluate the performance of different cue orders when used by a one-reason decision heuristic within a particular well-studied sample domain: large German cities, compared on the criterion of population size using 9 cues ranging from having a university to the presence of an intercity train line [6], [7]. Examining this domain makes it clear that there are many good possible cue orders. When used with one-reason stopping and decision building blocks, the mean accuracy of the 362,880 (9!) cue orders is 70%, equivalent to the performance expected from Minimalist. The accuracy of the validity order, 74.2%, falls toward the upper end of the accuracy range (62-75.8%), but there are still 7421 cue orders that do better than the validity order. The frugality of the search orders ranges from 2.53 cues per decision to 4.67, with a mean of 3.34 corresponding to using Minimalist; TTB has a frugality of 4.23, implying that most orders are more frugal. Thus, there are many accurate and frugal cue orders that could be found—a satisficing decision maker not requiring optimal performance need only land on one. An ordering problem of this kind has been studied in computer science for nearly four decades, and can provide us with a set of potential heuristics to test. Consider the case of a set of data records arranged in a list, each of which will be required during a set of retrievals with a particular probability pi. On each retrieval, a key is given (e.g. a record’s title) and the list is searched from the front to the end until the desired record, matching that key, is found. The goal is to minimize the mean search time for accessing the records in this list, for which the optimal ordering is in decreasing order of pi. But if these retrieval probabilities are not known ahead of time, how can the list be ordered after each successive retrieval to achieve fast access? This is the problem of self-organizing sequential search [11], [12]. A variety of simple sequential search heuristics have been proposed for this problem, centering on three main approaches: (1) transpose, in which a retrieved record is moved one position closer to the front of the list (i.e., swapping with the record in front of it); (2) move-to-front (MTF), in which a retrieved record is put at the front of the list, and all other records remain in the same relative order; and (3) count, in which a tally is kept of the number of times each record is retrieved, and the list is reordered in decreasing order of this tally after each retrieval. Because count rules require storing additional information, more attention has focused on the memory-free transposition and MTF rules. Analytic and simulation results (reviewed in [12]) have shown that while transposition rules can come closer to the optimal order asymptotically, in the short run MTF rules converge more quickly (as can count rules). This may make MTF (and count) rules more appealing as models of cue order learning by humans facing small numbers of decision trials. Furthermore, MTF rules are more responsive to local structure in the environment (e.g., clumped retrievals over time of a few records), and transposition can result in very poor performance under some circumstances (e.g., when neighboring pairs of “popular” records get trapped at the end of the list by repeatedly swapping places). It is important to note that there are important differences between the selforganizing sequential search problem and the cue-ordering problem we address here. In particular, when a record is sought that matches a particular key, search proceeds until the correct record is found. In contrast, when a decision is made lexicographically and the list of cues is searched through, there is no one “correct” cue to find—each cue may or may not discriminate (allow a decision to be made). Furthermore, once a discriminating cue is found, it may not even make the right decision. Thus, given feedback about whether a decision was right or wrong, a discriminating cue could potentially be moved up or down in the ordered list. This dissociation between making a decision or not (based on the cue discrimination rates), and making a right or wrong decision (based on the cue validities), means that there are two ordering criteria in this problem—frugality and accuracy—as opposed to the single order—search time—for records based on their retrieval probability pi . Because record search time corresponds to cue frugality, the heuristics that work well for the self-organizing sequential search task are likely to produce orders that emphasize frugality (reflecting cue discrimination rates) over accuracy in the cue-ordering task. Nonetheless, these heuristics offer a useful starting point for exploring cue-ordering rules. 2.1 The cue-ordering rules We focus on search order construction processes that are psychologically plausible by being frugal both in terms of information storage and in terms of computation. The decision situation we explore is different from the one assumed by Juslin and Persson [10] who strongly differentiate learning about objects from later making decisions about them. Instead we assume a learning-while-doing situation, consisting of tasks that have to be done repeatedly with feedback after each trial about the adequacy of one’s decision. For instance, we can observe on multiple occasions which of two supermarket checkout lines, the one we have chosen or (more likely) another one, is faster, and associate this outcome with cues including the lines’ lengths and the ages of their respective cashiers. In such situations, decision makers can learn about the differential usefulness of cues for solving the task via the feedback received over time. We compare several explicitly defined ordering rules that construct cue orders for use by lexicographic decision mechanisms applied to a particular probabilistic inference task: forced choice paired comparison, in which a decision maker has to infer which of two objects, each described by a set of binary cues, is “bigger” on a criterion—just the task for which TTB was formulated. After an inference has been made, feedback is given about whether a decision was right or wrong. Therefore, the order-learning algorithm has information about which cues were looked up, whether a cue discriminated, and whether a discriminating cue led to the right or wrong decision. The rules we propose differ in which pieces of information they use and how they use them. We classify the learning rules based on their memory requirement—high versus low—and their computational requirements in terms of full or partial reordering (see Table 1). Table 1: Learning rules classified by memory and computational requirements High memory load, complete reordering High memory load, local reordering Low memory load, local reordering Validity: reorders cues based on their current validity Tally swap: moves cue up (down) one position if it has made a correct (incorrect) decision if its tally of correct minus incorrect decisions is ( ) than that of next higher (lower) cue Simple swap: moves cue up one position after correct decision, and down after an incorrect decision Tally: reorders cues by number of correct minus incorrect decisions made so far Associative/delta rule: reorders cues by learned association strength Move-to-front (2 forms): Take The Last (TTL): moves discriminating cue to front TTL-correct: moves cue to front only if it correctly discriminates The validity rule, a type of count rule, is the most demanding of the rules we consider in terms of both memory requirements and computational complexity. It keeps a count of all discriminations made by a cue so far (in all the times that the cue was looked up) and a separate count of all the correct discriminations. Therefore, memory load is comparatively high. The validity of each cue is determined by dividing its current correct discrimination count by its total discrimination count. Based on these values computed after each decision, the rule reorders the whole set of cues from highest to lowest validity. The tally rule only keeps one count per cue, storing the number of correct decisions made by that cue so far minus the number of incorrect decisions. If a cue discriminates correctly on a given trial, one point is added to its tally, if it leads to an incorrect decision, one point is subtracted. The tally rule is less demanding in terms of memory and computation: Only one count is kept, no division is required. The simple swap rule uses the transposition rather than count approach. This rule has no memory of cue performance other than an ordered list of all cues, and just moves a cue up one position in this list whenever it leads to a correct decision, and down if it leads to an incorrect decision. In other words, a correctly deciding cue swaps positions with its nearest neighbor upwards in the cue order, and an incorrectly deciding cue swaps positions with its nearest neighbor downwards. The tally swap rule is a hybrid of the simple swap rule and the tally rule. It keeps a tally of correct minus incorrect discriminations per cue so far (so memory load is high) but only locally swaps cues: When a cue makes a correct decision and its tally is greater than or equal to that of its upward neighbor, the two cues swap positions. When a cue makes an incorrect decision and its tally is smaller than or equal to that of its downward neighbor, the two cues also swap positions. We also evaluate two types of move-to-front rules. First, the Take The Last (TTL) rule moves the last discriminating cue (that is, whichever cue was found to discriminate for the current decision) to the front of the order. This is equivalent to the Take The Last heuristic [6], [7], which uses a memory of cues that discriminated in the past to determine cue search order for subsequent decisions. Second, TTLcorrect moves the last discriminating cue to the front of the order only if it correctly discriminated; otherwise, the cue order remains unchanged. This rule thus takes accuracy as well as frugality into account. Finally, we include an associative learning rule that uses the delta rule to update cue weights according to whether they make correct or incorrect discriminations, and then reorders all cues in decreasing order of this weight after each decision. This corresponds to a simple network with nine input units encoding the difference in cue value between the two objects (A and B) being decided on (i.e., ini = -1 if cuei(A) cuei(B), and 0 if cuei(A)=cuei(B) or cuei was not checked) and with one output unit whose target value encodes the correct decision (t = 1 if criterion(A)>criterion(B), otherwise -1), and with the weights between inputs and output updated according to wi = lr · (t - ini·wi) · ini with learning rate lr = 0.1. We expect this rule to behave similarly to Oliver’s rule initially (moving a cue to the front of the list by giving it the largest weight when weights are small) and to swap later on (moving cues only a short distance once weights are larger). 3 Si mul ati on Study of Si mpl e O r de r i ng Rul e s To test the performance of these order learning rules, we use the German cities data set [6], [7], consisting of the 83 largest-population German cities (those with more than 100,000 inhabitants), described on 9 cues that give some information about population size. Discrimination rate and validity of the cues are negatively correlated (r = -.47). We present results averaged over 10,000 learning trials for each rule, starting from random initial cue orders. Each trial consisted of 100 decisions between randomly selected decision pairs. For each decision, the current cue order was used to look up cues until a discriminating cue was found, which was used to make the decision (employing a onereason or lexicographic decision strategy). After each decision, the cue order was updated using the particular order-learning rule. We start by considering the cumulative accuracies (i.e., online or amortized performance—[12]) of the rules, defined as the total percentage of correct decisions made so far at any point in the learning process. The contrasting measure of offline accuracy—how well the current learned cue order would do if it were applied to the entire test set—will be subsequently reported (see Figure 1). For all but the move-to-front rules, cumulative accuracies soon rise above that of the Minimalist heuristic (proportion correct = .70) which looks up cues in random order and thus serves as a lower benchmark. However, at least throughout the first 100 decisions, cumulative accuracies stay well below the (offline) accuracy that would be achieved by using TTB for all decisions (proportion correct = .74), looking up cues in the true order of their ecological validities. Except for the move-to-front rules, whose cumulative accuracies are very close to Minimalist (mean proportion correct in 100 decisions: TTL: .701; TTL-correct: .704), all learning rules perform on a surprisingly similar level, with less than one percentage point difference in favor of the most demanding rule (i.e., delta rule: .719) compared to the least (i.e., simple swap: .711; for comparison: tally swap: .715; tally: .716; validity learning rule: .719). Offline accuracies are slightly higher, again with the exception of the move to front rules (TTL: .699; TTL-correct: .702; simple swap: .714; tally swap: .719; tally: .721; validity learning rule: .724; delta rule: .725; see Figure 1). In longer runs (10,000 decisions) the validity learning rule is able to converge on TTB’s accuracy, but the tally rule’s performance changes little (to .73). Figure 1: Mean offline accuracy of order learning rules Figure 2: Mean offline frugality of order learning rules All learning rules are, however, more frugal than TTB, and even more frugal than Minimalist, both in terms of online as well as offline frugality. Let us focus on their offline frugality (see Figure 2): On average, the rules look up fewer cues than Minimalist before reaching a decision. There is little difference between the associative rule, the tallying rules and the swapping rules (mean number of cues looked up in 100 decisions: delta rule: 3.20; validity learning rule: 3.21; tally: 3.01; tally swap: 3.04; simple swap: 3.13). Most frugal are the two move-to front rules (TTL-correct: 2.87; TTL: 2.83). Consistent with this finding, all of the learning rules lead to cue orders that show positive correlations with the discrimination rate cue order (reaching the following values after 100 decisions: validity learning rule: r = .18; tally: r = .29; tally swap: r = .24; simple swap: r = .18; TTL-correct: r = .48; TTL: r = .56). This means that cues that often lead to discriminations are more likely to end up in the first positions of the order. This is especially true for the move-to-front rules. In contrast, the cue orders resulting from all learning rules but the validity learning rule do not correlate or correlate negatively with the validity cue order, and even the correlations of the cue orders resulting from the validity learning rule after 100 decisions only reach an average r = .12. But why would the discrimination rates of cues exert more of a pull on cue order than validity, even when the validity learning rule is applied? As mentioned earlier, this is what we would expect for the move-to-front rules, but it was unexpected for the other rules. Part of the explanation comes from the fact that in the city data set we used for the simulations, validity and discrimination rate of cues are negatively correlated. Having a low discrimination rate means that a cue has little chance to be used and hence to demonstrate its high validity. Whatever learning rule is used, if such a cue is displaced downward to the lower end of the order by other cues, it may have few chances to escape to the higher ranks where it belongs. The problem is that when a decision pair is finally encountered for which that cue would lead to a correct decision, it is unlikely to be checked because other, more discriminating although less valid, cues are looked up before and already bring about a decision. Thus, because one-reason decision making is intertwined with the learning mechanism and so influences which cues can be learned about, what mainly makes a cue come early in the order is producing a high number of correct decisions and not so much a high ratio of correct discriminations to total discriminations regardless of base rates. This argument indicates that performance may differ in environments where cue validities and discrimination rates correlate positively. We tested the learning rules on one such data set (r=.52) of mammal species life expectancies, predicted from 9 cues. It also differs from the cities environment with a greater difference between TTB’s and Minimalist’s performance (6.5 vs. 4 percentage points). In terms of offline accuracy, the validity learning rule now indeed more closely approaches TTB’s accuracy after 100 decisions (.773 vs. .782)., The tally rule, in contrast, behaves very much as in the cities environment, reaching an accuracy of .752, halfway between TTB and Minimalist (accuracy =.716). Thus only some learning rules can profit from the positive correlation. 4 D i s c u s s i on Most of the simpler cue order learning rules we have proposed do not fall far behind a validity learning rule in accuracy, and although the move-to-front rules cannot beat the accuracy achieved if cues were selected randomly, they compensate for this failure by being highly frugal. Interestingly, the rules that do achieve higher accuracy than Minimalist also beat random cue selection in terms of frugality. On the other hand, all rules, even the delta rule and the validity learning rule, stay below TTB’s accuracy across a relatively high number of decisions. But often it is necessary to make good decisions without much experience. Therefore, learning rules should be preferred that quickly lead to orders with good performance. The relatively complex rules with relatively high memory requirement, i.e., the delta and the validity learning rule, but also the tally learning rule, more quickly rise in accuracy compared the rules with lower requirements. Especially the tally rule thus represents a good compromise between cost, correctness and psychological plausibility considerations. Remember that the rules based on tallies assume full memory of all correct minus incorrect decisions made by a cue so far. But this does not make the rule implausible, at least from a psychological perspective, even though computer scientists were reluctant to adopt such counting approaches because of their extra memory requirements. There is considerable evidence that people are actually very good at remembering the frequencies of events. Hasher and Zacks [13] conclude from a wide range of studies that frequencies are encoded in an automatic way, implying that people are sensitive to this information without intention or special effort. Estes [14] pointed out the role frequencies play in decision making as a shortcut for probabilities. Further, the tally rule and the tally swap rule are comparatively simple, not having to keep track of base rates or perform divisions as does the validity rule. From the other side, the simple swap and move to front rules may not be much simpler, because storing a cue order may be about as demanding as storing a set of tallies. We have run experiments (reported elsewhere) in which indeed the tally swap rule best accounts for people’s actual processes of ordering cues. Our goal in this paper was to explore how well simple cue-ordering rules could work in conjunction with lexicographic decision strategies. This is important because it is necessary to take into account the set-up costs of a heuristic in addition to its application costs when considering the mechanism’s overall simplicity. As the example of the validity search order of TTB shows, what is easy to apply may not necessarily be so easy to set up. But simple rules can also be at work in the construction of a heuristic’s building blocks. We have proposed such rules for the construction of one building block, the search order. Simple learning rules inspired by research in computer science can enable a one-reason decision heuristic to perform only slightly worse than if it had full knowledge of cue validities from the very beginning. Giving up the assumption of full a priori knowledge for the slight decrease in accuracy seems like a reasonable bargain: Through the addition of learning rules, one-reason decision heuristics might lose some of their appeal to decision theorists who were surprised by the performance of such simple mechanisms compared to more complex algorithms, but they gain psychological plausibility and so become more attractive as explanations for human decision behavior. References [1] Fishburn, P.C. (1974). Lexicographic orders, utilities and decision rules: A survey. Management Science, 20, 1442-1471. [2] Payne, J.W., Bettman, J.R., & Johnson, E.J. (1993). The adaptive decision maker. New York: Cambridge University Press. [3] Bröder, A. (2000). Assessing the empirical validity of the “Take-The-Best” heuristic as a model of human probabilistic inference. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26 (5), 1332-1346. [4] Bröder, A. (2003). Decision making with the “adaptive toolbox”: Influence of environmental structure, intelligence, and working memory load. Journal of Experimental Psychology: Learning, Memory, & Cognition, 29, 611-625. [5] Gigerenzer, G., Todd, P.M., & The ABC Research Group (1999). Simple heuristics that make us smart. New York: Oxford University Press. [6] Gigerenzer, G., & Goldstein, D.G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103 (4), 650-669. [7] Gigerenzer, G., & Goldstein, D.G. (1999). Betting on one good reason: The Take The Best Heuristic. In G. Gigerenzer, P.M. Todd & The ABC Research Group, Simple heuristics that make us smart. New York: Oxford University Press. [8] Czerlinski, J., Gigerenzer, G., & Goldstein, D.G. (1999). How good are simple heuristics? In G. Gigerenzer, P.M. Todd & The ABC Research Group, Simple heuristics that make us smart. New York: Oxford University Press. [9] Newell, B.R., & Shanks, D.R. (2003). Take the best or look at the rest? Factors influencing ‘one-reason’ decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 53-65. [10] Juslin, P., & Persson, M. (2002). PROBabilities from EXemplars (PROBEX): a “lazy” algorithm for probabilistic inference from generic knowledge. Cognitive Science, 26, 563-607. [11] Rivest, R. (1976). On self-organizing sequential search heuristics. Communications of the ACM, 19(2), 63-67. [12] Bentley, J.L. & McGeoch, C.C. (1985). Amortized analyses of self-organizing sequential search heuristics. Communications of the ACM, 28(4), 404-411. [13] Hasher, L., & Zacks, R.T. (1984). Automatic Processing of fundamental information: The case of frequency of occurrence. American Psychologist, 39, 1372-1388. [14] Estes, W.K. (1976). The cognitive side of probability learning. Psychological Review, 83, 3764.
same-paper 2 0.89194757 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
3 0.87583339 117 nips-2004-Methods Towards Invasive Human Brain Computer Interfaces
Author: Thomas N. Lal, Thilo Hinterberger, Guido Widman, Michael Schröder, N. J. Hill, Wolfgang Rosenstiel, Christian E. Elger, Niels Birbaumer, Bernhard Schölkopf
Abstract: During the last ten years there has been growing interest in the development of Brain Computer Interfaces (BCIs). The field has mainly been driven by the needs of completely paralyzed patients to communicate. With a few exceptions, most human BCIs are based on extracranial electroencephalography (EEG). However, reported bit rates are still low. One reason for this is the low signal-to-noise ratio of the EEG [16]. We are currently investigating if BCIs based on electrocorticography (ECoG) are a viable alternative. In this paper we present the method and examples of intracranial EEG recordings of three epilepsy patients with electrode grids placed on the motor cortex. The patients were asked to repeatedly imagine movements of two kinds, e.g., tongue or finger movements. We analyze the classifiability of the data using Support Vector Machines (SVMs) [18, 21] and Recursive Channel Elimination (RCE) [11]. 1
4 0.86568594 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification
Author: Tong Zhang
Abstract: We consider the problem of deriving class-size independent generalization bounds for some regularized discriminative multi-category classification methods. In particular, we obtain an expected generalization bound for a standard formulation of multi-category support vector machines. Based on the theoretical result, we argue that the formulation over-penalizes misclassification error, which in theory may lead to poor generalization performance. A remedy, based on a generalization of multi-category logistic regression (conditional maximum entropy), is then proposed, and its theoretical properties are examined. 1
5 0.8623181 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
Author: Rasmus K. Olsson, Lars K. Hansen
Abstract: We discuss an identification framework for noisy speech mixtures. A block-based generative model is formulated that explicitly incorporates the time-varying harmonic plus noise (H+N) model for a number of latent sources observed through noisy convolutive mixtures. All parameters including the pitches of the source signals, the amplitudes and phases of the sources, the mixing filters and the noise statistics are estimated by maximum likelihood, using an EM-algorithm. Exact averaging over the hidden sources is obtained using the Kalman smoother. We show that pitch estimation and source separation can be performed simultaneously. The pitch estimates are compared to laryngograph (EGG) measurements. Artificial and real room mixtures are used to demonstrate the viability of the approach. Intelligible speech signals are re-synthesized from the estimated H+N models.
6 0.86062604 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
7 0.84707367 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
8 0.84251022 28 nips-2004-Bayesian inference in spiking neurons
9 0.83874857 138 nips-2004-Online Bounds for Bayesian Algorithms
10 0.83263797 163 nips-2004-Semi-parametric Exponential Family PCA
11 0.82777619 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
12 0.82637691 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid
13 0.82595581 70 nips-2004-Following Curved Regularized Optimization Solution Paths
14 0.82527632 176 nips-2004-Sub-Microwatt Analog VLSI Support Vector Machine for Pattern Classification and Sequence Estimation
15 0.8236469 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
16 0.82217646 102 nips-2004-Learning first-order Markov models for control
17 0.82167733 116 nips-2004-Message Errors in Belief Propagation
18 0.8148917 151 nips-2004-Rate- and Phase-coded Autoassociative Memory
19 0.81440514 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
20 0.81279123 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss