nips nips2004 nips2004-150 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Richard S. Zemel, Miguel Á. Carreira-Perpiñán
Abstract: Many machine learning algorithms for clustering or dimensionality reduction take as input a cloud of points in Euclidean space, and construct a graph with the input data points as vertices. This graph is then partitioned (clustering) or used to redefine metric information (dimensionality reduction). There has been much recent work on new methods for graph-based clustering and dimensionality reduction, but not much on constructing the graph itself. Graphs typically used include the fullyconnected graph, a local fixed-grid graph (for image segmentation) or a nearest-neighbor graph. We suggest that the graph should adapt locally to the structure of the data. This can be achieved by a graph ensemble that combines multiple minimum spanning trees, each fit to a perturbed version of the data set. We show that such a graph ensemble usually produces a better representation of the data manifold than standard methods; and that it provides robustness to a subsequent clustering or dimensionality reduction algorithm based on the graph. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Proximity graphs for clustering and manifold learning ´ ˜a Miguel A. [sent-1, score-0.501]
2 edu Abstract Many machine learning algorithms for clustering or dimensionality reduction take as input a cloud of points in Euclidean space, and construct a graph with the input data points as vertices. [sent-7, score-0.564]
3 This graph is then partitioned (clustering) or used to redefine metric information (dimensionality reduction). [sent-8, score-0.21]
4 There has been much recent work on new methods for graph-based clustering and dimensionality reduction, but not much on constructing the graph itself. [sent-9, score-0.419]
5 Graphs typically used include the fullyconnected graph, a local fixed-grid graph (for image segmentation) or a nearest-neighbor graph. [sent-10, score-0.25]
6 We suggest that the graph should adapt locally to the structure of the data. [sent-11, score-0.23]
7 This can be achieved by a graph ensemble that combines multiple minimum spanning trees, each fit to a perturbed version of the data set. [sent-12, score-0.597]
8 We show that such a graph ensemble usually produces a better representation of the data manifold than standard methods; and that it provides robustness to a subsequent clustering or dimensionality reduction algorithm based on the graph. [sent-13, score-0.886]
9 1 Introduction Graph-based algorithms have long been popular, and have received even more attention recently, for two of the fundamental problems in machine learning: clustering [1–4] and manifold learning [5–8]. [sent-14, score-0.347]
10 Relatively little attention has been paid to the properties and construction methods for the graphs that these algorithms depend on. [sent-15, score-0.211]
11 In the applications considered here, the graphs are an intermediate form of representation, and therefore their utility to some extent depends on the algorithms that they will ultimately be used for. [sent-17, score-0.173]
12 However, in the case of both clustering and manifold learning, the data points are assumed to lie on some small number of manifolds. [sent-18, score-0.399]
13 Intuitively, the graph should represent these underlying manifolds well: it should avoid shortcuts that travel outside a manifold, avoid gaps that erroneously disconnect regions of a manifold, and be dense within the manifold and clusters. [sent-19, score-0.517]
14 Dataset PSfrag replacements MST k-NNG -ball graph Delaunay triangulation Perturbed dataset Figure 1: Sensitivity to noise of proximity graphs. [sent-21, score-0.585]
15 Top row: several proximity graphs constructed on a noisy sample of points lying on a circle. [sent-22, score-0.262]
16 Bottom row: the same graphs constructed on a different sample; specifically, we added to each point Gaussian noise of standard deviation equal to the length of the small segment shown in the centre of the dataset (top row), built the graph, and drew it on the original dataset. [sent-23, score-0.225]
17 A fully-connected graph is used for example in spectral clustering and multidimensional scaling, while a fixed grid, with each point connecting to some small fixed number of neighbors in a pre-defined grid of locations, is generally used in image segmentation. [sent-26, score-0.592]
18 The -ball or k-NNG provide an improvement over the fully-connected graph or fixed grid (clustering: [3, 9]; manifold learning: [5, 7]). [sent-28, score-0.452]
19 In this paper we propose a different method of graph construction, one based on minimum spanning trees (MSTs). [sent-34, score-0.272]
20 Our method involves an ensemble of trees, each built on a perturbed version of the data. [sent-35, score-0.358]
21 We first discuss the motivation for this new type of graph, and then examine its robustness properties, and its utility to both subsequent clustering or dimensionality reduction methods. [sent-36, score-0.271]
22 2 Two new types of proximity graphs A minimum spanning tree is a tree subgraph that contains all the vertices and has a minimum sum of edge weights. [sent-37, score-0.291]
23 One way to flesh it out and attain robustness to noise is to form an MST ensemble that combines multiple MSTs; we give two different algorithms for this. [sent-41, score-0.315]
24 The locality of the noise model allows points to move more or less depending on the local data structure around them and to connect to different numbers of neighbors at different distances—in effect we achieve a variable k and . [sent-46, score-0.202]
25 To build the PMST ensemble, we generate T > 1 perturbed copies of the entire data set according to the local noise model and fit an MST to each. [sent-47, score-0.206]
26 The PMST ensemble assigns a weight eij ∈ [0, 1] to the edge between points xi and xj equal to the average number of times that edge appears on the trees. [sent-48, score-0.469]
27 For T = 1 this gives the MST of the unperturbed data set; for T → ∞ it gives a stochastic graph where eij is the probability (in the Laplace sense) of that edge under the noise model. [sent-49, score-0.423]
28 The PMST ensemble contains at most T (N − 1) edges (usually much less). [sent-50, score-0.311]
29 Although the algorithm is randomized, the PMST ensemble for large T is essentially deterministic, and insensitive to noise by construction. [sent-51, score-0.294]
30 2 Disjoint MSTs (DMSTs) Here we build a graph that is a deterministic collection of t MSTs that satisfies the property that the nth tree (for n = 1, . [sent-54, score-0.21]
31 Specifically, we sort the list of N (N −1) edges eij by increasing 2 distance dij and visit each available edge in turn, removing it if it merges two clusters (or equivalently does not create a cycle); whenever we have removed N − 1 edges, we go back to the start of the list. [sent-62, score-0.331]
32 The DMST ensemble consists of the union of all removed edges and contains t(N − 1) edges each of weight 1. [sent-64, score-0.399]
33 The t parameter controls the overall density of the graph, which is always connected; unlike or k (for the -ball or k-NNG), t is not a parameter that depends locally on the data, and again points may connect to different numbers of neighbors at different distances. [sent-65, score-0.219]
34 In both cases the resulting graphs are sparse (number of edges is linear on number of points N ). [sent-73, score-0.317]
35 an 8-connected grid in image segmentation) the edge list is much shorter, so the graph construction is faster. [sent-76, score-0.439]
36 For the perturbed MST ensemble, the perturbation of the data set results in a partially disordered edge list, which one should be able to sort efficiently. [sent-77, score-0.236]
37 Overall, the real computational bottleneck is the graph postprocessing, typically O(N 3 ) in spectral methods (for clustering or manifold learning). [sent-79, score-0.648]
38 This can be sped up to O(cN 2 ) by using sparsity (limiting a priori the edges allowed, thus approximating the true solution) but then the graph construction is likewise sped up. [sent-80, score-0.505]
39 Thus, even if our graphs are slightly more costly to construct than the -ball or k-NNG, the computational savings are very large if we avoid having to run the spectral technique multiple times in search for a good or k. [sent-81, score-0.239]
40 3 Experiments We present two sets of experiments on the application of the graphs to clustering and manifold learning, respectively. [sent-82, score-0.501]
41 1 Clustering In affinity-based clustering, our data is an N × N affinity matrix W that defines a graph (where nonzeros define edge weights) and we seek a partition of the graph that optimizes a cost function, such as mincut [1] or normalized cut [2]. [sent-84, score-0.531]
42 This graph partitioning problem is generally NP-complete, so approximations are necessary, such as spectral clustering algorithms [2]. [sent-86, score-0.441]
43 In spectral clustering we seek to cluster in the leading eigenvectors of the 1 1 normalized affinity matrix N = D− 2 WD− 2 (where D = diag ( i wij ), and discarding a constant eigenvector associated with eigenvalue 1). [sent-87, score-0.334]
44 Spectral clustering succeeds only for a range of values of σ where N displays the natural cluster structure of the data; if σ is too small W is approximately diagonal and if σ is too large W is approximately a matrix of ones. [sent-88, score-0.264]
45 2 shows segmentation results for a grayscale image from [11] where the objective is to segment the occluder from the underlying background, a hard task given the intensity gradients. [sent-91, score-0.196]
46 The other method uses the PMST or DMST ensemble (constrained to contain only edges in the 8-connected grid) under different values of the r, t parameters; the graph has between 44% and 98% the number of edges in the 8-grid, depend1 ing on the parameter value. [sent-94, score-0.643]
47 We define the affinity matrix as wij = eij exp(− 2 (dij /σ)2 ) (where eij ∈ [0, 1] are the edge values). [sent-95, score-0.259]
48 In both methods we apply the spectral clustering algorithm of [2]. [sent-96, score-0.231]
49 The plot shows the clustering error (mismatched occluder area) for a range of scales. [sent-97, score-0.349]
50 The 8-connected grid succeeds in segmenting the occluder for σ ∈ [0. [sent-98, score-0.215]
51 The reason for this success even for such high σ is that the graph lacks many edges around the occluder, so those affinities are zero no matter how high the scale is. [sent-100, score-0.362]
52 In other words, for clustering, our graphs enhance the inside of clusters with respect to the bridges between clusters, and so ease the graph partitioning. [sent-101, score-0.393]
53 , along the manifold) g ij beˆ ← − First 5 eigenvectors (except the constant one) − − −− −→ 8-grid σ=∞ PMST ensemble 8-grid σ2 = 1. [sent-106, score-0.251]
54 We consider segmenting the greyscale image at the bottom (an occluder over a background) with spectral clustering, asking for K = 5 clusters. [sent-109, score-0.217]
55 The color diagrams represent the segmentation (column 1) and first 5 eigenvectors of the affinity matrix (except the constant eigenvector, columns 2–4) obtained with spectral clustering, using a PMST ensemble with r = 0. [sent-110, score-0.362]
56 4 (upper row) or an 8-connectivity graph (lower row), for 3 different scales: σ1 = 0. [sent-111, score-0.21]
57 The PMST ensemble succeeds at all scales (note how several eigenvectors are constant over the occluder), while the 8-connectivity graph progressively deteriorates as σ increases to give a partition of equal-sized clusters for large scale. [sent-114, score-0.534]
58 In the bottom part of the figure we show: the PMST ensemble graph in 3D space; the clustering error vs σ (where the right end is σ = ∞) for the 8-connectivity graph (thick blue line) and for various other PMST and DMST ensembles under various parameters (thin lines). [sent-115, score-0.925]
59 tween pairs of points in the data set as the shortest-path lengths in a graph learned from the data. [sent-117, score-0.262]
60 Then we apply multidimensional scaling to these distances to obtain a collection of low-dimensional points {yi }N that optimally preserves the estimated geodesic distances. [sent-118, score-0.313]
61 3 we show the results of applying Isomap using different graphs to two data sets (ellipse and Swiss roll) for which we know the true geodesic distances gij . [sent-120, score-0.417]
62 In a real application, since the true geodesic distances are unknown, error and variance cannot be computed; an estimated residual variance has been proposed [5] to determine the optimal graph parameter. [sent-121, score-0.653]
63 For the perturbed MST ensemble, we binarize the edge values by making 1 any eij > 0. [sent-122, score-0.277]
64 (It is often possible to refine the graph by zeroing edges with small eij , since this removes shortcuts that may have arisen by chance, particularly if T is large; but it is difficult to estimate the right threshold reliably. [sent-123, score-0.48]
65 ) The plots show 3 curves as a function of the graph parameter: the average error E in the geodesic distances; Isomap’s estimated ˆ ˆ residual variance V ; and the true residual variance V . [sent-124, score-0.651]
66 From the plots we can see that V correlates well with V (though it underestimates it) and also with E for the Swiss roll, but not for the ellipse; this can make the optimal graph parameter difficult to determine in a real application. [sent-125, score-0.27]
67 Given this, the fact that our graphs work well over a larger region of their parameter space than the -ball or k-NNG graphs makes them particularly attractive. [sent-126, score-0.365]
68 The plots for the Swiss roll show that, while for the low noise case the -ball or k-NNG graphs work well over a reasonable region of their parameter space, for the high noise case this region decreases a lot, almost vanishing for the -ball. [sent-127, score-0.517]
69 This is because for low values of the parameter the graph is disconnected, while for high values it has multiple shortcuts; the difficulty of the task is compounded by the small number of points used, N = 500 (an unavoidable fact in high dimensions). [sent-128, score-0.353]
70 For very low r or t = 1 the graph is the single MST, thus the large errors. [sent-130, score-0.229]
71 It is also important to realize that the range of the r parameter of the PMST ensemble does not depend on the data, while the range for and k does. [sent-131, score-0.325]
72 The range of the t parameter of the DMST ensemble does depend on the data, but we have found empirically that t = 2–4 gives very good results with all data sets we have tried. [sent-132, score-0.291]
73 4 Discussion One main contribution of this paper is to highlight the relatively understudied problem of converting a data set into a graph, which forms an intermediate representation for many clustering and manifold learning algorithms. [sent-133, score-0.366]
74 A second contribution is novel construction algorithms, which are: easy to implement, not expensive to compute, robust across many noise levels and parameter settings, and useful for clustering and manifold learning. [sent-134, score-0.509]
75 In general, a careful selection of the graph construction algorithm makes the results of these machine learning methods robust, and avoids or limits the required parameter search. [sent-135, score-0.324]
76 Finally, the combination of many graphs, formed from perturbed versions of the data, into an ensemble of graphs, is a novel approach to the construction problem. [sent-136, score-0.415]
77 Our idea of MST ensembles is an extension to graphs of the well-known technique of combining predictors by averaging (regression) or voting (classification), as is the regularizing effect of training with noise [12]. [sent-137, score-0.308]
78 An ensemble of predictors improves the generalization to unseen data if the individual predictors are independent from each other and disagree with each other, and can be explained by the bias-variance tradeoff. [sent-138, score-0.285]
79 Unlike regression or classification, unsupervised graph learning lacks at present an error function, so it seems difficult to apply the bias-variance framework here. [sent-139, score-0.275]
80 However, we have conducted a wide range of empirical tests to understand the properties of the ensemble MSTs, and to compare them to the other graph construction methods, in terms of the error in the geodesic distances (if known a priori). [sent-140, score-0.758]
81 In summary, we have found that the variance of the error for the geodesic Ellipse, high noise Swiss roll, low noise Swiss roll, high noise 0. [sent-141, score-0.45]
82 2 5 PSfrag replacements k r t 0 PSfrag replacements −0. [sent-145, score-0.422]
83 5 1 avg error in gd resvar (estimated) resvar (true) 0. [sent-180, score-1.296]
84 9272 1 avg error in gd resvar (estimated) resvar (true) 0. [sent-194, score-1.296]
85 4454 avg error in gd resvar (estimated) resvar (true) 0. [sent-196, score-1.296]
86 9272 avg error in gd resvar (estimated) resvar (true) 0. [sent-225, score-1.296]
87 7 avg error in gd resvar (estimated) resvar (true) 0. [sent-237, score-1.296]
88 25 5 10 15 t 20 25 V, ˆ V 1 avg error in gd resvar (estimated) resvar (true) 0. [sent-252, score-1.296]
89 75 V, ˆ V k avg error in gd resvar (estimated) resvar (true) 0. [sent-253, score-1.296]
90 7749 r t 0 30 25 V, ˆ V 0 10 PSfrag replacements 1. [sent-254, score-0.211]
91 0332 DMST ensemble 1 avg error in gd resvar (estimated) resvar (true) k E PSfrag replacements 15 37. [sent-255, score-1.73]
92 75 avg error in gd resvar (estimated) resvar (true) E PSfrag replacements −10 −15 avg error in gd resvar (estimated) resvar (true) 24. [sent-262, score-2.803]
93 2583 E PSfrag replacements −10 avg error in gd resvar (estimated) resvar (true) 0. [sent-265, score-1.507]
94 Where the curves for -ball and k-NNG are missing, the graph was disconnected. [sent-269, score-0.21]
95 distances decreases for the ensemble when the individual graphs are sparse (e. [sent-270, score-0.479]
96 The typical cut [9, 13] is a clustering criterion that is based on the probability pij that points xi and xj are in the same cluster over all possible partitions (under the Boltzmann distribution for the mincut cost function). [sent-273, score-0.32]
97 A better way is to perturb points more strongly in directions likely to lie within the manifold and less strongly in directions away from the manifold, using a method such as k nearest neighbors to estimate appropriate directions. [sent-278, score-0.348]
98 Preliminary experiments with such a manifold-aligned model are very promising, particularly when the data is very noisy or its distribution on the manifold is not uniform. [sent-279, score-0.182]
99 An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. [sent-283, score-0.25]
100 Self organization in vision: Stochastic clustering for image segmentation, perceptual grouping, and image database organization. [sent-339, score-0.245]
wordName wordTfidf (topN-words)
[('resvar', 0.444), ('pmst', 0.314), ('mst', 0.277), ('ensemble', 0.223), ('psfrag', 0.221), ('replacements', 0.211), ('graph', 0.21), ('gd', 0.193), ('manifold', 0.182), ('avg', 0.176), ('msts', 0.166), ('clustering', 0.165), ('graphs', 0.154), ('perturbed', 0.135), ('dmst', 0.129), ('geodesic', 0.116), ('occluder', 0.111), ('shortcuts', 0.092), ('eij', 0.09), ('edges', 0.088), ('distances', 0.079), ('roll', 0.077), ('ellipse', 0.074), ('swiss', 0.074), ('noise', 0.071), ('estimated', 0.066), ('spectral', 0.066), ('grid', 0.06), ('construction', 0.057), ('proximity', 0.056), ('dmsts', 0.055), ('pmsts', 0.055), ('isomap', 0.054), ('edge', 0.052), ('points', 0.052), ('ensembles', 0.052), ('dij', 0.052), ('neighbors', 0.051), ('residual', 0.051), ('af', 0.049), ('perturbation', 0.049), ('segmentation', 0.045), ('succeeds', 0.044), ('dimensionality', 0.044), ('true', 0.042), ('reduction', 0.041), ('image', 0.04), ('error', 0.039), ('delaunay', 0.037), ('mincut', 0.037), ('triangulation', 0.037), ('dy', 0.035), ('range', 0.034), ('parameter', 0.034), ('trees', 0.033), ('manifolds', 0.033), ('nity', 0.032), ('disconnected', 0.032), ('predictors', 0.031), ('clusters', 0.029), ('sped', 0.029), ('spanning', 0.029), ('sparsity', 0.029), ('connect', 0.028), ('eigenvectors', 0.028), ('eigenvector', 0.027), ('wij', 0.027), ('plots', 0.026), ('blue', 0.026), ('gij', 0.026), ('lacks', 0.026), ('row', 0.025), ('variance', 0.025), ('bottleneck', 0.025), ('branches', 0.025), ('nities', 0.025), ('pij', 0.023), ('region', 0.023), ('axis', 0.023), ('sparse', 0.023), ('richard', 0.023), ('november', 0.023), ('careful', 0.023), ('cut', 0.022), ('strongly', 0.022), ('cluster', 0.021), ('robustness', 0.021), ('priori', 0.021), ('locally', 0.02), ('december', 0.02), ('connected', 0.02), ('list', 0.02), ('randomized', 0.02), ('high', 0.019), ('nearest', 0.019), ('costly', 0.019), ('intermediate', 0.019), ('lawrence', 0.019), ('low', 0.019), ('euclidean', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
Author: Richard S. Zemel, Miguel Á. Carreira-Perpiñán
Abstract: Many machine learning algorithms for clustering or dimensionality reduction take as input a cloud of points in Euclidean space, and construct a graph with the input data points as vertices. This graph is then partitioned (clustering) or used to redefine metric information (dimensionality reduction). There has been much recent work on new methods for graph-based clustering and dimensionality reduction, but not much on constructing the graph itself. Graphs typically used include the fullyconnected graph, a local fixed-grid graph (for image segmentation) or a nearest-neighbor graph. We suggest that the graph should adapt locally to the structure of the data. This can be achieved by a graph ensemble that combines multiple minimum spanning trees, each fit to a perturbed version of the data set. We show that such a graph ensemble usually produces a better representation of the data manifold than standard methods; and that it provides robustness to a subsequent clustering or dimensionality reduction algorithm based on the graph. 1
2 0.16535737 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
3 0.12194566 161 nips-2004-Self-Tuning Spectral Clustering
Author: Lihi Zelnik-manor, Pietro Perona
Abstract: We study a number of open issues in spectral clustering: (i) Selecting the appropriate scale of analysis, (ii) Handling multi-scale data, (iii) Clustering with irregular background clutter, and, (iv) Finding automatically the number of groups. We first propose that a ‘local’ scale should be used to compute the affinity between each pair of points. This local scaling leads to better clustering especially when the data includes multiple scales and when the clusters are placed within a cluttered background. We further suggest exploiting the structure of the eigenvectors to infer automatically the number of groups. This leads to a new algorithm in which the final randomly initialized k-means stage is eliminated. 1
4 0.099710651 115 nips-2004-Maximum Margin Clustering
Author: Linli Xu, James Neufeld, Bryce Larson, Dale Schuurmans
Abstract: We propose a new method for clustering based on finding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difficult computational problem, the hard-clustering constraints can be relaxed to a soft-clustering formulation which can be feasibly solved with a semidefinite program. Since our clustering technique only depends on the data through the kernel matrix, we can easily achieve nonlinear clusterings in the same manner as spectral clustering. Experimental results show that our maximum margin clustering technique often obtains more accurate results than conventional clustering methods. The real benefit of our approach, however, is that it leads naturally to a semi-supervised training method for support vector machines. By maximizing the margin simultaneously on labeled and unlabeled training data, we achieve state of the art performance by using a single, integrated learning principle. 1
5 0.099084057 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
Author: Elizaveta Levina, Peter J. Bickel
Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1
6 0.095462784 103 nips-2004-Limits of Spectral Clustering
7 0.090191461 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
8 0.089949556 160 nips-2004-Seeing through water
9 0.08798752 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
10 0.081518203 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
11 0.079293437 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
12 0.07900776 177 nips-2004-Supervised Graph Inference
13 0.0688693 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem
14 0.068556763 165 nips-2004-Semi-supervised Learning on Directed Graphs
15 0.067470931 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
16 0.067378655 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
17 0.067195617 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks
18 0.062427789 19 nips-2004-An Application of Boosting to Graph Classification
19 0.059992962 77 nips-2004-Hierarchical Clustering of a Mixture Model
20 0.059862535 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
topicId topicWeight
[(0, -0.174), (1, 0.061), (2, -0.084), (3, -0.072), (4, -0.033), (5, -0.032), (6, -0.141), (7, 0.066), (8, -0.163), (9, -0.04), (10, -0.134), (11, -0.128), (12, -0.021), (13, -0.156), (14, -0.079), (15, 0.066), (16, -0.014), (17, -0.048), (18, -0.059), (19, 0.125), (20, -0.004), (21, 0.062), (22, -0.089), (23, 0.003), (24, 0.004), (25, 0.142), (26, 0.043), (27, 0.019), (28, 0.099), (29, -0.092), (30, 0.028), (31, 0.021), (32, 0.116), (33, -0.013), (34, 0.006), (35, -0.025), (36, 0.018), (37, -0.088), (38, 0.122), (39, 0.018), (40, -0.054), (41, 0.052), (42, 0.023), (43, -0.033), (44, 0.007), (45, 0.048), (46, 0.072), (47, 0.055), (48, -0.031), (49, -0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.94909161 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
Author: Richard S. Zemel, Miguel Á. Carreira-Perpiñán
Abstract: Many machine learning algorithms for clustering or dimensionality reduction take as input a cloud of points in Euclidean space, and construct a graph with the input data points as vertices. This graph is then partitioned (clustering) or used to redefine metric information (dimensionality reduction). There has been much recent work on new methods for graph-based clustering and dimensionality reduction, but not much on constructing the graph itself. Graphs typically used include the fullyconnected graph, a local fixed-grid graph (for image segmentation) or a nearest-neighbor graph. We suggest that the graph should adapt locally to the structure of the data. This can be achieved by a graph ensemble that combines multiple minimum spanning trees, each fit to a perturbed version of the data set. We show that such a graph ensemble usually produces a better representation of the data manifold than standard methods; and that it provides robustness to a subsequent clustering or dimensionality reduction algorithm based on the graph. 1
2 0.69598472 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
3 0.53225732 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
Author: Massimiliano Pavan, Marcello Pelillo
Abstract: Dominant sets are a new graph-theoretic concept that has proven to be relevant in pairwise data clustering problems, such as image segmentation. They generalize the notion of a maximal clique to edgeweighted graphs and have intriguing, non-trivial connections to continuous quadratic optimization and spectral-based grouping. We address the problem of grouping out-of-sample examples after the clustering process has taken place. This may serve either to drastically reduce the computational burden associated to the processing of very large data sets, or to efficiently deal with dynamic situations whereby data sets need to be updated continually. We show that the very notion of a dominant set offers a simple and efficient way of doing this. Numerical experiments on various grouping problems show the effectiveness of the approach. 1
4 0.5139249 160 nips-2004-Seeing through water
Author: Alexei Efros, Volkan Isler, Jianbo Shi, Mirkó Visontai
Abstract: We consider the problem of recovering an underwater image distorted by surface waves. A large amount of video data of the distorted image is acquired. The problem is posed in terms of finding an undistorted image patch at each spatial location. This challenging reconstruction task can be formulated as a manifold learning problem, such that the center of the manifold is the image of the undistorted patch. To compute the center, we present a new technique to estimate global distances on the manifold. Our technique achieves robustness through convex flow computations and solves the “leakage” problem inherent in recent manifold embedding techniques. 1
5 0.50158173 161 nips-2004-Self-Tuning Spectral Clustering
Author: Lihi Zelnik-manor, Pietro Perona
Abstract: We study a number of open issues in spectral clustering: (i) Selecting the appropriate scale of analysis, (ii) Handling multi-scale data, (iii) Clustering with irregular background clutter, and, (iv) Finding automatically the number of groups. We first propose that a ‘local’ scale should be used to compute the affinity between each pair of points. This local scaling leads to better clustering especially when the data includes multiple scales and when the clusters are placed within a cluttered background. We further suggest exploiting the structure of the eigenvectors to infer automatically the number of groups. This leads to a new algorithm in which the final randomly initialized k-means stage is eliminated. 1
6 0.44984904 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
7 0.42794144 115 nips-2004-Maximum Margin Clustering
8 0.42229849 103 nips-2004-Limits of Spectral Clustering
9 0.39614108 17 nips-2004-Adaptive Manifold Learning
10 0.39138803 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
11 0.38708794 141 nips-2004-Optimal sub-graphical models
12 0.38478357 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
13 0.37626788 165 nips-2004-Semi-supervised Learning on Directed Graphs
14 0.37060997 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem
15 0.35472804 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
16 0.34002167 177 nips-2004-Supervised Graph Inference
17 0.33985496 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
18 0.32722116 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models
19 0.32318226 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
20 0.32203335 57 nips-2004-Economic Properties of Social Networks
topicId topicWeight
[(13, 0.11), (15, 0.142), (26, 0.039), (31, 0.02), (33, 0.174), (35, 0.021), (39, 0.014), (50, 0.035), (54, 0.019), (90, 0.32)]
simIndex simValue paperId paperTitle
1 0.80185699 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data
Author: Mark Craven, Joseph Bockhorst
Abstract: Many sequential prediction tasks involve locating instances of patterns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally represent arbitrarily overlapping elements. We show how to efficiently train and perform inference with these models. Experimental results from a genomics domain show that our models are more accurate at locating instances of overlapping patterns than are baseline models based on HMMs. 1
same-paper 2 0.79393834 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
Author: Richard S. Zemel, Miguel Á. Carreira-Perpiñán
Abstract: Many machine learning algorithms for clustering or dimensionality reduction take as input a cloud of points in Euclidean space, and construct a graph with the input data points as vertices. This graph is then partitioned (clustering) or used to redefine metric information (dimensionality reduction). There has been much recent work on new methods for graph-based clustering and dimensionality reduction, but not much on constructing the graph itself. Graphs typically used include the fullyconnected graph, a local fixed-grid graph (for image segmentation) or a nearest-neighbor graph. We suggest that the graph should adapt locally to the structure of the data. This can be achieved by a graph ensemble that combines multiple minimum spanning trees, each fit to a perturbed version of the data set. We show that such a graph ensemble usually produces a better representation of the data manifold than standard methods; and that it provides robustness to a subsequent clustering or dimensionality reduction algorithm based on the graph. 1
3 0.75020176 28 nips-2004-Bayesian inference in spiking neurons
Author: Sophie Deneve
Abstract: We propose a new interpretation of spiking neurons as Bayesian integrators accumulating evidence over time about events in the external world or the body, and communicating to other neurons their certainties about these events. In this model, spikes signal the occurrence of new information, i.e. what cannot be predicted from the past activity. As a result, firing statistics are close to Poisson, albeit providing a deterministic representation of probabilities. We proceed to develop a theory of Bayesian inference in spiking neural networks, recurrent interactions implementing a variant of belief propagation. Many perceptual and motor tasks performed by the central nervous system are probabilistic, and can be described in a Bayesian framework [4, 3]. A few important but hidden properties, such as direction of motion, or appropriate motor commands, are inferred from many noisy, local and ambiguous sensory cues. These evidences are combined with priors about the sensory world and body. Importantly, because most of these inferences should lead to quick and irreversible decisions in a perpetually changing world, noisy cues have to be integrated on-line, but in a way that takes into account unpredictable events, such as a sudden change in motion direction or the appearance of a new stimulus. This raises the question of how this temporal integration can be performed at the neural level. It has been proposed that single neurons in sensory cortices represent and compute the log probability that a sensory variable takes on a certain value (eg Is visual motion in the neuron’s preferred direction?) [9, 7]. Alternatively, to avoid normalization issues and provide an appropriate signal for decision making, neurons could represent the log probability ratio of a particular hypothesis (eg is motion more likely to be towards the right than towards the left) [7, 6]. Log probabilities are convenient here, since under some assumptions, independent noisy cues simply combine linearly. Moreover, there are physiological evidence for the neural representation of log probabilities and log probability ratios [9, 6, 7]. However, these models assume that neurons represent probabilities in their firing rates. We argue that it is important to study how probabilistic information are encoded in spikes. Indeed, it seems spurious to marry the idea of an exquisite on-line integration of noisy cues with an underlying rate code that requires averaging on large populations of noisy neurons and long periods of time. In particular, most natural tasks require this integration to take place on the time scale of inter-spike intervals. Spikes are more efficiently signaling events ∗ Institute of Cognitive Science, 69645 Bron, France than analog quantities. In addition, a neural theory of inference with spikes will bring us closer to the physiological level and generate more easily testable predictions. Thus, we propose a new theory of neural processing in which spike trains provide a deterministic, online representation of a log-probability ratio. Spikes signals events, eg that the log-probability ratio has exceeded what could be predicted from previous spikes. This form of coding was loosely inspired by the idea of ”energy landscape” coding proposed by Hinton and Brown [2]. However, contrary to [2] and other theories using rate-based representation of probabilities, this model is self-consistent and does not require different models for encoding and decoding: As output spikes provide new, unpredictable, temporally independent evidence, they can be used directly as an input to other Bayesian neurons. Finally, we show that these neurons can be used as building blocks in a theory of approximate Bayesian inference in recurrent spiking networks. Connections between neurons implement an underlying Bayesian network, consisting of coupled hidden Markov models. Propagation of spikes is a form of belief propagation in this underlying graphical model. Our theory provides computational explanations of some general physiological properties of cortical neurons, such as spike frequency adaptation, Poisson statistics of spike trains, the existence of strong local inhibition in cortical columns, and the maintenance of a tight balance between excitation and inhibition. Finally, we discuss the implications of this model for the debate about temporal versus rate-based neural coding. 1 Spikes and log posterior odds 1.1 Synaptic integration seen as inference in a hidden Markov chain We propose that each neuron codes for an underlying ”hidden” binary variable, xt , whose state evolves over time. We assume that xt depends only on the state at the previous time step, xt−dt , and is conditionally independent of other past states. The state xt can switch 1 from 0 to 1 with a constant rate ron = dt limdt→0 P (xt = 1|xt−dt = 0), and from 1 to 0 with a constant rate roff . For example, these transition rates could represent how often motion in a preferred direction appears the receptive field and how long it is likely to stay there. The neuron infers the state of its hidden variable from N noisy synaptic inputs, considered to be observations of the hidden state. In this initial version of the model, we assume that these inputs are conditionally independent homogeneous Poisson processes, synapse i i emitting a spike between time t and t + dt (si = 1) with constant probability qon dt if t i xt = 1, and another constant probability qoff dt if xt = 0. The synaptic spikes are assumed to be otherwise independent of previous synaptic spikes, previous states and spikes at other synapses. The resulting generative model is a hidden Markov chain (figure 1-A). However, rather than estimating the state of its hidden variable and communicating this estimate to other neurons (for example by emitting a spike when sensory evidence for xt = 1 goes above a threshold) the neuron reports and communicates its certainty that the current state is 1. This certainty takes the form of the log of the ratio of the probability that the hidden state is 1, and the probability that the state is 0, given all the synaptic inputs P (xt =1|s0→t ) received so far: Lt = log P (xt =0|s0→t ) . We use s0→t as a short hand notation for the N synaptic inputs received at present and in the past. We will refer to it as the log odds ratio. Thanks to the conditional independencies assumed in the generative model, we can compute this Log odds ratio iteratively. Taking the limit as dt goes to zero, we get the following differential equation: ˙ L = ron 1 + e−L − roff 1 + eL + i wi δ(si − 1) − θ t B. A. xt ron .roff dt qon , qoff st xt ron .roff i t st dt s qon , qoff qon , qoff st dt xt j st Ot It Gt Ot Lt t t dt C. E. 2 0 -2 -4 D. 500 1000 1500 2000 2500 2 3000 Count Log odds 4 20 Lt 0 -2 0 500 1000 1500 2000 2500 Time Ot 3000 0 200 400 600 ISI Figure 1: A. Generative model for the synaptic input. B. Schematic representation of log odds ratio encoding and decoding. The dashed circle represents both eventual downstream elements and the self-prediction taking place inside the model neuron. A spike is fired only when Lt exceeds Gt . C. One example trial, where the state switches from 0 to 1 (shaded area) and back to 0. plain: Lt , dotted: Gt . Black stripes at the top: corresponding spikes train. D. Mean Log odds ratio (dark line) and mean output firing rate (clear line). E. Output spike raster plot (1 line per trial) and ISI distribution for the neuron shown is C. and D. Clear line: ISI distribution for a poisson neuron with the same rate. wi , the synaptic weight, describe how informative synapse i is about the state of the hidden i qon variable, e.g. wi = log qi . Each synaptic spike (si = 1) gives an impulse to the log t off odds ratio, which is positive if this synapse is more active when the hidden state if 1 (i.e it increases the neuron’s confidence that the state is 1), and negative if this synapse is more active when xt = 0 (i.e it decreases the neuron’s confidence that the state is 1). The bias, θ, is determined by how informative it is not to receive any spike, e.g. θ = i i i qon − qoff . By convention, we will consider that the ”bias” is positive or zero (if not, we need simply to invert the status of the state x). 1.2 Generation of output spikes The spike train should convey a sparse representation of Lt , so that each spike reports new information about the state xt that is not redundant with that reported by other, preceding, spikes. This proposition is based on three arguments: First, spikes, being metabolically expensive, should be kept to a minimum. Second, spikes conveying redundant information would require a decoding of the entire spike train, whereas independent spike can be taken into account individually. And finally, we seek a self consistent model, with the spiking output having a similar semantics to its spiking input. To maximize the independence of the spikes (conditioned on xt ), we propose that the neuron fires only when the difference between its log odds ratio Lt and a prediction Gt of this log odds ratio based on the output spikes emitted so far reaches a certain threshold. Indeed, supposing that downstream elements predicts Lt as best as they can, the neuron only needs to fire when it expects that prediction to be too inaccurate (figure 1-B). In practice, this will happen when the neuron receives new evidence for xt = 1. Gt should thereby follow the same dynamics as Lt when spikes are not received. The equation for Gt and the output Ot (Ot = 1 when an output spike is fired) are given by: ˙ G = Ot = ron 1 + e−L − roff 1 + eL + go δ(Ot − 1) go 1. when Lt > Gt + , 0 otherwise, 2 (1) (2) Here go , a positive constant, is the only free parameter, the other parameters being constrained by the statistics of the synaptic input. 1.3 Results Figure 1-C plots a typical trial, showing the behavior of L, G and O before, during and after presentation of the stimulus. As random synaptic inputs are integrated, L fluctuates and eventually exceeds G + 0.5, leading to an output spike. Immediately after a spike, G jumps to G + go , which prevents (except in very rare cases) a second spike from immediately following the first. Thus, this ”jump” implements a relative refractory period. However, ron G decays as it tends to converge back to its stable level gstable = log roff . Thus L eventually exceeds G again, leading to a new spike. This threshold crossing happens more often during stimulation (xt = 1) as the net synaptic input alters to create a higher overall level of certainty, Lt . Mean Log odds ratio and output firing rate ¯ The mean firing rate Ot of the Bayesian neuron during presentation of its preferred stimulus (i.e. when xt switches from 0 to 1 and back to 0) is plotted in figure 1-D, together with the ¯ mean log posterior ratio Lt , both averaged over trials. Not surprisingly, the log-posterior ratio reflects the leaky integration of synaptic evidence, with an effective time constant that depends on the transition probabilities ron , roff . If the state is very stable (ron = roff ∼ 0), synaptic evidence is integrated over almost infinite time periods, the mean log posterior ratio tending to either increase or decrease linearly with time. In the example in figure 1D, the state is less stable, so ”old” synaptic evidence are discounted and Lt saturates. ¯ In contrast, the mean output firing rate Ot tracks the state of xt almost perfectly. This is because, as a form of predictive coding, the output spikes reflect the new synaptic i evidence, It = i δ(st − 1) − θ, rather than the log posterior ratio itself. In particular, the mean output firing rate is a rectified linear function of the mean input, e. g. + ¯ ¯ wi q i −θ . O= 1I= go i on(off) Analogy with a leaky integrate and fire neuron We can get an interesting insight into the computation performed by this neuron by linearizing L and G around their mean levels over trials. Here we reduce the analysis to prolonged, statistically stable periods when the state is constant (either ON or OFF). In this case, the ¯ ¯ mean level of certainty L and its output prediction G are also constant over time. We make the rough approximation that the post spike jump, go , and the input fluctuations are small ¯ compared to the mean level of certainty L. Rewriting Vt = Lt − Gt + go 2 as the ”membrane potential” of the Bayesian neuron: ˙ V = −kL V + It − ∆go − go Ot ¯ ¯ ¯ where kL = ron e−L + roff eL , the ”leak” of the membrane potential, depends on the overall ¯ level of certainty. ∆go is positive and a monotonic increasing function of go . A. s t1 dt s t1 s t1 dt B. C. x t1 x t3 dt x t3 x t3 dt x t1 x t1 x t1 x t2 x t3 x t1 … x tn x t3 x t2 … x tn … dt dt Lx2 D. x t2 dt s t2 dt x t2 s t2 x t2 dt s t2 dt Log odds 10 No inh -0.5 -1 -1 -1.5 -2 5 Feedback 500 1000 1500 2000 Tiger Stripes 0 -5 -10 500 1000 1500 2000 2500 Time Figure 2: A. Bayesian causal network for yt (tiger), x1 (stripes) and x2 (paws). B. A nett t work feedforward computing the log posterior for x1 . C. A recurrent network computing t the log posterior odds for all variables. D. Log odds ratio in a simulated trial with the net2 1 1 work in C (see text). Thick line: Lx , thin line: Lx , dash-dotted: Lx without inhibition. t t t 2 Insert: Lx averaged over trials, showing the effect of feedback. t The linearized Bayesian neuron thus acts in its stable regime as a leaky integrate and fire (LIF) neuron. The membrane potential Vt integrates its input, Jt = It − ∆go , with a leak kL . The neuron fires when its membrane potential reaches a constant threshold go . After ¯ each spikes, Vt is reset to 0. Interestingly, for appropriately chosen compression factor go , the mean input to the lin¯ ¯ earized neuron J = I − ∆go ≈ 0 1 . This means that the membrane potential is purely driven to its threshold by input fluctuations, or a random walk in membrane potential. As a consequence, the neuron’s firing will be memoryless, and close to a Poisson process. In particular, we found Fano factor close to 1 and quasi-exponential ISI distribution (figure 1E) on the entire range of parameters tested. Indeed, LIF neurons with balanced inputs have been proposed as a model to reproduce the statistics of real cortical neurons [8]. This balance is implemented in our model by the neuron’s effective self-inhibition, even when the synaptic input itself is not balanced. Decoding As we previously said, downstream elements could predict the log odds ratio Lt by computing Gt from the output spikes (Eq 1, fig 1-B). Of course, this requires an estimate of the transition probabilities ron , roff , that could be learned from the observed spike trains. However, we show next that explicit decoding is not necessary to perform bayesian inference in spiking networks. Intuitively, this is because the quantity that our model neurons receive and transmit, eg new information, is exactly what probabilistic inference algorithm propagate between connected statistical elements. 1 ¯ Even if go is not chosen optimally, the influence of the drift J is usually negligible compared to the large fluctuations in membrane potential. 2 Bayesian inference in cortical networks The model neurons, having the same input and output semantics, can be used as building blocks to implement more complex generative models consisting of coupled Markov chains. Consider, for example, the example in figure 2-A. Here, a ”parent” variable x1 t (the presence of a tiger) can cause the state of n other ”children” variables ([xk ]k=2...n ), t of whom two are represented (the presence of stripes,x2 , and motion, x3 ). The ”chilt t dren” variables are Bayesian neurons identical to those described previously. The resulting bayesian network consist of n + 1 coupled hidden Markov chains. Inference in this architecture corresponds to computing the log posterior odds ratio for the tiger, x1 , and the log t posterior of observing stripes or motion, ([xk ]k=2...n ), given the synaptic inputs received t by the entire network so far, i.e. s2 , . . . , sk . 0→t 0→t Unfortunately, inference and learning in this network (and in general in coupled Markov chains) requires very expensive computations, and cannot be performed by simply propagating messages over time and among the variable nodes. In particular, the state of a child k variable xt depends on xk , sk , x1 and the state of all other children at the previous t t t−dt time step, [xj ]2
4 0.61963737 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
5 0.61889404 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
6 0.61826563 178 nips-2004-Support Vector Classification with Input Data Uncertainty
7 0.61474073 102 nips-2004-Learning first-order Markov models for control
8 0.61228973 125 nips-2004-Multiple Relational Embedding
9 0.61149657 127 nips-2004-Neighbourhood Components Analysis
10 0.61053598 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
11 0.6104573 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
12 0.61031079 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
13 0.61027277 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
14 0.61026382 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
15 0.60999876 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
16 0.6091283 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
17 0.60878551 116 nips-2004-Message Errors in Belief Propagation
18 0.60865891 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
19 0.60823119 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
20 0.60817158 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks