nips nips2004 nips2004-150 knowledge-graph by maker-knowledge-mining

150 nips-2004-Proximity Graphs for Clustering and Manifold Learning

Source: pdf

Author: Richard S. Zemel, Miguel Á. Carreira-Perpiñán

Abstract: Many machine learning algorithms for clustering or dimensionality reduction take as input a cloud of points in Euclidean space, and construct a graph with the input data points as vertices. This graph is then partitioned (clustering) or used to redeﬁne metric information (dimensionality reduction). There has been much recent work on new methods for graph-based clustering and dimensionality reduction, but not much on constructing the graph itself. Graphs typically used include the fullyconnected graph, a local ﬁxed-grid graph (for image segmentation) or a nearest-neighbor graph. We suggest that the graph should adapt locally to the structure of the data. This can be achieved by a graph ensemble that combines multiple minimum spanning trees, each ﬁt to a perturbed version of the data set. We show that such a graph ensemble usually produces a better representation of the data manifold than standard methods; and that it provides robustness to a subsequent clustering or dimensionality reduction algorithm based on the graph. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Proximity graphs for clustering and manifold learning ´ ˜a Miguel A. [sent-1, score-0.501]

2 edu Abstract Many machine learning algorithms for clustering or dimensionality reduction take as input a cloud of points in Euclidean space, and construct a graph with the input data points as vertices. [sent-7, score-0.564]

3 This graph is then partitioned (clustering) or used to redeﬁne metric information (dimensionality reduction). [sent-8, score-0.21]

4 There has been much recent work on new methods for graph-based clustering and dimensionality reduction, but not much on constructing the graph itself. [sent-9, score-0.419]

5 Graphs typically used include the fullyconnected graph, a local ﬁxed-grid graph (for image segmentation) or a nearest-neighbor graph. [sent-10, score-0.25]

6 We suggest that the graph should adapt locally to the structure of the data. [sent-11, score-0.23]

7 This can be achieved by a graph ensemble that combines multiple minimum spanning trees, each ﬁt to a perturbed version of the data set. [sent-12, score-0.597]

8 We show that such a graph ensemble usually produces a better representation of the data manifold than standard methods; and that it provides robustness to a subsequent clustering or dimensionality reduction algorithm based on the graph. [sent-13, score-0.886]

9 1 Introduction Graph-based algorithms have long been popular, and have received even more attention recently, for two of the fundamental problems in machine learning: clustering [1–4] and manifold learning [5–8]. [sent-14, score-0.347]

10 Relatively little attention has been paid to the properties and construction methods for the graphs that these algorithms depend on. [sent-15, score-0.211]

11 In the applications considered here, the graphs are an intermediate form of representation, and therefore their utility to some extent depends on the algorithms that they will ultimately be used for. [sent-17, score-0.173]

12 However, in the case of both clustering and manifold learning, the data points are assumed to lie on some small number of manifolds. [sent-18, score-0.399]

13 Intuitively, the graph should represent these underlying manifolds well: it should avoid shortcuts that travel outside a manifold, avoid gaps that erroneously disconnect regions of a manifold, and be dense within the manifold and clusters. [sent-19, score-0.517]

14 Dataset PSfrag replacements MST k-NNG -ball graph Delaunay triangulation Perturbed dataset Figure 1: Sensitivity to noise of proximity graphs. [sent-21, score-0.585]

15 Top row: several proximity graphs constructed on a noisy sample of points lying on a circle. [sent-22, score-0.262]

16 Bottom row: the same graphs constructed on a different sample; speciﬁcally, we added to each point Gaussian noise of standard deviation equal to the length of the small segment shown in the centre of the dataset (top row), built the graph, and drew it on the original dataset. [sent-23, score-0.225]

17 A fully-connected graph is used for example in spectral clustering and multidimensional scaling, while a ﬁxed grid, with each point connecting to some small ﬁxed number of neighbors in a pre-deﬁned grid of locations, is generally used in image segmentation. [sent-26, score-0.592]

18 The -ball or k-NNG provide an improvement over the fully-connected graph or ﬁxed grid (clustering: [3, 9]; manifold learning: [5, 7]). [sent-28, score-0.452]

19 In this paper we propose a different method of graph construction, one based on minimum spanning trees (MSTs). [sent-34, score-0.272]

20 Our method involves an ensemble of trees, each built on a perturbed version of the data. [sent-35, score-0.358]

21 We ﬁrst discuss the motivation for this new type of graph, and then examine its robustness properties, and its utility to both subsequent clustering or dimensionality reduction methods. [sent-36, score-0.271]

22 2 Two new types of proximity graphs A minimum spanning tree is a tree subgraph that contains all the vertices and has a minimum sum of edge weights. [sent-37, score-0.291]

23 One way to ﬂesh it out and attain robustness to noise is to form an MST ensemble that combines multiple MSTs; we give two different algorithms for this. [sent-41, score-0.315]

24 The locality of the noise model allows points to move more or less depending on the local data structure around them and to connect to different numbers of neighbors at different distances—in effect we achieve a variable k and . [sent-46, score-0.202]

25 To build the PMST ensemble, we generate T > 1 perturbed copies of the entire data set according to the local noise model and ﬁt an MST to each. [sent-47, score-0.206]

26 The PMST ensemble assigns a weight eij ∈ [0, 1] to the edge between points xi and xj equal to the average number of times that edge appears on the trees. [sent-48, score-0.469]

27 For T = 1 this gives the MST of the unperturbed data set; for T → ∞ it gives a stochastic graph where eij is the probability (in the Laplace sense) of that edge under the noise model. [sent-49, score-0.423]

28 The PMST ensemble contains at most T (N − 1) edges (usually much less). [sent-50, score-0.311]

29 Although the algorithm is randomized, the PMST ensemble for large T is essentially deterministic, and insensitive to noise by construction. [sent-51, score-0.294]

30 2 Disjoint MSTs (DMSTs) Here we build a graph that is a deterministic collection of t MSTs that satisﬁes the property that the nth tree (for n = 1, . [sent-54, score-0.21]

31 Speciﬁcally, we sort the list of N (N −1) edges eij by increasing 2 distance dij and visit each available edge in turn, removing it if it merges two clusters (or equivalently does not create a cycle); whenever we have removed N − 1 edges, we go back to the start of the list. [sent-62, score-0.331]

32 The DMST ensemble consists of the union of all removed edges and contains t(N − 1) edges each of weight 1. [sent-64, score-0.399]

33 The t parameter controls the overall density of the graph, which is always connected; unlike or k (for the -ball or k-NNG), t is not a parameter that depends locally on the data, and again points may connect to different numbers of neighbors at different distances. [sent-65, score-0.219]

34 In both cases the resulting graphs are sparse (number of edges is linear on number of points N ). [sent-73, score-0.317]

35 an 8-connected grid in image segmentation) the edge list is much shorter, so the graph construction is faster. [sent-76, score-0.439]

36 For the perturbed MST ensemble, the perturbation of the data set results in a partially disordered edge list, which one should be able to sort efﬁciently. [sent-77, score-0.236]

37 Overall, the real computational bottleneck is the graph postprocessing, typically O(N 3 ) in spectral methods (for clustering or manifold learning). [sent-79, score-0.648]

38 This can be sped up to O(cN 2 ) by using sparsity (limiting a priori the edges allowed, thus approximating the true solution) but then the graph construction is likewise sped up. [sent-80, score-0.505]

39 Thus, even if our graphs are slightly more costly to construct than the -ball or k-NNG, the computational savings are very large if we avoid having to run the spectral technique multiple times in search for a good or k. [sent-81, score-0.239]

40 3 Experiments We present two sets of experiments on the application of the graphs to clustering and manifold learning, respectively. [sent-82, score-0.501]

41 1 Clustering In afﬁnity-based clustering, our data is an N × N afﬁnity matrix W that deﬁnes a graph (where nonzeros deﬁne edge weights) and we seek a partition of the graph that optimizes a cost function, such as mincut [1] or normalized cut [2]. [sent-84, score-0.531]

42 This graph partitioning problem is generally NP-complete, so approximations are necessary, such as spectral clustering algorithms [2]. [sent-86, score-0.441]

43 In spectral clustering we seek to cluster in the leading eigenvectors of the 1 1 normalized afﬁnity matrix N = D− 2 WD− 2 (where D = diag ( i wij ), and discarding a constant eigenvector associated with eigenvalue 1). [sent-87, score-0.334]

44 Spectral clustering succeeds only for a range of values of σ where N displays the natural cluster structure of the data; if σ is too small W is approximately diagonal and if σ is too large W is approximately a matrix of ones. [sent-88, score-0.264]

45 2 shows segmentation results for a grayscale image from [11] where the objective is to segment the occluder from the underlying background, a hard task given the intensity gradients. [sent-91, score-0.196]

46 The other method uses the PMST or DMST ensemble (constrained to contain only edges in the 8-connected grid) under different values of the r, t parameters; the graph has between 44% and 98% the number of edges in the 8-grid, depend1 ing on the parameter value. [sent-94, score-0.643]

47 We deﬁne the afﬁnity matrix as wij = eij exp(− 2 (dij /σ)2 ) (where eij ∈ [0, 1] are the edge values). [sent-95, score-0.259]

48 In both methods we apply the spectral clustering algorithm of [2]. [sent-96, score-0.231]

49 The plot shows the clustering error (mismatched occluder area) for a range of scales. [sent-97, score-0.349]

50 The 8-connected grid succeeds in segmenting the occluder for σ ∈ [0. [sent-98, score-0.215]

51 The reason for this success even for such high σ is that the graph lacks many edges around the occluder, so those afﬁnities are zero no matter how high the scale is. [sent-100, score-0.362]

52 In other words, for clustering, our graphs enhance the inside of clusters with respect to the bridges between clusters, and so ease the graph partitioning. [sent-101, score-0.393]

53 , along the manifold) g ij beˆ ← − First 5 eigenvectors (except the constant one) − − −− −→ 8-grid σ=∞ PMST ensemble 8-grid σ2 = 1. [sent-106, score-0.251]

54 We consider segmenting the greyscale image at the bottom (an occluder over a background) with spectral clustering, asking for K = 5 clusters. [sent-109, score-0.217]

55 The color diagrams represent the segmentation (column 1) and ﬁrst 5 eigenvectors of the afﬁnity matrix (except the constant eigenvector, columns 2–4) obtained with spectral clustering, using a PMST ensemble with r = 0. [sent-110, score-0.362]

56 4 (upper row) or an 8-connectivity graph (lower row), for 3 different scales: σ1 = 0. [sent-111, score-0.21]

57 The PMST ensemble succeeds at all scales (note how several eigenvectors are constant over the occluder), while the 8-connectivity graph progressively deteriorates as σ increases to give a partition of equal-sized clusters for large scale. [sent-114, score-0.534]

58 In the bottom part of the ﬁgure we show: the PMST ensemble graph in 3D space; the clustering error vs σ (where the right end is σ = ∞) for the 8-connectivity graph (thick blue line) and for various other PMST and DMST ensembles under various parameters (thin lines). [sent-115, score-0.925]

59 tween pairs of points in the data set as the shortest-path lengths in a graph learned from the data. [sent-117, score-0.262]

60 Then we apply multidimensional scaling to these distances to obtain a collection of low-dimensional points {yi }N that optimally preserves the estimated geodesic distances. [sent-118, score-0.313]

61 3 we show the results of applying Isomap using different graphs to two data sets (ellipse and Swiss roll) for which we know the true geodesic distances gij . [sent-120, score-0.417]

62 In a real application, since the true geodesic distances are unknown, error and variance cannot be computed; an estimated residual variance has been proposed [5] to determine the optimal graph parameter. [sent-121, score-0.653]

63 For the perturbed MST ensemble, we binarize the edge values by making 1 any eij > 0. [sent-122, score-0.277]

64 (It is often possible to reﬁne the graph by zeroing edges with small eij , since this removes shortcuts that may have arisen by chance, particularly if T is large; but it is difﬁcult to estimate the right threshold reliably. [sent-123, score-0.48]

65 ) The plots show 3 curves as a function of the graph parameter: the average error E in the geodesic distances; Isomap’s estimated ˆ ˆ residual variance V ; and the true residual variance V . [sent-124, score-0.651]

66 From the plots we can see that V correlates well with V (though it underestimates it) and also with E for the Swiss roll, but not for the ellipse; this can make the optimal graph parameter difﬁcult to determine in a real application. [sent-125, score-0.27]

67 Given this, the fact that our graphs work well over a larger region of their parameter space than the -ball or k-NNG graphs makes them particularly attractive. [sent-126, score-0.365]

68 The plots for the Swiss roll show that, while for the low noise case the -ball or k-NNG graphs work well over a reasonable region of their parameter space, for the high noise case this region decreases a lot, almost vanishing for the -ball. [sent-127, score-0.517]

69 This is because for low values of the parameter the graph is disconnected, while for high values it has multiple shortcuts; the difﬁculty of the task is compounded by the small number of points used, N = 500 (an unavoidable fact in high dimensions). [sent-128, score-0.353]

70 For very low r or t = 1 the graph is the single MST, thus the large errors. [sent-130, score-0.229]

71 It is also important to realize that the range of the r parameter of the PMST ensemble does not depend on the data, while the range for and k does. [sent-131, score-0.325]

72 The range of the t parameter of the DMST ensemble does depend on the data, but we have found empirically that t = 2–4 gives very good results with all data sets we have tried. [sent-132, score-0.291]

73 4 Discussion One main contribution of this paper is to highlight the relatively understudied problem of converting a data set into a graph, which forms an intermediate representation for many clustering and manifold learning algorithms. [sent-133, score-0.366]

74 A second contribution is novel construction algorithms, which are: easy to implement, not expensive to compute, robust across many noise levels and parameter settings, and useful for clustering and manifold learning. [sent-134, score-0.509]

75 In general, a careful selection of the graph construction algorithm makes the results of these machine learning methods robust, and avoids or limits the required parameter search. [sent-135, score-0.324]

76 Finally, the combination of many graphs, formed from perturbed versions of the data, into an ensemble of graphs, is a novel approach to the construction problem. [sent-136, score-0.415]

77 Our idea of MST ensembles is an extension to graphs of the well-known technique of combining predictors by averaging (regression) or voting (classiﬁcation), as is the regularizing effect of training with noise [12]. [sent-137, score-0.308]

78 An ensemble of predictors improves the generalization to unseen data if the individual predictors are independent from each other and disagree with each other, and can be explained by the bias-variance tradeoff. [sent-138, score-0.285]

79 Unlike regression or classiﬁcation, unsupervised graph learning lacks at present an error function, so it seems difﬁcult to apply the bias-variance framework here. [sent-139, score-0.275]

80 However, we have conducted a wide range of empirical tests to understand the properties of the ensemble MSTs, and to compare them to the other graph construction methods, in terms of the error in the geodesic distances (if known a priori). [sent-140, score-0.758]

81 In summary, we have found that the variance of the error for the geodesic Ellipse, high noise Swiss roll, low noise Swiss roll, high noise 0. [sent-141, score-0.45]

82 2 5 PSfrag replacements k r t 0 PSfrag replacements −0. [sent-145, score-0.422]

83 5 1 avg error in gd resvar (estimated) resvar (true) 0. [sent-180, score-1.296]

84 9272 1 avg error in gd resvar (estimated) resvar (true) 0. [sent-194, score-1.296]

85 4454 avg error in gd resvar (estimated) resvar (true) 0. [sent-196, score-1.296]

86 9272 avg error in gd resvar (estimated) resvar (true) 0. [sent-225, score-1.296]

87 7 avg error in gd resvar (estimated) resvar (true) 0. [sent-237, score-1.296]

88 25 5 10 15 t 20 25 V, ˆ V 1 avg error in gd resvar (estimated) resvar (true) 0. [sent-252, score-1.296]

89 75 V, ˆ V k avg error in gd resvar (estimated) resvar (true) 0. [sent-253, score-1.296]

90 7749 r t 0 30 25 V, ˆ V 0 10 PSfrag replacements 1. [sent-254, score-0.211]

91 0332 DMST ensemble 1 avg error in gd resvar (estimated) resvar (true) k E PSfrag replacements 15 37. [sent-255, score-1.73]

92 75 avg error in gd resvar (estimated) resvar (true) E PSfrag replacements −10 −15 avg error in gd resvar (estimated) resvar (true) 24. [sent-262, score-2.803]

93 2583 E PSfrag replacements −10 avg error in gd resvar (estimated) resvar (true) 0. [sent-265, score-1.507]

94 Where the curves for -ball and k-NNG are missing, the graph was disconnected. [sent-269, score-0.21]

95 distances decreases for the ensemble when the individual graphs are sparse (e. [sent-270, score-0.479]

96 The typical cut [9, 13] is a clustering criterion that is based on the probability pij that points xi and xj are in the same cluster over all possible partitions (under the Boltzmann distribution for the mincut cost function). [sent-273, score-0.32]

97 A better way is to perturb points more strongly in directions likely to lie within the manifold and less strongly in directions away from the manifold, using a method such as k nearest neighbors to estimate appropriate directions. [sent-278, score-0.348]

98 Preliminary experiments with such a manifold-aligned model are very promising, particularly when the data is very noisy or its distribution on the manifold is not uniform. [sent-279, score-0.182]

99 An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. [sent-283, score-0.25]

100 Self organization in vision: Stochastic clustering for image segmentation, perceptual grouping, and image database organization. [sent-339, score-0.245]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('resvar', 0.444), ('pmst', 0.314), ('mst', 0.277), ('ensemble', 0.223), ('psfrag', 0.221), ('replacements', 0.211), ('graph', 0.21), ('gd', 0.193), ('manifold', 0.182), ('avg', 0.176), ('msts', 0.166), ('clustering', 0.165), ('graphs', 0.154), ('perturbed', 0.135), ('dmst', 0.129), ('geodesic', 0.116), ('occluder', 0.111), ('shortcuts', 0.092), ('eij', 0.09), ('edges', 0.088), ('distances', 0.079), ('roll', 0.077), ('ellipse', 0.074), ('swiss', 0.074), ('noise', 0.071), ('estimated', 0.066), ('spectral', 0.066), ('grid', 0.06), ('construction', 0.057), ('proximity', 0.056), ('dmsts', 0.055), ('pmsts', 0.055), ('isomap', 0.054), ('edge', 0.052), ('points', 0.052), ('ensembles', 0.052), ('dij', 0.052), ('neighbors', 0.051), ('residual', 0.051), ('af', 0.049), ('perturbation', 0.049), ('segmentation', 0.045), ('succeeds', 0.044), ('dimensionality', 0.044), ('true', 0.042), ('reduction', 0.041), ('image', 0.04), ('error', 0.039), ('delaunay', 0.037), ('mincut', 0.037), ('triangulation', 0.037), ('dy', 0.035), ('range', 0.034), ('parameter', 0.034), ('trees', 0.033), ('manifolds', 0.033), ('nity', 0.032), ('disconnected', 0.032), ('predictors', 0.031), ('clusters', 0.029), ('sped', 0.029), ('spanning', 0.029), ('sparsity', 0.029), ('connect', 0.028), ('eigenvectors', 0.028), ('eigenvector', 0.027), ('wij', 0.027), ('plots', 0.026), ('blue', 0.026), ('gij', 0.026), ('lacks', 0.026), ('row', 0.025), ('variance', 0.025), ('bottleneck', 0.025), ('branches', 0.025), ('nities', 0.025), ('pij', 0.023), ('region', 0.023), ('axis', 0.023), ('sparse', 0.023), ('richard', 0.023), ('november', 0.023), ('careful', 0.023), ('cut', 0.022), ('strongly', 0.022), ('cluster', 0.021), ('robustness', 0.021), ('priori', 0.021), ('locally', 0.02), ('december', 0.02), ('connected', 0.02), ('list', 0.02), ('randomized', 0.02), ('high', 0.019), ('nearest', 0.019), ('costly', 0.019), ('intermediate', 0.019), ('lawrence', 0.019), ('low', 0.019), ('euclidean', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning

Author: Richard S. Zemel, Miguel Á. Carreira-Perpiñán

2 0.16535737 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

3 0.12194566 161 nips-2004-Self-Tuning Spectral Clustering

Author: Lihi Zelnik-manor, Pietro Perona

Abstract: We study a number of open issues in spectral clustering: (i) Selecting the appropriate scale of analysis, (ii) Handling multi-scale data, (iii) Clustering with irregular background clutter, and, (iv) Finding automatically the number of groups. We ﬁrst propose that a ‘local’ scale should be used to compute the afﬁnity between each pair of points. This local scaling leads to better clustering especially when the data includes multiple scales and when the clusters are placed within a cluttered background. We further suggest exploiting the structure of the eigenvectors to infer automatically the number of groups. This leads to a new algorithm in which the ﬁnal randomly initialized k-means stage is eliminated. 1

4 0.099710651 115 nips-2004-Maximum Margin Clustering

Author: Linli Xu, James Neufeld, Bryce Larson, Dale Schuurmans

Abstract: We propose a new method for clustering based on ﬁnding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difﬁcult computational problem, the hard-clustering constraints can be relaxed to a soft-clustering formulation which can be feasibly solved with a semidefinite program. Since our clustering technique only depends on the data through the kernel matrix, we can easily achieve nonlinear clusterings in the same manner as spectral clustering. Experimental results show that our maximum margin clustering technique often obtains more accurate results than conventional clustering methods. The real beneﬁt of our approach, however, is that it leads naturally to a semi-supervised training method for support vector machines. By maximizing the margin simultaneously on labeled and unlabeled training data, we achieve state of the art performance by using a single, integrated learning principle. 1

5 0.099084057 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

Author: Elizaveta Levina, Peter J. Bickel

Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1

6 0.095462784 103 nips-2004-Limits of Spectral Clustering

7 0.090191461 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

8 0.089949556 160 nips-2004-Seeing through water

9 0.08798752 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

10 0.081518203 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

11 0.079293437 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

12 0.07900776 177 nips-2004-Supervised Graph Inference

13 0.0688693 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

14 0.068556763 165 nips-2004-Semi-supervised Learning on Directed Graphs

15 0.067470931 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

16 0.067378655 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

17 0.067195617 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

18 0.062427789 19 nips-2004-An Application of Boosting to Graph Classification

19 0.059992962 77 nips-2004-Hierarchical Clustering of a Mixture Model

20 0.059862535 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.174), (1, 0.061), (2, -0.084), (3, -0.072), (4, -0.033), (5, -0.032), (6, -0.141), (7, 0.066), (8, -0.163), (9, -0.04), (10, -0.134), (11, -0.128), (12, -0.021), (13, -0.156), (14, -0.079), (15, 0.066), (16, -0.014), (17, -0.048), (18, -0.059), (19, 0.125), (20, -0.004), (21, 0.062), (22, -0.089), (23, 0.003), (24, 0.004), (25, 0.142), (26, 0.043), (27, 0.019), (28, 0.099), (29, -0.092), (30, 0.028), (31, 0.021), (32, 0.116), (33, -0.013), (34, 0.006), (35, -0.025), (36, 0.018), (37, -0.088), (38, 0.122), (39, 0.018), (40, -0.054), (41, 0.052), (42, 0.023), (43, -0.033), (44, 0.007), (45, 0.048), (46, 0.072), (47, 0.055), (48, -0.031), (49, -0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94909161 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning

Author: Richard S. Zemel, Miguel Á. Carreira-Perpiñán

2 0.69598472 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

3 0.53225732 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

Author: Massimiliano Pavan, Marcello Pelillo

Abstract: Dominant sets are a new graph-theoretic concept that has proven to be relevant in pairwise data clustering problems, such as image segmentation. They generalize the notion of a maximal clique to edgeweighted graphs and have intriguing, non-trivial connections to continuous quadratic optimization and spectral-based grouping. We address the problem of grouping out-of-sample examples after the clustering process has taken place. This may serve either to drastically reduce the computational burden associated to the processing of very large data sets, or to efﬁciently deal with dynamic situations whereby data sets need to be updated continually. We show that the very notion of a dominant set offers a simple and efﬁcient way of doing this. Numerical experiments on various grouping problems show the effectiveness of the approach. 1

4 0.5139249 160 nips-2004-Seeing through water

Author: Alexei Efros, Volkan Isler, Jianbo Shi, Mirkó Visontai

Abstract: We consider the problem of recovering an underwater image distorted by surface waves. A large amount of video data of the distorted image is acquired. The problem is posed in terms of ﬁnding an undistorted image patch at each spatial location. This challenging reconstruction task can be formulated as a manifold learning problem, such that the center of the manifold is the image of the undistorted patch. To compute the center, we present a new technique to estimate global distances on the manifold. Our technique achieves robustness through convex ﬂow computations and solves the “leakage” problem inherent in recent manifold embedding techniques. 1

5 0.50158173 161 nips-2004-Self-Tuning Spectral Clustering

Author: Lihi Zelnik-manor, Pietro Perona

6 0.44984904 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

7 0.42794144 115 nips-2004-Maximum Margin Clustering

8 0.42229849 103 nips-2004-Limits of Spectral Clustering

9 0.39614108 17 nips-2004-Adaptive Manifold Learning

10 0.39138803 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

11 0.38708794 141 nips-2004-Optimal sub-graphical models

12 0.38478357 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

13 0.37626788 165 nips-2004-Semi-supervised Learning on Directed Graphs

14 0.37060997 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

15 0.35472804 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

16 0.34002167 177 nips-2004-Supervised Graph Inference

17 0.33985496 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

18 0.32722116 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

19 0.32318226 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

20 0.32203335 57 nips-2004-Economic Properties of Social Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.11), (15, 0.142), (26, 0.039), (31, 0.02), (33, 0.174), (35, 0.021), (39, 0.014), (50, 0.035), (54, 0.019), (90, 0.32)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80185699 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

Author: Mark Craven, Joseph Bockhorst

Abstract: Many sequential prediction tasks involve locating instances of patterns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally represent arbitrarily overlapping elements. We show how to eﬃciently train and perform inference with these models. Experimental results from a genomics domain show that our models are more accurate at locating instances of overlapping patterns than are baseline models based on HMMs. 1

same-paper 2 0.79393834 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning

Author: Richard S. Zemel, Miguel Á. Carreira-Perpiñán

3 0.75020176 28 nips-2004-Bayesian inference in spiking neurons

Author: Sophie Deneve

Abstract: We propose a new interpretation of spiking neurons as Bayesian integrators accumulating evidence over time about events in the external world or the body, and communicating to other neurons their certainties about these events. In this model, spikes signal the occurrence of new information, i.e. what cannot be predicted from the past activity. As a result, ﬁring statistics are close to Poisson, albeit providing a deterministic representation of probabilities. We proceed to develop a theory of Bayesian inference in spiking neural networks, recurrent interactions implementing a variant of belief propagation. Many perceptual and motor tasks performed by the central nervous system are probabilistic, and can be described in a Bayesian framework [4, 3]. A few important but hidden properties, such as direction of motion, or appropriate motor commands, are inferred from many noisy, local and ambiguous sensory cues. These evidences are combined with priors about the sensory world and body. Importantly, because most of these inferences should lead to quick and irreversible decisions in a perpetually changing world, noisy cues have to be integrated on-line, but in a way that takes into account unpredictable events, such as a sudden change in motion direction or the appearance of a new stimulus. This raises the question of how this temporal integration can be performed at the neural level. It has been proposed that single neurons in sensory cortices represent and compute the log probability that a sensory variable takes on a certain value (eg Is visual motion in the neuron’s preferred direction?) [9, 7]. Alternatively, to avoid normalization issues and provide an appropriate signal for decision making, neurons could represent the log probability ratio of a particular hypothesis (eg is motion more likely to be towards the right than towards the left) [7, 6]. Log probabilities are convenient here, since under some assumptions, independent noisy cues simply combine linearly. Moreover, there are physiological evidence for the neural representation of log probabilities and log probability ratios [9, 6, 7]. However, these models assume that neurons represent probabilities in their ﬁring rates. We argue that it is important to study how probabilistic information are encoded in spikes. Indeed, it seems spurious to marry the idea of an exquisite on-line integration of noisy cues with an underlying rate code that requires averaging on large populations of noisy neurons and long periods of time. In particular, most natural tasks require this integration to take place on the time scale of inter-spike intervals. Spikes are more efﬁciently signaling events ∗ Institute of Cognitive Science, 69645 Bron, France than analog quantities. In addition, a neural theory of inference with spikes will bring us closer to the physiological level and generate more easily testable predictions. Thus, we propose a new theory of neural processing in which spike trains provide a deterministic, online representation of a log-probability ratio. Spikes signals events, eg that the log-probability ratio has exceeded what could be predicted from previous spikes. This form of coding was loosely inspired by the idea of ”energy landscape” coding proposed by Hinton and Brown [2]. However, contrary to [2] and other theories using rate-based representation of probabilities, this model is self-consistent and does not require different models for encoding and decoding: As output spikes provide new, unpredictable, temporally independent evidence, they can be used directly as an input to other Bayesian neurons. Finally, we show that these neurons can be used as building blocks in a theory of approximate Bayesian inference in recurrent spiking networks. Connections between neurons implement an underlying Bayesian network, consisting of coupled hidden Markov models. Propagation of spikes is a form of belief propagation in this underlying graphical model. Our theory provides computational explanations of some general physiological properties of cortical neurons, such as spike frequency adaptation, Poisson statistics of spike trains, the existence of strong local inhibition in cortical columns, and the maintenance of a tight balance between excitation and inhibition. Finally, we discuss the implications of this model for the debate about temporal versus rate-based neural coding. 1 Spikes and log posterior odds 1.1 Synaptic integration seen as inference in a hidden Markov chain We propose that each neuron codes for an underlying ”hidden” binary variable, xt , whose state evolves over time. We assume that xt depends only on the state at the previous time step, xt−dt , and is conditionally independent of other past states. The state xt can switch 1 from 0 to 1 with a constant rate ron = dt limdt→0 P (xt = 1|xt−dt = 0), and from 1 to 0 with a constant rate roﬀ . For example, these transition rates could represent how often motion in a preferred direction appears the receptive ﬁeld and how long it is likely to stay there. The neuron infers the state of its hidden variable from N noisy synaptic inputs, considered to be observations of the hidden state. In this initial version of the model, we assume that these inputs are conditionally independent homogeneous Poisson processes, synapse i i emitting a spike between time t and t + dt (si = 1) with constant probability qon dt if t i xt = 1, and another constant probability qoﬀ dt if xt = 0. The synaptic spikes are assumed to be otherwise independent of previous synaptic spikes, previous states and spikes at other synapses. The resulting generative model is a hidden Markov chain (ﬁgure 1-A). However, rather than estimating the state of its hidden variable and communicating this estimate to other neurons (for example by emitting a spike when sensory evidence for xt = 1 goes above a threshold) the neuron reports and communicates its certainty that the current state is 1. This certainty takes the form of the log of the ratio of the probability that the hidden state is 1, and the probability that the state is 0, given all the synaptic inputs P (xt =1|s0→t ) received so far: Lt = log P (xt =0|s0→t ) . We use s0→t as a short hand notation for the N synaptic inputs received at present and in the past. We will refer to it as the log odds ratio. Thanks to the conditional independencies assumed in the generative model, we can compute this Log odds ratio iteratively. Taking the limit as dt goes to zero, we get the following differential equation: ˙ L = ron 1 + e−L − roﬀ 1 + eL + i wi δ(si − 1) − θ t B. A. xt ron .roff dt qon , qoff st xt ron .roff i t st dt s qon , qoff qon , qoff st dt xt j st Ot It Gt Ot Lt t t dt C. E. 2 0 -2 -4 D. 500 1000 1500 2000 2500 2 3000 Count Log odds 4 20 Lt 0 -2 0 500 1000 1500 2000 2500 Time Ot 3000 0 200 400 600 ISI Figure 1: A. Generative model for the synaptic input. B. Schematic representation of log odds ratio encoding and decoding. The dashed circle represents both eventual downstream elements and the self-prediction taking place inside the model neuron. A spike is ﬁred only when Lt exceeds Gt . C. One example trial, where the state switches from 0 to 1 (shaded area) and back to 0. plain: Lt , dotted: Gt . Black stripes at the top: corresponding spikes train. D. Mean Log odds ratio (dark line) and mean output ﬁring rate (clear line). E. Output spike raster plot (1 line per trial) and ISI distribution for the neuron shown is C. and D. Clear line: ISI distribution for a poisson neuron with the same rate. wi , the synaptic weight, describe how informative synapse i is about the state of the hidden i qon variable, e.g. wi = log qi . Each synaptic spike (si = 1) gives an impulse to the log t off odds ratio, which is positive if this synapse is more active when the hidden state if 1 (i.e it increases the neuron’s conﬁdence that the state is 1), and negative if this synapse is more active when xt = 0 (i.e it decreases the neuron’s conﬁdence that the state is 1). The bias, θ, is determined by how informative it is not to receive any spike, e.g. θ = i i i qon − qoﬀ . By convention, we will consider that the ”bias” is positive or zero (if not, we need simply to invert the status of the state x). 1.2 Generation of output spikes The spike train should convey a sparse representation of Lt , so that each spike reports new information about the state xt that is not redundant with that reported by other, preceding, spikes. This proposition is based on three arguments: First, spikes, being metabolically expensive, should be kept to a minimum. Second, spikes conveying redundant information would require a decoding of the entire spike train, whereas independent spike can be taken into account individually. And ﬁnally, we seek a self consistent model, with the spiking output having a similar semantics to its spiking input. To maximize the independence of the spikes (conditioned on xt ), we propose that the neuron ﬁres only when the difference between its log odds ratio Lt and a prediction Gt of this log odds ratio based on the output spikes emitted so far reaches a certain threshold. Indeed, supposing that downstream elements predicts Lt as best as they can, the neuron only needs to ﬁre when it expects that prediction to be too inaccurate (ﬁgure 1-B). In practice, this will happen when the neuron receives new evidence for xt = 1. Gt should thereby follow the same dynamics as Lt when spikes are not received. The equation for Gt and the output Ot (Ot = 1 when an output spike is ﬁred) are given by: ˙ G = Ot = ron 1 + e−L − roﬀ 1 + eL + go δ(Ot − 1) go 1. when Lt > Gt + , 0 otherwise, 2 (1) (2) Here go , a positive constant, is the only free parameter, the other parameters being constrained by the statistics of the synaptic input. 1.3 Results Figure 1-C plots a typical trial, showing the behavior of L, G and O before, during and after presentation of the stimulus. As random synaptic inputs are integrated, L ﬂuctuates and eventually exceeds G + 0.5, leading to an output spike. Immediately after a spike, G jumps to G + go , which prevents (except in very rare cases) a second spike from immediately following the ﬁrst. Thus, this ”jump” implements a relative refractory period. However, ron G decays as it tends to converge back to its stable level gstable = log roff . Thus L eventually exceeds G again, leading to a new spike. This threshold crossing happens more often during stimulation (xt = 1) as the net synaptic input alters to create a higher overall level of certainty, Lt . Mean Log odds ratio and output ﬁring rate ¯ The mean ﬁring rate Ot of the Bayesian neuron during presentation of its preferred stimulus (i.e. when xt switches from 0 to 1 and back to 0) is plotted in ﬁgure 1-D, together with the ¯ mean log posterior ratio Lt , both averaged over trials. Not surprisingly, the log-posterior ratio reﬂects the leaky integration of synaptic evidence, with an effective time constant that depends on the transition probabilities ron , roﬀ . If the state is very stable (ron = roﬀ ∼ 0), synaptic evidence is integrated over almost inﬁnite time periods, the mean log posterior ratio tending to either increase or decrease linearly with time. In the example in ﬁgure 1D, the state is less stable, so ”old” synaptic evidence are discounted and Lt saturates. ¯ In contrast, the mean output ﬁring rate Ot tracks the state of xt almost perfectly. This is because, as a form of predictive coding, the output spikes reﬂect the new synaptic i evidence, It = i δ(st − 1) − θ, rather than the log posterior ratio itself. In particular, the mean output ﬁring rate is a rectiﬁed linear function of the mean input, e. g. + ¯ ¯ wi q i −θ . O= 1I= go i on(oﬀ) Analogy with a leaky integrate and ﬁre neuron We can get an interesting insight into the computation performed by this neuron by linearizing L and G around their mean levels over trials. Here we reduce the analysis to prolonged, statistically stable periods when the state is constant (either ON or OFF). In this case, the ¯ ¯ mean level of certainty L and its output prediction G are also constant over time. We make the rough approximation that the post spike jump, go , and the input ﬂuctuations are small ¯ compared to the mean level of certainty L. Rewriting Vt = Lt − Gt + go 2 as the ”membrane potential” of the Bayesian neuron: ˙ V = −kL V + It − ∆go − go Ot ¯ ¯ ¯ where kL = ron e−L + roﬀ eL , the ”leak” of the membrane potential, depends on the overall ¯ level of certainty. ∆go is positive and a monotonic increasing function of go . A. s t1 dt s t1 s t1 dt B. C. x t1 x t3 dt x t3 x t3 dt x t1 x t1 x t1 x t2 x t3 x t1 … x tn x t3 x t2 … x tn … dt dt Lx2 D. x t2 dt s t2 dt x t2 s t2 x t2 dt s t2 dt Log odds 10 No inh -0.5 -1 -1 -1.5 -2 5 Feedback 500 1000 1500 2000 Tiger Stripes 0 -5 -10 500 1000 1500 2000 2500 Time Figure 2: A. Bayesian causal network for yt (tiger), x1 (stripes) and x2 (paws). B. A nett t work feedforward computing the log posterior for x1 . C. A recurrent network computing t the log posterior odds for all variables. D. Log odds ratio in a simulated trial with the net2 1 1 work in C (see text). Thick line: Lx , thin line: Lx , dash-dotted: Lx without inhibition. t t t 2 Insert: Lx averaged over trials, showing the effect of feedback. t The linearized Bayesian neuron thus acts in its stable regime as a leaky integrate and ﬁre (LIF) neuron. The membrane potential Vt integrates its input, Jt = It − ∆go , with a leak kL . The neuron ﬁres when its membrane potential reaches a constant threshold go . After ¯ each spikes, Vt is reset to 0. Interestingly, for appropriately chosen compression factor go , the mean input to the lin¯ ¯ earized neuron J = I − ∆go ≈ 0 1 . This means that the membrane potential is purely driven to its threshold by input ﬂuctuations, or a random walk in membrane potential. As a consequence, the neuron’s ﬁring will be memoryless, and close to a Poisson process. In particular, we found Fano factor close to 1 and quasi-exponential ISI distribution (ﬁgure 1E) on the entire range of parameters tested. Indeed, LIF neurons with balanced inputs have been proposed as a model to reproduce the statistics of real cortical neurons [8]. This balance is implemented in our model by the neuron’s effective self-inhibition, even when the synaptic input itself is not balanced. Decoding As we previously said, downstream elements could predict the log odds ratio Lt by computing Gt from the output spikes (Eq 1, ﬁg 1-B). Of course, this requires an estimate of the transition probabilities ron , roﬀ , that could be learned from the observed spike trains. However, we show next that explicit decoding is not necessary to perform bayesian inference in spiking networks. Intuitively, this is because the quantity that our model neurons receive and transmit, eg new information, is exactly what probabilistic inference algorithm propagate between connected statistical elements. 1 ¯ Even if go is not chosen optimally, the inﬂuence of the drift J is usually negligible compared to the large ﬂuctuations in membrane potential. 2 Bayesian inference in cortical networks The model neurons, having the same input and output semantics, can be used as building blocks to implement more complex generative models consisting of coupled Markov chains. Consider, for example, the example in ﬁgure 2-A. Here, a ”parent” variable x1 t (the presence of a tiger) can cause the state of n other ”children” variables ([xk ]k=2...n ), t of whom two are represented (the presence of stripes,x2 , and motion, x3 ). The ”chilt t dren” variables are Bayesian neurons identical to those described previously. The resulting bayesian network consist of n + 1 coupled hidden Markov chains. Inference in this architecture corresponds to computing the log posterior odds ratio for the tiger, x1 , and the log t posterior of observing stripes or motion, ([xk ]k=2...n ), given the synaptic inputs received t by the entire network so far, i.e. s2 , . . . , sk . 0→t 0→t Unfortunately, inference and learning in this network (and in general in coupled Markov chains) requires very expensive computations, and cannot be performed by simply propagating messages over time and among the variable nodes. In particular, the state of a child k variable xt depends on xk , sk , x1 and the state of all other children at the previous t t t−dt time step, [xj ]2

4 0.61963737 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

5 0.61889404 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

6 0.61826563 178 nips-2004-Support Vector Classification with Input Data Uncertainty

7 0.61474073 102 nips-2004-Learning first-order Markov models for control

8 0.61228973 125 nips-2004-Multiple Relational Embedding

9 0.61149657 127 nips-2004-Neighbourhood Components Analysis

10 0.61053598 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

11 0.6104573 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

12 0.61031079 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

13 0.61027277 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

14 0.61026382 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

15 0.60999876 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

16 0.6091283 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

17 0.60878551 116 nips-2004-Message Errors in Belief Propagation

18 0.60865891 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

19 0.60823119 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

20 0.60817158 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks