jmlr jmlr2008 jmlr2008-96 knowledge-graph by maker-knowledge-mining

96 jmlr-2008-Visualizing Data using t-SNE

Source: pdf

Author: Laurens van der Maaten, Geoffrey Hinton

Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces signiﬁcantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to inﬂuence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are signiﬁcantly better than those produced by the other techniques on almost all of the data sets. Keywords: visualization, dimensionality reduction, manifold learning, embedding algorithms, multidimensional scaling

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In contrast to the visualization techniques discussed above, dimensionality reduction methods convert the high-dimensional data set X = {x1 , x2 , . [sent-23, score-0.331]

2 In the paper, we refer to the low-dimensional data representation Y as a map, and to the low-dimensional representations y i of individual datapoints as map points. [sent-30, score-0.547]

3 Traditional dimensionality reduction techniques such as Principal Components Analysis (PCA; Hotelling, 1933) and classical multidimensional scaling (MDS; Torgerson, 1952) are linear techniques that focus on keeping the low-dimensional representations of dissimilar datapoints far apart. [sent-33, score-0.675]

4 For high-dimensional data that lies on or near a low-dimensional, non-linear manifold it is usually more important to keep the low-dimensional representations of very similar datapoints close together, which is typically not possible with a linear mapping. [sent-34, score-0.438]

5 Stochastic Neighbor Embedding Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean distances between datapoints into conditional probabilities that represent similarities. [sent-55, score-0.515]

6 1 The similarity of datapoint x j to datapoint xi is the conditional probability, p j|i , that xi would pick x j as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x i . [sent-56, score-0.329]

7 For the low-dimensional counterparts yi and y j of the high-dimensional datapoints xi and x j , it is possible to compute a similar conditional probability, which we denote by q j|i . [sent-61, score-0.414]

8 Hence, we model the similarity of map point y j to map point yi by exp − yi − y j 2 q j|i = . [sent-63, score-0.462]

9 If the map points yi and y j correctly model the similarity between the high-dimensional datapoints xi and x j , the conditional probabilities p j|i and q j|i will be equal. [sent-65, score-0.68]

10 SNE minimizes the sum of Kullback-Leibler divergences over all datapoints using a gradient descent method. [sent-68, score-0.436]

11 Because the Kullback-Leibler divergence is not symmetric, different types of error in the pairwise distances in the low-dimensional map are not weighted equally. [sent-70, score-0.474]

12 In particular, there is a large cost for using widely separated map points to represent nearby datapoints (i. [sent-71, score-0.788]

13 2581 VAN DER M AATEN AND H INTON a small q j|i to model a large p j|i ), but there is only a small cost for using nearby map points to represent widely separated datapoints. [sent-79, score-0.423]

14 In other words, the SNE cost function focuses on retaining the local structure of the data in the map (for reasonable values of the variance of the Gaussian in the high-dimensional space, σi ). [sent-81, score-0.274]

15 It is not likely that there is a single value of σi that is optimal for all datapoints in the data set because the density of the data is likely to vary. [sent-83, score-0.365]

16 δyi j Physically, the gradient may be interpreted as the resultant force created by a set of springs between the map point yi and all other map points y j . [sent-93, score-0.566]

17 The spring between yi and y j repels or attracts the map points depending on whether the distance between the two in the map is too small or too large to represent the similarities between the two high-dimensional datapoints. [sent-95, score-0.564]

18 The force exerted by the spring between y i and y j is proportional to its length, and also proportional to its stiffness, which is the mismatch (p j|i − q j|i + pi| j − qi| j ) between the pairwise similarities of the data points and the map points. [sent-96, score-0.436]

19 The gradient descent is initialized by sampling map points randomly from an isotropic Gaussian with small variance that is centered around the origin. [sent-97, score-0.308]

20 In other words, the current gradient is added to an exponentially decaying sum of previous gradients in order to determine the changes in the coordinates of the map points at each iteration of the gradient search. [sent-99, score-0.379]

21 Picking the best map after several runs as a visualization of the data is not nearly as problematic as picking the model that does best on a test set during supervised learning. [sent-129, score-0.329]

22 For such an outlier, the values of pi j are extremely small for all j, so the location of its low-dimensional map point yi has very little effect on the cost function. [sent-134, score-0.418]

23 As a result, the position of the map point is not well determined by the positions of the other map points. [sent-135, score-0.364]

24 This ensures that 1 ∑ j pi j > 2n for all datapoints xi , as a result of which each datapoint xi makes a signiﬁcant contribution to the cost function. [sent-137, score-0.697]

25 2 The Crowding Problem Consider a set of datapoints that lie on a two-dimensional curved manifold which is approximately linear on a small scale, and which is embedded within a higher-dimensional space. [sent-143, score-0.438]

26 It is possible to model the small pairwise distances between datapoints fairly well in a two-dimensional map, which is often illustrated on toy examples such as the “Swiss roll” data set. [sent-144, score-0.62]

27 There are several reasons why the pairwise distances in a two-dimensional map cannot faithfully model distances between points on the ten-dimensional manifold. [sent-146, score-0.613]

28 For instance, in ten dimensions, it is possible to have 11 datapoints that are mutually equidistant and there is no way to model this faithfully in a two-dimensional map. [sent-147, score-0.397]

29 In SNE, the spring connecting datapoint i to each of these too-distant map points will thus exert a very small attractive force. [sent-154, score-0.382]

30 So however far apart two map points are, q i j can never fall 2ρ below n(n−1) (because the uniform background distribution is over n(n − 1)/2 pairs). [sent-160, score-0.269]

31 As a result, for datapoints that are far apart in the high-dimensional space, q i j will always be larger than pi j , leading to a slight repulsion. [sent-161, score-0.52]

32 Optimizing the UNI-SNE cost function directly does not work because two map points that are far apart will get almost all of their qi j from the uniform background. [sent-168, score-0.445]

33 This allows a moderate distance in the high-dimensional space to be faithfully modeled by a much larger distance in the map and, as a result, it eliminates the unwanted attractive forces between map points that represent moderately dissimilar datapoints. [sent-175, score-0.54]

34 Using this distribution, the joint probabilities qi j are deﬁned as qi j = 1 + yi − y j 2 −1 −1 ∑k=l (1 + yk − yl 2 ) . [sent-177, score-0.302]

35 (4) We use a Student t-distribution with a single degree of freedom, because it has the particularly −1 approaches an inverse square law for large pairwise distances nice property that 1 + yi − y j 2 yi − y j in the low-dimensional map. [sent-178, score-0.353]

36 This makes the map’s representation of joint probabilities (almost) invariant to changes in the scale of the map for map points that are far apart. [sent-179, score-0.448]

37 Figure 1: Gradients of three types of SNE as a function of the pairwise Euclidean distance between two points in the high-dimensional and the pairwise distance between the points in the low-dimensional data representation. [sent-186, score-0.44]

38 The gradient of the Kullback-Leibler divergence between P and the Student-t based joint probability distribution Q (computed using Equation 4) is derived in Appendix A, and is given by δC = 4 ∑(pi j − qi j )(yi − y j ) 1 + yi − y j δyi j 2 −1 . [sent-189, score-0.269]

39 (5) In Figure 1(a) to 1(c), we show the gradients between two low-dimensional datapoints y i and y j as a function of their pairwise Euclidean distances in the high-dimensional and the low-dimensional space (i. [sent-190, score-0.62]

40 In the ﬁgures, positive values of the gradient represent an attraction between the lowdimensional datapoints yi and y j , whereas negative values represent a repulsion between the two datapoints. [sent-193, score-0.57]

41 First, the t-SNE gradient strongly repels dissimilar datapoints that are modeled by a small pairwise distance in the low-dimensional representation. [sent-195, score-0.66]

42 SNE has such a repulsion as well, but its effect is minimal compared to the strong attractions elsewhere in the gradient (the largest attraction in our graphical representation of the gradient is approximately 19, whereas the largest repulsion is approximately 1). [sent-196, score-0.312]

43 Second, although t-SNE introduces strong repulsions between dissimilar datapoints that are modeled by small pairwise distances, these repulsions do not go to inﬁnity. [sent-198, score-0.558]

44 In this respect, t-SNE differs from UNI-SNE, in which the strength of the repulsion between very dissimilar datapoints 2586 V ISUALIZING DATA USING T-SNE Algorithm 1: Simple version of t-Distributed Stochastic Neighbor Embedding. [sent-199, score-0.509]

45 , xn }, cost function parameters: perplexity Perp, optimization parameters: number of iterations T , learning rate η, momentum α(t). [sent-203, score-0.277]

46 begin compute pairwise afﬁnities p j|i with perplexity Perp (using Equation 1) p +p set pi j = j|i2n i| j sample initial solution Y (0) = {y1 , y2 , . [sent-208, score-0.364]

47 Taken together, t-SNE puts emphasis on (1) modeling dissimilar datapoints by means of large pairwise distances, and (2) modeling similar datapoints by means of small pairwise distances. [sent-212, score-1.117]

48 Speciﬁcally, t-SNE introduces long-range forces in the low-dimensional map that can pull back together two (clusters of) similar points that get separated early on in the optimization. [sent-214, score-0.291]

49 This simple procedure uses a momentum term to reduce the number of iterations required and it works best if the momentum term is small until the map points have become moderately well organized. [sent-219, score-0.381]

50 When the distances between map points are small, it is easy for clusters to move through one another so it is much easier to explore the space of possible global organizations of the data. [sent-224, score-0.411]

51 Early compression is implemented by adding an additional L2-penalty to the cost function that is proportional to the sum of squared 2587 VAN DER M AATEN AND H INTON distances of the map points from the origin. [sent-225, score-0.422]

52 This speeds up the computation of pairwise distances between the datapoints and suppresses some noise without severely distorting the interpoint distances. [sent-281, score-0.62]

53 We then use each of the dimensionality reduction techniques to convert the 30-dimensional representation to a two-dimensional map and we show the resulting map as a scatterplot. [sent-282, score-0.548]

54 In the experiments with Isomap and LLE, we only visualize datapoints that correspond to vertices in the largest connected component of the neighborhood graph. [sent-288, score-0.418]

55 The map produced by t-SNE contains some points that are clustered with the wrong class, but most of these points correspond to distorted digits many of which are difﬁcult to identify. [sent-299, score-0.34]

56 The map constructed by Sammon mapping is signiﬁcantly better, since it models many of the members of each class fairly close together, but none of the classes are clearly separated in the Sammon map. [sent-315, score-0.271]

57 Obviously, it is possible to pick a random subset of the datapoints and display them using t-SNE, but such an approach fails to 2593 VAN DER M AATEN AND H INTON ¡ ¢ Figure 6: An illustration of the advantage of the random walk version of t-SNE over a standard landmark approach. [sent-334, score-0.738]

58 The shaded points A, B, and C are three (almost) equidistant landmark points, whereas the non-shaded datapoints are non-landmark points. [sent-335, score-0.663]

59 In a standard landmark approach, the pairwise afﬁnity between A and B is approximately equal to the pairwise afﬁnity between A and C. [sent-337, score-0.479]

60 In the random walk version of t-SNE, the pairwise afﬁnity between A and B is much larger than the pairwise afﬁnity between A and C, and therefore, it reﬂects the structure of the data much better. [sent-338, score-0.43]

61 make use of the information that the undisplayed datapoints provide about the underlying manifolds. [sent-339, score-0.365]

62 If there are many undisplayed datapoints between A and B and none between A and C, it is much more likely that A and B are part of the same cluster than A and C. [sent-341, score-0.365]

63 In this section, we show how t-SNE can be modiﬁed to display a random subset of the datapoints (so-called landmark points) in a way that uses information from the entire (possibly very large) data set. [sent-343, score-0.576]

64 Then, for each of the landmark points, we deﬁne a random walk starting at that landmark point and terminating as soon as it lands on another landmark point. [sent-346, score-0.795]

65 We deﬁne p j|i to be the fraction of random walks starting at landmark point xi that terminate at landmark point x j . [sent-348, score-0.539]

66 2594 V ISUALIZING DATA USING T-SNE The most obvious way to compute the random walk-based similarities p j|i is to explicitly perform the random walks on the neighborhood graph, which works very well in practice, given that one can easily perform one million random walks per second. [sent-355, score-0.352]

67 However, for very large data sets in which the landmark points are very sparse, the analytical solution may be more appropriate. [sent-360, score-0.299]

68 Figure 7 shows the results of an experiment, in which we applied the random walk version of t-SNE to 6,000 randomly selected digits from the MNIST data set, using all 60,000 digits to compute the pairwise afﬁnities p j|i . [sent-361, score-0.392]

69 Whereas the generalization error (measured using 10-fold cross validation) of a 1-nearest neighbor classiﬁer trained on the original 784-dimensional datapoints is 5. [sent-367, score-0.404]

70 The computational requirements of random walk t-SNE are reasonable: it took only one hour of CPU time to construct the map in Figure 7. [sent-370, score-0.344]

71 A linear method such as classical scaling is not good at modeling curved manifolds and it focuses on preserving the distances between widely separated datapoints rather than on preserving the distances between nearby datapoints. [sent-379, score-0.811]

72 The main weakness of the Sammon cost function is that the importance of retaining small pairwise distances in the map is largely dependent on small differences in these pairwise distances. [sent-385, score-0.698]

73 Since all small pairwise distances constitute the local structure of the data, it seems more appropriate to aim to assign approximately equal importance to all small pairwise distances. [sent-387, score-0.389]

74 The strong performance of t-SNE compared to LLE is mainly due to a basic weakness of LLE: the only thing that prevents all datapoints from collapsing onto a single point is a constraint on the covariance of the low-dimensional representation. [sent-392, score-0.4]

75 In practice, this constraint is often satisﬁed by placing most of the map points near the center of the map and using a few widely scattered points to create large covariance (see Figure 3(b) and 4(d)). [sent-393, score-0.507]

76 For neighborhood graphs that are almost disconnected, the covariance constraint can also be satisﬁed by a “curdled” map in which there are a few widely separated, collapsed subsets corresponding to the almost disconnected components. [sent-394, score-0.268]

77 Like Isomap and LLE, the random walk version of t-SNE employs neighborhood graphs, but it does not suffer from short-circuiting problems because the pairwise similarities between the highdimensional datapoints are computed by integrating over all paths through the neighborhood graph. [sent-397, score-0.86]

78 Because of the diffusion-based interpretation of the conditional probabilities underlying the random walk version of t-SNE, it is useful to compare t-SNE to diffusion maps. [sent-398, score-0.295]

79 The diffusion map is formed by the principal non-trivial eigenvectors of the Markov matrix of the random walks of length t. [sent-401, score-0.403]

80 It can be shown that when all (n − 1) non-trivial eigenvec2597 VAN DER M AATEN AND H INTON tors are employed, the Euclidean distances in the diffusion map are equal to the diffusion distances in the high-dimensional data representation (Lafon and Lee, 2006). [sent-402, score-0.632]

81 j As a result, diffusion maps are susceptible to the same problems as classical scaling: they assign much higher importance to modeling the large pairwise diffusion distances than the small ones and as a result, they are not good at retaining the local structure of the data. [sent-404, score-0.566]

82 Moreover, in contrast to the random walk version of t-SNE, diffusion maps do not have a natural way of selecting the length, t, of the random walks. [sent-405, score-0.311]

83 Moreover, within the range λ, CCA suffers from the same weakness as Sammon mapping: it assigns extremely high importance to modeling the distance between two datapoints that are extremely close. [sent-410, score-0.461]

84 Also, MVU makes no attempt to model longer range structure: It simply pulls the map points as far apart as possible subject to the hard constraints so, unlike t-SNE, it cannot be expected to produce sensible large-scale structure in the map. [sent-416, score-0.269]

85 In data sets with a high intrinsic dimensionality and an underlying manifold that is highly varying, the local linearity assumption on the manifold that t-SNE implicitly makes (by employing Euclidean distances between near neighbors) may be violated. [sent-431, score-0.418]

86 Such deep-layer architectures can represent complex nonlinear functions in a much simpler way, and as a result, require fewer datapoints to learn an appropriate solution (as is illustrated for a d-bits parity task by Bengio 2007). [sent-438, score-0.398]

87 A nice property of most state-of-the-art dimensionality reduction techniques (such as classical scaling, Isomap, LLE, and diffusion maps) is the convexity of their cost functions. [sent-442, score-0.352]

88 A local optimum of a cost function that accurately captures what we want in a visualization is often preferable to the global optimum of a cost function that fails to capture important aspects of what we want. [sent-447, score-0.275]

89 We will also investigate the extension of t-SNE to models in which each high-dimensional datapoint is modeled by several low-dimensional map points as in Cook et al. [sent-464, score-0.382]

90 Derivation of the t-SNE gradient t-SNE minimizes the Kullback-Leibler divergence between the joint probabilities p i j in the highdimensional space and the joint probabilities qi j in the low-dimensional space. [sent-473, score-0.306]

91 The Kullback-Leibler divergence between the two joint probability distributions P and Q is given by C = KL(P||Q) = ∑ ∑ pi j log i j pi j qi j = ∑ ∑ pi j log pi j − pi j log qi j . [sent-476, score-0.876]

92 k=l Note that if yi changes, the only pairwise distances that change are d i j and d ji for ∀ j. [sent-478, score-0.304]

93 δC δ(log qkl ) = − ∑ pkl δdi j δdi j k=l = − ∑ pkl k=l = − ∑ pkl k=l The gradient 2 δ((1+dkl )−1 ) δdi j δ(log qkl Z − log Z) δdi j 2 1 δ((1 + dkl )−1 ) 1 δZ − qkl Z δdi j Z δdi j is only nonzero when k = i and l = j. [sent-481, score-0.326]

94 δdi j qi j Z Z k=l Noting that ∑k=l pkl = 1, we see that the gradient simpliﬁes to δC = 2pi j (1 + di2j )−1 − 2qi j (1 + di2j )−1 δdi j = 2(pi j − qi j )(1 + di2j )−1 . [sent-483, score-0.348]

95 Analytical Solution to Random Walk Probabilities Below, we describe the analytical solution to the random walk probabilities that are employed in the random walk version of t-SNE (see Section 5). [sent-486, score-0.415]

96 Without loss of generality, we may reorder the landmark points such that the landmark points come ﬁrst. [sent-494, score-0.532]

97 (8) Please note that in this linear system, BT is a matrix containing the columns from the graph Laplacian L that correspond to the landmark points (excluding the rows that correspond to landmark points). [sent-497, score-0.51]

98 After normalization of the solutions to the systems XN , the column vectors of XN contain the probability that a random walk initiated from a non-landmark point terminates in a landmark point. [sent-498, score-0.373]

99 One should note that the linear system in Equation 8 is only nonsingular if the graph is completely connected, or if each connected component in the graph contains at least one landmark point (Biggs, 1974). [sent-499, score-0.277]

100 Because we are interested in the probability of a random walk initiated from a landmark point terminating at another landmark point, we duplicate all landmark points in the neighborhood graph, and initiate the random walks from the duplicate landmarks. [sent-500, score-1.092]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sne', 0.427), ('datapoints', 0.365), ('sammon', 0.224), ('landmark', 0.211), ('map', 0.182), ('lle', 0.163), ('walk', 0.162), ('isomap', 0.149), ('visualization', 0.147), ('datapoint', 0.145), ('aaten', 0.139), ('inton', 0.139), ('isualizing', 0.139), ('pairwise', 0.134), ('pi', 0.123), ('distances', 0.121), ('walks', 0.117), ('qi', 0.112), ('dimensionality', 0.11), ('perplexity', 0.107), ('der', 0.105), ('diffusion', 0.104), ('visualizations', 0.091), ('crowding', 0.085), ('repulsion', 0.085), ('mvu', 0.075), ('manifold', 0.073), ('momentum', 0.072), ('gradient', 0.071), ('van', 0.069), ('similarities', 0.065), ('hinton', 0.065), ('cost', 0.064), ('dissimilar', 0.059), ('laplacian', 0.058), ('faces', 0.055), ('points', 0.055), ('separated', 0.054), ('weinberger', 0.054), ('clusters', 0.053), ('neighborhood', 0.053), ('images', 0.053), ('maaten', 0.053), ('olivetti', 0.053), ('perp', 0.053), ('pkl', 0.053), ('supplemental', 0.053), ('manifolds', 0.052), ('visualizing', 0.052), ('student', 0.052), ('di', 0.051), ('mnist', 0.049), ('yi', 0.049), ('eigenmaps', 0.049), ('digits', 0.048), ('maps', 0.045), ('embedding', 0.044), ('exaggeration', 0.043), ('intrinsic', 0.041), ('cook', 0.041), ('cca', 0.041), ('reduction', 0.039), ('neighbor', 0.039), ('divergence', 0.037), ('material', 0.037), ('duplicate', 0.036), ('nearby', 0.035), ('techniques', 0.035), ('mapping', 0.035), ('weakness', 0.035), ('optimization', 0.034), ('af', 0.034), ('freedom', 0.034), ('roweis', 0.034), ('lee', 0.034), ('widely', 0.033), ('analytical', 0.033), ('graph', 0.033), ('nonlinear', 0.033), ('multidimensional', 0.032), ('lafon', 0.032), ('demartines', 0.032), ('equidistant', 0.032), ('laurens', 0.032), ('qkl', 0.032), ('tsne', 0.032), ('apart', 0.032), ('distance', 0.031), ('modeling', 0.03), ('probabilities', 0.029), ('tails', 0.029), ('employed', 0.029), ('gaussian', 0.028), ('retaining', 0.028), ('highdimensional', 0.028), ('mathematically', 0.028), ('dirichlet', 0.028), ('symmetrized', 0.027), ('springs', 0.027), ('grady', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 96 jmlr-2008-Visualizing Data using t-SNE

Author: Laurens van der Maaten, Geoffrey Hinton

2 0.11799368 57 jmlr-2008-Manifold Learning: The Price of Normalization

Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov

Abstract: We analyze the performance of a class of manifold-learning algorithms that ﬁnd their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the ﬁnite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap

3 0.11074711 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

Author: Arthur D. Szlam, Mauro Maggioni, Ronald R. Coifman

Abstract: Harmonic analysis and diffusion on discrete data has been shown to lead to state-of-the-art algorithms for machine learning tasks, especially in the context of semi-supervised and transductive learning. The success of these algorithms rests on the assumption that the function(s) to be studied (learned, interpolated, etc.) are smooth with respect to the geometry of the data. In this paper we present a method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modiﬁed geometry, and thus more amenable to treatment using harmonic analysis methods. Among the many possible applications, we consider the problems of image denoising and transductive classiﬁcation. In both settings, our approach improves on standard diffusion based methods. Keywords: diffusion processes, diffusion geometry, spectral graph theory, image denoising, transductive learning, semi-supervised learning

4 0.066515081 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

Author: Andreas Maurer

Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classiﬁers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning

5 0.056674913 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

Author: Suhrid Balakrishnan, David Madigan

Abstract: Classiﬁers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classiﬁers, etc., provide competitive methods for classiﬁcation problems in high dimensions. However, current algorithms for training sparse classiﬁers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multipass algorithms for training sparse linear classiﬁers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function. Keywords: Laplace approximation, expectation propagation, LASSO

6 0.05332632 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

7 0.052437175 58 jmlr-2008-Max-margin Classification of Data with Absent Features

8 0.052176218 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

9 0.048590079 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

10 0.045222715 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

11 0.03812496 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

12 0.037692301 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

13 0.037112687 9 jmlr-2008-Active Learning by Spherical Subdivision

14 0.036692008 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees

15 0.036550514 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

16 0.036328413 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

17 0.0360719 92 jmlr-2008-Universal Multi-Task Kernels

18 0.035221726 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)

19 0.032957587 52 jmlr-2008-Learning from Multiple Sources

20 0.03283627 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.19), (1, -0.055), (2, -0.012), (3, 0.02), (4, -0.019), (5, 0.008), (6, 0.066), (7, 0.048), (8, 0.008), (9, 0.023), (10, -0.116), (11, 0.253), (12, 0.003), (13, 0.113), (14, 0.191), (15, 0.245), (16, -0.073), (17, -0.199), (18, 0.072), (19, 0.113), (20, -0.146), (21, -0.178), (22, 0.132), (23, 0.085), (24, 0.079), (25, 0.065), (26, -0.13), (27, 0.088), (28, 0.168), (29, 0.103), (30, 0.126), (31, -0.051), (32, 0.01), (33, -0.019), (34, 0.059), (35, -0.004), (36, 0.067), (37, 0.005), (38, -0.024), (39, -0.019), (40, -0.027), (41, -0.038), (42, 0.17), (43, 0.123), (44, -0.008), (45, -0.003), (46, 0.053), (47, 0.011), (48, 0.056), (49, 0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94322771 96 jmlr-2008-Visualizing Data using t-SNE

Author: Laurens van der Maaten, Geoffrey Hinton

2 0.70380312 57 jmlr-2008-Manifold Learning: The Price of Normalization

Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov

3 0.56515914 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

Author: Arthur D. Szlam, Mauro Maggioni, Ronald R. Coifman

4 0.29767025 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

Author: Suhrid Balakrishnan, David Madigan

5 0.27805537 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

Author: Andreas Maurer

6 0.25761721 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

7 0.23039496 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees

8 0.21805485 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)

9 0.2158857 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons

10 0.20091309 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

11 0.19629169 9 jmlr-2008-Active Learning by Spherical Subdivision

12 0.18743941 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

13 0.18506221 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction

14 0.17938861 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

15 0.17735517 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

16 0.16706544 58 jmlr-2008-Max-margin Classification of Data with Absent Features

17 0.16585857 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

18 0.16415988 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)

19 0.15848169 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

20 0.15782258 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.018), (1, 0.016), (5, 0.033), (9, 0.011), (40, 0.04), (54, 0.033), (58, 0.036), (66, 0.036), (73, 0.019), (76, 0.025), (88, 0.521), (92, 0.024), (94, 0.066), (99, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99165297 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

Author: Gérard Biau, Luc Devroye, Gábor Lugosi

Abstract: In the last years of his life, Leo Breiman promoted random forests for use in classiﬁcation. He suggested using averaging as a means of obtaining good discrimination rules. The base classiﬁers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classiﬁers, including one suggested by Breiman, are not universally consistent. Keywords: random forests, classiﬁcation trees, consistency, bagging This paper is dedicated to the memory of Leo Breiman.

2 0.98815793 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning

Author: Sungwook Yoon, Alan Fern, Robert Givan

Abstract: A number of today’s state-of-the-art planners are based on forward state-space search. The impressive performance can be attributed to progress in computing domain independent heuristics that perform well across many domains. However, it is easy to ﬁnd domains where such heuristics provide poor guidance, leading to planning failure. Motivated by such failures, the focus of this paper is to investigate mechanisms for learning domain-speciﬁc knowledge to better control forward search in a given domain. While there has been a large body of work on inductive learning of control knowledge for AI planning, there is a void of work aimed at forward-state-space search. One reason for this may be that it is challenging to specify a knowledge representation for compactly representing important concepts across a wide range of domains. One of the main contributions of this work is to introduce a novel feature space for representing such control knowledge. The key idea is to deﬁne features in terms of information computed via relaxed plan extraction, which has been a major source of success for non-learning planners. This gives a new way of leveraging relaxed planning techniques in the context of learning. Using this feature space, we describe three forms of control knowledge—reactive policies (decision list rules and measures of progress) and linear heuristics—and show how to learn them and incorporate them into forward state-space search. Our empirical results show that our approaches are able to surpass state-of-the-art nonlearning planners across a wide range of planning competition domains. Keywords: planning, machine learning, knowledge representation, search

3 0.96548051 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park

Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefﬁcients of the linear support vector machine. A Bahadur type representation of the coefﬁcients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classiﬁcation, convexity lemma, Radon transform

same-paper 4 0.95642662 96 jmlr-2008-Visualizing Data using t-SNE

Author: Laurens van der Maaten, Geoffrey Hinton

5 0.74727154 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression

Author: Manu Chhabra, Robert A. Jacobs

Abstract: The computational complexities arising in motor control can be ameliorated through the use of a library of motor synergies. We present a new model, referred to as the Greedy Additive Regression (GAR) model, for learning a library of torque sequences, and for learning the coefﬁcients of a linear combination of sequences minimizing a cost function. From the perspective of numerical optimization, the GAR model is interesting because it creates a library of “local features”—each sequence in the library is a solution to a single training task—and learns to combine these sequences using a local optimization procedure, namely, additive regression. We speculate that learners with local representational primitives and local optimization procedures will show good performance on nonlinear tasks. The GAR model is also interesting from the perspective of motor control because it outperforms several competing models. Results using a simulated two-joint arm suggest that the GAR model consistently shows excellent performance in the sense that it rapidly learns to perform novel, complex motor tasks. Moreover, its library is overcomplete and sparse, meaning that only a small fraction of the stored torque sequences are used when learning a new movement. The library is also robust in the sense that, after an initial training period, nearly all novel movements can be learned as additive combinations of sequences in the library, and in the sense that it shows good generalization when an arm’s dynamics are altered between training and test conditions, such as when a payload is added to the arm. Lastly, the GAR model works well regardless of whether motor tasks are speciﬁed in joint space or Cartesian space. We conclude that learning techniques using local primitives and optimization procedures are viable and potentially important methods for motor control and possibly other domains, and that these techniques deserve further examination by the artiﬁcial intelligence and cognitive science

6 0.7197209 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

7 0.69815254 9 jmlr-2008-Active Learning by Spherical Subdivision

8 0.69358128 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

9 0.67942619 57 jmlr-2008-Manifold Learning: The Price of Normalization

10 0.66678363 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

11 0.66602063 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

12 0.64567053 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses

13 0.63980055 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds

14 0.63773453 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

15 0.63611841 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

16 0.63566971 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

17 0.63532013 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression

18 0.63500494 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2

19 0.6279909 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons

20 0.62250972 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks