nips nips2002 nips2002-190 knowledge-graph by maker-knowledge-mining

190 nips-2002-Stochastic Neighbor Embedding


Source: pdf

Author: Geoffrey E. Hinton, Sam T. Roweis

Abstract: We describe a probabilistic approach to the task of placing objects, described by high-dimensional vectors or by pairwise dissimilarities, in a low-dimensional space in a way that preserves neighbor identities. A Gaussian is centered on each object in the high-dimensional space and the densities under this Gaussian (or the given dissimilarities) are used to define a probability distribution over all the potential neighbors of the object. The aim of the embedding is to approximate this distribution as well as possible when the same operation is performed on the low-dimensional “images” of the objects. A natural cost function is a sum of Kullback-Leibler divergences, one per object, which leads to a simple gradient for adjusting the positions of the low-dimensional images. Unlike other dimensionality reduction methods, this probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images. This allows ambiguous objects, like the document count vector for the word “bank”, to have versions close to the images of both “river” and “finance” without forcing the images of outdoor concepts to be located close to those of corporate concepts.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu   ¡ Abstract We describe a probabilistic approach to the task of placing objects, described by high-dimensional vectors or by pairwise dissimilarities, in a low-dimensional space in a way that preserves neighbor identities. [sent-3, score-0.101]

2 A Gaussian is centered on each object in the high-dimensional space and the densities under this Gaussian (or the given dissimilarities) are used to define a probability distribution over all the potential neighbors of the object. [sent-4, score-0.184]

3 The aim of the embedding is to approximate this distribution as well as possible when the same operation is performed on the low-dimensional “images” of the objects. [sent-5, score-0.118]

4 A natural cost function is a sum of Kullback-Leibler divergences, one per object, which leads to a simple gradient for adjusting the positions of the low-dimensional images. [sent-6, score-0.098]

5 Unlike other dimensionality reduction methods, this probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images. [sent-7, score-0.282]

6 This allows ambiguous objects, like the document count vector for the word “bank”, to have versions close to the images of both “river” and “finance” without forcing the images of outdoor concepts to be located close to those of corporate concepts. [sent-8, score-0.229]

7 1 Introduction Automatic dimensionality reduction is an important “toolkit” operation in machine learning, both as a preprocessing step for other algorithms (e. [sent-9, score-0.039]

8 Multidimensional scaling methods[1] preserve dissimilarities between items, as measured either by Euclidean distance, some nonlinear squashing of distances, or shortest graph paths as with Isomap[2, 3]. [sent-13, score-0.134]

9 Principal components analysis (PCA) finds a linear projection of the original data which captures as much variance as possible. [sent-14, score-0.03]

10 LLE[4]) or associate high-dimensional points with a fixed grid of points in the low-dimensional space (e. [sent-17, score-0.084]

11 All of these methods, however, require each high-dimensional object to be associated with only a single location in the low-dimensional space. [sent-20, score-0.138]

12 This makes it difficult to unfold “many-to-one” mappings in which a single ambiguous object really belongs in several disparate locations in the low-dimensional space. [sent-21, score-0.195]

13 In this paper we define a new notion of embedding based on probable neighbors. [sent-22, score-0.118]

14 Our algorithm, Stochastic Neighbor Embedding (SNE) tries to place the objects in a low-dimensional space so as to optimally preserve neighborhood identity, and can be naturally extended to allow multiple different low-d images of each object. [sent-23, score-0.184]

15 Here, is the effective number of local neighbors or “perplexity” and is chosen by hand. [sent-26, score-0.113]

16 In this respect, SNE is an improvement over methods like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near neighbors in the low-dimensional space. [sent-30, score-0.201]

17 The intuition is that while SNE emphasizes local distances, its cost function cleanly enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart. [sent-31, score-0.579]

18 3, but # £q¥rH  qpih¦H  "¤¢  # ¤P  P  £ ¥ ¢ g ¥£ ¥£ ¥ £ (5) ¥ V @ § U f f £ P which has the nice interpretation of a sum of forces pulling toward or pushing it away depending on whether is observed to be a neighbor more or less often than desired. [sent-33, score-0.101]

19 Steepest descent in which all of the points are adjusted in parallel is inefficient and can get stuck in poor local optima. [sent-35, score-0.11]

20 Adding random jitter that decreases with time finds much better local optima and is the method we used for the examples in this paper, even though it is still quite slow. [sent-36, score-0.186]

21 We initialize the embedding by putting all the low-dimensional images in random locations very close to the origin. [sent-37, score-0.262]

22 Several other minimization methods, including annealing the perplexity, are discussed in sections 5&6. [sent-38, score-0.037]

23 Both of these datasets are likely to have intrinsic structure in many fewer dimensions than their raw dimensionalities: 256 for the handwritten digits and 13679 for the author-word counts. [sent-40, score-0.157]

24 To begin, we used a set of digit bitmaps from the UPS database[7] with examples from each of the five classes 0,1,2,3,4. [sent-41, score-0.147]

25 The variance of the Gaussian around each point in the -dimensional raw pixel image space was set to achieve a perplexity of 15 in the distribution over high-dimensional neighbors. [sent-42, score-0.13]

26 SNE was initialized by putting all the in random locations very close to the origin and then was trained using gradient descent with annealed noise. [sent-43, score-0.197]

27 Although SNE was given no information about class labels, it quite cleanly separates the digit groups as shown in figure 1. [sent-44, score-0.146]

28 Furthermore, within each region of the low-dimensional space, SNE has arranged the data so that properties like orientation, skew and stroke-thickness tend to vary smoothly. [sent-45, score-0.043]

29 For the embedding shown, the SNE cost function in Eq. [sent-46, score-0.177]

30 4 has a value of nats; with a uniform distribution across lowdimensional neighbors, the cost is nats. [sent-47, score-0.086]

31 In this experiment, we used digit classes that do not have very similar pairs like 3 and 5 or 7 and 9. [sent-49, score-0.109]

32 When there are more classes and only two available dimensions, SNE does not as cleanly separate very similar pairs. [sent-50, score-0.091]

33 ¡ ¡ ££¤ ¡ ¡ ¡ ¢£¢  ¤ £ 4P  ¥  © ¨ ¥ ¢@ § # §¢£©  )9B £¢£  ¨  © ©  D C ¡ ¡ ¡ ¥ @ §§¤ © ¨ ¦ We have also applied SNE to word-document and word-author matrices calculated from the OCRed text of NIPS volume 0-12 papers[9]. [sent-51, score-0.031]

34 Figure 2 shows a map locating NIPS authors into two dimensions. [sent-52, score-0.051]

35 Each of the 676 authors who published more than one paper in NIPS vols. [sent-53, score-0.096]

36 0-12 is shown by a dot at the position found by SNE; larger red dots and corresponding last names are authors who published six or more papers in that period. [sent-54, score-0.244]

37 Distances were computed as the norm of the difference between log aggregate author word counts, summed across all NIPS papers. [sent-55, score-0.09]

38 Co-authored papers gave fractional counts evenly to all authors. [sent-56, score-0.165]

39 All words occurring in six or more documents were included, except for stopwords giving a vocabulary size of 13649. [sent-57, score-0.108]

40 (The bow toolkit[10] was used for part of the pre-processing of the data. [sent-58, score-0.032]

41 ) The were set to achieve a local perplexity of neighbors. [sent-59, score-0.158]

42 SNE seems to have grouped authors by broad NIPS field: generative models, support vector machines, neuroscience, reinforcement learning and VLSI all have distinguishable localized regions. [sent-60, score-0.051]

43 £ P ¥ ¦£  £ A ¥ £@ § E 4 A full mixture version of SNE The clean probabilistic formulation of SNE makes it easy to modify the cost function so that instead of a single image, each high-dimensional object can have several different versions of its low-dimensional image. [sent-61, score-0.3]

44 These alternative versions have mixing proportions that sum to . [sent-62, score-0.18]

45 Image-version of object has location and mixing proportion . [sent-63, score-0.26]

46 V §'§ ¦H ¥£ In this multiple-image model, the derivatives with respect to the image locations are straightforward; the derivatives w. [sent-65, score-0.048]

47 t the mixing proportions are most easily expressed £ %QP  £ ¥ ¢@ Figure 1: The result of running the SNE algorithm on -dimensional grayscale images of handwritten digits. [sent-67, score-0.262]

48 Pictures of the original data vectors (scans of handwritten digit) are shown at the location corresponding to their low-dimensional images as found by SNE. [sent-68, score-0.153]

49 The classes are quite well separated even though SNE had no information about class labels. [sent-69, score-0.071]

50 Furthermore, within each class, properties like orientation, skew and strokethickness tend to vary smoothly across the space. [sent-70, score-0.07]

51 Not all points are shown: to produce this display, digits are chosen in random order and are only displayed if a x region of the display centered on the 2-D location of the digit in the embedding does not overlap any of the x regions for digits that have already been displayed. [sent-71, score-0.405]

52 £ 3 £ QP ¤ ¨ ¤ ¡ ¡ ¡ ££¢  ¤ ¨ ¤ ¨ ¢ £¡ ¤ ¨ (SNE was initialized by putting all the in random locations very close to the origin and then was trained using batch gradient descent (see Eq. [sent-72, score-0.165]

53 The jitter was then reduced to for a further iterations. [sent-77, score-0.103]

54 Each of the 676 authors who published more than one paper in NIPS vols. [sent-79, score-0.096]

55 0-12 is show by a dot at the location found by the SNE algorithm. [sent-80, score-0.039]

56 Larger red dots and corresponding last names are authors who published six or more papers in that period. [sent-81, score-0.244]

57 Dissimilarities between authors were computed based on squared Euclidean distance between vectors of log aggregate author word counts. [sent-83, score-0.114]

58 Co-authored papers gave fractional counts evenly to all authors. [sent-84, score-0.165]

59 All words occurring in six or more documents were included, except for stopwords giving a vocabulary size of 13649. [sent-85, score-0.108]

60 # £ 1R©      #  £ 1R© §  £     ©% As a “proof-of-concept”, we recently implemented a simplified mixture version in which every object is represented in the low-dimensional space by exactly two components that are constrained to have mixing proportions of . [sent-94, score-0.387]

61 The two components are pulled together by a force which increases linearly up to a threshold separation. [sent-95, score-0.123]

62 1 We ran two experiments with this simplified mixture version of SNE. [sent-97, score-0.11]

63 We took a dataset containing pictures of each of the digits 2,3,4 and added hybrid digit-pictures that were each constructed by picking new examples of two of the classes and taking each pixel at random from one of these two “parents”. [sent-98, score-0.153]

64 After minimization, of the hybrids and only of the non-hybrids had significantly different locations for their two mixture components. [sent-99, score-0.12]

65 Moreover, the mixture components of each hybrid always lay in the regions of the space devoted to the classes of its two parents and never in the region devoted to the third class. [sent-100, score-0.217]

66 For this example we used a perplexity of in defining the local neighborhoods, a step size of for each position update of times the gradient, and used a constant jitter of . [sent-101, score-0.261]

67 Our very simple mixture version of SNE also makes it possible to map a circle onto a line without losing any near neighbor relationships or introducing any new ones. [sent-102, score-0.311]

68 Points near one “cut point” on the circle can mapped to a mixture of two points, one near one end of the line and one near the other end. [sent-103, score-0.26]

69 Obviously, the location of the cut on the two-dimensional circle gets decided by which pairs of mixture components split first during the stochastic optimization. [sent-104, score-0.266]

70 For certain optimization parameters that control the ease with which two mixture components can be pulled apart, only a single cut in the circle is made. [sent-105, score-0.232]

71 For other parameter settings, however, the circle may fragment into two or more smaller line-segments, each of which is topologically correct but which may not be linked to each other. [sent-106, score-0.056]

72 £¤ ¤ ¡ ¡  The example with hybrid digits demonstrates that even the most primitive mixture version of SNE can deal with ambiguous high-dimensional objects that need to be mapped to two widely separated regions of the low-dimensional space. [sent-108, score-0.414]

73 &¢ % ¨# $ ¥¤ ¤ ¨ ¥¤ ¤ 5 Practical optimization strategies Our current method of reducing the SNE cost is to use steepest descent with added jitter that is slowly reduced. [sent-114, score-0.234]

74 This produces quite good embeddings, which demonstrates that the SNE cost function is worth minimizing, but it takes several hours to find a good embedding for just datapoints so we clearly need a better search algorithm. [sent-115, score-0.177]

75 ¡ ¡ ¡ ¢£¢  The time per iteration could be reduced considerably by ignoring pairs of points for which all four of are small. [sent-116, score-0.042]

76 Since the matrix is fixed during the learning, it is natural to sparsify it by replacing all entries below a certain threshold with zero and renormalizing. [sent-117, score-0.029]

77 Then pairs for which both and are zero can be ignored from gradient calculations if both and are small. [sent-118, score-0.039]

78 ¥ ¦£ ¢ £ H ¥ H £ q¥ q5 "£ q5 q¥ ¢ 5 ¦£ ¢ ¥ £¥ qp¢   ¥ ¡¤P  1R  £ P ¥£ "S¢ £ q¥ H ¥ ¦£ H ¡ r  5 ¥£ ¦$H In the mixture version of SNE there appears to be an interesting way of avoiding local optima that does not involve annealing the jitter. [sent-121, score-0.23]

79 Consider two components in the mixture for an object that are far apart in the low-dimensional space. [sent-122, score-0.201]

80 By raising the mixing proportion of one and lowering the mixing proportion of the other, we can move probability mass from one part of the space to another without it ever appearing at intermediate locations. [sent-123, score-0.244]

81 This type of “probability wormhole” seems like a good way to avoid local optima that arise because a cluster of low-dimensional points must move through a bad region of the space in order to reach a better one. [sent-124, score-0.125]

82 Yet another search method, which we have used with some success on toy problems, is to provide extra dimensions in the low-dimensional space but to penalize non-zero values on these dimensions. [sent-125, score-0.039]

83 During the search, SNE will use the extra dimensions to go around lower-dimensional barriers but as the penalty on using these dimensions is increased, they will cease to be used, effectively constraining the embedding to the original dimensionality. [sent-126, score-0.196]

84 6 Discussion and Conclusions Preliminary experiments show that we can find good optima by first annealing the perplexities (using high jitter) and only reducing the jitter after the final perplexity has been reached. [sent-127, score-0.325]

85 This raises the question of what SNE is doing when the variance, , of the Gaussian centered on each high-dimensional point is very big so that the distribution across neighbors is almost uniform. [sent-128, score-0.149]

86 It is clear that in the high variance limit, the contribution of to the SNE cost function is just as important for distant neighbors as for close ones. [sent-129, score-0.144]

87 # £  ¢ ¥  ¥   ¦£   # £ ¢  ¦£ d " where (10) (11) (12) This mismatch is very similar to “stress” functions used in nonmetric versions of MDS, and enables us to understand the large-variance limit of SNE as a particular variant of such procedures. [sent-132, score-0.032]

88 LRE learns an N-dimensional vector for each object and an NxN-dimensional matrix for each relation. [sent-138, score-0.099]

89 Its predictive distribution for the third term is then determined by the relative densities of all known objects under this Gaussian. [sent-140, score-0.092]

90 SNE is just a degenerate version of LRE in which the only relationship is “near” and the matrix representing this relationship is the identity. [sent-141, score-0.038]

91 In summary, we have presented a new criterion, Stochastic Neighbor Embedding, for mapping high-dimensional points into a low-dimensional space based on stochastic selection of similar neighbors. [sent-142, score-0.071]

92 Unlike self-organizing maps, in which the low-dimensional coordinates are fixed to a grid and the high-dimensional ends are free to move, in SNE the high-dimensional coordinates are fixed to the data and the low-dimensional points move. [sent-143, score-0.042]

93 Our method can also be applied to arbitrary pairwise dissimilarities between objects if such are available instead of (or in addition to) high-dimensional observations. [sent-144, score-0.192]

94 The gradient of the SNE cost function has an appealing “push-pull” property in which the forces acting on to bring it closer to points it is under-selecting and further from points it is over-selecting as its neighbor. [sent-145, score-0.182]

95 We have shown results of applying this algorithm to image and document collections for which it sensibly placed similar objects nearby in a low-dimensional space while keeping dissimilar objects well separated. [sent-146, score-0.25]

96 £ P Most importantly, because of its probabilistic formulation, SNE has the ability to be extended to mixtures in which ambiguous high-dimensional objects (such as the word “bank”) can have several widely-separated images in the low-dimensional space. [sent-147, score-0.231]

97 Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. [sent-202, score-0.087]

98 Learning distributed representations of concepts from relational data using linear relational embedding. [sent-211, score-0.096]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sne', 0.8), ('perplexity', 0.13), ('embedding', 0.118), ('lre', 0.108), ('jitter', 0.103), ('neighbor', 0.101), ('dissimilarities', 0.1), ('object', 0.099), ('objects', 0.092), ('neighbors', 0.085), ('digit', 0.082), ('mixing', 0.079), ('nips', 0.074), ('mixture', 0.072), ('proportions', 0.069), ('cleanly', 0.064), ('digits', 0.062), ('cost', 0.059), ('images', 0.058), ('toolkit', 0.056), ('circle', 0.056), ('handwritten', 0.056), ('optima', 0.055), ('papers', 0.053), ('authors', 0.051), ('counts', 0.049), ('relational', 0.048), ('locations', 0.048), ('ambiguous', 0.048), ('published', 0.045), ('near', 0.044), ('separated', 0.044), ('abu', 0.043), ('mostafa', 0.043), ('pomerleau', 0.043), ('skew', 0.043), ('sollich', 0.043), ('distances', 0.043), ('proportion', 0.043), ('qp', 0.043), ('points', 0.042), ('six', 0.041), ('descent', 0.04), ('cut', 0.04), ('location', 0.039), ('gradient', 0.039), ('dimensions', 0.039), ('dimensionality', 0.039), ('version', 0.038), ('putting', 0.038), ('nats', 0.038), ('sensibly', 0.038), ('bitmaps', 0.038), ('kailath', 0.038), ('stopwords', 0.038), ('big', 0.037), ('annealing', 0.037), ('pulled', 0.034), ('movellan', 0.034), ('lippmann', 0.034), ('yann', 0.034), ('smyth', 0.034), ('fractional', 0.034), ('pictures', 0.034), ('baldi', 0.034), ('tenenbaum', 0.034), ('preserve', 0.034), ('word', 0.033), ('gtm', 0.032), ('annealed', 0.032), ('steepest', 0.032), ('cottrell', 0.032), ('barber', 0.032), ('divergences', 0.032), ('bow', 0.032), ('versions', 0.032), ('text', 0.031), ('hybrid', 0.03), ('neighborhoods', 0.03), ('picks', 0.03), ('atkeson', 0.03), ('aggregate', 0.03), ('components', 0.03), ('force', 0.03), ('principal', 0.029), ('documents', 0.029), ('stochastic', 0.029), ('threshold', 0.029), ('evenly', 0.029), ('warmuth', 0.029), ('devoted', 0.029), ('widely', 0.028), ('local', 0.028), ('nearby', 0.028), ('red', 0.027), ('dots', 0.027), ('mds', 0.027), ('lle', 0.027), ('classes', 0.027), ('across', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 190 nips-2002-Stochastic Neighbor Embedding

Author: Geoffrey E. Hinton, Sam T. Roweis

Abstract: We describe a probabilistic approach to the task of placing objects, described by high-dimensional vectors or by pairwise dissimilarities, in a low-dimensional space in a way that preserves neighbor identities. A Gaussian is centered on each object in the high-dimensional space and the densities under this Gaussian (or the given dissimilarities) are used to define a probability distribution over all the potential neighbors of the object. The aim of the embedding is to approximate this distribution as well as possible when the same operation is performed on the low-dimensional “images” of the objects. A natural cost function is a sum of Kullback-Leibler divergences, one per object, which leads to a simple gradient for adjusting the positions of the low-dimensional images. Unlike other dimensionality reduction methods, this probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images. This allows ambiguous objects, like the document count vector for the word “bank”, to have versions close to the images of both “river” and “finance” without forcing the images of outdoor concepts to be located close to those of corporate concepts.

2 0.10144277 36 nips-2002-Automatic Alignment of Local Representations

Author: Yee W. Teh, Sam T. Roweis

Abstract: We present an automatic alignment procedure which maps the disparate internal representations learned by several local dimensionality reduction experts into a single, coherent global coordinate system for the original data space. Our algorithm can be applied to any set of experts, each of which produces a low-dimensional local representation of a highdimensional input. Unlike recent efforts to coordinate such models by modifying their objective functions [1, 2], our algorithm is invoked after training and applies an efficient eigensolver to post-process the trained models. The post-processing has no local optima and the size of the system it must solve scales with the number of local models rather than the number of original data points, making it more efficient than model-free algorithms such as Isomap [3] or LLE [4]. 1 Introduction: Local vs. Global Dimensionality Reduction Beyond density modelling, an important goal of unsupervised learning is to discover compact, informative representations of high-dimensional data. If the data lie on a smooth low dimensional manifold, then an excellent encoding is the coordinates internal to that manifold. The process of determining such coordinates is dimensionality reduction. Linear dimensionality reduction methods such as principal component analysis and factor analysis are easy to train but cannot capture the structure of curved manifolds. Mixtures of these simple unsupervised models [5, 6, 7, 8] have been used to perform local dimensionality reduction, and can provide good density models for curved manifolds, but unfortunately such mixtures cannot do dimensionality reduction. They do not describe a single, coherent low-dimensional coordinate system for the data since there is no pressure for the local coordinates of each component to agree. Roweis et al [1] recently proposed a model which performs global coordination of local coordinate systems in a mixture of factor analyzers (MFA). Their model is trained by maximizing the likelihood of the data, with an additional variational penalty term to encourage the internal coordinates of the factor analyzers to agree. While their model can trade off modelling the data and having consistent local coordinate systems, it requires a user given trade-off parameter, training is quite inefficient (although [2] describes an improved training algorithm for a more constrained model), and it has quite serious local minima problems (methods like LLE [4] or Isomap [3] have to be used for initialization). In this paper we describe a novel, automatic way to align the hidden representations used by each component of a mixture of dimensionality reducers into a single global representation of the data throughout space. Given an already trained mixture, the alignment is achieved by applying an eigensolver to a matrix constructed from the internal representations of the mixture components. Our method is efficient, simple to implement, and has no local optima in its optimization nor any learning rates or annealing schedules. 2 The Locally Linear Coordination Algorithm H 9¥ EI¡ CD66B9 ©9B 766 % G F 5 #

3 0.09363845 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction

Author: Vin D. Silva, Joshua B. Tenenbaum

Abstract: Recently proposed algorithms for nonlinear dimensionality reduction fall broadly into two categories which have different advantages and disadvantages: global (Isomap [1]), and local (Locally Linear Embedding [2], Laplacian Eigenmaps [3]). We present two variants of Isomap which combine the advantages of the global approach with what have previously been exclusive advantages of local methods: computational sparsity and the ability to invert conformal maps.

4 0.084850214 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

Author: Christopher Williams, Michalis K. Titsias

Abstract: We consider data which are images containing views of multiple objects. Our task is to learn about each of the objects present in the images. This task can be approached as a factorial learning problem, where each image must be explained by instantiating a model for each of the objects present with the correct instantiation parameters. A major problem with learning a factorial model is that as the number of objects increases, there is a combinatorial explosion of the number of configurations that need to be considered. We develop a method to extract object models sequentially from the data by making use of a robust statistical method, thus avoiding the combinatorial explosion, and present results showing successful extraction of objects from real images.

5 0.082615577 98 nips-2002-Going Metric: Denoising Pairwise Data

Author: Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: Pairwise data in empirical sciences typically violate metricity, either due to noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multi-dimensional scaling (MDS) that allows us to apply a variety of classical machine learning and signal processing algorithms. The class of pairwise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a second step. Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised metric representation. We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods. 1

6 0.078419499 49 nips-2002-Charting a Manifold

7 0.076736286 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

8 0.069216132 119 nips-2002-Kernel Dependency Estimation

9 0.068466723 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

10 0.065322883 10 nips-2002-A Model for Learning Variance Components of Natural Images

11 0.063386336 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

12 0.062399324 115 nips-2002-Informed Projections

13 0.059762839 57 nips-2002-Concurrent Object Recognition and Segmentation by Graph Partitioning

14 0.056931816 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

15 0.053081553 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

16 0.053059783 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

17 0.051101658 2 nips-2002-A Bilinear Model for Sparse Coding

18 0.051061414 138 nips-2002-Manifold Parzen Windows

19 0.049204163 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations

20 0.048458476 107 nips-2002-Identity Uncertainty and Citation Matching


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.16), (1, -0.044), (2, -0.006), (3, 0.088), (4, -0.055), (5, 0.043), (6, 0.025), (7, -0.084), (8, -0.056), (9, 0.117), (10, 0.013), (11, 0.024), (12, -0.043), (13, -0.075), (14, 0.057), (15, 0.129), (16, -0.049), (17, 0.085), (18, 0.057), (19, -0.036), (20, -0.03), (21, 0.016), (22, 0.012), (23, -0.024), (24, -0.023), (25, -0.075), (26, 0.02), (27, -0.072), (28, -0.086), (29, 0.029), (30, -0.017), (31, 0.084), (32, -0.106), (33, 0.005), (34, -0.08), (35, -0.03), (36, -0.045), (37, 0.147), (38, 0.073), (39, 0.043), (40, -0.022), (41, 0.084), (42, 0.033), (43, -0.044), (44, 0.093), (45, 0.006), (46, -0.067), (47, 0.029), (48, -0.022), (49, -0.082)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93120474 190 nips-2002-Stochastic Neighbor Embedding

Author: Geoffrey E. Hinton, Sam T. Roweis

Abstract: We describe a probabilistic approach to the task of placing objects, described by high-dimensional vectors or by pairwise dissimilarities, in a low-dimensional space in a way that preserves neighbor identities. A Gaussian is centered on each object in the high-dimensional space and the densities under this Gaussian (or the given dissimilarities) are used to define a probability distribution over all the potential neighbors of the object. The aim of the embedding is to approximate this distribution as well as possible when the same operation is performed on the low-dimensional “images” of the objects. A natural cost function is a sum of Kullback-Leibler divergences, one per object, which leads to a simple gradient for adjusting the positions of the low-dimensional images. Unlike other dimensionality reduction methods, this probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images. This allows ambiguous objects, like the document count vector for the word “bank”, to have versions close to the images of both “river” and “finance” without forcing the images of outdoor concepts to be located close to those of corporate concepts.

2 0.6014834 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

Author: Christopher Williams, Michalis K. Titsias

Abstract: We consider data which are images containing views of multiple objects. Our task is to learn about each of the objects present in the images. This task can be approached as a factorial learning problem, where each image must be explained by instantiating a model for each of the objects present with the correct instantiation parameters. A major problem with learning a factorial model is that as the number of objects increases, there is a combinatorial explosion of the number of configurations that need to be considered. We develop a method to extract object models sequentially from the data by making use of a robust statistical method, thus avoiding the combinatorial explosion, and present results showing successful extraction of objects from real images.

3 0.59604347 36 nips-2002-Automatic Alignment of Local Representations

Author: Yee W. Teh, Sam T. Roweis

Abstract: We present an automatic alignment procedure which maps the disparate internal representations learned by several local dimensionality reduction experts into a single, coherent global coordinate system for the original data space. Our algorithm can be applied to any set of experts, each of which produces a low-dimensional local representation of a highdimensional input. Unlike recent efforts to coordinate such models by modifying their objective functions [1, 2], our algorithm is invoked after training and applies an efficient eigensolver to post-process the trained models. The post-processing has no local optima and the size of the system it must solve scales with the number of local models rather than the number of original data points, making it more efficient than model-free algorithms such as Isomap [3] or LLE [4]. 1 Introduction: Local vs. Global Dimensionality Reduction Beyond density modelling, an important goal of unsupervised learning is to discover compact, informative representations of high-dimensional data. If the data lie on a smooth low dimensional manifold, then an excellent encoding is the coordinates internal to that manifold. The process of determining such coordinates is dimensionality reduction. Linear dimensionality reduction methods such as principal component analysis and factor analysis are easy to train but cannot capture the structure of curved manifolds. Mixtures of these simple unsupervised models [5, 6, 7, 8] have been used to perform local dimensionality reduction, and can provide good density models for curved manifolds, but unfortunately such mixtures cannot do dimensionality reduction. They do not describe a single, coherent low-dimensional coordinate system for the data since there is no pressure for the local coordinates of each component to agree. Roweis et al [1] recently proposed a model which performs global coordination of local coordinate systems in a mixture of factor analyzers (MFA). Their model is trained by maximizing the likelihood of the data, with an additional variational penalty term to encourage the internal coordinates of the factor analyzers to agree. While their model can trade off modelling the data and having consistent local coordinate systems, it requires a user given trade-off parameter, training is quite inefficient (although [2] describes an improved training algorithm for a more constrained model), and it has quite serious local minima problems (methods like LLE [4] or Isomap [3] have to be used for initialization). In this paper we describe a novel, automatic way to align the hidden representations used by each component of a mixture of dimensionality reducers into a single global representation of the data throughout space. Given an already trained mixture, the alignment is achieved by applying an eigensolver to a matrix constructed from the internal representations of the mixture components. Our method is efficient, simple to implement, and has no local optima in its optimization nor any learning rates or annealing schedules. 2 The Locally Linear Coordination Algorithm H 9¥ EI¡ CD66B9 ©9B 766 % G F 5 #

4 0.59359103 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations

Author: Elzbieta Pekalska, David Tax, Robert Duin

Abstract: Problems in which abnormal or novel situations should be detected can be approached by describing the domain of the class of typical examples. These applications come from the areas of machine diagnostics, fault detection, illness identification or, in principle, refer to any problem where little knowledge is available outside the typical class. In this paper we explain why proximities are natural representations for domain descriptors and we propose a simple one-class classifier for dissimilarity representations. By the use of linear programming an efficient one-class description can be found, based on a small number of prototype objects. This classifier can be made (1) more robust by transforming the dissimilarities and (2) cheaper to compute by using a reduced representation set. Finally, a comparison to a comparable one-class classifier by Campbell and Bennett is given.

5 0.58623636 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction

Author: Vin D. Silva, Joshua B. Tenenbaum

Abstract: Recently proposed algorithms for nonlinear dimensionality reduction fall broadly into two categories which have different advantages and disadvantages: global (Isomap [1]), and local (Locally Linear Embedding [2], Laplacian Eigenmaps [3]). We present two variants of Isomap which combine the advantages of the global approach with what have previously been exclusive advantages of local methods: computational sparsity and the ability to invert conformal maps.

6 0.52034634 49 nips-2002-Charting a Manifold

7 0.51214397 107 nips-2002-Identity Uncertainty and Citation Matching

8 0.4818638 98 nips-2002-Going Metric: Denoising Pairwise Data

9 0.47987169 138 nips-2002-Manifold Parzen Windows

10 0.4692308 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers

11 0.46854019 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

12 0.44741225 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

13 0.42329082 115 nips-2002-Informed Projections

14 0.3991169 150 nips-2002-Multiple Cause Vector Quantization

15 0.36893162 57 nips-2002-Concurrent Object Recognition and Segmentation by Graph Partitioning

16 0.36706063 87 nips-2002-Fast Transformation-Invariant Factor Analysis

17 0.36110649 119 nips-2002-Kernel Dependency Estimation

18 0.35978797 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis

19 0.35059792 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text

20 0.34692562 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.037), (14, 0.024), (23, 0.019), (42, 0.109), (51, 0.012), (54, 0.177), (55, 0.046), (57, 0.018), (63, 0.034), (64, 0.013), (67, 0.029), (68, 0.03), (74, 0.087), (75, 0.135), (92, 0.034), (98, 0.081)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96695775 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology

Author: Matthew G. Snover, Michael R. Brent

Abstract: This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and hill-climbing and directed search algorithms. By extracting and examining morphologically rich subsets of an input lexicon, the directed search identifies highly productive paradigms. The hill-climbing algorithm then further maximizes the probability of the hypothesis. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with another recent unsupervised morphology learning algorithm demonstrate the effectiveness of this technique.

2 0.96433395 32 nips-2002-Approximate Inference and Protein-Folding

Author: Chen Yanover, Yair Weiss

Abstract: Side-chain prediction is an important subtask in the protein-folding problem. We show that finding a minimal energy side-chain configuration is equivalent to performing inference in an undirected graphical model. The graphical model is relatively sparse yet has many cycles. We used this equivalence to assess the performance of approximate inference algorithms in a real-world setting. Specifically we compared belief propagation (BP), generalized BP (GBP) and naive mean field (MF). In cases where exact inference was possible, max-product BP always found the global minimum of the energy (except in few cases where it failed to converge), while other approximation algorithms of similar complexity did not. In the full protein data set, maxproduct BP always found a lower energy configuration than the other algorithms, including a widely used protein-folding software (SCWRL). 1

same-paper 3 0.91706532 190 nips-2002-Stochastic Neighbor Embedding

Author: Geoffrey E. Hinton, Sam T. Roweis

Abstract: We describe a probabilistic approach to the task of placing objects, described by high-dimensional vectors or by pairwise dissimilarities, in a low-dimensional space in a way that preserves neighbor identities. A Gaussian is centered on each object in the high-dimensional space and the densities under this Gaussian (or the given dissimilarities) are used to define a probability distribution over all the potential neighbors of the object. The aim of the embedding is to approximate this distribution as well as possible when the same operation is performed on the low-dimensional “images” of the objects. A natural cost function is a sum of Kullback-Leibler divergences, one per object, which leads to a simple gradient for adjusting the positions of the low-dimensional images. Unlike other dimensionality reduction methods, this probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images. This allows ambiguous objects, like the document count vector for the word “bank”, to have versions close to the images of both “river” and “finance” without forcing the images of outdoor concepts to be located close to those of corporate concepts.

4 0.87769783 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals

Author: Tatyana Sharpee, Nicole C. Rust, William Bialek

Abstract: unkown-abstract

5 0.84654284 169 nips-2002-Real-Time Particle Filters

Author: Cody Kwok, Dieter Fox, Marina Meila

Abstract: Particle filters estimate the state of dynamical systems from sensor information. In many real time applications of particle filters, however, sensor information arrives at a significantly higher rate than the update rate of the filter. The prevalent approach to dealing with such situations is to update the particle filter as often as possible and to discard sensor information that cannot be processed in time. In this paper we present real-time particle filters, which make use of all sensor information even when the filter update rate is below the update rate of the sensors. This is achieved by representing posteriors as mixtures of sample sets, where each mixture component integrates one observation arriving during a filter update. The weights of the mixture components are set so as to minimize the approximation error introduced by the mixture representation. Thereby, our approach focuses computational resources (samples) on valuable sensor information. Experiments using data collected with a mobile robot show that our approach yields strong improvements over other approaches.

6 0.84529299 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

7 0.84311438 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

8 0.83919305 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

9 0.83789057 188 nips-2002-Stability-Based Model Selection

10 0.83565235 3 nips-2002-A Convergent Form of Approximate Policy Iteration

11 0.83196527 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

12 0.83193111 119 nips-2002-Kernel Dependency Estimation

13 0.83150274 53 nips-2002-Clustering with the Fisher Score

14 0.83124483 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

15 0.82823384 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

16 0.82750309 124 nips-2002-Learning Graphical Models with Mercer Kernels

17 0.82578826 96 nips-2002-Generalized² Linear² Models

18 0.82417244 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

19 0.82415497 137 nips-2002-Location Estimation with a Differential Update Network

20 0.82363153 46 nips-2002-Boosting Density Estimation