nips nips2001 nips2001-84 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton
Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as: RP&¤§¢ Q ¡ I 0 ( 3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡ ¡ ¥ ¡ ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨
Reference: text
sentIndex sentText sentNum sentScore
1 Hinton Department of Computer Science, University of Toronto Department of Computer and Information Science, University of Pennsylvania ¡ Abstract High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. [sent-3, score-0.65]
2 Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. [sent-4, score-0.392]
3 In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. [sent-5, score-0.217]
4 Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. [sent-6, score-0.636]
5 The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. [sent-7, score-0.551]
6 As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. [sent-8, score-0.661]
7 The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. [sent-9, score-0.519]
8 Each image can be represented by a point in the high dimensional vector space of pixel intensities. [sent-11, score-0.244]
9 If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. [sent-13, score-0.483]
10 Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. [sent-14, score-0.788]
11 Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. [sent-15, score-0.255]
12 To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. [sent-16, score-0.442]
13 Given a setting of the levers and dials, the box generates an image of a face. [sent-17, score-0.226]
14 Given an image of a face, the box deduces the appropriate setting of the levers and dials. [sent-18, score-0.226]
15 We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. [sent-20, score-0.905]
16 Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e. [sent-21, score-0.126]
17 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. [sent-24, score-0.555]
18 Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. [sent-25, score-0.492]
19 Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. [sent-26, score-0.134]
20 For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. [sent-27, score-0.277]
21 The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. [sent-28, score-0.259]
22 The transformations correspond to arbitrary rotations and tribution, reflections of the local coordinates in each linear model. [sent-36, score-0.514]
23 The objective function for the EM algorithm is unchanged by these transformations. [sent-37, score-0.182]
24 Thus, maximum likelihood estimation in MFAs does not favor any particular alignment; instead, it produces models whose internal representations change unpredictably as one traverses connected paths on the manifold. [sent-38, score-0.281]
25 Can we encourage models whose local coordinate systems are aligned in a consistent way? [sent-39, score-0.462]
26 r r i ¡ P£¢ 3 Global Coordination Suppose the data lie near a smooth manifold with a locally flat (developable) structure. [sent-40, score-0.377]
27 h Pm l jh e ck BigfGd k hidden variables s,z Figure 1: Graphical model for globally coordinated MFAs. [sent-42, score-0.339]
28 Although global coordinates are unobserved, they affect the learning through a regularization term. [sent-43, score-0.596]
29 After learning, inferences about the global variables are made by computing posterior distributions, . [sent-44, score-0.39]
30 All these operations are particthe conditional distribution, ularly tractable due to the conditional independencies of the model. [sent-46, score-0.133]
31 l j e F # Gd ¢ ¡ data ¢ £ global coordinates x l Gd j e g everywhere. [sent-47, score-0.596]
32 Furthermore, to a good approximation, these global coordinates can be related to the local coordinates of different neighborhoods (in their region of validity) by linear 2 transformations: (5) What does it mean to say that the coordinates provide a global parameterization of the manifold? [sent-48, score-1.741]
33 Intuitively, if a data point belongs to overlapping neighborhoods, then the global coordinates computed from their local coordinate systems, given by eq. [sent-49, score-0.9]
34 ) The globally coordinated MFA is represented by the graphical model in Fig. [sent-52, score-0.307]
35 To enforce this criterion of agreement, we need to penalize models whose posterior distributions given by eq. [sent-57, score-0.238]
36 (8) are multimodal, since multiple modes only arise when different mixture components give rise to inconsistent global coordinates. [sent-58, score-0.399]
37 We introduce a family of unimodal distributions over both and , and encourage the true posteriors, , to be close to some member, , of this family. [sent-60, score-0.274]
38 u X £ ¢ §¥ ¡ ` ¨¥ £ ¢ § ¡ ¡ 9 £ ¢ § 9 £ X¥ ¡ § ¡ 9 £ ¢ § ¡ 9 £ ¨¥ ¢ Developing this idea further, we introduce a new objective function for unsupervised learning in MFAs. [sent-61, score-0.274]
39 The new objective function incorporates a regularizer to encourage the global consistency of local models: ¨ § ¥ i ££ X¥¥ § (9) ) 0 ¡ § ¡¢ " i £ ¨¥ ¡ ' ( & $ % " ¡ g 9 £¢ " " # ! [sent-62, score-0.777]
40 The first term in this objective function computes the log-probability of the data. [sent-63, score-0.182]
41 The second term computes a sum of Kullback-Leibler (KL) divergences; these are designed to 2 Without loss of generality, the matrices can be taken to be symmetric and positive-definite, by exploiting the polar factorization and absorbing reflection and rotation into the local coordinate systems. [sent-64, score-0.394]
42 (In practice, though, it may be easier to optimize the objective function without constraining the matrices to be of this form. [sent-65, score-0.272]
43 Together, then, the coordination matrices and vectors account for an axis-aligned scaling and uniform translation between the global and local coordinate systems. [sent-67, score-0.952]
44 h ¦ 1 h 2 h ¦ 1 penalize MFAs whose posterior distributions over global coordinates are not unimodal. [sent-68, score-0.778]
45 The twin goals of density estimation and manifold learning in MFAs are pursued by attempting to balance these terms in the objective function. [sent-69, score-0.55]
46 The factor controls the tradeoff between density modeling and global coordination: as only strict invariances (which do not affect likelihood) are exploited in order to achieve submodel agreement. [sent-70, score-0.368]
47 (10) factorizes over and , implying that— Note that the distribution according to this family of models—the global coordinate is independent of the mixture component given the data point . [sent-73, score-0.542]
48 At each iteration of learning, the means , covariance matrices , and mixture weights are determined separately for each data point, so as to maximize the objective function in eq. [sent-76, score-0.371]
49 (9): this amounts to computing the unimodal distributions, , best matched to the true posterior distributions, . [sent-77, score-0.181]
50 Note how the goal of developing more useful internal representations has changed the learning problem in a fundamental way. [sent-79, score-0.184]
51 All ces these parameters, as well as the MFA model parameters , must be chosen to “stitch together” the local coordinates systems in a smooth way and to learn internal representations easily coordinated by the local-to-global mapping in eq. [sent-82, score-0.835]
52 D s ¥ q ¥ r ¥ ¨5 Q § 9 £ ¨¥ ¡ ¤ ¦ " ¨ Optimization of the objective function in eq. [sent-84, score-0.182]
53 Our objective function illustrates an unexpected application of such variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. [sent-88, score-0.589]
54 We introduce the unimodal and factorized distributions to regularize the multimodal distributions . [sent-89, score-0.376]
55 Penalizing the KL divergence between these distributions lifts a degeneracy in the model’s parameter space and favors local linear models that can be globally aligned. [sent-90, score-0.473]
56 1 Computing and optimizing the objective function Evaluating the objective function in eq. [sent-92, score-0.364]
57 (9) requires a sum and integral over the latent variables of the model. [sent-93, score-0.106]
58 These operations are simplified by rewriting the objective function as: § 9 £ X¥ @ § ¥ ¡ § ©x &¨¥ £¢ " 9 £ ¨¥ ¡ © (11) § g w 9 £ X¥ ¡ ¡ & $ % " ! [sent-94, score-0.219]
59 r W s i r W u u q g ¤£) " £ ¡ ¥ U8 cFVT¡ " w¨ ¦ u x ¨ &©§78 © £ W s i r (14) ¤ " " ¤ where we have introduced simplifying notation for the vector differences and and the local precision matrices ¤ . [sent-98, score-0.251]
60 ¤ " ¦ Iteratively maximizing the objective function by coordinate ascent now leads to a learning algorithm of the same general style as EM. [sent-99, score-0.325]
61 The objective function is completely invariant to translation and rescaling of and (since , and appear only in the form ). [sent-105, score-0.225]
62 To remove this degeneracy, after solving the equations above we further constrain the global coordinates to have mean zero and unit variance in each direction. [sent-106, score-0.596]
63 These constraints are enforced without changing the value of the objective function by simply translating the offsets and rescaling the diagonal matrices . [sent-107, score-0.366]
64 3 M-step The M-step consists of maximizing the objective function, eq. [sent-109, score-0.182]
65 ) ¤ Figure 2: Global coordination of local linear models. [sent-117, score-0.419]
66 (left) A model trained using maximum likelihood, with the arrows indicating the direction of increase for each factor analyzer’s local coordinate system. [sent-118, score-0.304]
67 (right) A coordinated model; arrows indicate the direction in the data space corresponding to increasing the global coordinate as inferred by the algorithm. [sent-119, score-0.691]
68 5 Experiments We have tested our model on simple synthetic manifolds whose structure is known as well as on collections of images of handwritten digits and faces. [sent-121, score-0.498]
69 In the coordinated model, the global coordinate always points in the same direction along the data manifold, as defined by the composition of the transformations and . [sent-123, score-0.703]
70 In the model trained with maximum likelihood, the density is well captured but each local latent variable has a random orientation along the manifold. [sent-124, score-0.299]
71 r ¤ We also applied the algorithm to collections of images of handwritten digits and of faces. [sent-125, score-0.369]
72 The representation of was an unprocessed vector of raw 8-bit grayscale pixel intensities for each image (of dimensionality 256 for the digits and 560 for the faces. [sent-126, score-0.327]
73 ) The MFAs had 64 local models and the global coordinates were two dimensional. [sent-127, score-0.813]
74 After training, the coordinated MFAs had learned a smooth, continuous mapping from the plane to images of digits or of faces. [sent-128, score-0.437]
75 This allows us both to infer a two-dimensional location given any image by computing and to generate new images from any point in the plane . [sent-129, score-0.211]
76 ) In general, both by computing of these conditional distributions have the form of a mixture of Gaussians. [sent-131, score-0.181]
77 the means of the unimodal distributions ) of the training points after the last iteration of training as well as examples of new images from the generative model, created by evaluating the mean of along straight line paths in the global coordinate space. [sent-134, score-0.817]
78 In the case of digits, it seems as though our models have captured tilt/shape and identity and represented them as the two axes of the space; in the case of the faces the axes seem to capture pose and expression. [sent-135, score-0.285]
79 (For the faces, the final space was rotated by hand to align interpretable directions with the coordinate axes. [sent-136, score-0.184]
80 ¤£¢ ¡ As with all EM algorithms, the coordinated MFA learning procedure is susceptible to local optima. [sent-139, score-0.364]
81 We clamped equal to the embedding coordinate provided by LLE and to a small value and trained until convergence (typically 30-100 iterations). [sent-141, score-0.257]
82 ¨ 6 Discussion Mixture models provide a simple way to approximate the density of high dimensional data that lies on or near a low dimensional manifold. [sent-144, score-0.383]
83 In this paper, we have shown how to learn global coordinates that can act as an encapsulating interface, so that other parts of a learning system do not need to interact with the individual components of a mixture. [sent-146, score-0.596]
84 This should improve generalization as well as facilitate the propagation and exchange of information when these models are incorporated into a larger (perhaps Figure 3: Automatically constructed two dimensional global parameterizations of manifolds of digits and faces. [sent-147, score-0.698]
85 Each plot shows the global coordinate space discovered by the unsupervised algorithm; points indicate for the inferred means each training item at the end of learning. [sent-148, score-0.58]
86 The image stacks on the borders are not from the training set but are generated from the model itself and represent the mean of the predictive distribution at the corresponding open circles (sampled along the straight lines in the global space). [sent-149, score-0.385]
87 ¡ l Gd j e ¢ £ The models provide both a two degree-of-freedom generator for complex images via as well as a pose/slant recognition sys. [sent-150, score-0.182]
88 tem via lF j# Gd e l Gd j e ¢ ¢ £ ¡ For the handwritten digits, the training set consisted of 1100 examples of the digit “2” (shown as crosses above) mixed with 1100 examples of “3”s (shown as triangles). [sent-151, score-0.103]
89 The digits are from the NIST dataset, digitized at 16x16 pixels. [sent-152, score-0.166]
90 For the faces, we used 2000 images of a single person with various poses and expressions taken from consecutive frames of a video digitized at 20x20 pixels. [sent-153, score-0.221]
91 The first is to use an embedding algorithm (such as LLE or Isomap) not only as an initialization step but to provide clamped values for the global coordinates. [sent-157, score-0.456]
92 While this supervised approach may work in practice, unsupervised coordination makes clear the objective function that is being opti- Figure 4: A situation in which an un-coordinated mixture model–trained to do density estimation–cannot be “postcoordinated”. [sent-158, score-0.699]
93 Noise has caused one of the local density models to orient orthogonal to the manifold. [sent-159, score-0.285]
94 In globally coordinated learning, there is an additional pressure to align with neighbouring models which would force the local model to lie in the correct subspace. [sent-160, score-0.566]
95 mized, which unifies the goals of manifold learning and density estimation. [sent-161, score-0.333]
96 Another variant is to train an unsupervised mixture model (such as a MFA) using a traditional maximum likelihood objective function and then to “post-coordinate” its parameters by applying local reflections/rotations and translations to create global coordinates. [sent-162, score-0.873]
97 When both density estimation and coordination are optimized simultaneously there is extra pressure for local experts to fit the global structure of the manifold. [sent-164, score-0.83]
98 In the first are efforts at learning the global structure of nonlinear manifolds [1, 4, 9, 10]; in the second are efforts at developing probabilistic graphical models for reasoning under uncertainty[5, 6, 7]. [sent-166, score-0.725]
99 Our work proposes to model the global coordinates on manifolds as latent variables, thus attempting to combine the representational advantages of both frameworks. [sent-167, score-0.83]
100 It differs from embedding by providing a fully probabilistic model valid away from the training set, and from work in generative topographic mapping[2] by not requiring a uniform discretized gridding of the latent space. [sent-168, score-0.266]
wordName wordTfidf (topN-words)
[('global', 0.3), ('coordinates', 0.296), ('coordination', 0.258), ('mfas', 0.233), ('manifold', 0.23), ('coordinated', 0.203), ('objective', 0.182), ('local', 0.161), ('faces', 0.159), ('mfa', 0.146), ('coordinate', 0.143), ('manifolds', 0.129), ('unimodal', 0.127), ('images', 0.126), ('digits', 0.108), ('dimensional', 0.105), ('mixture', 0.099), ('gd', 0.096), ('neighborhoods', 0.092), ('unsupervised', 0.092), ('matrices', 0.09), ('levers', 0.088), ('image', 0.085), ('distributions', 0.082), ('internal', 0.078), ('lle', 0.076), ('embedding', 0.073), ('collections', 0.071), ('latent', 0.07), ('regularizer', 0.069), ('degeneracy', 0.069), ('variational', 0.068), ('density', 0.068), ('encourage', 0.065), ('handwritten', 0.064), ('locally', 0.063), ('globally', 0.062), ('representations', 0.062), ('face', 0.059), ('dials', 0.058), ('digitized', 0.058), ('independencies', 0.058), ('isomap', 0.058), ('magic', 0.058), ('transformations', 0.057), ('models', 0.056), ('pixel', 0.054), ('posterior', 0.054), ('box', 0.053), ('revow', 0.051), ('offsets', 0.051), ('near', 0.049), ('toronto', 0.048), ('traverses', 0.046), ('multimodal', 0.046), ('penalize', 0.046), ('penalizing', 0.046), ('inferred', 0.045), ('em', 0.045), ('dimensionality', 0.045), ('developing', 0.044), ('probabilistic', 0.043), ('rescaling', 0.043), ('regularizing', 0.043), ('analyzers', 0.043), ('pressure', 0.043), ('divergence', 0.043), ('intractable', 0.043), ('graphical', 0.042), ('ghahramani', 0.042), ('initialization', 0.042), ('kl', 0.041), ('hinton', 0.041), ('neighborhood', 0.041), ('unexpected', 0.041), ('smoothly', 0.041), ('clamped', 0.041), ('align', 0.041), ('topographic', 0.041), ('likelihood', 0.039), ('generative', 0.039), ('parameterize', 0.039), ('crosses', 0.039), ('factorized', 0.039), ('roweis', 0.039), ('hidden', 0.038), ('tractable', 0.038), ('operations', 0.037), ('efforts', 0.037), ('poses', 0.037), ('aligned', 0.037), ('nonlinear', 0.037), ('variables', 0.036), ('iterating', 0.035), ('intensities', 0.035), ('goals', 0.035), ('axes', 0.035), ('attempting', 0.035), ('smooth', 0.035), ('illustrates', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 84 nips-2001-Global Coordination of Local Linear Models
Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton
Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as: RP&¤§¢ Q ¡ I 0 ( 3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡ ¡ ¥ ¡ ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨
2 0.16672152 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
Author: Brendan J. Frey, Anitha Kannan, Nebojsa Jojic
Abstract: Factor analysis and principal components analysis can be used to model linear relationships between observed variables and linearly map high-dimensional data to a lower-dimensional hidden space. In factor analysis, the observations are modeled as a linear combination of normally distributed hidden variables. We describe a nonlinear generalization of factor analysis , called
3 0.16139355 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
Author: Marzia Polito, Pietro Perona
Abstract: Locally Linear Embedding (LLE) is an elegant nonlinear dimensionality-reduction technique recently introduced by Roweis and Saul [2]. It fails when the data is divided into separate groups. We study a variant of LLE that can simultaneously group the data and calculate local embedding of each group. An estimate for the upper bound on the intrinsic dimension of the data set is obtained automatically. 1
4 0.14951542 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
Author: Lawrence K. Saul, Daniel D. Lee
Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1
5 0.13615569 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
Author: Brendan J. Frey, Nebojsa Jojic
Abstract: In previous work on “transformed mixtures of Gaussians” and “transformed hidden Markov models”, we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. Here, we describe how a tremendous speed-up is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities. For N ×N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations takes only (C + 2 log N )N 2 scalar operations per iteration. In contrast, the original algorithm takes CN 6 operations to account for these transformations. We give results on learning a 4-component mixture model from a video sequence with frames of size 320 ×240. The model accounts for 360 rotations and 76,800 translations. Each iteration of EM takes only 10 seconds per frame in MATLAB, which is over 5 million times faster than the original algorithm. 1
6 0.13115445 46 nips-2001-Categorization by Learning and Combining Object Parts
7 0.12864316 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering
8 0.10216574 128 nips-2001-Multiagent Planning with Factored MDPs
9 0.097156212 164 nips-2001-Sampling Techniques for Kernel Methods
10 0.09333314 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates
11 0.09317106 43 nips-2001-Bayesian time series classification
12 0.091740787 21 nips-2001-A Variational Approach to Learning Curves
13 0.08999937 144 nips-2001-Partially labeled classification with Markov random walks
14 0.087664373 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data
15 0.084082864 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
16 0.081354715 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
17 0.078646742 58 nips-2001-Covariance Kernels from Bayesian Generative Models
18 0.078021809 155 nips-2001-Quantizing Density Estimators
19 0.076285154 74 nips-2001-Face Recognition Using Kernel Methods
20 0.076136149 107 nips-2001-Latent Dirichlet Allocation
topicId topicWeight
[(0, -0.258), (1, 0.026), (2, -0.025), (3, -0.1), (4, -0.095), (5, -0.032), (6, -0.11), (7, -0.038), (8, 0.059), (9, -0.024), (10, 0.087), (11, 0.029), (12, 0.165), (13, -0.185), (14, -0.07), (15, -0.123), (16, -0.112), (17, -0.064), (18, -0.064), (19, 0.068), (20, -0.058), (21, 0.013), (22, 0.129), (23, -0.074), (24, -0.145), (25, -0.071), (26, 0.132), (27, 0.049), (28, -0.087), (29, -0.085), (30, -0.007), (31, 0.002), (32, -0.033), (33, 0.087), (34, -0.064), (35, -0.172), (36, 0.022), (37, 0.031), (38, -0.054), (39, 0.028), (40, 0.049), (41, -0.028), (42, 0.02), (43, 0.102), (44, -0.188), (45, -0.019), (46, 0.073), (47, -0.121), (48, 0.105), (49, 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.96452582 84 nips-2001-Global Coordination of Local Linear Models
Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton
Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as: RP&¤§¢ Q ¡ I 0 ( 3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡ ¡ ¥ ¡ ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨
2 0.65701139 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
Author: Brendan J. Frey, Anitha Kannan, Nebojsa Jojic
Abstract: Factor analysis and principal components analysis can be used to model linear relationships between observed variables and linearly map high-dimensional data to a lower-dimensional hidden space. In factor analysis, the observations are modeled as a linear combination of normally distributed hidden variables. We describe a nonlinear generalization of factor analysis , called
3 0.6530121 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
Author: Brendan J. Frey, Nebojsa Jojic
Abstract: In previous work on “transformed mixtures of Gaussians” and “transformed hidden Markov models”, we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. Here, we describe how a tremendous speed-up is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities. For N ×N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations takes only (C + 2 log N )N 2 scalar operations per iteration. In contrast, the original algorithm takes CN 6 operations to account for these transformations. We give results on learning a 4-component mixture model from a video sequence with frames of size 320 ×240. The model accounts for 360 rotations and 76,800 translations. Each iteration of EM takes only 10 seconds per frame in MATLAB, which is over 5 million times faster than the original algorithm. 1
4 0.58865106 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering
Author: Mikhail Belkin, Partha Niyogi
Abstract: Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami op erator on a manifold , and the connections to the heat equation , we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality preserving properties and a natural connection to clustering. Several applications are considered. In many areas of artificial intelligence, information retrieval and data mining, one is often confronted with intrinsically low dimensional data lying in a very high dimensional space. For example, gray scale n x n images of a fixed object taken with a moving camera yield data points in rn: n2 . However , the intrinsic dimensionality of the space of all images of t he same object is the number of degrees of freedom of the camera - in fact the space has the natural structure of a manifold embedded in rn: n2 . While there is a large body of work on dimensionality reduction in general, most existing approaches do not explicitly take into account the structure of the manifold on which the data may possibly reside. Recently, there has been some interest (Tenenbaum et aI, 2000 ; Roweis and Saul, 2000) in the problem of developing low dimensional representations of data in this particular context. In this paper , we present a new algorithm and an accompanying framework of analysis for geometrically motivated dimensionality reduction. The core algorithm is very simple, has a few local computations and one sparse eigenvalu e problem. The solution reflects th e intrinsic geom etric structure of the manifold. Th e justification comes from the role of the Laplacian op erator in providing an optimal emb edding. Th e Laplacian of the graph obtained from the data points may be viewed as an approximation to the Laplace-Beltrami operator defined on the manifold. The emb edding maps for the data come from approximations to a natural map that is defined on the entire manifold. The framework of analysis presented here makes this connection explicit. While this connection is known to geometers and specialists in sp ectral graph theory (for example , see [1, 2]) to the best of our knowledge we do not know of any application to data representation yet. The connection of the Laplacian to the heat kernel enables us to choose the weights of the graph in a principled manner. The locality preserving character of the Laplacian Eigenmap algorithm makes it relatively insensitive to outliers and noise. A byproduct of this is that the algorithm implicitly emphasizes the natural clusters in the data. Connections to spectral clustering algorithms developed in learning and computer vision (see Shi and Malik , 1997) become very clear. Following the discussion of Roweis and Saul (2000) , and Tenenbaum et al (2000), we note that the biological perceptual apparatus is confronted with high dimensional stimuli from which it must recover low dimensional structure. One might argue that if the approach to recovering such low-dimensional structure is inherently local , then a natural clustering will emerge and thus might serve as the basis for the development of categories in biological perception. 1 The Algorithm Given k points Xl , ... , Xk in ]]{ I, we construct a weighted graph with k nodes, one for each point , and the set of edges connecting neighboring points to each other. 1. Step 1. [Constru cting th e Graph] We put an edge between nodes i and j if Xi and Xj are
5 0.57353699 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
Author: Marzia Polito, Pietro Perona
Abstract: Locally Linear Embedding (LLE) is an elegant nonlinear dimensionality-reduction technique recently introduced by Roweis and Saul [2]. It fails when the data is divided into separate groups. We study a variant of LLE that can simultaneously group the data and calculate local embedding of each group. An estimate for the upper bound on the intrinsic dimension of the data set is obtained automatically. 1
6 0.55460501 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
7 0.49223441 149 nips-2001-Probabilistic Abstraction Hierarchies
8 0.47532108 155 nips-2001-Quantizing Density Estimators
9 0.45886379 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data
10 0.4354881 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
11 0.40903705 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
12 0.39653209 107 nips-2001-Latent Dirichlet Allocation
13 0.39190701 43 nips-2001-Bayesian time series classification
14 0.38429454 108 nips-2001-Learning Body Pose via Specialized Maps
15 0.38218221 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons
16 0.37851349 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding
17 0.377978 128 nips-2001-Multiagent Planning with Factored MDPs
18 0.37425777 190 nips-2001-Thin Junction Trees
19 0.37387249 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network
20 0.36831635 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
topicId topicWeight
[(10, 0.215), (14, 0.03), (17, 0.014), (19, 0.019), (27, 0.171), (30, 0.066), (38, 0.033), (49, 0.025), (59, 0.063), (72, 0.079), (79, 0.039), (83, 0.017), (88, 0.01), (91, 0.156)]
simIndex simValue paperId paperTitle
1 0.91500455 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games
Author: Michael L. Littman, Michael J. Kearns, Satinder P. Singh
Abstract: We describe a new algorithm for computing a Nash equilibrium in graphical games, a compact representation for multi-agent systems that we introduced in previous work. The algorithm is the first to compute equilibria both efficiently and exactly for a non-trivial class of graphical games. 1
same-paper 2 0.8826822 84 nips-2001-Global Coordination of Local Linear Models
Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton
Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as: RP&¤§¢ Q ¡ I 0 ( 3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡ ¡ ¥ ¡ ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨
3 0.86897063 58 nips-2001-Covariance Kernels from Bayesian Generative Models
Author: Matthias Seeger
Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1
4 0.75675732 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
Author: Roland Vollgraf, Klaus Obermayer
Abstract: We present a new method for the blind separation of sources, which do not fulfill the independence assumption. In contrast to standard methods we consider groups of neighboring samples (
5 0.75488937 190 nips-2001-Thin Junction Trees
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.
6 0.75359517 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
7 0.75321937 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
8 0.74956429 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
9 0.74849063 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
10 0.74842417 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
11 0.7483027 8 nips-2001-A General Greedy Approximation Algorithm with Applications
12 0.74792981 57 nips-2001-Correlation Codes in Neuronal Populations
14 0.74589127 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
15 0.74546182 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
16 0.74530548 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
17 0.74477434 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
18 0.74454051 44 nips-2001-Blind Source Separation via Multinode Sparse Representation
19 0.74350178 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
20 0.74233991 13 nips-2001-A Natural Policy Gradient