nips nips2001 nips2001-75 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Brendan J. Frey, Nebojsa Jojic
Abstract: In previous work on “transformed mixtures of Gaussians” and “transformed hidden Markov models”, we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. Here, we describe how a tremendous speed-up is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities. For N ×N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations takes only (C + 2 log N )N 2 scalar operations per iteration. In contrast, the original algorithm takes CN 6 operations to account for these transformations. We give results on learning a 4-component mixture model from a video sequence with frames of size 320 ×240. The model accounts for 360 rotations and 76,800 translations. Each iteration of EM takes only 10 seconds per frame in MATLAB, which is over 5 million times faster than the original algorithm. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu/∼jojic Abstract In previous work on “transformed mixtures of Gaussians” and “transformed hidden Markov models”, we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e. [sent-8, score-0.179]
2 , center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. [sent-10, score-0.05]
3 The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. [sent-11, score-0.063]
4 The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. [sent-12, score-0.41]
5 Here, we describe how a tremendous speed-up is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities. [sent-13, score-0.548]
6 For N ×N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations takes only (C + 2 log N )N 2 scalar operations per iteration. [sent-14, score-0.346]
7 In contrast, the original algorithm takes CN 6 operations to account for these transformations. [sent-15, score-0.185]
8 We give results on learning a 4-component mixture model from a video sequence with frames of size 320 ×240. [sent-16, score-0.231]
9 The model accounts for 360 rotations and 76,800 translations. [sent-17, score-0.148]
10 Each iteration of EM takes only 10 seconds per frame in MATLAB, which is over 5 million times faster than the original algorithm. [sent-18, score-0.206]
11 1 Introduction The task of clustering raw data such as video frames and speech spectrograms is often obfuscated by the presence of random, but well-understood transformations in the data. [sent-19, score-0.477]
12 Examples of these transformations include object motion and camera motion in video sequences and pitch modulation in spectrograms. [sent-20, score-0.371]
13 The machine learning community has proposed a variety of sophisticated techniques for pattern analysis and pattern classification, but these techniques have mostly assumed the data is already normalized (e. [sent-21, score-0.078]
14 Linear approximations to the transformation manifold have been used to significantly improve the performance of feedforward discriminative classifiers such as nearest neighbors and multilayer perceptrons (Simard, LeCun and Denker 1993). [sent-24, score-0.193]
15 Linear generative models (factor analyzers, mixtures of factor analyzers) have also been modified using linear approximations to the transformation manifold to build in some degree of transformation invariance (Hinton, Dayan and Revow 1997). [sent-25, score-0.428]
16 A multi-resolution approach can be used to extend the usefulness of linear approximations (Vasconcelos and Lippman 1998), but this approach is susceptable to local minima – e. [sent-26, score-0.034]
17 For significant levels of transformation, linear approximations are far from exact and better results can be obtained by explicitly considering transformed versions of the input. [sent-29, score-0.122]
18 This approach has been used to design “convolutional neural networks” that are invariant to translations of parts of the input (LeCun et al. [sent-30, score-0.096]
19 In previous work on “transformed mixtures of Gaussians” (Frey and Jojic 2001) and “transformed hidden Markov models” (Jojic et al. [sent-32, score-0.052]
20 2000), we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e. [sent-33, score-0.127]
21 , center video frames, pitch-normalize spectrograms) and learn a mixture model of the normalized data. [sent-35, score-0.204]
22 We found “that the algorithm is reasonably fast (it learns in minutes or hours) and very effective at transformation-invariant density modeling. [sent-36, score-0.115]
23 ” Those results were for 44 × 28 images, but realistic applications such as home video summarization require near-real-time processing of medium-quality video at resolutions near 320 × 240. [sent-37, score-0.345]
24 In this paper, we show how a variational technique and a fast Fourier method for computing posterior probabilities can be used to achieve this goal. [sent-38, score-0.408]
25 2 Background In (Frey and Jojic 2001), we introduced a single discrete variable that enumerates a discrete set of possible transformations that can occur in the input. [sent-39, score-0.235]
26 Here, we break the transformation into a sequence of transformations. [sent-40, score-0.159]
27 Tk is the random variable for the transformation matrix at step k. [sent-41, score-0.218]
28 So, if Tk is the set of possible transformation matrices corresponding to the type of transformation at step k (e. [sent-42, score-0.346]
29 , C} is parameterized by p(c) = πc and the untransformed latent image has conditional density p(z0 |c) = N (z0 ; µc , Φc ), (2) where N () is the normal distribution, µc is the mean image for class c and Φc is the diagonal noise covariance matrix for class c. [sent-54, score-0.351]
30 Notice that the noise modeled by Φc gets transformed, so Φc can model noise sources that depend on the transformations, such as background clutter and object deformations in images. [sent-55, score-0.08]
31 (a) (b) (c) c T1 TK c T1 z0 z1 zK z0 TK z1 Figure 1: (a) The Bayesian network for a generative model that draws an image z0 from class c, applies a randomly drawn transformation matrix T1 of type 1 (e. [sent-56, score-0.371]
32 , image rotation) to obtain z1 , and so on, until a randomly drawn transformation matrix TK of type K (e. [sent-58, score-0.321]
33 , image translation) is applied to obtain the observed image zK . [sent-60, score-0.262]
34 (b) The Bayesian network for a factorized variational approximation to the posterior distribution, given zK . [sent-61, score-0.368]
35 (c) When an image is measured on a discrete, radial 2-D grid, a scale and rotation correspond to a shift in the radial and angular coordinates. [sent-62, score-0.395]
36 The probability of transformation matrix Tk at step k is p(Tk ) = λk,Tk . [sent-63, score-0.218]
37 ) At each step, we assume a small amount of noise with diagonal covariance matrix Ψ is added to the image, so p(zk |zK−1 , Tk ) = N (zk ; Tk zk−1 , Ψ). [sent-65, score-0.059]
38 (3) Tk operates on zk−1 to produce a transformed image. [sent-66, score-0.088]
39 In fact, Tk can be viewed as a permutation matrix that rearranges the pixels in zk−1 . [sent-67, score-0.031]
40 In (2001), an exact EM algorithm for learning this model is described. [sent-70, score-0.026]
41 The sufficient statistics for πc , µc and Φc are computed by averaging the derivatives of ln(πc N (z0 ; µc , Φc )) over the posterior distribution, p(z0 |c, T1 , . [sent-71, score-0.138]
42 , TK , zK ) is Gaussian and its mean and covariance are computed using linear algebra. [sent-87, score-0.036]
43 The problem with this direct approach is that the number of scalar operations in (4) is very large for large feature vectors and large sets of transformations. [sent-92, score-0.185]
44 For N × N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations leads to N 4 terms in the summation. [sent-93, score-0.037]
45 Since there are N 2 pixels, each term is computed using N 2 scalar operations. [sent-94, score-0.137]
46 So, each iteration of EM takes CN 6 scalar operations per training case. [sent-95, score-0.295]
47 For 10 classes and images of size 256 × 256, the direct approach takes 2. [sent-96, score-0.189]
48 8 × 1015 scalar operations per image for each iteration of EM. [sent-97, score-0.377]
49 We now describe how a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities can reduce the above number to (C + 2 log N )N 2 scalar operations. [sent-98, score-0.647]
50 For 10 classes and images of size 256 × 256, the new method takes 2, 752, 512 scalar operations per image for each iteration of EM. [sent-99, score-0.566]
51 3 Factorized variational technique To simplify the computation of the required posterior in (4), we use a variational approximation (Jordan et al. [sent-100, score-0.577]
52 1b, our variational approximation is a completely factorized approximation to the true posterior: p(c, z0 , T1 , z1 , . [sent-103, score-0.296]
53 (5) k=1 The q-distributions are parameterized and these variational parameters are varied to make the approximation a good one. [sent-110, score-0.23]
54 The variational parameters are q(c) = ρc , q(Tk ) = γk,Tk , q(zk ) = N (zk ; η k , Ωk ). [sent-112, score-0.23]
55 The generalized EM algorithm (Neal and Hinton 1998) maximizes a lower bound on the log-likelihood of the observed image zK : p(c, z0 , T1 , z1 , . [sent-113, score-0.157]
56 , TK ) In the E step, the variational parameters are adjusted to maximize B and in the M step, the model parameters are adjusted to maximize B. [sent-123, score-0.23]
57 (9) 2 2 Each time the ρc ’s are updated, they should be normalized and similarly for the γk,Tk ’s. [sent-125, score-0.026]
58 One or more iterations of the above updates are applied for each training case and the variational parameters are stored for use in the M-step, and as the initial conditions for the next E-step. [sent-126, score-0.27]
59 The derivatives of B with respect to the model parameters produce the following M-step updates: πc ← ρc µc ← ρc η 0 where Φc ← ρc (Ω0 + diag((η 0 − µc )(η 0 − µc ) ) , indicates an average over the training set. [sent-127, score-0.03]
60 (10) This factorized variational inference technique is quite greedy, since at each step, the method approximates the posterior with one Gaussian. [sent-128, score-0.413]
61 4 Inference using fast Fourier transforms The M-step updates described above take very few computations, but the E-step updates can be computationally burdensome. [sent-130, score-0.169]
62 The dominant culprits are the computation of the distance of the form dT = (g − Th) (g − Th) (11) in (9), for all possible transformations T, and the computation of the form γT Th (12) T in (7) and (8). [sent-131, score-0.169]
63 Since the variational approximation is more accurate when the transformations are broken into fewer steps, it is a good idea to pack as many transformations into each step as possible. [sent-132, score-0.596]
64 In our experiments, x-y translations are applied in one step, and rotations are applied in another step. [sent-133, score-0.208]
65 However, the number of possible x-y translations in a 320 × 240 image is 76,800. [sent-134, score-0.191]
66 So, 76,800 dT ’s must be computed and the computation of each dT uses a vector norm of size 76,800. [sent-135, score-0.036]
67 It turns out that if the data is defined on a coordinate system where the effect of a transformation is a shift, the above quantities can be computed very quickly using fast Fourier transforms (FFTs). [sent-136, score-0.336]
68 For images measured on rectangular grids, an x-y translation corresponds to a shift in the coordinate system. [sent-137, score-0.298]
69 For images measured on a radial grid, such as the one shown in Fig. [sent-138, score-0.173]
70 1c, a scale and rotation corresponds to a shift in the coordinate system (Wolberg and Zokai 2000). [sent-139, score-0.226]
71 When updating the variational parameters, it is straightforward to convert them to the appropriate coordinate system, apply the FFT method and convert them back. [sent-140, score-0.34]
72 The image is measured on a discrete grid and x is the x-y coordinate of a pixel in the image (x is a 2-vector). [sent-142, score-0.466]
73 The images g and h in (11) and (12) are written as functions of x: g(x), h(x). [sent-143, score-0.116]
74 In this representation, T is an integer 2-vector, corresponding to a shift in x. [sent-144, score-0.071]
75 (14) T The common form is the correlation f (T) = g(x)h(x + T), (15) x For an N × N grid, computing the correlation directly for all T takes N 4 scalar operations. [sent-146, score-0.15]
76 The FFT can be used to compute the correlation in N 2 log N time. [sent-147, score-0.041]
77 The FFTs G(ω) and H(ω) of g and h are computed in N 2 log N time. [sent-148, score-0.077]
78 Then, the FFT F (ω) of f is computed in N 2 as follows, F (ω) = G(ω)∗ H(ω), (16) where “∗ ” denotes complex conjugate. [sent-149, score-0.036]
79 Then the inverse FFT f (T) of F (ω) is computed in N 2 log N time. [sent-150, score-0.077]
80 Using this method, the posterior and sufficient statistics for all N 2 shifts in an N × N grid can be computed in N 2 log N time. [sent-151, score-0.272]
81 Using this method along with the variational technique, C classes, N scales, N rotations, N x-translations and N y-translations can be accounted for using (C + 2 log N )N 2 scalar operations. [sent-152, score-0.372]
82 5 Results In order to compare our new learning algorithm with the previously published result, we repeated the experiment on clustering head poses in 200 44x28 frames. [sent-153, score-0.068]
83 We achieved essentially the same result, but in only 10 seconds as opposed to 40 minutes that the original algorithm needed to compete the task. [sent-154, score-0.153]
84 It should be noted that the original algorithm tested only for 9 vertical and 9 horizontal shifts (81 combinations), while the new algorithm dealt with all 1232 possible discrete shifts. [sent-156, score-0.174]
85 This makes the new algorithm 600 times faster on low resolution data. [sent-157, score-0.072]
86 The speed-up is even more drastic at higher resolutions, and when rotations and scales are added, since the complexity of the original algorithm is CN 6 , where C is the number of classes and N is the number of pixels. [sent-158, score-0.278]
87 The speed-up promised in the abstract is based on our computations, but obviously we were not able to run the original algorithm on full 320x240 resolution data. [sent-159, score-0.098]
88 2 we show the results of training an ordinary Gaussian model, shift-invariant model and finally the scale, rotation and shift invariant model on the sequence. [sent-162, score-0.184]
89 We also show three frames from the sequence stabilized using the variational inference. [sent-163, score-0.283]
90 6 Conclusions We describes how a tremendous speed-up in training transformation-invariant generative model can be achieved through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities. [sent-164, score-0.598]
91 For N × N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations takes only (C + 2 log N )N 2 scalar operations per iteration. [sent-165, score-0.346]
92 In contrast, the original algorithm takes CN 6 operations to account for these transformations. [sent-166, score-0.185]
93 In this way we were able to reduce the computation to only seconds per frame for the images of 320x240 resolution using a simple Matlab implementation. [sent-167, score-0.266]
94 This opens the door for generative models of pixel intensities in video to be efficiently used for transformation-invariant video summary and search. [sent-168, score-0.423]
95 As opposed to most techniques used in computer vision today, the generative modeling approach provides the likelihood model useful for search or retrieval, automatic clustering of the data and the extensibility through adding new hidden variables. [sent-169, score-0.17]
96 Figure 2: Learning a rotation, scale and translation invariant model on 320x240 video Hinton, G. [sent-191, score-0.251]
97 Transformed hidden markov models: Estimating mixture models of images and inferring spatial transformations in video sequences. [sent-205, score-0.489]
98 A view of the EM algorithm that justifies incremental, sparse, and other variants. [sent-232, score-0.026]
wordName wordTfidf (topN-words)
[('tk', 0.572), ('zk', 0.47), ('variational', 0.23), ('transformations', 0.169), ('transformation', 0.159), ('video', 0.154), ('rotations', 0.148), ('jojic', 0.138), ('image', 0.131), ('images', 0.116), ('scalar', 0.101), ('transformed', 0.088), ('frey', 0.087), ('fourier', 0.087), ('fft', 0.086), ('operations', 0.084), ('cn', 0.079), ('rotation', 0.077), ('lecun', 0.073), ('posterior', 0.072), ('shift', 0.071), ('factorized', 0.066), ('decoupling', 0.065), ('shifts', 0.063), ('fast', 0.061), ('em', 0.061), ('grid', 0.06), ('translations', 0.06), ('spectrograms', 0.059), ('scales', 0.054), ('frames', 0.053), ('coordinate', 0.052), ('lippman', 0.05), ('vasconcelos', 0.05), ('zokai', 0.05), ('generative', 0.05), ('takes', 0.049), ('hinton', 0.046), ('resolution', 0.046), ('seconds', 0.046), ('technique', 0.045), ('jordan', 0.045), ('ln', 0.043), ('ffts', 0.043), ('revow', 0.043), ('tremendous', 0.043), ('wolberg', 0.043), ('clustering', 0.042), ('log', 0.041), ('updates', 0.04), ('jointly', 0.038), ('clusters', 0.037), ('analyzers', 0.037), ('denker', 0.037), ('norwell', 0.037), ('resolutions', 0.037), ('simard', 0.037), ('computed', 0.036), ('invariant', 0.036), ('translation', 0.035), ('pixel', 0.035), ('per', 0.034), ('approximations', 0.034), ('dt', 0.033), ('radial', 0.033), ('discrete', 0.033), ('transform', 0.032), ('matrix', 0.031), ('derivatives', 0.03), ('intensities', 0.03), ('picking', 0.03), ('latent', 0.03), ('neal', 0.029), ('convert', 0.029), ('dayan', 0.028), ('minutes', 0.028), ('transforms', 0.028), ('step', 0.028), ('noise', 0.028), ('iteration', 0.027), ('su', 0.027), ('opposed', 0.027), ('graphical', 0.027), ('scale', 0.026), ('algorithm', 0.026), ('hidden', 0.026), ('original', 0.026), ('pattern', 0.026), ('normalized', 0.026), ('mixtures', 0.026), ('matlab', 0.026), ('kluwer', 0.025), ('publishers', 0.025), ('vision', 0.025), ('measured', 0.024), ('classes', 0.024), ('frame', 0.024), ('mixture', 0.024), ('motion', 0.024), ('modeled', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
Author: Brendan J. Frey, Nebojsa Jojic
Abstract: In previous work on “transformed mixtures of Gaussians” and “transformed hidden Markov models”, we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. Here, we describe how a tremendous speed-up is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities. For N ×N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations takes only (C + 2 log N )N 2 scalar operations per iteration. In contrast, the original algorithm takes CN 6 operations to account for these transformations. We give results on learning a 4-component mixture model from a video sequence with frames of size 320 ×240. The model accounts for 360 rotations and 76,800 translations. Each iteration of EM takes only 10 seconds per frame in MATLAB, which is over 5 million times faster than the original algorithm. 1
2 0.18520324 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
Author: Brendan J. Frey, Anitha Kannan, Nebojsa Jojic
Abstract: Factor analysis and principal components analysis can be used to model linear relationships between observed variables and linearly map high-dimensional data to a lower-dimensional hidden space. In factor analysis, the observations are modeled as a linear combination of normally distributed hidden variables. We describe a nonlinear generalization of factor analysis , called
3 0.14944419 179 nips-2001-Tempo tracking and rhythm quantization by sequential Monte Carlo
Author: Ali Taylan Cemgil, Bert Kappen
Abstract: We present a probabilistic generative model for timing deviations in expressive music. performance. The structure of the proposed model is equivalent to a switching state space model. We formulate two well known music recognition problems, namely tempo tracking and automatic transcription (rhythm quantization) as filtering and maximum a posteriori (MAP) state estimation tasks. The inferences are carried out using sequential Monte Carlo integration (particle filtering) techniques. For this purpose, we have derived a novel Viterbi algorithm for Rao-Blackwellized particle filters, where a subset of the hidden variables is integrated out. The resulting model is suitable for realtime tempo tracking and transcription and hence useful in a number of music applications such as adaptive automatic accompaniment and score typesetting. 1
4 0.13615569 84 nips-2001-Global Coordination of Local Linear Models
Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton
Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as: RP&¤§¢ Q ¡ I 0 ( 3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡ ¡ ¥ ¡ ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨
5 0.12131529 21 nips-2001-A Variational Approach to Learning Curves
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
6 0.097595379 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network
7 0.096251577 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
8 0.092619538 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models
9 0.085068293 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates
10 0.084321693 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
11 0.078816295 58 nips-2001-Covariance Kernels from Bayesian Generative Models
12 0.072589159 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
13 0.072116867 46 nips-2001-Categorization by Learning and Combining Object Parts
14 0.069090091 79 nips-2001-Gaussian Process Regression with Mismatched Models
15 0.068217427 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
16 0.065512799 196 nips-2001-Very loopy belief propagation for unwrapping phase images
17 0.064072601 107 nips-2001-Latent Dirichlet Allocation
18 0.063489065 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images
19 0.058867477 164 nips-2001-Sampling Techniques for Kernel Methods
20 0.058395065 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
topicId topicWeight
[(0, -0.181), (1, 0.017), (2, -0.059), (3, -0.108), (4, -0.144), (5, -0.03), (6, -0.017), (7, -0.017), (8, 0.077), (9, -0.029), (10, 0.08), (11, 0.044), (12, 0.181), (13, -0.139), (14, -0.069), (15, 0.085), (16, -0.052), (17, -0.009), (18, -0.094), (19, 0.016), (20, -0.073), (21, 0.035), (22, 0.077), (23, -0.105), (24, -0.01), (25, -0.087), (26, 0.19), (27, -0.007), (28, -0.274), (29, -0.051), (30, 0.103), (31, -0.1), (32, -0.134), (33, 0.107), (34, 0.04), (35, -0.134), (36, -0.033), (37, -0.058), (38, 0.05), (39, -0.128), (40, 0.082), (41, 0.117), (42, 0.036), (43, -0.141), (44, -0.07), (45, 0.009), (46, 0.055), (47, -0.044), (48, -0.089), (49, 0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.96565288 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
Author: Brendan J. Frey, Nebojsa Jojic
Abstract: In previous work on “transformed mixtures of Gaussians” and “transformed hidden Markov models”, we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. Here, we describe how a tremendous speed-up is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities. For N ×N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations takes only (C + 2 log N )N 2 scalar operations per iteration. In contrast, the original algorithm takes CN 6 operations to account for these transformations. We give results on learning a 4-component mixture model from a video sequence with frames of size 320 ×240. The model accounts for 360 rotations and 76,800 translations. Each iteration of EM takes only 10 seconds per frame in MATLAB, which is over 5 million times faster than the original algorithm. 1
2 0.73952979 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
Author: Brendan J. Frey, Anitha Kannan, Nebojsa Jojic
Abstract: Factor analysis and principal components analysis can be used to model linear relationships between observed variables and linearly map high-dimensional data to a lower-dimensional hidden space. In factor analysis, the observations are modeled as a linear combination of normally distributed hidden variables. We describe a nonlinear generalization of factor analysis , called
3 0.57231218 84 nips-2001-Global Coordination of Local Linear Models
Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton
Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as: RP&¤§¢ Q ¡ I 0 ( 3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡ ¡ ¥ ¡ ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨
4 0.49775094 179 nips-2001-Tempo tracking and rhythm quantization by sequential Monte Carlo
Author: Ali Taylan Cemgil, Bert Kappen
Abstract: We present a probabilistic generative model for timing deviations in expressive music. performance. The structure of the proposed model is equivalent to a switching state space model. We formulate two well known music recognition problems, namely tempo tracking and automatic transcription (rhythm quantization) as filtering and maximum a posteriori (MAP) state estimation tasks. The inferences are carried out using sequential Monte Carlo integration (particle filtering) techniques. For this purpose, we have derived a novel Viterbi algorithm for Rao-Blackwellized particle filters, where a subset of the hidden variables is integrated out. The resulting model is suitable for realtime tempo tracking and transcription and hence useful in a number of music applications such as adaptive automatic accompaniment and score typesetting. 1
5 0.45419282 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates
Author: Chris Stauffer, Erik Miller, Kinh Tieu
Abstract: Recent work has shown impressive transform-invariant modeling and clustering for sets of images of objects with similar appearance. We seek to expand these capabilities to sets of images of an object class that show considerable variation across individual instances (e.g. pedestrian images) using a representation based on pixel-wise similarities, similarity templates. Because of its invariance to the colors of particular components of an object, this representation enables detection of instances of an object class and enables alignment of those instances. Further, this model implicitly represents the regions of color regularity in the class-specific image set enabling a decomposition of that object class into component regions. 1
6 0.37655261 21 nips-2001-A Variational Approach to Learning Curves
7 0.36887005 158 nips-2001-Receptive field structure of flow detectors for heading perception
8 0.35501656 182 nips-2001-The Fidelity of Local Ordinal Encoding
9 0.34446159 108 nips-2001-Learning Body Pose via Specialized Maps
10 0.3420926 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network
11 0.32614964 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
12 0.32591495 6 nips-2001-A Bayesian Network for Real-Time Musical Accompaniment
13 0.32440099 79 nips-2001-Gaussian Process Regression with Mismatched Models
14 0.30680981 107 nips-2001-Latent Dirichlet Allocation
15 0.30436978 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
16 0.28761277 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images
17 0.28169352 196 nips-2001-Very loopy belief propagation for unwrapping phase images
18 0.27281815 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
19 0.26501611 43 nips-2001-Bayesian time series classification
20 0.25629422 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons
topicId topicWeight
[(14, 0.017), (17, 0.031), (19, 0.028), (27, 0.127), (30, 0.07), (38, 0.054), (49, 0.024), (59, 0.038), (72, 0.072), (79, 0.028), (83, 0.024), (88, 0.278), (91, 0.129)]
simIndex simValue paperId paperTitle
1 0.89859903 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
Author: Guy Lebanon, John D. Lafferty
Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.
same-paper 2 0.85459983 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
Author: Brendan J. Frey, Nebojsa Jojic
Abstract: In previous work on “transformed mixtures of Gaussians” and “transformed hidden Markov models”, we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. Here, we describe how a tremendous speed-up is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities. For N ×N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations takes only (C + 2 log N )N 2 scalar operations per iteration. In contrast, the original algorithm takes CN 6 operations to account for these transformations. We give results on learning a 4-component mixture model from a video sequence with frames of size 320 ×240. The model accounts for 360 rotations and 76,800 translations. Each iteration of EM takes only 10 seconds per frame in MATLAB, which is over 5 million times faster than the original algorithm. 1
3 0.76568037 190 nips-2001-Thin Junction Trees
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.
4 0.68025887 98 nips-2001-Information Geometrical Framework for Analyzing Belief Propagation Decoder
Author: Shiro Ikeda, Toshiyuki Tanaka, Shun-ichi Amari
Abstract: The mystery of belief propagation (BP) decoder, especially of the turbo decoding, is studied from information geometrical viewpoint. The loopy belief network (BN) of turbo codes makes it difficult to obtain the true “belief” by BP, and the characteristics of the algorithm and its equilibrium are not clearly understood. Our study gives an intuitive understanding of the mechanism, and a new framework for the analysis. Based on the framework, we reveal basic properties of the turbo decoding.
5 0.62210363 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
Author: Brendan J. Frey, Anitha Kannan, Nebojsa Jojic
Abstract: Factor analysis and principal components analysis can be used to model linear relationships between observed variables and linearly map high-dimensional data to a lower-dimensional hidden space. In factor analysis, the observations are modeled as a linear combination of normally distributed hidden variables. We describe a nonlinear generalization of factor analysis , called
6 0.61976618 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
7 0.61400121 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
9 0.60672188 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
10 0.60369951 57 nips-2001-Correlation Codes in Neuronal Populations
11 0.60297143 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
12 0.60110366 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
13 0.60105067 13 nips-2001-A Natural Policy Gradient
14 0.59847409 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
15 0.59749711 84 nips-2001-Global Coordination of Local Linear Models
16 0.59621572 46 nips-2001-Categorization by Learning and Combining Object Parts
17 0.59618914 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
18 0.5944683 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
19 0.59357285 89 nips-2001-Grouping with Bias
20 0.59316343 121 nips-2001-Model-Free Least-Squares Policy Iteration