nips nips2004 nips2004-163 knowledge-graph by maker-knowledge-mining

163 nips-2004-Semi-parametric Exponential Family PCA

Source: pdf

Author: Sajama Sajama, Alon Orlitsky

Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Semi-parametric exponential family PCA Sajama Alon Orlitsky Department of Electrical and Computer Engineering University of California at San Diego, La Jolla, CA 92093 sajama@ucsd. [sent-1, score-0.327]

2 edu Abstract We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. [sent-4, score-0.721]

3 Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. [sent-5, score-0.57]

4 In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. [sent-6, score-1.069]

5 Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. [sent-7, score-0.271]

6 1 Introduction Principal component analysis (PCA) is widely used for dimensionality reduction with applications ranging from pattern recognition and time series prediction to visualization. [sent-8, score-0.172]

7 A latent variable model assumes that the distribution of data is determined by a latent or mixing distribution P (θ) and a conditional or component distribution P (x|θ), i. [sent-12, score-1.266]

8 A key feature of this probabilistic model is that the latent distribution P (θ) is also assumed to be Gaussian since it leads to simple and fast model estimation, i. [sent-16, score-0.59]

9 , the density of x is approximated by a Gaussian distribution whose covariance matrix is aligned along a lower dimensional subspace. [sent-18, score-0.319]

10 A mixture model with Gaussian latent distribution would not be able to capture this information. [sent-21, score-0.643]

11 The projection obtained using a Gaussian latent distribution tends to be skewed toward the center [1] and hence the distinction between nearby sub-populations may be lost in the visualization space. [sent-22, score-0.61]

12 For these reasons, it is important not to make restrictive assumptions about the latent distribution. [sent-23, score-0.442]

13 Several recently proposed dimension reduction methods can, like PPCA, be thought of as special cases of latent variable modelling which differ in the speci£c assumptions they make about the latent and conditional distributions. [sent-24, score-1.05]

14 We present an alternative probabilistic formulation, called semi-parametric PCA (SPPCA), where no assumptions are made about the distribution of the latent random variable θ. [sent-25, score-0.596]

15 Non-parametric latent distribution estimation allows us to approximate data density better than previous schemes and hence gives better low dimensional representations. [sent-26, score-0.908]

16 In particular, multi-modality of the high dimensional density is better preserved in the projected space. [sent-27, score-0.292]

17 To make our method suitable for special data types, we allow the conditional distribution P (x|θ) to be any member of the exponential family of distributions. [sent-29, score-0.516]

18 Use of exponential family distributions for P (x|θ) is common in statistics where it is known as latent trait analysis and they have also been used in several recently proposed dimensionality reduction schemes [3, 4]. [sent-30, score-1.029]

19 We use Lindsay’s non-parametric maximum likelihood estimation theorem to reduce the estimation problem to one with a large enough discrete prior. [sent-31, score-0.204]

20 It turns out that this choice gives us a prior which is ‘conjugate’ to all exponential family distributions, allowing us to give a uni£ed algorithm for all data types. [sent-32, score-0.365]

21 This choice also makes it possible to ef£ciently estimate the model even in the case when different components of the data vector are of different types. [sent-33, score-0.173]

22 2 The constrained mixture model We assume that the d-dimensional observation vectors x1 , . [sent-34, score-0.225]

23 , xn are outcomes of iid draws of a random variable whose distribution P (x) = P (θ)P (x|θ)dθ is determined by the latent distribution P (θ) and the conditional distribution P (x|θ). [sent-37, score-0.716]

24 This can also be viewed as a mixture density with P (θ) being the mixing distribution, the mixture components labelled by θ and P (x|θ) being the component distribution corresponding to θ. [sent-38, score-0.669]

25 The latent distribution is used to model the interdependencies among the components of x and the conditional distribution to model ‘noise’. [sent-39, score-0.732]

26 For example in the case of a collection of documents we can think of the ‘content’ of the document as a latent variable since it cannot be measured. [sent-40, score-0.531]

27 Conditional distribution P (x|θ): We assume that P (θ) adequately models the dependencies among the components of x and hence that the components of x are independent when conditioned upon θ, i. [sent-42, score-0.299]

28 As noted in the introduction, using Gaussian means and constraining them to a lower dimensional subspace of the data space is equivalent to using Euclidean distance as a measure of similarity. [sent-45, score-0.191]

29 This Gaussian model may not be appropriate for other data types, for instance the Bernoulli distribution may be better for binary data and Poisson for integer data. [sent-46, score-0.217]

30 These three distributions, along with several others, belong to a family of distributions known as the exponential family [5]. [sent-47, score-0.641]

31 Any member of this family can be written in the form log P (x|θ) = log P0 (x) + xθ − G(θ) where θ is called the natural parameter and G(θ) is a function that ensures that the probabilities sum to one. [sent-48, score-0.278]

32 An important property of this family is that the mean µ of a distribution and its natural parameter θ are related through a monotone invertible, nonlinear function µ = G (θ) = g(θ). [sent-49, score-0.334]

33 It can be shown that the negative log-likelihoods of exponential family distributions can be written as Bregman distances (ignoring constants) which are a family of generalized metrics associated with convex functions [4]. [sent-50, score-0.556]

34 Note that by using different distributions for the various components of x, we can model mixed data types. [sent-51, score-0.238]

35 Latent distribution P (θ): Like previous latent variable methods, including PCA, we constrain the latent variable θ to an -dimensional Euclidean subspace of R d to model the belief that the intrinsic dimensionality of the data is smaller than d. [sent-52, score-1.177]

36 Hence any distribution PΘ (θ) satisfying the low dimensional constraints can be represented using a triple (P (a), V, b), where P (a) is a distribution over R . [sent-54, score-0.26]

37 Lindsay’s mixture nonparametric maximum likelihood estimation (NPMLE) theorem states that for £xed (V ,b), the maximum likelihood (ML) estimate of P (a) exists and is a discrete distribution with no more than n distinct points of support [6]. [sent-55, score-0.389]

38 Hence if ML is the chosen parameter estimation technique, the SP-PCA model can be assumed (without loss of generality) to be a constrained £nite mixture model with at most n mixture components. [sent-56, score-0.54]

39 The number of mixture components in the model, n, grows with the amount of data and we propose to use pruning to reduce the number of components during model estimation to help both in computational speed and model generalization. [sent-57, score-0.59]

40 Finally, we note that instead of the natural parameter, any of its invertible transformations could have been constrained to a lower dimensional space. [sent-58, score-0.203]

41 Low dimensional representation: There are several ways in which low dimensional representations can be obtained using the constrained mixture model. [sent-60, score-0.444]

42 This representation has been used in other latent variable methods to get meaningful low dimensional views [1, 3]. [sent-65, score-0.613]

43 This representation is a generalization of the standard Euclidean projection and was used in [4]. [sent-67, score-0.172]

44 The Gaussian case: When the exponential family distribution chosen is Gaussian, the model is a mixture of n spherical Gaussians all of whose means lie on a hyperplane in the data space. [sent-68, score-0.644]

45 , Gaussian case of SP-PCA is related to PCA in the same manner as Gaussian mixture model is related to K-means. [sent-71, score-0.184]

46 The use of arbitrary mixing distribution over the plane allows us to approximate arbitrary spread of data along the hyperplane. [sent-72, score-0.252]

47 Use of £xed variance spherical Gaussians ensures that like PCA, the direction perpendicular to the plane (V, b) is irrelevant in any metric involving relative values of likelihoods P (x|θ k ), including the posterior mean. [sent-73, score-0.189]

48 Consider the case when data density P (x) belongs to our model space, i. [sent-74, score-0.176]

49 , it is speci£ed by {A, V, b, Π, σ} and let D be any direction parallel to the plane (V, b) along which the latent distribution P (θ) has non-zero variance. [sent-76, score-0.565]

50 Since Gaussian noise with variance σ is added to this latent distribution to obtain P (x), variance of P (x) along D will be greater than σ. [sent-77, score-0.588]

51 3 Model estimation Algorithm for ML estimation: We present an EM algorithm for estimating parameters of a £nite mixture model with the components constrained to an -dimensional Euclidean subspace. [sent-81, score-0.39]

52 , xn be iid samples drawn from a d-dimensional density P (x), c be the number of mixture components and let the mixing density be Π = (π1 , . [sent-88, score-0.596]

53 Associated with each mixture component (indexed by k) are parameter vectors θ k and ak which are related by θ k = ak V + b. [sent-92, score-0.352]

54 In this section we will work with the assumption that all components of x correspond to the same exponential family for ease of notation. [sent-93, score-0.428]

55 For each observed xi there is an unobserved ‘missing’ variable zi which is a c-dimensional binary vector whose k’th component is one if the k’th mixture component was the outcome in the i’th random draw and zero otherwise. [sent-94, score-0.377]

56 To measure the nearness of components we use the ∞-norm of the difference between probabilities the components assign to observations since we do not want to lose mixture components that are distinguished with respect to a small number of observation vectors. [sent-100, score-0.453]

57 Convergence of the EM iterations and computational complexity: It is easy to verify that the SP-PCA model satis£es the continuity assumptions of Theorem 2, [8], and hence we can conclude that any limit point of the EM iterations is a stationary point of the log likelihood function. [sent-103, score-0.156]

58 The k-d tree data structure is often used to identify relevant mixture components to speed up this step. [sent-107, score-0.289]

59 For the Gaussian case, a fast method to pick would be to plot the variance of data along the principal directions (found using PCA) and look for the dimension at which there is a ‘knee’ or a sudden drop in variance or where the total residual variance falls below a chosen threshold. [sent-109, score-0.308]

60 Consistency of the Maximum Likelihood estimator: We propose to use the ML estimator to £nd the latent space (V, b) and the latent distribution P (a). [sent-110, score-0.935]

61 Usually a parametric form is assumed for P (a) and the consistency of the ML estimate is well known for this task where the parameter space is a subset of a £nite dimensional Euclidean space. [sent-111, score-0.22]

62 ˆ ˆ Theorem If P (a) is assumed to be zero outside a bounded subset of R , the ML estimator of parameter (V, b, P (a)) is strongly consistent for Gaussian, Binary and Poisson conditional distributions. [sent-125, score-0.188]

63 83 Relationship to past work SP-PCA is a factor model that makes fewer assumptions about latent distribution than PPCA [1]. [sent-140, score-0.533]

64 Mixtures of probabilistic principal component analyzers (also known as mixtures of factor analyzers) is a generalization of PPCA which overcomes the limitation of global linearity of PCA via local dimensionality reduction. [sent-141, score-0.383]

65 [4] proposed a generalization of PCA using exponential family distributions. [sent-145, score-0.388]

66 Note that this generalization is not associated with a probability density model for the data. [sent-146, score-0.199]

67 SP-PCA can be thought of as a ‘soft’ version of this generalization of PCA, in the same manner as Gaussian mixtures are a soft version of K-means. [sent-147, score-0.2]

68 Generative topographic mapping (GTM) is a probabilistic alternative to Self organizing map which aims at £nding a nonlinear lower dimensional manifold passing close to data points. [sent-148, score-0.177]

69 An extension of GTM using exponential family distributions to deal with binary and count data is described in [3]. [sent-149, score-0.48]

70 Apart from the fact that GTM is a non-linear dimensionality reduction technique while SP-PCA is globally linear like PCA, one main feature that distinguishes the two is the choice of latent distribution. [sent-150, score-0.518]

71 GTM assumes that the latent distribution is uniform over a £nite and discrete grid of points. [sent-151, score-0.551]

72 Tibshirani [10] used a semi-parametric latent variable model for estimation of principle curves. [sent-153, score-0.565]

73 Discussion of these and other dimensionality reduction schemes based on latent trait and latent class models can be found in [7]. [sent-154, score-1.037]

74 In factor analysis literature, it is commonly believed that choice of prior distribution is unimportant for the low dimensional data summarization (see [2], Sections 2. [sent-156, score-0.241]

75 Through the examples below we argue that estimating the prior instead of assuming it arbitrarily can make a difference when latent variable models are used for density approximation, data analysis and visualization. [sent-160, score-0.609]

76 Use of SP-PCA as a low dimensional density model: The Tobamovirus data which consists of 38 18-dimensional examples was used in [1] to illustrate properties of PPCA. [sent-161, score-0.288]

77 The complexity of these densities increases with and is controlled by the value of (the projected space dimension) starting with the zero dimensional model of an isotropic Gaussian. [sent-163, score-0.222]

78 This indicates that SP-PCA combines ¤exible density estimation and excellent generalization even when trained on a small amount of data. [sent-167, score-0.229]

79 Simulation results on discrete datasets: We present experiments on 20 Newsgroups dataset comparing SP-PCA to PCA, exponential family GTM [3] and Exponential family PCA [4]. [sent-168, score-0.556]

80 A dictionary size of 150 words was chosen and the words in the dictionary were picked to be those which have maximum mutual information with class labels. [sent-178, score-0.272]

81 200 documents were drawn from each of the three newsgroups to form the training data. [sent-179, score-0.204]

82 In the projection obtained using PCA, Exponential family PCA and Bernoulli GTM, the classes comp. [sent-182, score-0.306]

83 One way to quantify the separation of dissimilar groups in the two-dimensional projections is to use the training set classi£cation error of projected data using SVM. [sent-194, score-0.161]

84 The accuracy of the best SVM classi£er (we tried a range of SVM parameter values and picked the best for each projected data set) was 75% for bernoulli GTM projection and 82. [sent-195, score-0.445]

85 3% for SP-PCA projection (the difference corresponds to 44 data points while the total number of data points is 600). [sent-196, score-0.187]

86 hardware have overlap in projection using Bernoulli GTM is that the prior is assumed to be over a pre-speci£ed grid in latent space and the spacing between grid points happened to be large in the parameter space close to the two news groups. [sent-204, score-0.696]

87 In contrast to this, in SP-PCA there is no grid and the latent distribution is allowed to adapt to the given data set. [sent-205, score-0.555]

88 Note that a standard clustering algorithm could be used on the data projected using SP-PCA to conclude that data consisted of three kinds of documents. [sent-206, score-0.157]

89 8 1 −60 −40 −20 0 20 40 60 80 100 (d) SP-PCA Figure 1: Projection by various methods of binary data from 200 documents each from comp. [sent-233, score-0.183]

90 A dictionary size of 100 words was chosen and again the words in the dictionary were picked to be those which have maximum mutual information with class labels. [sent-249, score-0.272]

91 100 documents were drawn from each of the newsgroups to form the training data and 100 more to form the test data. [sent-250, score-0.242]

92 Note that while the four newsgroups are bunched together in the projection obtained using Exponential family PCA [4] (Fig. [sent-253, score-0.391]

93 2(b)), we can still detect the presence four groups from this projection and in this sense this projection is better than the PCA projection. [sent-254, score-0.264]

94 We conjecture that the reason the four groups are not well separated in this projection is that a conjugate prior has to be used in its estimation for computational purposes [4] and the form and parameters of this prior are considered £xed and given inputs to the algorithm. [sent-256, score-0.251]

95 To measure generalization of these methods, we use a K-nearest neighbors based non-parametric estimate of the density of the projected training data. [sent-261, score-0.246]

96 The percentage difference between the log-likelihoods of training and test data with respect to this density was 9. [sent-262, score-0.179]

97 8 1 (f) Test data - GTM Figure 2: Projection by various methods of binary data from 100 documents each from sci. [sent-304, score-0.221]

98 A combined latent class and trait model for the analysis and visualization of discrete data. [sent-323, score-0.53]

99 A generalization of principal components analysis to the exponential family. [sent-330, score-0.351]

100 Semi-parametric exponential family PCA : Reducing dimensions via non-parametric latent distribution estimation. [sent-346, score-0.786]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('latent', 0.402), ('gtm', 0.402), ('pca', 0.319), ('ppca', 0.213), ('family', 0.195), ('zik', 0.15), ('mixture', 0.15), ('bernoulli', 0.135), ('exponential', 0.132), ('grs', 0.115), ('sajama', 0.115), ('projection', 0.111), ('dimensional', 0.107), ('density', 0.104), ('components', 0.101), ('newsgroups', 0.085), ('th', 0.085), ('ml', 0.083), ('projected', 0.081), ('dictionary', 0.08), ('estimator', 0.074), ('dimensionality', 0.068), ('pruning', 0.068), ('variable', 0.065), ('documents', 0.064), ('estimation', 0.064), ('gaussian', 0.063), ('generalization', 0.061), ('trait', 0.06), ('mixtures', 0.059), ('grid', 0.058), ('gri', 0.057), ('irls', 0.057), ('kiefer', 0.057), ('lindsay', 0.057), ('perpendicular', 0.057), ('starved', 0.057), ('principal', 0.057), ('bs', 0.057), ('distribution', 0.057), ('schemes', 0.057), ('euclidean', 0.057), ('component', 0.056), ('ak', 0.055), ('plane', 0.055), ('drawn', 0.055), ('invertible', 0.055), ('mixing', 0.051), ('along', 0.051), ('bootstrap', 0.051), ('analyzers', 0.05), ('binary', 0.05), ('reduction', 0.048), ('member', 0.047), ('conditional', 0.047), ('thought', 0.046), ('monotone', 0.046), ('iq', 0.046), ('ks', 0.046), ('please', 0.046), ('subspace', 0.046), ('consistency', 0.046), ('drop', 0.045), ('picked', 0.044), ('em', 0.043), ('hs', 0.043), ('alon', 0.043), ('vs', 0.042), ('groups', 0.042), ('likelihood', 0.042), ('constrained', 0.041), ('dasgupta', 0.04), ('bregman', 0.04), ('assumptions', 0.04), ('hence', 0.04), ('low', 0.039), ('variance', 0.039), ('data', 0.038), ('spherical', 0.038), ('av', 0.038), ('simulations', 0.037), ('percentage', 0.037), ('parameter', 0.036), ('annals', 0.036), ('model', 0.034), ('soft', 0.034), ('belong', 0.034), ('discrete', 0.034), ('populations', 0.034), ('conjecture', 0.034), ('words', 0.034), ('distributions', 0.034), ('statistics', 0.033), ('estimators', 0.033), ('probabilistic', 0.032), ('count', 0.031), ('various', 0.031), ('assumed', 0.031), ('iid', 0.031), ('collins', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

2 0.21698442 125 nips-2004-Multiple Relational Embedding

Author: Roland Memisevic, Geoffrey E. Hinton

Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1

3 0.17444342 124 nips-2004-Multiple Alignment of Continuous Time Series

Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili

Abstract: Multiple realizations of continuous-valued time series from a stochastic process often contain systematic variations in rate and amplitude. To leverage the information contained in such noisy replicate sets, we need to align them in an appropriate way (for example, to allow the data to be properly combined by adaptive averaging). We present the Continuous Proﬁle Model (CPM), a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned trace represents a canonical, high resolution fusion of all the replicates. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. We apply CPM to successfully align speech signals from multiple speakers and sets of Liquid Chromatography-Mass Spectrometry proteomic data. 1 A Proﬁle Model for Continuous Data When observing multiple time series generated by a noisy, stochastic process, large systematic sources of variability are often present. For example, within a set of nominally replicate time series, the time axes can be variously shifted, compressed and expanded, in complex, non-linear ways. Additionally, in some circumstances, the scale of the measured data can vary systematically from one replicate to the next, and even within a given replicate. We propose a Continuous Proﬁle Model (CPM) for simultaneously analyzing a set of such time series. In this model, each time series is generated as a noisy transformation of a single latent trace. The latent trace is an underlying, noiseless representation of the set of replicated, observable time series. Output time series are generated from this model by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as in an HMM. Each hidden state corresponds to a particular location in the latent trace, and the emitted value from the state depends on the value of the latent trace at that position. To account for changes in the amplitude of the signals across and within replicates, the latent time states are augmented by a set of scale states, which control how the emission signal will be scaled relative to the value of the latent trace. During training, the latent trace is learned, as well as the transition probabilities controlling the Markovian evolution of the scale and time states and the overall noise level of the observed data. After training, the latent trace learned by the model represents a higher resolution ’fusion’ of the experimental replicates. Figure 1 illustrate the model in action. Unaligned, Linear Warp Alignment and CPM Alignment Amplitude 40 30 20 10 0 50 Amplitude 40 30 20 10 Amplitude 0 30 20 10 0 Time a) b) Figure 1: a) Top: ten replicated speech energy signals as described in Section 4), Middle: same signals, aligned using a linear warp with an offset, Bottom: aligned with CPM (the learned latent trace is also shown in cyan). b) Speech waveforms corresponding to energy signals in a), Top: unaligned originals, Bottom: aligned using CPM. 2 Deﬁning the Continuous Proﬁle Model (CPM) The CPM is generative model for a set of K time series, xk = (xk , xk , ..., xk k ). The 1 2 N temporal sampling rate within each xk need not be uniform, nor must it be the same across the different xk . Constraints on the variability of the sampling rate are discussed at the end of this section. For notational convenience, we henceforth assume N k = N for all k, but this is not a requirement of the model. The CPM is set up as follows: We assume that there is a latent trace, z = (z1 , z2 , ..., zM ), a canonical representation of the set of noisy input replicate time series. Any given observed time series in the set is modeled as a non-uniformly subsampled version of the latent trace to which local scale transformations have been applied. Ideally, M would be inﬁnite, or at least very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overﬁtting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments and found this to be sufﬁcient with < 0.2. Because the resolution of the latent trace is higher than that of the observed time series, experimental time can be made effectively to speed up or slow down by advancing along the latent trace in larger or smaller jumps. The subsampling and local scaling used during the generation of each observed time series are determined by a sequence of hidden state variables. Let the state sequence for observation k be π k . Each state in the state sequence maps to a time state/scale state pair: k πi → {τik , φk }. Time states belong to the integer set (1..M ); scale states belong to an i ordered set (φ1 ..φQ ). (In our experiments we have used Q=7, evenly spaced scales in k logarithmic space). States, πi , and observation values, xk , are related by the emission i k probability distribution: Aπi (xk |z) ≡ p(xk |πi , z, σ, uk ) ≡ N (xk ; zτik φk uk , σ), where σ k i i i i is the noise level of the observed data, N (a; b, c) denotes a Gaussian probability density for a with mean b and standard deviation c. The uk are real-valued scale parameters, one per observed time series, that correct for any overall scale difference between time series k and the latent trace. To fully specify our model we also need to deﬁne the state transition probabilities. We deﬁne the transitions between time states and between scale states separately, so that k Tπi−1 ,πi ≡ p(πi |πi−1 ) = p(φi |φi−1 )pk (τi |τi−1 ). The constraint that time must move forward, cannot stand still, and that it can jump ahead no more than Jτ time states is enforced. (In our experiments we used Jτ = 3.) As well, we only allow scale state transitions between neighbouring scale states so that the local scale cannot jump arbitrarily. These constraints keep the number of legal transitions to a tractable computational size and work well in practice. Each observed time series has its own time transition probability distribution to account for experiment-speciﬁc patterns. Both the time and scale transition probability distributions are given by multinomials:  dk , if a − b = 1  1  k  d2 , if a − b = 2   k . p (τi = a|τi−1 = b) = . .  k d , if a − b = J  τ  Jτ  0, otherwise p(φi = a|φi−1  s0 , if D(a, b) = 0   s1 , if D(a, b) = 1 = b) =  s1 , if D(a, b) = −1  0, otherwise where D(a, b) = 1 means that a is one scale state larger than b, and D(a, b) = −1 means that a is one scale state smaller than b, and D(a, b) = 0 means that a = b. The distributions Jτ are constrained by: i=1 dk = 1 and 2s1 + s0 = 1. i Jτ determines the maximum allowable instantaneous speedup of one portion of a time series relative to another portion, within the same series or across different series. However, the length of time for which any series can move so rapidly is constrained by the length of the latent trace; thus the maximum overall ratio in speeds achievable by the model between any two entire time series is given by min(Jτ , M ). N After training, one may examine either the latent trace or the alignment of each observable time series to the latent trace. Such alignments can be achieved by several methods, including use of the Viterbi algorithm to ﬁnd the highest likelihood path through the hidden states [1], or sampling from the posterior over hidden state sequences. We found Viterbi alignments to work well in the experiments below; samples from the posterior looked quite similar. 3 Training with the Expectation-Maximization (EM) Algorithm As with HMMs, training with the EM algorithm (often referred to as Baum-Welch in the context of HMMs [1]), is a natural choice. In our model the E-Step is computed exactly using the Forward-Backward algorithm [1], which provides the posterior probability over k states for each time point of every observed time series, γs (i) ≡ p(πi = s|x) and also the pairwise state posteriors, ξs,t (i) ≡ p(πi−1 = s, πi = t|xk ). The algorithm is modiﬁed only in that the emission probabilities depend on the latent trace as described in Section 2. The M-Step consists of a series of analytical updates to the various parameters as detailed below. Given the latent trace (and the emission and state transition probabilities), the complete log likelihood of K observed time series, xk , is given by Lp ≡ L + P. L is the likelihood term arising in a (conditional) HMM model, and can be obtained from the Forward-Backward algorithm. It is composed of the emission and state transition terms. P is the log prior (or penalty term), regularizing various aspects of the model parameters as explained below. These two terms are: K N N L≡ log Aπi (xk |z) + i log p(π1 ) + τ −1 K (zj+1 − zj )2 + P ≡ −λ (1) i=2 i=1 k=1 k log Tπi−1 ,πi j=1 k log D(dk |{ηv }) + log D(sv |{ηv }), v (2) k=1 where p(π1 ) are priors over the initial states. The ﬁrst term in Equation 2 is a smoothing k penalty on the latent trace, with λ controlling the amount of smoothing. ηv and ηv are Dirichlet hyperprior parameters for the time and scale state transition probability distributions respectively. These ensure that all non-zero transition probabilities remain non-zero. k For the time state transitions, v ∈ {1, Jτ } and ηv corresponds to the pseudo-count data for k the parameters d1 , d2 . . . dJτ . For the scale state transitions, v ∈ {0, 1} and ηv corresponds to the pseudo-count data for the parameters s0 and s1 . Letting S be the total number of possible states, that is, the number of elements in the cross-product of possible time states and possible scale states, the expected complete log likelihood is: K S K p k k γs (1) log T0,s

4 0.14817195 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton

Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research ﬁelds. Although this approach has met with considerable success, the causal semantics of these models can make it difﬁcult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.

5 0.12596481 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with signiﬁcantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1

6 0.12225169 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

7 0.11893507 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming

8 0.11403067 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

9 0.10928021 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

10 0.10068669 131 nips-2004-Non-Local Manifold Tangent Learning

11 0.092077948 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

12 0.088107586 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

13 0.087677136 164 nips-2004-Semi-supervised Learning by Entropy Minimization

14 0.086650126 127 nips-2004-Neighbourhood Components Analysis

15 0.085804164 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

16 0.085119016 77 nips-2004-Hierarchical Clustering of a Mixture Model

17 0.084912866 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

18 0.084518239 145 nips-2004-Parametric Embedding for Class Visualization

19 0.083086506 168 nips-2004-Semigroup Kernels on Finite Sets

20 0.083036207 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.265), (1, 0.064), (2, -0.069), (3, -0.122), (4, -0.022), (5, 0.017), (6, -0.156), (7, 0.093), (8, 0.233), (9, 0.095), (10, 0.162), (11, -0.1), (12, 0.057), (13, -0.071), (14, 0.1), (15, -0.043), (16, 0.036), (17, 0.079), (18, -0.034), (19, -0.058), (20, -0.132), (21, 0.016), (22, 0.094), (23, 0.02), (24, 0.008), (25, 0.078), (26, -0.033), (27, -0.038), (28, -0.034), (29, -0.07), (30, 0.039), (31, 0.057), (32, -0.101), (33, -0.026), (34, -0.128), (35, 0.044), (36, 0.106), (37, 0.011), (38, 0.076), (39, -0.004), (40, -0.136), (41, -0.161), (42, -0.08), (43, -0.085), (44, -0.077), (45, 0.036), (46, 0.038), (47, 0.09), (48, -0.043), (49, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96211201 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

2 0.74009567 125 nips-2004-Multiple Relational Embedding

Author: Roland Memisevic, Geoffrey E. Hinton

3 0.69507843 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton

4 0.67153269 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

Author: Aaron C. Courville, Nathaniel D. Daw, David S. Touretzky

Abstract: We propose a probabilistic, generative account of conﬁgural learning phenomena in classical conditioning. Conﬁgural learning experiments probe how animals discriminate and generalize between patterns of simultaneously presented stimuli (such as tones and lights) that are differentially predictive of reinforcement. Previous models of these issues have been successful more on a phenomenological than an explanatory level: they reproduce experimental ﬁndings but, lacking formal foundations, provide scant basis for understanding why animals behave as they do. We present a theory that clariﬁes seemingly arbitrary aspects of previous models while also capturing a broader set of data. Key patterns of data, e.g. concerning animals’ readiness to distinguish patterns with varying degrees of overlap, are shown to follow from statistical inference.

5 0.64774787 124 nips-2004-Multiple Alignment of Continuous Time Series

Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili

6 0.6351887 158 nips-2004-Sampling Methods for Unsupervised Learning

7 0.5246768 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

8 0.46331987 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

9 0.45655343 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

10 0.44650841 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

11 0.43969154 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming

12 0.43599907 127 nips-2004-Neighbourhood Components Analysis

13 0.42247123 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

14 0.40459684 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models

15 0.39768815 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

16 0.38121852 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

17 0.37245736 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

18 0.37023842 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

19 0.36169657 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception

20 0.35779437 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.143), (15, 0.136), (26, 0.044), (31, 0.056), (33, 0.167), (35, 0.04), (39, 0.124), (45, 0.111), (50, 0.042), (71, 0.011), (76, 0.017), (87, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91851544 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

2 0.90276217 42 nips-2004-Computing regularization paths for learning multiple kernels

Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan

Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classiﬁcation or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1

3 0.89591211 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models

Author: Antti Honkela, Harri Valpola

Abstract: In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. This yields an accurate approximation for cases of large posterior variance. The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. 1

4 0.89401877 113 nips-2004-Maximum-Margin Matrix Factorization

Author: Nathan Srebro, Jason Rennie, Tommi S. Jaakkola

Abstract: We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-deﬁnite program, and discuss generalization error bounds for them. 1

5 0.8756547 70 nips-2004-Following Curved Regularized Optimization Solution Paths

Author: Saharon Rosset

Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized ﬁtting is likely to lead to over-ﬁtted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efﬁcient and highly ﬂexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.

6 0.86985803 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

7 0.86745548 131 nips-2004-Non-Local Manifold Tangent Learning

8 0.85746175 124 nips-2004-Multiple Alignment of Continuous Time Series

9 0.85128438 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

10 0.85115272 178 nips-2004-Support Vector Classification with Input Data Uncertainty

11 0.85002035 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

12 0.84918475 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters

13 0.84739673 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

14 0.84728456 28 nips-2004-Bayesian inference in spiking neurons

15 0.84574628 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

16 0.84335041 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

17 0.84236491 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

18 0.84217435 102 nips-2004-Learning first-order Markov models for control

19 0.84069306 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

20 0.83795017 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling