nips nips2004 nips2004-124 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili
Abstract: Multiple realizations of continuous-valued time series from a stochastic process often contain systematic variations in rate and amplitude. To leverage the information contained in such noisy replicate sets, we need to align them in an appropriate way (for example, to allow the data to be properly combined by adaptive averaging). We present the Continuous Profile Model (CPM), a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned trace represents a canonical, high resolution fusion of all the replicates. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. We apply CPM to successfully align speech signals from multiple speakers and sets of Liquid Chromatography-Mass Spectrometry proteomic data. 1 A Profile Model for Continuous Data When observing multiple time series generated by a noisy, stochastic process, large systematic sources of variability are often present. For example, within a set of nominally replicate time series, the time axes can be variously shifted, compressed and expanded, in complex, non-linear ways. Additionally, in some circumstances, the scale of the measured data can vary systematically from one replicate to the next, and even within a given replicate. We propose a Continuous Profile Model (CPM) for simultaneously analyzing a set of such time series. In this model, each time series is generated as a noisy transformation of a single latent trace. The latent trace is an underlying, noiseless representation of the set of replicated, observable time series. Output time series are generated from this model by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as in an HMM. Each hidden state corresponds to a particular location in the latent trace, and the emitted value from the state depends on the value of the latent trace at that position. To account for changes in the amplitude of the signals across and within replicates, the latent time states are augmented by a set of scale states, which control how the emission signal will be scaled relative to the value of the latent trace. During training, the latent trace is learned, as well as the transition probabilities controlling the Markovian evolution of the scale and time states and the overall noise level of the observed data. After training, the latent trace learned by the model represents a higher resolution ’fusion’ of the experimental replicates. Figure 1 illustrate the model in action. Unaligned, Linear Warp Alignment and CPM Alignment Amplitude 40 30 20 10 0 50 Amplitude 40 30 20 10 Amplitude 0 30 20 10 0 Time a) b) Figure 1: a) Top: ten replicated speech energy signals as described in Section 4), Middle: same signals, aligned using a linear warp with an offset, Bottom: aligned with CPM (the learned latent trace is also shown in cyan). b) Speech waveforms corresponding to energy signals in a), Top: unaligned originals, Bottom: aligned using CPM. 2 Defining the Continuous Profile Model (CPM) The CPM is generative model for a set of K time series, xk = (xk , xk , ..., xk k ). The 1 2 N temporal sampling rate within each xk need not be uniform, nor must it be the same across the different xk . Constraints on the variability of the sampling rate are discussed at the end of this section. For notational convenience, we henceforth assume N k = N for all k, but this is not a requirement of the model. The CPM is set up as follows: We assume that there is a latent trace, z = (z1 , z2 , ..., zM ), a canonical representation of the set of noisy input replicate time series. Any given observed time series in the set is modeled as a non-uniformly subsampled version of the latent trace to which local scale transformations have been applied. Ideally, M would be infinite, or at least very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overfitting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments and found this to be sufficient with < 0.2. Because the resolution of the latent trace is higher than that of the observed time series, experimental time can be made effectively to speed up or slow down by advancing along the latent trace in larger or smaller jumps. The subsampling and local scaling used during the generation of each observed time series are determined by a sequence of hidden state variables. Let the state sequence for observation k be π k . Each state in the state sequence maps to a time state/scale state pair: k πi → {τik , φk }. Time states belong to the integer set (1..M ); scale states belong to an i ordered set (φ1 ..φQ ). (In our experiments we have used Q=7, evenly spaced scales in k logarithmic space). States, πi , and observation values, xk , are related by the emission i k probability distribution: Aπi (xk |z) ≡ p(xk |πi , z, σ, uk ) ≡ N (xk ; zτik φk uk , σ), where σ k i i i i is the noise level of the observed data, N (a; b, c) denotes a Gaussian probability density for a with mean b and standard deviation c. The uk are real-valued scale parameters, one per observed time series, that correct for any overall scale difference between time series k and the latent trace. To fully specify our model we also need to define the state transition probabilities. We define the transitions between time states and between scale states separately, so that k Tπi−1 ,πi ≡ p(πi |πi−1 ) = p(φi |φi−1 )pk (τi |τi−1 ). The constraint that time must move forward, cannot stand still, and that it can jump ahead no more than Jτ time states is enforced. (In our experiments we used Jτ = 3.) As well, we only allow scale state transitions between neighbouring scale states so that the local scale cannot jump arbitrarily. These constraints keep the number of legal transitions to a tractable computational size and work well in practice. Each observed time series has its own time transition probability distribution to account for experiment-specific patterns. Both the time and scale transition probability distributions are given by multinomials: dk , if a − b = 1 1 k d2 , if a − b = 2 k . p (τi = a|τi−1 = b) = . . k d , if a − b = J τ Jτ 0, otherwise p(φi = a|φi−1 s0 , if D(a, b) = 0 s1 , if D(a, b) = 1 = b) = s1 , if D(a, b) = −1 0, otherwise where D(a, b) = 1 means that a is one scale state larger than b, and D(a, b) = −1 means that a is one scale state smaller than b, and D(a, b) = 0 means that a = b. The distributions Jτ are constrained by: i=1 dk = 1 and 2s1 + s0 = 1. i Jτ determines the maximum allowable instantaneous speedup of one portion of a time series relative to another portion, within the same series or across different series. However, the length of time for which any series can move so rapidly is constrained by the length of the latent trace; thus the maximum overall ratio in speeds achievable by the model between any two entire time series is given by min(Jτ , M ). N After training, one may examine either the latent trace or the alignment of each observable time series to the latent trace. Such alignments can be achieved by several methods, including use of the Viterbi algorithm to find the highest likelihood path through the hidden states [1], or sampling from the posterior over hidden state sequences. We found Viterbi alignments to work well in the experiments below; samples from the posterior looked quite similar. 3 Training with the Expectation-Maximization (EM) Algorithm As with HMMs, training with the EM algorithm (often referred to as Baum-Welch in the context of HMMs [1]), is a natural choice. In our model the E-Step is computed exactly using the Forward-Backward algorithm [1], which provides the posterior probability over k states for each time point of every observed time series, γs (i) ≡ p(πi = s|x) and also the pairwise state posteriors, ξs,t (i) ≡ p(πi−1 = s, πi = t|xk ). The algorithm is modified only in that the emission probabilities depend on the latent trace as described in Section 2. The M-Step consists of a series of analytical updates to the various parameters as detailed below. Given the latent trace (and the emission and state transition probabilities), the complete log likelihood of K observed time series, xk , is given by Lp ≡ L + P. L is the likelihood term arising in a (conditional) HMM model, and can be obtained from the Forward-Backward algorithm. It is composed of the emission and state transition terms. P is the log prior (or penalty term), regularizing various aspects of the model parameters as explained below. These two terms are: K N N L≡ log Aπi (xk |z) + i log p(π1 ) + τ −1 K (zj+1 − zj )2 + P ≡ −λ (1) i=2 i=1 k=1 k log Tπi−1 ,πi j=1 k log D(dk |{ηv }) + log D(sv |{ηv }), v (2) k=1 where p(π1 ) are priors over the initial states. The first term in Equation 2 is a smoothing k penalty on the latent trace, with λ controlling the amount of smoothing. ηv and ηv are Dirichlet hyperprior parameters for the time and scale state transition probability distributions respectively. These ensure that all non-zero transition probabilities remain non-zero. k For the time state transitions, v ∈ {1, Jτ } and ηv corresponds to the pseudo-count data for k the parameters d1 , d2 . . . dJτ . For the scale state transitions, v ∈ {0, 1} and ηv corresponds to the pseudo-count data for the parameters s0 and s1 . Letting S be the total number of possible states, that is, the number of elements in the cross-product of possible time states and possible scale states, the expected complete log likelihood is: K S K p k k γs (1) log T0,s
Reference: text
sentIndex sentText sentNum sentScore
1 ca † Abstract Multiple realizations of continuous-valued time series from a stochastic process often contain systematic variations in rate and amplitude. [sent-7, score-0.471]
2 To leverage the information contained in such noisy replicate sets, we need to align them in an appropriate way (for example, to allow the data to be properly combined by adaptive averaging). [sent-8, score-0.291]
3 We present the Continuous Profile Model (CPM), a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. [sent-9, score-0.913]
4 After unsupervised training, the learned trace represents a canonical, high resolution fusion of all the replicates. [sent-10, score-0.621]
5 As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. [sent-11, score-0.764]
6 We apply CPM to successfully align speech signals from multiple speakers and sets of Liquid Chromatography-Mass Spectrometry proteomic data. [sent-12, score-0.286]
7 1 A Profile Model for Continuous Data When observing multiple time series generated by a noisy, stochastic process, large systematic sources of variability are often present. [sent-13, score-0.514]
8 For example, within a set of nominally replicate time series, the time axes can be variously shifted, compressed and expanded, in complex, non-linear ways. [sent-14, score-0.447]
9 Additionally, in some circumstances, the scale of the measured data can vary systematically from one replicate to the next, and even within a given replicate. [sent-15, score-0.337]
10 We propose a Continuous Profile Model (CPM) for simultaneously analyzing a set of such time series. [sent-16, score-0.12]
11 In this model, each time series is generated as a noisy transformation of a single latent trace. [sent-17, score-0.788]
12 The latent trace is an underlying, noiseless representation of the set of replicated, observable time series. [sent-18, score-0.995]
13 Output time series are generated from this model by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as in an HMM. [sent-19, score-0.735]
14 Each hidden state corresponds to a particular location in the latent trace, and the emitted value from the state depends on the value of the latent trace at that position. [sent-20, score-1.524]
15 To account for changes in the amplitude of the signals across and within replicates, the latent time states are augmented by a set of scale states, which control how the emission signal will be scaled relative to the value of the latent trace. [sent-21, score-1.536]
16 During training, the latent trace is learned, as well as the transition probabilities controlling the Markovian evolution of the scale and time states and the overall noise level of the observed data. [sent-22, score-1.442]
17 After training, the latent trace learned by the model represents a higher resolution ’fusion’ of the experimental replicates. [sent-23, score-0.9]
18 b) Speech waveforms corresponding to energy signals in a), Top: unaligned originals, Bottom: aligned using CPM. [sent-26, score-0.272]
19 2 Defining the Continuous Profile Model (CPM) The CPM is generative model for a set of K time series, xk = (xk , xk , . [sent-27, score-0.63]
20 The 1 2 N temporal sampling rate within each xk need not be uniform, nor must it be the same across the different xk . [sent-31, score-0.599]
21 Constraints on the variability of the sampling rate are discussed at the end of this section. [sent-32, score-0.077]
22 For notational convenience, we henceforth assume N k = N for all k, but this is not a requirement of the model. [sent-33, score-0.068]
23 The CPM is set up as follows: We assume that there is a latent trace, z = (z1 , z2 , . [sent-34, score-0.36]
24 , zM ), a canonical representation of the set of noisy input replicate time series. [sent-37, score-0.36]
25 Any given observed time series in the set is modeled as a non-uniformly subsampled version of the latent trace to which local scale transformations have been applied. [sent-38, score-1.428]
26 Ideally, M would be infinite, or at least very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. [sent-39, score-0.415]
27 Because the resolution of the latent trace is higher than that of the observed time series, experimental time can be made effectively to speed up or slow down by advancing along the latent trace in larger or smaller jumps. [sent-43, score-1.93]
28 The subsampling and local scaling used during the generation of each observed time series are determined by a sequence of hidden state variables. [sent-44, score-0.7]
29 Let the state sequence for observation k be π k . [sent-45, score-0.182]
30 Each state in the state sequence maps to a time state/scale state pair: k πi → {τik , φk }. [sent-46, score-0.604]
31 M ); scale states belong to an i ordered set (φ1 . [sent-49, score-0.334]
32 (In our experiments we have used Q=7, evenly spaced scales in k logarithmic space). [sent-52, score-0.031]
33 The uk are real-valued scale parameters, one per observed time series, that correct for any overall scale difference between time series k and the latent trace. [sent-54, score-1.235]
34 To fully specify our model we also need to define the state transition probabilities. [sent-55, score-0.262]
35 We define the transitions between time states and between scale states separately, so that k Tπi−1 ,πi ≡ p(πi |πi−1 ) = p(φi |φi−1 )pk (τi |τi−1 ). [sent-56, score-0.683]
36 The constraint that time must move forward, cannot stand still, and that it can jump ahead no more than Jτ time states is enforced. [sent-57, score-0.573]
37 ) As well, we only allow scale state transitions between neighbouring scale states so that the local scale cannot jump arbitrarily. [sent-59, score-0.902]
38 These constraints keep the number of legal transitions to a tractable computational size and work well in practice. [sent-60, score-0.133]
39 Each observed time series has its own time transition probability distribution to account for experiment-specific patterns. [sent-61, score-0.66]
40 Both the time and scale transition probability distributions are given by multinomials: dk , if a − b = 1 1 k d2 , if a − b = 2 k . [sent-62, score-0.362]
41 The distributions Jτ are constrained by: i=1 dk = 1 and 2s1 + s0 = 1. [sent-66, score-0.093]
42 i Jτ determines the maximum allowable instantaneous speedup of one portion of a time series relative to another portion, within the same series or across different series. [sent-67, score-0.772]
43 However, the length of time for which any series can move so rapidly is constrained by the length of the latent trace; thus the maximum overall ratio in speeds achievable by the model between any two entire time series is given by min(Jτ , M ). [sent-68, score-1.18]
44 N After training, one may examine either the latent trace or the alignment of each observable time series to the latent trace. [sent-69, score-1.674]
45 Such alignments can be achieved by several methods, including use of the Viterbi algorithm to find the highest likelihood path through the hidden states [1], or sampling from the posterior over hidden state sequences. [sent-70, score-0.598]
46 We found Viterbi alignments to work well in the experiments below; samples from the posterior looked quite similar. [sent-71, score-0.104]
47 In our model the E-Step is computed exactly using the Forward-Backward algorithm [1], which provides the posterior probability over k states for each time point of every observed time series, γs (i) ≡ p(πi = s|x) and also the pairwise state posteriors, ξs,t (i) ≡ p(πi−1 = s, πi = t|xk ). [sent-73, score-0.643]
48 The algorithm is modified only in that the emission probabilities depend on the latent trace as described in Section 2. [sent-74, score-0.988]
49 The M-Step consists of a series of analytical updates to the various parameters as detailed below. [sent-75, score-0.258]
50 Given the latent trace (and the emission and state transition probabilities), the complete log likelihood of K observed time series, xk , is given by Lp ≡ L + P. [sent-76, score-1.719]
51 L is the likelihood term arising in a (conditional) HMM model, and can be obtained from the Forward-Backward algorithm. [sent-77, score-0.034]
52 It is composed of the emission and state transition terms. [sent-78, score-0.434]
53 P is the log prior (or penalty term), regularizing various aspects of the model parameters as explained below. [sent-79, score-0.116]
54 These two terms are: K N N L≡ log Aπi (xk |z) + i log p(π1 ) + τ −1 K (zj+1 − zj )2 + P ≡ −λ (1) i=2 i=1 k=1 k log Tπi−1 ,πi j=1 k log D(dk |{ηv }) + log D(sv |{ηv }), v (2) k=1 where p(π1 ) are priors over the initial states. [sent-80, score-0.307]
55 The first term in Equation 2 is a smoothing k penalty on the latent trace, with λ controlling the amount of smoothing. [sent-81, score-0.445]
56 ηv and ηv are Dirichlet hyperprior parameters for the time and scale state transition probability distributions respectively. [sent-82, score-0.554]
57 These ensure that all non-zero transition probabilities remain non-zero. [sent-83, score-0.152]
58 k For the time state transitions, v ∈ {1, Jτ } and ηv corresponds to the pseudo-count data for k the parameters d1 , d2 . [sent-84, score-0.271]
59 For the scale state transitions, v ∈ {0, 1} and ηv corresponds to the pseudo-count data for the parameters s0 and s1 . [sent-88, score-0.282]
wordName wordTfidf (topN-words)
[('trace', 0.415), ('cpm', 0.373), ('latent', 0.36), ('series', 0.258), ('xk', 0.255), ('emission', 0.172), ('states', 0.166), ('state', 0.151), ('replicate', 0.148), ('scale', 0.131), ('time', 0.12), ('transition', 0.111), ('transitions', 0.1), ('alignment', 0.098), ('amplitude', 0.097), ('subsampled', 0.093), ('dk', 0.093), ('resolution', 0.089), ('pro', 0.085), ('le', 0.082), ('fusion', 0.081), ('unaligned', 0.074), ('signals', 0.072), ('alignments', 0.069), ('warp', 0.069), ('uk', 0.064), ('observable', 0.063), ('systematic', 0.062), ('jump', 0.062), ('replicated', 0.062), ('align', 0.059), ('speech', 0.058), ('zj', 0.057), ('aligned', 0.056), ('hidden', 0.056), ('hmms', 0.055), ('continuous', 0.053), ('markovian', 0.053), ('viterbi', 0.053), ('observed', 0.051), ('log', 0.05), ('noisy', 0.05), ('controlling', 0.047), ('ik', 0.047), ('variability', 0.046), ('portion', 0.045), ('canonical', 0.042), ('toronto', 0.042), ('probabilities', 0.041), ('stand', 0.041), ('emitting', 0.041), ('proteomic', 0.041), ('hyperprior', 0.041), ('banting', 0.041), ('emili', 0.041), ('jennifer', 0.041), ('multinomials', 0.041), ('radford', 0.041), ('penalty', 0.038), ('replicates', 0.037), ('spectrometry', 0.037), ('noiseless', 0.037), ('henceforth', 0.037), ('cyan', 0.037), ('belong', 0.037), ('learned', 0.036), ('energy', 0.036), ('posterior', 0.035), ('leverage', 0.034), ('zm', 0.034), ('waveforms', 0.034), ('likelihood', 0.034), ('allowable', 0.033), ('aside', 0.033), ('liquid', 0.033), ('ahead', 0.033), ('subsampling', 0.033), ('speeds', 0.033), ('legal', 0.033), ('posteriors', 0.033), ('sequence', 0.031), ('compressed', 0.031), ('evenly', 0.031), ('emitted', 0.031), ('notational', 0.031), ('realizations', 0.031), ('neal', 0.031), ('rescaling', 0.031), ('sampling', 0.031), ('move', 0.031), ('across', 0.03), ('systematically', 0.03), ('neighbouring', 0.03), ('lp', 0.028), ('speakers', 0.028), ('regularizing', 0.028), ('sam', 0.028), ('multiple', 0.028), ('em', 0.028), ('within', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 124 nips-2004-Multiple Alignment of Continuous Time Series
Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili
Abstract: Multiple realizations of continuous-valued time series from a stochastic process often contain systematic variations in rate and amplitude. To leverage the information contained in such noisy replicate sets, we need to align them in an appropriate way (for example, to allow the data to be properly combined by adaptive averaging). We present the Continuous Profile Model (CPM), a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned trace represents a canonical, high resolution fusion of all the replicates. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. We apply CPM to successfully align speech signals from multiple speakers and sets of Liquid Chromatography-Mass Spectrometry proteomic data. 1 A Profile Model for Continuous Data When observing multiple time series generated by a noisy, stochastic process, large systematic sources of variability are often present. For example, within a set of nominally replicate time series, the time axes can be variously shifted, compressed and expanded, in complex, non-linear ways. Additionally, in some circumstances, the scale of the measured data can vary systematically from one replicate to the next, and even within a given replicate. We propose a Continuous Profile Model (CPM) for simultaneously analyzing a set of such time series. In this model, each time series is generated as a noisy transformation of a single latent trace. The latent trace is an underlying, noiseless representation of the set of replicated, observable time series. Output time series are generated from this model by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as in an HMM. Each hidden state corresponds to a particular location in the latent trace, and the emitted value from the state depends on the value of the latent trace at that position. To account for changes in the amplitude of the signals across and within replicates, the latent time states are augmented by a set of scale states, which control how the emission signal will be scaled relative to the value of the latent trace. During training, the latent trace is learned, as well as the transition probabilities controlling the Markovian evolution of the scale and time states and the overall noise level of the observed data. After training, the latent trace learned by the model represents a higher resolution ’fusion’ of the experimental replicates. Figure 1 illustrate the model in action. Unaligned, Linear Warp Alignment and CPM Alignment Amplitude 40 30 20 10 0 50 Amplitude 40 30 20 10 Amplitude 0 30 20 10 0 Time a) b) Figure 1: a) Top: ten replicated speech energy signals as described in Section 4), Middle: same signals, aligned using a linear warp with an offset, Bottom: aligned with CPM (the learned latent trace is also shown in cyan). b) Speech waveforms corresponding to energy signals in a), Top: unaligned originals, Bottom: aligned using CPM. 2 Defining the Continuous Profile Model (CPM) The CPM is generative model for a set of K time series, xk = (xk , xk , ..., xk k ). The 1 2 N temporal sampling rate within each xk need not be uniform, nor must it be the same across the different xk . Constraints on the variability of the sampling rate are discussed at the end of this section. For notational convenience, we henceforth assume N k = N for all k, but this is not a requirement of the model. The CPM is set up as follows: We assume that there is a latent trace, z = (z1 , z2 , ..., zM ), a canonical representation of the set of noisy input replicate time series. Any given observed time series in the set is modeled as a non-uniformly subsampled version of the latent trace to which local scale transformations have been applied. Ideally, M would be infinite, or at least very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overfitting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments and found this to be sufficient with < 0.2. Because the resolution of the latent trace is higher than that of the observed time series, experimental time can be made effectively to speed up or slow down by advancing along the latent trace in larger or smaller jumps. The subsampling and local scaling used during the generation of each observed time series are determined by a sequence of hidden state variables. Let the state sequence for observation k be π k . Each state in the state sequence maps to a time state/scale state pair: k πi → {τik , φk }. Time states belong to the integer set (1..M ); scale states belong to an i ordered set (φ1 ..φQ ). (In our experiments we have used Q=7, evenly spaced scales in k logarithmic space). States, πi , and observation values, xk , are related by the emission i k probability distribution: Aπi (xk |z) ≡ p(xk |πi , z, σ, uk ) ≡ N (xk ; zτik φk uk , σ), where σ k i i i i is the noise level of the observed data, N (a; b, c) denotes a Gaussian probability density for a with mean b and standard deviation c. The uk are real-valued scale parameters, one per observed time series, that correct for any overall scale difference between time series k and the latent trace. To fully specify our model we also need to define the state transition probabilities. We define the transitions between time states and between scale states separately, so that k Tπi−1 ,πi ≡ p(πi |πi−1 ) = p(φi |φi−1 )pk (τi |τi−1 ). The constraint that time must move forward, cannot stand still, and that it can jump ahead no more than Jτ time states is enforced. (In our experiments we used Jτ = 3.) As well, we only allow scale state transitions between neighbouring scale states so that the local scale cannot jump arbitrarily. These constraints keep the number of legal transitions to a tractable computational size and work well in practice. Each observed time series has its own time transition probability distribution to account for experiment-specific patterns. Both the time and scale transition probability distributions are given by multinomials: dk , if a − b = 1 1 k d2 , if a − b = 2 k . p (τi = a|τi−1 = b) = . . k d , if a − b = J τ Jτ 0, otherwise p(φi = a|φi−1 s0 , if D(a, b) = 0 s1 , if D(a, b) = 1 = b) = s1 , if D(a, b) = −1 0, otherwise where D(a, b) = 1 means that a is one scale state larger than b, and D(a, b) = −1 means that a is one scale state smaller than b, and D(a, b) = 0 means that a = b. The distributions Jτ are constrained by: i=1 dk = 1 and 2s1 + s0 = 1. i Jτ determines the maximum allowable instantaneous speedup of one portion of a time series relative to another portion, within the same series or across different series. However, the length of time for which any series can move so rapidly is constrained by the length of the latent trace; thus the maximum overall ratio in speeds achievable by the model between any two entire time series is given by min(Jτ , M ). N After training, one may examine either the latent trace or the alignment of each observable time series to the latent trace. Such alignments can be achieved by several methods, including use of the Viterbi algorithm to find the highest likelihood path through the hidden states [1], or sampling from the posterior over hidden state sequences. We found Viterbi alignments to work well in the experiments below; samples from the posterior looked quite similar. 3 Training with the Expectation-Maximization (EM) Algorithm As with HMMs, training with the EM algorithm (often referred to as Baum-Welch in the context of HMMs [1]), is a natural choice. In our model the E-Step is computed exactly using the Forward-Backward algorithm [1], which provides the posterior probability over k states for each time point of every observed time series, γs (i) ≡ p(πi = s|x) and also the pairwise state posteriors, ξs,t (i) ≡ p(πi−1 = s, πi = t|xk ). The algorithm is modified only in that the emission probabilities depend on the latent trace as described in Section 2. The M-Step consists of a series of analytical updates to the various parameters as detailed below. Given the latent trace (and the emission and state transition probabilities), the complete log likelihood of K observed time series, xk , is given by Lp ≡ L + P. L is the likelihood term arising in a (conditional) HMM model, and can be obtained from the Forward-Backward algorithm. It is composed of the emission and state transition terms. P is the log prior (or penalty term), regularizing various aspects of the model parameters as explained below. These two terms are: K N N L≡ log Aπi (xk |z) + i log p(π1 ) + τ −1 K (zj+1 − zj )2 + P ≡ −λ (1) i=2 i=1 k=1 k log Tπi−1 ,πi j=1 k log D(dk |{ηv }) + log D(sv |{ηv }), v (2) k=1 where p(π1 ) are priors over the initial states. The first term in Equation 2 is a smoothing k penalty on the latent trace, with λ controlling the amount of smoothing. ηv and ηv are Dirichlet hyperprior parameters for the time and scale state transition probability distributions respectively. These ensure that all non-zero transition probabilities remain non-zero. k For the time state transitions, v ∈ {1, Jτ } and ηv corresponds to the pseudo-count data for k the parameters d1 , d2 . . . dJτ . For the scale state transitions, v ∈ {0, 1} and ηv corresponds to the pseudo-count data for the parameters s0 and s1 . Letting S be the total number of possible states, that is, the number of elements in the cross-product of possible time states and possible scale states, the expected complete log likelihood is: K S K p k k γs (1) log T0,s
2 0.17444342 163 nips-2004-Semi-parametric Exponential Family PCA
Author: Sajama Sajama, Alon Orlitsky
Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1
3 0.16866222 125 nips-2004-Multiple Relational Embedding
Author: Roland Memisevic, Geoffrey E. Hinton
Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1
4 0.12363402 113 nips-2004-Maximum-Margin Matrix Factorization
Author: Nathan Srebro, Jason Rennie, Tommi S. Jaakkola
Abstract: We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-definite program, and discuss generalization error bounds for them. 1
5 0.11426613 52 nips-2004-Discrete profile alignment via constrained information bottleneck
Author: Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin
Abstract: Amino acid profiles, which capture position-specific mutation probabilities, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are much more computationally expensive than discrete symbol comparisons, making profiles impractical for many large datasets. Furthermore, because they are such a rich representation, profiles can be difficult to visualize. To overcome these problems, we propose a discretization for profiles using an expanded alphabet representing not just individual amino acids, but common profiles. By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we find an informationally optimal alphabet. This discretization yields a concise, informative textual representation for profile sequences. Also alignments between these sequences, while nearly as accurate as the full profileprofile alignments, can be computed almost as quickly as those between individual or consensus sequences. A full pairwise alignment of SwissProt would take years using profiles, but less than 3 days using a discrete IB encoding, illustrating how discrete encoding can expand the range of sequence problems to which profile information can be applied. 1
6 0.10770315 50 nips-2004-Dependent Gaussian Processes
7 0.09360303 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
8 0.08825814 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
9 0.075335532 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces
10 0.072279498 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
11 0.070910811 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
12 0.06897556 197 nips-2004-Two-Dimensional Linear Discriminant Analysis
13 0.068578266 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account
14 0.065315969 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
15 0.064133018 28 nips-2004-Bayesian inference in spiking neurons
16 0.062906057 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing
17 0.059410449 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
18 0.059381004 64 nips-2004-Experts in a Markov Decision Process
19 0.059052192 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
20 0.058166437 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve
topicId topicWeight
[(0, -0.179), (1, -0.019), (2, 0.012), (3, -0.147), (4, -0.088), (5, -0.002), (6, -0.068), (7, 0.059), (8, 0.194), (9, -0.01), (10, 0.083), (11, -0.112), (12, 0.033), (13, 0.15), (14, 0.205), (15, -0.032), (16, -0.033), (17, -0.042), (18, -0.093), (19, -0.158), (20, -0.144), (21, 0.088), (22, -0.069), (23, 0.009), (24, 0.129), (25, 0.102), (26, 0.009), (27, -0.093), (28, -0.008), (29, -0.128), (30, 0.103), (31, 0.11), (32, -0.189), (33, -0.18), (34, -0.02), (35, -0.021), (36, 0.085), (37, -0.02), (38, -0.031), (39, 0.026), (40, -0.092), (41, -0.1), (42, 0.03), (43, -0.124), (44, -0.027), (45, -0.002), (46, -0.045), (47, -0.021), (48, 0.052), (49, -0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.97974986 124 nips-2004-Multiple Alignment of Continuous Time Series
Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili
Abstract: Multiple realizations of continuous-valued time series from a stochastic process often contain systematic variations in rate and amplitude. To leverage the information contained in such noisy replicate sets, we need to align them in an appropriate way (for example, to allow the data to be properly combined by adaptive averaging). We present the Continuous Profile Model (CPM), a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned trace represents a canonical, high resolution fusion of all the replicates. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. We apply CPM to successfully align speech signals from multiple speakers and sets of Liquid Chromatography-Mass Spectrometry proteomic data. 1 A Profile Model for Continuous Data When observing multiple time series generated by a noisy, stochastic process, large systematic sources of variability are often present. For example, within a set of nominally replicate time series, the time axes can be variously shifted, compressed and expanded, in complex, non-linear ways. Additionally, in some circumstances, the scale of the measured data can vary systematically from one replicate to the next, and even within a given replicate. We propose a Continuous Profile Model (CPM) for simultaneously analyzing a set of such time series. In this model, each time series is generated as a noisy transformation of a single latent trace. The latent trace is an underlying, noiseless representation of the set of replicated, observable time series. Output time series are generated from this model by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as in an HMM. Each hidden state corresponds to a particular location in the latent trace, and the emitted value from the state depends on the value of the latent trace at that position. To account for changes in the amplitude of the signals across and within replicates, the latent time states are augmented by a set of scale states, which control how the emission signal will be scaled relative to the value of the latent trace. During training, the latent trace is learned, as well as the transition probabilities controlling the Markovian evolution of the scale and time states and the overall noise level of the observed data. After training, the latent trace learned by the model represents a higher resolution ’fusion’ of the experimental replicates. Figure 1 illustrate the model in action. Unaligned, Linear Warp Alignment and CPM Alignment Amplitude 40 30 20 10 0 50 Amplitude 40 30 20 10 Amplitude 0 30 20 10 0 Time a) b) Figure 1: a) Top: ten replicated speech energy signals as described in Section 4), Middle: same signals, aligned using a linear warp with an offset, Bottom: aligned with CPM (the learned latent trace is also shown in cyan). b) Speech waveforms corresponding to energy signals in a), Top: unaligned originals, Bottom: aligned using CPM. 2 Defining the Continuous Profile Model (CPM) The CPM is generative model for a set of K time series, xk = (xk , xk , ..., xk k ). The 1 2 N temporal sampling rate within each xk need not be uniform, nor must it be the same across the different xk . Constraints on the variability of the sampling rate are discussed at the end of this section. For notational convenience, we henceforth assume N k = N for all k, but this is not a requirement of the model. The CPM is set up as follows: We assume that there is a latent trace, z = (z1 , z2 , ..., zM ), a canonical representation of the set of noisy input replicate time series. Any given observed time series in the set is modeled as a non-uniformly subsampled version of the latent trace to which local scale transformations have been applied. Ideally, M would be infinite, or at least very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overfitting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments and found this to be sufficient with < 0.2. Because the resolution of the latent trace is higher than that of the observed time series, experimental time can be made effectively to speed up or slow down by advancing along the latent trace in larger or smaller jumps. The subsampling and local scaling used during the generation of each observed time series are determined by a sequence of hidden state variables. Let the state sequence for observation k be π k . Each state in the state sequence maps to a time state/scale state pair: k πi → {τik , φk }. Time states belong to the integer set (1..M ); scale states belong to an i ordered set (φ1 ..φQ ). (In our experiments we have used Q=7, evenly spaced scales in k logarithmic space). States, πi , and observation values, xk , are related by the emission i k probability distribution: Aπi (xk |z) ≡ p(xk |πi , z, σ, uk ) ≡ N (xk ; zτik φk uk , σ), where σ k i i i i is the noise level of the observed data, N (a; b, c) denotes a Gaussian probability density for a with mean b and standard deviation c. The uk are real-valued scale parameters, one per observed time series, that correct for any overall scale difference between time series k and the latent trace. To fully specify our model we also need to define the state transition probabilities. We define the transitions between time states and between scale states separately, so that k Tπi−1 ,πi ≡ p(πi |πi−1 ) = p(φi |φi−1 )pk (τi |τi−1 ). The constraint that time must move forward, cannot stand still, and that it can jump ahead no more than Jτ time states is enforced. (In our experiments we used Jτ = 3.) As well, we only allow scale state transitions between neighbouring scale states so that the local scale cannot jump arbitrarily. These constraints keep the number of legal transitions to a tractable computational size and work well in practice. Each observed time series has its own time transition probability distribution to account for experiment-specific patterns. Both the time and scale transition probability distributions are given by multinomials: dk , if a − b = 1 1 k d2 , if a − b = 2 k . p (τi = a|τi−1 = b) = . . k d , if a − b = J τ Jτ 0, otherwise p(φi = a|φi−1 s0 , if D(a, b) = 0 s1 , if D(a, b) = 1 = b) = s1 , if D(a, b) = −1 0, otherwise where D(a, b) = 1 means that a is one scale state larger than b, and D(a, b) = −1 means that a is one scale state smaller than b, and D(a, b) = 0 means that a = b. The distributions Jτ are constrained by: i=1 dk = 1 and 2s1 + s0 = 1. i Jτ determines the maximum allowable instantaneous speedup of one portion of a time series relative to another portion, within the same series or across different series. However, the length of time for which any series can move so rapidly is constrained by the length of the latent trace; thus the maximum overall ratio in speeds achievable by the model between any two entire time series is given by min(Jτ , M ). N After training, one may examine either the latent trace or the alignment of each observable time series to the latent trace. Such alignments can be achieved by several methods, including use of the Viterbi algorithm to find the highest likelihood path through the hidden states [1], or sampling from the posterior over hidden state sequences. We found Viterbi alignments to work well in the experiments below; samples from the posterior looked quite similar. 3 Training with the Expectation-Maximization (EM) Algorithm As with HMMs, training with the EM algorithm (often referred to as Baum-Welch in the context of HMMs [1]), is a natural choice. In our model the E-Step is computed exactly using the Forward-Backward algorithm [1], which provides the posterior probability over k states for each time point of every observed time series, γs (i) ≡ p(πi = s|x) and also the pairwise state posteriors, ξs,t (i) ≡ p(πi−1 = s, πi = t|xk ). The algorithm is modified only in that the emission probabilities depend on the latent trace as described in Section 2. The M-Step consists of a series of analytical updates to the various parameters as detailed below. Given the latent trace (and the emission and state transition probabilities), the complete log likelihood of K observed time series, xk , is given by Lp ≡ L + P. L is the likelihood term arising in a (conditional) HMM model, and can be obtained from the Forward-Backward algorithm. It is composed of the emission and state transition terms. P is the log prior (or penalty term), regularizing various aspects of the model parameters as explained below. These two terms are: K N N L≡ log Aπi (xk |z) + i log p(π1 ) + τ −1 K (zj+1 − zj )2 + P ≡ −λ (1) i=2 i=1 k=1 k log Tπi−1 ,πi j=1 k log D(dk |{ηv }) + log D(sv |{ηv }), v (2) k=1 where p(π1 ) are priors over the initial states. The first term in Equation 2 is a smoothing k penalty on the latent trace, with λ controlling the amount of smoothing. ηv and ηv are Dirichlet hyperprior parameters for the time and scale state transition probability distributions respectively. These ensure that all non-zero transition probabilities remain non-zero. k For the time state transitions, v ∈ {1, Jτ } and ηv corresponds to the pseudo-count data for k the parameters d1 , d2 . . . dJτ . For the scale state transitions, v ∈ {0, 1} and ηv corresponds to the pseudo-count data for the parameters s0 and s1 . Letting S be the total number of possible states, that is, the number of elements in the cross-product of possible time states and possible scale states, the expected complete log likelihood is: K S K p k k γs (1) log T0,s
2 0.59283859 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton
Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.
3 0.58400351 163 nips-2004-Semi-parametric Exponential Family PCA
Author: Sajama Sajama, Alon Orlitsky
Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1
4 0.57752228 125 nips-2004-Multiple Relational Embedding
Author: Roland Memisevic, Geoffrey E. Hinton
Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1
5 0.53259474 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account
Author: Aaron C. Courville, Nathaniel D. Daw, David S. Touretzky
Abstract: We propose a probabilistic, generative account of configural learning phenomena in classical conditioning. Configural learning experiments probe how animals discriminate and generalize between patterns of simultaneously presented stimuli (such as tones and lights) that are differentially predictive of reinforcement. Previous models of these issues have been successful more on a phenomenological than an explanatory level: they reproduce experimental findings but, lacking formal foundations, provide scant basis for understanding why animals behave as they do. We present a theory that clarifies seemingly arbitrary aspects of previous models while also capturing a broader set of data. Key patterns of data, e.g. concerning animals’ readiness to distinguish patterns with varying degrees of overlap, are shown to follow from statistical inference.
6 0.46914965 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing
7 0.42546085 52 nips-2004-Discrete profile alignment via constrained information bottleneck
8 0.41967756 74 nips-2004-Harmonising Chorales by Probabilistic Inference
9 0.378748 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
10 0.36367041 50 nips-2004-Dependent Gaussian Processes
11 0.36128578 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
12 0.34347728 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces
13 0.33706886 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models
14 0.32928136 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity
15 0.32402009 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data
16 0.31646669 185 nips-2004-The Convergence of Contrastive Divergences
17 0.30677092 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process
18 0.30156425 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography
19 0.29058194 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
20 0.28504047 193 nips-2004-Theories of Access Consciousness
topicId topicWeight
[(13, 0.126), (15, 0.131), (25, 0.03), (26, 0.045), (31, 0.03), (33, 0.174), (35, 0.014), (39, 0.083), (50, 0.03), (71, 0.013), (76, 0.015), (84, 0.2)]
simIndex simValue paperId paperTitle
same-paper 1 0.85904729 124 nips-2004-Multiple Alignment of Continuous Time Series
Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili
Abstract: Multiple realizations of continuous-valued time series from a stochastic process often contain systematic variations in rate and amplitude. To leverage the information contained in such noisy replicate sets, we need to align them in an appropriate way (for example, to allow the data to be properly combined by adaptive averaging). We present the Continuous Profile Model (CPM), a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned trace represents a canonical, high resolution fusion of all the replicates. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. We apply CPM to successfully align speech signals from multiple speakers and sets of Liquid Chromatography-Mass Spectrometry proteomic data. 1 A Profile Model for Continuous Data When observing multiple time series generated by a noisy, stochastic process, large systematic sources of variability are often present. For example, within a set of nominally replicate time series, the time axes can be variously shifted, compressed and expanded, in complex, non-linear ways. Additionally, in some circumstances, the scale of the measured data can vary systematically from one replicate to the next, and even within a given replicate. We propose a Continuous Profile Model (CPM) for simultaneously analyzing a set of such time series. In this model, each time series is generated as a noisy transformation of a single latent trace. The latent trace is an underlying, noiseless representation of the set of replicated, observable time series. Output time series are generated from this model by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as in an HMM. Each hidden state corresponds to a particular location in the latent trace, and the emitted value from the state depends on the value of the latent trace at that position. To account for changes in the amplitude of the signals across and within replicates, the latent time states are augmented by a set of scale states, which control how the emission signal will be scaled relative to the value of the latent trace. During training, the latent trace is learned, as well as the transition probabilities controlling the Markovian evolution of the scale and time states and the overall noise level of the observed data. After training, the latent trace learned by the model represents a higher resolution ’fusion’ of the experimental replicates. Figure 1 illustrate the model in action. Unaligned, Linear Warp Alignment and CPM Alignment Amplitude 40 30 20 10 0 50 Amplitude 40 30 20 10 Amplitude 0 30 20 10 0 Time a) b) Figure 1: a) Top: ten replicated speech energy signals as described in Section 4), Middle: same signals, aligned using a linear warp with an offset, Bottom: aligned with CPM (the learned latent trace is also shown in cyan). b) Speech waveforms corresponding to energy signals in a), Top: unaligned originals, Bottom: aligned using CPM. 2 Defining the Continuous Profile Model (CPM) The CPM is generative model for a set of K time series, xk = (xk , xk , ..., xk k ). The 1 2 N temporal sampling rate within each xk need not be uniform, nor must it be the same across the different xk . Constraints on the variability of the sampling rate are discussed at the end of this section. For notational convenience, we henceforth assume N k = N for all k, but this is not a requirement of the model. The CPM is set up as follows: We assume that there is a latent trace, z = (z1 , z2 , ..., zM ), a canonical representation of the set of noisy input replicate time series. Any given observed time series in the set is modeled as a non-uniformly subsampled version of the latent trace to which local scale transformations have been applied. Ideally, M would be infinite, or at least very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overfitting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments and found this to be sufficient with < 0.2. Because the resolution of the latent trace is higher than that of the observed time series, experimental time can be made effectively to speed up or slow down by advancing along the latent trace in larger or smaller jumps. The subsampling and local scaling used during the generation of each observed time series are determined by a sequence of hidden state variables. Let the state sequence for observation k be π k . Each state in the state sequence maps to a time state/scale state pair: k πi → {τik , φk }. Time states belong to the integer set (1..M ); scale states belong to an i ordered set (φ1 ..φQ ). (In our experiments we have used Q=7, evenly spaced scales in k logarithmic space). States, πi , and observation values, xk , are related by the emission i k probability distribution: Aπi (xk |z) ≡ p(xk |πi , z, σ, uk ) ≡ N (xk ; zτik φk uk , σ), where σ k i i i i is the noise level of the observed data, N (a; b, c) denotes a Gaussian probability density for a with mean b and standard deviation c. The uk are real-valued scale parameters, one per observed time series, that correct for any overall scale difference between time series k and the latent trace. To fully specify our model we also need to define the state transition probabilities. We define the transitions between time states and between scale states separately, so that k Tπi−1 ,πi ≡ p(πi |πi−1 ) = p(φi |φi−1 )pk (τi |τi−1 ). The constraint that time must move forward, cannot stand still, and that it can jump ahead no more than Jτ time states is enforced. (In our experiments we used Jτ = 3.) As well, we only allow scale state transitions between neighbouring scale states so that the local scale cannot jump arbitrarily. These constraints keep the number of legal transitions to a tractable computational size and work well in practice. Each observed time series has its own time transition probability distribution to account for experiment-specific patterns. Both the time and scale transition probability distributions are given by multinomials: dk , if a − b = 1 1 k d2 , if a − b = 2 k . p (τi = a|τi−1 = b) = . . k d , if a − b = J τ Jτ 0, otherwise p(φi = a|φi−1 s0 , if D(a, b) = 0 s1 , if D(a, b) = 1 = b) = s1 , if D(a, b) = −1 0, otherwise where D(a, b) = 1 means that a is one scale state larger than b, and D(a, b) = −1 means that a is one scale state smaller than b, and D(a, b) = 0 means that a = b. The distributions Jτ are constrained by: i=1 dk = 1 and 2s1 + s0 = 1. i Jτ determines the maximum allowable instantaneous speedup of one portion of a time series relative to another portion, within the same series or across different series. However, the length of time for which any series can move so rapidly is constrained by the length of the latent trace; thus the maximum overall ratio in speeds achievable by the model between any two entire time series is given by min(Jτ , M ). N After training, one may examine either the latent trace or the alignment of each observable time series to the latent trace. Such alignments can be achieved by several methods, including use of the Viterbi algorithm to find the highest likelihood path through the hidden states [1], or sampling from the posterior over hidden state sequences. We found Viterbi alignments to work well in the experiments below; samples from the posterior looked quite similar. 3 Training with the Expectation-Maximization (EM) Algorithm As with HMMs, training with the EM algorithm (often referred to as Baum-Welch in the context of HMMs [1]), is a natural choice. In our model the E-Step is computed exactly using the Forward-Backward algorithm [1], which provides the posterior probability over k states for each time point of every observed time series, γs (i) ≡ p(πi = s|x) and also the pairwise state posteriors, ξs,t (i) ≡ p(πi−1 = s, πi = t|xk ). The algorithm is modified only in that the emission probabilities depend on the latent trace as described in Section 2. The M-Step consists of a series of analytical updates to the various parameters as detailed below. Given the latent trace (and the emission and state transition probabilities), the complete log likelihood of K observed time series, xk , is given by Lp ≡ L + P. L is the likelihood term arising in a (conditional) HMM model, and can be obtained from the Forward-Backward algorithm. It is composed of the emission and state transition terms. P is the log prior (or penalty term), regularizing various aspects of the model parameters as explained below. These two terms are: K N N L≡ log Aπi (xk |z) + i log p(π1 ) + τ −1 K (zj+1 − zj )2 + P ≡ −λ (1) i=2 i=1 k=1 k log Tπi−1 ,πi j=1 k log D(dk |{ηv }) + log D(sv |{ηv }), v (2) k=1 where p(π1 ) are priors over the initial states. The first term in Equation 2 is a smoothing k penalty on the latent trace, with λ controlling the amount of smoothing. ηv and ηv are Dirichlet hyperprior parameters for the time and scale state transition probability distributions respectively. These ensure that all non-zero transition probabilities remain non-zero. k For the time state transitions, v ∈ {1, Jτ } and ηv corresponds to the pseudo-count data for k the parameters d1 , d2 . . . dJτ . For the scale state transitions, v ∈ {0, 1} and ηv corresponds to the pseudo-count data for the parameters s0 and s1 . Letting S be the total number of possible states, that is, the number of elements in the cross-product of possible time states and possible scale states, the expected complete log likelihood is: K S K p k k γs (1) log T0,s
2 0.79100937 163 nips-2004-Semi-parametric Exponential Family PCA
Author: Sajama Sajama, Alon Orlitsky
Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1
3 0.78607941 42 nips-2004-Computing regularization paths for learning multiple kernels
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
4 0.78051895 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
5 0.77558351 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
6 0.77080286 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
7 0.76896626 178 nips-2004-Support Vector Classification with Input Data Uncertainty
8 0.76741964 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
9 0.7662192 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
10 0.76572347 102 nips-2004-Learning first-order Markov models for control
11 0.76460856 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
12 0.76441193 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
13 0.76391309 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
14 0.76317823 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
15 0.76312834 113 nips-2004-Maximum-Margin Matrix Factorization
16 0.76258528 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
17 0.75928187 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
18 0.7580716 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
19 0.75793689 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
20 0.75734037 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection