nips nips2002 nips2002-25 knowledge-graph by maker-knowledge-mining

25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition


Source: pdf

Author: Samy Bengio

Abstract: This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/ Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ch/-bengio Abstract This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. [sent-5, score-0.421]

2 It is based on two other Markovian models, namely Asynchronous Input/ Output Hidden Markov Models and Pair Hidden Markov Models. [sent-6, score-0.024]

3 An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. [sent-7, score-0.319]

4 The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions. [sent-8, score-0.412]

5 1 Introduction Hidden Markov Models (HMMs) are statistical tools that have been used successfully in the last 30 years to model difficult tasks such as speech recognition [6) or biological sequence analysis [4). [sent-9, score-0.352]

6 They are very well suited to handle discrete of continuous sequences of varying sizes. [sent-10, score-0.105]

7 Moreover, an efficient training algorithm (EM) is available, as well as an efficient decoding algorithm (Viterbi), which provides the optimal sequence of states (and the corresponding sequence of high level events) associated with a given sequence of low-level data. [sent-11, score-0.583]

8 On the other hand, multimodal information processing is currently a very challenging framework of applications including multimodal person authentication, multimodal speech recognition, multimodal event analyzers, etc. [sent-12, score-0.731]

9 In that framework, the same sequence of events is represented not only by a single sequence of data but by a series of sequences of data, each of them coming eventually from a different modality: video streams with various viewpoints, audio stream(s), etc. [sent-13, score-1.173]

10 One such task, which will be presented in this paper, is multimodal speech recognition using both a microphone and a camera recording a speaker simultaneously while he (she) speaks. [sent-14, score-0.412]

11 It is indeed well known that seeing the speaker's face in addition to hearing his (her) voice can often improve speech intelligibility, particularly in noisy environments [7), mainly thanks to the complementarity of the visual and acoustic signals. [sent-15, score-0.196]

12 While in the former solution, the alignment between the two sequences is decided a priori, in the latter, there is no explicit learning of the joint probability of the two sequences. [sent-17, score-0.282]

13 An example of late integration is presented in [3], where the authors present a multistream approach where each stream is modeled by a different HMM, while decoding is done on a combined HMM (with various combination approaches proposed) . [sent-18, score-0.468]

14 In this paper, we present a novel Asynchronous Hidden Markov Model (AHMM) that can learn the joint probability of pairs of sequences of data representing the same sequence of events, even when the events are not synchronized between the sequences. [sent-19, score-0.361]

15 In fact, the model enables to desynchronize the streams by temporarily stretching one of them in order to obtain a better match between the corresponding frames . [sent-20, score-0.134]

16 The model can thus be directly applied to the problem of audio-visual speech recognition where sometimes lips start to move before any sound is heard for instance. [sent-21, score-0.246]

17 The paper is organized as follows: in the next section, the AHMM model is presented, followed by the corresponding EM training and Viterbi decoding algorithms. [sent-22, score-0.262]

18 Related models are then presented and implementation issues are discussed. [sent-23, score-0.024]

19 Finally, experiments on a audio-visual speech recognition task based on the M2VTS database are presented, followed by a conclusion. [sent-24, score-0.306]

20 2 The Asynchronous Hidden Markov Model For the sake of simplicity, let us present here the case where one is interested in modeling the joint probability of 2 asynchronous sequences, denoted xi and with S ::; T without loss of generality! [sent-25, score-0.265]

21 yr We are thus interested in modeling p(xi, Yr). [sent-27, score-0.037]

22 As it is intractable if we do it directly by considering all possible combinations, we introduce a hidden variable q which represents the state as in the classical HMM formulation, and which is synchronized with the longest sequence. [sent-28, score-0.22]

23 Moreover, in the model presented here, we always emit Xt at time t and sometimes emit Ys at time t. [sent-30, score-0.335]

24 Let us first define E(i, t) = P(Tt =sh- l =s - 1, qt =i, xLyf) as the probability that the system emits the next observation of sequence y at time t while in state i. [sent-31, score-0.585]

25 The additional hidden variable Tt = s can be seen as the alignment between y and q (and x which is aligned with q). [sent-32, score-0.204]

26 If this is not the case, a straightforward extension of the proposed model is then necessary. [sent-36, score-0.051]

27 N + (1 - E(i, t))p(xtlqt=i) L P(qt=ilqt- 1=j)a(j, s, t - 1) j=l which is very similar to the corresponding a variable used in normal HMMs2. [sent-37, score-0.045]

28 It can then be used to compute the joint likelihood of the two sequences as follows: N p(xi, yf) L p( qT=i , TT=S, xi, yf) (2) i=l N L a(i,S,T) . [sent-38, score-0.156]

29 2 Viterbi Decoding Using the same technique and replacing all the sums by max operators, a Viterbi decoding algorithm can be derived in order to obtain the most probable path along the sequence of states and alignments between x and y : V(i,s , t) S) . [sent-40, score-0.494]

30 3 An EM Training Algorithm An EM training algorithm can also be derived in the same fashion as in classical HMMs. [sent-43, score-0.047]

31 j=l 2The full derivations are not given in this paper but can be found in the appendix of [1). [sent-46, score-0.024]

32 3In the case where one is only interested in the best state sequence (no matter the alignment), the solution is then to marginalize over all the alignments during decoding (essentially keeping the sums on the alignments and the max on the state space). [sent-47, score-0.697]

33 E-Step: Using both the forward and backward variables, one can compute the posterior probabilities of the hidden variables of the system, namely the posterior on the state when it emits on both sequences, the posterior on the state when it emits on x only, and the posterior on transitions. [sent-50, score-0.628]

34 (6) j= l Then the posterior on state i when it emits joint observations of sequences x and y is . [sent-52, score-0.354]

35 ITS) ( Pqt =Z,Tt =STt- I=S- l ,X I ,YI = a l (i,s,t)(3(i,s,t) (T S) , P Xl , YI (7) the posterior on state i when it emits the next observation of sequence x only is ITS) aO(i , s, t) (3 (i,s,t) . [sent-53, score-0.304]

36 P (qt=Z, Tt=S Tt - l =S , Xl , YI = (T S ) ' P Xl ,YI ( ) 8 and the posterior on the transition between states i and j is * ( P(qt=ilqt- l =j) P(x f, yf) (9) a(j" - 1, t - 1 )p(x" y. [sent-54, score-0.062]

37 3 Related Models The present AHMM model is related to the Pair HMM model [4], which was proposed to search for the best alignment between two DNA sequences. [sent-57, score-0.202]

38 Moreover, the architecture of the Pair HMM model is such that a given state is designed to always emit either one OR two vectors, while in the proposed AHMM model, each state can always emit both one or two vectors, depending on E(i, t), which is learned . [sent-59, score-0.487]

39 In fact, when E(i, t) is deterministic and solely depends on i , we can indeed recover the Pair HMM model by slightly transforming the architecture. [sent-60, score-0.025]

40 It is also very similar to the asynchronous version of Input/ Output HMMs [2], which was proposed for speech recognition applications. [sent-61, score-0.461]

41 The main difference here is that in ) AHMMs both sequences are considered as output, while in Asynchronous IOHMMs one of the sequence (the shorter one, the output) is conditioned on the other one (the input). [sent-62, score-0.233]

42 The resulting Viterbi decoding algorithm is thus different since in Asynchronous IOHMMs one of the sequence, the input, is known during decoding, which is not the case in AHMMs. [sent-63, score-0.189]

43 1 Implementation Issues Time and Space Complexity The proposed algorithms (either training or decoding) have a complexity of O(N 2 ST) where N is the number of states (and assuming the worst case with ergodic connectivity) , S is the length of sequence y and T is the length of sequence x . [sent-65, score-0.286]

44 In that case, the complexity (both in time and space) becomes O(N 2 Tk), which is k times the usual HMM training/ decoding complexity. [sent-68, score-0.189]

45 In the experiments described later in the paper, we have chosen the latter implementation, with no sharing except during initialization; • E(i, t) = P(Tt=slTt - l =s - 1, qt=i, xi,yf): the probability to emit on sequence y at time t on state i. [sent-73, score-0.311]

46 With various assumptions , this probability could be represented as either independent on i, independent on s, independent on Xt and Ys. [sent-74, score-0.039]

47 5 Experiments Audio-visual speech recognition experiments were performed using the M2VTS database [5], which contains 185 recordings of 37 subjects, each containing acoustic and video signals of the subject pronouncing the French digits from zero to nine. [sent-76, score-0.612]

48 The video consisted of 286x360 pixel color images with a 25 Hz frame rate, while the audio was recorded at 48 kHz using a 16 bit PCM coding. [sent-77, score-0.748]

49 Although the M2VTS database is one of the largest databases of its type, it is still relatively small compared to reference audio databases used in speech recognition. [sent-78, score-0.648]

50 Hence, in order to increase the significance level of the experimental results, a 5-fold cross-validation method was used. [sent-79, score-0.028]

51 Note that all the subjects always pronounced the same sequence of words but this information was not used during recognition 5 . [sent-80, score-0.183]

52 The audio data was down-sampled to 8khz and every 10ms a vector of 16 MFCC coefficients and their first derivative, as well as the derivative of the log energy was computed, for a total of 33 features. [sent-81, score-0.395]

53 Each image of the video stream was coded using 12 shape features and 12 intensity features, as described in [3]. [sent-82, score-0.417]

54 The first derivative of each of these features was also computed, for a total of 48 features . [sent-83, score-0.022]

55 The HMM topology was as follows: we used left-to-right HMMs for each instance of the vocabulary, which consisted of the following 11 words: zero, un, deux trois, quatre, cinq, six, sept, huit, neuf, silence. [sent-84, score-0.04]

56 Each model had between 3 to 9 states including non-emitting begin and end states. [sent-85, score-0.05]

57 However, in order to keep the computational time tractable, a constraint was imposed in the alignment between the audio and video streams: we did not consider alignments where audio and video information were farther than 0. [sent-88, score-1.578]

58 Comparisons were made between the AHMM (taking into account audio and video), and a normal HMM taking into account either the audio or the video only. [sent-90, score-1.093]

59 We also compared the model with a normal HMM trained on both audio and video streams manually synchronized (each frame of the video stream was repeated in multiple copies in order to reach the same rate as the audio stream). [sent-91, score-1.736]

60 Moreover, in order to show the interest of robust multimodal speech recognition, we injected various levels of noise in the audio stream during decoding (training was always done using clean audio). [sent-92, score-1.14]

61 The noise was taken from the Noisex database [9], and was injected in order to reach signal-to-noise ratios of 10dB, 5dB and OdB. [sent-93, score-0.178]

62 They were taken from a previously trained model on a different task, Numbers'95. [sent-95, score-0.025]

63 As it can be seen, the AHMM yielded better results as soon as the noise level was significant (for clean data, the performance using the audio stream only was almost perfect, hence no enhancement was expected). [sent-97, score-0.644]

64 Moreover, it never deteriorated significantly (using a 95% confidence interval) under the level of the video stream, no matter the level of noise in the audio stream. [sent-98, score-0.837]

65 5Nevertheless, it can be argued that transitions between words could have been learned using the training data. [sent-99, score-0.023]

66 The proposed model is the AHMM using both audio and video streams. [sent-101, score-0.726]

67 Observations audio audio+ video audio+ video Model HMM HMM AHMM 15 dB 2. [sent-102, score-0.977]

68 1) Table 1: Word Error Rates (WER, in percent, the lower the better) and corresponding Confidence Intervals (CI, in parenthesis), of various systems under various noise conditions during decoding (from 15 to 0 dB additive noise). [sent-126, score-0.332]

69 The proposed model is the AHMM using both audio and video streams. [sent-127, score-0.726]

70 An HMM using the clean video data only obtains 39. [sent-128, score-0.34]

71 An interesting side effect of the model is to provide an optimal alignment between the audio and the video streams. [sent-131, score-0.826]

72 Figure 2 shows the alignment obtained while decoding sequence cd01 on data corrupted with 10dB Noisex noise. [sent-132, score-0.447]

73 It shows that the rate between video and audio is far from being constant (it would have followed the stepped line) and hence computing the joint probability using the AHMM appears more informative than using a naive alignment and a normal HMM. [sent-133, score-0.969]

74 6 Conclusion In this paper, we have presented a novel asynchronous HMM architecture to handle multiple sequences of data representing the same sequence of events. [sent-134, score-0.475]

75 The model was inspired by two other well-known models, namely Pair HMMs and Asynchronous IOHMMs. [sent-135, score-0.049]

76 An EM training algorithm was derived as well as a Viterbi decoding algorithm, and speech recognition experiments were performed on a multimodal database, yielding significant improvements on noisy audio data. [sent-136, score-0.97]

77 Various propositions were made to implement the model but only the simplest ones were tested in this paper. [sent-137, score-0.025]

78 Moreover, other applications of the model should also be investigated, such as multimodal authentication. [sent-139, score-0.166]

79 Audio Figure 2: Alignment obtained by the model between video and audio streams on sequence cdOl corrupted with a 10dE Noisex noise. [sent-140, score-0.904]

80 Acknowledgments This research has been partially carried out in the framework of the European project LAVA, funded by the Swiss OFES project number 01. [sent-143, score-0.085]

81 The Swiss NCCR project 1M2 has also partly funded this research. [sent-145, score-0.056]

82 An asynchronous hidden markov model for audio-visual speech recognition. [sent-149, score-0.534]

83 An EM algorithm for asynchronous input/ output hidden markov models. [sent-154, score-0.342]

84 In Proceedings of th e First International Conference on Audio- and Vid eo-bas ed Biometric P erson Authentication ABVPA, 1997. [sent-173, score-0.027]

85 A tutorial on hidden markov models and selected applications in speech recognition. [sent-176, score-0.295]

86 Journal of th e Acoustical Society of America, 26:212- 215 , 1954. [sent-183, score-0.027]

87 The noisex-92 study on the effect of additive noise on automatic speech recognition. [sent-196, score-0.232]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('audio', 0.373), ('qt', 0.318), ('video', 0.302), ('ahmm', 0.259), ('asynchronous', 0.214), ('iqt', 0.212), ('hmm', 0.203), ('decoding', 0.189), ('xt', 0.185), ('ys', 0.168), ('speech', 0.167), ('emit', 0.143), ('multimodal', 0.141), ('alignment', 0.126), ('viterbi', 0.122), ('xtlqt', 0.118), ('stream', 0.115), ('sequence', 0.106), ('sequences', 0.105), ('alignments', 0.102), ('tt', 0.101), ('emits', 0.099), ('hmms', 0.097), ('hidden', 0.078), ('streams', 0.072), ('em', 0.071), ('noisex', 0.071), ('wer', 0.071), ('yslqt', 0.071), ('emission', 0.066), ('state', 0.062), ('database', 0.06), ('synchronized', 0.056), ('recognition', 0.054), ('joint', 0.051), ('markov', 0.05), ('dupont', 0.047), ('idiap', 0.047), ('intelligibility', 0.047), ('stepped', 0.047), ('yf', 0.047), ('bengio', 0.047), ('db', 0.046), ('normal', 0.045), ('xl', 0.044), ('events', 0.043), ('authentication', 0.041), ('iohmms', 0.041), ('noise', 0.041), ('reach', 0.04), ('consisted', 0.04), ('various', 0.039), ('clean', 0.038), ('moreover', 0.038), ('stretching', 0.037), ('confidence', 0.037), ('injected', 0.037), ('yr', 0.037), ('modeled', 0.037), ('gaussians', 0.037), ('posterior', 0.037), ('ao', 0.035), ('late', 0.035), ('frame', 0.033), ('backward', 0.032), ('swiss', 0.031), ('percent', 0.03), ('integration', 0.029), ('acoustic', 0.029), ('project', 0.029), ('matter', 0.028), ('level', 0.028), ('th', 0.027), ('eventually', 0.027), ('funded', 0.027), ('proposed', 0.026), ('architecture', 0.026), ('path', 0.026), ('yielded', 0.026), ('speaker', 0.026), ('corrupted', 0.026), ('followed', 0.025), ('states', 0.025), ('model', 0.025), ('databases', 0.024), ('forward', 0.024), ('namely', 0.024), ('appendix', 0.024), ('classical', 0.024), ('additive', 0.024), ('presented', 0.024), ('significant', 0.023), ('subjects', 0.023), ('training', 0.023), ('max', 0.023), ('sums', 0.023), ('ci', 0.023), ('derivative', 0.022), ('conditioned', 0.022), ('pair', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

Author: Samy Bengio

Abstract: This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/ Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions. 1

2 0.22022472 137 nips-2002-Location Estimation with a Differential Update Network

Author: Ali Rahimi, Trevor Darrell

Abstract: Given a set of hidden variables with an a-priori Markov structure, we derive an online algorithm which approximately updates the posterior as pairwise measurements between the hidden variables become available. The update is performed using Assumed Density Filtering: to incorporate each pairwise measurement, we compute the optimal Markov structure which represents the true posterior and use it as a prior for incorporating the next measurement. We demonstrate the resulting algorithm by calculating globally consistent trajectories of a robot as it navigates along a 2D trajectory. To update a trajectory of length t, the update takes O(t). When all conditional distributions are linear-Gaussian, the algorithm can be thought of as a Kalman Filter which simplifies the state covariance matrix after incorporating each measurement.

3 0.20826063 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

Author: Patrick J. Wolfe, Simon J. Godsill

Abstract: The Bayesian paradigm provides a natural and effective means of exploiting prior knowledge concerning the time-frequency structure of sound signals such as speech and music—something which has often been overlooked in traditional audio signal processing approaches. Here, after constructing a Bayesian model and prior distributions capable of taking into account the time-frequency characteristics of typical audio waveforms, we apply Markov chain Monte Carlo methods in order to sample from the resultant posterior distribution of interest. We present speech enhancement results which compare favourably in objective terms with standard time-varying filtering techniques (and in several cases yield superior performance, both objectively and subjectively); moreover, in contrast to such methods, our results are obtained without an assumption of prior knowledge of the noise power.

4 0.12483677 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

Author: Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Naonori Ueda

Abstract: In this paper, we propose a Bayesian framework, which constructs shared-state triphone HMMs based on a variational Bayesian approach, and recognizes speech based on the Bayesian prediction classification; variational Bayesian estimation and clustering for speech recognition (VBEC). An appropriate model structure with high recognition performance can be found within a VBEC framework. Unlike conventional methods, including BIC or MDL criterion based on the maximum likelihood approach, the proposed model selection is valid in principle, even when there are insufficient amounts of data, because it does not use an asymptotic assumption. In isolated word recognition experiments, we show the advantage of VBEC over conventional methods, especially when dealing with small amounts of data.

5 0.1169566 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

Author: David Barber

Abstract: The application of latent/hidden variable Dynamic Bayesian Networks is constrained by the complexity of marginalising over latent variables. For this reason either small latent dimensions or Gaussian latent conditional tables linearly dependent on past states are typically considered in order that inference is tractable. We suggest an alternative approach in which the latent variables are modelled using deterministic conditional probability tables. This specialisation has the advantage of tractable inference even for highly complex non-linear/non-Gaussian visible conditional probability tables. This approach enables the consideration of highly complex latent dynamics whilst retaining the benefits of a tractable probabilistic model. 1

6 0.11267658 87 nips-2002-Fast Transformation-Invariant Factor Analysis

7 0.11086109 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

8 0.098054044 147 nips-2002-Monaural Speech Separation

9 0.090930529 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

10 0.082290895 16 nips-2002-A Prototype for Automatic Recognition of Spontaneous Facial Actions

11 0.073881336 170 nips-2002-Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

12 0.065340638 153 nips-2002-Neural Decoding of Cursor Motion Using a Kalman Filter

13 0.062131625 140 nips-2002-Margin Analysis of the LVQ Algorithm

14 0.060965113 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

15 0.058708325 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

16 0.053864796 53 nips-2002-Clustering with the Fisher Score

17 0.052423928 98 nips-2002-Going Metric: Denoising Pairwise Data

18 0.052262697 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

19 0.047526665 164 nips-2002-Prediction of Protein Topologies Using Generalized IOHMMs and RNNs

20 0.047285527 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.164), (1, 0.002), (2, -0.038), (3, 0.06), (4, -0.051), (5, 0.065), (6, -0.093), (7, 0.039), (8, 0.241), (9, -0.05), (10, 0.135), (11, 0.038), (12, -0.188), (13, -0.067), (14, -0.19), (15, -0.014), (16, -0.072), (17, 0.142), (18, -0.012), (19, 0.042), (20, 0.046), (21, -0.104), (22, 0.042), (23, 0.011), (24, -0.171), (25, 0.213), (26, -0.067), (27, -0.091), (28, 0.142), (29, 0.041), (30, 0.072), (31, 0.005), (32, 0.071), (33, 0.053), (34, -0.15), (35, -0.068), (36, -0.093), (37, -0.117), (38, 0.061), (39, 0.109), (40, -0.007), (41, -0.157), (42, -0.063), (43, 0.035), (44, 0.058), (45, 0.045), (46, 0.034), (47, 0.028), (48, -0.06), (49, -0.136)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95691997 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

Author: Samy Bengio

Abstract: This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/ Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions. 1

2 0.68838394 137 nips-2002-Location Estimation with a Differential Update Network

Author: Ali Rahimi, Trevor Darrell

Abstract: Given a set of hidden variables with an a-priori Markov structure, we derive an online algorithm which approximately updates the posterior as pairwise measurements between the hidden variables become available. The update is performed using Assumed Density Filtering: to incorporate each pairwise measurement, we compute the optimal Markov structure which represents the true posterior and use it as a prior for incorporating the next measurement. We demonstrate the resulting algorithm by calculating globally consistent trajectories of a robot as it navigates along a 2D trajectory. To update a trajectory of length t, the update takes O(t). When all conditional distributions are linear-Gaussian, the algorithm can be thought of as a Kalman Filter which simplifies the state covariance matrix after incorporating each measurement.

3 0.57310349 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

Author: Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Naonori Ueda

Abstract: In this paper, we propose a Bayesian framework, which constructs shared-state triphone HMMs based on a variational Bayesian approach, and recognizes speech based on the Bayesian prediction classification; variational Bayesian estimation and clustering for speech recognition (VBEC). An appropriate model structure with high recognition performance can be found within a VBEC framework. Unlike conventional methods, including BIC or MDL criterion based on the maximum likelihood approach, the proposed model selection is valid in principle, even when there are insufficient amounts of data, because it does not use an asymptotic assumption. In isolated word recognition experiments, we show the advantage of VBEC over conventional methods, especially when dealing with small amounts of data.

4 0.54218465 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

Author: Patrick J. Wolfe, Simon J. Godsill

Abstract: The Bayesian paradigm provides a natural and effective means of exploiting prior knowledge concerning the time-frequency structure of sound signals such as speech and music—something which has often been overlooked in traditional audio signal processing approaches. Here, after constructing a Bayesian model and prior distributions capable of taking into account the time-frequency characteristics of typical audio waveforms, we apply Markov chain Monte Carlo methods in order to sample from the resultant posterior distribution of interest. We present speech enhancement results which compare favourably in objective terms with standard time-varying filtering techniques (and in several cases yield superior performance, both objectively and subjectively); moreover, in contrast to such methods, our results are obtained without an assumption of prior knowledge of the noise power.

5 0.49240106 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences

Author: Eric P. Xing, Michael I. Jordan, Richard M. Karp, Stuart Russell

Abstract: We propose a dynamic Bayesian model for motifs in biopolymer sequences which captures rich biological prior knowledge and positional dependencies in motif structure in a principled way. Our model posits that the position-specific multinomial parameters for monomer distribution are distributed as a latent Dirichlet-mixture random variable, and the position-specific Dirichlet component is determined by a hidden Markov process. Model parameters can be fit on training motifs using a variational EM algorithm within an empirical Bayesian framework. Variational inference is also used for detecting hidden motifs. Our model improves over previous models that ignore biological priors and positional dependence. It has much higher sensitivity to motifs during detection and a notable ability to distinguish genuine motifs from false recurring patterns.

6 0.415353 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

7 0.41325635 16 nips-2002-A Prototype for Automatic Recognition of Spontaneous Facial Actions

8 0.40060323 87 nips-2002-Fast Transformation-Invariant Factor Analysis

9 0.37549943 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

10 0.3217946 147 nips-2002-Monaural Speech Separation

11 0.31104323 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

12 0.27696222 170 nips-2002-Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

13 0.25848445 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text

14 0.25044036 29 nips-2002-Analysis of Information in Speech Based on MANOVA

15 0.24851656 185 nips-2002-Speeding up the Parti-Game Algorithm

16 0.24810709 140 nips-2002-Margin Analysis of the LVQ Algorithm

17 0.24230595 98 nips-2002-Going Metric: Denoising Pairwise Data

18 0.2376103 54 nips-2002-Combining Dimensions and Features in Similarity-Based Representations

19 0.23590825 199 nips-2002-Timing and Partial Observability in the Dopamine System

20 0.23291819 164 nips-2002-Prediction of Protein Topologies Using Generalized IOHMMs and RNNs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.011), (11, 0.039), (23, 0.015), (32, 0.392), (42, 0.063), (54, 0.096), (55, 0.023), (64, 0.013), (67, 0.017), (68, 0.033), (74, 0.082), (79, 0.01), (92, 0.032), (98, 0.082)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7645697 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

Author: Samy Bengio

Abstract: This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/ Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions. 1

2 0.62886268 163 nips-2002-Prediction and Semantic Association

Author: Thomas L. Griffiths, Mark Steyvers

Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1

3 0.40069801 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

Author: Max Welling, Simon Osindero, Geoffrey E. Hinton

Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.

4 0.39418358 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf

Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1

5 0.3921268 3 nips-2002-A Convergent Form of Approximate Policy Iteration

Author: Theodore J. Perkins, Doina Precup

Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.

6 0.39076784 124 nips-2002-Learning Graphical Models with Mercer Kernels

7 0.38971946 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

8 0.38961971 53 nips-2002-Clustering with the Fisher Score

9 0.38957137 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

10 0.38865727 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

11 0.38830149 41 nips-2002-Bayesian Monte Carlo

12 0.38799632 169 nips-2002-Real-Time Particle Filters

13 0.38793427 10 nips-2002-A Model for Learning Variance Components of Natural Images

14 0.38761923 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

15 0.38740671 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

16 0.38740098 74 nips-2002-Dynamic Structure Super-Resolution

17 0.3872579 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

18 0.38622302 2 nips-2002-A Bilinear Model for Sparse Coding

19 0.38598827 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

20 0.38584393 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond