nips nips2007 nips2007-18 knowledge-graph by maker-knowledge-mining

18 nips-2007-A probabilistic model for generating realistic lip movements from speech


Source: pdf

Author: Gwenn Englebienne, Tim Cootes, Magnus Rattray

Abstract: The present work aims to model the correspondence between facial motion and speech. The face and sound are modelled separately, with phonemes being the link between both. We propose a sequential model and evaluate its suitability for the generation of the facial animation from a sequence of phonemes, which we obtain from speech. We evaluate the results both by computing the error between generated sequences and real video, as well as with a rigorous double-blind test with human subjects. Experiments show that our model compares favourably to other existing methods and that the sequences generated are comparable to real video sequences. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The face and sound are modelled separately, with phonemes being the link between both. [sent-12, score-0.43]

2 We propose a sequential model and evaluate its suitability for the generation of the facial animation from a sequence of phonemes, which we obtain from speech. [sent-13, score-0.31]

3 We evaluate the results both by computing the error between generated sequences and real video, as well as with a rigorous double-blind test with human subjects. [sent-14, score-0.404]

4 Experiments show that our model compares favourably to other existing methods and that the sequences generated are comparable to real video sequences. [sent-15, score-0.672]

5 1 Introduction Generative systems that model the relationship between face and speech offer a wide range of exciting prospects. [sent-16, score-0.388]

6 Models combining speech and face information have been shown to improve automatic speech recognition [4]. [sent-17, score-0.566]

7 Conversely, generating video-realistic animated faces from speech has immediate applications to the games and movie industries. [sent-18, score-0.298]

8 There is a strong correlation between lip movements and speech [7,10], and there have been multiple attempts at generating an animated face to match some given speech realistically [2,3,9,13]. [sent-19, score-0.679]

9 Studies have indicated that speech might be informative not only of lip movement but also of movement in the upper regions of the face [3]. [sent-20, score-0.445]

10 Incorporating speech therefore seems crucial to the generation of true-to-life animated faces. [sent-21, score-0.33]

11 Our goal is to build a generative probabilistic model, capable of generating realistic facial animations in real time, given speech. [sent-22, score-0.22]

12 We first use an Active Appearance Model (AAM [6]) to extract features from the video frames. [sent-23, score-0.398]

13 We then use a Hidden Markov Model (HMM [12]) to align phoneme labels to the audio stream of video sequences, and use this information to label the corresponding video frames. [sent-25, score-1.013]

14 We propose a model which, when trained on these labelled video frames, is capable of generating new, realistic video from unseen phoneme sequences. [sent-26, score-0.971]

15 Our model is a modification of Switching Linear Dynamical Systems (SLDS [1,15]) and we show that it performs better at generation than other existing models. [sent-27, score-0.17]

16 We compare its performance to two previously proposed models by comparing the sequences they generate to a golden standard, features from real video sequences, and by asking volunteers to select the “real” video in a forced-choice test. [sent-28, score-1.168]

17 The results of human evaluation of our generated sequences are extremely encouraging. [sent-29, score-0.301]

18 Our system performs well with any speech, and since it can easily handle real-time generation of the facial animation, it brings a realistic-looking, talking avatar within reach. [sent-30, score-0.255]

19 1 2 The Data We used sequences from the freely available on-line news broadcast Democracy Now! [sent-31, score-0.25]

20 The text transcripts are available on-line, thus greatly facilitating the training of a speech recognition system. [sent-33, score-0.209]

21 We manually extracted short video sequences of the news presenter talking (removing any inserts, telephone interviews, etc. [sent-34, score-0.655]

22 The sequences are all of the same person, albeit on different days within a period of slightly more than a month. [sent-37, score-0.221]

23 There was no reason to restrict the data to a single person, other than the difficulty to obtain sequences of similar quality from other sources. [sent-38, score-0.221]

24 All usable sequences were extracted from the data, that is, those where the face of the speaker was visible and the sound was not corrupted by external sound sources. [sent-39, score-0.762]

25 The sequences do include hesitations, corrections, incomplete words, noticeable fatigue, breath, swallowing, etc. [sent-40, score-0.221]

26 In total, sequences totalling 1 hour and 7 minutes of video were extracted and annotated. [sent-42, score-0.605]

27 1 The data was split into independent training and test sets for a 10-fold cross validation, based on the number of sequences in each set (rather than the total amount of data). [sent-43, score-0.255]

28 The sequences are split into an audio and a video stream, which are treated separately (see Figure 1). [sent-47, score-0.594]

29 We train a HMM on these MFCC features, and use it to align phonetic labels to the sound. [sent-49, score-0.17]

30 This is an easier task than unrestricted speech recognition, and is done satisfactorily by a simple HMM with monophones as hidden states, where mixtures of Gaussian distributions model the emission densities. [sent-50, score-0.323]

31 The sound samples are labelled with the Figure 1: Combining sound and face Viterbi path through the HMM that was “unrolled” with the phonetic transcription of the text. [sent-51, score-0.559]

32 The labels obtained from the sound stream are then used to label the corresponding video frames. [sent-52, score-0.649]

33 The difference in rate (the video is processed at 29. [sent-53, score-0.34]

34 97 frames per second while MFCC coefficients are computed at 100 Hz) is handled by simple voting: each video frame is labelled with the phoneme that labels most of the corresponding sound frames. [sent-54, score-0.824]

35 The feature extraction for the video was done using an Active Appearance Model (AAM [6]). [sent-56, score-0.383]

36 The shape of the lower part of the face is represented by the location of 23 points on key features on the eyes, mouth and jaw-line (see Figure 2). [sent-58, score-0.296]

37 Given the position of the points in a set of training images, we align them to a common co-ordinate frame and apply PCA to learn a low-dimensional linear model capturing the shape change [5]. [sent-59, score-0.171]

38 By combining shape and intensity model together, a wide range of convincing synthetic faces can be generated [6]. [sent-63, score-0.157]

39 3 Modelling the dynamics of the face We model the face using only phoneme labels to capture the shared information between speech and face. [sent-76, score-0.744]

40 We use 41 distinct phoneme labels, two of which are reserved for breath and silence, the rest being the generally accepted phonemes in the English language. [sent-77, score-0.207]

41 Most earlier techniques that use discrete labels to generate synthetic video sequences use some form of smooth interpolation between key frames [2, 9]. [sent-78, score-0.754]

42 Since the distribution is fitted to both the features and the difference between features, the resulting “distribution” cannot be sampled, as it would result in non-sensical mismatch between features and delta features. [sent-81, score-0.148]

43 It is therefore not genuinely generative and obtaining new sequences from the model requires solving an optimisation problem. [sent-82, score-0.321]

44 Under Brand’s approach, new sequences are obtained by finding the most likely sequence of observations for a set of labels. [sent-83, score-0.29]

45 This is done by setting the first derivative of the likelihood with respect to the observations to zero, resulting in a set of linear equations involving, at s s each time t, the observation yt and the previous observation yt−1 . [sent-84, score-0.266]

46 This requires the storage of O(d2 T ) elements and O(d3 T ) time to solve, where d is twice the dimensionality of the face features and T is the number of frames in a sequence. [sent-86, score-0.327]

47 This becomes non-trivial for sequences exceeding a few tens of seconds. [sent-87, score-0.221]

48 More important, however, is that this cannot be done in real time, as the last label of the sequence must be known before the first observation can be computed. [sent-88, score-0.154]

49 These models are shown to outperform Brand’s approach for the generation of realistic sequences. [sent-90, score-0.183]

50 We have a set of S video sequences, which we index with s ∈ [1 . [sent-93, score-0.34]

51 The feature vector of the frame at time t in the video sequence s is indicated s as yt ∈ Rd , and the complete set of feature vectors for that sequence is denoted as {y}Ts , 1 where Ts is the length of the sequence. [sent-97, score-0.73]

52 Continuous hidden variables are indicated as x and discrete state labels are indicated with π, where π ∈ [1 . [sent-98, score-0.186]

53 In an SLDS, the sequence of observations {y}Ts is modelled as being a noisy version of a 1 hidden sequence {x}Ts which depends on a sequence of discrete labels {π}Ts . [sent-102, score-0.38]

54 Each state π is 1 1 associated with a transition matrix Aπ and with a distribution for the output noise v and the s s s s s process noise w, such that yt = Bπt xs +vt , xs ∼ N (µπ1 , Σπ1 ) and xs = Aπt xs +ν πt +wt s t t t−1 1 for 2 t Ts . [sent-103, score-1.377]

55 Both the output noise vt and the process noise wt are normally distributed s s with zero mean; vt ∼ N (0, Rπt ) and wt ∼ N (0, Qπt ). [sent-104, score-0.218]

56 In our case, s however, the state label πt of each frame is known from the sound and the likelihood can be computed with the same algorithm as for a standard Linear Dynamical Systems (LDS), which is linear in T . [sent-122, score-0.22]

57 Also note that neither SLDS or LDS are commonly described s with the explicit state bias ν πt , as this can easily be emulated by augmenting each latent s s vector xs with a 1 and incorporating ν πt into Aπt . [sent-124, score-0.283]

58 We trained a SLDS by maximum likelihood and used the model to generate new sequences of face observations for given sequences of labels. [sent-128, score-0.765]

59 An in-depth evaluation of the trained SLDS model, when used to generate new video sequences, is given in section 4. [sent-130, score-0.406]

60 If we set the output noise vt of the SLDS to zero, leaving only process noise, we obtain the autoregressive hidden Markov model [11]. [sent-134, score-0.156]

61 This model has the advantage that it can be trained using an EM algorithm when the state labels are unknown, but we find that it performs very poorly at data generation. [sent-135, score-0.183]

62 The complete hidden sequence {x}T is then determined exactly by the labels {π}T . [sent-137, score-0.203]

63 The log-likelihood p({y}|{π}) 1 1 is given by S s s s log |Σπ1 | + (y1 − xs )⊤ Σ−1 (y1 − xs )+ s 1 1 π1 log p({y}|{x}) = − 1 2 s=1 Ts s s s log |Rπt | + (yt − xs )⊤ R−1 (yt − xs ) + dTs log 2π s t t πt (1) t=2 s s where xs = µπ1 and xs = Aπt xs + ν πt for t > 1. [sent-138, score-1.981]

64 Deriving a closed-form solution for the ML estimates of the parameters, however, results in solving polynomial equations of the order s s Ts , because xs = f (Aπ2 · · · Aπt ). [sent-147, score-0.283]

65 t s The log-likelihood of a sequence is a sum of scaled quadratic terms of (yt − xs ), where t t s xt = f ({π}1 ). [sent-149, score-0.404]

66 The log-likelihood must thus be computed by a forward iteration over all s time steps t using xs to compute xs . [sent-150, score-0.566]

67 In order to evaluate the performance of the models and compare it to Brand’s model, it is however useful to generate the most likely sequence of observation features for a sequence of labels with the features of the corresponding real video sequence. [sent-157, score-0.776]

68 s For both SLDS (when Bπt = I) and the DPDS, the mean for a given sequence of labels {π}T is found by a forward iteration starting with y1 = µπ1 and iterating for t > 1 with ˆ s 1 sˆ s yt = Aπt yt−1 + ν πt . [sent-158, score-0.339]

69 In setups where artificial speech is generated, the video sequence can therefore be generated at the same time as the audio sequence and without length limitations, with O(d) space and O(dT ) time complexity, where d is the dimensionality of the face features (without delta features). [sent-160, score-0.996]

70 4 Evaluation against real video We evaluated the models in two ways: (1) by computing the error between generated face features and a ground truth (the features of real video), and (2) by asking human subjects to rate how they perceived the sequences. [sent-161, score-0.921]

71 Both tests were done on the same real-world data, but partitioned differently: the comparison to the ground truth was done using 10fold cross-validation, while the test on humans was done using a single partitioning, due to the limited availability of unbiased test subjects. [sent-162, score-0.278]

72 In order to test the models against the ground truth, we use the sound to align the labels to the video and generate the corresponding face features. [sent-164, score-0.892]

73 features generated for the test sound sequences and the face features extracted from the real video. [sent-169, score-0.824]

74 We compared the sequences generated by DPDS, Brand’s model and SLDS to the most likely observations under a standard HMM. [sent-170, score-0.29]

75 This last model just generates the mean face for each phoneme, hence resulting in very unnatural sequences. [sent-171, score-0.219]

76 It illustrates how an obviously incorrect model nevertheless performs very similarly to the other models in terms of generation error. [sent-172, score-0.198]

77 We can see that, except for the SLDS which performs worse than the other methods in terms of L1 , RMS and L∞ error, the generation error for the models considered, under all metrics, is consistently not statistically significantly different. [sent-174, score-0.194]

78 The model with the highest likelihood generates the sequences with the largest error. [sent-176, score-0.29]

79 These results notwithstanding, great differences can be seen in the quality of the generated video sequences, and the models giving the lowest error or the highest likelihood are far from generating the most realistic sequences. [sent-178, score-0.556]

80 For this experiment, we trained the models on a training set of 642 sequences of an average of 5 seconds each. [sent-181, score-0.288]

81 We then labelled the sequences in our test set, which consists of 80 sequences and 436 seconds of video from sound with phonemes. [sent-182, score-1.01]

82 These are substantial amounts of data, showing the face in a wide variety of positions. [sent-183, score-0.188]

83 We set up a web-based test, where 33 volunteers compared 12 pairs of video sequences. [sent-184, score-0.423]

84 All video sequences had original sound, but the video stream was generated by any one of four methods: (1) from the face features extracted from the corresponding real video, (2) from SLDS, (3) from Brand’s model and (4) from DPDS. [sent-185, score-1.385]

85 A pool of 80 sequences was generated from previously unseen videos. [sent-186, score-0.259]

86 The 12 pairs were chosen such that each generation method was pitted against each other generation method twice (once on each side, left or right, in order to eliminate bias towards a particular side) in random order. [sent-187, score-0.222]

87 For each pair, corresponding sequences were chosen from the respective pools at random. [sent-188, score-0.221]

88 The volunteers were only told that the sequences were either real or artificial, and were asked to either select the real video or to indicate that they could not decide. [sent-189, score-0.728]

89 , shows that when comparing Brand’s model with the DPDS, people thought that the sequence generated with the former model was real in 5 cases, could not make up their mind in 7 cases, and thought the sequence generated with DPDS was real in 54 instances. [sent-198, score-0.36]

90 Despite the strong down-voting of Brand’s model in this test, the sequences generated with that model do not look all that bad. [sent-201, score-0.321]

91 6 In order to correlate human judgement with the generation errors discussed at the start of this section, we have computed the same error measures on the data as partitioned for the psychophysical test. [sent-205, score-0.25]

92 These confirmed the earlier conclusions: the SLDS, which humans like least, gives the highest likelihood and the worst generation errors while DPDS and Brand’s model do not give significantly different errors. [sent-206, score-0.206]

93 5 Conclusion In this work we have proposed a truly generative model, which allows real-time generation of talking faces given speech. [sent-207, score-0.226]

94 This is a trade-off for the very fast generation and visually much more appealing face animation. [sent-211, score-0.299]

95 In future work, we plan to investigate different error measures, especially on the more directly interpretable video frames rather than on the extracted features. [sent-215, score-0.492]

96 See me, hear me: Integrating automatic speech recognition and lipreading. [sent-264, score-0.209]

97 Linear predictive hidden markov models and the speech signal. [sent-287, score-0.246]

98 A tutorial on hidden markov models and selected applications in speech recognition. [sent-295, score-0.246]

99 Π}: Dµν , Dνν , Dνµ ← 0 n,m n,m n s for s ∈ {s|π0 = n} do ⊲ Compute coefficients Dµµ ,Dµν ,bµ to µn n nx n µµ µµ µ µ µ s Dn ← Dn + I, D ← I, bn ← bn + yt ∀m ∈ {1 . [sent-348, score-0.299]

100 Ts } do s s Dµ ← Dµ + Aπt Dµ , Dµµ ← Dµµ + Dµ Dµ , bµ ← bµ + Dµ yt n n n n µ µ µν µν µ µ s ∀m ∈ {1 . [sent-354, score-0.185]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('slds', 0.416), ('video', 0.34), ('dpds', 0.291), ('xs', 0.283), ('brand', 0.247), ('sequences', 0.221), ('face', 0.188), ('yt', 0.185), ('speech', 0.169), ('sound', 0.141), ('di', 0.129), ('erent', 0.111), ('generation', 0.111), ('aam', 0.104), ('coe', 0.093), ('labels', 0.085), ('volunteers', 0.083), ('phoneme', 0.083), ('stream', 0.083), ('frames', 0.081), ('ts', 0.077), ('hmm', 0.072), ('sequence', 0.069), ('facial', 0.066), ('breath', 0.062), ('cootes', 0.062), ('lip', 0.062), ('mfcc', 0.062), ('nx', 0.062), ('phonemes', 0.062), ('dynamical', 0.059), ('features', 0.058), ('reality', 0.053), ('labelled', 0.053), ('xt', 0.052), ('shape', 0.05), ('talking', 0.05), ('animated', 0.05), ('align', 0.049), ('hidden', 0.049), ('appearance', 0.048), ('siggraph', 0.046), ('vt', 0.046), ('realistic', 0.044), ('psychophysical', 0.044), ('manchester', 0.044), ('extracted', 0.044), ('done', 0.043), ('human', 0.042), ('genuinely', 0.042), ('real', 0.042), ('frame', 0.041), ('generating', 0.041), ('recognition', 0.04), ('intensities', 0.04), ('modelled', 0.039), ('trained', 0.039), ('likelihood', 0.038), ('generated', 0.038), ('faces', 0.038), ('ort', 0.036), ('phonetic', 0.036), ('temporary', 0.036), ('cients', 0.035), ('texture', 0.035), ('rms', 0.035), ('test', 0.034), ('lds', 0.033), ('animation', 0.033), ('edwards', 0.033), ('wt', 0.033), ('switching', 0.033), ('audio', 0.033), ('cm', 0.032), ('delta', 0.032), ('active', 0.032), ('model', 0.031), ('dn', 0.031), ('unrestricted', 0.031), ('diagonal', 0.031), ('noise', 0.03), ('truth', 0.029), ('asking', 0.029), ('erence', 0.029), ('broadcast', 0.029), ('metrics', 0.029), ('models', 0.028), ('performs', 0.028), ('graphics', 0.028), ('em', 0.028), ('generative', 0.027), ('generate', 0.027), ('error', 0.027), ('speaker', 0.027), ('indicated', 0.026), ('humans', 0.026), ('gradients', 0.026), ('bn', 0.026), ('interactive', 0.026), ('partitioned', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

Author: Gwenn Englebienne, Tim Cootes, Magnus Rattray

Abstract: The present work aims to model the correspondence between facial motion and speech. The face and sound are modelled separately, with phonemes being the link between both. We propose a sequential model and evaluate its suitability for the generation of the facial animation from a sequence of phonemes, which we obtain from speech. We evaluate the results both by computing the error between generated sequences and real video, as well as with a rigorous double-blind test with human subjects. Experiments show that our model compares favourably to other existing methods and that the sequences generated are comparable to real video sequences. 1

2 0.17393941 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

Author: Alan S. Willsky, Erik B. Sudderth, Martin J. Wainwright

Abstract: Variational methods are frequently used to approximate or bound the partition or likelihood function of a Markov random field. Methods based on mean field theory are guaranteed to provide lower bounds, whereas certain types of convex relaxations provide upper bounds. In general, loopy belief propagation (BP) provides often accurate approximations, but not bounds. We prove that for a class of attractive binary models, the so–called Bethe approximation associated with any fixed point of loopy BP always lower bounds the true likelihood. Empirically, this bound is much tighter than the naive mean field bound, and requires no further work than running BP. We establish these lower bounds using a loop series expansion due to Chertkov and Chernyak, which we show can be derived as a consequence of the tree reparameterization characterization of BP fixed points. 1

3 0.13918617 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

Author: Byron Boots, Geoffrey J. Gordon, Sajid M. Siddiqi

Abstract: Stability is a desirable characteristic for linear dynamical systems, but it is often ignored by algorithms that learn these systems from data. We propose a novel method for learning stable linear dynamical systems: we formulate an approximation of the problem as a convex program, start with a solution to a relaxed version of the program, and incrementally add constraints to improve stability. Rather than continuing to generate constraints until we reach a feasible solution, we test stability at each step; because the convex program is only an approximation of the desired problem, this early stopping rule can yield a higher-quality solution. We apply our algorithm to the task of learning dynamic textures from image sequences as well as to modeling biosurveillance drug-sales data. The constraint generation approach leads to noticeable improvement in the quality of simulated sequences. We compare our method to those of Lacy and Bernstein [1, 2], with positive results in terms of accuracy, quality of simulated sequences, and efficiency. 1

4 0.10643865 109 nips-2007-Kernels on Attributed Pointsets with Applications

Author: Mehul Parsana, Sourangshu Bhattacharya, Chiru Bhattacharya, K. Ramakrishnan

Abstract: This paper introduces kernels on attributed pointsets, which are sets of vectors embedded in an euclidean space. The embedding gives the notion of neighborhood, which is used to define positive semidefinite kernels on pointsets. Two novel kernels on neighborhoods are proposed, one evaluating the attribute similarity and the other evaluating shape similarity. Shape similarity function is motivated from spectral graph matching techniques. The kernels are tested on three real life applications: face recognition, photo album tagging, and shot annotation in video sequences, with encouraging results. 1

5 0.099604875 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

Author: Lars Buesing, Wolfgang Maass

Abstract: We show that under suitable assumptions (primarily linearization) a simple and perspicuous online learning rule for Information Bottleneck optimization with spiking neurons can be derived. This rule performs on common benchmark tasks as well as a rather complex rule that has previously been proposed [1]. Furthermore, the transparency of this new learning rule makes a theoretical analysis of its convergence properties feasible. A variation of this learning rule (with sign changes) provides a theoretically founded method for performing Principal Component Analysis (PCA) with spiking neurons. By applying this rule to an ensemble of neurons, different principal components of the input can be extracted. In addition, it is possible to preferentially extract those principal components from incoming signals X that are related or are not related to some additional target signal YT . In a biological interpretation, this target signal YT (also called relevance variable) could represent proprioceptive feedback, input from other sensory modalities, or top-down signals. 1

6 0.095205724 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

7 0.09410838 155 nips-2007-Predicting human gaze using low-level saliency combined with face detection

8 0.092307344 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

9 0.090602413 146 nips-2007-On higher-order perceptron algorithms

10 0.081526138 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

11 0.066768147 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events

12 0.065569252 107 nips-2007-Iterative Non-linear Dimensionality Reduction with Manifold Sculpting

13 0.06325046 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

14 0.062892839 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

15 0.061211582 154 nips-2007-Predicting Brain States from fMRI Data: Incremental Functional Principal Component Regression

16 0.061107907 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

17 0.05986489 183 nips-2007-Spatial Latent Dirichlet Allocation

18 0.058879983 21 nips-2007-Adaptive Online Gradient Descent

19 0.05813254 50 nips-2007-Combined discriminative and generative articulated pose and non-rigid shape estimation

20 0.057142653 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.191), (1, 0.024), (2, 0.018), (3, -0.05), (4, 0.044), (5, 0.078), (6, 0.008), (7, 0.091), (8, -0.062), (9, -0.045), (10, 0.057), (11, 0.094), (12, -0.081), (13, 0.079), (14, -0.165), (15, -0.012), (16, -0.208), (17, -0.031), (18, -0.031), (19, 0.048), (20, 0.027), (21, 0.006), (22, 0.091), (23, 0.059), (24, -0.044), (25, 0.158), (26, 0.118), (27, 0.069), (28, 0.024), (29, -0.072), (30, -0.073), (31, -0.059), (32, -0.112), (33, 0.125), (34, 0.031), (35, 0.048), (36, 0.043), (37, -0.188), (38, -0.115), (39, 0.215), (40, 0.002), (41, 0.015), (42, 0.077), (43, -0.035), (44, 0.051), (45, -0.143), (46, 0.136), (47, 0.077), (48, 0.034), (49, -0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95289505 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

Author: Gwenn Englebienne, Tim Cootes, Magnus Rattray

Abstract: The present work aims to model the correspondence between facial motion and speech. The face and sound are modelled separately, with phonemes being the link between both. We propose a sequential model and evaluate its suitability for the generation of the facial animation from a sequence of phonemes, which we obtain from speech. We evaluate the results both by computing the error between generated sequences and real video, as well as with a rigorous double-blind test with human subjects. Experiments show that our model compares favourably to other existing methods and that the sequences generated are comparable to real video sequences. 1

2 0.64813793 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

Author: Byron Boots, Geoffrey J. Gordon, Sajid M. Siddiqi

Abstract: Stability is a desirable characteristic for linear dynamical systems, but it is often ignored by algorithms that learn these systems from data. We propose a novel method for learning stable linear dynamical systems: we formulate an approximation of the problem as a convex program, start with a solution to a relaxed version of the program, and incrementally add constraints to improve stability. Rather than continuing to generate constraints until we reach a feasible solution, we test stability at each step; because the convex program is only an approximation of the desired problem, this early stopping rule can yield a higher-quality solution. We apply our algorithm to the task of learning dynamic textures from image sequences as well as to modeling biosurveillance drug-sales data. The constraint generation approach leads to noticeable improvement in the quality of simulated sequences. We compare our method to those of Lacy and Bernstein [1, 2], with positive results in terms of accuracy, quality of simulated sequences, and efficiency. 1

3 0.50846046 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

Author: Gonzalo Carvajal, Waldo Valenzuela, Miguel Figueroa

Abstract: We describe an analog-VLSI neural network for face recognition based on subspace methods. The system uses a dimensionality-reduction network whose coefficients can be either programmed or learned on-chip to perform PCA, or programmed to perform LDA. A second network with userprogrammed coefficients performs classification with Manhattan distances. The system uses on-chip compensation techniques to reduce the effects of device mismatch. Using the ORL database with 12x12-pixel images, our circuit achieves up to 85% classification performance (98% of an equivalent software implementation). 1

4 0.49878347 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

Author: Alan S. Willsky, Erik B. Sudderth, Martin J. Wainwright

Abstract: Variational methods are frequently used to approximate or bound the partition or likelihood function of a Markov random field. Methods based on mean field theory are guaranteed to provide lower bounds, whereas certain types of convex relaxations provide upper bounds. In general, loopy belief propagation (BP) provides often accurate approximations, but not bounds. We prove that for a class of attractive binary models, the so–called Bethe approximation associated with any fixed point of loopy BP always lower bounds the true likelihood. Empirically, this bound is much tighter than the naive mean field bound, and requires no further work than running BP. We establish these lower bounds using a loop series expansion due to Chertkov and Chernyak, which we show can be derived as a consequence of the tree reparameterization characterization of BP fixed points. 1

5 0.47460434 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

Author: Richard Turner, Maneesh Sahani

Abstract: Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼ 1 s); phonemes (∼ 10−1 s); glottal pulses (∼ 10−2 s); and formants ( 10−3 s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscienceinspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task. 1

6 0.45599875 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

7 0.43766207 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

8 0.41700834 109 nips-2007-Kernels on Attributed Pointsets with Applications

9 0.40801993 146 nips-2007-On higher-order perceptron algorithms

10 0.38146675 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

11 0.36393899 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events

12 0.35871923 155 nips-2007-Predicting human gaze using low-level saliency combined with face detection

13 0.35173753 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

14 0.35102794 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

15 0.32677516 137 nips-2007-Multiple-Instance Pruning For Learning Efficient Cascade Detectors

16 0.32359689 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

17 0.30717695 174 nips-2007-Selecting Observations against Adversarial Objectives

18 0.29322386 150 nips-2007-Optimal models of sound localization by barn owls

19 0.28792244 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

20 0.28788537 15 nips-2007-A general agnostic active learning algorithm


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.046), (13, 0.063), (14, 0.226), (16, 0.028), (18, 0.022), (19, 0.015), (21, 0.083), (31, 0.02), (34, 0.028), (35, 0.037), (47, 0.093), (49, 0.017), (83, 0.132), (85, 0.021), (87, 0.044), (90, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94120729 26 nips-2007-An online Hebbian learning rule that performs Independent Component Analysis

Author: Claudia Clopath, André Longtin, Wulfram Gerstner

Abstract: Independent component analysis (ICA) is a powerful method to decouple signals. Most of the algorithms performing ICA do not consider the temporal correlations of the signal, but only higher moments of its amplitude distribution. Moreover, they require some preprocessing of the data (whitening) so as to remove second order correlations. In this paper, we are interested in understanding the neural mechanism responsible for solving ICA. We present an online learning rule that exploits delayed correlations in the input. This rule performs ICA by detecting joint variations in the firing rates of pre- and postsynaptic neurons, similar to a local rate-based Hebbian learning rule. 1

same-paper 2 0.80210662 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

Author: Gwenn Englebienne, Tim Cootes, Magnus Rattray

Abstract: The present work aims to model the correspondence between facial motion and speech. The face and sound are modelled separately, with phonemes being the link between both. We propose a sequential model and evaluate its suitability for the generation of the facial animation from a sequence of phonemes, which we obtain from speech. We evaluate the results both by computing the error between generated sequences and real video, as well as with a rigorous double-blind test with human subjects. Experiments show that our model compares favourably to other existing methods and that the sequences generated are comparable to real video sequences. 1

3 0.65990579 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

4 0.65982693 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion

Abstract: We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. ¢ ¤ ¦¥£ ¢ ¢

5 0.6593256 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1

6 0.65755016 86 nips-2007-Exponential Family Predictive Representations of State

7 0.65639162 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

8 0.6558134 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

9 0.65552181 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

10 0.65454966 69 nips-2007-Discriminative Batch Mode Active Learning

11 0.65220219 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

12 0.65207797 115 nips-2007-Learning the 2-D Topology of Images

13 0.65204978 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

14 0.65187937 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

15 0.65104419 189 nips-2007-Supervised Topic Models

16 0.65038347 24 nips-2007-An Analysis of Inference with the Universum

17 0.65006036 84 nips-2007-Expectation Maximization and Posterior Constraints

18 0.64902997 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

19 0.64800131 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

20 0.64791268 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)