cvpr cvpr2013 cvpr2013-159 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Robert Anderson, Björn Stenger, Vincent Wan, Roberto Cipolla
Abstract: This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a ‘talking head’, given an input text and a set of continuous expression weights. The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. The model allows for normalization with respect to both pose and blink state which significantly reduces artifacts in the resulting synthesized sequences. We demonstrate quantitative improvements in terms of reconstruction error over a million frames, as well as in large-scale user studies, comparing the output of different systems.
Reference: text
sentIndex sentText sentNum sentScore
1 The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. [sent-2, score-0.23]
2 Introduction This paper presents a system for expressive visual textto-speech (VTTS) that generates near-videorealistic output. [sent-6, score-0.18]
3 Ex- pressive VTTS allows the text to be annotated with emotion labels which modulate the expression of the generated output. [sent-8, score-0.211]
4 Creating and animating talking face models with a high degree of realism has been a long-standing goal, as it has significant potential for digital content creation and enabling new types of user interfaces [16, 19, 27]. [sent-9, score-0.322]
5 It is becoming increasingly clear that in order to achieve this aim, one needs to draw on methods from different areas, including computer graphics, speech processing, and computer vision. [sent-10, score-0.255]
6 While systems exist that produce high quality animations for neutral speech [6, 16, 25], adding controllable, realistic facial expressions is still challenging [1, 5]. [sent-11, score-0.574]
7 Currently the most realistic data-driven VTTS systems are based on unit selection, splitting up the video into short sections and subsequently concatenating and blending these sections at the synthesis stage, e. [sent-12, score-0.32]
8 Due to the high degree of variation in appearance during expressive speech, the number of units required to allow realistic animation becomes excessive. [sent-15, score-0.322]
9 Concatenating the HMMs and sampling from them produces a set of parameters which can then be synthesized into a speech signal. [sent-18, score-0.286]
10 In this paper we propose using the established active appearance model (AAM) to model face shape and appearance [7]. [sent-20, score-0.272]
11 While AAMs have been used in VTTS systems for neutral speech in the past [ 10, 23], there are a number of difficulties when applying standard AAMs to the task of expressive face modeling. [sent-21, score-0.655]
12 The most significant problem is that AAMs capture a mixture of expression, mouth shape and head pose within each mode, making it impossible to model these effects independently. [sent-22, score-0.225]
13 Due to the large variation of pose and expression in expressive VTTS this leads to artifacts in synthesis as spurious correlations are learned. [sent-23, score-0.531]
14 In this paper we propose a number of extensions that allow AAMs to be used for synthesis tasks with a higher degree of realism. [sent-25, score-0.218]
15 a complete visual text-to-speech system allowing synthesis with a continuous range of emotions, introduced in section 4, 2. [sent-27, score-0.214]
16 extensions to the standard AAM that allow the separation of modes for global and local shape and appearance deformations, detailed in section 3, and 3. [sent-28, score-0.442]
17 The experiments demonstrate a clear improvement in synthesis quality in expressive VTTS. [sent-30, score-0.328]
18 Physics based methods model face movement based on simulating the effects of muscle interaction, thereby allowing anatomically plausible animation [1, 20]. [sent-33, score-0.186]
19 Unit selection methods allow videorealistic synthesis as they concatenate examples actually seen in a training set [16, 25]. [sent-35, score-0.305]
20 Statistical modeling approaches use a training set to build models of the speech generation process. [sent-39, score-0.292]
21 The main advantages of these methods are the flexibility they provide in dealing with coarticulation and their ability to handle expression variation in a principled manner. [sent-42, score-0.172]
22 Face models for VTTS A number of different face models have been proposed for videorealistic VTTS systems. [sent-45, score-0.164]
23 The resulting appearance is realistic, but this technique limits the synthesis method to unit selection [ 16, 26]. [sent-47, score-0.234]
24 Their main advantages are their invariance to 3D pose changes and their ability to render with an arbitrary pose and lighting at synthesis time. [sent-49, score-0.295]
25 While computer vision techniques continue to drive progress in this area [3, 17], until now only relatively small training sets have been acquired, insufficient in size to generate realistic expressive models [5, 24]. [sent-51, score-0.225]
26 Good results have been achieved animating 3D models that do not attempt to appear videorealistic, this avoids the uncanny valley and produces visually appealing synthesis such as that in [21]. [sent-52, score-0.208]
27 Active appearance models In this paper we use AAMs as they produce good results for neutral speech while the low-dimensional parametric representation enables their combination with standard TTS methods. [sent-58, score-0.511]
28 The specific requirements for our system are that the model must be able to track robustly and quickly over a very large corpus of expressive training data and that it must be possible to synthesize videorealistic renderings from statistical models of its parameters. [sent-60, score-0.33]
29 There has been extensive work on tracking expressive data, for example the work of De la Torre and Black [9] in which several independent AAMs representing different regions of the face are created by hand are linked together by a shared affine warp. [sent-61, score-0.251]
30 Modifications for convincing synthesis from AAMs on the other hand are much less well explored. [sent-62, score-0.181]
31 When AAMs have been used for VTTS in the past, small head pose variations have been removed by subtracting the mean AAM parameters for each sentence from all frames within that sentence [10] however this approach works for small rotations only and leads to a loss of expressiveness. [sent-63, score-0.177]
32 [ 1 1] in which canonical discriminant analysis is used to find semantically meaningful modes and a least squares approach is used to remove the contributions of these modes from training samples. [sent-66, score-0.635]
33 However this approach is not well suited to modeling local deformations such as blinking and the least squares approach to removing the learned modes from training samples can give disproportionate weighting to the appearance component. [sent-67, score-0.503]
34 Extending AAMs for Expressive Faces This section first briefly introduces the standard AAM with its notation and then details the proposed extensions to improve its performance in the expressive VTTS setting. [sent-69, score-0.184]
35 Throughout this paper we assume that the number of shape and appearance modes is equal but the techniques are equally applicable if this is not the case; modes with zero magnitude can be inserted to en333333888311 Standard AAM Pose invariant AAM Figure 1: Pose invariant AAM modes. [sent-72, score-0.704]
36 The first two modes of a standard AAM (left) encode a mixture of pose, mouth shape and expression variation. [sent-73, score-0.522]
37 (right) The first two modes of a pose invariantAAM encode only rotation, allowing headpose to be decoupled from expression and mouth shape. [sent-74, score-0.526]
38 1 where s0 is the mean shape of the model, si is the ith mode of M linear shape modes and ci is its cor- responding parameter. [sent-84, score-0.442]
39 Pose invariant AAM modes The global nature of AAMs leads to some of the modes handling variation which is due to both 3D pose change as well as local deformation, see figure 1left. [sent-97, score-0.655]
40 Here we propose a method for finding AAM modes that correspond purely to head rotation or to other physically meaningful motions. [sent-98, score-0.347]
41 More formally, we would like to express a face shape s as a combination of pose components and deformation components: ? [sent-99, score-0.23]
42 We first find the shape components that model pose {sipose}iK=1, by recording a s choomrtp training sequence lo pfo hseea d{s rotat}ion with a fixed neutral expression and applying PCA to the observed mean normalized shapes = s − s0. [sent-109, score-0.426]
43 (4) Having found these weights we remove the pose component from each training shape to obtain a pose normalized training shape s∗ : ? [sent-111, score-0.294]
44 1 If shape and appearance were indeed independent then we could find the deformation components by principal component analysis (PCA) of a training set of shape samples normalized as in (5), ensuring that only modes orthogonal to the pose modes are found, in the same way as [11]. [sent-116, score-0.894]
45 (6) The model is then constructed by using these weights in (5) and finding the deformation modes from samples of the complete training set. [sent-123, score-0.379]
46 Local deformation modes In this section we propose a method to obtain modes for local deformations such as eye blinking. [sent-127, score-0.641]
47 Firstly shape and appearance modes which model blinking are learned from a video containing blinking with no other head motion. [sent-129, score-0.718]
48 1to remove these blinking modes from the training set introduces artifacts. [sent-131, score-0.45]
49 The reason for this is apparent when considering the shape mode associated with blinking in which the majority of the movement is in the eyelid. [sent-132, score-0.231]
50 This means that if the eyes are in a different position relative to the centroid of the face (for example if the mouth is open, lowering the centroid) then the eyelid is moved toward the mean eyelid position, even if this artificially opens or closes the eye. [sent-133, score-0.214]
51 Segmenting AAMs into regions Different regions of the face can be moved nearly independently, a fact that has previously been exploited by segmenting the face into regions, which are modeled separately and blended at their boundaries [2, 9, 22]. [sent-142, score-0.208]
52 The modes for each region are learned by only considering a subset of the model’s vertices according to manually selected boundaries marked in the mean shape. [sent-159, score-0.299]
53 The advantage of this is that changes in mouth shape during synthesis cannot lead to artifacts in the upper half of the face. [sent-164, score-0.344]
54 Since global modes are used to model pose there is no risk of the upper and lower halves of the face having a different pose. [sent-165, score-0.433]
55 Assuming that the number of training samples N is larger than the number of modes M the new shape modes can be obtained as the least-squares solution. [sent-194, score-0.688]
56 Adding regions with static texture Since the teeth and tongue are occluded in many of the training examples, the synthesis of these regions contains significant artifacts when modeled using a standard AAM. [sent-198, score-0.349]
57 Synthesis framework Our synthesis model takes advantage of an existing TTS approach known as cluster adaptive training (CAT). [sent-204, score-0.266]
58 The audio and video data are modeled using separate streams within a CAT model, a brief overview of which is given next. [sent-206, score-0.167]
59 HMMTTS is a parametric approach to speech synthesis [29] which models quinphones using HMMs with five emitting states. [sent-210, score-0.515]
60 Typically, a decision tree is used to cluster the quinphones to handle sparseness in the training data. [sent-212, score-0.164]
61 By interpolating between λexpr1 and λexpr2 we can synthesize speech with an expression between two of the originally recorded expressions. [sent-221, score-0.358]
62 From the data 300 sentences were held out as a test set and the remaining data was used to train the speech model. [sent-226, score-0.378]
63 The speech data was parameterized using a standard feature set consisting of45 dimensional Mel-frequency cepstral coefficients, log-F0 (pitch) and 25 band aperiodicities, together with the first and second time derivatives of these features. [sent-227, score-0.255]
64 Each cluster is represented by a decision tree and defines a basis in expression space. [sent-260, score-0.178]
65 λP] the properties of the HMMs to use for synthesis can be found as a linear sum of the cluster properties. [sent-264, score-0.229]
66 The second model, AAMdecomp, separates both 3D head rotation (modeled by two modes) and blinking (modeled by one mode) from the deformation modes as described in sections 3. [sent-270, score-0.504]
67 The third model, AAMregions, is built in the same way as AAMdecomp expect that 8 modes are used to model the lower half of the face and 6 to model the upper half, see section 3. [sent-273, score-0.376]
68 It can be seen that while with few modes, AAMbase has the lowest reconstruction error, as the number of modes increases the difference in error decreases. [sent-283, score-0.299]
69 In other words, the flexibility that semantically meaningful modes provide does not come at the expense of reduced tracking accuracy. [sent-284, score-0.299]
70 It can be seen that the average errors of all models converge as the number of modes increases. [sent-289, score-0.299]
71 (c) An example of tracking failure for AAMbase since this combination of mouth shape and expression did not appear in the training set. [sent-292, score-0.26]
72 For each preference test 10 sentences in each of the six emotions were generated with two models rendered side by side. [sent-302, score-0.369]
73 Each pair of AAMs was evaluated by 10 users who were asked to select between the left model, right model or having no preference (the order of our model renderings was switched between experiments to avoid bias), resulting in a total of 600 pairwise comparisons per preference test. [sent-303, score-0.215]
74 In this experiment the videos were shown without audio in order to focus on the quality of the face model. [sent-304, score-0.18]
75 This preference is most pronounced for expressions such as angry, where there is a large amount of head motion and less so for emotions such as neutral and tender which do not involve significant movement of the head. [sent-306, score-0.611]
76 This demonstrates that the proposed extensions are particularly beneficial to expressive VTTS. [sent-307, score-0.184]
77 2 Comparison with other talking heads In order to compare the output of different VTTS systems users were asked to rate the realism of sample synthesized sentences on a scale of 1to 5, with 5 corresponding to ‘completely real’ and 1to ‘completely unreal’ . [sent-310, score-0.369]
78 Sample sentences that were publicly available were chosen for the evaluation, and scaled to a face region height of approximately 200 pixels. [sent-311, score-0.2]
79 The degree of expressiveness of the systems range from neutral speech only to highly expressive. [sent-312, score-0.431]
80 was rated most realistic among the systems for neutral speech and with a small degree of expressiveness. [sent-314, score-0.51]
81 The proposed system performs comparably to other methods in the neutral speech category, while for larger ranges of expression it achieved a significantly higher score than the system by Cao et al. [sent-315, score-0.6]
82 Users were presented either with video or audio clips of a single sentence from the test set and were asked to identify the emotion expressed by the speaker, selecting from a list of six emotions. [sent-320, score-0.284]
83 We also compared with versions of synthetic video only and synthetic audio only, as well as cropped versions of the actual video footage. [sent-322, score-0.295]
84 In each case 10 sentences in each of the six emotions were evaluated by 20 people, resulting in a total sample size of 1200. [sent-323, score-0.292]
85 The average recognition rates are 73% for the captured footage, 77% for our generated video (with audio), 52% for the synthetic video only and 68% for the synthetic audio only. [sent-325, score-0.295]
86 There is a preference for the refined models for the average score over all emotions, this is mostly due to the emotions with a large amount of movement, such as angry. [sent-334, score-0.246]
87 The preference for the proposed model over other AAMs is particularly clear for emotions with significant head motion, such as angry shown in the right table. [sent-335, score-0.365]
88 4) Users rated the realism of sample sentences generated using different VTTS systems where higher values correspond to more realistic output. [sent-362, score-0.259]
89 Tender and neutral expressions are most easily confused in all cases. [sent-368, score-0.221]
90 While some emotions are better recognized from audio only, the overall recognition rate is higher when using both cues. [sent-369, score-0.272]
91 Conclusions and future work In this paper we have demonstrated a complete visual text-to-speech system which is capable of creating nearvideorealistic synthesis of expressive text. [sent-371, score-0.396]
92 To improve performance of our system we have adapted active appearance models to reduce the main artifacts resulting from using a person specific active appearance model for rendering. [sent-373, score-0.254]
93 We are grateful to all researchers in the Speech Technology Group at Toshiba Research Europe for their work on the speech synthesis side of the model. [sent-376, score-0.436]
94 Mixed feelings: expression of non-basic emotions in a musclebased talking head. [sent-383, score-0.395]
95 Towards perceptually realistic talking heads: models, methods and mcgurk. [sent-427, score-0.164]
96 Figure 5: Emotion recognition for (a) real video cropped to face, (b) synthetic audio and video, (c) synthetic video only and (d) synthetic audio only. [sent-440, score-0.457]
97 In each case 10 sentences in each emotion were evaluated by 20 different people. [sent-441, score-0.231]
98 Miketalk: A talking facial display based on morphing visemes. [sent-455, score-0.18]
99 Realistic facial expression synthesis for an image-based talking head. [sent-487, score-0.464]
100 Photo-real lips [26] [27] [28] [29] synthesis with trajectory-guided sample selection. [sent-557, score-0.207]
wordName wordTfidf (topN-words)
[('aam', 0.51), ('modes', 0.299), ('aams', 0.282), ('vtts', 0.26), ('speech', 0.255), ('synthesis', 0.181), ('neutral', 0.176), ('emotions', 0.169), ('expressive', 0.147), ('talking', 0.123), ('sentences', 0.123), ('blinking', 0.114), ('emotion', 0.108), ('aamregions', 0.104), ('audio', 0.103), ('expression', 0.103), ('hmms', 0.09), ('videorealistic', 0.087), ('face', 0.077), ('preference', 0.077), ('angry', 0.071), ('aambase', 0.069), ('aamdecomp', 0.069), ('coarticulation', 0.069), ('tender', 0.069), ('cat', 0.068), ('mouth', 0.067), ('teeth', 0.061), ('synthetic', 0.059), ('facial', 0.057), ('realism', 0.057), ('pose', 0.057), ('torre', 0.056), ('tts', 0.054), ('animation', 0.053), ('appearance', 0.053), ('shape', 0.053), ('cisipose', 0.052), ('controllable', 0.052), ('decomp', 0.052), ('expr', 0.052), ('quinphones', 0.052), ('cluster', 0.048), ('head', 0.048), ('expressions', 0.045), ('artifacts', 0.043), ('deformation', 0.043), ('realistic', 0.041), ('cao', 0.041), ('rated', 0.038), ('user', 0.038), ('mode', 0.037), ('video', 0.037), ('training', 0.037), ('extensions', 0.037), ('active', 0.036), ('speaker', 0.036), ('edwards', 0.036), ('sentence', 0.036), ('users', 0.035), ('aamfull', 0.035), ('afraid', 0.035), ('deena', 0.035), ('emxpr', 0.035), ('eyelid', 0.035), ('ezzat', 0.035), ('gales', 0.035), ('knill', 0.035), ('mmms', 0.035), ('nearvideorealistic', 0.035), ('sipose', 0.035), ('toshiba', 0.035), ('blending', 0.034), ('system', 0.033), ('interspeech', 0.031), ('theobald', 0.031), ('phoneme', 0.031), ('synthesized', 0.031), ('studies', 0.029), ('mesh', 0.029), ('muscle', 0.029), ('zen', 0.029), ('units', 0.028), ('modifications', 0.028), ('concatenating', 0.027), ('la', 0.027), ('decision', 0.027), ('currently', 0.027), ('modeled', 0.027), ('blended', 0.027), ('animating', 0.027), ('morphable', 0.027), ('happy', 0.027), ('movement', 0.027), ('parametric', 0.027), ('europe', 0.026), ('sca', 0.026), ('lips', 0.026), ('renderings', 0.026), ('tog', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 159 cvpr-2013-Expressive Visual Text-to-Speech Using Active Appearance Models
Author: Robert Anderson, Björn Stenger, Vincent Wan, Roberto Cipolla
Abstract: This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a ‘talking head’, given an input text and a set of continuous expression weights. The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. The model allows for normalization with respect to both pose and blink state which significantly reduces artifacts in the resulting synthesized sequences. We demonstrate quantitative improvements in terms of reconstruction error over a million frames, as well as in large-scale user studies, comparing the output of different systems.
Author: Yue Wu, Zuoguan Wang, Qiang Ji
Abstract: Facial feature tracking is an active area in computer vision due to its relevance to many applications. It is a nontrivial task, sincefaces may have varyingfacial expressions, poses or occlusions. In this paper, we address this problem by proposing a face shape prior model that is constructed based on the Restricted Boltzmann Machines (RBM) and their variants. Specifically, we first construct a model based on Deep Belief Networks to capture the face shape variations due to varying facial expressions for near-frontal view. To handle pose variations, the frontal face shape prior model is incorporated into a 3-way RBM model that could capture the relationship between frontal face shapes and non-frontal face shapes. Finally, we introduce methods to systematically combine the face shape prior models with image measurements of facial feature points. Experiments on benchmark databases show that with the proposed method, facial feature points can be tracked robustly and accurately even if faces have significant facial expressions and poses.
3 0.14728212 276 cvpr-2013-MKPLS: Manifold Kernel Partial Least Squares for Lipreading and Speaker Identification
Author: Amr Bakry, Ahmed Elgammal
Abstract: Visual speech recognition is a challenging problem, due to confusion between visual speech features. The speaker identification problem is usually coupled with speech recognition. Moreover, speaker identification is important to several applications, such as automatic access control, biometrics, authentication, and personal privacy issues. In this paper, we propose a novel approach for lipreading and speaker identification. Wepropose a new approachfor manifold parameterization in a low-dimensional latent space, where each manifold is represented as a point in that space. We initially parameterize each instance manifold using a nonlinear mapping from a unified manifold representation. We then factorize the parameter space using Kernel Partial Least Squares (KPLS) to achieve a low-dimension manifold latent space. We use two-way projections to achieve two manifold latent spaces, one for the speech content and one for the speaker. We apply our approach on two public databases: AVLetters and OuluVS. We show the results for three different settings of lipreading: speaker independent, speaker dependent, and speaker semi-dependent. Our approach outperforms for the speaker semi-dependent setting by at least 15% of the baseline, and competes in the other two settings.
4 0.13958135 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation
Author: Ben Sapp, Ben Taskar
Abstract: We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. A typical approach to this problem is to use a linear structured model, which struggles to capture the wide range of appearance present in realistic, unconstrained images. In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. Our model outperforms state-of-theart approaches across the accuracy-speed trade-off curve for several pose datasets. This includes our newly-collected dataset of people in movies, FLIC, which contains an order of magnitude more labeled data for training and testing than existing datasets. The new dataset and code are avail- able online. 1
5 0.10186164 359 cvpr-2013-Robust Discriminative Response Map Fitting with Constrained Local Models
Author: Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, Maja Pantic
Abstract: We present a novel discriminative regression based approach for the Constrained Local Models (CLMs) framework, referred to as the Discriminative Response Map Fitting (DRMF) method, which shows impressive performance in the generic face fitting scenario. The motivation behind this approach is that, unlike the holistic texture based features used in the discriminative AAM approaches, the response map can be represented by a small set of parameters and these parameters can be very efficiently used for reconstructing unseen response maps. Furthermore, we show that by adopting very simple off-the-shelf regression techniques, it is possible to learn robust functions from response maps to the shape parameters updates. The experiments, conducted on Multi-PIE, XM2VTS and LFPW database, show that the proposed DRMF method outperforms stateof-the-art algorithms for the task of generic face fitting. Moreover, the DRMF method is computationally very efficient and is real-time capable. The current MATLAB implementation takes 1second per image. To facilitate future comparisons, we release the MATLAB code1 and the pretrained models for research purposes.
6 0.097215891 77 cvpr-2013-Capturing Complex Spatio-temporal Relations among Facial Muscles for Facial Expression Recognition
7 0.087328285 420 cvpr-2013-Supervised Descent Method and Its Applications to Face Alignment
8 0.08280988 438 cvpr-2013-Towards Pose Robust Face Recognition
9 0.082164377 152 cvpr-2013-Exemplar-Based Face Parsing
10 0.07863833 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
11 0.071866244 338 cvpr-2013-Probabilistic Elastic Matching for Pose Variant Face Verification
12 0.071283482 82 cvpr-2013-Class Generative Models Based on Feature Regression for Pose Estimation of Object Categories
13 0.070901513 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels
14 0.062673517 399 cvpr-2013-Single-Sample Face Recognition with Image Corruption and Misalignment via Sparse Illumination Transfer
15 0.060826335 182 cvpr-2013-Fusing Robust Face Region Descriptors via Multiple Metric Learning for Face Recognition in the Wild
16 0.059950948 92 cvpr-2013-Constrained Clustering and Its Application to Face Clustering in Videos
17 0.059659064 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
18 0.059153587 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction
19 0.05865572 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification
20 0.058524534 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill
topicId topicWeight
[(0, 0.138), (1, -0.005), (2, -0.031), (3, -0.016), (4, 0.016), (5, -0.009), (6, -0.008), (7, -0.052), (8, 0.119), (9, -0.087), (10, 0.048), (11, 0.023), (12, -0.014), (13, 0.037), (14, 0.024), (15, 0.047), (16, 0.019), (17, 0.023), (18, 0.007), (19, 0.016), (20, -0.053), (21, -0.005), (22, -0.011), (23, 0.002), (24, -0.031), (25, 0.02), (26, 0.034), (27, -0.007), (28, 0.059), (29, 0.006), (30, -0.01), (31, -0.064), (32, -0.073), (33, 0.002), (34, -0.044), (35, -0.01), (36, -0.032), (37, 0.049), (38, -0.064), (39, -0.006), (40, -0.005), (41, 0.015), (42, -0.093), (43, 0.089), (44, -0.044), (45, 0.032), (46, 0.004), (47, -0.021), (48, 0.017), (49, -0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.89507413 159 cvpr-2013-Expressive Visual Text-to-Speech Using Active Appearance Models
Author: Robert Anderson, Björn Stenger, Vincent Wan, Roberto Cipolla
Abstract: This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a ‘talking head’, given an input text and a set of continuous expression weights. The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. The model allows for normalization with respect to both pose and blink state which significantly reduces artifacts in the resulting synthesized sequences. We demonstrate quantitative improvements in terms of reconstruction error over a million frames, as well as in large-scale user studies, comparing the output of different systems.
Author: Yue Wu, Zuoguan Wang, Qiang Ji
Abstract: Facial feature tracking is an active area in computer vision due to its relevance to many applications. It is a nontrivial task, sincefaces may have varyingfacial expressions, poses or occlusions. In this paper, we address this problem by proposing a face shape prior model that is constructed based on the Restricted Boltzmann Machines (RBM) and their variants. Specifically, we first construct a model based on Deep Belief Networks to capture the face shape variations due to varying facial expressions for near-frontal view. To handle pose variations, the frontal face shape prior model is incorporated into a 3-way RBM model that could capture the relationship between frontal face shapes and non-frontal face shapes. Finally, we introduce methods to systematically combine the face shape prior models with image measurements of facial feature points. Experiments on benchmark databases show that with the proposed method, facial feature points can be tracked robustly and accurately even if faces have significant facial expressions and poses.
3 0.70571673 420 cvpr-2013-Supervised Descent Method and Its Applications to Face Alignment
Author: Xuehan Xiong, Fernando De_la_Torre
Abstract: Many computer vision problems (e.g., camera calibration, image alignment, structure from motion) are solved through a nonlinear optimization method. It is generally accepted that 2nd order descent methods are the most robust, fast and reliable approaches for nonlinear optimization ofa general smoothfunction. However, in the context of computer vision, 2nd order descent methods have two main drawbacks: (1) The function might not be analytically differentiable and numerical approximations are impractical. (2) The Hessian might be large and not positive definite. To address these issues, thispaperproposes a Supervised Descent Method (SDM) for minimizing a Non-linear Least Squares (NLS) function. During training, the SDM learns a sequence of descent directions that minimizes the mean of NLS functions sampled at different points. In testing, SDM minimizes the NLS objective using the learned descent directions without computing the Jacobian nor the Hessian. We illustrate the benefits of our approach in synthetic and real examples, and show how SDM achieves state-ofthe-art performance in the problem of facial feature detec- tion. The code is available at www. .human sen sin g. . cs . cmu . edu/in t ra fa ce.
4 0.69962162 77 cvpr-2013-Capturing Complex Spatio-temporal Relations among Facial Muscles for Facial Expression Recognition
Author: Ziheng Wang, Shangfei Wang, Qiang Ji
Abstract: Spatial-temporal relations among facial muscles carry crucial information about facial expressions yet have not been thoroughly exploited. One contributing factor for this is the limited ability of the current dynamic models in capturing complex spatial and temporal relations. Existing dynamic models can only capture simple local temporal relations among sequential events, or lack the ability for incorporating uncertainties. To overcome these limitations and take full advantage of the spatio-temporal information, we propose to model the facial expression as a complex activity that consists of temporally overlapping or sequential primitive facial events. We further propose the Interval Temporal Bayesian Network to capture these complex temporal relations among primitive facial events for facial expression modeling and recognition. Experimental results on benchmark databases demonstrate the feasibility of the proposed approach in recognizing facial expressions based purely on spatio-temporal relations among facial muscles, as well as its advantage over the existing methods.
5 0.68857747 385 cvpr-2013-Selective Transfer Machine for Personalized Facial Action Unit Detection
Author: Wen-Sheng Chu, Fernando De La Torre, Jeffery F. Cohn
Abstract: Automatic facial action unit (AFA) detection from video is a long-standing problem in facial expression analysis. Most approaches emphasize choices of features and classifiers. They neglect individual differences in target persons. People vary markedly in facial morphology (e.g., heavy versus delicate brows, smooth versus deeply etched wrinkles) and behavior. Individual differences can dramatically influence how well generic classifiers generalize to previously unseen persons. While a possible solution would be to train person-specific classifiers, that often is neither feasible nor theoretically compelling. The alternative that we propose is to personalize a generic classifier in an unsupervised manner (no additional labels for the test subjects are required). We introduce a transductive learning method, which we refer to Selective Transfer Machine (STM), to personalize a generic classifier by attenuating person-specific biases. STM achieves this effect by simultaneously learning a classifier and re-weighting the training samples that are most relevant to the test subject. To evaluate the effectiveness of STM, we compared STM to generic classifiers and to cross-domain learning methods in three major databases: CK+ [20], GEMEP-FERA [32] and RU-FACS [2]. STM outperformed generic classifiers in all.
6 0.68565756 359 cvpr-2013-Robust Discriminative Response Map Fitting with Constrained Local Models
7 0.63759255 415 cvpr-2013-Structured Face Hallucination
8 0.61370552 463 cvpr-2013-What's in a Name? First Names as Facial Attributes
9 0.60832506 118 cvpr-2013-Detecting Pulse from Head Motions in Video
10 0.59517175 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
11 0.5826878 438 cvpr-2013-Towards Pose Robust Face Recognition
12 0.56049567 214 cvpr-2013-Image Understanding from Experts' Eyes by Modeling Perceptual Skill of Diagnostic Reasoning Processes
13 0.55123174 358 cvpr-2013-Robust Canonical Time Warping for the Alignment of Grossly Corrupted Sequences
14 0.54064757 152 cvpr-2013-Exemplar-Based Face Parsing
15 0.52125198 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill
16 0.51650691 399 cvpr-2013-Single-Sample Face Recognition with Image Corruption and Misalignment via Sparse Illumination Transfer
17 0.50097358 321 cvpr-2013-PDM-ENLOR: Learning Ensemble of Local PDM-Based Regressions
18 0.49243948 454 cvpr-2013-Video Enhancement of People Wearing Polarized Glasses: Darkening Reversal and Reflection Reduction
19 0.48396948 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction
20 0.48238805 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
topicId topicWeight
[(10, 0.101), (16, 0.019), (26, 0.066), (33, 0.204), (55, 0.328), (67, 0.07), (69, 0.037), (87, 0.072)]
simIndex simValue paperId paperTitle
same-paper 1 0.73875123 159 cvpr-2013-Expressive Visual Text-to-Speech Using Active Appearance Models
Author: Robert Anderson, Björn Stenger, Vincent Wan, Roberto Cipolla
Abstract: This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a ‘talking head’, given an input text and a set of continuous expression weights. The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. The model allows for normalization with respect to both pose and blink state which significantly reduces artifacts in the resulting synthesized sequences. We demonstrate quantitative improvements in terms of reconstruction error over a million frames, as well as in large-scale user studies, comparing the output of different systems.
2 0.71409726 26 cvpr-2013-A Statistical Model for Recreational Trails in Aerial Images
Author: Andrew Predoehl, Scott Morris, Kobus Barnard
Abstract: unkown-abstract
3 0.69638073 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models
Author: Luca Del_Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, Kobus Barnard
Abstract: We develop a comprehensive Bayesian generative model for understanding indoor scenes. While it is common in this domain to approximate objects with 3D bounding boxes, we propose using strong representations with finer granularity. For example, we model a chair as a set of four legs, a seat and a backrest. We find that modeling detailed geometry improves recognition and reconstruction, and enables more refined use of appearance for scene understanding. We demonstrate this with a new likelihood function that re- wards 3D object hypotheses whose 2D projection is more uniform in color distribution. Such a measure would be confused by background pixels if we used a bounding box to represent a concave object like a chair. Complex objects are modeled using a set or re-usable 3D parts, and we show that this representation captures much of the variation among object instances with relatively few parameters. We also designed specific data-driven inference mechanismsfor eachpart that are shared by all objects containing that part, which helps make inference transparent to the modeler. Further, we show how to exploit contextual relationships to detect more objects, by, for example, proposing chairs around and underneath tables. We present results showing the benefits of each of these innovations. The performance of our approach often exceeds that of state-of-the-art methods on the two tasks of room layout estimation and object recognition, as evaluated on two bench mark data sets used in this domain. work. 1) Detailed geometric models, such as tables with legs and top (bottom left), provide better reconstructions than plain boxes (top right), when supported by image features such as geometric context [5] (top middle), or an approach to using color introduced here. 2) Non convex models allow for complex configurations, such as a chair under a table (bottom middle). 3) 3D contextual relationships, such as chairs being around a table, allow identifying objects supported by little image evidence, like the chair behind the table (bottom right). Best viewed in color.
4 0.66220719 420 cvpr-2013-Supervised Descent Method and Its Applications to Face Alignment
Author: Xuehan Xiong, Fernando De_la_Torre
Abstract: Many computer vision problems (e.g., camera calibration, image alignment, structure from motion) are solved through a nonlinear optimization method. It is generally accepted that 2nd order descent methods are the most robust, fast and reliable approaches for nonlinear optimization ofa general smoothfunction. However, in the context of computer vision, 2nd order descent methods have two main drawbacks: (1) The function might not be analytically differentiable and numerical approximations are impractical. (2) The Hessian might be large and not positive definite. To address these issues, thispaperproposes a Supervised Descent Method (SDM) for minimizing a Non-linear Least Squares (NLS) function. During training, the SDM learns a sequence of descent directions that minimizes the mean of NLS functions sampled at different points. In testing, SDM minimizes the NLS objective using the learned descent directions without computing the Jacobian nor the Hessian. We illustrate the benefits of our approach in synthetic and real examples, and show how SDM achieves state-ofthe-art performance in the problem of facial feature detec- tion. The code is available at www. .human sen sin g. . cs . cmu . edu/in t ra fa ce.
5 0.61077583 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem
Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.
6 0.60934478 311 cvpr-2013-Occlusion Patterns for Object Class Detection
7 0.60546148 414 cvpr-2013-Structure Preserving Object Tracking
8 0.60403568 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities
9 0.60329145 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
10 0.6031121 325 cvpr-2013-Part Discovery from Partial Correspondence
11 0.60240704 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
12 0.60156578 381 cvpr-2013-Scene Parsing by Integrating Function, Geometry and Appearance Models
13 0.60120934 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
14 0.60116315 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
15 0.6006614 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis
16 0.60061508 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation
17 0.60059267 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
18 0.60056567 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
19 0.60056299 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
20 0.60018009 440 cvpr-2013-Tracking People and Their Objects