cvpr cvpr2013 cvpr2013-104 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yi Sun, Xiaogang Wang, Xiaoou Tang
Abstract: We propose a new approach for estimation of the positions of facial keypoints with three-level carefully designed convolutional networks. At each level, the outputs of multiple networks are fused for robust and accurate estimation. Thanks to the deep structures of convolutional networks, global high-level features are extracted over the whole face region at the initialization stage, which help to locate high accuracy keypoints. There are two folds of advantage for this. First, the texture context information over the entire face is utilized to locate each keypoint. Second, since the networks are trained to predict all the keypoints simultaneously, the geometric constraints among keypoints are implicitly encoded. The method therefore can avoid local minimum caused by ambiguity and data corruption in difficult image samples due to occlusions, large pose variations, and extreme lightings. The networks at the following two levels are trained to locally refine initial predictions and their inputs are limited to small regions around the initial predictions. Several network structures critical for accurate and robust facial point detection are investigated. Extensive experiments show that our approach outperforms state-ofthe-art methods in both detection accuracy and reliability1.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract We propose a new approach for estimation of the positions of facial keypoints with three-level carefully designed convolutional networks. [sent-5, score-1.05]
2 At each level, the outputs of multiple networks are fused for robust and accurate estimation. [sent-6, score-0.525]
3 Thanks to the deep structures of convolutional networks, global high-level features are extracted over the whole face region at the initialization stage, which help to locate high accuracy keypoints. [sent-7, score-0.956]
4 First, the texture context information over the entire face is utilized to locate each keypoint. [sent-9, score-0.165]
5 Second, since the networks are trained to predict all the keypoints simultaneously, the geometric constraints among keypoints are implicitly encoded. [sent-10, score-0.753]
6 The networks at the following two levels are trained to locally refine initial predictions and their inputs are limited to small regions around the initial predictions. [sent-12, score-0.857]
7 Several network structures critical for accurate and robust facial point detection are investigated. [sent-13, score-0.549]
8 Introduction Facial keypoint detection is critical for face recognition and analysis, and has been studied extensively in recent years [3, 4, 5, 8, 9, 11, 20, 21, 23, 25, 26, 27, 28]. [sent-16, score-0.26]
9 This problem is challenging when face images are taken with extreme poses, lightings, expressions, and occlusions, as shown in Figure 1. [sent-17, score-0.132]
10 Existing approaches can be generally divided into two categories: classifying search windows [3, 4, 11, 20, 28] or directly predicting keypoint positions (or shape parameters) [5, 8, 9, 21, 25, 26]. [sent-18, score-0.191]
11 hk tial detection with our first level of convolutional networks. [sent-30, score-0.663]
12 It achieves good estimation with global context information even if some facial components are invisible or ambiguous in appearance. [sent-31, score-0.4]
13 Second row: finely tuned results with our second and third levels of networks. [sent-32, score-0.143]
14 category, a classifier called component detector is trained for each keypoint and decision is made based on local regions. [sent-36, score-0.131]
15 Since local features could be ambiguous or corrupted, multiple candidate regions which all look like the facial point or no suitable candidate region might be found. [sent-37, score-0.458]
16 In that case, an optimal configuration of facial points is estimated with shape constraints [3, 4, 11, 20, 23, 28]. [sent-38, score-0.333]
17 Compared with component detectors, directly predicting keypoint positions (or shape parameters) are more efficient since it does not need scanning. [sent-39, score-0.228]
18 Regressors are often used as the predictor, based on local patches close to the facial point [9, 26], or the whole image region [5, 25]. [sent-40, score-0.396]
19 Many approaches [5, 8, 11, 20, 21, 23, 25, 26] update the 333444777644 positions of facial points iteratively and good initializations are critical. [sent-42, score-0.365]
20 In addition, many approaches face the problem that the visual features extracted are not discriminative or not reliable enough to predict facial points, and context information becomes important. [sent-44, score-0.491]
21 It is desirable to directly extract texture context information over the whole face region, since they contain rich information. [sent-46, score-0.161]
22 To solve these problems, we propose a cascaded regression approach for facial point detection with three levels of convolutional networks. [sent-48, score-1.054]
23 Different from existing approaches which roughly estimate the initial positions of facial points, our convolutional networks make accurate predictions at the first level, even on very challenging cases as shown in Figure 1. [sent-49, score-1.513]
24 Our convolutional networks are trained to predict all the keypoints simultaneously and the constraints of keypoints are implicitly encoded. [sent-52, score-1.321]
25 The remaining two levels of convolutional networks refine the initial estimation of keypoints. [sent-53, score-1.158]
26 Different from existing methods [5, 25, 26] which apply the same regressor at different cascade stages, we design different convolutional networks. [sent-54, score-0.639]
27 The network structures at these two levels are shallower, since their tasks are low-level and their input is limited to small local regions around the initial positions. [sent-55, score-0.392]
28 At each level, multiple convolutional networks are fused to improve the accuracy and reliability of estimation. [sent-56, score-1.096]
29 Through detailed empirical investigation, we find that several factors regarding the network structures are critical for achieving good performance in facial point detection. [sent-57, score-0.519]
30 Related Work Significant progress on facial keypoint detection has been achieved in recent years. [sent-60, score-0.428]
31 Shape constraints are important to refine component detection results and much research has been focusing on this. [sent-62, score-0.13]
32 [26] predicted facial points from local patches with random forests and support vector regressors respectively. [sent-69, score-0.47]
33 [26] modelled the spatial relations of facial points with Markov random field and Dantone et al. [sent-71, score-0.304]
34 [9] fused many predictions from patches densely sampled within the face region. [sent-72, score-0.335]
35 [5] used the whole face region as input and random ferns as the regressor. [sent-76, score-0.225]
36 Convolutional networks and other deep models have been successfully used in vision tasks such as face detection and pose estimation [24], face parsing [22], image classification [6, 17], and scene parsing [10]. [sent-78, score-0.874]
37 The research works on convolutional networks mainly focus on two aspects: network structures and feature learning algorithms. [sent-79, score-1.191]
38 [7] analyzed the performance of single-layer networks with different filter strides, filter sizes, and the numbers of feature maps. [sent-81, score-0.439]
39 [14] introduced strong nonlinearities after convolution, including absolute value rectification and local contrast normalization, and also compared different combinations of nonlinearities and pooling strategies. [sent-83, score-0.348]
40 Not until recently has the potential of convolutional networks truly been discovered, when it becomes big (with hundreds of maps per layer) and deep (with up to five convolutional stages). [sent-84, score-1.762]
41 Even larger convolutional network was introduced in [17], and it significantly improved image classification accuracies on the ImageNet. [sent-87, score-0.699]
42 Examples of recently proposed feature learning algorithms include convolutional sparse coding [16] and topographic independent component analysis [18]. [sent-88, score-0.635]
43 Cascaded convolutional networks In this paper, we focus on the structural design of individual networks and their combining strategies. [sent-90, score-1.446]
44 There are five facial points to be detected: left eye center (LE), right eye center (RE), nose tip (N), left mouth corner (LM), and right mouth corner (RM). [sent-92, score-0.678]
45 We cascade three levels of convolutional networks to make coarse-to-fine prediction. [sent-93, score-1.195]
46 At the first level, we employ three deep convolutional networks, F1, EN1, and NM1, whose input regions cover the whole face (F1), eyes and nose (EN1), nose and mouth (NM1). [sent-94, score-1.215]
47 Each network si333444777755 networks at level 1 are denoted as F1, EN1, and NM1. [sent-95, score-0.671]
48 Green square is the face bounding box given by the face detector. [sent-99, score-0.236]
49 Red dots are the final predictions at each level. [sent-101, score-0.169]
50 Dots in other colors are predictions given by individual networks. [sent-102, score-0.141]
51 Figure 3: The structure of deep convolutional network F1. [sent-103, score-0.842]
52 Sizes of input, convolution, and max pooling layers are illustrated by cuboids whose length, width, and height denote the number of maps, and the size of each map. [sent-104, score-0.235]
53 Local receptive fields of neurons in different layers are illustrated by small squares in the cuboids. [sent-105, score-0.323]
54 For each facial point, the predictions of multiple networks are averaged to reduce the variance. [sent-107, score-0.884]
55 Figure 3 illustrates the deep structure of F1, which contains four convolutional layers followed by max pooling, and two fully connected layers. [sent-108, score-0.838]
56 EN1 and NM1 take the same deep structure, but with different sizes at each layer since the sizes of their input regions are different. [sent-109, score-0.407]
57 Networks at the second and third levels take local patches centered at the predicted positions of facial points from previous levels as input and are only allowed to make small changes to previous predictions. [sent-110, score-0.714]
58 Predictions at the last two levels are strictly restricted because local appearance is sometimes ambiguous and unreliable. [sent-112, score-0.182]
59 The predicted position of each point at the last two levels is given by the average of the two networks with different patch sizes. [sent-113, score-0.636]
60 While networks at the first level aim to estimate keypoint positions robustly with few large errors, networks at the last two levels are designed to achieve high accuracy. [sent-114, score-1.215]
61 All the networks at the last two levels share a common shallower structure since their tasks are low-level. [sent-115, score-0.62]
62 Network structure selection We analyze three important factors on the choice of network structures. [sent-118, score-0.131]
63 The discussions are limited to networks at the first level, which are the hardest to train. [sent-119, score-0.467]
64 First, convo- lutional networks at the first level should be deep. [sent-120, score-0.504]
65 Predicting keypoints from large input regions is a high-level task. [sent-121, score-0.208]
66 Deeper structures help to form high-level features, which are global while features extracted by neurons at lower layers are local due to local receptive fields. [sent-122, score-0.376]
67 By combining spatially nearby features extracted at lower layers, neurons at higher layers can extract features from larger regions. [sent-123, score-0.28]
68 Adding additional layers increases the non-linearity from input to output, and makes it possible to represent the relationship between input and output. [sent-125, score-0.197]
69 Second, for neurons in the convolutional layers, absolute value rectification after the hyperbolic tangent activation function (see details in Section 4) can effectively improve the performance. [sent-126, score-0.843]
70 This modification over traditional convolutional networks was proposed in [14], where improvement on Caltech-101 was observed. [sent-127, score-1.007]
71 Third, locally sharing weights of neurons on the same map improves the performance. [sent-129, score-0.31]
72 Traditional convolutional networks share weights of all the neurons on the same map based on two considerations. [sent-130, score-1.204]
73 Second, weight sharing helps to prevent gradient diffusion when back-propagating through many layers, since gradients of shared weights are aggregated, which makes supervised learning on deep structures easier. [sent-133, score-0.34]
74 So for networks whose inputs contain different semantic regions, locally sharing weights at high layers is more effective for learning different highlevel features, e. [sent-138, score-0.749]
75 The idea of locally sharing weights was originally proposed for convolutional deep belief net for face recognition [12]. [sent-141, score-0.973]
76 Multi-level regression We find several effective ways to combine multiple convolutional networks. [sent-144, score-0.568]
77 The face bounding box is the only prior knowledge for networks at the first level. [sent-146, score-0.544]
78 The relative position of a facial point to the bounding box could vary in a large range due to large pose variations and the instability of face detectors. [sent-147, score-0.47]
79 So the input regions of networks at the first level should be large in order to cover many possible predictions. [sent-148, score-0.595]
80 The outputs of networks at the first level provide a strong prior for the following detections, i. [sent-150, score-0.535]
81 , the true position of a facial point should lie within a small region around the prediction at the first level. [sent-152, score-0.398]
82 So the second level detection can be done within a small region, where the disruption from other areas is reduced significantly, and this process repeats. [sent-153, score-0.152]
83 However, without context information, appearance of local regions is ambiguous and the prediction is unreliable. [sent-154, score-0.179]
84 To avoid drifting, we should not cascade too many levels or trust the following levels too much. [sent-155, score-0.305]
85 These networks are only allowed to adjust the initial prediction in a very small range. [sent-156, score-0.466]
86 To further improve detection accuracy and reliability, we propose to jointly predict the position of each point with multiple networks at each level. [sent-157, score-0.554]
87 The final predicted position of a facial point can be formally expressed as x =x(11)+ ·l·1· + xl(11)+i? [sent-159, score-0.384]
88 =n2Δx1(i)+ ·l·i· + Δxl(ii) (1) for an n-level cascade with li predictions at level i. [sent-160, score-0.277]
89 Note that predictions at the first level are absolute positions while predictions at the following levels are adjustments. [sent-161, score-0.562]
90 Implementation details The input layer is denoted by I(h, w), where h and w are the height and width of the input region. [sent-163, score-0.223]
91 Convolutional layer is denoted by CR(s, n, p, q), if absolute value rectification is used, otherwise C(s, n, p, q). [sent-165, score-0.223]
92 n is the number of maps in the convolutional layer. [sent-167, score-0.612]
93 Each map in the convolutional layer is evenly divided into p by q regions, and weights are locally shared in each region. [sent-169, score-0.778]
94 Traditional convolutional network can be viewed as a special case by setting p = q = 1. [sent-170, score-0.699]
95 The m maps in the previous layer are correlated with m s by s kernels. [sent-209, score-0.135]
96 The resulting maps, together with a bias, are accumulated and passed the tanh nonlinearity, forming one of the n map in the convolutional layer. [sent-210, score-0.68]
97 For different output maps and different regions in the maps, the set of kernels and the bias are different. [sent-211, score-0.139]
98 Max pooling is used and the pooling regions are not overlapped. [sent-215, score-0.272]
99 Pooling results are multiplied with a gain coefficient (g) and shifted by a bias (b), followed by a tanh non-linearity. [sent-216, score-0.151]
100 The gain and bias coefficients are shared in a similar way as weights at the previous convolutional layer. [sent-217, score-0.682]
wordName wordTfidf (topN-words)
[('convolutional', 0.568), ('networks', 0.439), ('facial', 0.304), ('neurons', 0.153), ('deep', 0.143), ('predictions', 0.141), ('network', 0.131), ('layers', 0.127), ('levels', 0.117), ('keypoints', 0.117), ('tanh', 0.112), ('pooling', 0.108), ('face', 0.105), ('keypoint', 0.094), ('layer', 0.091), ('regressors', 0.086), ('mouth', 0.079), ('nose', 0.078), ('nonlinearities', 0.072), ('cascade', 0.071), ('sharing', 0.069), ('cuhk', 0.068), ('ambiguous', 0.065), ('level', 0.065), ('shallower', 0.064), ('positions', 0.061), ('rectification', 0.059), ('dantone', 0.059), ('regions', 0.056), ('fused', 0.055), ('valstar', 0.053), ('structures', 0.053), ('predict', 0.051), ('eyes', 0.048), ('predicted', 0.046), ('convolution', 0.046), ('chinese', 0.045), ('weights', 0.044), ('maps', 0.044), ('locally', 0.044), ('receptive', 0.043), ('sizes', 0.041), ('eye', 0.04), ('bias', 0.039), ('component', 0.037), ('absolute', 0.037), ('denoted', 0.036), ('cr', 0.036), ('predicting', 0.036), ('xl', 0.036), ('input', 0.035), ('cascaded', 0.035), ('reliability', 0.034), ('position', 0.034), ('hong', 0.034), ('patches', 0.034), ('refine', 0.034), ('kong', 0.033), ('region', 0.033), ('amberg', 0.032), ('ciresan', 0.032), ('disruption', 0.032), ('multaneously', 0.032), ('xtang', 0.032), ('critical', 0.031), ('context', 0.031), ('corrupted', 0.031), ('shared', 0.031), ('outputs', 0.031), ('detection', 0.03), ('jarrett', 0.03), ('strides', 0.03), ('topographic', 0.03), ('corner', 0.029), ('locate', 0.029), ('constraints', 0.029), ('dots', 0.028), ('patrick', 0.028), ('lightings', 0.028), ('hardest', 0.028), ('avn', 0.028), ('xiaoou', 0.028), ('extreme', 0.027), ('prediction', 0.027), ('inaccuracy', 0.027), ('instability', 0.027), ('thnde', 0.027), ('webpage', 0.027), ('abs', 0.027), ('ferns', 0.027), ('parsing', 0.026), ('inputs', 0.026), ('ie', 0.026), ('square', 0.026), ('width', 0.026), ('hyperbolic', 0.026), ('finely', 0.026), ('xgwang', 0.026), ('areas', 0.025), ('whole', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
Author: Yi Sun, Xiaogang Wang, Xiaoou Tang
Abstract: We propose a new approach for estimation of the positions of facial keypoints with three-level carefully designed convolutional networks. At each level, the outputs of multiple networks are fused for robust and accurate estimation. Thanks to the deep structures of convolutional networks, global high-level features are extracted over the whole face region at the initialization stage, which help to locate high accuracy keypoints. There are two folds of advantage for this. First, the texture context information over the entire face is utilized to locate each keypoint. Second, since the networks are trained to predict all the keypoints simultaneously, the geometric constraints among keypoints are implicitly encoded. The method therefore can avoid local minimum caused by ambiguity and data corruption in difficult image samples due to occlusions, large pose variations, and extreme lightings. The networks at the following two levels are trained to locally refine initial predictions and their inputs are limited to small regions around the initial predictions. Several network structures critical for accurate and robust facial point detection are investigated. Extensive experiments show that our approach outperforms state-ofthe-art methods in both detection accuracy and reliability1.
Author: Yue Wu, Zuoguan Wang, Qiang Ji
Abstract: Facial feature tracking is an active area in computer vision due to its relevance to many applications. It is a nontrivial task, sincefaces may have varyingfacial expressions, poses or occlusions. In this paper, we address this problem by proposing a face shape prior model that is constructed based on the Restricted Boltzmann Machines (RBM) and their variants. Specifically, we first construct a model based on Deep Belief Networks to capture the face shape variations due to varying facial expressions for near-frontal view. To handle pose variations, the frontal face shape prior model is incorporated into a 3-way RBM model that could capture the relationship between frontal face shapes and non-frontal face shapes. Finally, we introduce methods to systematically combine the face shape prior models with image measurements of facial feature points. Experiments on benchmark databases show that with the proposed method, facial feature points can be tracked robustly and accurately even if faces have significant facial expressions and poses.
3 0.25121817 328 cvpr-2013-Pedestrian Detection with Unsupervised Multi-stage Feature Learning
Author: Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, Yann Lecun
Abstract: Pedestrian detection is a problem of considerable practical interest. Adding to the list of successful applications of deep learning methods to vision, we report state-of-theart and competitive results on all major pedestrian datasets with a convolutional network model. The model uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape information with local distinctive motif information, and an unsupervised method based on convolutional sparse coding to pre-train the filters at each stage.
4 0.23468696 77 cvpr-2013-Capturing Complex Spatio-temporal Relations among Facial Muscles for Facial Expression Recognition
Author: Ziheng Wang, Shangfei Wang, Qiang Ji
Abstract: Spatial-temporal relations among facial muscles carry crucial information about facial expressions yet have not been thoroughly exploited. One contributing factor for this is the limited ability of the current dynamic models in capturing complex spatial and temporal relations. Existing dynamic models can only capture simple local temporal relations among sequential events, or lack the ability for incorporating uncertainties. To overcome these limitations and take full advantage of the spatio-temporal information, we propose to model the facial expression as a complex activity that consists of temporally overlapping or sequential primitive facial events. We further propose the Interval Temporal Bayesian Network to capture these complex temporal relations among primitive facial events for facial expression modeling and recognition. Experimental results on benchmark databases demonstrate the feasibility of the proposed approach in recognizing facial expressions based purely on spatio-temporal relations among facial muscles, as well as its advantage over the existing methods.
5 0.202177 164 cvpr-2013-Fast Convolutional Sparse Coding
Author: Hilton Bristow, Anders Eriksson, Simon Lucey
Abstract: Sparse coding has become an increasingly popular method in learning and vision for a variety of classification, reconstruction and coding tasks. The canonical approach intrinsically assumes independence between observations during learning. For many natural signals however, sparse coding is applied to sub-elements (i.e. patches) of the signal, where such an assumption is invalid. Convolutional sparse coding explicitly models local interactions through the convolution operator, however the resulting optimization problem is considerably more complex than traditional sparse coding. In this paper, we draw upon ideas from signal processing and Augmented Lagrange Methods (ALMs) to produce a fast algorithm with globally optimal subproblems and super-linear convergence.
6 0.16164485 388 cvpr-2013-Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video
7 0.14707474 304 cvpr-2013-Multipath Sparse Coding Using Hierarchical Matching Pursuit
9 0.14141086 152 cvpr-2013-Exemplar-Based Face Parsing
10 0.12466323 421 cvpr-2013-Supervised Kernel Descriptors for Visual Recognition
11 0.12238339 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
12 0.1107472 255 cvpr-2013-Learning Separable Filters
13 0.10625817 438 cvpr-2013-Towards Pose Robust Face Recognition
14 0.099975795 415 cvpr-2013-Structured Face Hallucination
15 0.099591188 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
16 0.095329352 92 cvpr-2013-Constrained Clustering and Its Application to Face Clustering in Videos
17 0.094791651 64 cvpr-2013-Blessing of Dimensionality: High-Dimensional Feature and Its Efficient Compression for Face Verification
18 0.094212897 182 cvpr-2013-Fusing Robust Face Region Descriptors via Multiple Metric Learning for Face Recognition in the Wild
19 0.090578347 156 cvpr-2013-Exploring Compositional High Order Pattern Potentials for Structured Output Learning
20 0.08724764 296 cvpr-2013-Multi-level Discriminative Dictionary Learning towards Hierarchical Visual Categorization
topicId topicWeight
[(0, 0.179), (1, -0.055), (2, -0.039), (3, 0.027), (4, 0.055), (5, 0.033), (6, 0.036), (7, -0.018), (8, 0.161), (9, -0.167), (10, 0.043), (11, -0.058), (12, 0.07), (13, -0.025), (14, 0.081), (15, 0.169), (16, -0.065), (17, 0.11), (18, 0.162), (19, 0.083), (20, 0.004), (21, -0.077), (22, -0.005), (23, -0.084), (24, -0.036), (25, 0.157), (26, 0.071), (27, -0.034), (28, 0.067), (29, 0.105), (30, -0.099), (31, -0.107), (32, -0.06), (33, -0.036), (34, -0.118), (35, 0.019), (36, 0.027), (37, 0.081), (38, 0.159), (39, -0.002), (40, -0.056), (41, -0.097), (42, 0.051), (43, 0.102), (44, -0.087), (45, 0.023), (46, 0.079), (47, 0.003), (48, 0.092), (49, -0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.96137726 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
Author: Yi Sun, Xiaogang Wang, Xiaoou Tang
Abstract: We propose a new approach for estimation of the positions of facial keypoints with three-level carefully designed convolutional networks. At each level, the outputs of multiple networks are fused for robust and accurate estimation. Thanks to the deep structures of convolutional networks, global high-level features are extracted over the whole face region at the initialization stage, which help to locate high accuracy keypoints. There are two folds of advantage for this. First, the texture context information over the entire face is utilized to locate each keypoint. Second, since the networks are trained to predict all the keypoints simultaneously, the geometric constraints among keypoints are implicitly encoded. The method therefore can avoid local minimum caused by ambiguity and data corruption in difficult image samples due to occlusions, large pose variations, and extreme lightings. The networks at the following two levels are trained to locally refine initial predictions and their inputs are limited to small regions around the initial predictions. Several network structures critical for accurate and robust facial point detection are investigated. Extensive experiments show that our approach outperforms state-ofthe-art methods in both detection accuracy and reliability1.
Author: Yue Wu, Zuoguan Wang, Qiang Ji
Abstract: Facial feature tracking is an active area in computer vision due to its relevance to many applications. It is a nontrivial task, sincefaces may have varyingfacial expressions, poses or occlusions. In this paper, we address this problem by proposing a face shape prior model that is constructed based on the Restricted Boltzmann Machines (RBM) and their variants. Specifically, we first construct a model based on Deep Belief Networks to capture the face shape variations due to varying facial expressions for near-frontal view. To handle pose variations, the frontal face shape prior model is incorporated into a 3-way RBM model that could capture the relationship between frontal face shapes and non-frontal face shapes. Finally, we introduce methods to systematically combine the face shape prior models with image measurements of facial feature points. Experiments on benchmark databases show that with the proposed method, facial feature points can be tracked robustly and accurately even if faces have significant facial expressions and poses.
3 0.69108981 77 cvpr-2013-Capturing Complex Spatio-temporal Relations among Facial Muscles for Facial Expression Recognition
Author: Ziheng Wang, Shangfei Wang, Qiang Ji
Abstract: Spatial-temporal relations among facial muscles carry crucial information about facial expressions yet have not been thoroughly exploited. One contributing factor for this is the limited ability of the current dynamic models in capturing complex spatial and temporal relations. Existing dynamic models can only capture simple local temporal relations among sequential events, or lack the ability for incorporating uncertainties. To overcome these limitations and take full advantage of the spatio-temporal information, we propose to model the facial expression as a complex activity that consists of temporally overlapping or sequential primitive facial events. We further propose the Interval Temporal Bayesian Network to capture these complex temporal relations among primitive facial events for facial expression modeling and recognition. Experimental results on benchmark databases demonstrate the feasibility of the proposed approach in recognizing facial expressions based purely on spatio-temporal relations among facial muscles, as well as its advantage over the existing methods.
4 0.61272579 385 cvpr-2013-Selective Transfer Machine for Personalized Facial Action Unit Detection
Author: Wen-Sheng Chu, Fernando De La Torre, Jeffery F. Cohn
Abstract: Automatic facial action unit (AFA) detection from video is a long-standing problem in facial expression analysis. Most approaches emphasize choices of features and classifiers. They neglect individual differences in target persons. People vary markedly in facial morphology (e.g., heavy versus delicate brows, smooth versus deeply etched wrinkles) and behavior. Individual differences can dramatically influence how well generic classifiers generalize to previously unseen persons. While a possible solution would be to train person-specific classifiers, that often is neither feasible nor theoretically compelling. The alternative that we propose is to personalize a generic classifier in an unsupervised manner (no additional labels for the test subjects are required). We introduce a transductive learning method, which we refer to Selective Transfer Machine (STM), to personalize a generic classifier by attenuating person-specific biases. STM achieves this effect by simultaneously learning a classifier and re-weighting the training samples that are most relevant to the test subject. To evaluate the effectiveness of STM, we compared STM to generic classifiers and to cross-domain learning methods in three major databases: CK+ [20], GEMEP-FERA [32] and RU-FACS [2]. STM outperformed generic classifiers in all.
5 0.57150006 328 cvpr-2013-Pedestrian Detection with Unsupervised Multi-stage Feature Learning
Author: Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, Yann Lecun
Abstract: Pedestrian detection is a problem of considerable practical interest. Adding to the list of successful applications of deep learning methods to vision, we report state-of-theart and competitive results on all major pedestrian datasets with a convolutional network model. The model uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape information with local distinctive motif information, and an unsupervised method based on convolutional sparse coding to pre-train the filters at each stage.
6 0.54861361 304 cvpr-2013-Multipath Sparse Coding Using Hierarchical Matching Pursuit
7 0.54641175 164 cvpr-2013-Fast Convolutional Sparse Coding
8 0.54167706 420 cvpr-2013-Supervised Descent Method and Its Applications to Face Alignment
9 0.50285542 415 cvpr-2013-Structured Face Hallucination
10 0.49425489 359 cvpr-2013-Robust Discriminative Response Map Fitting with Constrained Local Models
11 0.49375519 371 cvpr-2013-SCaLE: Supervised and Cascaded Laplacian Eigenmaps for Visual Object Recognition Based on Nearest Neighbors
12 0.48977786 255 cvpr-2013-Learning Separable Filters
13 0.48973083 159 cvpr-2013-Expressive Visual Text-to-Speech Using Active Appearance Models
14 0.45615995 463 cvpr-2013-What's in a Name? First Names as Facial Attributes
15 0.45309317 388 cvpr-2013-Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video
16 0.44057631 369 cvpr-2013-Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination
17 0.43774396 346 cvpr-2013-Real-Time No-Reference Image Quality Assessment Based on Filter Learning
18 0.43135574 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
19 0.41336501 64 cvpr-2013-Blessing of Dimensionality: High-Dimensional Feature and Its Efficient Compression for Face Verification
20 0.38782567 105 cvpr-2013-Deep Learning Shape Priors for Object Segmentation
topicId topicWeight
[(10, 0.126), (16, 0.042), (26, 0.098), (28, 0.013), (33, 0.252), (67, 0.098), (69, 0.05), (86, 0.166), (87, 0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.89376062 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
Author: Yi Sun, Xiaogang Wang, Xiaoou Tang
Abstract: We propose a new approach for estimation of the positions of facial keypoints with three-level carefully designed convolutional networks. At each level, the outputs of multiple networks are fused for robust and accurate estimation. Thanks to the deep structures of convolutional networks, global high-level features are extracted over the whole face region at the initialization stage, which help to locate high accuracy keypoints. There are two folds of advantage for this. First, the texture context information over the entire face is utilized to locate each keypoint. Second, since the networks are trained to predict all the keypoints simultaneously, the geometric constraints among keypoints are implicitly encoded. The method therefore can avoid local minimum caused by ambiguity and data corruption in difficult image samples due to occlusions, large pose variations, and extreme lightings. The networks at the following two levels are trained to locally refine initial predictions and their inputs are limited to small regions around the initial predictions. Several network structures critical for accurate and robust facial point detection are investigated. Extensive experiments show that our approach outperforms state-ofthe-art methods in both detection accuracy and reliability1.
2 0.88663095 441 cvpr-2013-Tracking Sports Players with Context-Conditioned Motion Models
Author: Jingchen Liu, Peter Carr, Robert T. Collins, Yanxi Liu
Abstract: We employ hierarchical data association to track players in team sports. Player movements are often complex and highly correlated with both nearby and distant players. A single model would require many degrees of freedom to represent the full motion diversity and could be difficult to use in practice. Instead, we introduce a set of Game Context Features extracted from noisy detections to describe the current state of the match, such as how the players are spatially distributed. Our assumption is that players react to the current situation in only a finite number of ways. As a result, we are able to select an appropriate simplified affinity model for each player and time instant using a random decisionforest based on current track and game contextfeatures. Our context-conditioned motion models implicitly incorporate complex inter-object correlations while remaining tractable. We demonstrate significant performance improvements over existing multi-target tracking algorithms on basketball and field hockey sequences several minutes in duration and containing 10 and 20 players respectively.
3 0.8741039 50 cvpr-2013-Augmenting CRFs with Boltzmann Machine Shape Priors for Image Labeling
Author: Andrew Kae, Kihyuk Sohn, Honglak Lee, Erik Learned-Miller
Abstract: Conditional random fields (CRFs) provide powerful tools for building models to label image segments. They are particularly well-suited to modeling local interactions among adjacent regions (e.g., superpixels). However, CRFs are limited in dealing with complex, global (long-range) interactions between regions. Complementary to this, restricted Boltzmann machines (RBMs) can be used to model global shapes produced by segmentation models. In this work, we present a new model that uses the combined power of these two network types to build a state-of-the-art labeler. Although the CRF is a good baseline labeler, we show how an RBM can be added to the architecture to provide a global shape bias that complements the local modeling provided by the CRF. We demonstrate its labeling performance for the parts of complex face images from the Labeled Faces in the Wild data set. This hybrid model produces results that are both quantitatively and qualitatively better than the CRF alone. In addition to high-quality labeling results, we demonstrate that the hidden units in the RBM portion of our model can be interpreted as face attributes that have been learned without any attribute-level supervision.
4 0.87324345 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
Author: Guang Chen, Yuanyuan Ding, Jing Xiao, Tony X. Han
Abstract: Context has been playing an increasingly important role to improve the object detection performance. In this paper we propose an effective representation, Multi-Order Contextual co-Occurrence (MOCO), to implicitly model the high level context using solely detection responses from a baseline object detector. The so-called (1st-order) context feature is computed as a set of randomized binary comparisons on the response map of the baseline object detector. The statistics of the 1st-order binary context features are further calculated to construct a high order co-occurrence descriptor. Combining the MOCO feature with the original image feature, we can evolve the baseline object detector to a stronger context aware detector. With the updated detector, we can continue the evolution till the contextual improvements saturate. Using the successful deformable-partmodel detector [13] as the baseline detector, we test the proposed MOCO evolution framework on the PASCAL VOC 2007 dataset [8] and Caltech pedestrian dataset [7]: The proposed MOCO detector outperforms all known state-ofthe-art approaches, contextually boosting deformable part models (ver.5) [13] by 3.3% in mean average precision on the PASCAL 2007 dataset. For the Caltech pedestrian dataset, our method further reduces the log-average miss rate from 48% to 46% and the miss rate at 1 FPPI from 25% to 23%, compared with the best prior art [6].
5 0.87163514 311 cvpr-2013-Occlusion Patterns for Object Class Detection
Author: Bojan Pepikj, Michael Stark, Peter Gehler, Bernt Schiele
Abstract: Despite the success of recent object class recognition systems, the long-standing problem of partial occlusion remains a major challenge, and a principled solution is yet to be found. In this paper we leave the beaten path of methods that treat occlusion as just another source of noise instead, we include the occluder itself into the modelling, by mining distinctive, reoccurring occlusion patterns from annotated training data. These patterns are then used as training data for dedicated detectors of varying sophistication. In particular, we evaluate and compare models that range from standard object class detectors to hierarchical, part-based representations of occluder/occludee pairs. In an extensive evaluation we derive insights that can aid further developments in tackling the occlusion challenge. –
6 0.87025434 155 cvpr-2013-Exploiting the Power of Stereo Confidences
7 0.86132067 440 cvpr-2013-Tracking People and Their Objects
8 0.86060488 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
9 0.85930181 152 cvpr-2013-Exemplar-Based Face Parsing
10 0.85725009 414 cvpr-2013-Structure Preserving Object Tracking
11 0.85575885 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
12 0.85567135 325 cvpr-2013-Part Discovery from Partial Correspondence
13 0.85348493 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
14 0.85259765 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
15 0.85214531 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
16 0.84938705 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
17 0.84923846 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
18 0.84904587 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
19 0.84771186 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation
20 0.84764338 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection