nips nips2004 nips2004-182 knowledge-graph by maker-knowledge-mining

182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models


Source: pdf

Author: Margarita Osadchy, Matthew L. Miller, Yann L. Cun

Abstract: We describe a novel method for real-time, simultaneous multi-view face detection and facial pose estimation. The method employs a convolutional network to map face images to points on a manifold, parametrized by pose, and non-face images to points far from that manifold. This network is trained by optimizing a loss function of three variables: image, pose, and face/non-face label. We test the resulting system, in a single configuration, on three standard data sets – one for frontal pose, one for rotated faces, and one for profiles – and find that its performance on each set is comparable to previous multi-view face detectors that can only handle one form of pose variation. We also show experimentally that the system’s accuracy on both face detection and pose estimation is improved by training for the two tasks together.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We describe a novel method for real-time, simultaneous multi-view face detection and facial pose estimation. [sent-6, score-0.824]

2 The method employs a convolutional network to map face images to points on a manifold, parametrized by pose, and non-face images to points far from that manifold. [sent-7, score-0.612]

3 This network is trained by optimizing a loss function of three variables: image, pose, and face/non-face label. [sent-8, score-0.154]

4 We test the resulting system, in a single configuration, on three standard data sets – one for frontal pose, one for rotated faces, and one for profiles – and find that its performance on each set is comparable to previous multi-view face detectors that can only handle one form of pose variation. [sent-9, score-0.752]

5 We also show experimentally that the system’s accuracy on both face detection and pose estimation is improved by training for the two tasks together. [sent-10, score-0.867]

6 1 Introduction The detection of human faces in natural images and videos is a key component in a wide variety of applications of human-computer interaction, search and indexing, security, and surveillance. [sent-11, score-0.468]

7 Many real-world applications would profit from multi-view detectors that can detect faces under a wide range of poses: looking left or right (yaw axis), up or down (pitch axis), or tilting left or right (roll axis). [sent-12, score-0.312]

8 In this paper we describe a novel method that not only detects faces independently of their poses, but simultaneously estimates those poses. [sent-13, score-0.283]

9 The system is highly-reliable, runs at near real time (5 frames per second on standard hardware), and is robust against variations in yaw (±90◦ ), roll (±45◦ ), and pitch (±60◦ ). [sent-14, score-0.46]

10 The method is motivated by the idea that multi-view face detection and pose estimation are so closely related that they should not be performed separately. [sent-15, score-0.801]

11 The tasks are related in the sense that they must be robust against the same sorts of variation: skin color, glasses, facial hair, lighting, scale, expressions, etc. [sent-16, score-0.109]

12 To exploit the synergy between these two tasks, we train a convolutional network to map face images to points on a face manifold, and non-face images to points far away from that manifold. [sent-18, score-0.983]

13 Conceptually, we can view the pose parameter as a latent variable that can be inferred through an energy-minimization process [4]. [sent-20, score-0.327]

14 To train the machine we derive a new type of discriminative loss function that is tailored to such detection tasks. [sent-21, score-0.254]

15 Previous Work: Learning-based approaches to face detection abound, including real-time methods [16], and approaches based on convolutional networks [15, 3]. [sent-22, score-0.571]

16 Most multi-view systems take a view-based approach, which involves building separate detectors for different views and either applying them in parallel [10, 14, 13, 7] or using a pose estimator to select a detector [5]. [sent-23, score-0.502]

17 Another approach is to estimate and correct in-plane rotations before applying a single pose-specific detector [12]. [sent-24, score-0.121]

18 Closer to our approach is that of [8], in which a number of Support Vector Regressors are trained to approximate smooth functions, each of which has a maximum for a face at a particular pose. [sent-25, score-0.317]

19 Another machine is trained to convert the resulting values to estimates of poses, and a third is trained to convert the values into a face/non-face score. [sent-26, score-0.172]

20 2 Integrating face detection and pose estimation To exploit the posited synergy between face detection and pose estimation, we must design a system that integrates the solutions to the two problems. [sent-28, score-1.722]

21 Our approach is to build a trainable system that can map raw images X to points in a low-dimensional space. [sent-31, score-0.166]

22 In that space, we pre-define a face manifold F (Z) that we parameterize by the pose Z. [sent-32, score-0.727]

23 We train the system to map face images with known poses to the corresponding points on the manifold. [sent-33, score-0.516]

24 We also train it to map non-face images to points far away from the manifold. [sent-34, score-0.11]

25 Proximity to the manifold then tells us whether or not an image is a face, and projection to the manifold yields an estimate of the pose. [sent-35, score-0.334]

26 Parameterizing the Face Manifold: We will now describe the details of the parameterizations of the face manifold. [sent-36, score-0.295]

27 Let’s start with the simplest case of one pose parameter Z = θ, representing, say, yaw. [sent-37, score-0.327]

28 If we want to preserve the natural topology and geometry of the problem, the face manifold under yaw variations in the interval [−90◦ , 90◦ ] should be a half circle (with constant curvature). [sent-38, score-0.593]

29 The same idea can be applied to any number of pose parameters. [sent-41, score-0.327]

30 Let us consider the set of all faces with yaw in [−90, 90] and roll in [−45, 45]. [sent-42, score-0.471]

31 Consequently, we encode the pose with the product of the cosines of the two angles: Fij (θ, φ) = cos(θ − αi ) cos(φ − βj ); i, j = 1, 2, 3; (3) For convenience we rescale the roll angles to the range of [−90, 90]. [sent-44, score-0.432]

32 With these parameterizations, the manifold has constant curvature, which ensures that the effect of errors will be the same regardless of pose. [sent-45, score-0.134]

33 Given nine components of the network’s output Gij (X), we compute the corresponding pose angles as follows: cc = ij Gij (X) cos(αi ) cos(βj ); cs = ij Gij (X) cos(αi ) sin(βj ) sc = Gij (X) sin(αi ) cos(βj ); ss = ij ij Gij (X) sin(αi ) sin(βj ) (4) θ = 0. [sent-46, score-0.673]

34 5(atan2(cs + sc, cc − ss) + atan2(sc − cs, cc + ss)) φ = 0. [sent-47, score-0.144]

35 5(atan2(cs + sc, cc − ss) − atan2(sc − cs, cc + ss)) Note that the dimension of the face manifold is much lower than that of the embedding space. [sent-48, score-0.544]

36 If X is a face with pose Z, then we want: EW (1, Z, X) EW (0, Z , X) for any pose Z , and EW (1, Z , X) EW (1, Z, X) for any pose Z = Z. [sent-53, score-1.247]

37 Operating the machine consists in clamping X to the observed value (the image), and finding the values of Z and Y that minimize EW (Y, Z, X): (Y , Z) = argminY ∈{Y }, Z∈{Z} EW (Y, Z, X) (5) where {Y } = {0, 1} and {Z} = [−90, 90]×[−45, 45] for yaw and roll variables. [sent-54, score-0.234]

38 The complete energy function is: EW (Y, Z, X) = Y GW (X) − F (Z) + (1 − Y )T (7) The architecture of the machine is depicted in Figure 1. [sent-61, score-0.143]

39 Operating this machine (finding the output label and pose with the smallest energy) comes down to first finding: Z = argminZ∈{Z} ||GW (X) − F (Z)||, and then comparing this minimum distance, GW (X) − F (Z) , to the threshold T . [sent-62, score-0.327]

40 Convolutional networks [6] are “endto-end” trainable system that can operate on raw pixel images and learn low-level features and high-level representation in an integrated fashion. [sent-66, score-0.142]

41 We employ a network architecture similar to LeNet5 [6]. [sent-69, score-0.135]

42 In our architecture we have 8 feature maps in the bottom convolutional and subsampling layers and 20 maps in the next two layers. [sent-71, score-0.226]

43 The last layer has 9 outputs to encode two pose parameters. [sent-72, score-0.327]

44 where S1 is the set of training faces, S0 the set of non-faces, L1 (W, Z i , X i ) and L0 (W, X i ) are loss functions for a face sample (with a known pose) and non-face, respectively1 . [sent-74, score-0.34]

45 To cause the machine to achieve the desired behavior, we need the parameter update to decrease the difference between the energy of the desired label and the energy of the undesired label. [sent-80, score-0.173]

46 for a face example (X, Z, 1), we must have: EW (1, Z, X) < EW (1, Z, X) For a non-face example (X, 1), we must have: EW (1, Z, X) > EW (1, Z, X) We choose the following forms for L1 and L0 : L1 (W, 1, Z, X) = EW (1, Z, X)2 ; L0 (W, 0, X) = K exp[−E(1, Z, X)] (9) where K is a positive constant. [sent-82, score-0.266]

47 1 Although face samples whose pose is unknown can easily be accommodated, we will not discuss this possibility here. [sent-89, score-0.593]

48 Running the Machine: Our detection system works on grayscale images and it applies √ the network to each image at a range of scales, stepping by a factor of 2. [sent-95, score-0.456]

49 The network is replicated over the image at each scale, stepping by 4 pixels in x and y (this step size is a consequence of having two, 2x2 subsampling layers). [sent-96, score-0.219]

50 At each scale and location, the network outputs are compared to the closest point on the manifold, and the system collects a list of all instances closer than our detection threshold. [sent-97, score-0.316]

51 The system can detect, locate, and estimate the pose of faces that are between 40 and 250 pixels high in a 640 × 480 image at roughly 5 frames per second on a 2. [sent-101, score-0.741]

52 4 Experiments and results Using the above architecture, we built a detector to locate faces and estimate two pose parameters: yaw from left to right profile, and in-plane rotation from −45 to 45 degrees. [sent-103, score-0.898]

53 The machine was trained to be robust against pitch variation. [sent-104, score-0.131]

54 The first set of experiments tests whether training for the two tasks together improves performance on both. [sent-106, score-0.116]

55 Training: Our training set consisted of 52, 850, 32x32-pixel faces from natural images collected at NEC Labs and hand annotated with appropriate facial poses (see [9] for a description of how the annotation was done). [sent-108, score-0.504]

56 These faces were selected from a much larger annotated set to yield a roughly uniform distribution of poses from left profile to right profile, with as much variation in pitch as we could obtain. [sent-109, score-0.403]

57 Our initial negative training data consisted of 52, 850 image patches chosen randomly from non-face areas of a variety of images. [sent-110, score-0.105]

58 For our second set of tests, we replaced half of these with image patches obtained by running the initial version of the detector on our training images and collecting false detections. [sent-111, score-0.324]

59 Each training image was used 5 times during training, with random variations 95 Percentage of poses correctly estimated 100 95 Percentage of faces detected 100 90 85 80 75 70 65 Frontal Rotated in plane Profile 60 55 50 0 0. [sent-112, score-0.528]

60 Left: ROC curves for our detector on the three data sets. [sent-118, score-0.1]

61 The x axis is the average number of false positives per image over all three sets, so each point corresponds to a single detection threshold. [sent-119, score-0.407]

62 Right: frequency with which yaw and roll are estimated within various error tolerances. [sent-120, score-0.259]

63 At the end of training, the network had converged to an equal error rate of 5% on the training data and 6% on a separate test set of 90,000 images. [sent-126, score-0.107]

64 Synergy tests: The goal of the synergy test was to verify that both face detection and pose estimation benefit from learning and running in parallel. [sent-127, score-0.88]

65 The first one was trained for simultaneous face detection and pose estimation (combined), the second was trained for detection only and the third for pose estimation only. [sent-129, score-1.438]

66 The “pose only” network was identical to the combined network, but trained on faces only (no negative examples). [sent-131, score-0.356]

67 In both these graphs, we see that the pose-plus-detection network had better performance, confirming that training for each task benefits the other. [sent-133, score-0.107]

68 Standard data sets: There is no standard data set that tests all the poses our system is designed to detect. [sent-134, score-0.215]

69 There are, however, data sets that have been used to test more restricted face detectors, each set focusing on a particular variation in pose. [sent-135, score-0.29]

70 The details of these sets are described below: • MIT+CMU [14, 11] – 130 images for testing frontal face detectors. [sent-138, score-0.389]

71 We count 517 faces in this set, but the standard tests only use a subset of 507 faces, because 10 faces are in the wrong pose or otherwise not suitable for the test. [sent-139, score-0.851]

72 (Note: about 2% of the faces in the standard subset are badly-drawn cartoons, which we do not intend our system to detect. [sent-140, score-0.292]

73 ) • TILTED [12] – 50 images of frontal faces with in-plane rotations. [sent-142, score-0.336]

74 (Note: about 20% of the faces in the standard subset are outside of the ±45◦ rotation range for which our system is designed. [sent-144, score-0.343]

75 There seems to be some disagreement about the number of faces in the standard set of annotations: [13] reports using 347 faces of the 462 that we found, [5] reports using 355, and we found 353 annotations. [sent-147, score-0.474]

76 We counted a face as being detected if 1) at least one detection lay within a circle centered on the midpoint between the eyes, with a radius equal to 1. [sent-149, score-0.539]

77 25 times the distance from that point to the midpoint of the mouth, and 2) that detection came at a scale within a factor of Figure 4: Some example face detections. [sent-150, score-0.469]

78 Data set → False positives per image → Our detector Jones & Viola [5] (tilted) Jones & Viola [5] (profile) Rowley et al [11] Schneiderman & Kanade [13] TILTED 4. [sent-154, score-0.247]

79 Each column shows the detection rates for a given average number of false positives per image (these rates correspond to those for which other authors have reported results). [sent-161, score-0.377]

80 Note that ours is the only single detector that can be tested on all data sets simultaneously. [sent-163, score-0.124]

81 We counted a detection as a false positive if it did not lie within this range for any of the faces in the image, including those faces not in the standard subset. [sent-165, score-0.733]

82 Table 1 shows our detection rates compared against other systems for which results were given on these data sets. [sent-168, score-0.171]

83 Those detectors, however, are not designed to handle all variations in pose, and do not yield pose estimates. [sent-170, score-0.387]

84 The right side of Figure 3 shows our performance at pose estimation. [sent-171, score-0.327]

85 To make this graph, we fixed the detection threshold at a value that resulted in about 0. [sent-172, score-0.171]

86 5 false positives per image over all three data sets. [sent-173, score-0.206]

87 We then compared the pose estimates for all detected faces (including those not in the standard subsets) against our manual pose annotations. [sent-174, score-0.977]

88 Note that this test is more difficult than typical tests of pose estimation systems, where faces are first localized by hand. [sent-175, score-0.651]

89 5 Conclusion The system we have presented here integrates detection and pose estimation by training a convolutional network to map faces to points on a manifold, parameterized by pose, and non-faces to points far from the manifold. [sent-177, score-1.115]

90 The network is trained by optimizing a loss function of three variables – image, pose, and face/non-face label. [sent-178, score-0.154]

91 When the three variables match, the energy function is trained to have a small value, when they do not match, it is trained to have a large value. [sent-179, score-0.178]

92 This system has several desirable properties: • The use of a convolutional network makes it fast. [sent-180, score-0.257]

93 • It is robust to a wide range of poses, including variations in yaw up to ±90◦ , in-plane rotation up to ±45◦ , and pitch up to ±60◦ . [sent-183, score-0.324]

94 This has been verified with tests on three standard data sets, each designed to test robustness against a single dimension of pose variation. [sent-184, score-0.402]

95 On the standard data sets, the estimates of yaw and in-plane rotation are within 15◦ of manual estimates over 80% and 95% of the time, respectively. [sent-186, score-0.278]

96 We have shown experimentally that our system’s accuracy at both pose estimation and face detection is increased by training for the two tasks together. [sent-187, score-0.867]

97 A neural architecture for fast and robust face detection. [sent-202, score-0.355]

98 Support vector regression and classification based multi-view face detection and recognition. [sent-237, score-0.437]

99 A statistical method for 3d object detection applied to faces and cars. [sent-268, score-0.408]

100 Rapid object detection using a boosted cascade of simple features. [sent-284, score-0.171]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ew', 0.663), ('pose', 0.327), ('face', 0.266), ('faces', 0.237), ('detection', 0.171), ('yaw', 0.158), ('convolutional', 0.134), ('manifold', 0.134), ('gw', 0.109), ('detector', 0.1), ('poses', 0.085), ('synergy', 0.079), ('energy', 0.076), ('roll', 0.076), ('detectors', 0.075), ('cos', 0.073), ('profile', 0.073), ('cc', 0.072), ('network', 0.068), ('architecture', 0.067), ('image', 0.066), ('gij', 0.063), ('tilted', 0.063), ('facial', 0.06), ('images', 0.06), ('false', 0.059), ('pro', 0.058), ('ss', 0.058), ('pitch', 0.058), ('sc', 0.055), ('system', 0.055), ('yaws', 0.054), ('rotation', 0.051), ('trained', 0.051), ('rowley', 0.051), ('tests', 0.05), ('sin', 0.05), ('positives', 0.05), ('cs', 0.048), ('detected', 0.041), ('nec', 0.04), ('jones', 0.04), ('le', 0.04), ('training', 0.039), ('frontal', 0.039), ('viola', 0.039), ('estimation', 0.037), ('lush', 0.036), ('stepping', 0.036), ('labs', 0.036), ('loss', 0.035), ('variations', 0.035), ('pentium', 0.033), ('kanade', 0.032), ('midpoint', 0.032), ('yann', 0.032), ('percentage', 0.031), ('per', 0.031), ('axis', 0.03), ('angles', 0.029), ('parameterizations', 0.029), ('courant', 0.029), ('counted', 0.029), ('schneiderman', 0.029), ('trainable', 0.027), ('tasks', 0.027), ('roc', 0.026), ('train', 0.026), ('locate', 0.025), ('subsampling', 0.025), ('baluja', 0.025), ('estimated', 0.025), ('frames', 0.025), ('designed', 0.025), ('sets', 0.024), ('estimates', 0.024), ('curvature', 0.024), ('bottou', 0.024), ('replicated', 0.024), ('cmu', 0.024), ('map', 0.024), ('integrates', 0.023), ('convert', 0.023), ('annotated', 0.023), ('detections', 0.023), ('discriminative', 0.022), ('closest', 0.022), ('america', 0.022), ('tolerance', 0.022), ('detects', 0.022), ('princeton', 0.022), ('box', 0.022), ('robust', 0.022), ('manual', 0.021), ('rotated', 0.021), ('pami', 0.021), ('rotations', 0.021), ('ij', 0.021), ('recognition', 0.021), ('update', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

Author: Margarita Osadchy, Matthew L. Miller, Yann L. Cun

Abstract: We describe a novel method for real-time, simultaneous multi-view face detection and facial pose estimation. The method employs a convolutional network to map face images to points on a manifold, parametrized by pose, and non-face images to points far from that manifold. This network is trained by optimizing a loss function of three variables: image, pose, and face/non-face label. We test the resulting system, in a single configuration, on three standard data sets – one for frontal pose, one for rotated faces, and one for profiles – and find that its performance on each set is comparable to previous multi-view face detectors that can only handle one form of pose variation. We also show experimentally that the system’s accuracy on both face detection and pose estimation is improved by training for the two tasks together.

2 0.12937868 68 nips-2004-Face Detection --- Efficient and Rank Deficient

Author: Wolf Kienzle, Matthias O. Franz, Bernhard Schölkopf, Gökhan H. Bakir

Abstract: This paper proposes a method for computing fast approximations to support vector decision functions in the field of object detection. In the present approach we are building on an existing algorithm where the set of support vectors is replaced by a smaller, so-called reduced set of synthesized input space points. In contrast to the existing method that finds the reduced set via unconstrained optimization, we impose a structural constraint on the synthetic points such that the resulting approximations can be evaluated via separable filters. For applications that require scanning large images, this decreases the computational complexity by a significant amount. Experimental results show that in face detection, rank deficient approximations are 4 to 6 times faster than unconstrained reduced set systems. 1

3 0.12103352 192 nips-2004-The power of feature clustering: An application to object detection

Author: Shai Avidan, Moshe Butman

Abstract: We give a fast rejection scheme that is based on image segments and demonstrate it on the canonical example of face detection. However, instead of focusing on the detection step we focus on the rejection step and show that our method is simple and fast to be learned, thus making it an excellent pre-processing step to accelerate standard machine learning classifiers, such as neural-networks, Bayes classifiers or SVM. We decompose a collection of face images into regions of pixels with similar behavior over the image set. The relationships between the mean and variance of image segments are used to form a cascade of rejectors that can reject over 99.8% of image patches, thus only a small fraction of the image patches must be passed to a full-scale classifier. Moreover, the training time for our method is much less than an hour, on a standard PC. The shape of the features (i.e. image segments) we use is data-driven, they are very cheap to compute and they form a very low dimensional feature space in which exhaustive search for the best features is tractable. 1

4 0.11018725 205 nips-2004-Who's In the Picture

Author: Tamara L. Berg, Alexander C. Berg, Jaety Edwards, David A. Forsyth

Abstract: The context in which a name appears in a caption provides powerful cues as to who is depicted in the associated image. We obtain 44,773 face images, using a face detector, from approximately half a million captioned news images and automatically link names, obtained using a named entity recognizer, with these faces. A simple clustering method can produce fair results. We improve these results significantly by combining the clustering process with a model of the probability that an individual is depicted given its context. Once the labeling procedure is over, we have an accurately labeled set of faces, an appearance model for each individual depicted, and a natural language model that can produce accurate results on captions in isolation. 1

5 0.10957612 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

6 0.10444531 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

7 0.10054053 40 nips-2004-Common-Frame Model for Object Recognition

8 0.095425889 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters

9 0.08594881 83 nips-2004-Incremental Learning for Visual Tracking

10 0.083639659 191 nips-2004-The Variational Ising Classifier (VIC) Algorithm for Coherently Contaminated Data

11 0.074595265 125 nips-2004-Multiple Relational Embedding

12 0.074032046 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

13 0.067358211 52 nips-2004-Discrete profile alignment via constrained information bottleneck

14 0.063000306 44 nips-2004-Conditional Random Fields for Object Recognition

15 0.062144291 179 nips-2004-Surface Reconstruction using Learned Shape Models

16 0.062070943 160 nips-2004-Seeing through water

17 0.059522584 99 nips-2004-Learning Hyper-Features for Visual Identification

18 0.059366271 127 nips-2004-Neighbourhood Components Analysis

19 0.056907095 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

20 0.054772682 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.161), (1, 0.034), (2, -0.053), (3, -0.168), (4, 0.123), (5, 0.029), (6, 0.054), (7, -0.106), (8, 0.002), (9, 0.019), (10, -0.041), (11, -0.086), (12, -0.04), (13, -0.003), (14, -0.097), (15, -0.035), (16, -0.072), (17, 0.045), (18, 0.046), (19, -0.029), (20, -0.041), (21, 0.028), (22, -0.103), (23, -0.102), (24, 0.154), (25, 0.129), (26, 0.004), (27, 0.003), (28, -0.074), (29, -0.015), (30, 0.16), (31, -0.082), (32, 0.209), (33, 0.031), (34, 0.021), (35, -0.146), (36, 0.07), (37, 0.094), (38, 0.038), (39, -0.038), (40, -0.035), (41, 0.101), (42, 0.026), (43, -0.018), (44, 0.002), (45, -0.069), (46, 0.041), (47, 0.127), (48, 0.109), (49, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96007377 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

Author: Margarita Osadchy, Matthew L. Miller, Yann L. Cun

Abstract: We describe a novel method for real-time, simultaneous multi-view face detection and facial pose estimation. The method employs a convolutional network to map face images to points on a manifold, parametrized by pose, and non-face images to points far from that manifold. This network is trained by optimizing a loss function of three variables: image, pose, and face/non-face label. We test the resulting system, in a single configuration, on three standard data sets – one for frontal pose, one for rotated faces, and one for profiles – and find that its performance on each set is comparable to previous multi-view face detectors that can only handle one form of pose variation. We also show experimentally that the system’s accuracy on both face detection and pose estimation is improved by training for the two tasks together.

2 0.5571391 192 nips-2004-The power of feature clustering: An application to object detection

Author: Shai Avidan, Moshe Butman

Abstract: We give a fast rejection scheme that is based on image segments and demonstrate it on the canonical example of face detection. However, instead of focusing on the detection step we focus on the rejection step and show that our method is simple and fast to be learned, thus making it an excellent pre-processing step to accelerate standard machine learning classifiers, such as neural-networks, Bayes classifiers or SVM. We decompose a collection of face images into regions of pixels with similar behavior over the image set. The relationships between the mean and variance of image segments are used to form a cascade of rejectors that can reject over 99.8% of image patches, thus only a small fraction of the image patches must be passed to a full-scale classifier. Moreover, the training time for our method is much less than an hour, on a standard PC. The shape of the features (i.e. image segments) we use is data-driven, they are very cheap to compute and they form a very low dimensional feature space in which exhaustive search for the best features is tractable. 1

3 0.5426302 68 nips-2004-Face Detection --- Efficient and Rank Deficient

Author: Wolf Kienzle, Matthias O. Franz, Bernhard Schölkopf, Gökhan H. Bakir

Abstract: This paper proposes a method for computing fast approximations to support vector decision functions in the field of object detection. In the present approach we are building on an existing algorithm where the set of support vectors is replaced by a smaller, so-called reduced set of synthesized input space points. In contrast to the existing method that finds the reduced set via unconstrained optimization, we impose a structural constraint on the synthetic points such that the resulting approximations can be evaluated via separable filters. For applications that require scanning large images, this decreases the computational complexity by a significant amount. Experimental results show that in face detection, rank deficient approximations are 4 to 6 times faster than unconstrained reduced set systems. 1

4 0.53829283 191 nips-2004-The Variational Ising Classifier (VIC) Algorithm for Coherently Contaminated Data

Author: Oliver Williams, Andrew Blake, Roberto Cipolla

Abstract: There has been substantial progress in the past decade in the development of object classifiers for images, for example of faces, humans and vehicles. Here we address the problem of contaminations (e.g. occlusion, shadows) in test images which have not explicitly been encountered in training data. The Variational Ising Classifier (VIC) algorithm models contamination as a mask (a field of binary variables) with a strong spatial coherence prior. Variational inference is used to marginalize over contamination and obtain robust classification. In this way the VIC approach can turn a kernel classifier for clean data into one that can tolerate contamination, without any specific training on contaminated positives. 1

5 0.52522653 205 nips-2004-Who's In the Picture

Author: Tamara L. Berg, Alexander C. Berg, Jaety Edwards, David A. Forsyth

Abstract: The context in which a name appears in a caption provides powerful cues as to who is depicted in the associated image. We obtain 44,773 face images, using a face detector, from approximately half a million captioned news images and automatically link names, obtained using a named entity recognizer, with these faces. A simple clustering method can produce fair results. We improve these results significantly by combining the clustering process with a model of the probability that an individual is depicted given its context. Once the labeling procedure is over, we have an accurately labeled set of faces, an appearance model for each individual depicted, and a natural language model that can produce accurate results on captions in isolation. 1

6 0.47301337 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

7 0.4161005 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

8 0.40701175 40 nips-2004-Common-Frame Model for Object Recognition

9 0.39649856 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

10 0.38689721 131 nips-2004-Non-Local Manifold Tangent Learning

11 0.3496649 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

12 0.33441088 160 nips-2004-Seeing through water

13 0.33233798 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters

14 0.32929954 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process

15 0.30569023 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

16 0.29918391 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning

17 0.29586032 99 nips-2004-Learning Hyper-Features for Visual Identification

18 0.2931397 73 nips-2004-Generative Affine Localisation and Tracking

19 0.2683537 52 nips-2004-Discrete profile alignment via constrained information bottleneck

20 0.26296547 127 nips-2004-Neighbourhood Components Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.239), (13, 0.101), (15, 0.132), (17, 0.033), (25, 0.014), (26, 0.059), (31, 0.021), (33, 0.179), (35, 0.025), (50, 0.034), (51, 0.013), (76, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88239551 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

Author: Yee W. Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei

Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.

2 0.86463052 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

Author: Michael P. Holmes, Charles Jr.

Abstract: Schema learning is a way to discover probabilistic, constructivist, predictive action models (schemas) from experience. It includes methods for finding and using hidden state to make predictions more accurate. We extend the original schema mechanism [1] to handle arbitrary discrete-valued sensors, improve the original learning criteria to handle POMDP domains, and better maintain hidden state by using schema predictions. These extensions show large improvement over the original schema mechanism in several rewardless POMDPs, and achieve very low prediction error in a difficult speech modeling task. Further, we compare extended schema learning to the recently introduced predictive state representations [2], and find their predictions of next-step action effects to be approximately equal in accuracy. This work lays the foundation for a schema-based system of integrated learning and planning. 1

same-paper 3 0.84066415 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

Author: Margarita Osadchy, Matthew L. Miller, Yann L. Cun

Abstract: We describe a novel method for real-time, simultaneous multi-view face detection and facial pose estimation. The method employs a convolutional network to map face images to points on a manifold, parametrized by pose, and non-face images to points far from that manifold. This network is trained by optimizing a loss function of three variables: image, pose, and face/non-face label. We test the resulting system, in a single configuration, on three standard data sets – one for frontal pose, one for rotated faces, and one for profiles – and find that its performance on each set is comparable to previous multi-view face detectors that can only handle one form of pose variation. We also show experimentally that the system’s accuracy on both face detection and pose estimation is improved by training for the two tasks together.

4 0.72420579 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang

Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature. 1

5 0.71677196 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1

6 0.7154026 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

7 0.71427435 131 nips-2004-Non-Local Manifold Tangent Learning

8 0.71414542 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

9 0.71328449 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

10 0.71101487 102 nips-2004-Learning first-order Markov models for control

11 0.70949173 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

12 0.70915681 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

13 0.70903617 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

14 0.70897549 178 nips-2004-Support Vector Classification with Input Data Uncertainty

15 0.70779556 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

16 0.7072646 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

17 0.70705074 130 nips-2004-Newscast EM

18 0.70656985 127 nips-2004-Neighbourhood Components Analysis

19 0.70651042 116 nips-2004-Message Errors in Belief Propagation

20 0.70641857 68 nips-2004-Face Detection --- Efficient and Rank Deficient