iccv iccv2013 iccv2013-220 knowledge-graph by maker-knowledge-mining

220 iccv-2013-Joint Deep Learning for Pedestrian Detection


Source: pdf

Author: Wanli Ouyang, Xiaogang Wang

Abstract: Feature extraction, deformation handling, occlusion handling, and classi?cation are four important components in pedestrian detection. Existing methods learn or design these components either individually or sequentially. The interaction among these components is not yet well explored. This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation. We formulate these four components into a joint deep learning framework and propose a new deep network architecture1. By establishing automatic, mutual interaction among components, the deep model achieves a 9% reduction in the average miss rate compared with the current best-performing pedestrian detection approaches on the largest Caltech benchmark dataset.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 hk Abstract Feature extraction, deformation handling, occlusion handling, and classi? [sent-4, score-0.425]

2 We formulate these four components into a joint deep learning framework and propose a new deep network architecture1. [sent-9, score-0.7]

3 By establishing automatic, mutual interaction among components, the deep model achieves a 9% reduction in the average miss rate compared with the current best-performing pedestrian detection approaches on the largest Caltech benchmark dataset. [sent-10, score-0.901]

4 Second, deformation models should handle the articulation of human parts such as torso, head, and legs. [sent-18, score-0.44]

5 Third, occlusion handling approaches [ 13, 5 1, 19] seek to identify the occluded regions and avoid their use when determining the existence of a pedestrian in a window. [sent-20, score-0.377]

6 rst learned or designed individually or sequentially, and then put together in a pipeline. [sent-31, score-0.267]

7 xing HOG features and deformable models, occlusion handling models are learned in [34, 36], using the part-detection scores as input. [sent-43, score-0.288]

8 We hope that jointly learned components, like members with team spirit, can create synergy through close interaction, and generate performance that is greater than individually learned components. [sent-46, score-0.174]

9 The deep model is especially appropriate for this task because it can organize these components into different layers and jointly optimize them through back-propagation. [sent-50, score-0.516]

10 ed deep model for jointly learning feature extraction, a part deformation model, an occlusion model and classi? [sent-54, score-0.943]

11 With the deep model, these components interact with each other in the learning process, which allows each component to maximize its strength when cooperating with others. [sent-56, score-0.392]

12 We enrich the operation in deep models by incorporating the deformation layer into the convolutional neural networks (CNN) [26]. [sent-58, score-1.033]

13 With this layer, various deformation handling approaches can be applied to our deep model. [sent-59, score-0.746]

14 The features are learned from pixels through interaction with deformation and occlusion handling models. [sent-61, score-0.634]

15 Related Work It has been proved that deep models are potentially more capable than shallow models in handling complex tasks [ 3]. [sent-64, score-0.401]

16 Deep models for pedestrian detection focus on feature learning [44, 33], contextual information learning [ 57], and occlusion handling [34]. [sent-66, score-0.556]

17 Firstorder color features like color histograms [ 11], secondorder color features like color-self-similarity (CSS) [ 50] and co-occurrence features [43] are also used for pedestrian detection. [sent-69, score-0.282]

18 However, these approaches do not learn the variable deformation properties of body parts. [sent-76, score-0.345]

19 Since pedestrians have non-rigid deformation, the ability to handle deformation improves detection performance. [sent-78, score-0.476]

20 In order to handle occlusion, many approaches have been proposed for estimating the visibility of parts [ 13, 5 1, 54, 53, 45, 27]. [sent-82, score-0.26]

21 Some ofthem use the detection scores ofblocks or parts [5 1, 34, 13, 54] as input for visibility estimation. [sent-83, score-0.352]

22 However, all these approaches learn the occlusion modeling separately from feature extraction and part models. [sent-85, score-0.205]

23 This paper takes a global view ×× of these components and is an important step towards joint learning of them for pedestrian detection. [sent-96, score-0.288]

24 Overview of the proposed deep model An overview of our proposed deep model is shown in Fig. [sent-100, score-0.616]

25 This layer convolves the 3-channel input image data with 9 9 3 ? [sent-105, score-0.246]

26 Part detection maps are obtained from the second convolutional layer. [sent-121, score-0.268]

27 This layer convolves the feature maps with 20 part ? [sent-122, score-0.393]

28 lters of different sizes and outputs 20 part detection maps. [sent-123, score-0.298]

29 Part scores are obtained from the 20 part detection maps using a deformation handling layer. [sent-127, score-0.647]

30 The visibility reasoning of 20 parts is used for estimating the label y; that is, whether a given window encloses a pedestrian or not. [sent-132, score-0.52]

31 r Oev maps are othuern d processed by mtaheg ese dcaotand is c coonnvvoolluvteiodn wali layer 9a ×nd 9 9th ×e 3de ? [sent-148, score-0.258]

32 floterrmsa atinodn a layer teloy o pbotaoilne d2 t0o part scores. [sent-149, score-0.265]

33 visibility reasoning model is used to estimate the detection label y. [sent-151, score-0.32]

34 rst channel is a 84 28 Y-channel image after the( image i s? [sent-157, score-0.206]

35 rst convolutional layer and its following average pooling layer use the standard CNN settings. [sent-168, score-0.788]

36 There are six small parts at level 1, seven medium-sized parts at level 2, and seven Level 3 Level 2 Level 1 (a) (b) Figure 3. [sent-181, score-0.22]

37 A part at an upper level is composed of parts at the lower level. [sent-188, score-0.172]

38 gure, the head-shoulder part appears twice (representing occlusion status at the top level and part at the 2058 middle level respectively) because this body part itself can generate an occlusion status. [sent-193, score-0.444]

39 The deformation layer In order to learn the deformation constraints of different parts, we propose the deformation handling layer (deformation layer for short) for the CNNs. [sent-211, score-1.737]

40 The deformation layer takes the P part detection maps as input and outputs P part scores s = {s 1, . [sent-212, score-0.819]

41 tTpuhets d Pef poarrmta sctioonre sla sye =r t {resats the de}t,ec Ptio =n maps separately and produces the pth part score sp from the pth part detection map, denoted by Mp. [sent-217, score-0.458]

42 A 2D summed map, denoted by Bp, is obtained by summing up the part detection map Mp and the deformation maps as follows: ? [sent-218, score-0.619]

43 1 Dn,p denotes the nth deformation map for the pth part, cn,p denotes the weight for Dn,p, and N denotes the number of deformation maps. [sent-223, score-0.797]

44 At the training stage, only the value at location (x, y) of Bp is used for learning the deformation parameters. [sent-227, score-0.398]

45 The cn,p and Dn,p in (1) are the key for designing different deformation models. [sent-228, score-0.345]

46 Suppose N = 1, c1,p = 1 and the deformation map D1,p is to be learned. [sent-232, score-0.37]

47 In this case, the discrete locations of the pth part are treated as bins and the deformation cost for each bin is learned. [sent-233, score-0.489]

48 The approach in [39] treats deformation as bins of locations. [sent-235, score-0.345]

49 If d(1,xp,y) is the same for any (x,y), then there is no deformation cost. [sent-240, score-0.345]

50 Part detection map and deformation maps are summed up with weights cn,p for n = 1, 2, 3, 4 to obtain the summed map Bp. [sent-245, score-0.649]

51 Global max pooling is then performed on the summed map to obtain the score sp for the pth part. [sent-246, score-0.292]

52 The disadvantage of max-pooling is that the hand-tuned local region does not adapt to different deformation properties of different parts. [sent-248, score-0.345]

53 The deformation layer can represent the widely used quadratic constraint ofdeformation in [ 17]. [sent-250, score-0.548]

54 The quadratic constraint of deformation can be represented as follows: p b(x,y)=m(x,y)+ c1(x−ax+2cc31)2+c2(y−ay+2cc42)2, (4) where m(x,y) is the (x, y)th element of the part detection map M, (ax , ay) is the prede? [sent-253, score-0.497]

55 4 illustrates this example, which is used as the deformation layer in this work. [sent-268, score-0.548]

56 For example, h11 indicates the visibility of the left-head-shoulder part. [sent-285, score-0.199]

57 5 shows the model for the visibility reasoning and classi? [sent-291, score-0.255]

58 Denote the score and visibility of the jth part at level las sjl and hjl respectively. [sent-294, score-0.31]

59 Denote the visibility of Pl parts at level lby hl = [hl1 . [sent-295, score-0.343]

60 The visibility of one part is correlated with the visibility of other parts at the same level through shared parents. [sent-303, score-0.57]

61 The differences between the deep model in this paper and the approach in [34] are as follows: 1. [sent-305, score-0.308]

62 The approach in [34] only learns the visibility relationship from part scores. [sent-317, score-0.302]

63 Both HOG features and the parameters for the deformation model are ? [sent-318, score-0.371]

64 In this paper, features, deformable models, and visibility relationships are jointly learned. [sent-320, score-0.29]

65 In order to learn the parameters in the two convolutional layers and the deformation layer in Fig. [sent-321, score-0.785]

66 In order to train this deep architecture, we adopt a multi-stage training strategy. [sent-339, score-0.333]

67 We add one more layer at each stage, the layers trained in the previous stage are used for initialization and then all the layers at the current stage − are jointly optimized with BP. [sent-344, score-0.518]

68 Approximately 60,000 training samples that are not pruned by the detector are used for training the deep model. [sent-348, score-0.383]

69 At the testing stage, the execution time required by our deep model is less than 10% of the execution time required by the HOG+CSS+SVM detector, which has ? [sent-349, score-0.356]

70 As in [12], the log-average miss rate is used to summarize the detector performance, and is computed by averaging the miss rate at nine FPPI rates that are evenly spaced in the log-space in the range from 10−2 to 100. [sent-357, score-0.561]

71 Different deep models are used by ConvNet-U-MS and DN-HOG. [sent-364, score-0.308]

72 The current best performing approaches on the Caltech-Test are the MultiResC [37] and the contextual boost [8], both of which have an average miss rate of 48%. [sent-374, score-0.296]

73 Our approach reduces the average miss rate by 9%. [sent-375, score-0.268]

74 Since Caltech-Test is the largest among commonly used datasets, we investigate different designs of deep models on this dataset. [sent-376, score-0.353]

75 7(a)) is constructed by convolving the extracted feature maps with another convolutional layer and another pooling layer. [sent-384, score-0.504]

76 Adding more convolutional and pooling layers on the top of the two-layer CNN does not improve the performance. [sent-385, score-0.305]

77 rst convolutional layer and pooling layer of UDN, but do not have the deformation layer or the visibility estimation layer. [sent-387, score-1.535]

78 This experiment shows that the usage of deformation and visibility layers outperforms CNNs. [sent-388, score-0.633]

79 The inclusion of the second chennel of color images with a lower resolution reduces the miss rate by 5%. [sent-398, score-0.268]

80 Including the third channel of edge maps reduces the miss rate by a further 3%. [sent-399, score-0.363]

81 rst convolutional and pooling layers of UDN correspond to the feature extraction step. [sent-404, score-0.534]

82 • LatSvm-V2 [ 17], with a miss rate of 63%, manually designs the LHaOtSGv fme-aVtu2re [,1 a7n],d w wthitehn ale marinsss rtahtee odfef 6o3r%ma,t mioann muaoldlye dl. [sent-406, score-0.313]

83 Results of various designs of the deep model on the Caltech-Test dataset. [sent-433, score-0.353]

84 • • • • is whether deformation and visibility models arejointly learned. [sent-434, score-0.544]

85 Compared with UDN-HOG, the extra CSS feature reduces the miss rate by 3%. [sent-438, score-0.298]

86 rst learns the featUurDeN e-xCtNracNtiFoena layers using CNN-1layer in, Fig. [sent-440, score-0.296]

87 fexea-s these layers, and then jointly learns the deformation and visibility. [sent-442, score-0.449]

88 In this case, the feature extraction is not jointly learned with the deformation and visibility. [sent-443, score-0.505]

89 Compared with UDN-HOGCSS, UDN-CNNFeat reduces the miss rate by 3% by using the features learned from CNN-1layer. [sent-444, score-0.328]

90 UDN-DefLayer, with a miss rate of 41%, jointly learns features aUnDd Nde-DfoerfmLaatyioenr,. [sent-445, score-0.398]

91 e interaction between deformation, visibility, and feature learning clearly improves the detection ability of the model. [sent-450, score-0.179]

92 , [24, 41]) have found that deep models favor largescale training data. [sent-469, score-0.333]

93 Therefore, the difference of miss rates between UDN and existing approaches is smaller than that on Caltech-Test. [sent-471, score-0.201]

94 Area under curve [44] is another measurement commonly used for evaluate the performance of pedestrian detection. [sent-472, score-0.204]

95 9 shows the average miss rate computed from AUC, which indicates that UDN also outperforms other sate-of-the-art methods under AUC. [sent-474, score-0.268]

96 ed deep model that jointly learns four components – feature extraction, deformation handling, occlusion handling and classi? [sent-478, score-1.043]

97 We enrich the deep model by introducing the deformation layer, which has great ? [sent-482, score-0.682]

98 Hierarchical face parsing via – [3 1] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] deep learning. [sent-688, score-0.308]

99 Stacks of convolutional restricted boltzmann machines for shift-invariant feature learning. [sent-709, score-0.178]

100 A discriminative deep model for pedestrian detection with occlusion handling. [sent-714, score-0.657]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('classi', 0.397), ('deformation', 0.345), ('deep', 0.308), ('pedestrian', 0.204), ('layer', 0.203), ('miss', 0.201), ('visibility', 0.199), ('lters', 0.171), ('rst', 0.166), ('udn', 0.15), ('convolutional', 0.148), ('cation', 0.127), ('cnn', 0.123), ('gjl', 0.107), ('css', 0.099), ('multiftr', 0.095), ('handling', 0.093), ('layers', 0.089), ('hog', 0.089), ('multiresc', 0.088), ('ygnd', 0.086), ('pth', 0.082), ('occlusion', 0.08), ('xed', 0.079), ('ers', 0.079), ('lter', 0.076), ('bp', 0.075), ('ouyang', 0.073), ('pooling', 0.068), ('rate', 0.067), ('summed', 0.067), ('pedestrians', 0.066), ('detection', 0.065), ('prede', 0.064), ('jointly', 0.063), ('part', 0.062), ('parts', 0.061), ('ay', 0.057), ('xes', 0.057), ('reasoning', 0.056), ('components', 0.056), ('interaction', 0.056), ('maps', 0.055), ('ranzato', 0.055), ('uni', 0.053), ('doll', 0.051), ('sp', 0.05), ('level', 0.049), ('auc', 0.049), ('designs', 0.045), ('individually', 0.043), ('cjl', 0.043), ('conlvaoyleur', 0.043), ('convolves', 0.043), ('gure', 0.043), ('wcls', 0.043), ('cuhk', 0.042), ('eth', 0.042), ('learns', 0.041), ('ax', 0.041), ('channel', 0.04), ('featsynth', 0.038), ('hil', 0.038), ('ccs', 0.038), ('ltered', 0.038), ('shapelet', 0.038), ('stage', 0.037), ('ftrmine', 0.035), ('hl', 0.034), ('articulation', 0.034), ('interdependent', 0.034), ('svm', 0.034), ('learned', 0.034), ('hoglbp', 0.033), ('extraction', 0.033), ('caltech', 0.031), ('crosstalk', 0.03), ('kavukcuoglu', 0.03), ('feature', 0.03), ('hidden', 0.03), ('enrich', 0.029), ('levels', 0.029), ('leibe', 0.029), ('inria', 0.028), ('learning', 0.028), ('deformable', 0.028), ('contextual', 0.028), ('er', 0.027), ('ned', 0.027), ('ed', 0.027), ('boureau', 0.027), ('scores', 0.027), ('features', 0.026), ('training', 0.025), ('map', 0.025), ('detector', 0.025), ('execution', 0.024), ('hinton', 0.024), ('mp', 0.024), ('designed', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 220 iccv-2013-Joint Deep Learning for Pedestrian Detection

Author: Wanli Ouyang, Xiaogang Wang

Abstract: Feature extraction, deformation handling, occlusion handling, and classi?cation are four important components in pedestrian detection. Existing methods learn or design these components either individually or sequentially. The interaction among these components is not yet well explored. This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation. We formulate these four components into a joint deep learning framework and propose a new deep network architecture1. By establishing automatic, mutual interaction among components, the deep model achieves a 9% reduction in the average miss rate compared with the current best-performing pedestrian detection approaches on the largest Caltech benchmark dataset.

2 0.43762809 279 iccv-2013-Multi-stage Contextual Deep Learning for Pedestrian Detection

Author: Xingyu Zeng, Wanli Ouyang, Xiaogang Wang

Abstract: Cascaded classifiers1 have been widely used in pedestrian detection and achieved great success. These classifiers are trained sequentially without joint optimization. In this paper, we propose a new deep model that can jointly train multi-stage classifiers through several stages of backpropagation. It keeps the score map output by a classifier within a local region and uses it as contextual information to support the decision at the next stage. Through a specific design of the training strategy, this deep architecture is able to simulate the cascaded classifiers by mining hard samples to train the network stage-by-stage. Each classifier handles samples at a different difficulty level. Unsupervised pre-training and specifically designed stage-wise supervised training are used to regularize the optimization problem. Both theoretical analysis and experimental results show that the training strategy helps to avoid overfitting. Experimental results on three datasets (Caltech, ETH and TUD-Brussels) show that our approach outperforms the state-of-the-art approaches.

3 0.23913664 257 iccv-2013-Log-Euclidean Kernels for Sparse Representation and Dictionary Learning

Author: Peihua Li, Qilong Wang, Wangmeng Zuo, Lei Zhang

Abstract: The symmetric positive de?nite (SPD) matrices have been widely used in image and vision problems. Recently there are growing interests in studying sparse representation (SR) of SPD matrices, motivated by the great success of SR for vector data. Though the space of SPD matrices is well-known to form a Lie group that is a Riemannian manifold, existing work fails to take full advantage of its geometric structure. This paper attempts to tackle this problem by proposing a kernel based method for SR and dictionary learning (DL) of SPD matrices. We disclose that the space of SPD matrices, with the operations of logarithmic multiplication and scalar logarithmic multiplication de?ned in the Log-Euclidean framework, is a complete inner product space. We can thus develop a broad family of kernels that satis?es Mercer’s condition. These kernels characterize the geodesic distance and can be computed ef?ciently. We also consider the geometric structure in the DL process by updating atom matrices in the Riemannian space instead of in the Euclidean space. The proposed method is evaluated with various vision problems and shows notable per- formance gains over state-of-the-arts.

4 0.23187597 196 iccv-2013-Hierarchical Data-Driven Descent for Efficient Optimal Deformation Estimation

Author: Yuandong Tian, Srinivasa G. Narasimhan

Abstract: Real-world surfaces such as clothing, water and human body deform in complex ways. The image distortions observed are high-dimensional and non-linear, making it hard to estimate these deformations accurately. The recent datadriven descent approach [17] applies Nearest Neighbor estimators iteratively on a particular distribution of training samples to obtain a globally optimal and dense deformation field between a template and a distorted image. In this work, we develop a hierarchical structure for the Nearest Neighbor estimators, each of which can have only a local image support. We demonstrate in both theory and practice that this algorithm has several advantages over the nonhierarchical version: it guarantees global optimality with significantly fewer training samples, is several orders faster, provides a metric to decide whether a given image is “hard” (or “easy ”) requiring more (or less) samples, and can handle more complex scenes that include both global motion and local deformation. The proposed algorithm successfully tracks a broad range of non-rigid scenes including water, clothing, and medical images, and compares favorably against several other deformation estimation and tracking approaches that do not provide optimality guarantees.

5 0.20694886 311 iccv-2013-Pedestrian Parsing via Deep Decompositional Network

Author: Ping Luo, Xiaogang Wang, Xiaoou Tang

Abstract: We propose a new Deep Decompositional Network (DDN) for parsing pedestrian images into semantic regions, such as hair, head, body, arms, and legs, where the pedestrians can be heavily occluded. Unlike existing methods based on template matching or Bayesian inference, our approach directly maps low-level visual features to the label maps of body parts with DDN, which is able to accurately estimate complex pose variations with good robustness to occlusions and background clutters. DDN jointly estimates occluded regions and segments body parts by stacking three types of hidden layers: occlusion estimation layers, completion layers, and decomposition layers. The occlusion estimation layers estimate a binary mask, indicating which part of a pedestrian is invisible. The completion layers synthesize low-level features of the invisible part from the original features and the occlusion mask. The decomposition layers directly transform the synthesized visual features to label maps. We devise a new strategy to pre-train these hidden layers, and then fine-tune the entire network using the stochastic gradient descent. Experimental results show that our approach achieves better segmentation accuracy than the state-of-the-art methods on pedestrian images with or without occlusions. Another important contribution of this paper is that it provides a large scale benchmark human parsing dataset1 that includes 3, 673 annotated samples collected from 171 surveillance videos. It is 20 times larger than existing public datasets.

6 0.16054064 25 iccv-2013-A Novel Earth Mover's Distance Methodology for Image Matching with Gaussian Mixture Models

7 0.14999044 206 iccv-2013-Hybrid Deep Learning for Face Verification

8 0.13878842 336 iccv-2013-Random Forests of Local Experts for Pedestrian Detection

9 0.13334158 106 iccv-2013-Deep Learning Identity-Preserving Face Space

10 0.12422026 190 iccv-2013-Handling Occlusions with Franken-Classifiers

11 0.12193178 7 iccv-2013-A Deep Sum-Product Architecture for Robust Facial Attributes Analysis

12 0.11692247 105 iccv-2013-DeepFlow: Large Displacement Optical Flow with Deep Matching

13 0.11478984 16 iccv-2013-A Generic Deformation Model for Dense Non-rigid Surface Registration: A Higher-Order MRF-Based Approach

14 0.099526152 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow

15 0.097957231 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures

16 0.089063413 136 iccv-2013-Efficient Pedestrian Detection by Directly Optimizing the Partial Area under the ROC Curve

17 0.086961091 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes

18 0.084251449 75 iccv-2013-CoDeL: A Human Co-detection and Labeling Framework

19 0.083026305 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction

20 0.082946002 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.182), (1, 0.013), (2, -0.018), (3, -0.056), (4, 0.063), (5, -0.082), (6, 0.021), (7, 0.08), (8, -0.058), (9, -0.1), (10, -0.024), (11, 0.01), (12, 0.076), (13, -0.081), (14, 0.065), (15, 0.032), (16, 0.02), (17, 0.118), (18, 0.196), (19, 0.227), (20, -0.068), (21, 0.069), (22, 0.048), (23, -0.024), (24, -0.268), (25, -0.052), (26, -0.019), (27, 0.127), (28, -0.179), (29, 0.186), (30, -0.118), (31, 0.017), (32, 0.113), (33, -0.056), (34, 0.159), (35, 0.065), (36, -0.034), (37, 0.18), (38, 0.031), (39, 0.1), (40, -0.007), (41, -0.057), (42, 0.045), (43, -0.03), (44, -0.091), (45, 0.035), (46, 0.027), (47, -0.018), (48, 0.022), (49, 0.138)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9322874 220 iccv-2013-Joint Deep Learning for Pedestrian Detection

Author: Wanli Ouyang, Xiaogang Wang

Abstract: Feature extraction, deformation handling, occlusion handling, and classi?cation are four important components in pedestrian detection. Existing methods learn or design these components either individually or sequentially. The interaction among these components is not yet well explored. This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation. We formulate these four components into a joint deep learning framework and propose a new deep network architecture1. By establishing automatic, mutual interaction among components, the deep model achieves a 9% reduction in the average miss rate compared with the current best-performing pedestrian detection approaches on the largest Caltech benchmark dataset.

2 0.78320152 279 iccv-2013-Multi-stage Contextual Deep Learning for Pedestrian Detection

Author: Xingyu Zeng, Wanli Ouyang, Xiaogang Wang

Abstract: Cascaded classifiers1 have been widely used in pedestrian detection and achieved great success. These classifiers are trained sequentially without joint optimization. In this paper, we propose a new deep model that can jointly train multi-stage classifiers through several stages of backpropagation. It keeps the score map output by a classifier within a local region and uses it as contextual information to support the decision at the next stage. Through a specific design of the training strategy, this deep architecture is able to simulate the cascaded classifiers by mining hard samples to train the network stage-by-stage. Each classifier handles samples at a different difficulty level. Unsupervised pre-training and specifically designed stage-wise supervised training are used to regularize the optimization problem. Both theoretical analysis and experimental results show that the training strategy helps to avoid overfitting. Experimental results on three datasets (Caltech, ETH and TUD-Brussels) show that our approach outperforms the state-of-the-art approaches.

3 0.77079797 311 iccv-2013-Pedestrian Parsing via Deep Decompositional Network

Author: Ping Luo, Xiaogang Wang, Xiaoou Tang

Abstract: We propose a new Deep Decompositional Network (DDN) for parsing pedestrian images into semantic regions, such as hair, head, body, arms, and legs, where the pedestrians can be heavily occluded. Unlike existing methods based on template matching or Bayesian inference, our approach directly maps low-level visual features to the label maps of body parts with DDN, which is able to accurately estimate complex pose variations with good robustness to occlusions and background clutters. DDN jointly estimates occluded regions and segments body parts by stacking three types of hidden layers: occlusion estimation layers, completion layers, and decomposition layers. The occlusion estimation layers estimate a binary mask, indicating which part of a pedestrian is invisible. The completion layers synthesize low-level features of the invisible part from the original features and the occlusion mask. The decomposition layers directly transform the synthesized visual features to label maps. We devise a new strategy to pre-train these hidden layers, and then fine-tune the entire network using the stochastic gradient descent. Experimental results show that our approach achieves better segmentation accuracy than the state-of-the-art methods on pedestrian images with or without occlusions. Another important contribution of this paper is that it provides a large scale benchmark human parsing dataset1 that includes 3, 673 annotated samples collected from 171 surveillance videos. It is 20 times larger than existing public datasets.

4 0.543962 196 iccv-2013-Hierarchical Data-Driven Descent for Efficient Optimal Deformation Estimation

Author: Yuandong Tian, Srinivasa G. Narasimhan

Abstract: Real-world surfaces such as clothing, water and human body deform in complex ways. The image distortions observed are high-dimensional and non-linear, making it hard to estimate these deformations accurately. The recent datadriven descent approach [17] applies Nearest Neighbor estimators iteratively on a particular distribution of training samples to obtain a globally optimal and dense deformation field between a template and a distorted image. In this work, we develop a hierarchical structure for the Nearest Neighbor estimators, each of which can have only a local image support. We demonstrate in both theory and practice that this algorithm has several advantages over the nonhierarchical version: it guarantees global optimality with significantly fewer training samples, is several orders faster, provides a metric to decide whether a given image is “hard” (or “easy ”) requiring more (or less) samples, and can handle more complex scenes that include both global motion and local deformation. The proposed algorithm successfully tracks a broad range of non-rigid scenes including water, clothing, and medical images, and compares favorably against several other deformation estimation and tracking approaches that do not provide optimality guarantees.

5 0.51354772 206 iccv-2013-Hybrid Deep Learning for Face Verification

Author: Yi Sun, Xiaogang Wang, Xiaoou Tang

Abstract: This paper proposes a hybrid convolutional network (ConvNet)-Restricted Boltzmann Machine (RBM) model for face verification in wild conditions. A key contribution of this work is to directly learn relational visual features, which indicate identity similarities, from raw pixels of face pairs with a hybrid deep network. The deep ConvNets in our model mimic the primary visual cortex to jointly extract local relational visual features from two face images compared with the learned filter pairs. These relational features are further processed through multiple layers to extract high-level and global features. Multiple groups of ConvNets are constructed in order to achieve robustness and characterize face similarities from different aspects. The top-layerRBMperforms inferencefrom complementary high-level features extracted from different ConvNet groups with a two-level average pooling hierarchy. The entire hybrid deep network is jointly fine-tuned to optimize for the task of face verification. Our model achieves competitive face verification performance on the LFW dataset.

6 0.44756442 257 iccv-2013-Log-Euclidean Kernels for Sparse Representation and Dictionary Learning

7 0.44175586 211 iccv-2013-Image Segmentation with Cascaded Hierarchical Models and Logistic Disjunctive Normal Networks

8 0.42657536 61 iccv-2013-Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition

9 0.42564639 7 iccv-2013-A Deep Sum-Product Architecture for Robust Facial Attributes Analysis

10 0.41417912 136 iccv-2013-Efficient Pedestrian Detection by Directly Optimizing the Partial Area under the ROC Curve

11 0.40882623 151 iccv-2013-Exploiting Reflection Change for Automatic Reflection Removal

12 0.4071821 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow

13 0.40406784 106 iccv-2013-Deep Learning Identity-Preserving Face Space

14 0.40225339 25 iccv-2013-A Novel Earth Mover's Distance Methodology for Image Matching with Gaussian Mixture Models

15 0.39971167 193 iccv-2013-Heterogeneous Auto-similarities of Characteristics (HASC): Exploiting Relational Information for Classification

16 0.39963245 390 iccv-2013-Shufflets: Shared Mid-level Parts for Fast Object Detection

17 0.39904904 190 iccv-2013-Handling Occlusions with Franken-Classifiers

18 0.39808929 336 iccv-2013-Random Forests of Local Experts for Pedestrian Detection

19 0.39149627 351 iccv-2013-Restoring an Image Taken through a Window Covered with Dirt or Rain

20 0.38052088 426 iccv-2013-Training Deformable Part Models with Decorrelated Features


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.048), (4, 0.016), (7, 0.022), (11, 0.176), (12, 0.015), (26, 0.066), (31, 0.034), (34, 0.013), (41, 0.019), (42, 0.099), (48, 0.072), (64, 0.066), (67, 0.07), (73, 0.048), (78, 0.016), (89, 0.126)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78031909 220 iccv-2013-Joint Deep Learning for Pedestrian Detection

Author: Wanli Ouyang, Xiaogang Wang

Abstract: Feature extraction, deformation handling, occlusion handling, and classi?cation are four important components in pedestrian detection. Existing methods learn or design these components either individually or sequentially. The interaction among these components is not yet well explored. This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation. We formulate these four components into a joint deep learning framework and propose a new deep network architecture1. By establishing automatic, mutual interaction among components, the deep model achieves a 9% reduction in the average miss rate compared with the current best-performing pedestrian detection approaches on the largest Caltech benchmark dataset.

2 0.74602205 362 iccv-2013-Robust Tucker Tensor Decomposition for Effective Image Representation

Author: Miao Zhang, Chris Ding

Abstract: Many tensor based algorithms have been proposed for the study of high dimensional data in a large variety ofcomputer vision and machine learning applications. However, most of the existing tensor analysis approaches are based on Frobenius norm, which makes them sensitive to outliers, because they minimize the sum of squared errors and enlarge the influence of both outliers and large feature noises. In this paper, we propose a robust Tucker tensor decomposition model (RTD) to suppress the influence of outliers, which uses L1-norm loss function. Yet, the optimization on L1-norm based tensor analysis is much harder than standard tensor decomposition. In this paper, we propose a simple and efficient algorithm to solve our RTD model. Moreover, tensor factorization-based image storage needs much less space than PCA based methods. We carry out extensive experiments to evaluate the proposed algorithm, and verify the robustness against image occlusions. Both numerical and visual results show that our RTD model is consistently better against the existence of outliers than previous tensor and PCA methods.

3 0.73782933 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning

Author: Junliang Xing, Jin Gao, Bing Li, Weiming Hu, Shuicheng Yan

Abstract: Recently, sparse representation has been introduced for robust object tracking. By representing the object sparsely, i.e., using only a few templates via ?1-norm minimization, these so-called ?1-trackers exhibit promising tracking results. In this work, we address the object template building and updating problem in these ?1-tracking approaches, which has not been fully studied. We propose to perform template updating, in a new perspective, as an online incremental dictionary learning problem, which is efficiently solved through an online optimization procedure. To guarantee the robustness and adaptability of the tracking algorithm, we also propose to build a multi-lifespan dictionary model. By building target dictionaries of different lifespans, effective object observations can be obtained to deal with the well-known drifting problem in tracking and thus improve the tracking accuracy. We derive effective observa- tion models both generatively and discriminatively based on the online multi-lifespan dictionary learning model and deploy them to the Bayesian sequential estimation framework to perform tracking. The proposed approach has been extensively evaluated on ten challenging video sequences. Experimental results demonstrate the effectiveness of the online learned templates, as well as the state-of-the-art tracking performance of the proposed approach.

4 0.72201359 320 iccv-2013-Pose-Configurable Generic Tracking of Elongated Objects

Author: Daniel Wesierski, Patrick Horain

Abstract: Elongated objects have various shapes and can shift, rotate, change scale, and be rigid or deform by flexing, articulating, and vibrating, with examples as varied as a glass bottle, a robotic arm, a surgical suture, a finger pair, a tram, and a guitar string. This generally makes tracking of poses of elongated objects very challenging. We describe a unified, configurable framework for tracking the pose of elongated objects, which move in the image plane and extend over the image region. Our method strives for simplicity, versatility, and efficiency. The object is decomposed into a chained assembly of segments of multiple parts that are arranged under a hierarchy of tailored spatio-temporal constraints. In this hierarchy, segments can rescale independently while their elasticity is controlled with global orientations and local distances. While the trend in tracking is to design complex, structure-free algorithms that update object appearance on- line, we show that our tracker, with the novel but remarkably simple, structured organization of parts with constant appearance, reaches or improves state-of-the-art performance. Most importantly, our model can be easily configured to track exact pose of arbitrary, elongated objects in the image plane. The tracker can run up to 100 fps on a desktop PC, yet the computation time scales linearly with the number of object parts. To our knowledge, this is the first approach to generic tracking of elongated objects.

5 0.71602505 279 iccv-2013-Multi-stage Contextual Deep Learning for Pedestrian Detection

Author: Xingyu Zeng, Wanli Ouyang, Xiaogang Wang

Abstract: Cascaded classifiers1 have been widely used in pedestrian detection and achieved great success. These classifiers are trained sequentially without joint optimization. In this paper, we propose a new deep model that can jointly train multi-stage classifiers through several stages of backpropagation. It keeps the score map output by a classifier within a local region and uses it as contextual information to support the decision at the next stage. Through a specific design of the training strategy, this deep architecture is able to simulate the cascaded classifiers by mining hard samples to train the network stage-by-stage. Each classifier handles samples at a different difficulty level. Unsupervised pre-training and specifically designed stage-wise supervised training are used to regularize the optimization problem. Both theoretical analysis and experimental results show that the training strategy helps to avoid overfitting. Experimental results on three datasets (Caltech, ETH and TUD-Brussels) show that our approach outperforms the state-of-the-art approaches.

6 0.71551818 354 iccv-2013-Robust Dictionary Learning by Error Source Decomposition

7 0.7145763 63 iccv-2013-Bounded Labeling Function for Global Segmentation of Multi-part Objects with Geometric Constraints

8 0.7087763 311 iccv-2013-Pedestrian Parsing via Deep Decompositional Network

9 0.69276679 207 iccv-2013-Illuminant Chromaticity from Image Sequences

10 0.68902212 331 iccv-2013-Pyramid Coding for Functional Scene Element Recognition in Video Scenes

11 0.6845966 370 iccv-2013-Saliency Detection in Large Point Sets

12 0.68296528 7 iccv-2013-A Deep Sum-Product Architecture for Robust Facial Attributes Analysis

13 0.68263412 206 iccv-2013-Hybrid Deep Learning for Face Verification

14 0.68227375 338 iccv-2013-Randomized Ensemble Tracking

15 0.68112755 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning

16 0.68100464 150 iccv-2013-Exemplar Cut

17 0.68011028 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps

18 0.67895591 208 iccv-2013-Image Co-segmentation via Consistent Functional Maps

19 0.67890751 45 iccv-2013-Affine-Constrained Group Sparse Coding and Its Application to Image-Based Classifications

20 0.67754698 106 iccv-2013-Deep Learning Identity-Preserving Face Space