iccv iccv2013 iccv2013-311 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ping Luo, Xiaogang Wang, Xiaoou Tang
Abstract: We propose a new Deep Decompositional Network (DDN) for parsing pedestrian images into semantic regions, such as hair, head, body, arms, and legs, where the pedestrians can be heavily occluded. Unlike existing methods based on template matching or Bayesian inference, our approach directly maps low-level visual features to the label maps of body parts with DDN, which is able to accurately estimate complex pose variations with good robustness to occlusions and background clutters. DDN jointly estimates occluded regions and segments body parts by stacking three types of hidden layers: occlusion estimation layers, completion layers, and decomposition layers. The occlusion estimation layers estimate a binary mask, indicating which part of a pedestrian is invisible. The completion layers synthesize low-level features of the invisible part from the original features and the occlusion mask. The decomposition layers directly transform the synthesized visual features to label maps. We devise a new strategy to pre-train these hidden layers, and then fine-tune the entire network using the stochastic gradient descent. Experimental results show that our approach achieves better segmentation accuracy than the state-of-the-art methods on pedestrian images with or without occlusions. Another important contribution of this paper is that it provides a large scale benchmark human parsing dataset1 that includes 3, 673 annotated samples collected from 171 surveillance videos. It is 20 times larger than existing public datasets.
Reference: text
sentIndex sentText sentNum sentScore
1 hk Abstract We propose a new Deep Decompositional Network (DDN) for parsing pedestrian images into semantic regions, such as hair, head, body, arms, and legs, where the pedestrians can be heavily occluded. [sent-9, score-0.311]
2 Unlike existing methods based on template matching or Bayesian inference, our approach directly maps low-level visual features to the label maps of body parts with DDN, which is able to accurately estimate complex pose variations with good robustness to occlusions and background clutters. [sent-10, score-0.292]
3 DDN jointly estimates occluded regions and segments body parts by stacking three types of hidden layers: occlusion estimation layers, completion layers, and decomposition layers. [sent-11, score-0.56]
4 The occlusion estimation layers estimate a binary mask, indicating which part of a pedestrian is invisible. [sent-12, score-0.593]
5 The completion layers synthesize low-level features of the invisible part from the original features and the occlusion mask. [sent-13, score-0.606]
6 The decomposition layers directly transform the synthesized visual features to label maps. [sent-14, score-0.39]
7 We devise a new strategy to pre-train these hidden layers, and then fine-tune the entire network using the stochastic gradient descent. [sent-15, score-0.214]
8 Experimental results show that our approach achieves better segmentation accuracy than the state-of-the-art methods on pedestrian images with or without occlusions. [sent-16, score-0.164]
9 Another important contribution of this paper is that it provides a large scale benchmark human parsing dataset1 that includes 3, 673 annotated samples collected from 171 surveillance videos. [sent-17, score-0.168]
10 Introduction Pedestrian analysis is an important topic in computer vision, including pedestrian detection, pose estimation, ∗This work is supported by the General Research Fund sponsored by the Research Grants Council of Hong Kong (Project No. [sent-20, score-0.164]
11 This paper focuses on parsing a pedestrian figure into different semantic parts, such as hair, head, body, arms, and legs. [sent-33, score-0.264]
12 Existing studies of pedestrian parsing [2, 1, 6, 20] generally fall into two categories: template matching and Bayesian inference. [sent-37, score-0.285]
13 The pixel-level segmentation of body parts was first proposed in [2], which searches for templates of body parts (poselets) in the training set by incorporating the 3D skeletons of humans. [sent-38, score-0.22]
14 The identified templates are directly used as segmentation results and cannot accurately fit body boundaries of pedestrians in tests. [sent-39, score-0.184]
15 [1] (SBP) provided the ground truth annotations of the PennFudan pedestrian database [27] and used it to evaluate the 2648 segmentation accuracy of their algorithm. [sent-42, score-0.164]
16 Their method segments an image into superpixels and then merges the superpixels into candidate body parts by comparing their shapes and positions with templates in the training set. [sent-43, score-0.115]
17 [6] (PbOS) treated human parsing as a Bayesian inference problem. [sent-49, score-0.144]
18 They model the appearance prior as Gaussian mixture of pixel colors, and the body shape prior is modeled by the pose skeleton in [20] and the multinomial shape boltzmann machine in [6]. [sent-51, score-0.224]
19 This paper addresses the aforementioned limitations by proposing a new deep model, the Deep Decompositional Network (DDN), which utilizes HOG features [3] as input and outputs the segmentation label maps. [sent-59, score-0.24]
20 HOG features can effectively characterize the boundaries of body parts and estimate human poses. [sent-60, score-0.105]
21 In order to explicitly handle the occlusion problem, DDN stacks three types of hidden layers, including occlusion estimation layers, completion layers, and decomposition layers (see Fig. [sent-61, score-0.862]
22 Specifically, the occlusion estimation layers infer a binary mask, indicating which part of the features is occluded. [sent-63, score-0.451]
23 Finally, the decomposition layers decompose the synthesized features to the label maps by learning a mapping (transformation) from the feature space to the space of label maps (see an example in Fig. [sent-65, score-0.493]
24 Unlike CNN [11], whose weights are shared and locally connected, we find fully connecting adjacent layers in DDN can capture the global structures of humans and can improve the parsing results. [sent-67, score-0.387]
25 At the training stage, we devise a new strategy based on least squares dictionary learning to pre-train the occlusion estimation layers and the decomposition layers, while the completion layers are pre-trained with a modified denoising autoencoder [26]. [sent-68, score-1.089]
26 The entire network is then fine-tuned by the stochastic gradient descent. [sent-69, score-0.16]
27 At the testing stage, our network can efficiently transform an image into label maps without template matching or MCMC sampling. [sent-70, score-0.259]
28 (1) This is the first time that deep learning is studied specifically for pedestrian parsing. [sent-72, score-0.321]
29 We propose a novel deep network, where the models for occlusion estimation, data completion, and data transformation are incorporated into a unified deep architecture and jointly trained. [sent-74, score-0.624]
30 (4) We provide a largescale benchmark human parsing dataset (refer to footnote 1) which includes 3, 673 annotated samples collected from 171 surveillance videos, making it 20 times larger than existing public datasets. [sent-77, score-0.168]
31 Related Work We review some related works on occlusion estimation [28, 4, 7, 24, 17], data completion [5, 21, 8], and crossmodality data transformation [16, 14, 10]. [sent-80, score-0.368]
32 Our DDN with deep structures are more powerful than SVM, which is a flat model [8]. [sent-83, score-0.179]
33 In their network, occlusion patterns are sampled from the models and then verified with input images. [sent-86, score-0.157]
34 In contrast, our model directly maps the input features to occlusion masks. [sent-89, score-0.2]
35 The deep belief network (DBN) [8] and the deep Boltzmann machine (DBM) [21] both consist of multiple layers of RBMs, and complete the corrupted data using probabilistic inference. [sent-92, score-0.788]
36 The denoising autoencoder (DAE) [26] has shown excellent performance at recovering corrupted data, and we have integrated it as a module in our DDN. [sent-94, score-0.156]
37 [13] marginalized missing data with proposed deep sum product network for facial attribute recognition. [sent-96, score-0.313]
38 [16] proposed a multimodel deep network that concatenates data across modalities as input and reconstructs them by learning a shared representation. [sent-100, score-0.313]
39 DDN architecture, which combines occlusion estimation, data completion, and data transformation in an unified deep network. [sent-103, score-0.377]
40 the joint representation of images and label maps for face parsing. [sent-104, score-0.106]
41 [29] proposed a deep network to transform a face image under arbitrary pose and lighting to a canonical view. [sent-106, score-0.381]
42 [10] used convolutional neural networks (CNN), which consider data of one modality as input and the corresponding data of the other modality as output. [sent-109, score-0.106]
43 The decomposition layers in DDN are similar to CNN, but with fully-connected layers that capture the global structures of the pedestrians. [sent-110, score-0.594]
44 2 (a) shows the architecture of DDN, the input of which is a feature vector x, and the output is a set of label maps {y1, . [sent-113, score-0.15]
45 Each layer is fully lcaobnenle mcteadps w {yith the ne}xt o upper layer, aEnadc hth laeyree are one down-sampling layer, two occlusion estimation layers, two completion layers, and two decomposition layers. [sent-117, score-0.53]
46 More layers can be added for more complex problems. [sent-119, score-0.265]
47 x is also mapped to a binary occlusion mask ∈ [0, 1]n through two weight matrices Wo1 , Wo2 , and biase∈s bo1 , bo2 . [sent-121, score-0.247]
48 xio = 0 if the i-th element of the feature is occluded, and xio = 1 otherwise. [sent-123, score-0.106]
49 is computed as xo xo xo xo = τ(Wo2ρ(Wo1x + bo1 ) + bo2 ), (1) where τ(x) = 1/(1 + exp(−x)) and ρ(x) = max(0, x). [sent-124, score-0.664]
50 Twhhee efirs τt( xo)cc =lus 1i/on(1 1es +tim exapti(o−nx layer employs t hme rxec(0ti,fixe)d. [sent-125, score-0.12]
51 The second layer models binary data with the sigmoid function. [sent-127, score-0.12]
52 is reconstructed from and xd as follows, xc xo z = ρ(Wc2ρ(Wc1 (xo ? [sent-136, score-0.237]
53 On the top of DDN, the completed feature xc is decomposed (transformed) into several label maps {y1, . [sent-141, score-0.158]
54 Each label map yi ∈ [0, 1]n is estimated by yi = τ(Wit2ρ(Wt1xc + bt1) + bti2), (4) where yij = 0 indicates the pixel belongs to the background and yij = 1indicates the pixel is on the corresponding body part. [sent-151, score-0.164]
55 Pre-training Occlusion Estimation Layer The occlusion estimation layers infer a binary mask xo from an input feature x. [sent-160, score-0.649]
56 2F, (5) where Ho1 = ρ(Wo1 X) is the output of the first layer as shown in Fig. [sent-166, score-0.12]
57 We pre-train each layer with two steps: parameters initialization and reconstruction error minimization. [sent-210, score-0.12]
58 In the first step, for each completion layer, let v? [sent-211, score-0.16]
59 For each clean sample, we generate 40 corrupted samples by computing the element-wise product between the feature and the 40 templates in Fig. [sent-232, score-0.109]
60 Pre-training Decomposition Layers The first decomposition layer transforms the output of the previous layer to a different space through the weight matrix Wt1. [sent-239, score-0.334]
61 The second layer projects the output of the first layer to several subspaces through a set of weight matrices {Wit2 }. [sent-240, score-0.298]
62 Both decomposition layers can be pre-trained using the strategy introduced in Sec. [sent-245, score-0.329]
63 , L} and iare the indices of layers and iterations. [sent-268, score-0.265]
64 For the output layer of DDN, eL = diag(y − y)diag(y)(1 − y), (13) where diag(·) is the diagonal matrix. [sent-287, score-0.12]
65 -th lower layer ew ditiah gth(·e) sigmoid iafugnocntiaoln m, tahtrei backpropagation error is denoted as e? [sent-289, score-0.151]
66 For a lower layer with the rectified linear function, the backpropagation error is computed as ei? [sent-301, score-0.151]
67 The occlusion estimation layers are pre-trained with 600 images selected from the CUHK occlusion dataset [17], where the ground truth of occlusion masks was obtained as the overlapping regions of the bounding boxes of neighboring pedestrians, e. [sent-319, score-0.765]
68 Both the completion and decomposition layers are pre-trained with the HumanEva dataset [22], which contains 937 clean pedestrians with the ground truth of label maps annotated by [1]. [sent-323, score-0.664]
69 Pretraining the decomposition layers requires clean images and their label maps. [sent-324, score-0.414]
70 Pre-training the completion layers requires clean images, the corrupted data of which can be obtained by element-wise multiplication with the 40 occlusion templates shown in Fig. [sent-325, score-0.691]
71 2 shows the results of pedestrian parsing on two datasets: the Penn-Fudan dataset [27] and a new dataset constructed by us. [sent-329, score-0.264]
72 Our pedestrian parsing dataset contains 3, 673 images from 171 videos of differentsurveillancescenes (PPSS), where 2, 064 images are occluded and 1, 609 are not. [sent-331, score-0.301]
73 7, which shows that large pose, illumination, and occlusion variations are present. [sent-334, score-0.157]
74 Compared with Penn-Fudan, PPSS is much larger and more diversified on scene coverage, and is therefore suitable to evaluate the performance of pedestrian parsing algorithms in practical applications. [sent-335, score-0.264]
75 We compare with structured SVM [7] and RoBM [24] for occlusion estimation on the CUHK dataset [17]. [sent-347, score-0.186]
76 The two completion layers in DDN have neurons 104 and 3, 000, respectively. [sent-366, score-0.425]
77 CNN has three convolutional layers, and each layer has 32 filters. [sent-389, score-0.146]
78 However, the major advantage of DDN is that all the three modules can be well integrated into an unified deep architecture and jointly fine-tuned for the ultimate goal of human parsing. [sent-392, score-0.288]
79 2 shows that the performance of human parsing is significantly improved after fine tuning. [sent-395, score-0.144]
80 It takes two hours to train our network on one NVIDIA GTX 670 GPU, and takes less than 0. [sent-402, score-0.134]
81 We also show the results by using only the decomposition layers (DL), which are trained with the HumanEva dataset. [sent-404, score-0.329]
82 Table 4 (a) reports the human parsing accuracy of DDN compared with SBP [1], P&S; [20], and PbOS [6]. [sent-405, score-0.178]
83 DDN adopts HOG features and the fully-connected network architecture. [sent-414, score-0.134]
84 For our methods, using DL alone achieves better results than DDN, since this dataset has no occlusion, which means that the occlusion estimation and completion layers may slightly induce noise to the DL in the DDN. [sent-420, score-0.611]
85 1, and fine-tune the network with the training data of PPSS. [sent-426, score-0.134]
86 These three methods have the best performance among state-of-the-art methods on occlusion estimation, data completion, and data transformation. [sent-433, score-0.157]
87 First, the performance of DL drops significantly when occlusion is present. [sent-436, score-0.157]
88 DL essentially transform the occluded HOG features to the label maps, which is difficult since the feature space of occlusion is extremely large. [sent-437, score-0.255]
89 8 presents some segmentation examples of DDN compared to DL, and shows that DDN can effectively capture the pose of the human when large occlusion is present, because our network is carefully designed to handle occlusion. [sent-439, score-0.357]
90 Note that some small body parts can still be estimated, such as “shoes” and “arms”, since the correlations between pixels are implicitly maintained by our network structure. [sent-443, score-0.217]
91 Conclusions We present a new Deep Decompositional Network (DDN) for pedestrian parsing. [sent-450, score-0.142]
92 DDN combines the occlusion estimation layers, completion layers, and the decomposition layers in an unified network, which can handle large occlusions. [sent-451, score-0.694]
93 We construct a large benchmark parsing dataset that is larger and more difficult than the existing dataset. [sent-452, score-0.122]
94 Our method outperforms the state-of-the-art on pedestrian parsing, both with and without occlusions. [sent-453, score-0.142]
95 The shape boltzmann machine: a strong model of object shape. [sent-482, score-0.097]
96 A deep sum-product architecture for robust facial attributes analysis. [sent-538, score-0.247]
97 A discriminative deep model for pedestrian detection with occlusion handling. [sent-565, score-0.478]
98 Modeling mutual visibility relationship with a deep model in pedestrian detection. [sent-570, score-0.321]
99 A generative model for simultaneous estimation of human body shape and pixellevel segmentation. [sent-580, score-0.134]
100 Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. [sent-618, score-0.397]
wordName wordTfidf (topN-words)
[('ddn', 0.724), ('layers', 0.265), ('deep', 0.179), ('xo', 0.166), ('completion', 0.16), ('occlusion', 0.157), ('pedestrian', 0.142), ('network', 0.134), ('sbp', 0.124), ('parsing', 0.122), ('layer', 0.12), ('cnn', 0.102), ('boltzmann', 0.097), ('pbos', 0.089), ('ppss', 0.089), ('humaneva', 0.087), ('body', 0.083), ('wc', 0.076), ('robm', 0.071), ('cuhk', 0.07), ('architecture', 0.068), ('dl', 0.066), ('decomposition', 0.064), ('autoencoder', 0.063), ('decompositional', 0.058), ('bmae', 0.053), ('eslami', 0.053), ('msbm', 0.053), ('xio', 0.053), ('arms', 0.051), ('dae', 0.048), ('pedestrians', 0.047), ('dbm', 0.047), ('dbn', 0.047), ('clean', 0.046), ('vc', 0.046), ('maps', 0.043), ('denoising', 0.042), ('occlusions', 0.041), ('completed', 0.041), ('luo', 0.04), ('rbm', 0.039), ('label', 0.039), ('occluded', 0.037), ('ouyang', 0.036), ('xd', 0.036), ('rauschert', 0.036), ('xc', 0.035), ('hog', 0.035), ('diag', 0.035), ('reports', 0.034), ('templates', 0.032), ('mask', 0.032), ('backpropagation', 0.031), ('pretraining', 0.031), ('corrupted', 0.031), ('weight', 0.03), ('hidden', 0.03), ('mnih', 0.029), ('ngiam', 0.029), ('modality', 0.029), ('estimation', 0.029), ('matrices', 0.028), ('noises', 0.028), ('biases', 0.028), ('rbms', 0.027), ('convolutional', 0.026), ('hair', 0.026), ('stochastic', 0.026), ('clutters', 0.025), ('devise', 0.024), ('face', 0.024), ('surveillance', 0.024), ('synthesize', 0.024), ('cloth', 0.023), ('clothes', 0.023), ('accuracies', 0.023), ('chinese', 0.022), ('human', 0.022), ('transformation', 0.022), ('networks', 0.022), ('segmentation', 0.022), ('transform', 0.022), ('pose', 0.022), ('mcmc', 0.022), ('multinomial', 0.022), ('yij', 0.021), ('hc', 0.021), ('template', 0.021), ('hong', 0.021), ('salakhutdinov', 0.021), ('kong', 0.02), ('dictionary', 0.02), ('hinton', 0.02), ('bc', 0.02), ('module', 0.02), ('ym', 0.019), ('unified', 0.019), ('poselets', 0.019), ('bengio', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 311 iccv-2013-Pedestrian Parsing via Deep Decompositional Network
Author: Ping Luo, Xiaogang Wang, Xiaoou Tang
Abstract: We propose a new Deep Decompositional Network (DDN) for parsing pedestrian images into semantic regions, such as hair, head, body, arms, and legs, where the pedestrians can be heavily occluded. Unlike existing methods based on template matching or Bayesian inference, our approach directly maps low-level visual features to the label maps of body parts with DDN, which is able to accurately estimate complex pose variations with good robustness to occlusions and background clutters. DDN jointly estimates occluded regions and segments body parts by stacking three types of hidden layers: occlusion estimation layers, completion layers, and decomposition layers. The occlusion estimation layers estimate a binary mask, indicating which part of a pedestrian is invisible. The completion layers synthesize low-level features of the invisible part from the original features and the occlusion mask. The decomposition layers directly transform the synthesized visual features to label maps. We devise a new strategy to pre-train these hidden layers, and then fine-tune the entire network using the stochastic gradient descent. Experimental results show that our approach achieves better segmentation accuracy than the state-of-the-art methods on pedestrian images with or without occlusions. Another important contribution of this paper is that it provides a large scale benchmark human parsing dataset1 that includes 3, 673 annotated samples collected from 171 surveillance videos. It is 20 times larger than existing public datasets.
2 0.22509682 279 iccv-2013-Multi-stage Contextual Deep Learning for Pedestrian Detection
Author: Xingyu Zeng, Wanli Ouyang, Xiaogang Wang
Abstract: Cascaded classifiers1 have been widely used in pedestrian detection and achieved great success. These classifiers are trained sequentially without joint optimization. In this paper, we propose a new deep model that can jointly train multi-stage classifiers through several stages of backpropagation. It keeps the score map output by a classifier within a local region and uses it as contextual information to support the decision at the next stage. Through a specific design of the training strategy, this deep architecture is able to simulate the cascaded classifiers by mining hard samples to train the network stage-by-stage. Each classifier handles samples at a different difficulty level. Unsupervised pre-training and specifically designed stage-wise supervised training are used to regularize the optimization problem. Both theoretical analysis and experimental results show that the training strategy helps to avoid overfitting. Experimental results on three datasets (Caltech, ETH and TUD-Brussels) show that our approach outperforms the state-of-the-art approaches.
3 0.20694886 220 iccv-2013-Joint Deep Learning for Pedestrian Detection
Author: Wanli Ouyang, Xiaogang Wang
Abstract: Feature extraction, deformation handling, occlusion handling, and classi?cation are four important components in pedestrian detection. Existing methods learn or design these components either individually or sequentially. The interaction among these components is not yet well explored. This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation. We formulate these four components into a joint deep learning framework and propose a new deep network architecture1. By establishing automatic, mutual interaction among components, the deep model achieves a 9% reduction in the average miss rate compared with the current best-performing pedestrian detection approaches on the largest Caltech benchmark dataset.
4 0.13509218 106 iccv-2013-Deep Learning Identity-Preserving Face Space
Author: Zhenyao Zhu, Ping Luo, Xiaogang Wang, Xiaoou Tang
Abstract: Face recognition with large pose and illumination variations is a challenging problem in computer vision. This paper addresses this challenge by proposing a new learningbased face representation: the face identity-preserving (FIP) features. Unlike conventional face descriptors, the FIP features can significantly reduce intra-identity variances, while maintaining discriminativeness between identities. Moreover, the FIP features extracted from an image under any pose and illumination can be used to reconstruct its face image in the canonical view. This property makes it possible to improve the performance of traditional descriptors, such as LBP [2] and Gabor [31], which can be extracted from our reconstructed images in the canonical view to eliminate variations. In order to learn the FIP features, we carefully design a deep network that combines the feature extraction layers and the reconstruction layer. The former encodes a face image into the FIP features, while the latter transforms them to an image in the canonical view. Extensive experiments on the large MultiPIE face database [7] demonstrate that it significantly outperforms the state-of-the-art face recognition methods.
5 0.13461831 206 iccv-2013-Hybrid Deep Learning for Face Verification
Author: Yi Sun, Xiaogang Wang, Xiaoou Tang
Abstract: This paper proposes a hybrid convolutional network (ConvNet)-Restricted Boltzmann Machine (RBM) model for face verification in wild conditions. A key contribution of this work is to directly learn relational visual features, which indicate identity similarities, from raw pixels of face pairs with a hybrid deep network. The deep ConvNets in our model mimic the primary visual cortex to jointly extract local relational visual features from two face images compared with the learned filter pairs. These relational features are further processed through multiple layers to extract high-level and global features. Multiple groups of ConvNets are constructed in order to achieve robustness and characterize face similarities from different aspects. The top-layerRBMperforms inferencefrom complementary high-level features extracted from different ConvNet groups with a two-level average pooling hierarchy. The entire hybrid deep network is jointly fine-tuned to optimize for the task of face verification. Our model achieves competitive face verification performance on the LFW dataset.
6 0.12713803 7 iccv-2013-A Deep Sum-Product Architecture for Robust Facial Attributes Analysis
7 0.11107802 190 iccv-2013-Handling Occlusions with Franken-Classifiers
8 0.10859267 151 iccv-2013-Exploiting Reflection Change for Automatic Reflection Removal
9 0.10160063 351 iccv-2013-Restoring an Image Taken through a Window Covered with Dirt or Rain
10 0.10091063 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
11 0.0856914 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes
12 0.083039209 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow
13 0.075367071 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation
14 0.075018719 256 iccv-2013-Locally Affine Sparse-to-Dense Matching for Motion and Occlusion Estimation
15 0.074665189 153 iccv-2013-Face Recognition Using Face Patch Networks
16 0.074158303 335 iccv-2013-Random Faces Guided Sparse Many-to-One Encoder for Pose-Invariant Face Recognition
17 0.074052654 105 iccv-2013-DeepFlow: Large Displacement Optical Flow with Deep Matching
18 0.072565146 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose
19 0.071690209 375 iccv-2013-Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers
20 0.069027402 336 iccv-2013-Random Forests of Local Experts for Pedestrian Detection
topicId topicWeight
[(0, 0.146), (1, 0.007), (2, -0.023), (3, -0.034), (4, 0.029), (5, -0.08), (6, 0.013), (7, 0.079), (8, -0.033), (9, -0.018), (10, -0.024), (11, 0.009), (12, 0.038), (13, -0.059), (14, 0.004), (15, 0.037), (16, -0.017), (17, 0.044), (18, 0.141), (19, 0.183), (20, -0.047), (21, -0.001), (22, 0.028), (23, -0.074), (24, -0.18), (25, -0.028), (26, -0.019), (27, 0.077), (28, -0.086), (29, 0.114), (30, -0.052), (31, 0.105), (32, 0.134), (33, -0.032), (34, 0.036), (35, 0.064), (36, -0.077), (37, 0.105), (38, -0.005), (39, 0.053), (40, -0.074), (41, -0.067), (42, -0.04), (43, -0.019), (44, 0.028), (45, 0.141), (46, 0.089), (47, 0.028), (48, -0.051), (49, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.9357332 311 iccv-2013-Pedestrian Parsing via Deep Decompositional Network
Author: Ping Luo, Xiaogang Wang, Xiaoou Tang
Abstract: We propose a new Deep Decompositional Network (DDN) for parsing pedestrian images into semantic regions, such as hair, head, body, arms, and legs, where the pedestrians can be heavily occluded. Unlike existing methods based on template matching or Bayesian inference, our approach directly maps low-level visual features to the label maps of body parts with DDN, which is able to accurately estimate complex pose variations with good robustness to occlusions and background clutters. DDN jointly estimates occluded regions and segments body parts by stacking three types of hidden layers: occlusion estimation layers, completion layers, and decomposition layers. The occlusion estimation layers estimate a binary mask, indicating which part of a pedestrian is invisible. The completion layers synthesize low-level features of the invisible part from the original features and the occlusion mask. The decomposition layers directly transform the synthesized visual features to label maps. We devise a new strategy to pre-train these hidden layers, and then fine-tune the entire network using the stochastic gradient descent. Experimental results show that our approach achieves better segmentation accuracy than the state-of-the-art methods on pedestrian images with or without occlusions. Another important contribution of this paper is that it provides a large scale benchmark human parsing dataset1 that includes 3, 673 annotated samples collected from 171 surveillance videos. It is 20 times larger than existing public datasets.
2 0.80124587 220 iccv-2013-Joint Deep Learning for Pedestrian Detection
Author: Wanli Ouyang, Xiaogang Wang
Abstract: Feature extraction, deformation handling, occlusion handling, and classi?cation are four important components in pedestrian detection. Existing methods learn or design these components either individually or sequentially. The interaction among these components is not yet well explored. This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation. We formulate these four components into a joint deep learning framework and propose a new deep network architecture1. By establishing automatic, mutual interaction among components, the deep model achieves a 9% reduction in the average miss rate compared with the current best-performing pedestrian detection approaches on the largest Caltech benchmark dataset.
3 0.74192387 279 iccv-2013-Multi-stage Contextual Deep Learning for Pedestrian Detection
Author: Xingyu Zeng, Wanli Ouyang, Xiaogang Wang
Abstract: Cascaded classifiers1 have been widely used in pedestrian detection and achieved great success. These classifiers are trained sequentially without joint optimization. In this paper, we propose a new deep model that can jointly train multi-stage classifiers through several stages of backpropagation. It keeps the score map output by a classifier within a local region and uses it as contextual information to support the decision at the next stage. Through a specific design of the training strategy, this deep architecture is able to simulate the cascaded classifiers by mining hard samples to train the network stage-by-stage. Each classifier handles samples at a different difficulty level. Unsupervised pre-training and specifically designed stage-wise supervised training are used to regularize the optimization problem. Both theoretical analysis and experimental results show that the training strategy helps to avoid overfitting. Experimental results on three datasets (Caltech, ETH and TUD-Brussels) show that our approach outperforms the state-of-the-art approaches.
4 0.59277481 351 iccv-2013-Restoring an Image Taken through a Window Covered with Dirt or Rain
Author: David Eigen, Dilip Krishnan, Rob Fergus
Abstract: Photographs taken through a window are often compromised by dirt or rain present on the window surface. Common cases of this include pictures taken from inside a vehicle, or outdoor security cameras mounted inside a protective enclosure. At capture time, defocus can be used to remove the artifacts, but this relies on achieving a shallow depth-of-field and placement of the camera close to the window. Instead, we present a post-capture image processing solution that can remove localized rain and dirt artifacts from a single image. We collect a dataset of clean/corrupted image pairs which are then used to train a specialized form of convolutional neural network. This learns how to map corrupted image patches to clean ones, implicitly capturing the characteristic appearance of dirt and water droplets in natural images. Our models demonstrate effective removal of dirt and rain in outdoor test conditions.
5 0.58617759 206 iccv-2013-Hybrid Deep Learning for Face Verification
Author: Yi Sun, Xiaogang Wang, Xiaoou Tang
Abstract: This paper proposes a hybrid convolutional network (ConvNet)-Restricted Boltzmann Machine (RBM) model for face verification in wild conditions. A key contribution of this work is to directly learn relational visual features, which indicate identity similarities, from raw pixels of face pairs with a hybrid deep network. The deep ConvNets in our model mimic the primary visual cortex to jointly extract local relational visual features from two face images compared with the learned filter pairs. These relational features are further processed through multiple layers to extract high-level and global features. Multiple groups of ConvNets are constructed in order to achieve robustness and characterize face similarities from different aspects. The top-layerRBMperforms inferencefrom complementary high-level features extracted from different ConvNet groups with a two-level average pooling hierarchy. The entire hybrid deep network is jointly fine-tuned to optimize for the task of face verification. Our model achieves competitive face verification performance on the LFW dataset.
6 0.56381667 151 iccv-2013-Exploiting Reflection Change for Automatic Reflection Removal
7 0.53248143 106 iccv-2013-Deep Learning Identity-Preserving Face Space
8 0.51148188 7 iccv-2013-A Deep Sum-Product Architecture for Robust Facial Attributes Analysis
9 0.47267151 211 iccv-2013-Image Segmentation with Cascaded Hierarchical Models and Logistic Disjunctive Normal Networks
10 0.46703029 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow
11 0.43849954 306 iccv-2013-Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items
12 0.40966725 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
13 0.40681297 196 iccv-2013-Hierarchical Data-Driven Descent for Efficient Optimal Deformation Estimation
14 0.39873341 190 iccv-2013-Handling Occlusions with Franken-Classifiers
15 0.38869485 15 iccv-2013-A Generalized Low-Rank Appearance Model for Spatio-temporally Correlated Rain Streaks
16 0.37485117 8 iccv-2013-A Deformable Mixture Parsing Model with Parselets
18 0.35282665 75 iccv-2013-CoDeL: A Human Co-detection and Labeling Framework
19 0.34433156 336 iccv-2013-Random Forests of Local Experts for Pedestrian Detection
20 0.34242278 270 iccv-2013-Modeling Self-Occlusions in Dynamic Shape and Appearance Tracking
topicId topicWeight
[(2, 0.054), (7, 0.016), (26, 0.077), (31, 0.029), (34, 0.011), (35, 0.012), (42, 0.068), (48, 0.376), (64, 0.06), (67, 0.014), (73, 0.025), (78, 0.015), (89, 0.129)]
simIndex simValue paperId paperTitle
1 0.81544399 331 iccv-2013-Pyramid Coding for Functional Scene Element Recognition in Video Scenes
Author: Eran Swears, Anthony Hoogs, Kim Boyer
Abstract: Recognizing functional scene elemeents in video scenes based on the behaviors of moving objects that interact with them is an emerging problem ooff interest. Existing approaches have a limited ability to chharacterize elements such as cross-walks, intersections, andd buildings that have low activity, are multi-modal, or havee indirect evidence. Our approach recognizes the low activvity and multi-model elements (crosswalks/intersections) by introducing a hierarchy of descriptive clusters to fform a pyramid of codebooks that is sparse in the numbber of clusters and dense in content. The incorporation oof local behavioral context such as person-enter-building aand vehicle-parking nearby enables the detection of elemennts that do not have direct motion-based evidence, e.g. buuildings. These two contributions significantly improvee scene element recognition when compared against thhree state-of-the-art approaches. Results are shown on tyypical ground level surveillance video and for the first time on the more complex Wide Area Motion Imagery.
same-paper 2 0.77938521 311 iccv-2013-Pedestrian Parsing via Deep Decompositional Network
Author: Ping Luo, Xiaogang Wang, Xiaoou Tang
Abstract: We propose a new Deep Decompositional Network (DDN) for parsing pedestrian images into semantic regions, such as hair, head, body, arms, and legs, where the pedestrians can be heavily occluded. Unlike existing methods based on template matching or Bayesian inference, our approach directly maps low-level visual features to the label maps of body parts with DDN, which is able to accurately estimate complex pose variations with good robustness to occlusions and background clutters. DDN jointly estimates occluded regions and segments body parts by stacking three types of hidden layers: occlusion estimation layers, completion layers, and decomposition layers. The occlusion estimation layers estimate a binary mask, indicating which part of a pedestrian is invisible. The completion layers synthesize low-level features of the invisible part from the original features and the occlusion mask. The decomposition layers directly transform the synthesized visual features to label maps. We devise a new strategy to pre-train these hidden layers, and then fine-tune the entire network using the stochastic gradient descent. Experimental results show that our approach achieves better segmentation accuracy than the state-of-the-art methods on pedestrian images with or without occlusions. Another important contribution of this paper is that it provides a large scale benchmark human parsing dataset1 that includes 3, 673 annotated samples collected from 171 surveillance videos. It is 20 times larger than existing public datasets.
3 0.74120981 63 iccv-2013-Bounded Labeling Function for Global Segmentation of Multi-part Objects with Geometric Constraints
Author: Masoud S. Nosrati, Shawn Andrews, Ghassan Hamarneh
Abstract: The inclusion of shape and appearance priors have proven useful for obtaining more accurate and plausible segmentations, especially for complex objects with multiple parts. In this paper, we augment the popular MumfordShah model to incorporate two important geometrical constraints, termed containment and detachment, between different regions with a specified minimum distance between their boundaries. Our method is able to handle multiple instances of multi-part objects defined by these geometrical hamarneh} @ s fu . ca (a)Standar laΩb ehlingΩfuhnctionseting(Ωb)hΩOuirseΩtijng Figure 1: The inside vs. outside ambiguity in (a) is resolved by our containment constraint in (b). constraints using a single labeling function while maintaining global optimality. We demonstrate the utility and advantages of these two constraints and show that the proposed convex continuous method is superior to other state-of-theart methods, including its discrete counterpart, in terms of memory usage, and metrication errors.
4 0.67900956 320 iccv-2013-Pose-Configurable Generic Tracking of Elongated Objects
Author: Daniel Wesierski, Patrick Horain
Abstract: Elongated objects have various shapes and can shift, rotate, change scale, and be rigid or deform by flexing, articulating, and vibrating, with examples as varied as a glass bottle, a robotic arm, a surgical suture, a finger pair, a tram, and a guitar string. This generally makes tracking of poses of elongated objects very challenging. We describe a unified, configurable framework for tracking the pose of elongated objects, which move in the image plane and extend over the image region. Our method strives for simplicity, versatility, and efficiency. The object is decomposed into a chained assembly of segments of multiple parts that are arranged under a hierarchy of tailored spatio-temporal constraints. In this hierarchy, segments can rescale independently while their elasticity is controlled with global orientations and local distances. While the trend in tracking is to design complex, structure-free algorithms that update object appearance on- line, we show that our tracker, with the novel but remarkably simple, structured organization of parts with constant appearance, reaches or improves state-of-the-art performance. Most importantly, our model can be easily configured to track exact pose of arbitrary, elongated objects in the image plane. The tracker can run up to 100 fps on a desktop PC, yet the computation time scales linearly with the number of object parts. To our knowledge, this is the first approach to generic tracking of elongated objects.
5 0.65197355 354 iccv-2013-Robust Dictionary Learning by Error Source Decomposition
Author: Zhuoyuan Chen, Ying Wu
Abstract: Sparsity models have recently shown great promise in many vision tasks. Using a learned dictionary in sparsity models can in general outperform predefined bases in clean data. In practice, both training and testing data may be corrupted and contain noises and outliers. Although recent studies attempted to cope with corrupted data and achieved encouraging results in testing phase, how to handle corruption in training phase still remains a very difficult problem. In contrast to most existing methods that learn the dictionaryfrom clean data, this paper is targeted at handling corruptions and outliers in training data for dictionary learning. We propose a general method to decompose the reconstructive residual into two components: a non-sparse component for small universal noises and a sparse component for large outliers, respectively. In addition, , further analysis reveals the connection between our approach and the “partial” dictionary learning approach, updating only part of the prototypes (or informative codewords) with remaining (or noisy codewords) fixed. Experiments on synthetic data as well as real applications have shown satisfactory per- formance of this new robust dictionary learning approach.
6 0.63861853 207 iccv-2013-Illuminant Chromaticity from Image Sequences
7 0.54429764 220 iccv-2013-Joint Deep Learning for Pedestrian Detection
8 0.52599764 279 iccv-2013-Multi-stage Contextual Deep Learning for Pedestrian Detection
9 0.51442993 7 iccv-2013-A Deep Sum-Product Architecture for Robust Facial Attributes Analysis
10 0.50036985 206 iccv-2013-Hybrid Deep Learning for Face Verification
11 0.49007297 106 iccv-2013-Deep Learning Identity-Preserving Face Space
12 0.48264325 5 iccv-2013-A Color Constancy Model with Double-Opponency Mechanisms
13 0.47871238 208 iccv-2013-Image Co-segmentation via Consistent Functional Maps
14 0.47521156 313 iccv-2013-Person Re-identification by Salience Matching
15 0.47423577 351 iccv-2013-Restoring an Image Taken through a Window Covered with Dirt or Rain
16 0.47171026 312 iccv-2013-Perceptual Fidelity Aware Mean Squared Error
17 0.47055298 61 iccv-2013-Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition
18 0.46804774 151 iccv-2013-Exploiting Reflection Change for Automatic Reflection Removal
19 0.4645592 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
20 0.46253228 270 iccv-2013-Modeling Self-Occlusions in Dynamic Shape and Appearance Tracking