cvpr cvpr2013 cvpr2013-332 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Cheng Li, Kris M. Kitani
Abstract: We address the task of pixel-level hand detection in the context of ego-centric cameras. Extracting hand regions in ego-centric videos is a critical step for understanding handobject manipulation and analyzing hand-eye coordination. However, in contrast to traditional applications of hand detection, such as gesture interfaces or sign-language recognition, ego-centric videos present new challenges such as rapid changes in illuminations, significant camera motion and complex hand-object manipulations. To quantify the challenges and performance in this new domain, we present a fully labeled indoor/outdoor ego-centric hand detection benchmark dataset containing over 200 million labeled pixels, which contains hand images taken under various illumination conditions. Using both our dataset and a publicly available ego-centric indoors dataset, we give extensive analysis of detection performance using a wide range of local appearance features. Our analysis highlights the effectiveness of sparse features and the importance of modeling global illumination. We propose a modeling strategy based on our findings and show that our model outperforms several baseline approaches.
Reference: text
sentIndex sentText sentNum sentScore
1 cn i s Abstract We address the task of pixel-level hand detection in the context of ego-centric cameras. [sent-4, score-0.315]
2 Extracting hand regions in ego-centric videos is a critical step for understanding handobject manipulation and analyzing hand-eye coordination. [sent-5, score-0.494]
3 However, in contrast to traditional applications of hand detection, such as gesture interfaces or sign-language recognition, ego-centric videos present new challenges such as rapid changes in illuminations, significant camera motion and complex hand-object manipulations. [sent-6, score-0.658]
4 To quantify the challenges and performance in this new domain, we present a fully labeled indoor/outdoor ego-centric hand detection benchmark dataset containing over 200 million labeled pixels, which contains hand images taken under various illumination conditions. [sent-7, score-0.947]
5 Using both our dataset and a publicly available ego-centric indoors dataset, we give extensive analysis of detection performance using a wide range of local appearance features. [sent-8, score-0.365]
6 Introduction In this work we focus on the task of pixel-wise hand detection from video recorded with a wearable head-mounted camera. [sent-12, score-0.525]
7 Recently, the use of ego-centric video is re-emerging as a popular topic in computer vision and has shown promising results in such areas as understanding hand-eye coordination [5] and recognizing activities of daily living [17]. [sent-14, score-0.289]
8 In order to achieve more detailed models of human interaction and object manipulation, it is important to detect hand regions with pixel-level Kris M. [sent-15, score-0.356]
9 Hand detection is an important element of such tasks as gesture recognition, hand tracking, grasp recognition, action recognition and understanding hand-object in- teractions. [sent-21, score-0.428]
10 In contrast to previous work on hand detection, the egocentric paradigm presents a new set of constraints and characteristics that introduce new challenges as well as unique properties that can be exploited for the task of first-person hand detection. [sent-22, score-0.71]
11 As a result, the large im333555667088 age displacement caused by body motion makes it very difficult to apply traditional image stabilization or background subtraction techniques. [sent-25, score-0.324]
12 Similarly, large changes in illumination conditions induce large fluctuations in the appearance of hands. [sent-26, score-0.407]
13 Fortunately, ego-centric videos also have the property of being user-specific, where images of hands and the physical world are always acquired with the same camera for the same user. [sent-27, score-0.421]
14 This implies that the intrinsic color of the hands does not change drastically over time. [sent-28, score-0.426]
15 The purpose of this work is to identify and address the challenges of hand detection for first-person vision. [sent-29, score-0.358]
16 To this end, we present a dataset of over 600 hand images taken under various illumination and different backgrounds (Figure 1). [sent-30, score-0.468]
17 Using this dataset and a publicly available indoor ego-centric dataset, we perform extensive tests to highlight the pros and cons of various widely-used local appearance features. [sent-32, score-0.268]
18 We evaluate the value of modeling global illumination to generate an ensemble of hand region detectors conditioned on the illumination conditions of the scene. [sent-33, score-0.81]
19 Based on our finding, we propose a model using sparse feature selection and an illumination-dependent modeling strategy, and show that it out-performs several baseline approaches. [sent-34, score-0.325]
20 Related Work We give a review of work that aims to generate pixellevel detections of hand regions from moving cameras. [sent-36, score-0.382]
21 Approaches for detecting hand regions can be roughly divided into three approaches: (1) local appearance-based detection, (2) global appearance-based detection and (3) motion-based detection. [sent-37, score-0.523]
22 In many scenarios, local color is a simple yet strong feature for extracting hand regions and is the most classical approach for detecting skin-color regions [9]. [sent-38, score-0.745]
23 Their approach was shown to be effective for extracting skin regions in internet images. [sent-40, score-0.351]
24 Color models have also been combined with trackers to take into account both the static and dynamic appearance of skin [16, 23, 1, 11]. [sent-41, score-0.392]
25 Global appearance-based models detect hands using a global hand template, where dense or sparse hand templates are generated from a database 2D images [25] or 2D projections of a 3D hand model [18, 24, 15]. [sent-42, score-1.199]
26 However, when hands must be detected in various configurations, this approach usually requires a search over a very large search space and it may be necessary to enforce a tracking framework to constrain the search. [sent-44, score-0.345]
27 Motion-based approaches explicitly take into account the ego-motion of the camera by assuming that hands (foreground) and the background have different motion or appearance statistics. [sent-45, score-0.597]
28 However, since there is no explicit modeling of the hand, objects being handled by hands are often detected as foreground. [sent-47, score-0.341]
29 When there is no hand motion or camera motion, there is no way to disambiguate the foreground from the background. [sent-48, score-0.562]
30 Methods that attempt to model moving backgrounds are effective when camera motion is limited and video stabilization methods can be used to apply classical background modeling techniques [7, 6]. [sent-49, score-0.52]
31 In the greater context of activity analysis for ego-centric vision, the task of extracting hand regions with pixel-level accuracy will be a critical preprocessing step for many high- level tasks. [sent-51, score-0.41]
32 Modeling Hand Appearance We are interested in understanding how local appearance and global illumination should be modeled to effectively detect hand regions over a diverse set of imaging conditions. [sent-55, score-0.796]
33 To this end we evaluate a pool of widely used local appearance features, to understand how different features affect detection performance. [sent-56, score-0.35]
34 We also examine the use of global appearance features as a means of representing changes in global illumination. [sent-57, score-0.456]
35 Local Appearance Features Color is a strong feature for detecting skin and has been the feature of choice for a majority of previous work [9]. [sent-61, score-0.461]
36 Here we evaluate the RGB, HSV and LAB colorspaces which have been shown to be robust colorspaces for skin color detection. [sent-62, score-0.622]
37 In contrast to previous work [8] using only a single pixel color features, we are interested in understanding how local color information (color of pixels surrounding the pixel of evaluation) contributes to detection performance. [sent-63, score-0.587]
38 We use the response of a bank of 48 Gabor filters (8 ori- entations, 3 scales, both real and imaginary components) to examine how local texture affects the discriminability of skin color regions. [sent-64, score-0.571]
39 One of the typical limitations of colorbased skin detection approaches is the difficulties encountered with attempting to discriminate against objects that share a similar color distribution to skin. [sent-65, score-0.465]
40 Figure 2 shows a visualization of the color feature space and the color+texture feature space for selected portions of 333555667199 (a) Image regions Figure 2. [sent-66, score-0.438]
41 Skin features in red and the desk features in blue. [sent-68, score-0.308]
42 Pixel features extracted from a portion of the hand (marked in red) and a portion of the desk (marked in blue) are visualized in 2D. [sent-71, score-0.572]
43 Notice how the pixel features extracted from the hand and desk are completely overlapped in the color space (Figure 2b). [sent-72, score-0.708]
44 However, by concatenating the response of 32 Gabor filters (4 orientations, 4 scales) to the color feature, we can see that the visualization of the feature space shows better separation between pixels of the hand and pixels of the desk (Figure 2c). [sent-73, score-0.789]
45 This visualization suggests that low-level texture can help to disambiguate between hands and other similar colored objects. [sent-74, score-0.5]
46 We expect that these gradient histogram descriptors will capture local contours of hands and also encode typical background appearance to help improve classification performance. [sent-77, score-0.669]
47 Binary tests randomly selected from small local image patches indirectly encode texture and gradients, and have been proposed as a more efficient way of encoding local appearance similar to SIFT descriptors. [sent-78, score-0.328]
48 We evaluate the 16 dimensional BRIEF [3] descriptor and a 32 dimensional ORB [20] descriptor to measure relative performance with respect to the task of hand region detection. [sent-79, score-0.47]
49 The use of small clusters of pixels, better known as superpixels, is a preprocessing step used for tasks such as image segmentation and appearance modeling for tracking [19, 28]. [sent-80, score-0.295]
50 The color descriptor is the mean and covariance of the HSV values within a super pixel (3+6 dimensions). [sent-85, score-0.354]
51 Global Appearance Modeling Using a single hand detector to take into account the wide range of illumination variation and its effect on hand appearance is very challenging. [sent-91, score-0.869]
52 The visualization shows the large variance in hand appearance across changes in illumination. [sent-93, score-0.529]
53 The posterior distribution of a pixel x given a local appearance feature l and a global appearance feature g, is computed by marginalizing over different scenes c, p(x|l,g) = ? [sent-97, score-0.594]
54 c where p(x|l, c) is the output of a discriminative global appearance-specific regressor ta ondf a p( dc|igsc)r ims a tciovned gitlioonbaall adpisptreiabruatniocne- ospfe a scene c given a global appearance dfietaiotnurael g. [sent-99, score-0.381]
55 Different global appearance models are learned using kmeans clustering on the HSV histogram of each training image and a separate random tree regressor is learned for each cluster. [sent-100, score-0.315]
56 By using a histogram over all three channels of the HSV colorspace, each scene cluster encodes both the appearance of the scene and the illumination of the scene. [sent-101, score-0.302]
57 In- tuitively, we are modeling the fact that hands viewed under similar global appearance will share a similar distribution in 333555777200 learned by clustering image histograms. [sent-102, score-0.546]
58 We evaluate (1) different patch sizes for color features, (2) feature selection over feature modality and (3) feature selection over sparse descriptor elements. [sent-108, score-0.733]
59 Second, we show how learning a collection of classifiers indexed by different scene models can increase robustness to changes in illumination and significantly improve performance. [sent-109, score-0.316]
60 We generated two datasets to evaluate the robustness of our system to extreme changes in illumination and mild camera motion induced by walking and climbing stairs. [sent-114, score-0.46]
61 Both hands are purposefully extended outwards for the entire duration of the video to capture the change in skin color under varying illumination (direct sun light, office lights, staircase, shadows, etc. [sent-117, score-0.92]
62 An additional video was taken in a kitchenette area which we denote as EDSH-kitchen, which features large amounts of ego-motion and hand deformations induced by the activity of making tea. [sent-120, score-0.508]
63 We also compare our approach on a publicly available dataset of egocentric activities from the Georgia Tech Ego- centric Activity (GTEA) dataset [6]. [sent-124, score-0.294]
64 We used the foreground hand masks available on the project site to compute scores for their proposed hand detection algorithm. [sent-125, score-0.635]
65 The labeled portion of the GTEA dataset includes results for a single user performing three activities, which include making tea, making a peanut butter sandwich and making coffee. [sent-126, score-0.272]
66 The coffee sequence was used as training when testing on the tea and peanut butter sequence and the tea sequence was used for training when testing on the coffee sequence. [sent-131, score-0.442]
67 Evaluating Local Color Features In this experiment we examine the effects of increasing the spatial extent of color descriptors to detect hand regions. [sent-143, score-0.501]
68 We extend the spatial extent of the color feature by vectorizing a m m pixel patch of color values to encode loiczailn gco alor m min ×for mma ptiioxne. [sent-144, score-0.535]
69 Our results show that when color is the only feature type, modeling only a single pixel [8] does not always yield the best performance. [sent-147, score-0.37]
70 , pixels surrounded by more skin-like pixels) should help to disambiguate hand regions. [sent-152, score-0.396]
71 Feature Performance over Modality In this experiment we analyze the discriminative power of each feature modality using a forward feature selection evaluation criteria. [sent-155, score-0.287]
72 Based on previous work we expect that color will play an influential role in detecting skin regions but we are also interested in how texture, gradient features and superpixel statistics can contribute to performance. [sent-157, score-0.743]
73 We see that high-order gradient features such as HOG and BRIEF are added after color features and enable an increase in overall performance latter in the pipeline. [sent-163, score-0.418]
74 We also observed a small initial dip in performance (Kitchen dataset) as more color features were added into the feature pool. [sent-167, score-0.319]
75 The increase in the dimensionality of the color features initially causes over-fitting but the effect is counterbalanced (a mechanism of the RF learning algorithm) as more texture features are added to the pool. [sent-168, score-0.446]
76 Figure 5 shows the results of feature selection by selecting a single element from the pool of all 498 local appearance feature dimensions. [sent-177, score-0.44]
77 Notice that the number of LAB and HSV features continue to increase after other texture and gradient features are added. [sent-184, score-0.336]
78 This indicates that local color information is discriminative when used together with texture and gradient features. [sent-185, score-0.328]
79 Other higher order gradient features are added between the 16th and 32nd iterations of the feature selection process. [sent-188, score-0.268]
80 When the dimensionality of the feature is extended to 100, we observe that SIFT feature and HSV features are aggressively selected because they help to disambiguate the more difficult cases. [sent-191, score-0.334]
81 This result reconfirms our previous result that higher-order gradient features are more discriminative when coupled with color features. [sent-192, score-0.29]
82 Number of Global Appearance Models The appearance of the hands changes dramatically depending on the illumination of the scene, as can be seen in Figure 3. [sent-196, score-0.633]
83 Although we expect that the optimal number of scene clusters k will vary depending on the statistics of the dataset, we have gained an important insight that it is better to have multiple models of skin conditioned on the imaging conditions to achieve more robust performance. [sent-203, score-0.383]
84 We compare against four baseline approaches: (1) single-pixel color approach inspired by [8], (2) video stabilization approach inspired by [7] based on background modeling using affine alignment of image frames, (3) foreground modeling using feature trajectory-based projection of Sheikh et al. [sent-207, score-0.816]
85 The single-pixel color classifier is a random regressor trained only on single-pixel LAB color values. [sent-210, score-0.432]
86 After alignment, pixels with high variance are considered to be foreground hand regions. [sent-212, score-0.356]
87 The video stabilization approach for background modeling is sensitive to ego-motion and therefore performs better on the GTEA dataset and worse on the EDSH dataset, which contains significant egomotion. [sent-222, score-0.415]
88 Our approach generates stable detection around hand regions but also occasionally classifies small regions such as red cups and wooden table-tops. [sent-227, score-0.429]
89 post-processing, sparse features and scenespecific modeling) improves over a single color pixel approach score of 0. [sent-232, score-0.333]
90 On average, compared to a single color feature approach, our our approach yields a 15% increase in performance over all datasets. [sent-235, score-0.291]
91 Conclusion We have presented a thorough analysis of local appearance features for detecting hand regions. [sent-237, score-0.566]
92 Our results have shown that using a sparse set of features improves the robustness of our approach and we have also shown that global appearance models can be used to adapt our detec- tors to changes in illumination (a prevalent phenomenon in wearable cameras). [sent-238, score-0.628]
93 Our experiments have shown that a sparse 50 dimensional combination of color, texture and gradient histogram features can be used to accurately detect hands over varying illumination and hand poses. [sent-239, score-1.018]
94 We have also shown that modeling scene-specific illumination models is necessary to deal with large changes in illumination. [sent-240, score-0.305]
95 parts of the scene and hands become pure white), very dark scenes, and high contrast cast shadows, is very challenging for local appearance based approaches (Figure 8). [sent-248, score-0.441]
96 This work has shown that hand region pixels can be detected with reasonable confidence for a wide range of illumination changes and hand deformations. [sent-258, score-0.832]
97 Based on the findings of this work, we believe that our proposed pixellevel detection approach can be used to enable a variety of higher level tasks such hand tracking, gesture recogni- tion, action recognition and manipulation analysis for firstperson vision. [sent-259, score-0.531]
98 Statistical color models with applica- [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] tion to skin detection. [sent-305, score-0.414]
99 Detecting activities of daily living in first-person camera views. [sent-355, score-0.271]
100 Visual tracking of high DOF articulated structures: an application to human hand tracking. [sent-360, score-0.344]
wordName wordTfidf (topN-words)
[('hands', 0.265), ('hand', 0.264), ('skin', 0.253), ('gtea', 0.208), ('hsv', 0.205), ('stabilization', 0.173), ('illumination', 0.163), ('color', 0.161), ('desk', 0.152), ('appearance', 0.139), ('egocentric', 0.139), ('lab', 0.13), ('gabor', 0.11), ('regressor', 0.11), ('colorspaces', 0.104), ('disambiguate', 0.096), ('camera', 0.095), ('tea', 0.091), ('orb', 0.083), ('tracking', 0.08), ('feature', 0.08), ('texture', 0.079), ('super', 0.079), ('video', 0.078), ('butter', 0.078), ('peanut', 0.078), ('features', 0.078), ('gesture', 0.078), ('manipulation', 0.077), ('modeling', 0.076), ('fathi', 0.075), ('wearable', 0.075), ('activities', 0.073), ('baseline', 0.069), ('modality', 0.068), ('sheikh', 0.067), ('global', 0.066), ('brief', 0.066), ('changes', 0.066), ('videos', 0.061), ('daily', 0.061), ('descriptor', 0.061), ('pixellevel', 0.061), ('plateaus', 0.061), ('visualization', 0.06), ('selection', 0.059), ('indoors', 0.058), ('klt', 0.058), ('regions', 0.057), ('recorded', 0.057), ('foreground', 0.056), ('tsinghua', 0.055), ('rehg', 0.055), ('subtraction', 0.053), ('pixel', 0.053), ('expect', 0.052), ('coffee', 0.052), ('indoor', 0.051), ('gradient', 0.051), ('motion', 0.051), ('detection', 0.051), ('increase', 0.05), ('million', 0.049), ('gpb', 0.049), ('detecting', 0.048), ('activity', 0.048), ('background', 0.047), ('kitchen', 0.047), ('extreme', 0.045), ('pool', 0.045), ('patch', 0.044), ('carnegie', 0.044), ('mellon', 0.043), ('challenges', 0.043), ('superpixel', 0.043), ('contours', 0.042), ('living', 0.042), ('dimensional', 0.042), ('sparse', 0.041), ('examine', 0.041), ('outdoor', 0.041), ('dataset', 0.041), ('extracting', 0.041), ('comparative', 0.041), ('induced', 0.04), ('wide', 0.039), ('conditions', 0.039), ('sift', 0.039), ('jones', 0.039), ('conditioned', 0.039), ('portion', 0.039), ('dimensions', 0.039), ('local', 0.037), ('indexed', 0.037), ('encode', 0.036), ('labeled', 0.036), ('pixels', 0.036), ('detect', 0.035), ('particle', 0.035), ('understanding', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 332 cvpr-2013-Pixel-Level Hand Detection in Ego-centric Videos
Author: Cheng Li, Kris M. Kitani
Abstract: We address the task of pixel-level hand detection in the context of ego-centric cameras. Extracting hand regions in ego-centric videos is a critical step for understanding handobject manipulation and analyzing hand-eye coordination. However, in contrast to traditional applications of hand detection, such as gesture interfaces or sign-language recognition, ego-centric videos present new challenges such as rapid changes in illuminations, significant camera motion and complex hand-object manipulations. To quantify the challenges and performance in this new domain, we present a fully labeled indoor/outdoor ego-centric hand detection benchmark dataset containing over 200 million labeled pixels, which contains hand images taken under various illumination conditions. Using both our dataset and a publicly available ego-centric indoors dataset, we give extensive analysis of detection performance using a wide range of local appearance features. Our analysis highlights the effectiveness of sparse features and the importance of modeling global illumination. We propose a modeling strategy based on our findings and show that our model outperforms several baseline approaches.
2 0.1873181 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár
Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.
3 0.15960576 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization
Author: Zihan Zhou, Hailin Jin, Yi Ma
Abstract: Recently, a new image deformation technique called content-preserving warping (CPW) has been successfully employed to produce the state-of-the-art video stabilization results in many challenging cases. The key insight of CPW is that the true image deformation due to viewpoint change can be well approximated by a carefully constructed warp using a set of sparsely constructed 3D points only. However, since CPW solely relies on the tracked feature points to guide the warping, it works poorly in large textureless regions, such as ground and building interiors. To overcome this limitation, in this paper we present a hybrid approach for novel view synthesis, observing that the textureless regions often correspond to large planar surfaces in the scene. Particularly, given a jittery video, we first segment each frame into piecewise planar regions as well as regions labeled as non-planar using Markov random fields. Then, a new warp is computed by estimating a single homography for regions belong to the same plane, while in- heriting results from CPW in the non-planar regions. We demonstrate how the segmentation information can be efficiently obtained and seamlessly integrated into the stabilization framework. Experimental results on a variety of real video sequences verify the effectiveness of our method.
4 0.15815909 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis
Author: Nikolaos Kyriazis, Antonis Argyros
Abstract: In several hand-object(s) interaction scenarios, the change in the objects ’ state is a direct consequence of the hand’s motion. This has a straightforward representation in Newtonian dynamics. We present the first approach that exploits this observation to perform model-based 3D tracking of a table-top scene comprising passive objects and an active hand. Our forward modelling of 3D hand-object(s) interaction regards both the appearance and the physical state of the scene and is parameterized over the hand motion (26 DoFs) between two successive instants in time. We demonstrate that our approach manages to track the 3D pose of all objects and the 3D pose and articulation of the hand by only searching for the parameters of the hand motion. In the proposed framework, covert scene state is inferred by connecting it to the overt state, through the incorporation of physics. Thus, our tracking approach treats a variety of challenging observability issues in a principled manner, without the need to resort to heuristics.
5 0.15467992 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
Author: Michael S. Ryoo, Larry Matthies
Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.
6 0.1459516 287 cvpr-2013-Modeling Actions through State Changes
7 0.14291735 457 cvpr-2013-Visual Tracking via Locality Sensitive Histograms
8 0.14285065 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels
9 0.12187675 388 cvpr-2013-Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video
10 0.12071275 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
11 0.12000628 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
13 0.11554947 386 cvpr-2013-Self-Paced Learning for Long-Term Tracking
14 0.10900336 148 cvpr-2013-Ensemble Video Object Cut in Highly Dynamic Scenes
15 0.10896882 187 cvpr-2013-Geometric Context from Videos
16 0.10889945 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
17 0.10886148 380 cvpr-2013-Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images
18 0.10687713 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
19 0.10675165 314 cvpr-2013-Online Object Tracking: A Benchmark
20 0.10656799 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
topicId topicWeight
[(0, 0.296), (1, 0.004), (2, 0.016), (3, -0.082), (4, -0.069), (5, -0.024), (6, 0.019), (7, -0.023), (8, 0.006), (9, 0.041), (10, 0.018), (11, -0.098), (12, 0.057), (13, 0.025), (14, 0.066), (15, -0.006), (16, 0.032), (17, -0.021), (18, -0.002), (19, 0.025), (20, 0.085), (21, 0.078), (22, -0.021), (23, -0.097), (24, -0.088), (25, -0.021), (26, 0.009), (27, 0.08), (28, -0.006), (29, 0.04), (30, -0.034), (31, 0.05), (32, -0.035), (33, -0.016), (34, 0.018), (35, -0.005), (36, -0.084), (37, 0.112), (38, -0.046), (39, 0.052), (40, -0.042), (41, 0.05), (42, 0.026), (43, -0.035), (44, 0.015), (45, -0.052), (46, 0.039), (47, 0.013), (48, -0.025), (49, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.96539038 332 cvpr-2013-Pixel-Level Hand Detection in Ego-centric Videos
Author: Cheng Li, Kris M. Kitani
Abstract: We address the task of pixel-level hand detection in the context of ego-centric cameras. Extracting hand regions in ego-centric videos is a critical step for understanding handobject manipulation and analyzing hand-eye coordination. However, in contrast to traditional applications of hand detection, such as gesture interfaces or sign-language recognition, ego-centric videos present new challenges such as rapid changes in illuminations, significant camera motion and complex hand-object manipulations. To quantify the challenges and performance in this new domain, we present a fully labeled indoor/outdoor ego-centric hand detection benchmark dataset containing over 200 million labeled pixels, which contains hand images taken under various illumination conditions. Using both our dataset and a publicly available ego-centric indoors dataset, we give extensive analysis of detection performance using a wide range of local appearance features. Our analysis highlights the effectiveness of sparse features and the importance of modeling global illumination. We propose a modeling strategy based on our findings and show that our model outperforms several baseline approaches.
2 0.72199357 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
Author: Mehrsan Javan Roshtkhari, Martin D. Levine
Abstract: We present a novel approach for video parsing and simultaneous online learning of dominant and anomalous behaviors in surveillance videos. Dominant behaviors are those occurring frequently in videos and hence, usually do not attract much attention. They can be characterized by different complexities in space and time, ranging from a scene background to human activities. In contrast, an anomalous behavior is defined as having a low likelihood of occurrence. We do not employ any models of the entities in the scene in order to detect these two kinds of behaviors. In this paper, video events are learnt at each pixel without supervision using densely constructed spatio-temporal video volumes. Furthermore, the volumes are organized into large contextual graphs. These compositions are employed to construct a hierarchical codebook model for the dominant behaviors. By decomposing spatio-temporal contextual information into unique spatial and temporal contexts, the proposed framework learns the models of the dominant spatial and temporal events. Thus, it is ultimately capable of simultaneously modeling high-level behaviors as well as low-level spatial, temporal and spatio-temporal pixel level changes.
3 0.71364075 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization
Author: Zihan Zhou, Hailin Jin, Yi Ma
Abstract: Recently, a new image deformation technique called content-preserving warping (CPW) has been successfully employed to produce the state-of-the-art video stabilization results in many challenging cases. The key insight of CPW is that the true image deformation due to viewpoint change can be well approximated by a carefully constructed warp using a set of sparsely constructed 3D points only. However, since CPW solely relies on the tracked feature points to guide the warping, it works poorly in large textureless regions, such as ground and building interiors. To overcome this limitation, in this paper we present a hybrid approach for novel view synthesis, observing that the textureless regions often correspond to large planar surfaces in the scene. Particularly, given a jittery video, we first segment each frame into piecewise planar regions as well as regions labeled as non-planar using Markov random fields. Then, a new warp is computed by estimating a single homography for regions belong to the same plane, while in- heriting results from CPW in the non-planar regions. We demonstrate how the segmentation information can be efficiently obtained and seamlessly integrated into the stabilization framework. Experimental results on a variety of real video sequences verify the effectiveness of our method.
4 0.70708632 210 cvpr-2013-Illumination Estimation Based on Bilayer Sparse Coding
Author: Bing Li, Weihua Xiong, Weiming Hu, Houwen Peng
Abstract: Computational color constancy is a very important topic in computer vision and has attracted many researchers ’ attention. Recently, lots of research has shown the effects of using high level visual content cues for improving illumination estimation. However, nearly all the existing methods are essentially combinational strategies in which image ’s content analysis is only used to guide the combination or selection from a variety of individual illumination estimation methods. In this paper, we propose a novel bilayer sparse coding model for illumination estimation that considers image similarity in terms of both low level color distribution and high level image scene content simultaneously. For the purpose, the image ’s scene content information is integrated with its color distribution to obtain optimal illumination estimation model. The experimental results on real-world image sets show that our algorithm is superior to some prevailing illumination estimation methods, even better than some combinational methods.
5 0.68605673 118 cvpr-2013-Detecting Pulse from Head Motions in Video
Author: Guha Balakrishnan, Fredo Durand, John Guttag
Abstract: We extract heart rate and beat lengths from videos by measuring subtle head motion caused by the Newtonian reaction to the influx of blood at each beat. Our method tracks features on the head and performs principal component analysis (PCA) to decompose their trajectories into a set of component motions. It then chooses the component that best corresponds to heartbeats based on its temporal frequency spectrum. Finally, we analyze the motion projected to this component and identify peaks of the trajectories, which correspond to heartbeats. When evaluated on 18 subjects, our approach reported heart rates nearly identical to an electrocardiogram device. Additionally we were able to capture clinically relevant information about heart rate variability.
6 0.68521589 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
7 0.67582119 270 cvpr-2013-Local Fisher Discriminant Analysis for Pedestrian Re-identification
8 0.66746247 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
9 0.6633355 55 cvpr-2013-Background Modeling Based on Bidirectional Analysis
10 0.65957737 112 cvpr-2013-Dense Segmentation-Aware Descriptors
11 0.65749955 140 cvpr-2013-Efficient Color Boundary Detection with Color-Opponent Mechanisms
12 0.65659517 391 cvpr-2013-Sensing and Recognizing Surface Textures Using a GelSight Sensor
13 0.64926815 401 cvpr-2013-Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection
14 0.64889807 120 cvpr-2013-Detecting and Naming Actors in Movies Using Generative Appearance Models
15 0.64713955 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
16 0.64321458 37 cvpr-2013-Adherent Raindrop Detection and Removal in Video
17 0.6417731 352 cvpr-2013-Recovering Stereo Pairs from Anaglyphs
18 0.63826972 457 cvpr-2013-Visual Tracking via Locality Sensitive Histograms
19 0.63567197 103 cvpr-2013-Decoding Children's Social Behavior
20 0.63549346 130 cvpr-2013-Discriminative Color Descriptors
topicId topicWeight
[(10, 0.116), (16, 0.028), (26, 0.045), (33, 0.297), (67, 0.116), (69, 0.04), (76, 0.204), (87, 0.073)]
simIndex simValue paperId paperTitle
1 0.89964896 317 cvpr-2013-Optimal Geometric Fitting under the Truncated L2-Norm
Author: Erik Ask, Olof Enqvist, Fredrik Kahl
Abstract: This paper is concerned with model fitting in the presence of noise and outliers. Previously it has been shown that the number of outliers can be minimized with polynomial complexity in the number of measurements. This paper improves on these results in two ways. First, it is shown that for a large class of problems, the statistically more desirable truncated L2-norm can be optimized with the same complexity. Then, with the same methodology, it is shown how to transform multi-model fitting into a purely combinatorial problem—with worst-case complexity that is polynomial in the number of measurements, though exponential in the number of models. We apply our framework to a series of hard registration and stitching problems demonstrating that the approach is not only of theoretical interest. It gives a practical method for simultaneously dealing with measurement noise and large amounts of outliers for fitting problems with lowdimensional models.
2 0.89295405 426 cvpr-2013-Tensor-Based Human Body Modeling
Author: Yinpeng Chen, Zicheng Liu, Zhengyou Zhang
Abstract: In this paper, we present a novel approach to model 3D human body with variations on both human shape and pose, by exploring a tensor decomposition technique. 3D human body modeling is important for 3D reconstruction and animation of realistic human body, which can be widely used in Tele-presence and video game applications. It is challenging due to a wide range of shape variations over different people and poses. The existing SCAPE model [4] is popular in computer vision for modeling 3D human body. However, it considers shape and pose deformations separately, which is not accurate since pose deformation is persondependent. Our tensor-based model addresses this issue by jointly modeling shape and pose deformations. Experimental results demonstrate that our tensor-based model outperforms the SCAPE model quite significantly. We also apply our model to capture human body using Microsoft Kinect sensors with excellent results.
3 0.89154464 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
Author: Gangqiang Zhao, Junsong Yuan, Gang Hua
Abstract: A topical video object refers to an object that is frequently highlighted in a video. It could be, e.g., the product logo and the leading actor/actress in a TV commercial. We propose a topic model that incorporates a word co-occurrence prior for efficient discovery of topical video objects from a set of key frames. Previous work using topic models, such as Latent Dirichelet Allocation (LDA), for video object discovery often takes a bag-of-visual-words representation, which ignored important co-occurrence information among the local features. We show that such data driven co-occurrence information from bottom-up can conveniently be incorporated in LDA with a Gaussian Markov prior, which combines top down probabilistic topic modeling with bottom up priors in a unified model. Our experiments on challenging videos demonstrate that the proposed approach can discover different types of topical objects despite variations in scale, view-point, color and lighting changes, or even partial occlusions. The efficacy of the co-occurrence prior is clearly demonstrated when comparing with topic models without such priors.
4 0.88966537 341 cvpr-2013-Procrustean Normal Distribution for Non-rigid Structure from Motion
Author: Minsik Lee, Jungchan Cho, Chong-Ho Choi, Songhwai Oh
Abstract: Non-rigid structure from motion is a fundamental problem in computer vision, which is yet to be solved satisfactorily. The main difficulty of the problem lies in choosing the right constraints for the solution. In this paper, we propose new constraints that are more effective for non-rigid shape recovery. Unlike the other proposals which have mainly focused on restricting the deformation space using rank constraints, our proposal constrains the motion parameters so that the 3D shapes are most closely aligned to each other, which makes the rank constraints unnecessary. Based on these constraints, we define a new class ofprobability distribution called the Procrustean normal distribution and propose a new NRSfM algorithm, EM-PND. The experimental results show that the proposed method outperforms the existing methods, and it works well even if there is no temporal dependence between the observed samples.
same-paper 5 0.88596731 332 cvpr-2013-Pixel-Level Hand Detection in Ego-centric Videos
Author: Cheng Li, Kris M. Kitani
Abstract: We address the task of pixel-level hand detection in the context of ego-centric cameras. Extracting hand regions in ego-centric videos is a critical step for understanding handobject manipulation and analyzing hand-eye coordination. However, in contrast to traditional applications of hand detection, such as gesture interfaces or sign-language recognition, ego-centric videos present new challenges such as rapid changes in illuminations, significant camera motion and complex hand-object manipulations. To quantify the challenges and performance in this new domain, we present a fully labeled indoor/outdoor ego-centric hand detection benchmark dataset containing over 200 million labeled pixels, which contains hand images taken under various illumination conditions. Using both our dataset and a publicly available ego-centric indoors dataset, we give extensive analysis of detection performance using a wide range of local appearance features. Our analysis highlights the effectiveness of sparse features and the importance of modeling global illumination. We propose a modeling strategy based on our findings and show that our model outperforms several baseline approaches.
6 0.88273942 201 cvpr-2013-Heterogeneous Visual Features Fusion via Sparse Multimodal Machine
7 0.85965019 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
8 0.85959822 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
9 0.85839933 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
10 0.85686606 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
11 0.85506225 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
12 0.85385537 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors
13 0.85359889 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
14 0.85301954 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation
15 0.85252202 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues
16 0.85180271 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
17 0.85167605 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
18 0.85141534 438 cvpr-2013-Towards Pose Robust Face Recognition
19 0.85079253 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
20 0.85058272 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases