iccv iccv2013 iccv2013-133 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chi Xu, Li Cheng
Abstract: We tackle the practical problem of hand pose estimation from a single noisy depth image. A dedicated three-step pipeline is proposed: Initial estimation step provides an initial estimation of the hand in-plane orientation and 3D location; Candidate generation step produces a set of 3D pose candidate from the Hough voting space with the help of the rotational invariant depth features; Verification step delivers the final 3D hand pose as the solution to an optimization problem. We analyze the depth noises, and suggest tips to minimize their negative impacts on the overall performance. Our approach is able to work with Kinecttype noisy depth images, and reliably produces pose estimations of general motions efficiently (12 frames per second). Extensive experiments are conducted to qualitatively and quantitatively evaluate the performance with respect to the state-of-the-art methods that have access to additional RGB images. Our approach is shown to deliver on par or even better results.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract We tackle the practical problem of hand pose estimation from a single noisy depth image. [sent-4, score-0.848]
2 We analyze the depth noises, and suggest tips to minimize their negative impacts on the overall performance. [sent-6, score-0.456]
3 Our approach is able to work with Kinecttype noisy depth images, and reliably produces pose estimations of general motions efficiently (12 frames per second). [sent-7, score-0.624]
4 Introduction 3D hand pose estimation has a wide range of applications including avatar animation and graphics [22, 24], robotic design [13], human-computer interaction, and Ergonomics. [sent-11, score-0.525]
5 [17, 10, 9, 13] and references therein, it remains a challenging problem, which is mainly due to the complex and dexterous nature of hand articulations. [sent-14, score-0.29]
6 The Kinect-type depth images, on the other hand, come with noticeable depth noises that significantly degrade the image quality, as illustrated in Figure 1. [sent-16, score-0.875]
7 In particular, regions are sometimes missing, and there are ghost shadows around object boundaries pixels with undefined depth values. [sent-17, score-0.324]
8 In this paper, we focus on efficient hand pose estima– Li Cheng Bioinformatics Institute, A*STAR, Singapore School of Computing, NUS, Singapore chengl i i a-star . [sent-18, score-0.417]
9 Experiments are conducted using the depth channel of a Kinect sensor, with some exemplar estimation results presented in Figure 1. [sent-27, score-0.394]
10 The main contributions are: • • Our approach estimates 3D hand poses from a single depth image, w eshtiimcha tise applicable ptoos general m ao stiinognles (i. [sent-30, score-0.669]
11 To our knowledge, this is the only such system to work with Kinect-type noisy depth images and reliably produces pose estimations of general motions 1. [sent-33, score-0.678]
12 We analyze the depth noises, show that they are inher- iWteed afrnoamly zthee t geometric layout hofo wthe th hoant- tbhoeayrd a sensors, and offer tips to minimize their negative impacts on the overall performance. [sent-34, score-0.456]
13 • Different from the often-used hand kinematic model iDni e. [sent-35, score-0.513]
14 [n1t8 ]f owmhe rteh eth oef palm eisd das hsaunmde kdi nfelamt, our mmooddeell is able to simulate palm arching (Figure 3) which is novel as e. [sent-37, score-0.397]
15 33444569 The first row is the input depth images from Kinect, while the second row presents our corresponding results overlayed on the RGB images. [sent-47, score-0.324]
16 The depth images are very noisy and of low-resolution, nevertheless our approach produces satisfactory results. [sent-48, score-0.395]
17 t eInp particular, tdheev depth ofe partouvriedse eo af the second step are reformulated to become invariant to in-plane rotations. [sent-52, score-0.407]
18 In particular, their method utilizes large-scale synthetic data during the training stage, as well as adopts the random forest paradigm of [5]. [sent-55, score-0.331]
19 This framework is further followed by [11] for directly regressing the 3D locations of 16 body joints for body pose estimation. [sent-56, score-0.347]
20 [15] to address the problem of hand gesture recognition, i. [sent-58, score-0.29]
21 to recognize a limited set of predefined hand gestures. [sent-60, score-0.329]
22 The problem of estimating 3D hand pose from a single RGB image has also been studied by e. [sent-62, score-0.417]
23 More recently, a tracking-based 3D hand pose estimation method has been proposed in [16] by making use of both RGB and depth channels of Kinect. [sent-68, score-0.811]
24 It however requires an explicit initialization to learn a hand skin model which may be cumbersome in some circumstances. [sent-69, score-0.29]
25 The commercially available 3Gear system [2] is able to robustly estimate hand poses by accessing to the RGB-D channels of two Kinects. [sent-71, score-0.399]
26 The performance is excellent with these gestures at various orientations, but becomes unbearable when engaging with an unseen gesture. [sent-73, score-0.249]
27 Leapmotion [3] is the most recent commercial system designed for close-range (within about 50cm in depth) hand pose estimation. [sent-74, score-0.471]
28 Our observation is that it is not well tolerant to even moderate level of self-occlusions of finger tips, such that a finger will not be detected if it touches other fingers or the palm. [sent-76, score-0.565]
29 In contrast, our system works beyond half a meter in depth, and work well when some of the finger-tips are occluded as it does not rely on detecting finger tips. [sent-77, score-0.314]
30 Our Approach Our approach starts with a preprocessing step to decrease the discrepancies between synthetic and real depth images. [sent-80, score-0.542]
31 One chief motivation is the fact that a hand can easily roll sideways (i. [sent-83, score-0.29]
32 As depth features are usually not rotational invariant, we instead consider separate steps thus our first two steps in the pipeline. [sent-86, score-0.375]
33 Furthermore, as a compound consequence of small hand size and dense self-occlusions, mode-seeking in the Hough voting space might not necessarily give the optimal 3D hand pose. [sent-87, score-0.639]
34 Top left panel presents three depth images: A real, a synthetic without noise, and a final synthetic with noise. [sent-93, score-0.718]
35 Note the pixel in a depth image is darker as it is closer to the viewing camera. [sent-94, score-0.357]
36 The pixels in pure black are those with unknown depth values. [sent-95, score-0.362]
37 Depth features We adopt the same depth features as mentioned in [19]. [sent-99, score-0.324]
38 That is, at a given pixel location x of an image I, denotes its depth value as a mapping dI(x), and construct a feature fI(x) by considering two 2D offsets positions u, v from x: fI(x) = dI? [sent-100, score-0.424]
39 Regression model ofHough forest The regression model of Hough forest is developed in [11] to estimate 3D locations of body joints. [sent-107, score-0.416]
40 Preprocessing: Analyzing the Depth Noises Our approach relies on synthetic hand images for training the model, which is then applied to real depth images for pose estimation. [sent-111, score-0.916]
41 However, there are noticeable noises presented in typical Kinect-type depth images. [sent-113, score-0.551]
42 The issue is further complicated by the fact that the Hough forest model tends to pick up features along depth boundaries which often coincide with the ares having large noises (first row of Figure 1). [sent-115, score-0.707]
43 The noises, especially those with unknown values, turn out to be rather different from the typical image noises such as Gaussian noise or salt-and-pepper noise. [sent-117, score-0.264]
44 As depicted in Figure 2 bottom, rendered from certain view of the synthetic 3D hand, the surface normal of every pixel is calculated, and its perpendicular tendency w. [sent-121, score-0.276]
45 Those tend to be perpendicular are less visible, therefore their depth values become unknown (in black). [sent-125, score-0.43]
46 As a result, often there are small regions surrounded by these black unknown pixels, which are subsequently missing from the final depth map, due to limited sample rate of the depth camera. [sent-126, score-0.745]
47 During our experiments, the synthetic images are added with these two source of depth noises. [sent-128, score-0.499]
48 The parameters to be estimated are (x1, x2, x3, θ), in which x1, x2 , x3 defines the 3D position of the hand base (i. [sent-132, score-0.29]
49 Step 2: Candidate Generation Given the in-plane orientation θ, the hand can be reversely rotated to its canonical pose in Figure 3 (c), as rotating each of the hand-related pixels x in-plane by −θ along tghe e a hcahnd o f b tahsee. [sent-141, score-0.537]
50 aTnhde- depth f epiaxtuerles can -thpluasn bee b computed from the processed image using Equation 1. [sent-142, score-0.324]
51 Having this in mind, we instead redefine our depth 33445581 features as follows: fI(x) = dI? [sent-144, score-0.324]
52 Note his new depth feature does not implement exactly the image rotating scenario described above. [sent-155, score-0.373]
53 Nevertheless it is an efficient approximation, and as a result, our depth features becomes invariant to inplane rotations. [sent-156, score-0.431]
54 A second regression model of Hough forest is then constructed to produce a pool of 3D hand pose candidates: Each image pixel is parsed by one of the T2 trees, leading down the tree path to certain leaf node that stores a collection of 27-dimensional votes. [sent-157, score-0.868]
55 The voting space is similarly formed as in step 1, and the standard mean-shift method [7] is used to find k local modes of the point cloud as the output pool of 3D hand candidates. [sent-158, score-0.447]
56 Besides, we only need to consider the canonical hand poses with small in-plane perturbations, as the in-plane rotation issue has been explicitly addressed by step 1. [sent-162, score-0.458]
57 3D location of joints Instead of estimating the 3D location of body joints from a depth image [11], our model predicts the parameters of a predefined hand kinematic chain, which is then used to build the 3D hand. [sent-165, score-1.27]
58 First, a kinematic chain model is better at dealing with self-occlusion, an scenario often seen in hand pose estimation. [sent-167, score-0.71]
59 Comparing to 3D location of joints, kinematic chain is a global representation and is more tolerant to small local perturbations. [sent-168, score-0.407]
60 Second, for human pose estimation, once the body location is known (i. [sent-169, score-0.254]
61 a change from left hand will not affect the other parts significantly. [sent-173, score-0.29]
62 In contrast, motions of the five fingers and the palm are tightly correlated. [sent-174, score-0.439]
63 Our 3D hand model A widely-used hand kinematic chain model has been proposed by Rehg and Kanade [18], which has 21+6 degree-of-freedom (DoF). [sent-175, score-0.873]
64 Unfortunately, in this kinematic model, the palm is assumed to be a rigid object, making it unable to mimic gestures with perceivable palm arching. [sent-176, score-0.802]
65 Our 3D hand contains 21+6 degrees of freedom, including the hand root position and orientation (6), and the relative angles of individual joints (21). [sent-196, score-0.804]
66 From (A) to (C): The hand anatomy, the underlying skeleton kinematic model, and the skinned mesh model. [sent-197, score-0.513]
67 In our kinematic model, four DoF are attached to each finger from bottom up in Figure 3 (b): For each of the four fingers (i. [sent-206, score-0.57]
68 index to pinky fingers), two DoF are used for the metacarpophalangeal (MCP) joint, and one each is for the proximal interphalangeal (PIP) joint and the distal interphalangeal (DIP) joint, respectively. [sent-208, score-0.311]
69 It is the vote stored in the leaf node of step 2, and is also used to represent each of the 3D hand candidates. [sent-212, score-0.375]
70 For example, it is adopted by [25] for motion capture and by [9] for estimating hand pose from one RGB image. [sent-216, score-0.417]
71 It often involves the minimization of a discrepancy measure, ρ, between the observed and the synthetic depth image, Θ∗ = argmΘin? [sent-217, score-0.499]
72 Our synthetic sets are randomly sampled from different orientations as well as a wide range of gestures: The kinematic model of over thirty American sign language letter and number gestures (after removing the dynamic guestures and and a few duplicate gestures) are adopted as bases. [sent-229, score-0.737]
73 A number of random pairs are then selected, and from each pair, we smoothly interpolate over each of the joints to obtain a series of new gestures inbetween. [sent-230, score-0.349]
74 This way we construct our pool of synthetic hand gestures. [sent-231, score-0.52]
75 In our experiments, the number of trees for both Hough forests are the same as T1 = T2 = 5, the tree depth is fixed to 20, and the candidate pool size k = 10. [sent-232, score-0.475]
76 To train the first Hough forest in the initial estimation step, 20k synthetic images with different gestures, orientations and scales were randomly generated for each tree. [sent-233, score-0.456]
77 For the second forest in the candidate generation step, 100k synthetic images were randomly for each tree, thus totally 500k images for the entire forest. [sent-235, score-0.434]
78 For both forests, 512 pixels are randomly sampled from each synthetic image, which roughly are evenly distributed over a hand image. [sent-237, score-0.465]
79 Average joint error is defined as the average error of the rotation of all hand joints. [sent-253, score-0.518]
80 the estimation errors and in particular the translation error ofhand base (in %) & the average error ofjoints (in degrees) are evaluated as a function of distances. [sent-256, score-0.247]
81 Due to the physical limitation of Kinect, the minimum depth is 0. [sent-257, score-0.324]
82 5 meter, and the the hand area becomes too small beyond 1-1. [sent-258, score-0.29]
83 In other words, the error metric is more sensitive when the hand is close, and is less sensitive when the hand is farther away from the camera. [sent-268, score-0.637]
84 Visually analysis of our results It is practically very important to consider the depth noises and in particular the missing of small regions in our synthesized training data. [sent-275, score-0.576]
85 As with column 1 of Figure 6, the index finger tip is missing from the Kinect raw image, nevertheless our approach can still successfully estimate a proper 3D hand pose that is nicely aligned with the RGB image in hindsight. [sent-276, score-0.647]
86 Moreover, the incorporating of palm arching parameter in our 3D hand model also turns to be very helpful in estimating the gestures with bending palm. [sent-277, score-0.81]
87 Our system may fail to estimate the right orientation for hand gestures with similar frontal and back sides in the depth map, as in the third column of this figure. [sent-279, score-1.02]
88 Sometimes several fingers heavily overlap with each other, as the crossfinger case in the fourth column, they may be estimated as a single finger instead. [sent-280, score-0.347]
89 As expected, 3Gear [2] performs poorly on these hand poses, as the gestures are outside of the 6 predefined ones, at the expenses of employing two Kinects. [sent-288, score-0.578]
90 [16] seems very difficult to pick up some gestures & orientations in tracking, as shown in this figure, as sell as in the supplementary videos. [sent-289, score-0.338]
91 Conclusion and Future Work We have proposed a systematic approach to estimate 3D hand poses of general motions from a single depth image. [sent-292, score-0.77]
92 This is made possible by a carefully designed three-step pipeline, as well as a detailed analysis of the depth noises. [sent-294, score-0.324]
93 On future work, we plan to exploit prior knowledge of hand motion constraints to work in a reduced parameter space, as well as to extend the current work to deal with scenarios where one hand is interacting with physical objects including other hands. [sent-295, score-0.58]
94 Efficient regression of general-activity human poses from depth images. [sent-365, score-0.423]
95 Hand pose estimation and hand shape classification using multi-layered randomized decision forests. [sent-388, score-0.487]
96 Efficient model-based 3d tracking of hand articulations using kinect. [sent-393, score-0.29]
97 Visual interpretation of hand gestures for human-computer interaction: A review. [sent-399, score-0.539]
98 Visual tracking of high dof articulated structures: an application to human hand tracking. [sent-408, score-0.411]
99 Real-time human pose recognition in parts from single depth images. [sent-419, score-0.451]
100 Combining marker-based mocap and rgb-d camera for acquiring high-fidelity hand motion data. [sent-456, score-0.29]
wordName wordTfidf (topN-words)
[('depth', 0.324), ('hand', 0.29), ('gestures', 0.249), ('kinematic', 0.223), ('noises', 0.193), ('hough', 0.184), ('fingers', 0.176), ('synthetic', 0.175), ('finger', 0.171), ('palm', 0.165), ('forest', 0.156), ('pose', 0.127), ('dof', 0.121), ('rgb', 0.109), ('joints', 0.1), ('interphalangeal', 0.1), ('meter', 0.089), ('tips', 0.089), ('kinect', 0.079), ('oikonomidis', 0.077), ('rot', 0.077), ('orientation', 0.071), ('estimation', 0.07), ('rotation', 0.07), ('chain', 0.07), ('perpendicular', 0.068), ('location', 0.067), ('arching', 0.067), ('argyros', 0.067), ('distal', 0.067), ('diu', 0.067), ('inplane', 0.067), ('libhand', 0.067), ('mcp', 0.067), ('motions', 0.066), ('delivers', 0.066), ('hands', 0.065), ('translation', 0.063), ('pipeline', 0.06), ('body', 0.06), ('missing', 0.059), ('keskin', 0.059), ('subplot', 0.059), ('voting', 0.059), ('verification', 0.059), ('candidate', 0.057), ('error', 0.057), ('shadow', 0.056), ('pool', 0.055), ('poses', 0.055), ('orientations', 0.055), ('di', 0.054), ('system', 0.054), ('dedicated', 0.054), ('degrees', 0.053), ('rotational', 0.051), ('rotating', 0.049), ('tolerant', 0.047), ('fps', 0.046), ('generation', 0.046), ('mm', 0.045), ('star', 0.045), ('rehg', 0.045), ('singapore', 0.045), ('regression', 0.044), ('gpu', 0.044), ('panel', 0.044), ('joint', 0.044), ('step', 0.043), ('impacts', 0.043), ('parsed', 0.043), ('leaf', 0.042), ('everyday', 0.042), ('bioinformatics', 0.042), ('deliver', 0.04), ('invariant', 0.04), ('tree', 0.039), ('predefined', 0.039), ('stores', 0.039), ('bending', 0.039), ('perturbations', 0.038), ('animation', 0.038), ('unknown', 0.038), ('shotton', 0.038), ('noisy', 0.037), ('estimations', 0.036), ('enhances', 0.036), ('articulation', 0.035), ('letter', 0.035), ('systematic', 0.035), ('access', 0.034), ('noticeable', 0.034), ('desktop', 0.034), ('produces', 0.034), ('pick', 0.034), ('pixel', 0.033), ('noise', 0.033), ('plots', 0.033), ('sides', 0.032), ('tightly', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 133 iccv-2013-Efficient Hand Pose Estimation from a Single Depth Image
Author: Chi Xu, Li Cheng
Abstract: We tackle the practical problem of hand pose estimation from a single noisy depth image. A dedicated three-step pipeline is proposed: Initial estimation step provides an initial estimation of the hand in-plane orientation and 3D location; Candidate generation step produces a set of 3D pose candidate from the Hough voting space with the help of the rotational invariant depth features; Verification step delivers the final 3D hand pose as the solution to an optimization problem. We analyze the depth noises, and suggest tips to minimize their negative impacts on the overall performance. Our approach is able to work with Kinecttype noisy depth images, and reliably produces pose estimations of general motions efficiently (12 frames per second). Extensive experiments are conducted to qualitatively and quantitatively evaluate the performance with respect to the state-of-the-art methods that have access to additional RGB images. Our approach is shown to deliver on par or even better results.
2 0.37091675 218 iccv-2013-Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data
Author: Srinath Sridhar, Antti Oulasvirta, Christian Theobalt
Abstract: Tracking the articulated 3D motion of the hand has important applications, for example, in human–computer interaction and teleoperation. We present a novel method that can capture a broad range of articulated hand motions at interactive rates. Our hybrid approach combines, in a voting scheme, a discriminative, part-based pose retrieval method with a generative pose estimation method based on local optimization. Color information from a multiview RGB camera setup along with a person-specific hand model are used by the generative method to find the pose that best explains the observed images. In parallel, our discriminative pose estimation method uses fingertips detected on depth data to estimate a complete or partial pose of the hand by adopting a part-based pose retrieval strategy. This part-based strategy helps reduce the search space drastically in comparison to a global pose retrieval strategy. Quantitative results show that our method achieves state-of-the-art accuracy on challenging sequences and a near-realtime performance of 10 fps on a desktop computer.
3 0.28386751 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests
Author: Danhang Tang, Tsz-Ho Yu, Tae-Kyun Kim
Abstract: This paper presents the first semi-supervised transductive algorithm for real-time articulated hand pose estimation. Noisy data and occlusions are the major challenges of articulated hand pose estimation. In addition, the discrepancies among realistic and synthetic pose data undermine the performances of existing approaches that use synthetic data extensively in training. We therefore propose the Semi-supervised Transductive Regression (STR) forest which learns the relationship between a small, sparsely labelled realistic dataset and a large synthetic dataset. We also design a novel data-driven, pseudo-kinematic technique to refine noisy or occluded joints. Our contributions include: (i) capturing the benefits of both realistic and synthetic data via transductive learning; (ii) showing accuracies can be improved by considering unlabelled data; and (iii) introducing a pseudo-kinematic technique to refine articulations efficiently. Experimental results show not only the promising performance of our method with respect to noise and occlusions, but also its superiority over state-of- the-arts in accuracy, robustness and speed.
4 0.20282786 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion
Author: Ibrahim Radwan, Abhinav Dhall, Roland Goecke
Abstract: In this paper, an automatic approach for 3D pose reconstruction from a single image is proposed. The presence of human body articulation, hallucinated parts and cluttered background leads to ambiguity during the pose inference, which makes the problem non-trivial. Researchers have explored various methods based on motion and shading in order to reduce the ambiguity and reconstruct the 3D pose. The key idea of our algorithm is to impose both kinematic and orientation constraints. The former is imposed by projecting a 3D model onto the input image and pruning the parts, which are incompatible with the anthropomorphism. The latter is applied by creating synthetic views via regressing the input view to multiple oriented views. After applying the constraints, the 3D model is projected onto the initial and synthetic views, which further reduces the ambiguity. Finally, we borrow the direction of the unambiguous parts from the synthetic views to the initial one, which results in the 3D pose. Quantitative experiments are performed on the HumanEva-I dataset and qualitatively on unconstrained images from the Image Parse dataset. The results show the robustness of the proposed approach to accurately reconstruct the 3D pose form a single image.
5 0.1994618 341 iccv-2013-Real-Time Body Tracking with One Depth Camera and Inertial Sensors
Author: Thomas Helten, Meinard Müller, Hans-Peter Seidel, Christian Theobalt
Abstract: In recent years, the availability of inexpensive depth cameras, such as the Microsoft Kinect, has boosted the research in monocular full body skeletal pose tracking. Unfortunately, existing trackers often fail to capture poses where a single camera provides insufficient data, such as non-frontal poses, and all other poses with body part occlusions. In this paper, we present a novel sensor fusion approach for real-time full body tracking that succeeds in such difficult situations. It takes inspiration from previous tracking solutions, and combines a generative tracker and a discriminative tracker retrieving closest poses in a database. In contrast to previous work, both trackers employ data from a low number of inexpensive body-worn inertial sensors. These sensors provide reliable and complementary information when the monocular depth information alone is not sufficient. We also contribute by new algorithmic solutions to best fuse depth and inertial data in both trackers. One is a new visibility model to determine global body pose, occlusions and usable depth correspondences and to decide what data modality to use for discriminative tracking. We also contribute with a new inertial-basedpose retrieval, and an adapted late fusion step to calculate the final body pose.
6 0.19623801 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose
7 0.18822852 382 iccv-2013-Semi-dense Visual Odometry for a Monocular Camera
8 0.18055557 209 iccv-2013-Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation
9 0.1626437 444 iccv-2013-Viewing Real-World Faces in 3D
10 0.1588687 18 iccv-2013-A Joint Intensity and Depth Co-sparse Analysis Model for Depth Map Super-resolution
11 0.15590164 366 iccv-2013-STAR3D: Simultaneous Tracking and Reconstruction of 3D Objects Using RGB-D Data
12 0.14313729 424 iccv-2013-Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines
13 0.13878387 199 iccv-2013-High Quality Shape from a Single RGB-D Image under Uncalibrated Natural Illumination
14 0.13018796 108 iccv-2013-Depth from Combining Defocus and Correspondence Using Light-Field Cameras
15 0.12878528 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?
16 0.12298644 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation
17 0.12069552 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
18 0.11793374 318 iccv-2013-PixelTrack: A Fast Adaptive Algorithm for Tracking Non-rigid Objects
19 0.11785869 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
20 0.11768968 47 iccv-2013-Alternating Regression Forests for Object Detection and Pose Estimation
topicId topicWeight
[(0, 0.256), (1, -0.165), (2, -0.046), (3, 0.054), (4, 0.053), (5, -0.106), (6, 0.042), (7, -0.072), (8, -0.113), (9, 0.131), (10, 0.022), (11, -0.036), (12, -0.162), (13, -0.028), (14, 0.018), (15, 0.07), (16, -0.057), (17, -0.272), (18, -0.025), (19, 0.148), (20, 0.013), (21, 0.084), (22, 0.03), (23, 0.088), (24, -0.017), (25, -0.046), (26, 0.05), (27, 0.16), (28, 0.008), (29, -0.036), (30, 0.017), (31, 0.085), (32, -0.103), (33, 0.095), (34, -0.083), (35, -0.046), (36, 0.014), (37, 0.027), (38, -0.05), (39, 0.001), (40, -0.033), (41, 0.055), (42, 0.011), (43, 0.019), (44, -0.073), (45, 0.08), (46, 0.084), (47, -0.044), (48, -0.034), (49, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.98275125 133 iccv-2013-Efficient Hand Pose Estimation from a Single Depth Image
Author: Chi Xu, Li Cheng
Abstract: We tackle the practical problem of hand pose estimation from a single noisy depth image. A dedicated three-step pipeline is proposed: Initial estimation step provides an initial estimation of the hand in-plane orientation and 3D location; Candidate generation step produces a set of 3D pose candidate from the Hough voting space with the help of the rotational invariant depth features; Verification step delivers the final 3D hand pose as the solution to an optimization problem. We analyze the depth noises, and suggest tips to minimize their negative impacts on the overall performance. Our approach is able to work with Kinecttype noisy depth images, and reliably produces pose estimations of general motions efficiently (12 frames per second). Extensive experiments are conducted to qualitatively and quantitatively evaluate the performance with respect to the state-of-the-art methods that have access to additional RGB images. Our approach is shown to deliver on par or even better results.
2 0.86789757 218 iccv-2013-Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data
Author: Srinath Sridhar, Antti Oulasvirta, Christian Theobalt
Abstract: Tracking the articulated 3D motion of the hand has important applications, for example, in human–computer interaction and teleoperation. We present a novel method that can capture a broad range of articulated hand motions at interactive rates. Our hybrid approach combines, in a voting scheme, a discriminative, part-based pose retrieval method with a generative pose estimation method based on local optimization. Color information from a multiview RGB camera setup along with a person-specific hand model are used by the generative method to find the pose that best explains the observed images. In parallel, our discriminative pose estimation method uses fingertips detected on depth data to estimate a complete or partial pose of the hand by adopting a part-based pose retrieval strategy. This part-based strategy helps reduce the search space drastically in comparison to a global pose retrieval strategy. Quantitative results show that our method achieves state-of-the-art accuracy on challenging sequences and a near-realtime performance of 10 fps on a desktop computer.
3 0.83935833 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests
Author: Danhang Tang, Tsz-Ho Yu, Tae-Kyun Kim
Abstract: This paper presents the first semi-supervised transductive algorithm for real-time articulated hand pose estimation. Noisy data and occlusions are the major challenges of articulated hand pose estimation. In addition, the discrepancies among realistic and synthetic pose data undermine the performances of existing approaches that use synthetic data extensively in training. We therefore propose the Semi-supervised Transductive Regression (STR) forest which learns the relationship between a small, sparsely labelled realistic dataset and a large synthetic dataset. We also design a novel data-driven, pseudo-kinematic technique to refine noisy or occluded joints. Our contributions include: (i) capturing the benefits of both realistic and synthetic data via transductive learning; (ii) showing accuracies can be improved by considering unlabelled data; and (iii) introducing a pseudo-kinematic technique to refine articulations efficiently. Experimental results show not only the promising performance of our method with respect to noise and occlusions, but also its superiority over state-of- the-arts in accuracy, robustness and speed.
4 0.82122386 341 iccv-2013-Real-Time Body Tracking with One Depth Camera and Inertial Sensors
Author: Thomas Helten, Meinard Müller, Hans-Peter Seidel, Christian Theobalt
Abstract: In recent years, the availability of inexpensive depth cameras, such as the Microsoft Kinect, has boosted the research in monocular full body skeletal pose tracking. Unfortunately, existing trackers often fail to capture poses where a single camera provides insufficient data, such as non-frontal poses, and all other poses with body part occlusions. In this paper, we present a novel sensor fusion approach for real-time full body tracking that succeeds in such difficult situations. It takes inspiration from previous tracking solutions, and combines a generative tracker and a discriminative tracker retrieving closest poses in a database. In contrast to previous work, both trackers employ data from a low number of inexpensive body-worn inertial sensors. These sensors provide reliable and complementary information when the monocular depth information alone is not sufficient. We also contribute by new algorithmic solutions to best fuse depth and inertial data in both trackers. One is a new visibility model to determine global body pose, occlusions and usable depth correspondences and to decide what data modality to use for discriminative tracking. We also contribute with a new inertial-basedpose retrieval, and an adapted late fusion step to calculate the final body pose.
5 0.71871805 278 iccv-2013-Multi-scale Topological Features for Hand Posture Representation and Analysis
Author: Kaoning Hu, Lijun Yin
Abstract: In this paper, we propose a multi-scale topological feature representation for automatic analysis of hand posture. Such topological features have the advantage of being posture-dependent while being preserved under certain variations of illumination, rotation, personal dependency, etc. Our method studies the topology of the holes between the hand region and its convex hull. Inspired by the principle of Persistent Homology, which is the theory of computational topology for topological feature analysis over multiple scales, we construct the multi-scale Betti Numbers matrix (MSBNM) for the topological feature representation. In our experiments, we used 12 different hand postures and compared our features with three popular features (HOG, MCT, and Shape Context) on different data sets. In addition to hand postures, we also extend the feature representations to arm postures. The results demonstrate the feasibility and reliability of the proposed method.
6 0.71543705 47 iccv-2013-Alternating Regression Forests for Object Detection and Pose Estimation
7 0.69762951 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion
8 0.69462699 254 iccv-2013-Live Metric 3D Reconstruction on Mobile Phones
9 0.65903717 209 iccv-2013-Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation
10 0.64069206 108 iccv-2013-Depth from Combining Defocus and Correspondence Using Light-Field Cameras
11 0.63149333 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose
12 0.62774634 382 iccv-2013-Semi-dense Visual Odometry for a Monocular Camera
13 0.61898732 18 iccv-2013-A Joint Intensity and Depth Co-sparse Analysis Model for Depth Map Super-resolution
14 0.60571098 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?
15 0.59679914 366 iccv-2013-STAR3D: Simultaneous Tracking and Reconstruction of 3D Objects Using RGB-D Data
16 0.54575831 308 iccv-2013-Parsing IKEA Objects: Fine Pose Estimation
17 0.54081255 444 iccv-2013-Viewing Real-World Faces in 3D
18 0.53955239 199 iccv-2013-High Quality Shape from a Single RGB-D Image under Uncalibrated Natural Illumination
19 0.53054887 46 iccv-2013-Allocentric Pose Estimation
20 0.52997124 437 iccv-2013-Unsupervised Random Forest Manifold Alignment for Lipreading
topicId topicWeight
[(2, 0.07), (7, 0.016), (12, 0.016), (26, 0.059), (31, 0.05), (35, 0.024), (40, 0.027), (42, 0.109), (48, 0.017), (64, 0.049), (73, 0.045), (78, 0.016), (84, 0.211), (89, 0.219)]
simIndex simValue paperId paperTitle
1 0.94168806 401 iccv-2013-Stacked Predictive Sparse Coding for Classification of Distinct Regions in Tumor Histopathology
Author: Hang Chang, Yin Zhou, Paul Spellman, Bahram Parvin
Abstract: Image-based classification ofhistology sections, in terms of distinct components (e.g., tumor, stroma, normal), provides a series of indices for tumor composition. Furthermore, aggregation of these indices, from each whole slide image (WSI) in a large cohort, can provide predictive models of the clinical outcome. However, performance of the existing techniques is hindered as a result of large technical variations and biological heterogeneities that are always present in a large cohort. We propose a system that automatically learns a series of basis functions for representing the underlying spatial distribution using stacked predictive sparse decomposition (PSD). The learned representation is then fed into the spatial pyramid matching framework (SPM) with a linear SVM classifier. The system has been evaluated for classification of (a) distinct histological components for two cohorts of tumor types, and (b) colony organization of normal and malignant cell lines in 3D cell culture models. Throughput has been increased through the utility of graphical processing unit (GPU), and evalu- ation indicates a superior performance results, compared with previous research.
2 0.88103181 381 iccv-2013-Semantically-Based Human Scanpath Estimation with HMMs
Author: Huiying Liu, Dong Xu, Qingming Huang, Wen Li, Min Xu, Stephen Lin
Abstract: We present a method for estimating human scanpaths, which are sequences of gaze shifts that follow visual attention over an image. In this work, scanpaths are modeled based on three principal factors that influence human attention, namely low-levelfeature saliency, spatialposition, and semantic content. Low-level feature saliency is formulated as transition probabilities between different image regions based on feature differences. The effect of spatial position on gaze shifts is modeled as a Levy flight with the shifts following a 2D Cauchy distribution. To account for semantic content, we propose to use a Hidden Markov Model (HMM) with a Bag-of-Visual-Words descriptor of image regions. An HMM is well-suited for this purpose in that 1) the hidden states, obtained by unsupervised learning, can represent latent semantic concepts, 2) the prior distribution of the hidden states describes visual attraction to the semantic concepts, and 3) the transition probabilities represent human gaze shift patterns. The proposed method is applied to task-driven viewing processes. Experiments and analysis performed on human eye gaze data verify the effectiveness of this method.
3 0.85573399 60 iccv-2013-Bayesian Robust Matrix Factorization for Image and Video Processing
Author: Naiyan Wang, Dit-Yan Yeung
Abstract: Matrix factorization is a fundamental problem that is often encountered in many computer vision and machine learning tasks. In recent years, enhancing the robustness of matrix factorization methods has attracted much attention in the research community. To benefit from the strengths of full Bayesian treatment over point estimation, we propose here a full Bayesian approach to robust matrix factorization. For the generative process, the model parameters have conjugate priors and the likelihood (or noise model) takes the form of a Laplace mixture. For Bayesian inference, we devise an efficient sampling algorithm by exploiting a hierarchical view of the Laplace distribution. Besides the basic model, we also propose an extension which assumes that the outliers exhibit spatial or temporal proximity as encountered in many computer vision applications. The proposed methods give competitive experimental results when compared with several state-of-the-art methods on some benchmark image and video processing tasks.
same-paper 4 0.85472029 133 iccv-2013-Efficient Hand Pose Estimation from a Single Depth Image
Author: Chi Xu, Li Cheng
Abstract: We tackle the practical problem of hand pose estimation from a single noisy depth image. A dedicated three-step pipeline is proposed: Initial estimation step provides an initial estimation of the hand in-plane orientation and 3D location; Candidate generation step produces a set of 3D pose candidate from the Hough voting space with the help of the rotational invariant depth features; Verification step delivers the final 3D hand pose as the solution to an optimization problem. We analyze the depth noises, and suggest tips to minimize their negative impacts on the overall performance. Our approach is able to work with Kinecttype noisy depth images, and reliably produces pose estimations of general motions efficiently (12 frames per second). Extensive experiments are conducted to qualitatively and quantitatively evaluate the performance with respect to the state-of-the-art methods that have access to additional RGB images. Our approach is shown to deliver on par or even better results.
5 0.85440624 241 iccv-2013-Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection
Author: Tianfu Wu, Song-Chun Zhu
Abstract: Many object detectors, such as AdaBoost, SVM and deformable part-based models (DPM), compute additive scoring functions at a large number of windows scanned over image pyramid, thus computational efficiency is an important consideration beside accuracy performance. In this paper, we present a framework of learning cost-sensitive decision policy which is a sequence of two-sided thresholds to execute early rejection or early acceptance based on the accumulative scores at each step. A decision policy is said to be optimal if it minimizes an empirical global risk function that sums over the loss of false negatives (FN) and false positives (FP), and the cost of computation. While the risk function is very complex due to high-order connections among the two-sided thresholds, we find its upper bound can be optimized by dynamic programming (DP) efficiently and thus say the learned policy is near-optimal. Given the loss of FN and FP and the cost in three numbers, our method can produce a policy on-the-fly for Adaboost, SVM and DPM. In experiments, we show that our decision policy outperforms state-of-the-art cascade methods significantly in terms of speed with similar accuracy performance.
7 0.81931323 219 iccv-2013-Internet Based Morphable Model
8 0.80399162 218 iccv-2013-Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data
10 0.7788794 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning
11 0.77866936 349 iccv-2013-Regionlets for Generic Object Detection
12 0.77842653 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses
13 0.77826178 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose
14 0.77791667 314 iccv-2013-Perspective Motion Segmentation via Collaborative Clustering
15 0.7779026 300 iccv-2013-Optical Flow via Locally Adaptive Fusion of Complementary Data Costs
16 0.77781624 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
17 0.77762043 151 iccv-2013-Exploiting Reflection Change for Automatic Reflection Removal
18 0.77744687 50 iccv-2013-Analysis of Scores, Datasets, and Models in Visual Saliency Prediction
19 0.77725303 379 iccv-2013-Semantic Segmentation without Annotating Segments
20 0.77721655 75 iccv-2013-CoDeL: A Human Co-detection and Labeling Framework