iccv iccv2013 iccv2013-430 knowledge-graph by maker-knowledge-mining

430 iccv-2013-Two-Point Gait: Decoupling Gait from Body Shape

Source: pdf

Author: Stephen Lombardi, Ko Nishino, Yasushi Makihara, Yasushi Yagi

Abstract: Human gait modeling (e.g., for person identification) largely relies on image-based representations that muddle gait with body shape. Silhouettes, for instance, inherently entangle body shape and gait. For gait analysis and recognition, decoupling these two factors is desirable. Most important, once decoupled, they can be combined for the task at hand, but not if left entangled in the first place. In this paper, we introduce Two-Point Gait, a gait representation that encodes the limb motions regardless of the body shape. Two-Point Gait is directly computed on the image sequence based on the two point statistics of optical flow fields. We demonstrate its use for exploring the space of human gait and gait recognition under large clothing variation. The results show that we can achieve state-of-the-art person recognition accuracy on a challenging dataset.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 , for person identification) largely relies on image-based representations that muddle gait with body shape. [sent-4, score-1.087]

2 For gait analysis and recognition, decoupling these two factors is desirable. [sent-6, score-0.965]

3 In this paper, we introduce Two-Point Gait, a gait representation that encodes the limb motions regardless of the body shape. [sent-8, score-1.094]

4 We demonstrate its use for exploring the space of human gait and gait recognition under large clothing variation. [sent-10, score-1.993]

5 Introduction The study of gait has enjoyed a rich history since its inception with Eadweard Muybridge’s study [26] of equine locomotion in 1878, and for good reason. [sent-13, score-0.993]

6 A large body of research has shown that human gait patterns contain a great deal of information. [sent-14, score-1.072]

7 This definition highlights the fact that gait is about motion rather than body shape. [sent-18, score-1.067]

8 Work by Gunnar Johansson has visualized this in his study of pure gait with point light displays [16], which eliminate all body shape information. [sent-19, score-1.119]

9 later demonstrated the possibility of pure gait person identification with these same displays [34]. [sent-21, score-1.043]

10 On the other hand, many past approaches to modeling gait rely heavily on body shape information for recognition and other computer vision tasks. [sent-22, score-1.092]

11 Extracting gait from images and video has remained a challenging problem. [sent-23, score-0.957]

12 Each row corresponds to a different temporal position in the gait cycle. [sent-33, score-0.957]

13 It provides an easyto-compute gait representation that is robust to body shape. [sent-37, score-1.066]

14 Model-based gait representations fit limbs or other parts of the body in a gait pattern to a predefined 2D or 3D model. [sent-40, score-2.044]

15 examine the periodical 11004411 and pendulum-like motion of the legs to model gait [43]. [sent-42, score-0.981]

16 As model-based gait representations typically estimate the 3D motion of the limbs, they are naturally body shape-invariant. [sent-46, score-1.067]

17 Although they provide a better avenue for a body-shape invariant gait descriptor, they have not been able to capture the nuances of human gait as finely as silhouette-based methods. [sent-55, score-1.925]

18 Separating body shape from gait is related to the classic problem of “style vs. [sent-56, score-1.084]

19 Elgammal and Lee study the automatic decomposition of style and content specifically for human gait [10]. [sent-58, score-0.991]

20 In our case, body shape is style—it can be modified by clothing or occlusions—but it is the “content” (i. [sent-59, score-0.187]

21 There have been several works that specifically study the entanglement of body shape and gait for person recognition [8, 38, 39]. [sent-62, score-1.136]

22 They all conclude that recognition becomes strong only when both body shape and gait are used together. [sent-63, score-1.092]

23 The advantage of a pure gait representation is that it is optimal for applications where body shape is orthogonal to the things we would like to measure. [sent-71, score-1.124]

24 For example, determining the general emotional state of a person based on their gait should function the same despite their body shape. [sent-72, score-1.096]

25 In a medical setting, automatically determining rehabilitative progress from a gait sequence would also need to ignore body shape. [sent-73, score-1.052]

26 On the other hand, accurately recovering 3D posture is difficult unless we rely on a statistical model that heavily regulates potential poses, which would wash out the subtle differences in gait that we are interested in. [sent-75, score-0.957]

27 Our goal is to devise an image-based gait representation that factors out the body shape. [sent-76, score-1.066]

28 We achieve this with a novel gait representation based on a statistical distribution of the optical flow: the Two-Point Gait. [sent-77, score-1.042]

29 We focus on optical flow rather than silhouette as it primarily encodes the motion of a person rather than shape and can also be directly computed from the image without any manual intervention. [sent-78, score-0.242]

30 Our key idea is to extract the statistical characteristics of these optical flow fields that encode the gait of the person. [sent-80, score-1.085]

31 For two optical flow vectors, a and b, the two-point statistics is defined as the spatial distribution of pairs of pixels in the image whose optical flow vectors are a and b, respectively. [sent-83, score-0.292]

32 For example, if a points left and b right, the two-point statistics will encode information about the arm movement versus the leg movement because they move in opposition during much of the gait cycle. [sent-87, score-1.052]

33 We expect this representation to be very robust to body shape difference because it is principally encoding the changing spatial distribution of limbs rather than their size. [sent-88, score-0.182]

34 We introduce a synthetic data set that contains gait motion from the Carnegie Mellon University Motion Capture Database [2] realized with a set of synthetic body shapes created with MakeHuman [3]. [sent-90, score-1.128]

35 By examining the distance matrix of these synthetic data sets, we show that the Two- Point Gait is robust to body shape and appearance variations. [sent-91, score-0.175]

36 First, we demonstrate how the Two-Point Gait representation naturally encodes gait motion into an intuitive gait space in which the distance between Two-Point Gaits tells us how similar two people walk regardless of their body shapes. [sent-93, score-2.102]

37 We show that, when combined with a body shape representation, the TwoPoint Gait achieves the state-of-the-art accuracy for gait recognition with clothing variation. [sent-95, score-1.152]

38 These results clearly demonstrate the power and advantage of having a pure gait representation that can be computed from 2D images. [sent-96, score-0.997]

39 Two-Point Gait Our goal is to develop an image-based gait representation as insensitive as possible to body shape. [sent-98, score-1.066]

40 Our key idea is to use the two-point statistics to encode the gait observed in optical flow fields (i. [sent-107, score-1.108]

41 Two-point statistics of optical flow The most straightforward application of two-point statistics to optical flow would be to define a probability function, P(d|a,b) ∝? [sent-112, score-0.284]

42 This formulation has the advantage of accounting for optical flow magnitude simply without requiring that we store the two-point statistics for a large amount of optical flow vector pairs (a, b). [sent-161, score-0.275]

43 Figure 1 visualizes an example gait sequence alongside its corresponding Two-Point Gait. [sent-163, score-0.957]

44 The representation is robust to body shape variation because it is the spatial distribution of limbs that is encoded rather than the size of the limbs. [sent-171, score-0.197]

45 After the optical flow has been quantized, we can compute the two-point statistics of optical flow for two orienta11004433 tions a and b, ta,b(d) = ? [sent-178, score-0.261]

46 We also normalize for the length of the gait cycle by quantizing into temporal bins. [sent-187, score-0.988]

47 We also divide each Two-Point Gait by it’s cycle length so that the temporal quantization does not favor × × longer-length gait cycles. [sent-191, score-0.98]

48 We do this by downsampling the optical flow histogram computed on the full size images and then computing the two-point statistics on the downsampled optical flow. [sent-196, score-0.215]

49 If the original image was W H pixels, the running tBime goes from W2H2 without optical flow downsampling to with optical flow downsampling. [sent-198, score-0.248]

50 f (t(i) , t(j) ) ∈ S if and only if t(i) and t(j) were computed from gait sequences dof o a single person) and D the set of farllo pairs ot fs eTqwuoen-Pceoisnt o fG aait s nnoglte efr poemrs otnhe) same person t(i. [sent-218, score-1.006]

51 f (t(i), t(j)) ∈ D if and only if t(i) and t(j) were computed from gait sequences odf odnilffyer ifen tt people) then we set up an optimization problem, argwmin(t(i),? [sent-220, score-0.957]

52 We will show that many of the orientation pairs have an intuitive meaning and correspond to parts of the gait cycle. [sent-230, score-0.993]

53 For example, by examining the Two-Point Gait of optical flow vectors pointing left and right, we can observe the changing spatial relationship of the arms versus the legs. [sent-238, score-0.166]

54 This shows how we don’t 11004444 column shows a frame ofvideo, the middle column shows the optical flow for that frame, and the right column shows the Two-Point Gait for orientation pair (right, left). [sent-240, score-0.185]

55 That is, the blue dots have optical flow vectors pointing right, and the red dots have optical flow vectors pointing left. [sent-242, score-0.272]

56 Here we show two distance matrices from our synthetic data set to compare the robustness to body shape variation between the Two-Point Gait and a silhouette-based method [4]. [sent-246, score-0.192]

57 The distance matrices have 10 rows and columns, each corresponding to a combination of a motion-captured gait pattern retargeted to a synthetic body shape. [sent-247, score-1.117]

58 The block structure of the matrix in column (a) shows the TwoPoint Gait’s robustness to body shape variation. [sent-248, score-0.158]

59 Robustness to body shape and appearance One goal of our representation is to be as insensitive as possible to changes in body shape. [sent-253, score-0.236]

60 The distance matrices have 10 rows and columns, each corresponding to a combination of a motion-captured gait pattern retargeted to a synthetic body shape. [sent-259, score-1.117]

61 We then selected several motion-captured gait patterns from the Carnegie Mellon University Motion Capture Database [2]. [sent-263, score-0.966]

62 Using Blender [1], we retargeted the motion capture animations to our synthetic human skeletons to cre- ×× ate a synthetic gait database. [sent-264, score-1.05]

63 0T×he1 e0le dmisetnantcs being compared are a ollf possible pairs of 5 synthetic body shapes with two unique gait patterns. [sent-266, score-1.101]

64 The block structure of the distance matrix shows that the representation is extremely robust to body shape variation and discriminates instead against the actual gait. [sent-267, score-0.178]

65 The silhouette-based method does not have the block structure showing its inability to decouple gait from body shape. [sent-269, score-1.061]

66 Another important quality for the representation is robustness to changes in body appearance (texture) which also affects optical flow accuracy. [sent-270, score-0.239]

67 dHemereo,n tshtrea teel ethmisen ints 4 o bfy yth seh odwisitnagn ace 1 m0×at1rix0 dciosntasnisctse of all combinations of a synthetic human model with five clothing variations with two unique gait patterns. [sent-272, score-1.054]

68 We first show how it enables us to map out the gait space, the space of gait patterns of different people. [sent-276, score-1.923]

69 The space of gait patterns We can use the Two-Point Gait as a representation for an individual’s walk and use the distance between different people’s Two-Point Gait to explore the space of gait patterns—that is, the entire space of how people walk. [sent-282, score-1.981]

70 We may, however, use manifold learning to extract a lower-dimensional manifold on which these gait patterns of different people lie. [sent-284, score-0.984]

71 We use locally linear embedding [30] to visualize this lowdimensional gait space. [sent-285, score-0.957]

72 Figure 5 shows the gait space together with several individuals who lie close to each other in this space. [sent-286, score-0.965]

73 This gait space can facilitate the study of human locomotion as a function of other attributes such as gender and age, which is only possible with a representation that decouples gait and body shape. [sent-291, score-2.071]

74 Gait recognition The Two-Point Gait, as a pure gait representation, can be used for gait recognition. [sent-294, score-1.948]

75 Current methods heavily rely on silhouette-based representations, which strongly indicate that the recognition is done based on a mixture of gait and body shape. [sent-296, score-1.06]

76 , the person is walking with almost the same clothes and viewing conditions in both the gallery and probe), it is obvious that representations that primarily encode the body shape would suffice. [sent-299, score-0.188]

77 Obtaining a pure gait representation, however, is still crucial to tackle gait-based person identification. [sent-300, score-1.018]

78 Without disentangling gait and body shape, we cannot understand which contributes more to the recognition and hope to combine them in an optimal manner. [sent-301, score-1.078]

79 To this end, we evaluate the use of Two-Point Gait for person identification on the OU-ISIR clothing variation data set which is arguably the largest and most challenging gait recognition data [4]. [sent-304, score-1.1]

80 The data set consists of 68 subjects each with at most 32 combinations of very different clothing for a total of 2,746 gait sequences divided into three subsets: training, gallery, and probe. [sent-305, score-1.017]

81 , it essentially encodes the body shape as the mean silhouette of the gait cycle). [sent-311, score-1.125]

82 We combine TwoPoint gait with GEI by first computing the distance matrices for all gallery and probe pairs separately for each representation, and then by taking a linear combination to form a new combined distance matrix (in our experiments we used 45% GEI + 55% TPG) after proper scaling. [sent-313, score-1.024]

83 This is expected because the Two-Point Gait recognizes gait and ignores body shape. [sent-321, score-1.052]

84 ” This is the accuracy one can hope to achieve for person identification solely based on pure gait observation. [sent-323, score-1.043]

85 The GEI is the average silhouette over the gait period and therefore is essentially encoding the body shape. [sent-326, score-1.073]

86 When trained on the clothing variation training data, the GEI learns to discriminate body shapes in the space of intra-class variation of specific clothing types. [sent-327, score-0.254]

87 In particular, notice the similarity of the arm and leg positions for a given gait cycle position. [sent-338, score-1.002]

88 The ROC curves show that we achieve state-of-the-art results with Two-Point Gait combined with GEI and clearly show the advan- tage of disentangling gait and body shape. [sent-341, score-1.07]

89 In other words, GEI with LDA is not learning gait motion but rather learning a better and more subtle shape classifier. [sent-344, score-1.004]

90 071), which shows the strength of modeling gait separately from body shape. [sent-346, score-1.052]

91 shows that disentangling shape from gait is extremely important: the only way to get such greatly improved performance is by starting with a pure gait and combining it with a pure shape representation. [sent-352, score-2.048]

92 These results demonstrate the effectiveness of Two-Point Gait as a discriminative gait representation for person identification. [sent-356, score-1.006]

93 Conclusion In this paper, we introduced a novel gait representation: Two-Point Gait. [sent-358, score-0.957]

94 Two-Point Gait is unique in that it can be directly computed from 2D images yet it encodes purely the gait without regard to the body shape. [sent-359, score-1.072]

95 The experimental results using synthetic gait patterns demonstrate the invariance of the representation to body shape and appearance. [sent-360, score-1.133]

96 We demonstrated the use of Two-Point Gait for exploring the space of people based on their gait and also showed that it allows us to achieve state-of-the-art recognition performance on a challenging gait recognition dataset with clothing variation. [sent-361, score-2.008]

97 Clothing- [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] invariant gait identification using part-based clothing categorization and adaptive weight control. [sent-384, score-1.042]

98 Gait flow image: A silhouette-based gait representation for human identification. [sent-472, score-1.038]

99 Automatic gait recognition via fourier descriptors of deformable objects. [sent-524, score-0.965]

100 The gait identification challenge problem: data sets and baseline algorithm. [sent-546, score-0.982]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gait', 0.957), ('theso', 0.123), ('body', 0.095), ('yd', 0.081), ('lc', 0.067), ('optical', 0.063), ('clothing', 0.06), ('gei', 0.059), ('flow', 0.056), ('ob', 0.056), ('twopoint', 0.055), ('person', 0.035), ('shape', 0.032), ('makihara', 0.031), ('pure', 0.026), ('synthetic', 0.026), ('identification', 0.025), ('hossain', 0.025), ('limbs', 0.024), ('statistics', 0.023), ('cycle', 0.023), ('orientation', 0.022), ('eer', 0.022), ('silhouette', 0.021), ('encodes', 0.02), ('disentangling', 0.018), ('gaits', 0.018), ('locomotion', 0.018), ('makehuman', 0.018), ('nixon', 0.018), ('people', 0.018), ('gallery', 0.017), ('entangle', 0.016), ('afgr', 0.016), ('retargeted', 0.015), ('silhouettes', 0.015), ('motion', 0.015), ('variation', 0.015), ('movement', 0.015), ('lda', 0.014), ('pairs', 0.014), ('style', 0.014), ('representation', 0.014), ('distance', 0.013), ('walk', 0.013), ('blender', 0.012), ('drexel', 0.012), ('gerontology', 0.012), ('mood', 0.012), ('southampton', 0.012), ('stevenage', 0.012), ('yam', 0.012), ('sigal', 0.011), ('human', 0.011), ('left', 0.011), ('pattern', 0.011), ('arm', 0.011), ('cmu', 0.011), ('mobo', 0.011), ('osaka', 0.011), ('tpg', 0.011), ('column', 0.011), ('leg', 0.011), ('robustness', 0.011), ('right', 0.011), ('probe', 0.01), ('downsampling', 0.01), ('gender', 0.01), ('pages', 0.01), ('hv', 0.009), ('pointing', 0.009), ('patterns', 0.009), ('metric', 0.009), ('carnegie', 0.009), ('mellon', 0.009), ('examining', 0.009), ('spatial', 0.009), ('elgammal', 0.009), ('emotional', 0.009), ('directions', 0.009), ('encode', 0.009), ('legs', 0.009), ('arms', 0.009), ('study', 0.009), ('umd', 0.009), ('shapes', 0.009), ('block', 0.009), ('recognition', 0.008), ('articulated', 0.008), ('displacement', 0.008), ('downsample', 0.008), ('quantizing', 0.008), ('distribution', 0.008), ('individuals', 0.008), ('roc', 0.008), ('dots', 0.008), ('decoupling', 0.008), ('decoupled', 0.008), ('limb', 0.008), ('receiver', 0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 430 iccv-2013-Two-Point Gait: Decoupling Gait from Body Shape

Author: Stephen Lombardi, Ko Nishino, Yasushi Makihara, Yasushi Yagi

2 0.05574441 143 iccv-2013-Estimating Human Pose with Flowing Puppets

Author: Silvia Zuffi, Javier Romero, Cordelia Schmid, Michael J. Black

Abstract: We address the problem of upper-body human pose estimation in uncontrolled monocular video sequences, without manual initialization. Most current methods focus on isolated video frames and often fail to correctly localize arms and hands. Inferring pose over a video sequence is advantageous because poses of people in adjacent frames exhibit properties of smooth variation due to the nature of human and camera motion. To exploit this, previous methods have used prior knowledge about distinctive actions or generic temporal priors combined with static image likelihoods to track people in motion. Here we take a different approach based on a simple observation: Information about how a person moves from frame to frame is present in the optical flow field. We develop an approach for tracking articulated motions that “links” articulated shape models of peo- ple in adjacent frames through the dense optical flow. Key to this approach is a 2D shape model of the body that we use to compute how the body moves over time. The resulting “flowing puppets ” provide a way of integrating image evidence across frames to improve pose inference. We apply our method on a challenging dataset of TV video sequences and show state-of-the-art performance.

3 0.05148571 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation

Author: Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, Bernt Schiele

Abstract: Typical approaches to articulated pose estimation combine spatial modelling of the human body with appearance modelling of body parts. This paper aims to push the state-of-the-art in articulated pose estimation in two ways. First we explore various types of appearance representations aiming to substantially improve the bodypart hypotheses. And second, we draw on and combine several recently proposed powerful ideas such as more flexible spatial models as well as image-conditioned spatial models. In a series of experiments we draw several important conclusions: (1) we show that the proposed appearance representations are complementary; (2) we demonstrate that even a basic tree-structure spatial human body model achieves state-ofthe-art performance when augmented with the proper appearance representation; and (3) we show that the combination of the best performing appearance model with a flexible image-conditioned spatial model achieves the best result, significantly improving over the state of the art, on the “Leeds Sports Poses ” and “Parse ” benchmarks.

4 0.049232766 317 iccv-2013-Piecewise Rigid Scene Flow

Author: Christoph Vogel, Konrad Schindler, Stefan Roth

Abstract: Estimating dense 3D scene flow from stereo sequences remains a challenging task, despite much progress in both classical disparity and 2D optical flow estimation. To overcome the limitations of existing techniques, we introduce a novel model that represents the dynamic 3D scene by a collection of planar, rigidly moving, local segments. Scene flow estimation then amounts to jointly estimating the pixelto-segment assignment, and the 3D position, normal vector, and rigid motion parameters of a plane for each segment. The proposed energy combines an occlusion-sensitive data term with appropriate shape, motion, and segmentation regularizers. Optimization proceeds in two stages: Starting from an initial superpixelization, we estimate the shape and motion parameters of all segments by assigning a proposal from a set of moving planes. Then the pixel-to-segment assignment is updated, while holding the shape and motion parameters of the moving planes fixed. We demonstrate the benefits of our model on different real-world image sets, including the challenging KITTI benchmark. We achieve leading performance levels, exceeding competing 3D scene flow methods, and even yielding better 2D motion estimates than all tested dedicated optical flow techniques.

5 0.045261715 306 iccv-2013-Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items

Author: Kota Yamaguchi, M. Hadi Kiapour, Tamara L. Berg

Abstract: Clothing recognition is an extremely challenging problem due to wide variation in clothing item appearance, layering, and style. In this paper, we tackle the clothing parsing problem using a retrieval based approach. For a query image, we find similar styles from a large database of tagged fashion images and use these examples to parse the query. Our approach combines parsing from: pre-trained global clothing models, local clothing models learned on theflyfrom retrieved examples, and transferredparse masks (paper doll item transfer) from retrieved examples. Experimental evaluation shows that our approach significantly outperforms state of the art in parsing accuracy.

6 0.044955533 105 iccv-2013-DeepFlow: Large Displacement Optical Flow with Deep Matching

7 0.043398309 300 iccv-2013-Optical Flow via Locally Adaptive Fusion of Complementary Data Costs

8 0.041384716 341 iccv-2013-Real-Time Body Tracking with One Depth Camera and Inertial Sensors

9 0.041230138 12 iccv-2013-A General Dense Image Matching Framework Combining Direct and Feature-Based Costs

10 0.03982383 78 iccv-2013-Coherent Motion Segmentation in Moving Camera Videos Using Optical Flow Orientations

11 0.037236296 335 iccv-2013-Random Faces Guided Sparse Many-to-One Encoder for Pose-Invariant Face Recognition

12 0.036088571 225 iccv-2013-Joint Segmentation and Pose Tracking of Human in Natural Videos

13 0.035624366 39 iccv-2013-Action Recognition with Improved Trajectories

14 0.034969121 449 iccv-2013-What Do You Do? Occupation Recognition in a Photo via Social Context

15 0.032000281 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary

16 0.030824108 160 iccv-2013-Fast Object Segmentation in Unconstrained Video

17 0.030285748 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

18 0.029341543 256 iccv-2013-Locally Affine Sparse-to-Dense Matching for Motion and Occlusion Estimation

19 0.028973529 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?

20 0.028473809 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.058), (1, -0.01), (2, -0.0), (3, 0.02), (4, 0.011), (5, -0.016), (6, 0.013), (7, 0.013), (8, 0.015), (9, 0.033), (10, 0.003), (11, 0.007), (12, 0.029), (13, -0.026), (14, -0.007), (15, 0.023), (16, -0.03), (17, -0.014), (18, 0.037), (19, 0.015), (20, 0.065), (21, -0.008), (22, 0.04), (23, -0.02), (24, -0.01), (25, -0.014), (26, 0.033), (27, 0.016), (28, -0.001), (29, -0.008), (30, -0.019), (31, -0.039), (32, -0.011), (33, -0.022), (34, -0.019), (35, -0.001), (36, 0.025), (37, 0.009), (38, -0.004), (39, -0.005), (40, 0.021), (41, -0.014), (42, -0.019), (43, 0.008), (44, 0.023), (45, -0.012), (46, -0.021), (47, -0.01), (48, 0.049), (49, -0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91056067 430 iccv-2013-Two-Point Gait: Decoupling Gait from Body Shape

Author: Stephen Lombardi, Ko Nishino, Yasushi Makihara, Yasushi Yagi

2 0.75434482 143 iccv-2013-Estimating Human Pose with Flowing Puppets

Author: Silvia Zuffi, Javier Romero, Cordelia Schmid, Michael J. Black

3 0.62463081 300 iccv-2013-Optical Flow via Locally Adaptive Fusion of Complementary Data Costs

Author: Tae Hyun Kim, Hee Seok Lee, Kyoung Mu Lee

Abstract: Many state-of-the-art optical flow estimation algorithms optimize the data and regularization terms to solve ill-posed problems. In this paper, in contrast to the conventional optical flow framework that uses a single or fixed data model, we study a novel framework that employs locally varying data term that adaptively combines different multiple types of data models. The locally adaptive data term greatly reduces the matching ambiguity due to the complementary nature of the multiple data models. The optimal number of complementary data models is learnt by minimizing the redundancy among them under the minimum description length constraint (MDL). From these chosen data models, a new optical flow estimation energy model is designed with the weighted sum of the multiple data models, and a convex optimization-based highly effective and practical solution thatfinds the opticalflow, as well as the weights isproposed. Comparative experimental results on the Middlebury optical flow benchmark show that the proposed method using the complementary data models outperforms the state-ofthe art methods.

4 0.58722657 12 iccv-2013-A General Dense Image Matching Framework Combining Direct and Feature-Based Costs

Author: Jim Braux-Zin, Romain Dupont, Adrien Bartoli

Abstract: Dense motion field estimation (typically Romain Dupont1 romain . dupont @ cea . fr Adrien Bartoli2 adrien . bart o l @ gmai l com i . 2 ISIT, Universit e´ d’Auvergne/CNRS, France sions are explicitly modeled [32, 13]. Coarse-to-fine warping improves global convergence by making the assumption that optical flow, the motion of smaller structures is similar to the motion of stereo disparity and surface registration) is a key computer vision problem. Many solutions have been proposed to compute small or large displacements, narrow or wide baseline stereo disparity, but a unified methodology is still lacking. We here introduce a general framework that robustly combines direct and feature-based matching. The feature-based cost is built around a novel robust distance function that handles keypoints and “weak” features such as segments. It allows us to use putative feature matches which may contain mismatches to guide dense motion estimation out of local minima. Our framework uses a robust direct data term (AD-Census). It is implemented with a powerful second order Total Generalized Variation regularization with external and self-occlusion reasoning. Our framework achieves state of the art performance in several cases (standard optical flow benchmarks, wide-baseline stereo and non-rigid surface registration). Our framework has a modular design that customizes to specific application needs.

5 0.5794543 317 iccv-2013-Piecewise Rigid Scene Flow

Author: Christoph Vogel, Konrad Schindler, Stefan Roth

6 0.57475483 105 iccv-2013-DeepFlow: Large Displacement Optical Flow with Deep Matching

7 0.54172134 301 iccv-2013-Optimal Orthogonal Basis and Image Assimilation: Motion Modeling

8 0.54042846 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?

9 0.53711122 256 iccv-2013-Locally Affine Sparse-to-Dense Matching for Motion and Occlusion Estimation

10 0.53370988 78 iccv-2013-Coherent Motion Segmentation in Moving Camera Videos Using Optical Flow Orientations

11 0.52334237 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow

12 0.51232028 225 iccv-2013-Joint Segmentation and Pose Tracking of Human in Natural Videos

13 0.49802297 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion

14 0.48765522 160 iccv-2013-Fast Object Segmentation in Unconstrained Video

15 0.48317862 291 iccv-2013-No Matter Where You Are: Flexible Graph-Guided Multi-task Learning for Multi-view Head Pose Classification under Target Motion

16 0.47707939 39 iccv-2013-Action Recognition with Improved Trajectories

17 0.47023547 263 iccv-2013-Measuring Flow Complexity in Videos

18 0.46323898 130 iccv-2013-Dynamic Structured Model Selection

19 0.45811152 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation

20 0.45353866 118 iccv-2013-Discovering Object Functionality

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.038), (7, 0.035), (12, 0.029), (26, 0.059), (31, 0.038), (34, 0.013), (35, 0.014), (42, 0.085), (60, 0.274), (64, 0.044), (73, 0.026), (89, 0.13), (95, 0.012), (98, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73235512 430 iccv-2013-Two-Point Gait: Decoupling Gait from Body Shape

Author: Stephen Lombardi, Ko Nishino, Yasushi Makihara, Yasushi Yagi

2 0.6451503 151 iccv-2013-Exploiting Reflection Change for Automatic Reflection Removal

Author: Yu Li, Michael S. Brown

Abstract: This paper introduces an automatic method for removing reflection interference when imaging a scene behind a glass surface. Our approach exploits the subtle changes in the reflection with respect to the background in a small set of images taken at slightly different view points. Key to this idea is the use of SIFT-flow to align the images such that a pixel-wise comparison can be made across the input set. Gradients with variation across the image set are assumed to belong to the reflected scenes while constant gradients are assumed to belong to the desired background scene. By correctly labelling gradients belonging to reflection or background, the background scene can be separated from the reflection interference. Unlike previous approaches that exploit motion, our approach does not make any assumptions regarding the background or reflected scenes’ geometry, nor requires the reflection to be static. This makes our approach practical for use in casual imaging scenarios. Our approach is straight forward and produces good results compared with existing methods. 1. Introduction and Related Work There are situations when a scene must be imaged behind a pane of glass. This is common when “window shopping” where one takes a photograph of an object behind a window. This is not a conducive setup for imaging as the glass will produce an unwanted layer of reflection in the resulting image. This problem can be treated as one of layer separation [7, 8], where the captured image I a linear combiis nation of a reflection layer IR and the desired background scene, IB, as follows: I IR + IB. = (1) The goal of reflection removal is to separate IB and IR from an input image I shown in Figure 1. as This problem is ill-posed, as it requires extracting two layers from one image. To make the problem tractable additional information, either supplied from the user or from Fig. 1. Example of our approach separating the background (IB) and reflection (IR) layers of one of the input images. Note that the reflection layer’s contrast has been boosted to improve visualization. multiple images, is required. For example, Levin and Weiss [7, 8] proposed a method where a user labelled image gradients as belonging to either background or reflection. Combing the markup with an optimization that imposed a sparsity prior on the separated images, their method produced compelling results. The only drawback was the need for user intervention. An automatic method was proposed by Levin et al. [9] that found the most likely decomposition which minimized the total number of edges and corners in the recovered image using a database of natural images. As 22443322 with example-based methods, the results were reliant on the similarity of the examples in the database. Another common strategy is to use multiple images. Some methods assume a fixed camera that is able to capture a set of images with different mixing of the layers through various means, e.g. rotating a polarized lens [3, 6, 12, 16, 17], changing focus [15], or applying a flash [1]. While these approaches demonstrate good results, the ability of controlling focal change, polarization, and flash may not always be possible. Sarel and Irani [13, 14] proposed video based methods that work by assuming the two layers, reflection and background, to be statistically uncorrelated. These methods can handle complex geometry in the reflection layer, but require a long image sequence such that the reflection layer has significant changes in order for a median-based approach [21] to extract the intrinsic image from the sequence as the initial guess for one of the layers. Techniques closer to ours exploit motion between the layers present in multiple images. In particular, when the background is captured from different points of view, the background and the reflection layers undergo different motions due to their different distance to the transparent layer. One issue with changing viewpoint is handling alignment among the images. Szeliski et al. [19] proposed a method that could simultaneously recover the two layers by assuming they were both static scenes and related by parametric transformations (i.e. homographies). Gai et al. [4, 5] proposed a similar approach that aligned the images in the gradient domain using gradient sparsity, again assuming static scenes. Tsin et al. [20] relaxed the planar scene constraint in [19] and used dense stereo correspondence with stereo matching configuration which limits the camera motion to unidirectional parallel motion. These approaches produce good results, but the constraint on scene geometry and assumed motion of the camera limit the type of scenes that can be processed. Our Contribution Our proposed method builds on the single-image approach by Levin and Weiss [8], but removes the need for user markup by examining the relative motion in a small set (e.g. 3-5) of images to automatically label gradients as either reflection or background. This is done by first aligning the images using SIFT-flow and then examining the variation in the gradients over the image set. Gradients with more variation are assumed to be from reflection while constant gradients are assumed to be from the desired background. While a simple idea, this approach does not impose any restrictions on the scene or reflection geometry. This allows a more practical imaging setup that is suitable for handheld cameras. The remainder of this paper is organized as follows. Section 2 overviews our approach; section 3 compares our results with prior methods on several examples; the paper is concluded in section 4. Warped ? ?Recovered ? ? Recovered ? ? Warp e d ? ?Recover d ? ? Recover d ? ? Fig. 2. This figure shows the separated layers of the first two input images. The layers illustrate that the background image IB has lit- tle variation while the reflection layers, IRi ,have notable variation due to the viewpoint change. 2. Reflection Removal Method 2.1. Imaging Assumption and Procedure The input ofour approach is a small set of k images taken of the scene from slightly varying view points. We assume the background dominates in the mixture image and the images are related by a warping, such that the background is registered and the reflection layer is changing. This relationship can be expressed as: Ii = wi(IRi + IB), (2) where Ii is the i-th mixture image, {wi}, i = 1, . . . , k are warping fuisn tchteio in-sth hcma uisxetud by mthaeg camera viewpoint change with respect to a reference image (in our case I1). Assuming we can estimate the inverse warps, w−i1, where w−11 is the identity, we get the following relationship: wi−1(Ii) = IRi + IB. (3) Even though IB appears static in the mixture image, the problem is still ill-posed given we have more unknowns than the number of input images. However, the presence of a static IB in the image set makes it possible to identify gradient edges of the background layer IB and edges of the changing reflection layers IRi . More specifically, edges in IB are assumed to appear every time in the image set while the edges in the reflection layer IRi are assumed to vary across the set. This reflection-change effect can be seen in Figure 2. This means edges can be labelled based on the frequency of a gradient appearing at a particular pixel across the aligned input images. After labelling edges as either background or reflection, we can reconstruct the two layers using an optimization that imposes the sparsity prior on the separated layers as done by [7, 8]. Figure 3 shows the processing pipeline of our approach. Each step is described in the following sections. 22443333 Fig. 3. This figure shows the pipeline of our approach: 1) warping functions are estimated to align the inputs to a reference view; 2) the edges are labelled as either background or foreground based on gradient frequency; 3) a reconstruction step is used to separate the two layers; 4) all recovered background layers are combined together to get the final recovered background. 2.2. Warping Our approach begins by estimating warping functions, w−i1, to register the input to the reference image. Previous approaches estimated these warps using global parametric motion (e.g. homographies [4, 5, 19]), however, the planarity constraint often leads to regions in the image with misalignments when the scene is not planar. Traditional dense correspondence method like optical flow is another option. However, even with our assumption that the background should be more prominent than the reflection layer, optical flow methods (e.g. [2, 18]) that are based on image intensity gave poor performance due to the reflection interference. This led us to try SIFT-flow [10] that is based on more robust image features. SIFT-flow [10] proved to work surprisingly well on our input sequences and provide a dense warp suitable to bring the images into alignment even under moderate interference of reflection. Empirical demonstration of the effectiveness of SIFT-flow in this task as well as the comparison with optical flow are shown in our supplemental materials. Our implementation fixes I1 as the reference, then uses SIFT-flow to estimate the inverse-warping functions {w−i1 }, i= 2, . . . , k for each ofthe input images I2 , . . . , Ik against ,I 1i . = W 2e, a.l.s.o, compute htohef gradient magnitudes Gi of the each input image and then warp the images Ii as well as the gradient magnitudes Gi using the same inverse-warping function w−i1, denoting the warped images and gradient magnitudes as Iˆi and Gˆi. 2.3. Edge separation Our approach first identifies salient edges using a simple threshold on the gradient magnitudes in Gˆi. The resulting binary edge map is denoted as Ei. After edge detection, the edges need to be separated as either background or foreground in each aligned image Iˆi. As previously discussed, the edges of the background layer should appear frequently across all the warped images while the edges of the reflection layer would only have sparse presence. To examine the sparsity of the edge occurrence, we use the following measurement: Φ(y) =??yy??2221, (4) where y is a vector containing the gradient magnitudes at a given pixel location. Since all elements in y are non-negative, we can rewrite equation 4 as Φ(y) = yi)2. This measurement can be conside?red as a L1? normalized L2 norm. It measures the sparsity o?f the vecto?r which achieves its maximum value of 1when only one non-zero item exists and achieve its minimum value of k1 when all items are non-zero and have identical values (i.e. y1 = y2 = . . . = yk > 0). This measurement is used to assign two probabilities to each edge pixel as belonging to either background or reflection. We estimate the reflection edge probability by examining ?ik=1 yi2/(?ik=1 22443344 the edge occurrence, as follows: PRi(x) = s?(??iikk==11GGˆˆii((xx))2)2−k1?,(5) Gˆi Iˆi. where, (x) is the gradient magnitude at pixel x of We subtract k1 to move the smallest value close to zero. The sparsity measurement is further stretched by a sigmoid function s(t) = (1 + e−(t−0.05)/0.05)−1 to facilitate the separation. The background edge probability is then estimated by: PBi(x) = s?−?(??iikk==11GGˆˆii((xx))2)2−k1??,(6) where PBi (x) + PRi (x) = ?1. These probabilities are defined only at the pixels that are edges in the image. We consider only edge pixels with relatively high probability in either the background edge probability map or reflection edge probability map. The final edge separation is performed by thresholding the two probability maps as: EBi/Ri(x) =⎨⎧ 10, Ei(x) = 1 aotndhe PrwBiis/eRi(x) > 0.6 Figure 4 shows ⎩the edge separation procedure. 2.4. Layer Reconstruction With the separated edges of the background and the reflection, we can reconstruct the two layers. Levin and Weis- ???????????? Gˆ Fig. 4. Edge separation illustration: 1) shows the all gradient maps in this case we have five input images; 2) plots the gradient values at two position across the five images - top plot is a pixel on a background edge, bottom plot is a pixel on a reflection edge; 3) shows the probability map estimated for each layer; 4) Final edge separation after thresholding the probability maps. s [7, 8] showed that the long tailed distribution of gradients in natural scenes is an effective prior in this problem. This kind of distributions is well modelled by a Laplacian or hyper-Laplacian distribution (P(t) ∝ p = 1for – e−|t|p/s, Laplacian and p < 1 for hyper-Laplacian). In our work, we use Laplacian approximation since the L1 norm converges quickly with good results. For each image Iˆi , we try to maximize the probability P(IBi , IRi ) in order to separate the two layers and this is equivalent to minimizing the cost log P(IBi , IRi ). Following the same deduction tinh e[ c7]o,s tw −ithlo tgheP independent assumption of the two layers (i.e. P(IBi , IRi ) = P(IBi ) · P(IRi )), the objective function becomes: − J(IBi) = ? |(IBi ∗ fn)(x)| + |((Iˆi − IBi) ∗ fn)(x)| ?x, ?n + λ?EBi(x)|((Iˆi − IBi) ∗ fn)(x)| ?x, ?n + λ?ERi(x)|(IBi ?x,n ∗ fn)(x)|, (7) where fn denotes the derivative filters and ∗ is the 2D convolution operator. hFeo rd efrniv, we use trwso a nodri e∗n istat tihoen 2s Dan cdo nt-wo degrees (first order and second order) derivative filters. While the first term in the objective function keeps the gradients of the two layer as sparse as possible, the last two terms force the gradients of IBi at edges positions in EBi to agree with the gradients of input image Iˆi and gradients of IRi at edge positions in ERi agree with the gradients of Iˆi. This equation can be further rewritten in the form of J = ?Au b? 1 and be minimized efficiently using iterative − reweighted lbea?st square [11]. 2.5. Combining the Results Our approach processes each image in the input set independently. Due to the reflective glass surface, some of the images may contain saturated regions from specular highlights. When saturation occurs, we can not fully recover the structure in these saturated regions because the information about the two layers are lost. In addition, sometimes the edges of the reflection in some regions are too weak to be correctly distinguished. This can lead to local regions in the background where the reflection is still present. These erroneous regions are often in different places in each input image due to changes in the reflection. In such cases, it is reasonable to assume that the minimum value across all recovered background layers may be a proper approximation of the true background. As such, the last step of our method is to take the minimum of the pixel value of all reconstructed background images as the final recovered background, as follows: IB (x) = mini IBi (x) . 22443355 (8) Fig. 5. This figure shows our combination procedure. The recovered background on each single image is good at first glance but may have reflection remaining in local regions. A simple minimum operator combining all recovered images gives a better result in these regions. The comparison can be seen in the zoomed-in regions. × Based on this, the reflection layer of each input image can be computed by IRi = IB . The effectiveness of this combination procedure is ill−us Itrated in Figure 5. Iˆi − 3. Results In this section, we present the experimental results of our proposed method. Additional results and test cases can be found in the accompanying supplemental materials. The experiments were conducted on an Intel i7? PC (3.4GHz CPU, 8.0GB RAM). The code was implemented in Matlab. We use the SIFT-Flow implementation provided by the authors 1. Matlab code and images used in our paper can be downloaded at the author’s webpage 2. The entire procedure outlined in Figure 3 takes approximately five minutes for a 500 400 image sequence containing up to five images. All t5h0e0 d×at4a0 s0h iomwang are qreuaeln scene captured pu ntodfe irv vea irmioaugse lighting conditions (e.g. indoor, outdoor). Input sequences range from three to five images. Figure 6 shows two examples of our edge separation results and final reconstructed background layers and reflection layers. Our method provides a clear separation of the edges of the two layers which is crucial in the reconstruc- 1http://people.csail.mit.edu/celiu/SIFTflow/SIFTflow.zip 2http://www.comp.nus.edu.sg/ liyu1988/ tion step. Figure 9 shows more reflection removal results of our method. We also compare our methods with those in [8] and [5]. For the method in [8], we use the source code 3 of the author to generate the results. The comparisons between our and [8] are not entirely fair since [8] uses single image to generate the result, while we have the advantage of the entire set. For the results produced by [8], the reference view was used as input. The required user-markup is also provided. For the method in [5], we set the layer number to be one, and estimate the motions of the background layer using their method. In the reconstruction phase, we set the remaining reflection layer in k input mixture images as k different layers, each only appearing once in one mixture. Figure 8 shows the results of two examples. Our results are arguably the best. The results of [8] still exhibited some edges from different layers even with the elaborate user mark-ups. This may be fixed by going back to further refine the user markup. But in the heavily overlapping edge regions, it is challenging for users to indicate the edges. If the edges are not clearly indicated the results tend to be over smoothed in one layer. For the method of [5], since it uses global transformations to align images, local misalignment effects often appear in the final recovered background image. Also, their approach uses all the input image into the optimization to recover the layers. This may lead to the result that has edges from different reflection layers of different images mixed and appear as ghosting effect in the recovered background image. For heavily saturated regions, none of the two previous methods can give visually plausible results like ours. 4. Discussion and Conclusion We have presented a method to automatically remove reflectance interference due to a glass surface. Our approach works by capturing a set of images of a scene from slightly varying view points. The images are then aligned and edges are labelled as belonging to either background or reflectance. This alignment was enabled by SIFT-flow, whose robustness to the reflection interference enabled our method. When using SIFT-flow, we assume that the background layer will be the most prominent and will provide sufficient SIFT features for matching. While we found this to work well in practice, images with very strong reflectance can produce poor alignment as SIFT-flow may attempt to align to the foreground which is changing. This will cause problems in the subsequent layer separation. Figure 7 shows such a case. While these failures can often be handled by cropping the image or simple user input (see supplemental material), it is a notable issue. Another challenging issue is when the background scene 3http://www.wisdom.weizmann.ac.il/ levina/papers/reflections.zip 22443366 ??? ??? ?? ??? Fig. 6. Example of edge separation results and recovered background and foreground layer using our method has large homogeneous regions. In such cases there are no edges to be labelled as background. This makes subsequent separation challenging, especially when the reflection interference in these regions is weak but still visually noticeable. While this problem is not unique to our approach, it is an issue to consider. We also found that by combining all the background results of the input images we can overcome Fig. 7. A failure case of our approach due to dominant reflection against the background in some regions (i.e. the upper part of the phonograph). This will cause unsatisfactory alignment of the background in the warping procedure which further lead to our edge separation and final reconstruction failure as can be seen in the figure. local regions with high saturation. While a simple idea, this combination strategy can be incorporated into other techniques to improve their results. Lastly, we believe reflection removal is an application that would be welcomed on many mobile devices, however, the current processing time is still too long for real world use. Exploring ways to speed up the processing pipeline is an area of interest for future work. Acknowledgement This work was supported by Singapore A*STAR PSF grant 11212100. References [1] A. K. Agrawal, R. Raskar, S. K. Nayar, and Y. Li. Removing photography artifacts using gradient projection and flashexposure sampling. ToG, 24(3):828–835, 2005. [2] A. Bruhn, J. Weickert, and C. Schn o¨rr. Lucas/kanade meets horn/schunck: Combining local and global optic flow methods. IJCV, 61(3):21 1–231, 2005. [3] H. Farid and E. H. Adelson. Separating reflections from images by use of independent component analysis. JOSA A, 16(9):2136–2145, 1999. [4] K. Gai, Z. Shi, and C. Zhang. Blindly separating mixtures of multiple layers with spatial shifts. In CVPR, 2008. [5] K. Gai, Z. Shi, and C. Zhang. Blind separation of superimposed moving images using image statistics. TPAMI, 34(1): 19–32, 2012. 22443377 Ours Levin and Weiss [7]Gai et al. [4] Fig. 8. Two example of reflection removal results of our method and those in [8] and [5] (user markup for [8] provided in the supplemental material). Our method provides more visual pleasing results. The results of [8] still exhibited remaining edges from reflection and tended to over smooth some local regions. The results of [5] suffered misalignment due to their global transformation alignment which results in ghosting effect of different layers in the final recovered background image. For the reflection, our results can give very complete and clear recovery of the reflection layer. [6] N. Kong, Y.-W. Tai, and S. Y. Shin. A physically-based approach to reflection separation. In CVPR, 2012. [7] A. Levin and Y. Weiss. User assisted separation ofreflections from a single image using a sparsity prior. In ECCV, 2004. [8] A. Levin and Y. Weiss. User assisted separation of reflections from a single image using a sparsity prior. TPAMI, 29(9): 1647–1654, 2007. [9] A. Levin, A. Zomet, and Y. Weiss. Separating reflections from a single image using local features. In CVPR, 2004. [10] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspondence across scenes and its applications. TPAMI, 33(5):978– 994, 2011. [11] P. Meer. Robust techniques for computer vision. Emerging Topics in Computer Vision, 2004. [12] N. Ohnishi, K. Kumaki, T. Yamamura, and T. Tanaka. Separating real and virtual objects from their overlapping images. In ECCV, 1996. [13] B. Sarel and M. Irani. Separating transparent layers through layer information exchange. In ECCV, 2004. [14] B. Sarel and M. Irani. Separating transparent layers of repetitive dynamic behaviors. In ICCV, 2005. [15] Y. Y. Schechner, N. Kiryati, and R. Basri. Separation of [16] [17] [18] [19] [20] [21] transparent layers using focus. IJCV, 39(1):25–39, 2000. Y. Y. Shechner, J. Shamir, and N. Kiryati. Polarization-based decorrelation of transparent layers: The inclination angle of an invisible surface. In ICCV, 1999. Y. Y. Shechner, J. Shamir, and N. Kiryati. Polarization and statistical analysis of scenes containing a semireflector. JOSA A, 17(2):276–284, 2000. D. Sun, S.Roth, and M. Black. Secrets of optical flow estimation and their principles. In CVPR, 2010. R. Szeliski, S. Avidan, and P. Anandan. Layer Extraction from Multiple Images Containing Reflections and Transparency. In CVPR, 2000. Y. Tsin, S. B. Kang, and R. Szeliski. Stereo matching with linear superposition of layers. TPAMI, 28(2):290–301, 2006. Y. Weiss. Deriving intrinsic images from image sequences. In ICCV, 2001. 22443388 Fig. 9. More results of reflection removal using our method in varying scenes (e.g. art museum, street shop, etc.). 22443399

3 0.64323652 270 iccv-2013-Modeling Self-Occlusions in Dynamic Shape and Appearance Tracking

Author: Yanchao Yang, Ganesh Sundaramoorthi

Abstract: We present a method to track the precise shape of a dynamic object in video. Joint dynamic shape and appearance models, in which a template of the object is propagated to match the object shape and radiance in the next frame, are advantageous over methods employing global image statistics in cases of complex object radiance and cluttered background. In cases of complex 3D object motion and relative viewpoint change, self-occlusions and disocclusions of the object are prominent, and current methods employing joint shape and appearance models are unable to accurately adapt to new shape and appearance information, leading to inaccurate shape detection. In this work, we model self-occlusions and dis-occlusions in a joint shape and appearance tracking framework. Experiments on video exhibiting occlusion/dis-occlusion, complex radiance and background show that occlusion/dis-occlusion modeling leads to superior shape accuracy compared to recent methods employing joint shape/appearance models or employing global statistics.

4 0.5485785 338 iccv-2013-Randomized Ensemble Tracking

Author: Qinxun Bai, Zheng Wu, Stan Sclaroff, Margrit Betke, Camille Monnier

Abstract: We propose a randomized ensemble algorithm to model the time-varying appearance of an object for visual tracking. In contrast with previous online methods for updating classifier ensembles in tracking-by-detection, the weight vector that combines weak classifiers is treated as a random variable and the posterior distribution for the weight vector is estimated in a Bayesian manner. In essence, the weight vector is treated as a distribution that reflects the confidence among the weak classifiers used to construct and adapt the classifier ensemble. The resulting formulation models the time-varying discriminative ability among weak classifiers so that the ensembled strong classifier can adapt to the varying appearance, backgrounds, and occlusions. The formulation is tested in a tracking-by-detection implementation. Experiments on 28 challenging benchmark videos demonstrate that the proposed method can achieve results comparable to and often better than those of stateof-the-art approaches.

5 0.54701728 349 iccv-2013-Regionlets for Generic Object Detection

Author: Xiaoyu Wang, Ming Yang, Shenghuo Zhu, Yuanqing Lin

Abstract: Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to evaluate for many locations. In view of this, we propose to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as regionlets. A regionlet is a base feature extraction region defined proportionally to a detection window at an arbitrary resolution (i.e. size and aspect ratio). These regionlets are organized in small groups with stable relative positions to delineate fine-grained spatial layouts inside objects. Their features are aggregated to a one-dimensional feature within one group so as to tolerate deformations. Then we evaluate the object bounding box proposal in selective search from segmentation cues, limiting the evaluation locations to thousands. Our approach significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method, without any contexts. It achieves the detec- tion mean average precision of 41. 7% on the PASCAL VOC 2007 dataset and 39. 7% on the VOC 2010 for 20 object categories. It achieves 14. 7% mean average precision on the ImageNet dataset for 200 object categories, outperforming the latest deformable part-based model (DPM) by 4. 7%.

6 0.54671711 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation

7 0.54533184 180 iccv-2013-From Where and How to What We See

8 0.54494363 291 iccv-2013-No Matter Where You Are: Flexible Graph-Guided Multi-task Learning for Multi-view Head Pose Classification under Target Motion

9 0.54478741 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection

10 0.54447758 150 iccv-2013-Exemplar Cut

11 0.5430181 182 iccv-2013-GOSUS: Grassmannian Online Subspace Updates with Structured-Sparsity

12 0.5429734 427 iccv-2013-Transfer Feature Learning with Joint Distribution Adaptation

13 0.54269439 157 iccv-2013-Fast Face Detector Training Using Tailored Views

14 0.54262221 59 iccv-2013-Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation

15 0.54254359 379 iccv-2013-Semantic Segmentation without Annotating Segments

16 0.5425005 330 iccv-2013-Proportion Priors for Image Sequence Segmentation

17 0.54240972 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?

18 0.54153585 187 iccv-2013-Group Norm for Learning Structured SVMs with Unstructured Latent Variables

19 0.54094833 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses

20 0.54089582 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection