nips nips2011 nips2011-303 knowledge-graph by maker-knowledge-mining

303 nips-2011-Video Annotation and Tracking with Active Learning

Source: pdf

Author: Carl Vondrick, Deva Ramanan

Abstract: We introduce a novel active learning framework for video annotation. By judiciously choosing which frames a user should annotate, we can obtain highly accurate tracks with minimal user effort. We cast this problem as one of active learning, and show that we can obtain excellent performance by querying frames that, if annotated, would produce a large expected change in the estimated object track. We implement a constrained tracker and compute the expected change for putative annotations with efﬁcient dynamic programming algorithms. We demonstrate our framework on four datasets, including two benchmark datasets constructed with key frame annotations obtained by Amazon Mechanical Turk. Our results indicate that we could obtain equivalent labels for a small fraction of the original cost. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We introduce a novel active learning framework for video annotation. [sent-4, score-0.338]

2 By judiciously choosing which frames a user should annotate, we can obtain highly accurate tracks with minimal user effort. [sent-5, score-0.487]

3 We cast this problem as one of active learning, and show that we can obtain excellent performance by querying frames that, if annotated, would produce a large expected change in the estimated object track. [sent-6, score-0.616]

4 We implement a constrained tracker and compute the expected change for putative annotations with efﬁcient dynamic programming algorithms. [sent-7, score-0.652]

5 We demonstrate our framework on four datasets, including two benchmark datasets constructed with key frame annotations obtained by Amazon Mechanical Turk. [sent-8, score-0.589]

6 1 Introduction With the decreasing costs of personal portable cameras and the rise of online video sharing services such as YouTube, there is an abundance of unlabeled video readily available. [sent-10, score-0.328]

7 Indeed, many approaches have demonstrated the power of data-driven analysis given labeled video footage [12, 17]. [sent-12, score-0.223]

8 The twenty-six hour VIRAT video data set consisting of surveillance footage of cars and people cost tens of thousands of dollars to annotate despite deploying state-of-the-art annotation protocols [13]. [sent-14, score-0.7]

9 Existing video annotation protocols typically work by having users (possibly on Amazon Mechanical Turk) label a sparse set of key frames followed by either linear interpolation [16] or nonlinear tracking [1, 15]. [sent-15, score-0.844]

10 We propose an adaptive key-frame strategy which uses active learning to intelligently query a worker to label only certain objects at only certain frames that are likely to improve performance. [sent-16, score-0.693]

11 In these cases, a few user clicks are enough to constrain a visual tracker to produce accurate tracks. [sent-18, score-0.532]

12 Rather, user clicks should be spent on more “hard” objects/frames that are visually ambiguous, such as occlusions or cluttered backgrounds. [sent-19, score-0.326]

13 Our approach is an instance of active structured prediction Figure 1: Videos from the VIRAT data set [13] can have hundreds of objects per frame. [sent-21, score-0.236]

14 Our active learning framework automatically focuses the worker’s effort on the difﬁcult instances (such as occlusion or deformation). [sent-23, score-0.357]

15 2 TRACKING 2 [8, 7], since we train object models that predict a complex, structured label (an object track) rather than a binary class output. [sent-24, score-0.281]

16 However, rather than training a single car model over several videos (which must be invariant to instance-speciﬁc properties such as color and shape), we train a separate car model for each car instance to be tracked. [sent-25, score-0.234]

17 From this perspective, our training examples are individual frames rather than videos. [sent-26, score-0.23]

18 We believe this property makes video a prime candidate for active learning, possibly simplifying its theoretical analysis [14, 2] because one does not face an adversarial ordering of data. [sent-30, score-0.31]

19 Our approach is similar to recent work in active labeling [4], except we determine which part of the label the user should annotate in order to improve performance the most. [sent-31, score-0.572]

20 Finally, we use a novel query strategy appropriate for video: rather than use expected information gain (expensive to compute for structured predictors) or label entropy (too coarse of an approximation), we use the expected label change to select a frame. [sent-32, score-0.412]

21 Related work (Interactive video annotation): There has also been work on interactive tracking from the computer vision community. [sent-34, score-0.304]

22 [5] describe efﬁcient data structures that enable interactive tracking, but do not focus on frame query strategies as we do. [sent-35, score-0.414]

23 2 Tracking In this section, we outline the dynamic programming tracker of [15]. [sent-37, score-0.321]

24 We begin by describing a method for tracking a single object, given a sparse set of key frame bounding-box annotations. [sent-39, score-0.432]

25 As in [15], we use a visual tracker to interpolate the annotations for the unlabeled in-between frames. [sent-40, score-0.471]

26 We deﬁne bi to t be a bounding box at frame t at pixel position i. [sent-41, score-0.689]

27 For every bounding box annotation in ζ, we extract its associated image patch and resize it to the average size in the set. [sent-46, score-0.247]

28 Ut (bt ) scores how well a particular bt matches against the learned appearance model w, but truncated by α1 so as to reduce the penalty when the object undergoes an occlusion. [sent-59, score-0.429]

29 St (bt , bt−1 ) favors smooth motion and prevents the tracked object from teleporting across the scene. [sent-61, score-0.193]

30 3 Efﬁcient Optimization We can recover the missing annotations by computing the optimal path as given by the energy function. [sent-63, score-0.264]

31 0:T b0:T bt = bi t ∀bi ∈ ζ t (4) subject to the constraint that the path crosses through the annotations labeled by the worker in ζ. [sent-66, score-0.963]

32 We note that these constraints can be removed by simply redeﬁning Ut (bt ) = ∞ ∀bt = bi . [sent-67, score-0.321]

33 3 Active Learning Let curr0:T be the current best estimate for the path given a set of user annotations ζ. [sent-71, score-0.364]

34 We wish to compute which frame the user should annotate next t∗ . [sent-72, score-0.585]

35 In the ideal case, if we had knowledge of the ground-truth path bgt , we should select the frame t, that when annotated with bgt , would produce t 0:T a new estimated path closest to the ground-truth. [sent-73, score-0.933]

36 Let us write next0:T (bgt ) for the estimated track t given the augmented constraint set ζ = ζ ∪ bgt . [sent-74, score-0.276]

37 The optimal next frame is: t T topt = argmin 0≤t≤T j=0 err(bgt , nextj (bgt )) t j (7) where err could be squared error or a thresholded overlap (in which err evaluates to 0 or 1 depending upon if the two locations sufﬁciently overlap or not). [sent-75, score-0.67]

38 First, we change the minimization to a maximization and replace the ground-truth error with the change in track label: err(bgt , nextj (bgt )) ⇒ err(currj , nextj (bgt )). [sent-79, score-0.413]

39 However, this requires knowing the ground-truth location bgt . [sent-81, score-0.245]

40 We make the second assumption that we have t access to an accurate estimate of P (bi ), which is the probability that, if we show the user frame t, t then they will annotate a particular location i. [sent-82, score-0.633]

41 We can use this distribution to compute an expected change in track label: K t∗ = argmax 0≤t≤T T P (bi ) · ∆I(bi ) where ∆I(bi ) = t t t i=0 err(currj , nextj (bi )) t j=0 (8) 3 ACTIVE LEARNING 4 (a) One click: Initial frame only (c) Identical objects. [sent-83, score-0.581]

42 (b) Two clicks: Initial and requested frame (e) Intersection point. [sent-85, score-0.381]

43 Figure 2: We consider a synthetic video of two nearly identical rectangles rotating around a point— one clockwise and the other counterclockwise. [sent-87, score-0.216]

44 The rectangles intersect every 20 frames, at which point the tracker does not know which direction the true rectangle is following. [sent-88, score-0.38]

45 (a) Our framework realizes the ambiguity can be resolved by requesting annotations when they do not intersect. [sent-90, score-0.281]

46 Due to the periodic motion, a ﬁxed rate tracker may request annotations at the intersection points, resulting in wasted clicks. [sent-91, score-0.551]

47 The expected label change plateaus because every point along the maximas provide the same amount of disambiguating information. [sent-92, score-0.224]

48 (b) Once the requested frame is annotated, that corresponding segment is resolved, but the others remain ambiguous. [sent-93, score-0.381]

49 In this example, our framework can determine the true path for a particular rectangle in only 7 clicks, while a ﬁxed rate tracker may require 13 clicks. [sent-94, score-0.432]

50 The above selects the frame, that when annotated, produces the largest expected track label change. [sent-95, score-0.215]

51 We now show how to compute P (bi ) and ∆I(bi ) using costs and constrained paths, respectively, t t from the dynamic-programming based visual tracker described in Section 2. [sent-96, score-0.283]

52 By considering every possible space-time location that a worker could annotate, we are able to determine which frame we expect could change the current path the most. [sent-97, score-0.586]

53 Moreover, (8) can be parallelized across frames in order to guarantee a rapid response time, often necessary due to the interactive nature of active learning. [sent-99, score-0.432]

54 2 Annotation Likelihood and Estimated Tracks A user has access to global knowledge and video history when annotating a frame. [sent-101, score-0.312]

55 Although both objects have the same appearance, our framework does not query for new annotations because the pairwise cost has made it unlikely that the two objects switch identities, indicated by a single mode in the probability map. [sent-106, score-0.448]

56 If the object is extremely difﬁcult to localize, the active learner will automatically decide the optimal annotation strategy is to use ﬁxed rate key frames. [sent-110, score-0.489]

57 By caching forward and backward pointers πt (bi ) and πt (bi ), the associated t t i tracks next0:T (bt ) can be found by backtracking both forward and backward from any spacetime location bi . [sent-112, score-0.632]

58 3 Label Change We now describe a dynamic programming algorithm for computing the label change ∆I(bi ) for all t possible spacetime locations bi . [sent-114, score-0.643]

59 The total label change is their sum, minus the t double-counted error from frame t: ∆I(bi ) = Θ→ (bi ) + Θ← (bi ) − err(currt , nextt (bi )) t t t t t t (13) (13) is sensitive to small spatial shifts; i. [sent-116, score-0.54]

60 , the user may annotate any location within a small neighborhood and still produce a large label change). [sent-121, score-0.434]

61 A tracker trained only on the initial frame will lose the object when his appearance changes. [sent-124, score-0.676]

62 Our framework is able to determine which additional frame the user should annotate in order to resolve the track. [sent-125, score-0.613]

63 (a) Our framework does not expect any signiﬁcant label change when the person is wearing the same jacket as in the training frame (black curve). [sent-126, score-0.711]

64 But, when the jacket is removed and the person changes his pose (colorful curves), the tracker cannot localize the object and our framework queries for an additional annotation. [sent-127, score-0.6]

65 (b) After annotating the requested frame, the tracker learns the color of the person’s shirt and gains conﬁdence in its track estimate. [sent-128, score-0.514]

66 A ﬁxed rate tracker may pick a frame where the person is still wearing the jacket, resulting in a wasted click. [sent-129, score-0.712]

67 (c-f) The green box is the predicted path with one click and red box is with two clicks. [sent-130, score-0.202]

68 extracted from frame t∗ (according to (1)), and repeat. [sent-132, score-0.306]

69 We stop requesting annotations once we are conﬁdent that additional annotations will not signiﬁcantly change the predicted path: K P (bi ) · ∆I(bi ) < tolerance t t max 0≤t≤T (15) i=0 We then report b∗ as the ﬁnal annotated track as found in (4). [sent-133, score-0.654]

70 As long as the budget is sufﬁciently high, the reported annotations will closely match the actual location of the tracked object. [sent-135, score-0.318]

71 We also note that one can apply our active learning algorithm in parallel for multiple objects in a video. [sent-136, score-0.236]

72 We select the object and frame with the maximum expected label change according to (8) . [sent-138, score-0.617]

73 4 Qualitative Experiments In order to demonstrate our framework’s capabilities, we show how our approach handles a couple of interesting annotation problems. [sent-140, score-0.185]

74 We have assembled two data sets: a synthetic video of easy-tolocalize rectangles maneuvering in an uncluttered background, and a real-world data set of actors following scripted walking patterns. [sent-141, score-0.251]

75 4 QUALITATIVE EXPERIMENTS (a) One click: Initial frame only (c) Training image (d) Entering occlusion 7 (b) Two clicks: Initial and requested frame (e) Total occlusion (f) After occlusion Figure 6: We investigate a car from [13] that undergoes a total occlusion and later reappears. [sent-142, score-1.224]

76 The tracker is able to localize the car until it enters the occlusion, but it cannot recover when the car reappears. [sent-143, score-0.441]

77 (a) Our framework expects a large label change during the occlusion and when the object is lost. [sent-144, score-0.423]

78 The largest label change occurs when the object begins to reappear because this frame would lock the tracker back onto the correct path. [sent-145, score-0.871]

79 (b) When the tracker receives the requested annotation, it is able to recover from the occlusion, but it is still confused when the object is not visible. [sent-146, score-0.445]

80 (a) Initial frame (b) Rotation (c) Scale (d) Estimated Figure 7: We examine situations where there are many easy-to-localize objects (e. [sent-147, score-0.396]

81 Our framework realizes that the stationary objects are not likely to change their label, so it focuses annotations on moving objects. [sent-151, score-0.498]

82 3 highlights how our framework does not request annotations when the paths of two identical objects are disjoint because the motion is not ambiguous. [sent-156, score-0.435]

83 4 reveals how our framework will gracefully degrade to ﬁxed rate key frames if the tracked object is difﬁcult to localize. [sent-158, score-0.492]

84 7 shows how we are able to transfer wasted clicks from stationary objects on to moving objects. [sent-164, score-0.36]

85 (a) VIRAT Cars [13] (b) Basketball Players [15] Figure 9: We compare active key frames (green curve) vs. [sent-167, score-0.418]

86 ﬁxed rate key frames (red curve) on a subset (a few thousand frames) of the VIRAT videos and part of a basketball game. [sent-168, score-0.439]

87 By decreasing the annotation frequency in the easy sections and instead transferring those clicks to the difﬁcult frames, we achieve superior performance over the current methods on the same budget. [sent-170, score-0.334]

88 (a) Due to the large number of stationary objects in VIRAT, our framework assigns a tremendous number of clicks to moving objects, allowing us to achieve nearly zero error. [sent-171, score-0.336]

89 (b) By focusing annotation effort on ambiguous frames, we show nearly a 5% improvement on basketball players. [sent-172, score-0.38]

90 5 Benchmark Results We validate our approach on both the VIRAT challenge video surveillance data set [13] and the basketball game studied in [15]. [sent-173, score-0.356]

91 VIRAT is unique for its enormous size of over three million frames and up to hundreds of annotated objects in each frame. [sent-174, score-0.401]

92 The basketball game is extremely difﬁcult due to cluttered backgrounds, motion blur, frequent occlusions, and drastic pose changes. [sent-175, score-0.237]

93 We evaluate the performance of our tracker using active key frames versus ﬁxed rate key frames. [sent-176, score-0.743]

94 A ﬁxed rate tracker simply requests annotations every T frames, regardless of the video content. [sent-177, score-0.635]

95 For active key frames, we use the annotation schedule presented in section 3. [sent-178, score-0.373]

96 Our key frame baseline is the state-of-the-art labeling protocol used to originally annotate both datasets [15, 13]. [sent-179, score-0.567]

97 In a given video, we allow our active learning protocol to iteratively pick a frame and an object to annotate until the budget is exhausted. [sent-180, score-0.746]

98 We then run the tracker described in section 2 constrained by these key frames and compare its performance. [sent-181, score-0.555]

99 We score the two key frame schedules by determining how well the tracker is able to estimate the ground truth annotations. [sent-182, score-0.631]

100 We compare our active approach to a ﬁxed-rate baseline for a ﬁxed amount of user effort: is it better to spend X user clicks on active or ﬁxed-rate key frames? [sent-184, score-0.683]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bi', 0.321), ('bt', 0.31), ('frame', 0.306), ('tracker', 0.283), ('frames', 0.23), ('bgt', 0.197), ('annotations', 0.188), ('annotation', 0.185), ('annotate', 0.179), ('virat', 0.177), ('video', 0.164), ('clicks', 0.149), ('active', 0.146), ('basketball', 0.121), ('occlusion', 0.113), ('err', 0.113), ('jacket', 0.111), ('label', 0.107), ('user', 0.1), ('objects', 0.09), ('change', 0.088), ('object', 0.087), ('tracking', 0.084), ('annotated', 0.081), ('track', 0.079), ('nextj', 0.079), ('vondrick', 0.079), ('yuen', 0.079), ('path', 0.076), ('requested', 0.075), ('worker', 0.068), ('footage', 0.059), ('spacetime', 0.059), ('tracks', 0.057), ('interactive', 0.056), ('tracked', 0.054), ('click', 0.054), ('car', 0.053), ('motion', 0.052), ('localize', 0.052), ('rectangles', 0.052), ('wasted', 0.052), ('query', 0.052), ('ct', 0.05), ('paths', 0.049), ('ut', 0.049), ('location', 0.048), ('annotating', 0.048), ('videos', 0.046), ('surveillance', 0.045), ('rectangle', 0.045), ('key', 0.042), ('backward', 0.041), ('ramanan', 0.041), ('effort', 0.041), ('labeling', 0.04), ('currj', 0.039), ('currt', 0.039), ('nextt', 0.039), ('occlusions', 0.039), ('person', 0.039), ('dynamic', 0.038), ('cluttered', 0.038), ('querying', 0.036), ('cars', 0.036), ('box', 0.036), ('walking', 0.035), ('moving', 0.035), ('realizes', 0.035), ('pointers', 0.035), ('qualitative', 0.034), ('bn', 0.034), ('stationary', 0.034), ('ambiguous', 0.033), ('hog', 0.033), ('rgb', 0.033), ('protocols', 0.032), ('undergoes', 0.032), ('wearing', 0.032), ('backtracking', 0.03), ('requesting', 0.03), ('locations', 0.03), ('argmin', 0.029), ('expected', 0.029), ('color', 0.029), ('automatically', 0.029), ('request', 0.028), ('framework', 0.028), ('budget', 0.028), ('players', 0.027), ('boxes', 0.027), ('bounding', 0.026), ('game', 0.026), ('gracefully', 0.026), ('putative', 0.026), ('liblinear', 0.026), ('cult', 0.026), ('benchmark', 0.025), ('degrade', 0.025), ('unary', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 303 nips-2011-Video Annotation and Tracking with Active Learning

Author: Carl Vondrick, Deva Ramanan

2 0.22303148 35 nips-2011-An ideal observer model for identifying the reference frame of objects

Author: Joseph L. Austerweil, Abram L. Friesen, Thomas L. Griffiths

Abstract: The object people perceive in an image can depend on its orientation relative to the scene it is in (its reference frame). For example, the images of the symbols × and + differ by a 45 degree rotation. Although real scenes have multiple images and reference frames, psychologists have focused on scenes with only one reference frame. We propose an ideal observer model based on nonparametric Bayesian statistics for inferring the number of reference frames in a scene and their parameters. When an ambiguous image could be assigned to two conﬂicting reference frames, the model predicts two factors should inﬂuence the reference frame inferred for the image: The image should be more likely to share the reference frame of the closer object (proximity) and it should be more likely to share the reference frame containing the most objects (alignment). We conﬁrm people use both cues using a novel methodology that allows for easy testing of human reference frame inference. 1

3 0.20659171 180 nips-2011-Multiple Instance Filtering

Author: Kamil A. Wnuk, Stefano Soatto

Abstract: We propose a robust ﬁltering approach based on semi-supervised and multiple instance learning (MIL). We assume that the posterior density would be unimodal if not for the eﬀect of outliers that we do not wish to explicitly model. Therefore, we seek for a point estimate at the outset, rather than a generic approximation of the entire posterior. Our approach can be thought of as a combination of standard ﬁnite-dimensional ﬁltering (Extended Kalman Filter, or Unscented Filter) with multiple instance learning, whereby the initial condition comes with a putative set of inlier measurements. We show how both the state (regression) and the inlier set (classiﬁcation) can be estimated iteratively and causally by processing only the current measurement. We illustrate our approach on visual tracking problems whereby the object of interest (target) moves and evolves as a result of occlusions and deformations, and partial knowledge of the target is given in the form of a bounding box (training set). 1

4 0.13308486 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

Author: Zhen J. Xiang, Hao Xu, Peter J. Ramadge

Abstract: Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efﬁciently. 1

5 0.12875326 275 nips-2011-Structured Learning for Cell Tracking

Author: Xinghua Lou, Fred A. Hamprecht

Abstract: We study the problem of learning to track a large quantity of homogeneous objects such as cell tracking in cell culture study and developmental biology. Reliable cell tracking in time-lapse microscopic image sequences is important for modern biomedical research. Existing cell tracking methods are usually kept simple and use only a small number of features to allow for manual parameter tweaking or grid search. We propose a structured learning approach that allows to learn optimum parameters automatically from a training set. This allows for the use of a richer set of features which in turn affords improved tracking compared to recently reported methods on two public benchmark sequences. 1

6 0.1229978 220 nips-2011-Prediction strategies without loss

7 0.10760715 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

8 0.10195859 193 nips-2011-Object Detection with Grammar Models

9 0.096182257 301 nips-2011-Variational Gaussian Process Dynamical Systems

10 0.09493129 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

11 0.094904855 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities

12 0.094233319 218 nips-2011-Predicting Dynamic Difficulty

13 0.08802253 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

14 0.087690115 197 nips-2011-On Tracking The Partition Function

15 0.086376011 22 nips-2011-Active Ranking using Pairwise Comparisons

16 0.076520495 154 nips-2011-Learning person-object interactions for action recognition in still images

17 0.076214358 64 nips-2011-Convergent Bounds on the Euclidean Distance

18 0.075017162 66 nips-2011-Crowdclustering

19 0.071308419 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

20 0.069923088 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.176), (1, 0.048), (2, -0.068), (3, 0.121), (4, 0.113), (5, 0.017), (6, -0.012), (7, -0.089), (8, -0.005), (9, 0.109), (10, 0.06), (11, -0.113), (12, -0.015), (13, -0.021), (14, 0.139), (15, -0.049), (16, 0.007), (17, 0.132), (18, -0.008), (19, 0.062), (20, 0.078), (21, -0.04), (22, 0.002), (23, 0.016), (24, -0.051), (25, -0.231), (26, -0.281), (27, 0.071), (28, -0.106), (29, -0.105), (30, 0.013), (31, -0.125), (32, 0.101), (33, -0.001), (34, 0.019), (35, 0.027), (36, -0.035), (37, -0.039), (38, 0.118), (39, -0.011), (40, -0.027), (41, 0.075), (42, -0.046), (43, -0.111), (44, -0.006), (45, -0.028), (46, 0.053), (47, -0.03), (48, 0.079), (49, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97474742 303 nips-2011-Video Annotation and Tracking with Active Learning

Author: Carl Vondrick, Deva Ramanan

2 0.65190274 35 nips-2011-An ideal observer model for identifying the reference frame of objects

Author: Joseph L. Austerweil, Abram L. Friesen, Thomas L. Griffiths

3 0.62220007 275 nips-2011-Structured Learning for Cell Tracking

Author: Xinghua Lou, Fred A. Hamprecht

4 0.61068428 180 nips-2011-Multiple Instance Filtering

Author: Kamil A. Wnuk, Stefano Soatto

5 0.5236432 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities

Author: Angela Yao, Juergen Gall, Luc V. Gool, Raquel Urtasun

Abstract: A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or can only learn latent spaces from “simple data”, i.e., single activities such as walking or running. In this paper, we present an efﬁcient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the effectiveness of our approach on the task of monocular and multi-view tracking and show that our approach outperforms the state-of-the-art. 1

6 0.46777332 193 nips-2011-Object Detection with Grammar Models

7 0.43600833 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

8 0.42208675 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

9 0.39515632 218 nips-2011-Predicting Dynamic Difficulty

10 0.37716195 22 nips-2011-Active Ranking using Pairwise Comparisons

11 0.35353214 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation

12 0.34685099 197 nips-2011-On Tracking The Partition Function

13 0.33966133 64 nips-2011-Convergent Bounds on the Euclidean Distance

14 0.33488187 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

15 0.32695165 154 nips-2011-Learning person-object interactions for action recognition in still images

16 0.32645887 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection

17 0.31886294 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition

18 0.31656647 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

19 0.31098175 277 nips-2011-Submodular Multi-Label Learning

20 0.31095514 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.011), (4, 0.077), (6, 0.211), (20, 0.08), (26, 0.04), (31, 0.084), (33, 0.039), (43, 0.033), (45, 0.095), (57, 0.029), (65, 0.012), (74, 0.061), (83, 0.051), (84, 0.015), (99, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82920504 303 nips-2011-Video Annotation and Tracking with Active Learning

Author: Carl Vondrick, Deva Ramanan

2 0.72164261 199 nips-2011-On fast approximate submodular minimization

Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes

Abstract: We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n7 ) (O(n5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest signiﬁcant speedups over minimum norm while retaining higher accuracies. 1

3 0.63992941 127 nips-2011-Image Parsing with Stochastic Scene Grammar

Author: Yibiao Zhao, Song-chun Zhu

Abstract: This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), detecting 2D faces (windows, doors etc.), and segmenting background. In contrast to previous scene labeling work that applied discriminative classiﬁers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules represent an ensemble of visual entities. Contextual relations: (i) Cooperative “+” relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive “-” relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efﬁcient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene conﬁgurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to ﬁnd the most probable conﬁguration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree. 1

4 0.63744223 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

Author: Adrian Ion, Joao Carreira, Cristian Sminchisescu

Abstract: We present a joint image segmentation and labeling model (JSL) which, given a bag of ﬁgure-ground segment hypotheses extracted at multiple image locations and scales, constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as ﬁrst sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag [1], followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure. The partition function over tilings and labelings is increasingly more accurately approximated by including incorrect conﬁgurations that a not-yet-competent model rates probable during learning. We show that the proposed methodology matches the current state of the art in the Stanford dataset [2], as well as in VOC2010, where 41.7% accuracy on the test set is achieved.

5 0.63648331 180 nips-2011-Multiple Instance Filtering

Author: Kamil A. Wnuk, Stefano Soatto

6 0.6359539 227 nips-2011-Pylon Model for Semantic Segmentation

7 0.62990099 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

8 0.62580353 263 nips-2011-Sparse Manifold Clustering and Embedding

9 0.62567562 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

10 0.62338382 231 nips-2011-Randomized Algorithms for Comparison-based Search

11 0.62298495 154 nips-2011-Learning person-object interactions for action recognition in still images

12 0.62248486 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

13 0.62223494 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

14 0.62219208 168 nips-2011-Maximum Margin Multi-Instance Learning

15 0.62173641 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

16 0.61673391 22 nips-2011-Active Ranking using Pairwise Comparisons

17 0.61545581 186 nips-2011-Noise Thresholds for Spectral Clustering

18 0.61521924 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

19 0.61434656 55 nips-2011-Collective Graphical Models

20 0.61367995 64 nips-2011-Convergent Bounds on the Euclidean Distance