nips nips2012 nips2012-209 knowledge-graph by maker-knowledge-mining

209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization


Source: pdf

Author: Du Tran, Junsong Yuan

Abstract: Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i.e., identifying a sequence of bounding boxes that track the action in video. The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. We propose a novel structured learning approach for spatio-temporal action localization. The mapping between a video and a spatio-temporal action trajectory is learned. The intractable inference and learning problems are addressed by leveraging an efficient Max-Path search method, thus making it feasible to optimize the model over the whole structured space. Experiments on two challenging benchmark datasets show that our proposed method outperforms the state-of-the-art methods. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 sg Abstract Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. [sent-4, score-0.31]

2 Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i. [sent-5, score-1.228]

3 , identifying a sequence of bounding boxes that track the action in video. [sent-7, score-0.515]

4 The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. [sent-8, score-0.587]

5 We propose a novel structured learning approach for spatio-temporal action localization. [sent-9, score-0.483]

6 The mapping between a video and a spatio-temporal action trajectory is learned. [sent-10, score-0.553]

7 1 Introduction Blaschko and Lampert have recently shown that object localization can be approached as structured regression problem [2]. [sent-13, score-0.807]

8 Instead of modeling object localization as a binary classification and treating every bounding box independently, their method trains a discriminant function directly for predicting the bounding boxes of objects located in images. [sent-14, score-0.922]

9 Motivated by the successful application of structured regression in object localization [2], it is natural to ask if we can perform structured regression for action localization in videos. [sent-16, score-1.82]

10 Although this idea looks plausible, the extension from object localization to action localization is non-trivial. [sent-17, score-1.526]

11 Different from object localization, where a visual object can be well localized by a 2-dimensional (2D) subwindow, human actions cannot be tightly bounded in such a similar way, i. [sent-18, score-0.425]

12 Although many current methods for action detection are based on this 3D subvolume assumption [6, 9, 20, 29], and search for video subvolumes to detect actions, such an assumption is only reasonable for “static” actions, where the subjects do not move globally e. [sent-21, score-1.099]

13 Thus, a more accurate localization scheme that can track the actions is required for localizing dynamic actions in videos. [sent-27, score-1.01]

14 For example, one can localize an action by a 2D bounding box in each frame, and track it as the action moves across different frames. [sent-28, score-0.909]

15 This localization structured output generates a smooth spatio-temporal path of connected 2-D bounding boxes. [sent-29, score-0.864]

16 Such a spatio-temporal path can tightly bound the actions in the video space and provides a more accurate spatio-temporal localization of actions. [sent-30, score-1.064]

17 1 right left top object bottom a) b) c) Figure 1: Complexities of object and action localization: a) Object localization is of O(n4 ). [sent-31, score-1.048]

18 b) Action localization by subvolume search is of O(n6 ). [sent-32, score-0.738]

19 c) Spatio-temporal action localization in a much larger search space. [sent-33, score-0.953]

20 However, as the video space is much larger than the image space, spatio-temporal action localization has a much larger structured space compared with object localization. [sent-34, score-1.334]

21 For a video with size w × h × n, the search space for 3D subvolumes and 2D subwindows is only O(w2 h2 n2 ) and O(w2 h2 ), respectively (Figure 1). [sent-35, score-0.372]

22 However, the search space for possible spatio-temporal paths in the video space is exponential O(whnk n )[23] if we do not know the start and end points of the path (k is the number of incoming edges per node). [sent-36, score-0.408]

23 Any one of these paths can be the candidates for spatiotemporal action localization, thus an exhaustive search is infeasible. [sent-37, score-0.478]

24 This huge structured space keeps structured learning approaches from being practical to spatio-temporal action localization due to intractable inferences. [sent-38, score-1.217]

25 This paper proposes a new approach for spatio-temporal action localization which mainly addresses the above mentioned problems. [sent-39, score-0.892]

26 Instead of using the 3D subvolume localization scheme, we precisely locate and track the action by finding an optimal spatio-temporal path to detect and localize actions. [sent-40, score-1.329]

27 The mapping between a video and a spatio-temporal action trajectory is learned. [sent-41, score-0.553]

28 Being solved as structured learning problem, our method can well exploit the correlations between local dependent video features, and therefore optimizes the structured output. [sent-43, score-0.539]

29 1 Related work Human action detection is traditionally approached by spatio-temporal video volume matching using different features: space-time orientation [6], volumetric [9], action MACH [20], HOG3D [10]. [sent-46, score-1.143]

30 Hu et al used multiple-instance learning to detect actions [8]. [sent-50, score-0.456]

31 Mahadevan et al used mixtures of dynamic textures to detect anomaly events [15]. [sent-51, score-0.293]

32 Niebles et al used a probabilistic latent semantic analysis model for recognizing actions [17]. [sent-53, score-0.35]

33 Yuan et al extended the branch-and-bound subwindow search [11] to subvolume search for action detection [29]. [sent-55, score-0.98]

34 Recently, Tran and Yuan relaxed the 3D bounding box constraint for detecting and localizing medium and long video events [23]. [sent-56, score-0.416]

35 More recently, Lan et al used a latent SVM to jointly detect and recognize actions in videos [12]. [sent-66, score-0.523]

36 However, this method requires a reliable human detector in both inference and learning, thus it is not applicable to “dynamic” actions where the human poses are significantly varied. [sent-68, score-0.324]

37 Moreover, because of its using HOG3D [26], it only detects actions in a sparse subset of frames where the interest points present. [sent-69, score-0.315]

38 2 Spatio-Temporal Action Localization as Structured Output Regression Given a video x with the size of w × h × m where w × h is the frame size and m is its length, to localize actions, one needs to predict a structured object y which is a smooth spatio-temporal path in the video space. [sent-70, score-0.857]

39 m } where (l, t, r, b)i are respectively the left, top, right, bottom of the rectangle that bounds the action in the i-th frame. [sent-73, score-0.367]

40 These values of (l, t, r, b) are all set to zeros when there is no action in this frame. [sent-74, score-0.336]

41 Because of the spatio-temporal smoothness constraint, the boxes in y are necessarily smoothed over the spatio-temporal video space. [sent-75, score-0.296]

42 Let us denote X ⊂ [0, 255]3whm as the set of color videos, and Y ⊂ R4m as the set of all smooth spatiotemporal paths in the video space. [sent-76, score-0.298]

43 The problem of spatio-temporal action localization becomes to learn a structured prediction function of f : X → Y. [sent-77, score-1.039]

44 We formulate the action localization problem using the structured learning as presented in [24]. [sent-86, score-1.039]

45 F is a compatibility function which measures how compatible the localization y will be suited to the given input video x. [sent-88, score-0.773]

46 2 The Joint Kernel Feature Map for Action Localization Let us denote x|y as the video portion cut out from x by the path y, namely the stack of images cropped by the bounding boxes b1. [sent-114, score-0.464]

47 We also denote ϕ(bi ) ∈ Rk as a feature map for a 2D 3 box bi . [sent-117, score-0.311]

48 m }, where bi = (l, t, r, b)i ˆ is the ground truth box of the i-th frame. [sent-130, score-0.42]

49 max {∆(y, y ) + w, φ(x, y) } ¯ (10) y∈Y = = max y∈Y 1 m 1 max m y∈Y m ¯ δ(bi , bi ) + i=1 m 1 m m w, ϕ(bi ) (11) i=1 ¯ δ(bi , bi ) + w, ϕ(bi ) (12) i=1 To solve Eq. [sent-146, score-0.488]

50 12, one needs to search for a smooth path y ∗ in the spatio-temporal video space Y which gives the maximum total score. [sent-148, score-0.352]

51 Max-Path algorithm [23] was proposed to detect dynamic video events. [sent-155, score-0.323]

52 It is guaranteed to obtain the best spatio-temporal path in the video space provided that the local windows’ scores can be precomputed. [sent-156, score-0.322]

53 In testing, the trellis’s local scores are w, ϕ(bi ) where bi is the local window. [sent-158, score-0.275]

54 In training, those values of the trellis are δ(bi , bi ) + w, ϕ(bi ) which are also ¯ computable given parameter w, feature map ϕ, and ground truth bi . [sent-160, score-0.678]

55 ¯ ¯ b w, φ(x, y ) − w, φ(x, y) ≥ ∆(¯, y) − ξ, ∀y ∈ Y\¯ ¯ y y ⇔ 1 m m ¯ w, ϕ(bi ) − i=1 m w, ϕ(bi ) ≥ i=1 m ¯ w, ϕ(bi ) − ⇔ 1 m m i=1 1 m (13) m ¯ δ(bi , bi ) − ξ, ∀y ∈ Y\¯ y (14) i=1 m ¯ δ(bi , bi ) − mξ, ∀y ∈ Y\¯ y w, ϕ(bi ) ≥ i=1 (15) i=1 The constraint in Eq. [sent-172, score-0.536]

56 15 constraint ¯ ¯ w, ϕ(bi ) − w, ϕ(bi ) ≥ δ(bi , bi ) − ξ, ∀i ∈ [1. [sent-175, score-0.292]

57 However, the important benefit of using such enforcements is that instead of comparing features of two different ¯ spatio-temporal paths y and y , one can compare the features of individual box pairs (bi , bi ) of those ¯ two paths. [sent-182, score-0.441]

58 UCFSport dataset consists of 150 video sequences of 10 different action classes. [sent-185, score-0.577]

59 We perform the task of kiss detection and localization on this dataset. [sent-194, score-0.854]

60 Kissing actions is more challenging compared with other action classes in this dataset due to less motion and appearance cues. [sent-195, score-0.552]

61 As used in [12], the video localization score is measured by averaging its frame localization scores which are the overlap area divided by the union area of the predicted and truth boxes. [sent-208, score-1.603]

62 A prediction is then considered as correct if its localization score is greater or equal to σ = 0. [sent-209, score-0.556]

63 It is worth noting that detection evaluations are applied to both positive and negative testing examples while localization evaluations are only applied to positive ones. [sent-211, score-0.855]

64 As a result, the detection metric is to measure the reliability of the detections (precision/recall) where the localization metric indicates the quality of detections, e. [sent-212, score-0.813]

65 More specific, detection is to answer the question “Is there any action of interest in this video? [sent-215, score-0.533]

66 ” while localization is to answer to “Provided that there is one action instance that appears in this video, where is it? [sent-216, score-0.892]

67 6 Figure 2: Action detection results on UCF-Sport: detection curves of our proposed method compared with [12] and [23]. [sent-266, score-0.446]

68 Upper plots are detection results evaluated on subset frames given by [12], while lower plots are the results of all-frame evaluations. [sent-267, score-0.32]

69 30 Table 1: Action localization results on UCF-Sport: comparisons among our proposed method, [12], and [23]. [sent-290, score-0.556]

70 For [23], we train a linear SVM detector for each action class using the same features as ours. [sent-299, score-0.373]

71 The Max-Path algorithm is then applied to detect the actions of interest. [sent-300, score-0.298]

72 According to [12], its method used HOG3D [26], so that it is only able to detect and localize actions at a sparse set of frames where the HOG3D interest points present. [sent-301, score-0.484]

73 The first set is applied only to the subset of frames where [12] reports detections and the second set is to take all frames into consideration. [sent-303, score-0.297]

74 Table 1 reports the results of action localization of different methods and action classes. [sent-304, score-1.254]

75 Our method significantly improves over [23] for all three action classes on both subset and all-frame evaluations. [sent-310, score-0.437]

76 However, [12] provides better detection results than ours on diving detection. [sent-312, score-0.399]

77 This better detection is because their interest-pointbased sparse features are more suitable to deformable actions as diving. [sent-313, score-0.426]

78 For a complete presentation, we visualze localization results of our method comapared with those of [12] and [23] on a diving sequence (Figure 3). [sent-314, score-0.786]

79 All predicted boxes are plotted together with ground truth boxes for comparisons. [sent-315, score-0.295]

80 2 0 0 10 20 30 Frame number 40 50 60 Figure 3: Visualization of diving localization: the plots of localization scores of different methods on a diving video sequence. [sent-320, score-1.208]

81 Figure 4: Action detection and localization on UCF-Sport: Lan et al’s [12] results are visualized in blue, Tran and Yuan’s [23] are green, ours are read, and ground truth are black. [sent-323, score-0.952]

82 Our method and [23] can detect multiple instances of actions (two bottom left images). [sent-324, score-0.326]

83 Our method (red curve) localizes diving action much more accurately than [23] (green curve). [sent-326, score-0.602]

84 [12] localizes diving action fairly good, however it is not applicable when more accurate localizations (e. [sent-327, score-0.574]

85 Oxford-TV: we compare our method with [23] on both detection and localization tasks. [sent-330, score-0.781]

86 For localization, besides the spatial localization (SL) metric as used in UCF dataset experiments, we also evaluate different methods by temporal localization (TL) metric. [sent-332, score-1.198]

87 This metric is not applicable to UCF dataset because most action instances in UCF dataset start and end at the first and last frame, respectively. [sent-333, score-0.384]

88 Temporal localization is computed as the length Method EPR(%) AUC SL(%) TL(%) [18] 32. [sent-334, score-0.556]

89 (measured in frames) of the intersection divided by the union of detection and ground truth. [sent-400, score-0.289]

90 Table 2 presents detection and localization results of our proposed method compared with [23]. [sent-401, score-0.781]

91 03% (Figure 6b) which demonstrates that our method can simultaneously detect and localize actions with high accuracy. [sent-415, score-0.396]

92 6 Conclusions We have proposed a novel structured learning approach for spatio-temporal action localization in videos. [sent-416, score-1.039]

93 While most of current approaches detect actions as 3D subvolumes [6, 9, 20, 29] or a sparse subset of frames [12], our method can precisely detect and track actions in both spatial and temporal spaces. [sent-417, score-0.913]

94 Although [23] is also applicable to spatio-temporal action detection, this method cannot be optimized over the large video space due to its independently trained detectors. [sent-418, score-0.581]

95 Moreover, being free from people detection and background subtraction, our approach can efficiently handle unconstrained videos and be easily extended to detect other spatio-temporal video patterns. [sent-421, score-0.587]

96 Efficient action spotting based on a spacetime oriented structure representation. [sent-461, score-0.371]

97 Discriminative figure-centric models for joint action localization and recognition. [sent-503, score-0.892]

98 Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. [sent-517, score-0.373]

99 Unsupervised learning of human action categories using spatialtemporal words. [sent-539, score-0.388]

100 Action mach: A spatio-temporal maximum average correlation height filter for action recognition. [sent-560, score-0.336]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('localization', 0.556), ('action', 0.336), ('bi', 0.244), ('video', 0.217), ('diving', 0.202), ('detection', 0.197), ('actions', 0.192), ('lan', 0.156), ('structured', 0.147), ('tran', 0.142), ('al', 0.125), ('subvolume', 0.121), ('detect', 0.106), ('kiss', 0.101), ('frames', 0.088), ('dive', 0.081), ('trellis', 0.081), ('ucf', 0.081), ('boxes', 0.079), ('object', 0.078), ('path', 0.074), ('localize', 0.07), ('box', 0.067), ('videos', 0.067), ('yuan', 0.067), ('ground', 0.063), ('search', 0.061), ('hof', 0.061), ('subvolumes', 0.061), ('detections', 0.06), ('visualized', 0.057), ('bounding', 0.057), ('paths', 0.056), ('roc', 0.055), ('frame', 0.054), ('human', 0.052), ('cvpr', 0.052), ('evaluations', 0.051), ('lb', 0.049), ('constraint', 0.048), ('subwindow', 0.046), ('blaschko', 0.046), ('truth', 0.046), ('area', 0.043), ('track', 0.043), ('precision', 0.042), ('epr', 0.04), ('irregularities', 0.04), ('junsong', 0.04), ('klaser', 0.04), ('nanyang', 0.04), ('bmvc', 0.038), ('improves', 0.038), ('yi', 0.037), ('features', 0.037), ('cropped', 0.037), ('curve', 0.037), ('mach', 0.036), ('localizes', 0.036), ('marszalek', 0.036), ('subset', 0.035), ('temporal', 0.035), ('oriented', 0.035), ('argmax', 0.035), ('et', 0.033), ('subwindows', 0.033), ('boiman', 0.033), ('niebles', 0.033), ('scores', 0.031), ('volumetric', 0.031), ('rectangle', 0.031), ('enforcement', 0.031), ('mahadevan', 0.031), ('intractable', 0.031), ('output', 0.03), ('anomaly', 0.029), ('divided', 0.029), ('predicted', 0.028), ('pedestrian', 0.028), ('tl', 0.028), ('lampert', 0.028), ('recall', 0.028), ('method', 0.028), ('constraints', 0.027), ('localizing', 0.027), ('finley', 0.027), ('spatial', 0.027), ('reports', 0.026), ('approached', 0.026), ('tightly', 0.025), ('spatiotemporal', 0.025), ('dataset', 0.024), ('cutting', 0.024), ('yao', 0.024), ('sl', 0.024), ('curves', 0.024), ('ijcv', 0.023), ('locate', 0.023), ('windows', 0.023), ('svm', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

Author: Du Tran, Junsong Yuan

Abstract: Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i.e., identifying a sequence of bounding boxes that track the action in video. The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. We propose a novel structured learning approach for spatio-temporal action localization. The mapping between a video and a spatio-temporal action trajectory is learned. The intractable inference and learning problems are addressed by leveraging an efficient Max-Path search method, thus making it feasible to optimize the model over the whole structured space. Experiments on two challenging benchmark datasets show that our proposed method outperforms the state-of-the-art methods. 1

2 0.27038479 344 nips-2012-Timely Object Recognition

Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell

Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1

3 0.2093845 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

Author: Kevin Tang, Vignesh Ramanathan, Li Fei-fei, Daphne Koller

Abstract: Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection [1] and LabelMe Video [2] datasets that illustrate the benefit of our approach to adapt object detectors to video. 1

4 0.12468732 201 nips-2012-Localizing 3D cuboids in single-view images

Author: Jianxiong Xiao, Bryan Russell, Antonio Torralba

Abstract: In this paper we seek to detect rectangular cuboids and localize their corners in uncalibrated single-view images depicting everyday scenes. In contrast to recent approaches that rely on detecting vanishing points of the scene and grouping line segments to form cuboids, we build a discriminative parts-based detector that models the appearance of the cuboid corners and internal edges while enforcing consistency to a 3D cuboid model. Our model copes with different 3D viewpoints and aspect ratios and is able to detect cuboids across many different object categories. We introduce a database of images with cuboid annotations that spans a variety of indoor and outdoor scenes and show qualitative and quantitative results on our collected database. Our model out-performs baseline detectors that use 2D constraints alone on the task of localizing cuboid corners. 1

5 0.11675564 289 nips-2012-Recognizing Activities by Attribute Dynamics

Author: Weixin Li, Nuno Vasconcelos

Abstract: In this work, we consider the problem of modeling the dynamic structure of human activities in the attributes space. A video sequence is Ä?Ĺš rst represented in a semantic feature space, where each feature encodes the probability of occurrence of an activity attribute at a given time. A generative model, denoted the binary dynamic system (BDS), is proposed to learn both the distribution and dynamics of different activities in this space. The BDS is a non-linear dynamic system, which extends both the binary principal component analysis (PCA) and classical linear dynamic systems (LDS), by combining binary observation variables with a hidden Gauss-Markov state process. In this way, it integrates the representation power of semantic modeling with the ability of dynamic systems to capture the temporal structure of time-varying processes. An algorithm for learning BDS parameters, inspired by a popular LDS learning method from dynamic textures, is proposed. A similarity measure between BDSs, which generalizes the BinetCauchy kernel for LDS, is then introduced and used to design activity classiÄ?Ĺš ers. The proposed method is shown to outperform similar classiÄ?Ĺš ers derived from the kernel dynamic system (KDS) and state-of-the-art approaches for dynamics-based or attribute-based action recognition. 1

6 0.11539704 303 nips-2012-Searching for objects driven by context

7 0.11254552 40 nips-2012-Analyzing 3D Objects in Cluttered Images

8 0.10900647 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

9 0.1020058 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

10 0.10194789 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

11 0.096239462 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

12 0.091857731 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

13 0.087453954 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

14 0.086176179 168 nips-2012-Kernel Latent SVM for Visual Recognition

15 0.085906766 160 nips-2012-Imitation Learning by Coaching

16 0.075751215 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

17 0.07439556 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

18 0.073904328 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

19 0.070689686 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

20 0.069859006 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.177), (1, -0.097), (2, -0.165), (3, -0.031), (4, 0.116), (5, -0.091), (6, 0.007), (7, -0.048), (8, -0.005), (9, -0.013), (10, -0.068), (11, 0.036), (12, 0.111), (13, -0.102), (14, 0.113), (15, 0.164), (16, -0.021), (17, -0.053), (18, -0.058), (19, -0.004), (20, 0.015), (21, 0.023), (22, -0.06), (23, -0.044), (24, -0.027), (25, -0.066), (26, 0.022), (27, 0.054), (28, 0.024), (29, 0.009), (30, 0.065), (31, 0.073), (32, -0.001), (33, 0.029), (34, -0.137), (35, 0.021), (36, 0.033), (37, -0.015), (38, -0.049), (39, 0.038), (40, 0.058), (41, -0.021), (42, 0.022), (43, -0.007), (44, -0.099), (45, -0.073), (46, 0.026), (47, 0.049), (48, -0.076), (49, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97078896 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

Author: Du Tran, Junsong Yuan

Abstract: Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i.e., identifying a sequence of bounding boxes that track the action in video. The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. We propose a novel structured learning approach for spatio-temporal action localization. The mapping between a video and a spatio-temporal action trajectory is learned. The intractable inference and learning problems are addressed by leveraging an efficient Max-Path search method, thus making it feasible to optimize the model over the whole structured space. Experiments on two challenging benchmark datasets show that our proposed method outperforms the state-of-the-art methods. 1

2 0.8213439 344 nips-2012-Timely Object Recognition

Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell

Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1

3 0.77531242 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

Author: Kevin Tang, Vignesh Ramanathan, Li Fei-fei, Daphne Koller

Abstract: Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection [1] and LabelMe Video [2] datasets that illustrate the benefit of our approach to adapt object detectors to video. 1

4 0.70306557 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

Author: Sanja Fidler, Sven Dickinson, Raquel Urtasun

Abstract: This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model [1] to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efficiency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach significantly outperforms the stateof-the-art in both 2D [1] and 3D object detection [2]. 1

5 0.68313843 201 nips-2012-Localizing 3D cuboids in single-view images

Author: Jianxiong Xiao, Bryan Russell, Antonio Torralba

Abstract: In this paper we seek to detect rectangular cuboids and localize their corners in uncalibrated single-view images depicting everyday scenes. In contrast to recent approaches that rely on detecting vanishing points of the scene and grouping line segments to form cuboids, we build a discriminative parts-based detector that models the appearance of the cuboid corners and internal edges while enforcing consistency to a 3D cuboid model. Our model copes with different 3D viewpoints and aspect ratios and is able to detect cuboids across many different object categories. We introduce a database of images with cuboid annotations that spans a variety of indoor and outdoor scenes and show qualitative and quantitative results on our collected database. Our model out-performs baseline detectors that use 2D constraints alone on the task of localizing cuboid corners. 1

6 0.66701847 289 nips-2012-Recognizing Activities by Attribute Dynamics

7 0.65971416 303 nips-2012-Searching for objects driven by context

8 0.63942611 31 nips-2012-Action-Model Based Multi-agent Plan Recognition

9 0.5944581 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

10 0.58609217 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

11 0.56007749 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

12 0.55365652 40 nips-2012-Analyzing 3D Objects in Cluttered Images

13 0.51626909 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

14 0.48654908 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

15 0.48506981 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

16 0.47981077 185 nips-2012-Learning about Canonical Views from Internet Image Collections

17 0.47298086 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

18 0.47124228 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

19 0.46989462 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

20 0.46912506 198 nips-2012-Learning with Target Prior


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.039), (1, 0.096), (17, 0.035), (21, 0.02), (38, 0.093), (39, 0.011), (42, 0.035), (44, 0.021), (54, 0.078), (55, 0.011), (74, 0.112), (76, 0.164), (80, 0.063), (86, 0.012), (92, 0.053), (93, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90229118 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

Author: Du Tran, Junsong Yuan

Abstract: Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i.e., identifying a sequence of bounding boxes that track the action in video. The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. We propose a novel structured learning approach for spatio-temporal action localization. The mapping between a video and a spatio-temporal action trajectory is learned. The intractable inference and learning problems are addressed by leveraging an efficient Max-Path search method, thus making it feasible to optimize the model over the whole structured space. Experiments on two challenging benchmark datasets show that our proposed method outperforms the state-of-the-art methods. 1

2 0.87079775 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

Author: Jenna Wiens, Eric Horvitz, John V. Guttag

Abstract: A patient’s risk for adverse events is affected by temporal processes including the nature and timing of diagnostic and therapeutic activities, and the overall evolution of the patient’s pathophysiology over time. Yet many investigators ignore this temporal aspect when modeling patient outcomes, considering only the patient’s current or aggregate state. In this paper, we represent patient risk as a time series. In doing so, patient risk stratification becomes a time-series classification task. The task differs from most applications of time-series analysis, like speech processing, since the time series itself must first be extracted. Thus, we begin by defining and extracting approximate risk processes, the evolving approximate daily risk of a patient. Once obtained, we use these signals to explore different approaches to time-series classification with the goal of identifying high-risk patterns. We apply the classification to the specific task of identifying patients at risk of testing positive for hospital acquired Clostridium difficile. We achieve an area under the receiver operating characteristic curve of 0.79 on a held-out set of several hundred patients. Our two-stage approach to risk stratification outperforms classifiers that consider only a patient’s current state (p<0.05). 1

3 0.86733216 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

Author: Ko-jen Hsiao, Kevin Xu, Jeff Calder, Alfred O. Hero

Abstract: We consider the problem of identifying patterns in a data set that exhibit anomalous behavior, often referred to as anomaly detection. In most anomaly detection algorithms, the dissimilarity between data samples is calculated by a single criterion, such as Euclidean distance. However, in many cases there may not exist a single dissimilarity measure that captures all possible anomalous patterns. In such a case, multiple criteria can be defined, and one can test for anomalies by scalarizing the multiple criteria using a linear combination of them. If the importance of the different criteria are not known in advance, the algorithm may need to be executed multiple times with different choices of weights in the linear combination. In this paper, we introduce a novel non-parametric multi-criteria anomaly detection method using Pareto depth analysis (PDA). PDA uses the concept of Pareto optimality to detect anomalies under multiple criteria without having to run an algorithm multiple times with different choices of weights. The proposed PDA approach scales linearly in the number of criteria and is provably better than linear combinations of the criteria. 1

4 0.86397111 344 nips-2012-Timely Object Recognition

Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell

Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1

5 0.86081505 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

6 0.85554403 188 nips-2012-Learning from Distributions via Support Measure Machines

7 0.85339433 303 nips-2012-Searching for objects driven by context

8 0.84416759 210 nips-2012-Memorability of Image Regions

9 0.84004086 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

10 0.83918524 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

11 0.83899677 201 nips-2012-Localizing 3D cuboids in single-view images

12 0.83741975 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

13 0.83621591 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

14 0.83582801 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

15 0.83491027 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

16 0.83437049 193 nips-2012-Learning to Align from Scratch

17 0.83247489 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

18 0.8315829 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

19 0.82697225 38 nips-2012-Algorithms for Learning Markov Field Policies

20 0.82673419 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning