nips nips2012 nips2012-311 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kevin Tang, Vignesh Ramanathan, Li Fei-fei, Daphne Koller
Abstract: Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection [1] and LabelMe Video [2] datasets that illustrate the benefit of our approach to adapt object detectors to video. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. [sent-3, score-0.501]
2 In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. [sent-4, score-0.413]
3 We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). [sent-5, score-1.064]
4 Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. [sent-6, score-1.271]
5 At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. [sent-7, score-0.734]
6 To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. [sent-8, score-1.378]
7 We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. [sent-9, score-0.513]
8 1 Introduction Following recent advances in learning algorithms and robust feature representations, tasks in video understanding have shifted from classifying simple motions and actions [3, 4] to detecting complex events and activities in Internet videos [1,5,6]. [sent-11, score-0.632]
9 Because many events are characterized by key objects and their interactions, it is imperative to have robust object detectors that can provide accurate detections. [sent-13, score-0.397]
10 However, as seen in Figure 1, the domain of images and videos is quite different, as it is often the case that images of objects are taken in controlled settings that differ greatly from where they appear in real-world situations, as seen in video. [sent-17, score-0.511]
11 To adapt object detectors from image to video, we take an incremental, self-paced approach to learn from the large amounts of unlabeled video data available. [sent-19, score-0.911]
12 We make the assumption that within our unlabeled video data, there exist instances of our target object. [sent-20, score-0.773]
13 However, we do not assume that every video has an instance of the object, due to the noise present in Internet videos. [sent-21, score-0.411]
14 We start by introducing a simple, robust method for discovering examples in the video data using KanadeLucas-Tomasi (KLT) feature tracks [8,9]. [sent-22, score-0.688]
15 This is done by iteratively including examples from the video data into the training set, while removing examples from the image data based on the difficulty of the examples. [sent-25, score-0.763]
16 In addition, it is common to have discriminative features that are only available in the target domain, which we term target features. [sent-29, score-0.583]
17 For example, in the video domain, there are contextual features in the spatial and temporal vicinity of our detected object that we can take advantage of when performing detection. [sent-30, score-0.706]
18 2 Related Work Most relevant are works that also deal with adapting detectors to video [10–13], but these works typically deal with a constrained set of videos and limited object classes. [sent-32, score-0.968]
19 The work of [14] deals with a similar problem, but they adapt detectors from video to image. [sent-33, score-0.59]
20 More similar to our method are approaches based on optimizing Support Vector Machine (SVM) related objectives [19–24] or joint cost functions [25], that treat the features as fixed and seek to adapt parameters of the classifier from source to target domain. [sent-37, score-0.493]
21 However, with the exception of [18, 25], previous works deal with supervised or semi-supervised domain adaptation, which require labeled data in the target domain to generate associations between the source and target domains. [sent-38, score-1.023]
22 In our setting, unsupervised domain adaptation, the target domain examples are unlabeled, and we must simultaneously discover and label examples in addition to learning parameters. [sent-39, score-0.926]
23 The objective we optimize to learn our detector draws inspiration from [26–28], in which we include and exclude the loss of certain examples using binary-valued indicator variables. [sent-40, score-0.395]
24 However, our method is different from [26] in that we have three sets of weights that govern the source examples, target examples, and target features. [sent-42, score-0.674]
25 The weights are annealed in different directions, giving us the flexibility to iteratively include examples from the target domain, exclude examples from the source domain, and include parameters for the target features. [sent-43, score-1.024]
26 We assume that we are given a large amount of unlabeled video data with positive instances of our object class within some of these videos. [sent-51, score-0.671]
27 We start by initializing our detector using image positives and negatives (Step 1). [sent-53, score-0.736]
28 We then proceed to enter a loop in which we discover the top K video positives and negatives (Step 2), re-train our detector using these (Step 3), and then update the annealed parameters of the algorithm (Step 4). [sent-54, score-1.269]
29 We initialize our detector (Step 1 of Figure 2) by training a classifier on the labeled image positives and negatives, which we denote by our dataset (hx1 , y1 i, . [sent-55, score-0.59]
30 Our goal then is to discover the top K positive and negative examples from the unlabeled videos, and to use these examples to help re-train our detector. [sent-61, score-0.5]
31 We do not attempt to discover all instances, but simply a sufficient quantity to help adapt our detector to the video domain. [sent-62, score-0.793]
32 To discover the top K video positives and negatives (Step 2 of Figure 2), we utilize the strong prior of temporal continuity and score trajectory tracks instead of bounding boxes, which we describe in Section 3. [sent-63, score-1.328]
33 Given the discovered examples, we optimize a novel objective inspired by self-paced learning [26] that simultaneously selects easy examples and trains a new detector (Step 3 of Figure 2). [sent-65, score-0.418]
34 1 Discovering Examples in Video In this step of the algorithm, we are given weights w of an object detector that can be used to score bounding boxes in video frames. [sent-68, score-1.107]
35 A naive approach would run our detector on frames of video, taking the highest scoring and lowest scoring bounding boxes as the top K video positives and negatives. [sent-69, score-1.311]
36 An object that appears in one frame of a video is certain to appear close in neighboring frames as well. [sent-71, score-0.704]
37 We obtain tracks by running a KLT tracker on our videos, which tracks a sparse set of features over large periods of time. [sent-74, score-0.397]
38 Because of the large number of unlabeled videos we have, we elect to extract KLT tracks rather than computing dense tracks using optical flow. [sent-75, score-0.603]
39 Note that the number of bounding boxes in B is only dependent on the dimensions of the detector and the scales we search over. [sent-81, score-0.445]
40 The score bs is computed by pooling i scores of the bounding box along multiple points of the track in time. [sent-82, score-0.388]
41 max After scoring each track in our unlabeled videos, we select the top and bottom few scoring tracks, and extract bounding boxes from each using the associated box coordinates (bx , by ) to get our max max top K video positives and negatives. [sent-85, score-1.346]
42 For each box, we average the scores at each point along the track, and take the i i box with the maximum score as the score and associated bounding box coordinates for this track. [sent-89, score-0.47]
43 2 Self-Paced Domain Adaptation In this step of the algorithm, we are given the discovered top K video positives and negatives, which we denote by the dataset (hz1 , h1 i, . [sent-95, score-0.77]
44 Ideally, we would like to re-train with a set of easier examples whose labels we are confident of first, and then re-discover video examples with this new detector. [sent-105, score-0.649]
45 By repeating this process, we can avoid bad examples and iteratively refine our set of top K video positives and negatives before having to train with all of them. [sent-107, score-1.099]
46 Formulating this intuition, our algorithm selects easier examples to learn from in the discovered video examples, and simultaneously selects harder examples in the image examples to stop learning from. [sent-108, score-0.888]
47 The number of examples selected from the video examples and image examples are governed by weights that will be annealed over iterations (Step 4 of Figure 2). [sent-110, score-0.976]
48 To prevent the algorithm from assigning all examples to be difficult, we introduce parameters K source and K target that control the number of examples considered from the source and target domain, respectively. [sent-120, score-0.982]
49 (wt+1 , v t+1 , ut+1 ) = arg min r(w) + C w,v,u n ⇣X vi Loss(xi , yi ; w) + i=1 1 K source n X vi i=1 1 K target k X j=1 uj ! [sent-121, score-0.404]
50 k X j=1 uj Loss(zj , hj ; w) ⌘ (2) If K target is large, the algorithm prefers to consider only easy target examples with a small Loss(·), and the same is true for K source . [sent-122, score-0.774]
51 In the annealing of the weights for the algorithm (Step 4 of Figure 2), we decrease K target and increase K source to iteratively include more examples from the target domain and decrease examples from the source domain. [sent-123, score-1.267]
52 Leveraging target features Often, the target domain we are adapting to has additional features we can take advantage of. [sent-127, score-0.938]
53 However, as we iteratively adapt to the target domain and build more confidence in our detector, we can start utilizing these target features to help with detection. [sent-129, score-0.884]
54 We assume there are a set of features that are shared between the source and target domains as = [ shared shared , and a set of target domain-only features as target : target ]. [sent-131, score-1.287]
55 Since the source data doesn’t have target features, we initialize those features to be 0 so that wtarget doesn’t affect the loss on the source data. [sent-133, score-0.711]
56 The new objective function is formulated as: (wt+1 , v t+1 , ut+1 ) = arg min r(w) + C w,v,u n ⇣X vi Loss(xi , yi ; w) + i=1 + 1 K f eat ||wtarget ||1 k X uj Loss(zj , hj ; w) j=1 1 K source n X i=1 vi 1 K target k X j=1 uj ! [sent-134, score-0.475]
57 To anneal the weights for target features, we increase K f eat to iteratively reduce the L1 norm on the target features so that wtarget can become non-zero. [sent-136, score-0.827]
58 Intuitively, we are forcing the weights w to only use shared features first, and to consider more target features when we have a better model of the target domain. [sent-137, score-0.715]
59 4 Experiments We present experimental results for adapting object detectors on the 2011 TRECVID Multimedia Event Detection (MED) dataset [1] and LabelMe Video [2] dataset. [sent-141, score-0.381]
60 The detection scores are computed on annotated video frames from the respective video datasets that are disjoint from the unlabeled videos used in the adapting stage. [sent-145, score-1.431]
61 The spatial features are taken to be HOG features bordering the object with dimensions half the size of the object bounding box. [sent-150, score-0.58]
62 To isolate the effects of adaptation and better analyze our method, we restrict our experiments to the setting in which we fix the video negatives, and focus our problem on adapting from the labeled image positives to the unlabeled video positives. [sent-153, score-1.523]
63 This scenario is realistic and commonly seen, as we can easily obtain video negatives by sampling from a set of unlabeled or weakly-labeled videos. [sent-154, score-0.706]
64 For the K target and K source weights, we set values for the first and final iterations, and linearly interpolate values for the remaining iterations in between. [sent-238, score-0.409]
65 For the K target weight, we estimate the weights so that we start by considering only the video examples that have no loss, and end with all video examples considered. [sent-239, score-1.362]
66 For the target features, we set the algorithm to allow target features at the midpoint of total iterations. [sent-241, score-0.583]
67 Model selection The free model parameters that can be varied are the number of top K examples to discover, the ending K source weight, and whether or not to use target features. [sent-243, score-0.54]
68 In our results, we perform model selection by comparing the distribution of scores on the discovered video positives. [sent-244, score-0.538]
69 The distributions are compared between the initial models from iteration 1 for different model parameters to select K and K source , and between the final iteration 5 models for different model parameters to determine the use of target features. [sent-245, score-0.372]
70 This allows us to evaluate the strength of the initial model trained on the image positives and video negatives, as well as our final adapted model. [sent-246, score-0.723]
71 2 Baseline Comparisons InitialBL This baseline is the intial detector trained only on image positives and video negatives. [sent-249, score-1.037]
72 VideoPosBL This baseline uses the intial detector to discover the top K video positives from the unlabeled video, then trains with all these examples without iterating. [sent-250, score-1.33]
73 Thus, it incorporates our idea of discovering video positives by scoring tracks and re-training, but does not use self-paced domain adaptation for learning weights. [sent-251, score-1.193]
74 This is a state-of-the-art method for unsupervised domain adaptation [18] that models the domain shift in feature space. [sent-255, score-0.509]
75 Since we are not given labels in the target domain, most previous methods for domain adaptation cannot be applied to our setting. [sent-256, score-0.579]
76 3 TRECVID MED The 2011 TRECVID MED dataset [1] consists of a collection of Internet videos collected by the Linguistic Data Consortium from various Internet video hosting sites. [sent-262, score-0.587]
77 There are a total of 15 complex events, and videos are labeled with either an event class or no label, where an absence of label indicates the video belongs to no event class. [sent-263, score-0.721]
78 We select 6 object classes to learn object detectors for because they are commonly present in selected events: “Skateboard”, “Animal”, “Tire”, “Vehicle”, “Sandwich”, and “Sewing machine”. [sent-264, score-0.464]
79 After sets of iterations, we show samples of newly discovered video positives (red boxes) that were not in the set of top K of previous iterations (left, middle columns). [sent-266, score-0.807]
80 As our model adapts, it is able to iteratively refine its set of top K video positives. [sent-268, score-0.513]
81 Green boxes detections from our method, red boxes detections from “InitialBL”, blue boxes detections from “VideoPosBL”, and magenta boxes detections from Gopalan et al. [sent-271, score-0.904]
82 The video negatives were randomly sampled from the videos that were labeled with no event class. [sent-275, score-0.857]
83 To test our algorithm, we manually annotated approximately 200 frames with bounding boxes of positive examples for each object, resulting in 1234 annotated frames total from over 500 videos, giving us a diverse set of situations the objects can appear in. [sent-276, score-0.63]
84 For each object, we use 20 videos from the associated event as unlabeled video training data. [sent-277, score-0.746]
85 The video negatives were randomly sampled from the videos that were not annotated with any of these objects. [sent-283, score-0.814]
86 For each object class, we use the remaining videos that contain the object as the unlabeled video training data, resulting in around 9 videos per object. [sent-285, score-1.172]
87 This shows that if we discover the top K video positives and re-train our detector with all of them, we do not obtain consistent gains in performance. [sent-289, score-1.026]
88 As illustrated in Figure 4, our method is able to add new video positives from iteration to iteration that are good examples, and remove bad examples at the same time. [sent-291, score-0.813]
89 93% for classes that choose models with target features versus no target features. [sent-305, score-0.61]
90 However, we hypothesize that the inclusion of more complex target features such as temporal movement could help our method achieve even better results. [sent-307, score-0.393]
91 Although this is not a common occurrence, it can happen when our method of self-paced domain adaptation replaces good video positives taken in the first iteration with bad examples in future iterations. [sent-309, score-1.141]
92 This situation arises when there are incorrect examples present in the easiest of the top K video positives, causing our detector to re-train and iteratively become worse. [sent-310, score-0.898]
93 To discover examples in the unlabeled video data, we classify tracks instead of bounding boxes, allowing us to leverage temporal continuity to avoid spurious detections, and to discover examples we would’ve otherwise missed. [sent-313, score-1.191]
94 Furthermore, we introduce a novel self-paced domain adaptation algorithm that allows our detector to iteratively adapt from source to target domain, while also considering target features unique to the target domain. [sent-314, score-1.616]
95 We’ve shown convincing results that illustrate the benefit of our approach to adapting object detectors to video. [sent-316, score-0.381]
96 A measure that would allow us to estimate our performance on the target domain with theoretical guarantees would be an interesting direction. [sent-318, score-0.432]
97 Another possible direction would be to relax the assumption of having no labeled target domain examples, and to formulate similar methods for this scenario. [sent-319, score-0.47]
98 Labelme video: Building a video database with human annotations. [sent-348, score-0.411]
99 Detection by detections: Non-parametric detector adaptation for a video. [sent-410, score-0.387]
100 Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. [sent-462, score-0.509]
wordName wordTfidf (topN-words)
[('video', 0.411), ('target', 0.251), ('positives', 0.251), ('detector', 0.24), ('initialbl', 0.185), ('negatives', 0.184), ('domain', 0.181), ('sandwich', 0.178), ('videos', 0.176), ('tracks', 0.158), ('object', 0.149), ('adaptation', 0.147), ('detectors', 0.139), ('trecvid', 0.134), ('source', 0.121), ('examples', 0.119), ('videoposbl', 0.118), ('boxes', 0.116), ('unlabeled', 0.111), ('detections', 0.11), ('gopalan', 0.11), ('vid', 0.101), ('wtarget', 0.101), ('tire', 0.096), ('adapting', 0.093), ('box', 0.091), ('bounding', 0.089), ('car', 0.088), ('sewing', 0.084), ('klt', 0.082), ('features', 0.081), ('discover', 0.075), ('labelme', 0.07), ('scores', 0.068), ('pls', 0.067), ('skateboard', 0.067), ('frames', 0.065), ('objects', 0.064), ('med', 0.063), ('im', 0.062), ('image', 0.061), ('track', 0.06), ('discovered', 0.059), ('annealed', 0.059), ('vehicle', 0.055), ('cvpr', 0.053), ('detection', 0.053), ('iteratively', 0.053), ('frame', 0.053), ('weights', 0.051), ('score', 0.051), ('imagenet', 0.05), ('ap', 0.049), ('top', 0.049), ('event', 0.048), ('bx', 0.045), ('scoring', 0.045), ('events', 0.045), ('animal', 0.044), ('annotated', 0.043), ('keyboard', 0.041), ('magni', 0.041), ('adapt', 0.04), ('baseline', 0.04), ('internet', 0.039), ('eat', 0.039), ('labeled', 0.038), ('iterations', 0.037), ('loss', 0.036), ('temporal', 0.034), ('visualizations', 0.034), ('hxn', 0.034), ('intial', 0.034), ('wshared', 0.034), ('svm', 0.034), ('images', 0.032), ('uj', 0.032), ('bad', 0.032), ('spatial', 0.031), ('saenko', 0.03), ('multimedia', 0.029), ('coordinates', 0.029), ('unseen', 0.029), ('bs', 0.029), ('eccv', 0.028), ('stanford', 0.028), ('dent', 0.028), ('iarpa', 0.027), ('occuring', 0.027), ('hog', 0.027), ('incremental', 0.027), ('classes', 0.027), ('help', 0.027), ('er', 0.026), ('trajectory', 0.026), ('appear', 0.026), ('bicycle', 0.026), ('easiest', 0.026), ('boat', 0.026), ('placements', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video
Author: Kevin Tang, Vignesh Ramanathan, Li Fei-fei, Daphne Koller
Abstract: Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection [1] and LabelMe Video [2] datasets that illustrate the benefit of our approach to adapt object detectors to video. 1
2 0.22044422 344 nips-2012-Timely Object Recognition
Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell
Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1
3 0.2093845 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
Author: Du Tran, Junsong Yuan
Abstract: Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i.e., identifying a sequence of bounding boxes that track the action in video. The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. We propose a novel structured learning approach for spatio-temporal action localization. The mapping between a video and a spatio-temporal action trajectory is learned. The intractable inference and learning problems are addressed by leveraging an efficient Max-Path search method, thus making it feasible to optimize the model over the whole structured space. Experiments on two challenging benchmark datasets show that our proposed method outperforms the state-of-the-art methods. 1
4 0.18896498 142 nips-2012-Generalization Bounds for Domain Adaptation
Author: Chao Zhang, Lei Zhang, Jieping Ye
Abstract: In this paper, we provide a new framework to study the generalization bound of the learning process for domain adaptation. We consider two kinds of representative domain adaptation settings: one is domain adaptation with multiple sources and the other is domain adaptation combining source and target data. In particular, we use the integral probability metric to measure the difference between two domains. Then, we develop the specific Hoeffding-type deviation inequality and symmetrization inequality for either kind of domain adaptation to achieve the corresponding generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process. The numerical experiments support our results. 1
5 0.17258602 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng
Abstract: We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1
6 0.11717013 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model
7 0.11675823 201 nips-2012-Localizing 3D cuboids in single-view images
8 0.11198852 62 nips-2012-Burn-in, bias, and the rationality of anchoring
9 0.11198852 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning
10 0.11085517 168 nips-2012-Kernel Latent SVM for Visual Recognition
11 0.10138432 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection
12 0.1000863 289 nips-2012-Recognizing Activities by Attribute Dynamics
13 0.095591553 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model
14 0.09506876 198 nips-2012-Learning with Target Prior
15 0.092217654 303 nips-2012-Searching for objects driven by context
16 0.089517586 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity
17 0.084438384 40 nips-2012-Analyzing 3D Objects in Cluttered Images
18 0.079596244 185 nips-2012-Learning about Canonical Views from Internet Image Collections
19 0.078086197 256 nips-2012-On the connections between saliency and tracking
20 0.075786531 361 nips-2012-Volume Regularization for Binary Classification
topicId topicWeight
[(0, 0.184), (1, -0.008), (2, -0.203), (3, -0.019), (4, 0.147), (5, -0.087), (6, 0.013), (7, -0.018), (8, 0.024), (9, -0.016), (10, -0.059), (11, 0.052), (12, 0.041), (13, -0.142), (14, 0.076), (15, 0.078), (16, -0.031), (17, -0.056), (18, -0.055), (19, 0.03), (20, 0.044), (21, -0.008), (22, -0.051), (23, -0.027), (24, -0.041), (25, 0.007), (26, 0.036), (27, 0.074), (28, -0.047), (29, -0.113), (30, 0.046), (31, 0.132), (32, 0.007), (33, -0.072), (34, -0.18), (35, 0.029), (36, 0.003), (37, -0.112), (38, 0.005), (39, 0.053), (40, 0.04), (41, 0.033), (42, 0.031), (43, 0.067), (44, -0.078), (45, -0.023), (46, 0.044), (47, 0.114), (48, -0.07), (49, -0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.96955466 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video
Author: Kevin Tang, Vignesh Ramanathan, Li Fei-fei, Daphne Koller
Abstract: Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection [1] and LabelMe Video [2] datasets that illustrate the benefit of our approach to adapt object detectors to video. 1
2 0.78494751 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
Author: Du Tran, Junsong Yuan
Abstract: Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i.e., identifying a sequence of bounding boxes that track the action in video. The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. We propose a novel structured learning approach for spatio-temporal action localization. The mapping between a video and a spatio-temporal action trajectory is learned. The intractable inference and learning problems are addressed by leveraging an efficient Max-Path search method, thus making it feasible to optimize the model over the whole structured space. Experiments on two challenging benchmark datasets show that our proposed method outperforms the state-of-the-art methods. 1
3 0.67538846 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model
Author: Sanja Fidler, Sven Dickinson, Raquel Urtasun
Abstract: This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model [1] to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efficiency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach significantly outperforms the stateof-the-art in both 2D [1] and 3D object detection [2]. 1
4 0.6742354 344 nips-2012-Timely Object Recognition
Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell
Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1
5 0.64220923 201 nips-2012-Localizing 3D cuboids in single-view images
Author: Jianxiong Xiao, Bryan Russell, Antonio Torralba
Abstract: In this paper we seek to detect rectangular cuboids and localize their corners in uncalibrated single-view images depicting everyday scenes. In contrast to recent approaches that rely on detecting vanishing points of the scene and grouping line segments to form cuboids, we build a discriminative parts-based detector that models the appearance of the cuboid corners and internal edges while enforcing consistency to a 3D cuboid model. Our model copes with different 3D viewpoints and aspect ratios and is able to detect cuboids across many different object categories. We introduce a database of images with cuboid annotations that spans a variety of indoor and outdoor scenes and show qualitative and quantitative results on our collected database. Our model out-performs baseline detectors that use 2D constraints alone on the task of localizing cuboid corners. 1
6 0.62836474 289 nips-2012-Recognizing Activities by Attribute Dynamics
7 0.59964484 256 nips-2012-On the connections between saliency and tracking
8 0.5847249 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition
9 0.54432821 303 nips-2012-Searching for objects driven by context
10 0.54234481 40 nips-2012-Analyzing 3D Objects in Cluttered Images
11 0.52885342 142 nips-2012-Generalization Bounds for Domain Adaptation
12 0.52189982 198 nips-2012-Learning with Target Prior
13 0.51192391 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection
14 0.50590515 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
15 0.49310136 2 nips-2012-3D Social Saliency from Head-mounted Cameras
16 0.49032143 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity
17 0.48872641 185 nips-2012-Learning about Canonical Views from Internet Image Collections
18 0.4643963 31 nips-2012-Action-Model Based Multi-agent Plan Recognition
19 0.44860846 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis
20 0.43744782 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
topicId topicWeight
[(0, 0.02), (21, 0.019), (38, 0.074), (42, 0.02), (54, 0.02), (55, 0.028), (74, 0.07), (76, 0.581), (80, 0.054), (92, 0.032)]
simIndex simValue paperId paperTitle
1 0.99327272 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data
Author: Assaf Glazer, Michael Lindenbaum, Shaul Markovitch
Abstract: We propose an efficient, generalized, nonparametric, statistical KolmogorovSmirnov test for detecting distributional change in high-dimensional data. To implement the test, we introduce a novel, hierarchical, minimum-volume sets estimator to represent the distributions to be tested. Our work is motivated by the need to detect changes in data streams, and the test is especially efficient in this context. We provide the theoretical foundations of our test and show its superiority over existing methods. 1
same-paper 2 0.98643339 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video
Author: Kevin Tang, Vignesh Ramanathan, Li Fei-fei, Daphne Koller
Abstract: Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection [1] and LabelMe Video [2] datasets that illustrate the benefit of our approach to adapt object detectors to video. 1
3 0.98303014 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature
Author: Michael Osborne, Roman Garnett, Zoubin Ghahramani, David K. Duvenaud, Stephen J. Roberts, Carl E. Rasmussen
Abstract: Numerical integration is a key component of many problems in scientific computing, statistical modelling, and machine learning. Bayesian Quadrature is a modelbased method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientific problem from astronomy. 1
4 0.98293114 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data
Author: Francisco Pereira, Matthew Botvinick
Abstract: This paper introduces a novel classification method for functional magnetic resonance imaging datasets with tens of classes. The method is designed to make predictions using information from as many brain locations as possible, instead of resorting to feature selection, and does this by decomposing the pattern of brain activation into differently informative sub-regions. We provide results over a complex semantic processing dataset that show that the method is competitive with state-of-the-art feature selection and also suggest how the method may be used to perform group or exploratory analyses of complex class structure. 1
5 0.98122841 286 nips-2012-Random Utility Theory for Social Choice
Author: Hossein Azari, David Parks, Lirong Xia
Abstract: Random utility theory models an agent’s preferences on alternatives by drawing a real-valued score on each alternative (typically independently) from a parameterized distribution, and then ranking the alternatives according to scores. A special case that has received significant attention is the Plackett-Luce model, for which fast inference methods for maximum likelihood estimators are available. This paper develops conditions on general random utility models that enable fast inference within a Bayesian framework through MC-EM, providing concave loglikelihood functions and bounded sets of global maxima solutions. Results on both real-world and simulated data provide support for the scalability of the approach and capability for model selection among general random utility models including Plackett-Luce. 1
6 0.9768002 205 nips-2012-MCMC for continuous-time discrete-state systems
7 0.97436219 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
8 0.94418156 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
9 0.94311184 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning
10 0.94192386 247 nips-2012-Nonparametric Reduced Rank Regression
11 0.89113945 338 nips-2012-The Perturbed Variation
12 0.87921178 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC
13 0.87308753 142 nips-2012-Generalization Bounds for Domain Adaptation
14 0.87071097 41 nips-2012-Ancestor Sampling for Particle Gibbs
15 0.87061012 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters
16 0.86672384 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
17 0.86237162 163 nips-2012-Isotropic Hashing
18 0.86229515 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
19 0.86184978 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
20 0.85902339 327 nips-2012-Structured Learning of Gaussian Graphical Models