iccv iccv2013 iccv2013-130 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Weiss, Benjamin Sapp, Ben Taskar
Abstract: Ben Taskar University of Washington Seattle, WA t as kar @ c s . washingt on . edu In many cases, the predictive power of structured models for for complex vision tasks is limited by a trade-off between the expressiveness and the computational tractability of the model. However, choosing this trade-off statically a priori is suboptimal, as images and videos in different settings vary tremendously in complexity. On the other hand, choosing the trade-off dynamically requires knowledge about the accuracy of different structured models on any given example. In this work, we propose a novel two-tier architecture that provides dynamic speed/accuracy trade-offs through a simple type of introspection. Our approach, which we call dynamic structured model selection (DMS), leverages typically intractable features in structured learning problems in order to automatically determine ’ which of several models should be used at test-time in order to maximize accuracy under a fixed budgetary constraint. We demonstrate DMS on two sequential modeling vision tasks, and we establish a new state-of-the-art in human pose estimation in video with an implementation that is roughly 23 faster than the prevaino uims sptleanmdeanrtda implementation.
Reference: text
sentIndex sentText sentNum sentScore
1 edu In many cases, the predictive power of structured models for for complex vision tasks is limited by a trade-off between the expressiveness and the computational tractability of the model. [sent-7, score-0.22]
2 However, choosing this trade-off statically a priori is suboptimal, as images and videos in different settings vary tremendously in complexity. [sent-8, score-0.174]
3 On the other hand, choosing the trade-off dynamically requires knowledge about the accuracy of different structured models on any given example. [sent-9, score-0.264]
4 Our approach, which we call dynamic structured model selection (DMS), leverages typically intractable features in structured learning problems in order to automatically determine ’ which of several models should be used at test-time in order to maximize accuracy under a fixed budgetary constraint. [sent-11, score-0.618]
5 We demonstrate DMS on two sequential modeling vision tasks, and we establish a new state-of-the-art in human pose estimation in video with an implementation that is roughly 23 faster than the prevaino uims sptleanmdeanrtda implementation. [sent-12, score-0.231]
6 Introduction Computational budget constraints on inference in structured models for complex vision tasks force us to trade off between expressiveness and tractability of models. [sent-14, score-0.373]
7 Choosing this trade-off statically for all images is suboptimal since images and image parts vary tremendously in complexity; choosing it dynamically requires meta-level assessment and prediction of performance of different models on given examples. [sent-15, score-0.238]
8 In many object detection systems, the cost of computing these features is several times greater then the time spent on structured inference given the features. [sent-20, score-0.207]
9 There is a tremendous variation of cost of low-level processing based on resolution and other accuracy parameters; choosing one global setting is often not sufficient for complex images and wasteful on simple ones. [sent-21, score-0.141]
10 The key idea is a division of labor between a hierarchy of models/inference algorithms (tier one) and meta-level model selector (tier two), which decides when to use expensive models adaptively, where they are most likely to improve the accuracy of predictions. [sent-24, score-0.458]
11 The two tiers have complementary strengths: Tier one models provide increasingly accurate and more expensive inference over structured outputs. [sent-25, score-0.352]
12 Tier two model-selectors use arbitrary sparsely-computed features and long-range dependencies, which would make inference intractable, in order to evaluate the outputs of the first tier and decide when to stop. [sent-26, score-0.244]
13 While the first tier optimizes over a combinatorial set of possibilities using inference over densely computed features, the second tier simply evaluates proposals of the first. [sent-27, score-0.375]
14 The advantage of this division is that both tiers are efficient and the second tier has more information than the first that allows it to reason about the success of the first. [sent-28, score-0.208]
15 We summarize the contributions of this work as follows: • • We propose dynamic structured model selection (WDeMS p)r,o a onsoevel d tywnao-mtiiecr f srtarmucetwuroerdk f moro creating efcatsitoenr and more accurate structured prediction systems. [sent-29, score-0.488]
16 We apply our approach to two sequential modeling tWaseks a: handwriting recognition o(se scetqiouenn n4ti. [sent-31, score-0.226]
17 On the handwriting recognition task, we use DMS to achieve a significant increase in accuracy over baseline while • ×× at the same time being nearly 3 faster. [sent-34, score-0.205]
18 On the pose task, we propose a novel sequence model 2656 based re-ranker that utilizes the recently introduced model of [18] to achieve state-of-the-art accuracy on a benchmark dataset while being 23 faster than the previous baerskt dmaettahsoedt. [sent-35, score-0.241]
19 w Whiele eth beenin apply ×D MfasSte tro ahacnhie tvhee even faster times on a new, much larger benchmark dataset, reducing the re-ranking model runtime by a factor of 2 with no decrease in accuracy for wrist lofcaaclitozart ioofn 2. [sent-36, score-0.225]
20 Nonetheless, while the goals of these works are similar to ours–explicitly controlling feature computation at test time–none of the classifier cascade literature addresses either inference as a batch or the structured prediction setting. [sent-48, score-0.399]
21 On the other hand, [6] also propose explicitly modeling the value of evaluating a classifier, but their approach requires modeling the entropy of a predicted class distribution and therefore does not apply to the structured setting in which there are exponentially many outputs. [sent-53, score-0.241]
22 Finally, [3] propose an approximate dynamic program to determine where in video clips to apply an expensive algorithm for analysis (whereas we learn to which video clips to apply different models. [sent-56, score-0.46]
23 [1] attempt to predict various video analysis algorithm’s performance (similar in spirit to the selector we propose), but based on measures of image quality rather than properties of model output. [sent-61, score-0.382]
24 [9] propose an evaluator for human pose estimators, but only for single-frame images, and propose only learning “correct or not” coarse-level distinctions, whereas we attempt to predict a measure of the error of each model directly. [sent-62, score-0.225]
25 There is considerable research into human pose estimation from 2D images; far more than we can review here. [sent-65, score-0.134]
26 However, as state-of-the-art pose estimation can take upwards of several minutes per frame (e. [sent-66, score-0.166]
27 , [19, 13]) there is significantly less prior work on pose estimation in video clips. [sent-68, score-0.144]
28 [16] propose a related approach to our method by stitching together N hypothesized poses per frame into video tracks, using N = 300 and evaluating their approach on 4 video sequences. [sent-72, score-0.151]
29 In contrast, we use N = 32 proposals from [19] (assuming the scale and location of the person is known), learn additional sequence models using features computed over proposed tracks, and evaluate on hundreds of short clips from cinema. [sent-73, score-0.181]
30 Dynamic Structured Model Selection In this section, we introduce our approach to dynamic model selection and provide an overview of the algorithm. [sent-76, score-0.18]
31 The core idea behind our approach is very simple: we learn to predict the value of choosing a more expensive model over a cheaper one, and we use predicted values to allocate computational resources at test time. [sent-77, score-0.287]
32 We consider the problem of structured prediction, in which our goal is to learn a hypothesis mapping inputs x ∈ X to outputs y ∈ Y(x), where |x| = ? [sent-80, score-0.18]
33 and y is a 2657 × Algorithm 1: Dynamic structured model selection. [sent-81, score-0.147]
34 For the dynamic model selection problem we consider here, we assume that we are given a set of models, h1, . [sent-121, score-0.18]
35 Given a fixed ordering of the models, we define the value of evaluating model hi on example x, V (hi, x, y) = L(hi−1 (x) , y) L(hi (x) ,y) , (2) where L(y, y? [sent-128, score-0.155]
36 (While more expensive models usually increase accuracy on average, in practice we find that there are many examples where the more expensive features hurt performance. [sent-131, score-0.207]
37 ) Our proposed goal for metalearning is to learn a selector ν(hi , x) to approximate the value function. [sent-132, score-0.324]
38 Note that even if all predicted ν(hi, xj) are negative, Algorithm 1 continues to greedily choose more expensive models as long as budget is available. [sent-151, score-0.301]
39 In order for Algorithm 1to succeed, the selector ν must provide a useful estimate of the value V . [sent-153, score-0.299]
40 We formulate the selector as a linear function of metafeatures computed on the output of the models. [sent-154, score-0.367]
41 The key idea is that, while the feature generating function f for a structured prediction model decomposes over subsets of y in order to maintain feasible inference, the meta-features φ need only be computed efficiently for the specific outputs h1(x) through hi−1 (x). [sent-155, score-0.238]
42 We learn the selector by learning a weight vector β to approximate the value function. [sent-157, score-0.349]
43 Some care is needed when learning the selector in order to avoid re-using the same training set for learning both the models and the selector; i. [sent-170, score-0.404]
44 Application to Sequential Prediction In this section, we discuss two applications of our dynamic structured model selection framework to computer vision problems: handwriting recognition and human pose estimation from 2D video. [sent-179, score-0.609]
45 In both settings, DMS provides for a far more efficient structured prediction model. [sent-180, score-0.184]
46 In both settings, we use the following standard linear-chain structured prediction model. [sent-182, score-0.184]
47 For the handwriting recognition problem, ea∈ch { yi corresponds to one letter of the written word; for human pose estimation, each yi corresponds to one of K possible predicted poses. [sent-192, score-0.491]
48 At test time, we can efficiently make predictions using the Viterbi algorithm to find the state sequence that maximizes (5). [sent-205, score-0.14]
49 For the handwriting recognition task, we approximately optimize (6) using the structured perceptron algorithm, which has been shown to work well for this task [23]. [sent-210, score-0.295]
50 For the video pose estimation task, we optimize (6) directly using the recent stochastic Frank-Wolfe block-coordinate descent method of [12], which we found to be more robust. [sent-211, score-0.144]
51 Handwriting Recognition We first apply our method to the handwriting recognition dataset of [20]. [sent-215, score-0.196]
52 In this way, we the selector is ideally suited to direct computation at test time, and a very fast and effective method is the result. [sent-221, score-0.323]
53 We use three different models for the handwriting recognition problem, differing only in the unary term features of the sequence model. [sent-223, score-0.288]
54 Trade-off on handwriting recognition task, displayed as a function of the efficiency speedup w. [sent-230, score-0.345]
55 To draw each curve, we sweep the budget B or tradeoff parameter η until we find a point with at least the target speedup and record the error rate. [sent-236, score-0.305]
56 Our approach (DMS) significantly outpeforms imitation learning, yielding an error rate below that of the final model. [sent-237, score-0.202]
57 The Uniform method consists of picking which element to expand uniformly at random until all examples use the same model, and the Baseline method consists of picking a single entire fixed stage of models a priori. [sent-238, score-0.164]
58 The first are computed from the output of hi (x), consisting of the relative difference in the scores of the top two outputs and the average of the mean, min, and max entropies of the marginal distributions predicted by hi at each position in the sequence. [sent-242, score-0.315]
59 The second set of metafeatures count the number of times an n-gram was predicted in hi (x) that occured zero times in the training set, computed for n = 3, 4, 5. [sent-243, score-0.268]
60 We compare to an alternative method for dynamic model selection inspired by imitation learning methods for feature selection [8]. [sent-246, score-0.474]
61 For this baseline, we first pick a trade-off parameter η, and then for each example (xj , yj ) in the training set independently decide the optimal stopping point, τj? [sent-247, score-0.134]
62 We then learn an approximate policy π(i, xj) using 2659 × approach of [19] (Ensemble) in elbow accuracy and exceeds it in wrist accuracy (at high precisions), and provides a significant boost in performance over MODEC. [sent-259, score-0.285]
63 a linear SVM classifier trained with the same meta-features as the selector uses; we generated training data points by sampling all trajectories generated by the optimal policy on the training set. [sent-261, score-0.452]
64 We visualize the trade-off between error rate and computation time on the handwriting recognition task is given in Figure 1. [sent-265, score-0.195]
65 Our approach significantly outperforms imitation learning, and both imitation learning and our approach provide a significant increase in efficiency over choosing one of the models a priori or uniformly at random. [sent-267, score-0.627]
66 Besides the improvement in accuracy and speedup, there are several practical advantages of DMS over the imitation learning baseline. [sent-274, score-0.261]
67 Human Pose Estimation in Video Our approach to video pose estimation can be summarized as follows. [sent-279, score-0.144]
68 We adapt the efficient and current state-of-the-art MODEC pose model [18] to generate the states: each state corresponds to the highest scoring prediction of one of the 32 MODEC sub-models. [sent-282, score-0.23]
69 Next, given the set of states for each clip in our training database, we learn to predict a path through the states using high level features such as color and × flow consistency. [sent-283, score-0.289]
70 To generate our 32 states, we find the argmax arm configuration for each of the 32 modes in MODEC. [sent-290, score-0.187]
71 Note that MODEC models each arm as a separate pose model, but chooses a single mode for each arm based on a combined compatiblity score between the two poses; for our purposes, we ignore the compatibility score and take the 32 separate predictions for each arm independently. [sent-291, score-0.754]
72 Experimentally, we find that with 32 states per arm, at least one state is typically very close to the true arm pose for a given image (i. [sent-292, score-0.373]
73 Given a video sequence, we generate 32 states for each arm for each frame independently using the MODEC model. [sent-296, score-0.345]
74 The problem then becomes selecting which of the 32 poses for each arm and frame to choose. [sent-297, score-0.244]
75 Let yi be state at frame i; for each assignment to yi we have a corresponding MODEC argmax pose on the i’th frame, which we denote pi (yi). [sent-299, score-0.349]
76 × DMS provides a significant increase in speedup with very little accuracy cost compared to picking elements uniformly at random; e. [sent-303, score-0.291]
77 for elbows, a 2 speedup can cbree aosbeta i nn esdpe feodru hardly any accuracy cost, w cohislte c a m5 speedup cwahnen b picking uniformly dalyt yra anndyo amcc. [sent-305, score-0.401]
78 each state yi in the i’th frame we allow transitions from the 5 closest states yi? [sent-306, score-0.235]
79 At test time, we can efficiently make predictions using the Viterbi algorithm to find the state sequence that maximizes (5) with practically neglible runtime due to the tiny size of the state and transition space. [sent-310, score-0.255]
80 Given a video clip xj and a labeled pose pji for each frame i in xj, we define the ground truth state to be the state with the closest pose = argmin | |pi(yi) | |2. [sent-314, score-0.517]
81 As in the handwriting recognition task, we use a fixed hiearchy of features to create a series of four increas- yij yij −pji × ingly complex base models. [sent-317, score-0.225]
82 The second model adds an image-dependent pairwise term, the χ2-distance between color histograms of the predicted arm locations from one frame to the next. [sent-319, score-0.335]
83 The third model adds an image-dependent unary term; each image is quickly segmented into superpixels using [5], and we compute the intersection-over-union (IoU) score between the predicted arm rectangles and superpixels selected by the rectangles. [sent-320, score-0.316]
84 Finally, the fourth model computes a very fast and coarse optical flow using [15]; we obtain an estimate of the foreground flow by subtracting the median flow outside speedup wkiinthg D elMemSe can ubnei oobrmtailnye adt fraorn dtohem same accuracy osf, a 32 × the target bounding box. [sent-321, score-0.322]
85 For every n-gram in the image, where n = 2k, we compute the mean and max χ2 distance between a center frame predicted arm location and the k frames before and after. [sent-328, score-0.312]
86 5, indicating a significant difference between the predicted arm color of the center frame and the surround frames. [sent-330, score-0.312]
87 Although all of the data will be publicly available, for our experiments in the following, we selected half of the clips that contained the most arm motion to provide for a more challenging dataset. [sent-344, score-0.264]
88 Before applying DMS, we evaluated the utility of the final, most complex MODEC+S model for articulated pose estimation in video. [sent-348, score-0.155]
89 We first compared to the approach of [19] on the VideoPose2 (VP2) dataset [19], which represents state-of-the-art in human pose estimation in challenging videos. [sent-349, score-0.134]
90 For each % of budget used, the distribution of stopping points for the batch examples between the four possible models is shown. [sent-368, score-0.249]
91 We evaluated our dynamic model selection framework on the CLIC dataset with 200 random partitions of the dataset. [sent-373, score-0.18]
92 Within the training set, we ran 3-fold cross-validation to generate model predictions for learning the selector as described in section 3. [sent-375, score-0.424]
93 When learning the selector, we focused on minimizing wrist test error, counting as an error × any frame that the wrist was not localized to within 20 pixels. [sent-376, score-0.244]
94 We also smoothed the predictions of each model before passing them to the selector since we found this improved the overall accuracy of the system. [sent-377, score-0.409]
95 For wrist localization, our approach was able to obtain a 2 speedup for li ztatlteio tno, no accuracy cost, asn adb lmea tiont oabinta a significant speedup compared to the uninformed model selection baseline. [sent-380, score-0.526]
96 For elbow localization, our approach yields a speedup of 5 aeltb btohew same accuracy cuors at as an uhn yiniefoldrmse ad s 2pe×e speedup, a significant improvement. [sent-381, score-0.206]
97 We also investigated whether or not Algorithm 1 was choosing models to use by simply choosing the cheapest first (Figure 4), which we found not to be the case. [sent-383, score-0.181]
98 We also established a new state-of-the-art in human pose estimation in video with an implementation that is 23 faster than the previous standard implementation. [sent-391, score-0.201]
99 t aOnu rth rees purletvsi suggest dtharatd itmwop-letiemre re-ranking style approaches are indeed a powerful technique to increase the efficiency and discriminative power of structured prediction systems. [sent-393, score-0.216]
100 Nonetheless, the DMS approach outlined here could be improved in several key ways: in future work, we intend to explore optimal strategies for constructing a model set, unordered model sets, and better greedy batch inference strategies. [sent-394, score-0.167]
wordName wordTfidf (topN-words)
[('modec', 0.57), ('selector', 0.299), ('dms', 0.243), ('clic', 0.205), ('imitation', 0.202), ('arm', 0.187), ('handwriting', 0.171), ('tier', 0.162), ('speedup', 0.142), ('structured', 0.124), ('hi', 0.108), ('xj', 0.103), ('budget', 0.102), ('dynamic', 0.09), ('wrist', 0.081), ('policy', 0.081), ('pose', 0.081), ('clips', 0.077), ('choosing', 0.075), ('yi', 0.073), ('expensive', 0.071), ('batch', 0.07), ('budgetary', 0.068), ('flic', 0.068), ('metafeatures', 0.068), ('predicted', 0.068), ('selection', 0.067), ('states', 0.066), ('yj', 0.064), ('prediction', 0.06), ('frame', 0.057), ('movies', 0.054), ('predictions', 0.053), ('inference', 0.051), ('picking', 0.05), ('sequence', 0.048), ('stopping', 0.046), ('cascade', 0.046), ('evaluator', 0.046), ('hwj', 0.046), ('neglible', 0.046), ('tiers', 0.046), ('sapp', 0.045), ('hm', 0.042), ('clip', 0.042), ('flow', 0.041), ('statically', 0.04), ('amortized', 0.04), ('cinema', 0.04), ('pji', 0.04), ('signing', 0.04), ('state', 0.039), ('unary', 0.038), ('uninformed', 0.037), ('bigram', 0.037), ('aistats', 0.037), ('video', 0.035), ('signifies', 0.035), ('accuracy', 0.034), ('tractability', 0.034), ('elbows', 0.034), ('sweep', 0.034), ('uniformly', 0.033), ('intractable', 0.032), ('tremendously', 0.032), ('faster', 0.032), ('efficiency', 0.032), ('cost', 0.032), ('models', 0.031), ('expressiveness', 0.031), ('taskar', 0.031), ('outputs', 0.031), ('sequential', 0.03), ('velocities', 0.03), ('elbow', 0.03), ('weiss', 0.03), ('runtime', 0.03), ('iou', 0.029), ('viterbi', 0.029), ('greedily', 0.029), ('increasingly', 0.029), ('daum', 0.028), ('mode', 0.028), ('estimation', 0.028), ('yij', 0.027), ('priori', 0.027), ('tradeoff', 0.027), ('scoring', 0.027), ('pi', 0.026), ('learn', 0.025), ('learning', 0.025), ('apply', 0.025), ('predict', 0.025), ('human', 0.025), ('classifier', 0.024), ('evaluating', 0.024), ('training', 0.024), ('computation', 0.024), ('model', 0.023), ('articulated', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 130 iccv-2013-Dynamic Structured Model Selection
Author: David Weiss, Benjamin Sapp, Ben Taskar
Abstract: Ben Taskar University of Washington Seattle, WA t as kar @ c s . washingt on . edu In many cases, the predictive power of structured models for for complex vision tasks is limited by a trade-off between the expressiveness and the computational tractability of the model. However, choosing this trade-off statically a priori is suboptimal, as images and videos in different settings vary tremendously in complexity. On the other hand, choosing the trade-off dynamically requires knowledge about the accuracy of different structured models on any given example. In this work, we propose a novel two-tier architecture that provides dynamic speed/accuracy trade-offs through a simple type of introspection. Our approach, which we call dynamic structured model selection (DMS), leverages typically intractable features in structured learning problems in order to automatically determine ’ which of several models should be used at test-time in order to maximize accuracy under a fixed budgetary constraint. We demonstrate DMS on two sequential modeling vision tasks, and we establish a new state-of-the-art in human pose estimation in video with an implementation that is roughly 23 faster than the prevaino uims sptleanmdeanrtda implementation.
2 0.11311509 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses
Author: Ryan Tokola, Wongun Choi, Silvio Savarese
Abstract: We present an approach to multi-target tracking that has expressive potential beyond the capabilities of chainshaped hidden Markov models, yet has significantly reduced complexity. Our framework, which we call tracking-byselection, is similar to tracking-by-detection in that it separates the tasks of detection and tracking, but it shifts tempo-labs . com Stanford, CA ssi lvio @ st an ford . edu ral reasoning from the tracking stage to the detection stage. The core feature of tracking-by-selection is that it reasons about path hypotheses that traverse the entire video instead of a chain of single-frame object hypotheses. A traditional chain-shaped tracking-by-detection model is only able to promote consistency between one frame and the next. In tracking-by-selection, path hypotheses exist across time, and encouraging long-term temporal consistency is as simple as rewarding path hypotheses with consistent image features. One additional advantage of tracking-by-selection is that it results in a dramatically simplified model that can be solved exactly. We adapt an existing tracking-by-detection model to the tracking-by-selectionframework, and show improvedperformance on a challenging dataset (introduced in [18]).
3 0.11138581 187 iccv-2013-Group Norm for Learning Structured SVMs with Unstructured Latent Variables
Author: Daozheng Chen, Dhruv Batra, William T. Freeman
Abstract: Latent variables models have been applied to a number of computer vision problems. However, the complexity of the latent space is typically left as a free design choice. A larger latent space results in a more expressive model, but such models are prone to overfitting and are slower to perform inference with. The goal of this paper is to regularize the complexity of the latent space and learn which hidden states are really relevant for prediction. Specifically, we propose using group-sparsity-inducing regularizers such as ?1-?2 to estimate the parameters of Structured SVMs with unstructured latent variables. Our experiments on digit recognition and object detection show that our approach is indeed able to control the complexity of latent space without any significant loss in accuracy of the learnt model.
4 0.1064612 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
Author: Weixin Li, Qian Yu, Ajay Divakaran, Nuno Vasconcelos
Abstract: The problem of adaptively selecting pooling regions for the classification of complex video events is considered. Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, temporal structure modeling, and event detection. Video is decomposed into segments, and the segments most informative for detecting a given event are identified, so as to dynamically determine the pooling operator most suited for each sequence. This dynamic pooling is implemented by treating the locations of characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. Although the feasible set of segment selections is combinatorial, it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs. Besides the coarselevel location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segmenttuples. Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets.
5 0.096186116 336 iccv-2013-Random Forests of Local Experts for Pedestrian Detection
Author: Javier Marín, David Vázquez, Antonio M. López, Jaume Amores, Bastian Leibe
Abstract: Pedestrian detection is one of the most challenging tasks in computer vision, and has received a lot of attention in the last years. Recently, some authors have shown the advantages of using combinations of part/patch-based detectors in order to cope with the large variability of poses and the existence of partial occlusions. In this paper, we propose a pedestrian detection method that efficiently combines multiple local experts by means of a Random Forest ensemble. The proposed method works with rich block-based representations such as HOG and LBP, in such a way that the same features are reused by the multiple local experts, so that no extra computational cost is needed with respect to a holistic method. Furthermore, we demonstrate how to integrate the proposed approach with a cascaded architecture in order to achieve not only high accuracy but also an acceptable efficiency. In particular, the resulting detector operates at five frames per second using a laptop machine. We tested the proposed method with well-known challenging datasets such as Caltech, ETH, Daimler, and INRIA. The method proposed in this work consistently ranks among the top performers in all the datasets, being either the best method or having a small difference with the best one.
6 0.087544806 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation
7 0.086512379 143 iccv-2013-Estimating Human Pose with Flowing Puppets
8 0.086244345 404 iccv-2013-Structured Forests for Fast Edge Detection
9 0.082366057 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation
10 0.078546152 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies
11 0.076955289 241 iccv-2013-Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection
12 0.071677104 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose
13 0.070209838 44 iccv-2013-Adapting Classification Cascades to New Domains
14 0.068549633 75 iccv-2013-CoDeL: A Human Co-detection and Labeling Framework
15 0.065071814 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
16 0.064349242 225 iccv-2013-Joint Segmentation and Pose Tracking of Human in Natural Videos
17 0.063670047 233 iccv-2013-Latent Task Adaptation with Large-Scale Hierarchies
18 0.063351162 253 iccv-2013-Linear Sequence Discriminant Analysis: A Model-Based Dimensionality Reduction Method for Vector Sequences
19 0.061733298 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion
20 0.059760772 160 iccv-2013-Fast Object Segmentation in Unconstrained Video
topicId topicWeight
[(0, 0.173), (1, 0.012), (2, 0.002), (3, 0.027), (4, 0.06), (5, -0.001), (6, -0.013), (7, 0.045), (8, -0.027), (9, -0.004), (10, -0.015), (11, -0.011), (12, -0.024), (13, -0.036), (14, 0.009), (15, 0.041), (16, -0.073), (17, -0.029), (18, 0.016), (19, 0.073), (20, 0.028), (21, -0.04), (22, -0.015), (23, -0.023), (24, -0.032), (25, -0.055), (26, 0.064), (27, -0.022), (28, 0.061), (29, -0.042), (30, 0.012), (31, 0.017), (32, -0.018), (33, -0.002), (34, 0.034), (35, -0.0), (36, -0.027), (37, -0.003), (38, 0.006), (39, -0.026), (40, 0.053), (41, 0.033), (42, -0.036), (43, 0.015), (44, 0.022), (45, -0.044), (46, -0.001), (47, 0.004), (48, -0.025), (49, 0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.92664975 130 iccv-2013-Dynamic Structured Model Selection
Author: David Weiss, Benjamin Sapp, Ben Taskar
Abstract: Ben Taskar University of Washington Seattle, WA t as kar @ c s . washingt on . edu In many cases, the predictive power of structured models for for complex vision tasks is limited by a trade-off between the expressiveness and the computational tractability of the model. However, choosing this trade-off statically a priori is suboptimal, as images and videos in different settings vary tremendously in complexity. On the other hand, choosing the trade-off dynamically requires knowledge about the accuracy of different structured models on any given example. In this work, we propose a novel two-tier architecture that provides dynamic speed/accuracy trade-offs through a simple type of introspection. Our approach, which we call dynamic structured model selection (DMS), leverages typically intractable features in structured learning problems in order to automatically determine ’ which of several models should be used at test-time in order to maximize accuracy under a fixed budgetary constraint. We demonstrate DMS on two sequential modeling vision tasks, and we establish a new state-of-the-art in human pose estimation in video with an implementation that is roughly 23 faster than the prevaino uims sptleanmdeanrtda implementation.
2 0.66176224 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose
Author: Andreas M. Lehrmann, Peter V. Gehler, Sebastian Nowozin
Abstract: Having a sensible prior of human pose is a vital ingredient for many computer vision applications, including tracking and pose estimation. While the application of global non-parametric approaches and parametric models has led to some success, finding the right balance in terms of flexibility and tractability, as well as estimating model parameters from data has turned out to be challenging. In this work, we introduce a sparse Bayesian network model of human pose that is non-parametric with respect to the estimation of both its graph structure and its local distributions. We describe an efficient sampling scheme for our model and show its tractability for the computation of exact log-likelihoods. We empirically validate our approach on the Human 3.6M dataset and demonstrate superior performance to global models and parametric networks. We further illustrate our model’s ability to represent and compose poses not present in the training set (compositionality) and describe a speed-accuracy trade-off that allows realtime scoring of poses.
3 0.6558755 352 iccv-2013-Revisiting Example Dependent Cost-Sensitive Learning with Decision Trees
Author: Oisin Mac Aodha, Gabriel J. Brostow
Abstract: Typical approaches to classification treat class labels as disjoint. For each training example, it is assumed that there is only one class label that correctly describes it, and that all other labels are equally bad. We know however, that good and bad labels are too simplistic in many scenarios, hurting accuracy. In the realm of example dependent costsensitive learning, each label is instead a vector representing a data point’s affinity for each of the classes. At test time, our goal is not to minimize the misclassification rate, but to maximize that affinity. We propose a novel example dependent cost-sensitive impurity measure for decision trees. Our experiments show that this new impurity measure improves test performance while still retaining the fast test times of standard classification trees. We compare our approach to classification trees and other cost-sensitive methods on three computer vision problems, tracking, descriptor matching, and optical flow, and show improvements in all three domains.
4 0.65349257 211 iccv-2013-Image Segmentation with Cascaded Hierarchical Models and Logistic Disjunctive Normal Networks
Author: Mojtaba Seyedhosseini, Mehdi Sajjadi, Tolga Tasdizen
Abstract: Contextual information plays an important role in solving vision problems such as image segmentation. However, extracting contextual information and using it in an effective way remains a difficult problem. To address this challenge, we propose a multi-resolution contextual framework, called cascaded hierarchical model (CHM), which learns contextual information in a hierarchical framework for image segmentation. At each level of the hierarchy, a classifier is trained based on downsampled input images and outputs of previous levels. Our model then incorporates the resulting multi-resolution contextual information into a classifier to segment the input image at original resolution. We repeat this procedure by cascading the hierarchical framework to improve the segmentation accuracy. Multiple classifiers are learned in the CHM; therefore, a fast and accurate classifier is required to make the training tractable. The classifier also needs to be robust against overfitting due to the large number of parameters learned during training. We introduce a novel classification scheme, called logistic dis- junctive normal networks (LDNN), which consists of one adaptive layer of feature detectors implemented by logistic sigmoid functions followed by two fixed layers of logical units that compute conjunctions and disjunctions, respectively. We demonstrate that LDNN outperforms state-of-theart classifiers and can be used in the CHM to improve object segmentation performance.
5 0.64419746 143 iccv-2013-Estimating Human Pose with Flowing Puppets
Author: Silvia Zuffi, Javier Romero, Cordelia Schmid, Michael J. Black
Abstract: We address the problem of upper-body human pose estimation in uncontrolled monocular video sequences, without manual initialization. Most current methods focus on isolated video frames and often fail to correctly localize arms and hands. Inferring pose over a video sequence is advantageous because poses of people in adjacent frames exhibit properties of smooth variation due to the nature of human and camera motion. To exploit this, previous methods have used prior knowledge about distinctive actions or generic temporal priors combined with static image likelihoods to track people in motion. Here we take a different approach based on a simple observation: Information about how a person moves from frame to frame is present in the optical flow field. We develop an approach for tracking articulated motions that “links” articulated shape models of peo- ple in adjacent frames through the dense optical flow. Key to this approach is a 2D shape model of the body that we use to compute how the body moves over time. The resulting “flowing puppets ” provide a way of integrating image evidence across frames to improve pose inference. We apply our method on a challenging dataset of TV video sequences and show state-of-the-art performance.
6 0.64234 125 iccv-2013-Drosophila Embryo Stage Annotation Using Label Propagation
7 0.64218229 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses
8 0.6351949 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields
9 0.6258601 241 iccv-2013-Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection
10 0.61828274 145 iccv-2013-Estimating the Material Properties of Fabric from Video
11 0.61751741 386 iccv-2013-Sequential Bayesian Model Update under Structured Scene Prior for Semantic Road Scenes Labeling
12 0.61145836 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion
13 0.60983849 46 iccv-2013-Allocentric Pose Estimation
14 0.60724443 171 iccv-2013-Fix Structured Learning of 2013 ICCV paper k2opt.pdf
15 0.60439646 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?
16 0.60245091 118 iccv-2013-Discovering Object Functionality
17 0.59710032 75 iccv-2013-CoDeL: A Human Co-detection and Labeling Framework
18 0.590895 234 iccv-2013-Learning CRFs for Image Parsing with Adaptive Subgradient Descent
19 0.59029508 165 iccv-2013-Find the Best Path: An Efficient and Accurate Classifier for Image Hierarchies
20 0.58698839 8 iccv-2013-A Deformable Mixture Parsing Model with Parselets
topicId topicWeight
[(2, 0.074), (7, 0.024), (12, 0.017), (26, 0.101), (31, 0.04), (34, 0.016), (42, 0.103), (48, 0.011), (49, 0.231), (64, 0.054), (73, 0.029), (78, 0.014), (84, 0.012), (89, 0.14), (95, 0.03), (98, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.7553708 130 iccv-2013-Dynamic Structured Model Selection
Author: David Weiss, Benjamin Sapp, Ben Taskar
Abstract: Ben Taskar University of Washington Seattle, WA t as kar @ c s . washingt on . edu In many cases, the predictive power of structured models for for complex vision tasks is limited by a trade-off between the expressiveness and the computational tractability of the model. However, choosing this trade-off statically a priori is suboptimal, as images and videos in different settings vary tremendously in complexity. On the other hand, choosing the trade-off dynamically requires knowledge about the accuracy of different structured models on any given example. In this work, we propose a novel two-tier architecture that provides dynamic speed/accuracy trade-offs through a simple type of introspection. Our approach, which we call dynamic structured model selection (DMS), leverages typically intractable features in structured learning problems in order to automatically determine ’ which of several models should be used at test-time in order to maximize accuracy under a fixed budgetary constraint. We demonstrate DMS on two sequential modeling vision tasks, and we establish a new state-of-the-art in human pose estimation in video with an implementation that is roughly 23 faster than the prevaino uims sptleanmdeanrtda implementation.
2 0.72485071 232 iccv-2013-Latent Space Sparse Subspace Clustering
Author: Vishal M. Patel, Hien Van Nguyen, René Vidal
Abstract: We propose a novel algorithm called Latent Space Sparse Subspace Clustering for simultaneous dimensionality reduction and clustering of data lying in a union of subspaces. Specifically, we describe a method that learns the projection of data and finds the sparse coefficients in the low-dimensional latent space. Cluster labels are then assigned by applying spectral clustering to a similarity matrix built from these sparse coefficients. An efficient optimization method is proposed and its non-linear extensions based on the kernel methods are presented. One of the main advantages of our method is that it is computationally efficient as the sparse coefficients are found in the low-dimensional latent space. Various experiments show that the proposed method performs better than the competitive state-of-theart subspace clustering methods.
3 0.68454462 410 iccv-2013-Support Surface Prediction in Indoor Scenes
Author: Ruiqi Guo, Derek Hoiem
Abstract: In this paper, we present an approach to predict the extent and height of supporting surfaces such as tables, chairs, and cabinet tops from a single RGBD image. We define support surfaces to be horizontal, planar surfaces that can physically support objects and humans. Given a RGBD image, our goal is to localize the height and full extent of such surfaces in 3D space. To achieve this, we created a labeling tool and annotated 1449 images with rich, complete 3D scene models in NYU dataset. We extract ground truth from the annotated dataset and developed a pipeline for predicting floor space, walls, the height and full extent of support surfaces. Finally we match the predicted extent with annotated scenes in training scenes and transfer the the support surface configuration from training scenes. We evaluate the proposed approach in our dataset and demonstrate its effectiveness in understanding scenes in 3D space.
4 0.68358254 150 iccv-2013-Exemplar Cut
Author: Jimei Yang, Yi-Hsuan Tsai, Ming-Hsuan Yang
Abstract: We present a hybrid parametric and nonparametric algorithm, exemplar cut, for generating class-specific object segmentation hypotheses. For the parametric part, we train a pylon model on a hierarchical region tree as the energy function for segmentation. For the nonparametric part, we match the input image with each exemplar by using regions to obtain a score which augments the energy function from the pylon model. Our method thus generates a set of highly plausible segmentation hypotheses by solving a series of exemplar augmented graph cuts. Experimental results on the Graz and PASCAL datasets show that the proposed algorithm achievesfavorable segmentationperformance against the state-of-the-art methods in terms of visual quality and accuracy.
5 0.68337852 180 iccv-2013-From Where and How to What We See
Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath
Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
6 0.6831761 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation
7 0.67925012 241 iccv-2013-Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection
8 0.67674112 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
9 0.67578256 414 iccv-2013-Temporally Consistent Superpixels
10 0.67488515 427 iccv-2013-Transfer Feature Learning with Joint Distribution Adaptation
11 0.67436254 156 iccv-2013-Fast Direct Super-Resolution by Simple Functions
12 0.67344946 330 iccv-2013-Proportion Priors for Image Sequence Segmentation
13 0.67303526 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
14 0.67289472 80 iccv-2013-Collaborative Active Learning of a Kernel Machine Ensemble for Recognition
15 0.67246497 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
16 0.6720404 245 iccv-2013-Learning a Dictionary of Shape Epitomes with Applications to Image Labeling
17 0.6710667 182 iccv-2013-GOSUS: Grassmannian Online Subspace Updates with Structured-Sparsity
18 0.67047709 126 iccv-2013-Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification
19 0.66973555 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
20 0.66969389 44 iccv-2013-Adapting Classification Cascades to New Domains