iccv iccv2013 iccv2013-61 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: João F. Henriques, João Carreira, Rui Caseiro, Jorge Batista
Abstract: Competitive sliding window detectors require vast training sets. Since a pool of natural images provides a nearly endless supply of negative samples, in the form of patches at different scales and locations, training with all the available data is considered impractical. A staple of current approaches is hard negative mining, a method of selecting relevant samples, which is nevertheless expensive. Given that samples at slightly different locations have overlapping support, there seems to be an enormous amount of duplicated work. It is natural, then, to ask whether these redundancies can be eliminated. In this paper, we show that the Gram matrix describing such data is block-circulant. We derive a transformation based on the Fourier transform that block-diagonalizes the Gram matrix, at once eliminating redundancies and partitioning the learning problem. This decomposition is valid for any dense features and several learning algorithms, and takes full advantage of modern parallel architectures. Surprisingly, it allows training with all the potential samples in sets of thousands of images. By considering the full set, we generate in a single shot the optimal solution, which is usually obtained only after several rounds of hard negative mining. We report speed gains on Caltech Pedestrians and INRIA Pedestrians of over an order of magnitude, allowing training on a desktop computer in a couple of minutes.
Reference: text
sentIndex sentText sentNum sentScore
1 pt ro st s Abstract Competitive sliding window detectors require vast training sets. [sent-4, score-0.101]
2 Since a pool of natural images provides a nearly endless supply of negative samples, in the form of patches at different scales and locations, training with all the available data is considered impractical. [sent-5, score-0.212]
3 A staple of current approaches is hard negative mining, a method of selecting relevant samples, which is nevertheless expensive. [sent-6, score-0.249]
4 Given that samples at slightly different locations have overlapping support, there seems to be an enormous amount of duplicated work. [sent-7, score-0.132]
5 We derive a transformation based on the Fourier transform that block-diagonalizes the Gram matrix, at once eliminating redundancies and partitioning the learning problem. [sent-10, score-0.13]
6 This decomposition is valid for any dense features and several learning algorithms, and takes full advantage of modern parallel architectures. [sent-11, score-0.26]
7 Surprisingly, it allows training with all the potential samples in sets of thousands of images. [sent-12, score-0.161]
8 By considering the full set, we generate in a single shot the optimal solution, which is usually obtained only after several rounds of hard negative mining. [sent-13, score-0.446]
9 The templates, most often HOG filters [7], are evaluated exhaustively at all locations in an image over a discrete range of scales, using fast convolution algorithms which exploit the redundancy of overlapping image subwindows. [sent-18, score-0.127]
10 This asymmetry between the resolution of prediction and learning has been tackled by mining for hard negative examples. [sent-20, score-0.444]
11 In this iterative process, an initial model is trained using all positive examples and a randomly selected subset of negative examples, and this initial training set is progressively augmented with false positive examples produced while scanning the images with the model learned so far. [sent-21, score-0.361]
12 Hard negative mining is considered expensive, and this has become less tolerable as the community has strived to scale up to a large number of object models [22]. [sent-22, score-0.309]
13 Here we propose instead to learn directly from a training set comprising all image subwindows of a predetermined aspect-ratio and show this is feasible for a rich set of popular models including Ridge Regression, Support Vector Regression (SVR) and Logistic Regression. [sent-28, score-0.246]
14 The crux of our derivation is the observation that the Gram matrix of a set ofimages and their translated versions, as modeled by cyclic shifts, exhibits a block-circulant structure. [sent-30, score-0.333]
15 We build on recent work that explores the circulant structure of translations in a single image [19] or in pairs (in the temporal domain) [24]. [sent-32, score-0.578]
16 Additionally, while the circulant matrices used in these works model translations of one sample or a pair of samples, our block-circulant formulation allows an arbitrary number of samples. [sent-34, score-0.608]
17 Our closed-form decomposition is simple and can be implemented with a few lines of code (Algorithm 1). [sent-35, score-0.122]
18 Full source code for fast training of linear detectors is available1 . [sent-36, score-0.101]
19 We show that the structure of overlapping image subwindows allows efficient training of a detector with all subwindows of a set of negative images. [sent-41, score-0.614]
20 This obviates the need for expensive rounds of hard negative mining, which only approximate the full problem. [sent-42, score-0.446]
21 A theoretical analysis of the influence of translated im- ages over a learning problem, by proving that the resulting Gram matrix is block-circulant (Section 2). [sent-44, score-0.207]
22 An explicit description of the data matrix which allows the use of fast linear solvers [13], that scale linearly with the number of training examples (Sec. [sent-52, score-0.208]
23 Experiments show that it is possible to train with all subwindows of large training sets (INRIA and Caltech Pedestrians), achieving the same performance as several rounds of hard negative mining in a single run. [sent-54, score-0.803]
24 Given n samples xi and corresponding target labels yi, the goal is to find the optimal weights w, mwin? [sent-58, score-0.138]
25 Many popular algorithms, including SVM, SVR, Ridge Regression and Logistic Regression can equivalently be expressed in their dual form mαin21αTGα +? [sent-68, score-0.101]
26 niD(αi,yi), (2) with a vector α containing n dual optimization variables αi, a function D that depends on the training algorithm2, and the n n Gram matrix G, with elements Gij = xiTxj. [sent-69, score-0.281]
27 samples xi (and thus can optimize the primal or dual), or? [sent-73, score-0.138]
28 accept the Gram matrix G (restricting them to the dual optimization). [sent-74, score-0.208]
29 Throughout this paper, we will prove most results using the dual formulation (Section 2-3). [sent-76, score-0.144]
30 Training sets of translated samples In detection tasks, it is common to train a classifier with ×× negative examples that are cropped from a large set of images. [sent-88, score-0.387]
31 samples [7, 14], they share an additional structure: samples from the same image are translated versions of one another. [sent-92, score-0.298]
32 e W wei can etrran shsloawte t x by one oelnem toe tnhet, by multiplying it with the s s permutation matrix P, P= ? [sent-96, score-0.102]
33 sT−hi1s operation iIs a cyclic shift: a) l×l e (lse −me 1n)ts i are styhi mfteadtr one place etroa ttihoen right, acnlidc the element that exits the image on one side will reappear on the other side. [sent-99, score-0.163]
34 2761 set S can be (a) (b) × (c) Figure 1: (a) An augmented training set of n base samples, and s horizontal translations. [sent-102, score-0.178]
35 Such data arises when training a classifier with several subwindows from images. [sent-103, score-0.246]
36 (b) Forming the ns ns Gram matrix G, it is clear that there is some structure at the block level. [sent-104, score-0.226]
37 Our goal, then, is to train a classifier not only with a set of n samples xi, but also their translations, as modeled by cyclic shifts. [sent-113, score-0.263]
38 In the following sections, we will consider an augmented training set X, with a total of ns samples: s tarnan asulgamtioennst eodf n a binasineg samples. [sent-114, score-0.164]
39 2, many learning algorithms can be expressed in the dual, in terms of a Gram matrix G. [sent-130, score-0.109]
40 n Es,a cahnd b leoacckh e(ule,mv)en cto orrfe a p bolnocdks (i, j) corresponds to a pair of base samples (see Fig. [sent-136, score-0.162]
41 × (7) Since the blocks (u, v) are not independent, but a function of v u,3 G is a block-circulant matrix [8]. [sent-146, score-0.124]
42 − of v − u, 3More precisely, Pv−u is a function of (v u) modulus of its cyclic nature: Pkn = iPs 0a ffuorn actlli oinnt eogfe (rvs k−. [sent-147, score-0.163]
43 9 computes dotproducts between xi and all translations of xj, it is equivalent to the correlation between vectors (denoted with ? [sent-151, score-0.224]
44 Separability of learning problems We now need to prove the intuition that the blocks of a block-diagonal Gram matrix G¯ indeed represent independent learning problems. [sent-176, score-0.241]
45 c Tohnsei edxeraactti partitioning i)nsto sub-problems tihsa aanls foo rsu tihteable for parallel implementations that take full advantage of modern architectures. [sent-178, score-0.138]
46 At this point, it is important to point out that the transformation matrix U we used in Section 2. [sent-179, score-0.116]
47 Given a unitary matrix U such that G¯ = UGU−1 is block-diagonal, with s blocks G¯f, then Eq. [sent-189, score-0.216]
48 2 can be decomposed into the s sub-problems 4The Hermitian transpose is the most natural extension of transposition to the complex domain, simplifying many expressions. [sent-190, score-0.134]
49 Ridge regression (RR) is a regularized form of least-squares, with loss function − yi? [sent-204, score-0.119]
50 Since its terms are all dot-products, the decomposition is exact. [sent-214, score-0.122]
51 We do not use SVM because it restricts the labels to {−1, 1}, and a unitary transformation rwesoutrilcd fsa tllh eou latsbiedles tt ohis { s−e1t. [sent-251, score-0.136]
52 Explicit data matrix The transformed Gram matrix for each of the s subproblems, G¯(f), can be used directly in a dual solver (e. [sent-253, score-0.302]
53 However, if n is large, an explicit description of the transformed data matrix that generates such a Gram matrix (through Eq. [sent-256, score-0.178]
54 14-15, it can be seen that each block corresponds to a distinct Fourier frequency f, and each of its elements (i, j) is simply the product of frequency f of the Fourier transforms of samples xi and xj . [sent-259, score-0.389]
55 Equivalent to solving a regression with all spatial translations of the given samples. [sent-261, score-0.305]
56 The independent regression sub-problems can be solved in parallel. [sent-262, score-0.119]
57 X ( : : f1 f2 ) Y ( : f1 f2 ) ) ; end end W = real ( ifft2 (W) ) * sqrt ( s 1* s 2 ) ; , , , where X¯(f) is an n 1 vector with the Fourier frequency f of eaXch(f sample. [sent-269, score-0.141]
58 ×Th 1is v explicit description orf f tehqeu ednactay matrix X¯(f) allows us to use a fast primal solver such as l ibl inear [13]. [sent-270, score-0.163]
59 We can simply extend X¯(f) to be a n m matrix with the Fourier frequency Xf (off) m fe beatu are ns. [sent-277, score-0.124]
60 By additivity iothf t thhee dot-product, the Gram matrix obtained this way is the sum of Gram matrices over all m features, and all properties are preserved. [sent-278, score-0.102]
61 Complex-valued regression The fact that the data matrices of Eq. [sent-280, score-0.149]
62 19 are complex may apparently present some difficulties, since regression is usually real-valued. [sent-281, score-0.119]
63 We consider a generic data matrix X and regression targets y. [sent-283, score-0.227]
64 2 th) ithsa ret tdhuisc eiss equivalent to a simple SVR with the augmented data matrix X? [sent-316, score-0.127]
65 Experiments We tested the proposed decomposition on a number of detection tasks. [sent-339, score-0.16]
66 Recall that our method replaces the traditional hard negative mining steps with a single learning phase. [sent-340, score-0.444]
67 As such, the goal of these experiments is not to show greater accuracy, but to verify that we can achieve accuracy that is competitive with several rounds ofhard negative mining, all other components being constant. [sent-341, score-0.301]
68 We chose the task of learning a single HOG filter from full object exemplars, which captures the core component shared by most modern object detectors [14, 4]. [sent-343, score-0.178]
69 cetTsihoteno SVM learning (20) 2764 as the number of rounds of hard negative Precision Figure 2: Performance on the test set of INRIA Pedestrians using a HOG detector. [sent-348, score-0.436]
70 It takes several rounds of hard negative mining to converge to the same results of the proposed Circulant Decomposition, which is trained on the full set of negative windows. [sent-349, score-0.755]
71 The Circulant Decomposition allows training on the full set in one go. [sent-350, score-0.108]
72 The Circulant Decomposition is competitive with 3 expensive rounds of hard negative mining. [sent-353, score-0.399]
73 mining increases, 2) the implicit ability of CD to enlarge the training set with many translations, and its impact on datasets with few positive samples and 3) the computational savings of CD, compared to hard negative mining, which is generally considered expensive. [sent-354, score-0.653]
74 Pedestrian detection We experimented with pedestrian detection on two stan- dard datasets: the well-known INRIA Pedestrians [7] and the recent Caltech Pedestrian Detection Benchmark [10]. [sent-357, score-0.167]
75 We will jump straight away to the main point of our paper: that a Circulant Decomposition is equivalent to training with all negative windows, a feat that can only be approximated by several rounds of hard negative mining. [sent-358, score-0.611]
76 2 shows a comparison of CD and different numbers of hard negative rounds for INRIA Pedestrians and Fig. [sent-360, score-0.399]
77 The results suggest that the Circulant Decomposition performs on par with the slower and intrinsically less complete process of learning with hard negative mining. [sent-363, score-0.286]
78 On INRIA Pedestrians the baseline classifier is trained with 12180 random negative windows, before mining hard negative examples from the set of 1218 negative images, which contains ∼ 108 potential windows. [sent-367, score-0.709]
79 We proceed similarly on the “reasonable” subset of the Caltech Pedestrian dataset, composed of 4250 training images obtained every 30 frames, of which × 2217 are negative images, without pedestrians. [sent-370, score-0.212]
80 Both CD and mining implementations are parallelized, providing a fair representation of a modern set-up. [sent-373, score-0.249]
81 The factors √s1s2 are needed because most FFT implementations are only unitary if corrected by this scalar factor. [sent-377, score-0.129]
82 Additionally, we verified experimentally that it is necessary for the regression targets to have no DC component (line 3 of Algorithm 1. [sent-379, score-0.155]
83 2 Cyclic shifts as a model for translation Since samples must have the same support as the learned template w, cyclic shifts of a template-sized sample are less accurate for large translations, due to wrap-around effects. [sent-384, score-0.414]
84 Thus in practice we collect base samples xi from negative images in a grid, at regular intervals of 2/3 of the template 2765 INRIA and Caltech Pedestrians datasets, and the classical approach using increasing numbers of hard negative mining rounds. [sent-385, score-0.758]
85 We verified experimentally that there is little impact in performance if we assume cyclic shifts are accurate up to ∼1/3 of the template saiszseu mine a clly cdli rce schtiiofntss. [sent-388, score-0.215]
86 a eT ahecc cyclic sph tifots ∼ P1/u3− o1fx tih me toemdepl tahtee finer translations, which comes at no additional cost when using the Circulant Decomposition. [sent-389, score-0.163]
87 This allows us to effectively model a sliding window, while collecting only a few dozen base samples per negative image. [sent-390, score-0.313]
88 To verify that the gain in performance is truly due to the Circulant Decomposition and not the grid sampling scheme, we trained a full SVM classifier with the same base samples. [sent-391, score-0.109]
89 3 Positive samples Another aspect of our method is that we must choose labels for translations of positive samples, since they are implicitly accounted for during training. [sent-396, score-0.333]
90 = =the 0 fa (mcti sthalaitg regression allows labels outside the set {−1, +1}, we can use a Gausasil aonw fsun lacbtieolns otou interpolate smoothly } b,e wtwee ceann t uhsee two, according to a Gaussian bandwidth σ. [sent-399, score-0.119]
91 This function plays a similar role to the training output plane in correlation filters [19, 3]. [sent-401, score-0.117]
92 (a) Comparison of Circulant Decomposition and a full SVM (not decomposed) using the same base samples. [sent-414, score-0.109]
93 ETHZ Shapes As mentioned earlier, translations of positive samples are also considered. [sent-419, score-0.333]
94 This dataset has only between 22 and 45 positive examples for each of its 5 categories (Mugs, Bottles, Swans, Giraffes and Apple Logos), evenly split into training and testing sets. [sent-422, score-0.108]
95 Unlike on the pedestrian detection benchmarks, where positive examples abound, here results are markedly improved using the proposed method, as visible in Fig. [sent-423, score-0.176]
96 This is surprising since the number of subwindows is in the order of 108, which seems to even preclude loading the data into the computer’s memory, but is feasible using our proposed embedding of the learning problem into the Fourier domain. [sent-428, score-0.222]
97 Our methodology is likely to have broad applicability, as it allows for both a much more efficient and more complete learning process than iterative hard negative mining, used for learning virtually all modern object detectors. [sent-429, score-0.377]
98 579169) Swans Applelogos Bottles Giraffes Mugs Figure 5: Performance on ETHZ Shapes, a dataset with scarce training data (between 22 and 45 positive examples per class, with an equal train-test split). [sent-441, score-0.108]
99 In addition to allowing models to be trained faster, the Circulant Decomposition achieves higher performance in this case, perhaps due to the implicit inclusion of translated positive samples. [sent-442, score-0.145]
100 Event retrieval in large video collections with circulant temporal encoding. [sent-623, score-0.392]
wordName wordTfidf (topN-words)
[('circulant', 0.392), ('gram', 0.346), ('translations', 0.186), ('subwindows', 0.185), ('cyclic', 0.163), ('mining', 0.158), ('fourier', 0.157), ('negative', 0.151), ('rounds', 0.15), ('pedestrians', 0.132), ('svr', 0.13), ('decomposition', 0.122), ('regression', 0.119), ('inria', 0.119), ('ridge', 0.108), ('dual', 0.101), ('caltech', 0.1), ('samples', 0.1), ('hard', 0.098), ('translated', 0.098), ('unitary', 0.092), ('pedestrian', 0.091), ('sqrt', 0.089), ('rime', 0.079), ('matrix', 0.072), ('transposition', 0.069), ('wtx', 0.069), ('cd', 0.067), ('base', 0.062), ('training', 0.061), ('blockcirculant', 0.06), ('conjugation', 0.06), ('henrique', 0.06), ('regre', 0.06), ('ugu', 0.06), ('whx', 0.06), ('block', 0.058), ('solver', 0.057), ('filters', 0.056), ('augmented', 0.055), ('modern', 0.054), ('ethz', 0.053), ('mugs', 0.053), ('swans', 0.053), ('shifts', 0.052), ('blocks', 0.052), ('frequency', 0.052), ('logistic', 0.051), ('permute', 0.049), ('giraffes', 0.049), ('henriques', 0.049), ('redundancies', 0.049), ('appendix', 0.049), ('ns', 0.048), ('elements', 0.047), ('positive', 0.047), ('translation', 0.047), ('full', 0.047), ('bottles', 0.046), ('hermitian', 0.046), ('transformation', 0.044), ('dubout', 0.044), ('prove', 0.043), ('nid', 0.042), ('xj', 0.042), ('solvers', 0.041), ('pv', 0.041), ('detectors', 0.04), ('caseiro', 0.039), ('imaginary', 0.039), ('jo', 0.039), ('hog', 0.039), ('convolution', 0.039), ('xi', 0.038), ('rui', 0.038), ('misaligned', 0.038), ('savings', 0.038), ('detection', 0.038), ('arguments', 0.037), ('dealt', 0.037), ('trix', 0.037), ('implementations', 0.037), ('learning', 0.037), ('suppressing', 0.036), ('targets', 0.036), ('accept', 0.035), ('vlfeat', 0.035), ('pu', 0.035), ('explicit', 0.034), ('gl', 0.034), ('transpose', 0.033), ('ap', 0.033), ('svm', 0.033), ('accelerate', 0.032), ('overlapping', 0.032), ('extension', 0.032), ('multiplying', 0.03), ('subproblems', 0.03), ('desktop', 0.03), ('matrices', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 61 iccv-2013-Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition
Author: João F. Henriques, João Carreira, Rui Caseiro, Jorge Batista
Abstract: Competitive sliding window detectors require vast training sets. Since a pool of natural images provides a nearly endless supply of negative samples, in the form of patches at different scales and locations, training with all the available data is considered impractical. A staple of current approaches is hard negative mining, a method of selecting relevant samples, which is nevertheless expensive. Given that samples at slightly different locations have overlapping support, there seems to be an enormous amount of duplicated work. It is natural, then, to ask whether these redundancies can be eliminated. In this paper, we show that the Gram matrix describing such data is block-circulant. We derive a transformation based on the Fourier transform that block-diagonalizes the Gram matrix, at once eliminating redundancies and partitioning the learning problem. This decomposition is valid for any dense features and several learning algorithms, and takes full advantage of modern parallel architectures. Surprisingly, it allows training with all the potential samples in sets of thousands of images. By considering the full set, we generate in a single shot the optimal solution, which is usually obtained only after several rounds of hard negative mining. We report speed gains on Caltech Pedestrians and INRIA Pedestrians of over an order of magnitude, allowing training on a desktop computer in a couple of minutes.
2 0.14203726 277 iccv-2013-Multi-channel Correlation Filters
Author: Hamed Kiani Galoogahi, Terence Sim, Simon Lucey
Abstract: Modern descriptors like HOG and SIFT are now commonly used in vision for pattern detection within image and video. From a signal processing perspective, this detection process can be efficiently posed as a correlation/convolution between a multi-channel image and a multi-channel detector/filter which results in a singlechannel response map indicating where the pattern (e.g. object) has occurred. In this paper, we propose a novel framework for learning a multi-channel detector/filter efficiently in the frequency domain, both in terms of training time and memory footprint, which we refer to as a multichannel correlation filter. To demonstrate the effectiveness of our strategy, we evaluate it across a number of visual detection/localization tasks where we: (i) exhibit superiorperformance to current state of the art correlation filters, and (ii) superior computational and memory efficiencies compared to state of the art spatial detectors.
3 0.13201353 279 iccv-2013-Multi-stage Contextual Deep Learning for Pedestrian Detection
Author: Xingyu Zeng, Wanli Ouyang, Xiaogang Wang
Abstract: Cascaded classifiers1 have been widely used in pedestrian detection and achieved great success. These classifiers are trained sequentially without joint optimization. In this paper, we propose a new deep model that can jointly train multi-stage classifiers through several stages of backpropagation. It keeps the score map output by a classifier within a local region and uses it as contextual information to support the decision at the next stage. Through a specific design of the training strategy, this deep architecture is able to simulate the cascaded classifiers by mining hard samples to train the network stage-by-stage. Each classifier handles samples at a different difficulty level. Unsupervised pre-training and specifically designed stage-wise supervised training are used to regularize the optimization problem. Both theoretical analysis and experimental results show that the training strategy helps to avoid overfitting. Experimental results on three datasets (Caltech, ETH and TUD-Brussels) show that our approach outperforms the state-of-the-art approaches.
4 0.12922128 336 iccv-2013-Random Forests of Local Experts for Pedestrian Detection
Author: Javier Marín, David Vázquez, Antonio M. López, Jaume Amores, Bastian Leibe
Abstract: Pedestrian detection is one of the most challenging tasks in computer vision, and has received a lot of attention in the last years. Recently, some authors have shown the advantages of using combinations of part/patch-based detectors in order to cope with the large variability of poses and the existence of partial occlusions. In this paper, we propose a pedestrian detection method that efficiently combines multiple local experts by means of a Random Forest ensemble. The proposed method works with rich block-based representations such as HOG and LBP, in such a way that the same features are reused by the multiple local experts, so that no extra computational cost is needed with respect to a holistic method. Furthermore, we demonstrate how to integrate the proposed approach with a cascaded architecture in order to achieve not only high accuracy but also an acceptable efficiency. In particular, the resulting detector operates at five frames per second using a laptop machine. We tested the proposed method with well-known challenging datasets such as Caltech, ETH, Daimler, and INRIA. The method proposed in this work consistently ranks among the top performers in all the datasets, being either the best method or having a small difference with the best one.
5 0.12097768 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
Author: Ross Girshick, Jitendra Malik
Abstract: In this paper, we show how to train a deformable part model (DPM) fast—typically in less than 20 minutes, or four times faster than the current fastest method—while maintaining high average precision on the PASCAL VOC datasets. At the core of our approach is “latent LDA,” a novel generalization of linear discriminant analysis for learning latent variable models. Unlike latent SVM, latent LDA uses efficient closed-form updates and does not require an expensive search for hard negative examples. Our approach also acts as a springboard for a detailed experimental study of DPM training. We isolate and quantify the impact of key training factors for the first time (e.g., How important are discriminative SVM filters? How important is joint parameter estimation? How many negative images are needed for training?). Our findings yield useful insights for researchers working with Markov random fields and partbased models, and have practical implications for speeding up tasks such as model selection.
6 0.11217999 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
7 0.10351349 190 iccv-2013-Handling Occlusions with Franken-Classifiers
8 0.098302945 184 iccv-2013-Global Fusion of Relative Motions for Robust, Accurate and Scalable Structure from Motion
9 0.095148399 104 iccv-2013-Decomposing Bag of Words Histograms
10 0.086763889 377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors
11 0.083398901 121 iccv-2013-Discriminatively Trained Templates for 3D Object Detection: A Real Time Scalable Approach
12 0.082614124 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes
13 0.081622429 236 iccv-2013-Learning Discriminative Part Detectors for Image Classification and Cosegmentation
14 0.079079941 6 iccv-2013-A Convex Optimization Framework for Active Learning
15 0.07882376 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
16 0.077859789 220 iccv-2013-Joint Deep Learning for Pedestrian Detection
17 0.076419182 136 iccv-2013-Efficient Pedestrian Detection by Directly Optimizing the Partial Area under the ROC Curve
18 0.076291822 305 iccv-2013-POP: Person Re-identification Post-rank Optimisation
19 0.072783493 200 iccv-2013-Higher Order Matching for Consistent Multiple Target Tracking
20 0.072479226 189 iccv-2013-HOGgles: Visualizing Object Detection Features
topicId topicWeight
[(0, 0.199), (1, 0.031), (2, -0.038), (3, -0.054), (4, 0.027), (5, -0.003), (6, -0.009), (7, 0.032), (8, -0.036), (9, -0.082), (10, 0.004), (11, -0.069), (12, -0.03), (13, -0.065), (14, 0.059), (15, -0.005), (16, 0.021), (17, 0.072), (18, 0.061), (19, 0.057), (20, -0.021), (21, 0.001), (22, -0.082), (23, 0.031), (24, -0.042), (25, 0.053), (26, -0.006), (27, -0.005), (28, -0.055), (29, -0.021), (30, -0.03), (31, 0.03), (32, -0.025), (33, 0.028), (34, 0.094), (35, -0.02), (36, 0.019), (37, 0.033), (38, -0.013), (39, -0.014), (40, 0.014), (41, -0.015), (42, -0.025), (43, 0.022), (44, -0.05), (45, -0.017), (46, -0.016), (47, 0.01), (48, -0.036), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.94384003 61 iccv-2013-Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition
Author: João F. Henriques, João Carreira, Rui Caseiro, Jorge Batista
Abstract: Competitive sliding window detectors require vast training sets. Since a pool of natural images provides a nearly endless supply of negative samples, in the form of patches at different scales and locations, training with all the available data is considered impractical. A staple of current approaches is hard negative mining, a method of selecting relevant samples, which is nevertheless expensive. Given that samples at slightly different locations have overlapping support, there seems to be an enormous amount of duplicated work. It is natural, then, to ask whether these redundancies can be eliminated. In this paper, we show that the Gram matrix describing such data is block-circulant. We derive a transformation based on the Fourier transform that block-diagonalizes the Gram matrix, at once eliminating redundancies and partitioning the learning problem. This decomposition is valid for any dense features and several learning algorithms, and takes full advantage of modern parallel architectures. Surprisingly, it allows training with all the potential samples in sets of thousands of images. By considering the full set, we generate in a single shot the optimal solution, which is usually obtained only after several rounds of hard negative mining. We report speed gains on Caltech Pedestrians and INRIA Pedestrians of over an order of magnitude, allowing training on a desktop computer in a couple of minutes.
2 0.84991878 136 iccv-2013-Efficient Pedestrian Detection by Directly Optimizing the Partial Area under the ROC Curve
Author: Sakrapee Paisitkriangkrai, Chunhua Shen, Anton Van Den Hengel
Abstract: Many typical applications of object detection operate within a prescribed false-positive range. In this situation the performance of a detector should be assessed on the basis of the area under the ROC curve over that range, rather than over the full curve, as the performance outside the range is irrelevant. This measure is labelled as the partial area under the ROC curve (pAUC). Effective cascade-based classification, for example, depends on training node classifiers that achieve the maximal detection rate at a moderate false positive rate, e.g., around 40% to 50%. We propose a novel ensemble learning method which achieves a maximal detection rate at a user-defined range of false positive rates by directly optimizing the partial AUC using structured learning. By optimizing for different ranges of false positive rates, the proposed method can be used to train either a single strong classifier or a node classifier forming part of a cascade classifier. Experimental results on both synthetic and real-world data sets demonstrate the effectiveness of our approach, and we show that it is possible to train state-of-the-art pedestrian detectors using the pro- posed structured ensemble learning method.
3 0.76686394 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
Author: Ross Girshick, Jitendra Malik
Abstract: In this paper, we show how to train a deformable part model (DPM) fast—typically in less than 20 minutes, or four times faster than the current fastest method—while maintaining high average precision on the PASCAL VOC datasets. At the core of our approach is “latent LDA,” a novel generalization of linear discriminant analysis for learning latent variable models. Unlike latent SVM, latent LDA uses efficient closed-form updates and does not require an expensive search for hard negative examples. Our approach also acts as a springboard for a detailed experimental study of DPM training. We isolate and quantify the impact of key training factors for the first time (e.g., How important are discriminative SVM filters? How important is joint parameter estimation? How many negative images are needed for training?). Our findings yield useful insights for researchers working with Markov random fields and partbased models, and have practical implications for speeding up tasks such as model selection.
4 0.75421739 241 iccv-2013-Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection
Author: Tianfu Wu, Song-Chun Zhu
Abstract: Many object detectors, such as AdaBoost, SVM and deformable part-based models (DPM), compute additive scoring functions at a large number of windows scanned over image pyramid, thus computational efficiency is an important consideration beside accuracy performance. In this paper, we present a framework of learning cost-sensitive decision policy which is a sequence of two-sided thresholds to execute early rejection or early acceptance based on the accumulative scores at each step. A decision policy is said to be optimal if it minimizes an empirical global risk function that sums over the loss of false negatives (FN) and false positives (FP), and the cost of computation. While the risk function is very complex due to high-order connections among the two-sided thresholds, we find its upper bound can be optimized by dynamic programming (DP) efficiently and thus say the learned policy is near-optimal. Given the loss of FN and FP and the cost in three numbers, our method can produce a policy on-the-fly for Adaboost, SVM and DPM. In experiments, we show that our decision policy outperforms state-of-the-art cascade methods significantly in terms of speed with similar accuracy performance.
5 0.74767339 277 iccv-2013-Multi-channel Correlation Filters
Author: Hamed Kiani Galoogahi, Terence Sim, Simon Lucey
Abstract: Modern descriptors like HOG and SIFT are now commonly used in vision for pattern detection within image and video. From a signal processing perspective, this detection process can be efficiently posed as a correlation/convolution between a multi-channel image and a multi-channel detector/filter which results in a singlechannel response map indicating where the pattern (e.g. object) has occurred. In this paper, we propose a novel framework for learning a multi-channel detector/filter efficiently in the frequency domain, both in terms of training time and memory footprint, which we refer to as a multichannel correlation filter. To demonstrate the effectiveness of our strategy, we evaluate it across a number of visual detection/localization tasks where we: (i) exhibit superiorperformance to current state of the art correlation filters, and (ii) superior computational and memory efficiencies compared to state of the art spatial detectors.
6 0.73885059 336 iccv-2013-Random Forests of Local Experts for Pedestrian Detection
7 0.73357081 279 iccv-2013-Multi-stage Contextual Deep Learning for Pedestrian Detection
9 0.70988387 211 iccv-2013-Image Segmentation with Cascaded Hierarchical Models and Logistic Disjunctive Normal Networks
10 0.70741224 190 iccv-2013-Handling Occlusions with Franken-Classifiers
11 0.69142336 390 iccv-2013-Shufflets: Shared Mid-level Parts for Fast Object Detection
12 0.6855039 55 iccv-2013-Automatic Kronecker Product Model Based Detection of Repeated Patterns in 2D Urban Images
13 0.67776471 189 iccv-2013-HOGgles: Visualizing Object Detection Features
14 0.67677927 125 iccv-2013-Drosophila Embryo Stage Annotation Using Label Propagation
15 0.66556412 220 iccv-2013-Joint Deep Learning for Pedestrian Detection
16 0.65637922 75 iccv-2013-CoDeL: A Human Co-detection and Labeling Framework
17 0.65236247 406 iccv-2013-Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time
18 0.64727885 349 iccv-2013-Regionlets for Generic Object Detection
19 0.64023477 248 iccv-2013-Learning to Rank Using Privileged Information
20 0.63193858 121 iccv-2013-Discriminatively Trained Templates for 3D Object Detection: A Real Time Scalable Approach
topicId topicWeight
[(2, 0.077), (7, 0.02), (12, 0.027), (26, 0.099), (31, 0.045), (35, 0.011), (40, 0.012), (42, 0.122), (48, 0.03), (58, 0.205), (64, 0.043), (73, 0.032), (78, 0.013), (89, 0.142), (95, 0.018), (98, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.82826984 61 iccv-2013-Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition
Author: João F. Henriques, João Carreira, Rui Caseiro, Jorge Batista
Abstract: Competitive sliding window detectors require vast training sets. Since a pool of natural images provides a nearly endless supply of negative samples, in the form of patches at different scales and locations, training with all the available data is considered impractical. A staple of current approaches is hard negative mining, a method of selecting relevant samples, which is nevertheless expensive. Given that samples at slightly different locations have overlapping support, there seems to be an enormous amount of duplicated work. It is natural, then, to ask whether these redundancies can be eliminated. In this paper, we show that the Gram matrix describing such data is block-circulant. We derive a transformation based on the Fourier transform that block-diagonalizes the Gram matrix, at once eliminating redundancies and partitioning the learning problem. This decomposition is valid for any dense features and several learning algorithms, and takes full advantage of modern parallel architectures. Surprisingly, it allows training with all the potential samples in sets of thousands of images. By considering the full set, we generate in a single shot the optimal solution, which is usually obtained only after several rounds of hard negative mining. We report speed gains on Caltech Pedestrians and INRIA Pedestrians of over an order of magnitude, allowing training on a desktop computer in a couple of minutes.
2 0.7356497 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation
Author: Suyog Dutt Jain, Kristen Grauman
Abstract: The mode of manual annotation used in an interactive segmentation algorithm affects both its accuracy and easeof-use. For example, bounding boxes are fast to supply, yet may be too coarse to get good results on difficult images; freehand outlines are slower to supply and more specific, yet they may be overkill for simple images. Whereas existing methods assume a fixed form of input no matter the image, we propose to predict the tradeoff between accuracy and effort. Our approach learns whether a graph cuts segmentation will succeed if initialized with a given annotation mode, based on the image ’s visual separability and foreground uncertainty. Using these predictions, we optimize the mode of input requested on new images a user wants segmented. Whether given a single image that should be segmented as quickly as possible, or a batch of images that must be segmented within a specified time budget, we show how to select the easiest modality that will be sufficiently strong to yield high quality segmentations. Extensive results with real users and three datasets demonstrate the impact.
3 0.73438418 180 iccv-2013-From Where and How to What We See
Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath
Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
4 0.73414826 427 iccv-2013-Transfer Feature Learning with Joint Distribution Adaptation
Author: Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, Philip S. Yu
Abstract: Transfer learning is established as an effective technology in computer visionfor leveraging rich labeled data in the source domain to build an accurate classifier for the target domain. However, most prior methods have not simultaneously reduced the difference in both the marginal distribution and conditional distribution between domains. In this paper, we put forward a novel transfer learning approach, referred to as Joint Distribution Adaptation (JDA). Specifically, JDA aims to jointly adapt both the marginal distribution and conditional distribution in a principled dimensionality reduction procedure, and construct new feature representation that is effective and robustfor substantial distribution difference. Extensive experiments verify that JDA can significantly outperform several state-of-the-art methods on four types of cross-domain image classification problems.
5 0.73376536 150 iccv-2013-Exemplar Cut
Author: Jimei Yang, Yi-Hsuan Tsai, Ming-Hsuan Yang
Abstract: We present a hybrid parametric and nonparametric algorithm, exemplar cut, for generating class-specific object segmentation hypotheses. For the parametric part, we train a pylon model on a hierarchical region tree as the energy function for segmentation. For the nonparametric part, we match the input image with each exemplar by using regions to obtain a score which augments the energy function from the pylon model. Our method thus generates a set of highly plausible segmentation hypotheses by solving a series of exemplar augmented graph cuts. Experimental results on the Graz and PASCAL datasets show that the proposed algorithm achievesfavorable segmentationperformance against the state-of-the-art methods in terms of visual quality and accuracy.
6 0.7317813 80 iccv-2013-Collaborative Active Learning of a Kernel Machine Ensemble for Recognition
7 0.73032564 44 iccv-2013-Adapting Classification Cascades to New Domains
8 0.72779202 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses
9 0.72683918 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
10 0.72647631 161 iccv-2013-Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration
11 0.72645712 330 iccv-2013-Proportion Priors for Image Sequence Segmentation
12 0.72553158 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
13 0.72540736 241 iccv-2013-Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection
14 0.72522432 156 iccv-2013-Fast Direct Super-Resolution by Simple Functions
15 0.72507018 126 iccv-2013-Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification
16 0.72481549 206 iccv-2013-Hybrid Deep Learning for Face Verification
17 0.72447836 245 iccv-2013-Learning a Dictionary of Shape Epitomes with Applications to Image Labeling
18 0.72390938 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
19 0.7237817 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
20 0.72377706 354 iccv-2013-Robust Dictionary Learning by Error Source Decomposition