cvpr cvpr2013 cvpr2013-304 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: Complex real-world signals, such as images, contain discriminative structures that differ in many aspects including scale, invariance, and data channel. While progress in deep learning shows the importance of learning features through multiple layers, it is equally important to learn features through multiple paths. We propose Multipath Hierarchical Matching Pursuit (M-HMP), a novel feature learning architecture that combines a collection of hierarchical sparse features for image classification to capture multiple aspects of discriminative structures. Our building blocks are MI-KSVD, a codebook learning algorithm that balances the reconstruction error and the mutual incoherence of the codebook, and batch orthogonal matching pursuit (OMP); we apply them recursively at varying layers and scales. The result is a highly discriminative image representation that leads to large improvements to the state-of-the-art on many standard benchmarks, e.g., Caltech-101, Caltech-256, MITScenes, Oxford-IIIT Pet and Caltech-UCSD Bird-200.
Reference: text
sentIndex sentText sentNum sentScore
1 While progress in deep learning shows the importance of learning features through multiple layers, it is equally important to learn features through multiple paths. [sent-8, score-0.202]
2 We propose Multipath Hierarchical Matching Pursuit (M-HMP), a novel feature learning architecture that combines a collection of hierarchical sparse features for image classification to capture multiple aspects of discriminative structures. [sent-9, score-0.357]
3 Our building blocks are MI-KSVD, a codebook learning algorithm that balances the reconstruction error and the mutual incoherence of the codebook, and batch orthogonal matching pursuit (OMP); we apply them recursively at varying layers and scales. [sent-10, score-0.721]
4 A variety of learning and coding techniques have been proposed and evaluated, such as deep belief nets [12], deep autoencoders [17], deep convolutional neural networks [15], and hierarchical sparse coding [32, 3]. [sent-20, score-0.89]
5 Many are deep learning approaches that learn to push pix- of different sizes (here, 16x16 and 36x36) are encoded via multiple layers of sparse coding. [sent-21, score-0.348]
6 Each path corresponds to a specific patch size and number of layers (numbers inside boxes indicate patch size at the corresponding layer and path). [sent-22, score-0.388]
7 The final layer of each path encodes complete image patches and generates a feature vector for the whole image via another spatial pooling operation. [sent-24, score-0.421]
8 The multipath architecture is important as it significantly and efficiently expands the richness of image representation and leads to large improvements to the state of the art of image classification, as evaluated on a variety of object and scene recognition benchmarks. [sent-33, score-0.35]
9 Related Work In the past few years, a growing amount of research on visual recognition has focused on learning rich features using unsupervised and supervised hierarchical architectures. [sent-36, score-0.165]
10 Deep Networks: Deep belief nets [12] learn a hierarchy of features, layer by layer, using the unsupervised restricted Boltzmann machine. [sent-37, score-0.278]
11 To make deep beliefnets applicable to full-size images, convolutional deep belief nets [18] use a small receptive field and share the weights between the hidden and visible layers among all locations in an image. [sent-39, score-0.364]
12 Deconvolutional networks [33] convolutionally decompose images in an unsupervised way under a sparsity constraint. [sent-40, score-0.188]
13 Sparse Coding: For many years, sparse coding [21] has been a popular tool for modeling images. [sent-44, score-0.228]
14 Sparse coding on top of raw patches or SIFT features has achieved state-ofthe-art performance on face recognition, texture segmentation [19], and generic object recognition [28, 5, 9]. [sent-45, score-0.339]
15 Very recently, multi-layer sparse coding networks including hierarchical sparse coding [32] and hierarchical matching pursuit [3, 4] have been proposed for building multiple level features from raw sensor data. [sent-46, score-0.971]
16 Such networks learn codebooks at each layer in an unsupervised way such that image patches or pooled features can be represented by a sparse, linear combination of codebook entries. [sent-47, score-0.823]
17 With learned codebooks, feature hierarchies are built from scratch, layer by layer, using sparse codes and spatial pooling [3]. [sent-48, score-0.584]
18 We propose a novel codebook learning algorithm, MI-KSVD, to maintain mutual incoherence of the codebook, and discuss how multi-layer sparse coding hierarchies for images can be built from scratch and how multipath sparse coding helps capture discriminative structures of varying characteristics. [sent-51, score-1.101]
19 Codebook Learning with Mutual Incoherence The key idea of sparse coding is to represent data as sparse linear combinations of codewords selected from a codebook/dictionary [21]. [sent-54, score-0.451]
20 When sparse coding is applied to object recognition, the data matrix Y consists of raw patches randomly sampled from images. [sent-58, score-0.431]
21 Since the patches frequently observed in images have a higher probability of being included in Y than the ones less frequently observed in images, the learned codebooks may overfit to the frequently observed patches. [sent-59, score-0.271]
22 In order to balance the roles of different types of image patches, it is desirable to maintain large mutual incoherence during the codebook learning phase. [sent-60, score-0.322]
23 On the other hand, theoretical results on sparse coding [7] have also indicated that it is much easier to recover the underlying sparse codes of data when the mutual incoherence of the codebook is large. [sent-61, score-0.797]
24 This motivates us to balance the reconstruction error and the mutual incoherence of the codebook mD,iXnkY −DXkF2+λi∑=M1j=∑1M,j6=i|di>dj| s. [sent-62, score-0.299]
25 ∀m, kdm k2 = 1 and ∀n, kxn k0 (2) ≤K Here, the mutual coherence λ ∑iM=1 ∑Mj=1,j6=i |d>idj| has been included in the objective function to encourage large amsu bteueanl incoherence where λ ≥ 0 is a tradeoff parameter. [sent-64, score-0.294]
26 During each iteration, the current codebook D is used to encode the data Y by computing the sparse code matrix X. [sent-68, score-0.267]
27 Train- ing and test data consists of 1,000,000 36x36 image patches and 100,000 36x36 image patches sampled from images, respectively. [sent-70, score-0.284]
28 This new codebook is then used in the next iteration to recompute the sparse code matrix followed by another round of codebook update. [sent-72, score-0.394]
29 Note that MI-KSVD is quite different from dictionary learning algorithms proposed in [26], which use L2 norm to measure mutual incoherence and L1 norm to enforce sparsity. [sent-73, score-0.195]
30 Encoding: Given a codebook D, the encoding problem is to find the sparse code x of y, leading to the following optimization mxin ky −Dxk2 s. [sent-74, score-0.294]
31 onal matching pursuit (OMP) [25] is used to compute th? [sent-81, score-0.206]
32 OMP selects the codeword best correlated with the current residual at each iteration, which is the reconstruction error remaining after the codewords chosen thus far are subtracted. [sent-83, score-0.179]
33 Once a new codeword is selected, the observation is orthogonally projected onto the span of all the previously selected codewords and the residual is recomputed. [sent-85, score-0.179]
34 Codebook Update: Given the sparse code matrix X, the codewords dm are optimized sequentially. [sent-87, score-0.249]
35 In the m-th step, the m-th codeword and its sparse codes can be computed by minimizing the residual matrix and the mutual coherence corresponding to that codeword ? [sent-88, score-0.48]
36 TYh −is ∑mi6=amtrix contains the differences between the observations and their approximations using all other codewords and their sparse codes. [sent-95, score-0.223]
37 The codebook learned by MI-KSVD has average mutual coherence (AMC) 0. [sent-105, score-0.258]
38 Hierarchial Matching Pursuit In our hierarchical matching pursuit, MI-KSVD is used to learn codebooks at three layers, where the data matrix Y in the first layer consists of raw patches sampled from images, and Y in the second and third layers are sparse codes pooled from the lower layers. [sent-110, score-1.049]
39 With the learned codebooks D, hierarchical matching pursuit builds a feature hierarchy, layer by layer, using batch orthogonal matching pursuit for computing sparse codes, spatial pooling for aggregating sparse codes, and contrast normalization for normalizing feature vectors, as shown in Fig. [sent-111, score-1.201]
40 First Layer: The goal of the first layer in HMP is to extract sparse codes for small patches (e. [sent-113, score-0.535]
41 , 5x5) and generate pooled codes for mid-level patches (e. [sent-115, score-0.406]
42 Orthogonal matching pursuit is used to compute the sparse codes x of small patches (e. [sent-118, score-0.599]
43 Spatial max pooling is then applied to aggregate the sparse codes. [sent-121, score-0.248]
44 We split the positive and negative components of the sparse codes into separate features to allow higher layers weight positive 662 Figure 3: A three-layer architecture of Hierarchical Matching Pursuit. [sent-124, score-0.477]
45 The feature FP describing an image patch P is the concatenation of aggregated sparse codes in each spatial cell FP = ? [sent-126, score-0.386]
46 Since the mpagnitude of sparse codes varies over a wide range due to lopcal variations in illumination and occlusion, this operation makes the appearance features robust to such variations, as commonly done in SIFT features. [sent-130, score-0.293]
47 Second Layer: The goal of the second layer in HMP is to gather and code mid-level sparse codes and generate pooled codes for large patches (e. [sent-133, score-0.844]
48 To do so, HMP applies batch OMP and spatial max pooling to features FP generated in the first layer. [sent-136, score-0.247]
49 The codebook for this level is learned by sampling features FP over images. [sent-137, score-0.208]
50 The process to extract the feature describing a large image patch is identical to that for the first layer: sparse codes of each image patch are computed using batch orthogonal matching pursuit, followed by spatial max pooling on the large patch. [sent-138, score-0.673]
51 Third Layer: The goal of the third layer in HMP is to generate pooled sparse codes for the whole image/object. [sent-140, score-0.539]
52 Similar to the second layer, the codebook for this level is learned by sampling these pooled sparse codes in the second layer. [sent-141, score-0.582]
53 With the learned codebook, just as in the second layer, sparse codes of each image patch (for instance, 36x36) are computed using batch OMP, followed by spatial max pooling on the whole images. [sent-142, score-0.584]
54 The features of the whole image/object are the concatenation of the aggregated sparse codes of the spatial cells. [sent-143, score-0.326]
55 A one-layer HMP has the same architecture as the final layer of a three-layer HMP, except that MI-KSVD and batch OMP are performed on 36x36 raw image patches instead of pooled sparse codes. [sent-145, score-0.73]
56 A two-layer HMP has the same architecture as the second and third layers of a three-layer HMP, except that MI-KSVD and batch OMP in the second layer are performed on raw image patches. [sent-146, score-0.443]
57 The spatial pyramid bag-of-patches model [16] overcomes this problem by organizing patches into spatial cells at multiple levels and then concatenating features from spatial cells into one feature vector. [sent-154, score-0.327]
58 To exploit the advantage of multiple patch sizes as well as the strength of multi-layer architectures, our Multipath Hierarchical Matching Pursuit (M-HMP) configures matching pursuit encoders in multiple pathways, varying patch sizes and the number of layers (see Fig. [sent-158, score-0.477]
59 Note that in the final layer, sparse coding is always applied to the full image patches (16x16 on the top and 36x36 on the bottom). [sent-160, score-0.351]
60 More importantly, we argue that multiple paths, by varying the number of layers in HMP, is important for a single patch size. [sent-162, score-0.166]
61 For instance, we could learn features for 36x36 image patches using a one-layer HMP or a two-layer HMP or a three-layer HMP. [sent-163, score-0.175]
62 These HMP networks with different layers capture different aspects of 36x36 image patches. [sent-164, score-0.182]
63 Intuitively, features of 36x36 image patches learned by a one-layer HMP capture basic structures of patches and are sensitive to spatial displacement. [sent-165, score-0.362]
64 Features of 36x36 image patches learned by a two-layer HMP introduce robustness to local deformations due to the usage of spatial max pooling in the first layer. [sent-166, score-0.323]
65 Features of 36x36 image patches learned by a three-layer HMP are highly abstract and robust due to their ability to recursively eliminate unimportant structures and increase invariance to local deformations by spatial max pooling in the previous layers. [sent-167, score-0.378]
66 Top: One-layer and two-layer HMP networks on image patches of size 16x16. [sent-169, score-0.247]
67 Bottom: Two-layer HMP networks on image patches of size 16x16 and on images patches of size 36x36. [sent-170, score-0.389]
68 Next, we outline the detailed architecture of the five HMP networks used in the experiments. [sent-171, score-0.236]
69 On image patches of size 16x16 (A), we learn a one-layer HMP on RGB images (A1) and a two-layer HMP on grayscale images (A2). [sent-172, score-0.207]
70 For the one-layer HMP, we learn codebooks of size 1000 with sparsity level 5 on a collection of 16x16 raw patches sampled from RGB images. [sent-173, score-0.465]
71 For the two-layer HMP, we first learn first-layer codebooks of size 75 with sparsity level 5 on a collection of 5x5 raw patches sampled from grayscale images. [sent-174, score-0.501]
72 We then generate the pooled sparse codes on 4x4 spatial cells of 16x16 image patches with a pooling size of 4x4 pixels. [sent-175, score-0.709]
73 Finally, we learn second-layer codebooks of size 1000 with sparsity level 10 on the resulting pooled sparse codes on 16x16 image patches. [sent-176, score-0.636]
74 On image patches of size 36x36 (B), we learn one-layer HMP on RGB images, and two-layer and three-layer HMP on grayscale images. [sent-177, score-0.207]
75 For the one-layer HMP (B 1), we learn codebooks of size 1000 with sparsity level 5 on a collection of 36x36 raw patches sampled from RGB images. [sent-178, score-0.465]
76 For the two-layer HMP (B2), we first learn codebooks of size 300 with sparsity level 5 on a collection of 10x10 raw patches sampled from grayscale images. [sent-179, score-0.501]
77 We then generate the pooled sparse codes on 4x4 spatial cells of 36x36 image patches with a pooling size of 9x9 pixels. [sent-180, score-0.709]
78 Finally, we learn codebooks of size 1000 with sparsity level 10 on the resulting pooled sparse codes on 36x36 image patches. [sent-181, score-0.636]
79 For the third layer, we first generate the pooled sparse codes on 3x3 spatial cells of the pooled sparse codes in the second layer with a pooling size of 3x3. [sent-183, score-1.125]
80 Finally, we learn codebooks of size 1000 with sparsity level 10 on the resulting pooled sparse codes based 36x36 image patches (36=4x3x3). [sent-184, score-0.759]
81 In the final layer of A1, A2, B 1, B2 and B3, we generate image-level features by computing sparse codes of 36x36 image patches and performing max pooling followed by contrast normalization on spatial pyramids 1x1, 2x2 and 4x4 on the whole images. [sent-185, score-0.725]
82 Note that the above architecture of multi-layer HMP networks leads to fast computation of pooled sparse codes. [sent-186, score-0.453]
83 Generally speaking, the architecture A1 works well for the categories which have consistent appearances (colors, textures and so on) while the architecture A2 does well for categories which have consistent shapes. [sent-190, score-0.278]
84 We remove the zero frequency component from raw patches by subtracting their mean in the first layer of the HMP networks. [sent-196, score-0.325]
85 MI-KSVD We compare KSVD and MI-KSVD for a one-layer HMP network with image patches of size 36x36 on the Caltech101 dataset. [sent-202, score-0.167]
86 We learn codebooks of size 1000 with sparsity level 5 on 1,000,000 sampled 36x36 raw patches. [sent-204, score-0.319]
87 First of all, the learned dictionaries have very rich appearances and include uniform colors of red, green and blue, transition codewords between different colors, gray and color edges, double gray and color edges, center-surround (dot) codewords, and so on. [sent-208, score-0.182]
88 Moreover, the codebook learned by MI-KSVD is more balanced and diverse than that learned by KSVD: there are 664 (right) on Caltech-101. [sent-210, score-0.193]
89 225 codewords randomly selected from 1000 codewords are shown. [sent-211, score-0.218]
90 less color codewords and more codewords with some complex structures in MI-KSVD. [sent-217, score-0.245]
91 In the following sections, we will demonstrate multipath sparse coding on more standard vision datasets. [sent-225, score-0.452]
92 The three-layer HMP network (B3) is superior to the one-layer networks (A1 and B 1), but inferior to the two layer HMP networks (A2 and B2). [sent-232, score-0.377]
93 Multipath coding combining all five HMP networks achieves the best result of 82. [sent-234, score-0.243]
94 We compare M-HMP with recently published state-of- Figure 6: Test accuracy of single-path and multipath HMP. [sent-236, score-0.241]
95 SIFT+T [9] is soft threshold coding and CRBM [27] is a convolutional variant of Restricted Boltzmann Machines (RBM). [sent-253, score-0.163]
96 HSC [32] is a two layer sparse coding network using L1-norm regularization. [sent-256, score-0.395]
97 2), with the only exception that the number of codewords in the final layer of HMP is increased to 2000 to accommodate for more categories and more images. [sent-268, score-0.283]
98 The best single-path M-HMP is an architecture of two layers on 16x16 image patches that obtains 44. [sent-283, score-0.307]
99 In Table 5, we compare our M-HMP with recently published algorithms such as multiple kernel learning [6], LLC [3 1], random forest [3 1], multi-cue [14], pose pooling [34] and unsupervised template learning [30]. [sent-318, score-0.201]
100 Our approach combines sparse coding through several pathways, 666 using multiple patches of varying size, and encoding each patch through multiple paths with a varying number of layers. [sent-324, score-0.528]
wordName wordTfidf (topN-words)
[('hmp', 0.762), ('multipath', 0.224), ('pursuit', 0.168), ('codes', 0.156), ('layer', 0.142), ('codebook', 0.127), ('pooled', 0.127), ('patches', 0.123), ('ksvd', 0.116), ('codebooks', 0.115), ('sparse', 0.114), ('coding', 0.114), ('codewords', 0.109), ('architecture', 0.107), ('pooling', 0.106), ('incoherence', 0.106), ('networks', 0.105), ('omp', 0.081), ('deep', 0.081), ('layers', 0.077), ('mutual', 0.066), ('raw', 0.06), ('kdm', 0.06), ('mhmp', 0.06), ('miksvd', 0.06), ('patch', 0.057), ('batch', 0.057), ('sparsity', 0.051), ('convolutional', 0.049), ('hierarchical', 0.048), ('pathways', 0.045), ('codeword', 0.042), ('pet', 0.041), ('matching', 0.038), ('csp', 0.037), ('grayscale', 0.036), ('nets', 0.035), ('spatial', 0.033), ('learned', 0.033), ('unsupervised', 0.032), ('fp', 0.032), ('varying', 0.032), ('coherence', 0.032), ('architectures', 0.032), ('categories', 0.032), ('cells', 0.031), ('amc', 0.03), ('kxn', 0.03), ('xjm', 0.03), ('paths', 0.029), ('dj', 0.029), ('learn', 0.029), ('residual', 0.028), ('bo', 0.028), ('invariance', 0.028), ('max', 0.028), ('encoding', 0.027), ('species', 0.027), ('structures', 0.027), ('orthogonal', 0.027), ('crbm', 0.027), ('nbnn', 0.027), ('cell', 0.026), ('code', 0.026), ('cats', 0.026), ('level', 0.025), ('network', 0.025), ('mj', 0.024), ('dogs', 0.024), ('sizes', 0.024), ('five', 0.024), ('belief', 0.023), ('learning', 0.023), ('collection', 0.023), ('features', 0.023), ('ren', 0.023), ('autoencoders', 0.022), ('imagenet', 0.022), ('recursive', 0.022), ('scratch', 0.021), ('sampled', 0.02), ('deconvolutional', 0.02), ('dictionaries', 0.02), ('pyramid', 0.02), ('reconfigurable', 0.02), ('rich', 0.02), ('fox', 0.019), ('discriminative', 0.019), ('size', 0.019), ('labs', 0.019), ('recognition', 0.019), ('signals', 0.018), ('rgb', 0.018), ('sift', 0.018), ('receptive', 0.018), ('test', 0.018), ('hyperparameters', 0.017), ('published', 0.017), ('hierarchy', 0.017), ('path', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 304 cvpr-2013-Multipath Sparse Coding Using Hierarchical Matching Pursuit
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: Complex real-world signals, such as images, contain discriminative structures that differ in many aspects including scale, invariance, and data channel. While progress in deep learning shows the importance of learning features through multiple layers, it is equally important to learn features through multiple paths. We propose Multipath Hierarchical Matching Pursuit (M-HMP), a novel feature learning architecture that combines a collection of hierarchical sparse features for image classification to capture multiple aspects of discriminative structures. Our building blocks are MI-KSVD, a codebook learning algorithm that balances the reconstruction error and the mutual incoherence of the codebook, and batch orthogonal matching pursuit (OMP); we apply them recursively at varying layers and scales. The result is a highly discriminative image representation that leads to large improvements to the state-of-the-art on many standard benchmarks, e.g., Caltech-101, Caltech-256, MITScenes, Oxford-IIIT Pet and Caltech-UCSD Bird-200.
2 0.14707474 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
Author: Yi Sun, Xiaogang Wang, Xiaoou Tang
Abstract: We propose a new approach for estimation of the positions of facial keypoints with three-level carefully designed convolutional networks. At each level, the outputs of multiple networks are fused for robust and accurate estimation. Thanks to the deep structures of convolutional networks, global high-level features are extracted over the whole face region at the initialization stage, which help to locate high accuracy keypoints. There are two folds of advantage for this. First, the texture context information over the entire face is utilized to locate each keypoint. Second, since the networks are trained to predict all the keypoints simultaneously, the geometric constraints among keypoints are implicitly encoded. The method therefore can avoid local minimum caused by ambiguity and data corruption in difficult image samples due to occlusions, large pose variations, and extreme lightings. The networks at the following two levels are trained to locally refine initial predictions and their inputs are limited to small regions around the initial predictions. Several network structures critical for accurate and robust facial point detection are investigated. Extensive experiments show that our approach outperforms state-ofthe-art methods in both detection accuracy and reliability1.
3 0.13931736 388 cvpr-2013-Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video
Author: Yang Yang, Guang Shu, Mubarak Shah
Abstract: We propose a novel approach to boost the performance of generic object detectors on videos by learning videospecific features using a deep neural network. The insight behind our proposed approach is that an object appearing in different frames of a video clip should share similar features, which can be learned to build better detectors. Unlike many supervised detector adaptation or detection-bytracking methods, our method does not require any extra annotations or utilize temporal correspondence. We start with the high-confidence detections from a generic detector, then iteratively learn new video-specific features and refine the detection scores. In order to learn discriminative and compact features, we propose a new feature learning method using a deep neural network based on auto encoders. It differs from the existing unsupervised feature learning methods in two ways: first it optimizes both discriminative and generative properties of the features simultaneously, which gives our features better discriminative ability; second, our learned features are more compact, while the unsupervised feature learning methods usually learn a redundant set of over-complete features. Extensive experimental results on person and horse detection show that significant performance improvement can be achieved with our proposed method.
4 0.12476316 204 cvpr-2013-Histograms of Sparse Codes for Object Detection
Author: Xiaofeng Ren, Deva Ramanan
Abstract: Object detection has seen huge progress in recent years, much thanks to the heavily-engineered Histograms of Oriented Gradients (HOG) features. Can we go beyond gradients and do better than HOG? Weprovide an affirmative answer byproposing and investigating a sparse representation for object detection, Histograms of Sparse Codes (HSC). We compute sparse codes with dictionaries learned from data using K-SVD, and aggregate per-pixel sparse codes to form local histograms. We intentionally keep true to the sliding window framework (with mixtures and parts) and only change the underlying features. To keep training (and testing) efficient, we apply dimension reduction by computing SVD on learned models, and adopt supervised training where latent positions of roots and parts are given externally e.g. from a HOG-based detector. By learning and using local representations that are much more expressive than gradients, we demonstrate large improvements over the state of the art on the PASCAL benchmark for both root- only and part-based models.
Author: Ruobing Wu, Yizhou Yu, Wenping Wang
Abstract: Recognizing the category of a visual object remains a challenging computer vision problem. In this paper we develop a novel deep learning method thatfacilitates examplebased visual object category recognition. Our deep learning architecture consists of multiple stacked layers and computes an intermediate representation that can be fed to a nearest-neighbor classifier. This intermediate representation is discriminative and structure-preserving. It is also capable of extracting essential characteristics shared by objects in the same category while filtering out nonessential differences among them. Each layer in our model is a nonlinear mapping, whose parameters are learned through two sequential steps that are designed to achieve the aforementioned properties. The first step computes a discrete mapping called supervised Laplacian Eigenmap. The second step computes a continuous mapping from the discrete version through nonlinear regression. We have extensively tested our method and it achieves state-of-the-art recognition rates on a number of benchmark datasets.
6 0.11797338 268 cvpr-2013-Leveraging Structure from Motion to Learn Discriminative Codebooks for Scalable Landmark Classification
7 0.11487284 328 cvpr-2013-Pedestrian Detection with Unsupervised Multi-stage Feature Learning
8 0.11130131 421 cvpr-2013-Supervised Kernel Descriptors for Visual Recognition
9 0.10596573 296 cvpr-2013-Multi-level Discriminative Dictionary Learning towards Hierarchical Visual Categorization
10 0.1044689 178 cvpr-2013-From Local Similarity to Global Coding: An Application to Image Classification
11 0.092599943 164 cvpr-2013-Fast Convolutional Sparse Coding
12 0.088596508 200 cvpr-2013-Harvesting Mid-level Visual Concepts from Large-Scale Internet Images
13 0.086657159 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
14 0.085829698 422 cvpr-2013-Tag Taxonomy Aware Dictionary Learning for Region Tagging
15 0.084233522 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning
16 0.081523798 257 cvpr-2013-Learning Structured Low-Rank Representations for Image Classification
17 0.08137399 53 cvpr-2013-BFO Meets HOG: Feature Extraction Based on Histograms of Oriented p.d.f. Gradients for Image Classification
18 0.077749692 64 cvpr-2013-Blessing of Dimensionality: High-Dimensional Feature and Its Efficient Compression for Face Verification
19 0.075394131 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
20 0.070467122 442 cvpr-2013-Transfer Sparse Coding for Robust Image Representation
topicId topicWeight
[(0, 0.146), (1, -0.075), (2, -0.062), (3, 0.083), (4, 0.003), (5, 0.011), (6, 0.009), (7, 0.018), (8, -0.05), (9, -0.051), (10, -0.061), (11, -0.029), (12, 0.058), (13, -0.017), (14, 0.079), (15, 0.063), (16, -0.075), (17, 0.038), (18, 0.118), (19, 0.01), (20, 0.07), (21, -0.034), (22, 0.047), (23, -0.152), (24, -0.048), (25, 0.125), (26, -0.01), (27, -0.012), (28, 0.007), (29, -0.034), (30, -0.068), (31, -0.041), (32, 0.034), (33, 0.025), (34, 0.014), (35, 0.047), (36, 0.013), (37, 0.028), (38, 0.143), (39, 0.007), (40, -0.018), (41, -0.033), (42, 0.102), (43, -0.038), (44, 0.011), (45, 0.058), (46, -0.02), (47, -0.01), (48, 0.006), (49, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.92985719 304 cvpr-2013-Multipath Sparse Coding Using Hierarchical Matching Pursuit
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: Complex real-world signals, such as images, contain discriminative structures that differ in many aspects including scale, invariance, and data channel. While progress in deep learning shows the importance of learning features through multiple layers, it is equally important to learn features through multiple paths. We propose Multipath Hierarchical Matching Pursuit (M-HMP), a novel feature learning architecture that combines a collection of hierarchical sparse features for image classification to capture multiple aspects of discriminative structures. Our building blocks are MI-KSVD, a codebook learning algorithm that balances the reconstruction error and the mutual incoherence of the codebook, and batch orthogonal matching pursuit (OMP); we apply them recursively at varying layers and scales. The result is a highly discriminative image representation that leads to large improvements to the state-of-the-art on many standard benchmarks, e.g., Caltech-101, Caltech-256, MITScenes, Oxford-IIIT Pet and Caltech-UCSD Bird-200.
2 0.72264671 421 cvpr-2013-Supervised Kernel Descriptors for Visual Recognition
Author: Peng Wang, Jingdong Wang, Gang Zeng, Weiwei Xu, Hongbin Zha, Shipeng Li
Abstract: In visual recognition tasks, the design of low level image feature representation is fundamental. The advent of local patch features from pixel attributes such as SIFT and LBP, has precipitated dramatic progresses. Recently, a kernel view of these features, called kernel descriptors (KDES) [1], generalizes the feature design in an unsupervised fashion and yields impressive results. In this paper, we present a supervised framework to embed the image level label information into the design of patch level kernel descriptors, which we call supervised kernel descriptors (SKDES). Specifically, we adopt the broadly applied bag-of-words (BOW) image classification pipeline and a large margin criterion to learn the lowlevel patch representation, which makes the patch features much more compact and achieve better discriminative ability than KDES. With this method, we achieve competitive results over several public datasets comparing with stateof-the-art methods.
3 0.70750463 404 cvpr-2013-Sparse Quantization for Patch Description
Author: Xavier Boix, Michael Gygli, Gemma Roig, Luc Van_Gool
Abstract: The representation of local image patches is crucial for the good performance and efficiency of many vision tasks. Patch descriptors have been designed to generalize towards diverse variations, depending on the application, as well as the desired compromise between accuracy and efficiency. We present a novel formulation of patch description, that serves such issues well. Sparse quantization lies at its heart. This allows for efficient encodings, leading to powerful, novel binary descriptors, yet also to the generalization of existing descriptors like SIFTorBRIEF. We demonstrate the capabilities of our formulation for both keypoint matching and image classification. Our binary descriptors achieve state-of-the-art results for two keypoint matching benchmarks, namely those by Brown [6] and Mikolajczyk [18]. For image classification, we propose new descriptors that perform similar to SIFT on Caltech101 [10] and PASCAL VOC07 [9].
Author: Ruobing Wu, Yizhou Yu, Wenping Wang
Abstract: Recognizing the category of a visual object remains a challenging computer vision problem. In this paper we develop a novel deep learning method thatfacilitates examplebased visual object category recognition. Our deep learning architecture consists of multiple stacked layers and computes an intermediate representation that can be fed to a nearest-neighbor classifier. This intermediate representation is discriminative and structure-preserving. It is also capable of extracting essential characteristics shared by objects in the same category while filtering out nonessential differences among them. Each layer in our model is a nonlinear mapping, whose parameters are learned through two sequential steps that are designed to achieve the aforementioned properties. The first step computes a discrete mapping called supervised Laplacian Eigenmap. The second step computes a continuous mapping from the discrete version through nonlinear regression. We have extensively tested our method and it achieves state-of-the-art recognition rates on a number of benchmark datasets.
5 0.63647258 164 cvpr-2013-Fast Convolutional Sparse Coding
Author: Hilton Bristow, Anders Eriksson, Simon Lucey
Abstract: Sparse coding has become an increasingly popular method in learning and vision for a variety of classification, reconstruction and coding tasks. The canonical approach intrinsically assumes independence between observations during learning. For many natural signals however, sparse coding is applied to sub-elements (i.e. patches) of the signal, where such an assumption is invalid. Convolutional sparse coding explicitly models local interactions through the convolution operator, however the resulting optimization problem is considerably more complex than traditional sparse coding. In this paper, we draw upon ideas from signal processing and Augmented Lagrange Methods (ALMs) to produce a fast algorithm with globally optimal subproblems and super-linear convergence.
6 0.61058849 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
7 0.60916334 328 cvpr-2013-Pedestrian Detection with Unsupervised Multi-stage Feature Learning
8 0.60910016 178 cvpr-2013-From Local Similarity to Global Coding: An Application to Image Classification
9 0.59999299 388 cvpr-2013-Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video
10 0.59847295 53 cvpr-2013-BFO Meets HOG: Feature Extraction Based on Histograms of Oriented p.d.f. Gradients for Image Classification
11 0.57361799 204 cvpr-2013-Histograms of Sparse Codes for Object Detection
12 0.57222444 83 cvpr-2013-Classification of Tumor Histology via Morphometric Context
13 0.56592005 369 cvpr-2013-Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination
14 0.56349093 346 cvpr-2013-Real-Time No-Reference Image Quality Assessment Based on Filter Learning
15 0.53791291 69 cvpr-2013-Boosting Binary Keypoint Descriptors
16 0.5341866 403 cvpr-2013-Sparse Output Coding for Large-Scale Visual Recognition
17 0.53399044 200 cvpr-2013-Harvesting Mid-level Visual Concepts from Large-Scale Internet Images
18 0.52806073 255 cvpr-2013-Learning Separable Filters
19 0.52498674 442 cvpr-2013-Transfer Sparse Coding for Robust Image Representation
20 0.50088298 268 cvpr-2013-Leveraging Structure from Motion to Learn Discriminative Codebooks for Scalable Landmark Classification
topicId topicWeight
[(10, 0.1), (16, 0.026), (26, 0.041), (27, 0.022), (33, 0.273), (57, 0.015), (61, 0.181), (67, 0.059), (69, 0.083), (80, 0.027), (87, 0.065)]
simIndex simValue paperId paperTitle
1 0.93543696 47 cvpr-2013-As-Projective-As-Possible Image Stitching with Moving DLT
Author: Julio Zaragoza, Tat-Jun Chin, Michael S. Brown, David Suter
Abstract: We investigate projective estimation under model inadequacies, i.e., when the underpinning assumptions oftheprojective model are not fully satisfied by the data. We focus on the task of image stitching which is customarily solved by estimating a projective warp — a model that is justified when the scene is planar or when the views differ purely by rotation. Such conditions are easily violated in practice, and this yields stitching results with ghosting artefacts that necessitate the usage of deghosting algorithms. To this end we propose as-projective-as-possible warps, i.e., warps that aim to be globally projective, yet allow local non-projective deviations to account for violations to the assumed imaging conditions. Based on a novel estimation technique called Moving Direct Linear Transformation (Moving DLT), our method seamlessly bridges image regions that are inconsistent with the projective model. The result is highly accurate image stitching, with significantly reduced ghosting effects, thus lowering the dependency on post hoc deghosting.
2 0.90733558 16 cvpr-2013-A Linear Approach to Matching Cuboids in RGBD Images
Author: Hao Jiang, Jianxiong Xiao
Abstract: We propose a novel linear method to match cuboids in indoor scenes using RGBD images from Kinect. Beyond depth maps, these cuboids reveal important structures of a scene. Instead of directly fitting cuboids to 3D data, we first construct cuboid candidates using superpixel pairs on a RGBD image, and then we optimize the configuration of the cuboids to satisfy the global structure constraints. The optimal configuration has low local matching costs, small object intersection and occlusion, and the cuboids tend to project to a large region in the image; the number of cuboids is optimized simultaneously. We formulate the multiple cuboid matching problem as a mixed integer linear program and solve the optimization efficiently with a branch and bound method. The optimization guarantees the global optimal solution. Our experiments on the Kinect RGBD images of a variety of indoor scenes show that our proposed method is efficient, accurate and robust against object appearance variations, occlusions and strong clutter.
3 0.88350916 72 cvpr-2013-Boundary Detection Benchmarking: Beyond F-Measures
Author: Xiaodi Hou, Alan Yuille, Christof Koch
Abstract: For an ill-posed problem like boundary detection, human labeled datasets play a critical role. Compared with the active research on finding a better boundary detector to refresh the performance record, there is surprisingly little discussion on the boundary detection benchmark itself. The goal of this paper is to identify the potential pitfalls of today’s most popular boundary benchmark, BSDS 300. In the paper, we first introduce a psychophysical experiment to show that many of the “weak” boundary labels are unreliable and may contaminate the benchmark. Then we analyze the computation of f-measure and point out that the current benchmarking protocol encourages an algorithm to bias towards those problematic “weak” boundary labels. With this evidence, we focus on a new problem of detecting strong boundaries as one alternative. Finally, we assess the performances of 9 major algorithms on different ways of utilizing the dataset, suggesting new directions for improvements.
4 0.87538278 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
Author: LiMin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.
same-paper 5 0.87378609 304 cvpr-2013-Multipath Sparse Coding Using Hierarchical Matching Pursuit
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: Complex real-world signals, such as images, contain discriminative structures that differ in many aspects including scale, invariance, and data channel. While progress in deep learning shows the importance of learning features through multiple layers, it is equally important to learn features through multiple paths. We propose Multipath Hierarchical Matching Pursuit (M-HMP), a novel feature learning architecture that combines a collection of hierarchical sparse features for image classification to capture multiple aspects of discriminative structures. Our building blocks are MI-KSVD, a codebook learning algorithm that balances the reconstruction error and the mutual incoherence of the codebook, and batch orthogonal matching pursuit (OMP); we apply them recursively at varying layers and scales. The result is a highly discriminative image representation that leads to large improvements to the state-of-the-art on many standard benchmarks, e.g., Caltech-101, Caltech-256, MITScenes, Oxford-IIIT Pet and Caltech-UCSD Bird-200.
6 0.87013757 100 cvpr-2013-Crossing the Line: Crowd Counting by Integer Programming with Local Features
8 0.85391414 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics
9 0.85337681 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
10 0.85316551 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
11 0.85117859 372 cvpr-2013-SLAM++: Simultaneous Localisation and Mapping at the Level of Objects
12 0.85093462 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment
13 0.85021263 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities
14 0.8499521 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
15 0.84960771 132 cvpr-2013-Discriminative Re-ranking of Diverse Segmentations
16 0.84960681 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models
17 0.84934574 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning
18 0.8486616 231 cvpr-2013-Joint Detection, Tracking and Mapping by Semantic Bundle Adjustment
19 0.84834802 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
20 0.84736186 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images