nips nips2013 nips2013-190 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Carl Doersch, Abhinav Gupta, Alexei A. Efros
Abstract: Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical “visual words”, but lower than full-blown semantic objects. Several approaches [5, 6, 12, 23] have been proposed to discover mid-level visual elements, that are both 1) representative, i.e., frequently occurring within a visual dataset, and 2) visually discriminative. However, the current approaches are rather ad hoc and difficult to analyze and evaluate. In this work, we pose visual element discovery as discriminative mode seeking, drawing connections to the the well-known and well-studied mean-shift algorithm [2, 1, 4, 8]. Given a weakly-labeled image collection, our method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels. One advantage of our formulation is that it requires only a single pass through the data. We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset of [5]. We also evaluate our method on the task of scene classification, demonstrating state-of-the-art performance on the MIT Scene-67 dataset. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical “visual words”, but lower than full-blown semantic objects. [sent-8, score-0.249]
2 Several approaches [5, 6, 12, 23] have been proposed to discover mid-level visual elements, that are both 1) representative, i. [sent-9, score-0.249]
3 , frequently occurring within a visual dataset, and 2) visually discriminative. [sent-11, score-0.347]
4 In this work, we pose visual element discovery as discriminative mode seeking, drawing connections to the the well-known and well-studied mean-shift algorithm [2, 1, 4, 8]. [sent-13, score-0.604]
5 Given a weakly-labeled image collection, our method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels. [sent-14, score-0.543]
6 We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset of [5]. [sent-16, score-0.369]
7 1 Introduction In terms of sheer size, visual data is, by most accounts, the biggest “Big Data” out there. [sent-18, score-0.249]
8 [13]) are not equipped to handle it directly, at the raw pixel level, making research on finding good visual representations particularly relevant and timely. [sent-21, score-0.249]
9 Currently, the most popular visual representations in machine learning are based on “visual words” [24], which are obtained by unsupervised clustering (k-means) of local features (SIFT) over a large dataset. [sent-22, score-0.32]
10 Recently, several approaches [5, 6, 11, 12, 15, 23, 26, 27] have proposed mining visual data for discriminative mid-level visual elements, i. [sent-28, score-0.69]
11 , scene categories [12] or GPS coordinates [5] (but can also run unsupervised [23]), and have been recently used for tasks including image classification [12, 23, 27], object detection [6], visual data mining [5, 15], action recognition [11], and geometry estimation [7]. [sent-33, score-0.613]
12 But how are informative visual elements to be identified in the weakly-labeled visual dataset? [sent-34, score-0.706]
13 The idea is to search for clusters of image patches that are both 1) representative, i. [sent-35, score-0.443]
14 Unfortunately, algorithms for finding patches that fit these criteria remain rather ad-hoc and poorly understood. [sent-38, score-0.279]
15 17 Figure 1: The distribution of patches in HOG feature space is very non-uniform and absolute distances cannot be trusted. [sent-50, score-0.321]
16 We show two patches with their 5 nearest-neighbors from the Paris Street View dataset [5]; beneath each nearest neighbor is its distance from query. [sent-51, score-0.347]
17 Although the nearest neighbors on the left are visually much better, their distances are more than twice those on the right, meaning that the actual densities of the two regions will differ by a factor of more than 2d , where d is the intrinsic dimensionality of patch feature space. [sent-52, score-0.346]
18 the well-known, well-understood mean-shift algorithm can produce visual elements that are more representative and discriminative than those of previous approaches. [sent-54, score-0.655]
19 Mining visual elements from a large dataset is difficult for a number of reasons. [sent-55, score-0.525]
20 First, the search space is huge: a typical dataset for visual data mining has tens of thousands of images, and finding something in an image (e. [sent-56, score-0.423]
21 , finding matches for a visual template) involves searching across tens of thousands of patches at different positions and scales. [sent-58, score-0.528]
22 To make matters worse, patch descriptors tend to be on the order of thousands of dimensions; not only is the curse of dimensionality a constant problem, but we must sift through terabytes of data. [sent-59, score-0.284]
23 And we are searching for a needle in a haystack: the vast majority of patches are actually uninteresting, either because they are rare (e. [sent-60, score-0.279]
24 The goal of mean-shift is to find the local maxima (modes) of a density using a sample from that density. [sent-65, score-0.273]
25 But in our case, we can use the weak labels to divide our data into two different subsets (“positive” (+) and “negative” ( )) and seek visual elements which appear only in the “positive” set and not in the “negative” set. [sent-69, score-0.494]
26 That is, we want to find points in feature space where the density of the positive set is large, and the density of the negative set is small. [sent-70, score-0.358]
27 While a number of algorithms exist for estimating ratios of densities (see [25] for a review), we did not find any that were particularly suitable for finding local maxima of density ratios. [sent-72, score-0.273]
28 Hence, the first contribution of our paper is to propose a discriminative variant of mean-shift for finding visual elements. [sent-73, score-0.39]
29 Similar to the way mean-shift performs gradient ascent on a density estimate, our algorithm performs gradient ascent on the density ratio (section 2). [sent-74, score-0.534]
30 When we perform gradient ascent separately for each element as in standard mean-shift, however, we find that the most frequently-occuring elements tend to be over-represented. [sent-75, score-0.427]
31 Hence, section 3 describes a modification to our gradient ascent algorithm which uses inter-element communication to approximate common adaptive bandwidth procedures. [sent-76, score-0.358]
32 Finally, in section 4 we demonstrate that our algorithms produce visual elements which are more representative and discriminative than previous methods, and in section 5 we show they significantly improve performance in scene classification. [sent-77, score-0.811]
33 2 Mode Seeking on Density Ratios Our goal is to extract discriminative visual elements by finding the local maxima of the density ratio. [sent-78, score-0.871]
34 , a Gaussian), and h is a globally-shared bandwidth parameter. [sent-81, score-0.236]
35 The 2 bandwidth defines how much the density is smoothed before gradient ascent is performed, meaning these estimators assume a roughly equal distribution of points in all regions of the space. [sent-82, score-0.518]
36 Unfortunately, absolute distances in HOG feature space cannot be trusted, as shown in Figure 1: any kernel bandwidth which is large enough to work well in the left example will be far too large to work well in the right. [sent-83, score-0.321]
37 One way to deal with the non-uniformity of the feature space is to use an adaptive bandwidth [4]: that is, different bandwidths are used in different regions of the space. [sent-84, score-0.357]
38 We want to maximize the density ratio, so we simply divide the two density estimates. [sent-91, score-0.273]
39 We allow an adaptive bandwidth, but rather than associating a bandwidth with each datapoint, we compute it as a function of w which depends on the data. [sent-92, score-0.236]
40 Hence, we define B(w) as the value of b which satisfies: nneg X max(b (3) d(xi , w), 0) = i=1 Where is a constant analogous to the bandwidth parameter, except that it directly controls how many negative datapoints are in each cluster. [sent-95, score-0.392]
41 This approach makes the implicit assumption that the distribution of the negatives captures the overall density of the patch space. [sent-98, score-0.317]
42 We can further rewrite the above equation as finding the local maxima of: npos X i=1 nneg max(w> x+ i b, 0) kwk2 s. [sent-107, score-0.307]
43 We run this as an online algorithm by breaking the dataset into chunks and then mining, one chunk at a time, for patches where w> x b > ✏ for some small ✏, akin to “hard mining” for SVMs. [sent-121, score-0.409]
44 One way to deal with this is to assign smaller bandwidths to patches in dense regions of the space [4], e. [sent-128, score-0.358]
45 , the window railing on row 1 of Figure 2 (middle) would hopefully have a smaller bandwidth and hence not match to the sidewalk barrier. [sent-130, score-0.363]
46 However, estimating a bandwidth for every datapoint in our setting is not practical, so we seek an approach which only requires one pass through the data. [sent-131, score-0.328]
47 Since patches in regions of the feature space with high density ratio will be members of many clusters, we want a mechanism that will reduce their bandwidth. [sent-132, score-0.535]
48 Specifically, we control how a single patch can contribute to multiple clusters by introducing a sharing weight ↵i,j for each patch i that is contained in a cluster j, akin to soft-assignment in EM GMM fitting. [sent-134, score-0.573]
49 Returning to our fomulation, we maximize (again with respect to the w’s and b’s): npos m XX i=1 j=1 > ↵i,j max(wj x+ i bj , 0) m X j=1 nneg kwj k2 s. [sent-135, score-0.235]
50 8j X > max(wj xi bj , 0) = (6) i=1 Where each ↵i,j is chosen such that any patch which is a member of multiple clusters gets a lower weight. [sent-137, score-0.428]
51 (6) also has a natural interpretation in terms of maximizing the “representativeness” of the set of clusters: clusters are rewarded for representing patches that are not repre> sented by other clusters. [sent-138, score-0.388]
52 However, since w is roughly proportional to the density of the positive data, the bandwidth is only reduced when the density of positive data is high. [sent-141, score-0.542]
53 In each plot, purity measures the accuracy of the element detectors, whereas coverage captures how often they fire. [sent-174, score-0.579]
54 However, this goes against our mean-shift intuition: if two patches are really instances of the same element, then clusters initialized from those two points should converge to the same mode and not “compete” with one another. [sent-179, score-0.453]
55 Then we set ↵i,j = > max(wj x+ i > max(wj x+ bj , 0) i Pm > bj , 0) + k=1 I(Ck 6= Cj ) max(wk x+ i (7) bk , 0) In this way, any “competition” from elements that are too similar to each other is ignored. [sent-182, score-0.411]
56 To obtain the clusters, we perform agglomerative (UPGMA) clustering on the set of element clusters, using the negative of the number of overlapping cluster members as a “distance” metric. [sent-183, score-0.242]
57 In practice, however, it is extremely rare that the exact same patch is a member of two different clusters; instead, clusters will have member patches that merely overlap with each other. [sent-184, score-0.661]
58 Then we compute ↵i,j for a given patch by averaging ↵i,j,p over all pixels in the patch. [sent-186, score-0.261]
59 It is admittedly difficult to analyze how well these heuristics approximate the adaptive bandwidth approach of [4], and even there the setting of the bandwidth for each datapoint has heuristic aspects. [sent-188, score-0.609]
60 4 Evaluation via Purity-Coverage Plot Our aim is to discover visual elements that are maximally representative and discriminative. [sent-190, score-0.553]
61 To measure this, we define two quantities for a set of visual elements: coverage (which captures representativeness) and purity (which captures discriminativeness). [sent-191, score-0.731]
62 Given a held-out test set, visual elements will generate a set of patch detections. [sent-192, score-0.656]
63 We define the coverage of this set of patches to be the fraction of the pixels from the positive images claimed by at least one patch. [sent-193, score-0.675]
64 We define the purity of a set as the percentage of the patches that share the same label. [sent-194, score-0.555]
65 For an individual visual element, of course, there is an inherent trade-off between purity and coverage: if we lower the detection threshold, we cover more pixels but also increase the likelihood of making mistakes. [sent-195, score-0.653]
66 We could perform this analysis on any dataset containing positive and negative images, but [5] presents a dataset which is particularly suitable. [sent-197, score-0.216]
67 The goal is to mine visual elements which define the look and feel of a geographical locale, with a training set of 2,000 Paris Street View images and 8,000 5 Purity of 100% Purity of 90% 0. [sent-198, score-0.551]
68 2012) 100 200 300 Number of Elements 400 500 Figure 4: Coverage versus the number of elements used in the representation. [sent-212, score-0.208]
69 On the right, we lower the detection threshold until the elements are 90% pure. [sent-216, score-0.274]
70 Note: this is the same purity and coverage measure for the same elements as Figure 3, just plotted differently. [sent-217, score-0.69]
71 To plot the curve for a given value of purity p, we rank all patches by w> x b independently for every element, and select, for a given element, all patches up until the last point where the element has the desired purity. [sent-220, score-0.931]
72 We then compute the coverage as the union of patches selected for every element. [sent-221, score-0.485]
73 Because we are taking a union of patches, adding more elements can only increase coverage, but in practice we prefer concise representations, both for interpretability and for computational reasons. [sent-222, score-0.208]
74 Hence, to compare two element discovery methods, we must select exactly the same number of elements for both of them. [sent-223, score-0.393]
75 Hence, we select elements in the same way for all algorithms, which approximates an “ideal” selection for our measure. [sent-225, score-0.244]
76 Specifically, we first fix a level of purity (95%) and greedily select elements to maximize coverage (on the testing data) for that level of purity. [sent-226, score-0.726]
77 Hence, this ranking serves as an oracle to choose the “best” set of elements for covering the dataset at that level of purity. [sent-227, score-0.276]
78 While this ranking has a bias toward large elements (which inherently cover more pixels per detection), we believe that it provides a valuable comparison between algorithms. [sent-228, score-0.27]
79 We can also slice the same data differently, fixing a level of purity for all elements and varying the number of elements, as shown in Figure 4. [sent-230, score-0.484]
80 We initially train 20, 000 visual elements for all the baselines, and select the top elements using the method above. [sent-233, score-0.701]
81 Each cluster is represented by a hyperplane which maximally separates a single seed patch from the negative dataset learned via LDA, i. [sent-235, score-0.417]
82 To show the effects of re-clustering, “LDA Retrained” takes the top 5 positive-set patches retrieved in Exemplar LDA (including the initial patch itself), and repeats LDA, separating those 5 from the negative Gaussian. [sent-238, score-0.523]
83 Finally, “LDA Retrained 5 times” begins with elements initialized via the LDA retraining method, and retrains the LDA classifier, each time throwing out the previous top 5 used to train the previous LDA, and selecting a new top 5 from held-out data. [sent-240, score-0.245]
84 Implementation details: We use the same patch descriptors described in [5] and whiten them following [10]. [sent-244, score-0.241]
85 We mine elements using the online version of our algorithm, with a chunk size of 1000 (200 Paris, 800 non-Paris per batch). [sent-245, score-0.309]
86 We set ⇤ = t/500 where t is the iteration number, such that the bandwidth increases proportional to the number of samples. [sent-246, score-0.236]
87 We train the elements for about 200 6 Figure 5: For each correctly classified image (left), we show four elements (center) and heatmap of the locations (right) that contributed most to the classification. [sent-247, score-0.471]
88 To compute ↵i,j for patch i and detector j, we actually use scale-space voxels rather than pixels, since a large detection can completely cover a small detection but not vice versa. [sent-267, score-0.375]
89 Finally, to reduce the impact of highly redundant textures, we divide ↵i,j divided by the total number of detections for element j in the image containing i. [sent-273, score-0.259]
90 5 Scene Classification Finally, we evaluate whether our visual element representation is useful for scene classification. [sent-275, score-0.502]
91 For instance, it may not be obvious why a corridor would be classified as a staircase, but we can see (top right) that the algorithm has identified the railings as a key staircase element, and has found no other staircase elements the image. [sent-277, score-0.437]
92 For indoor scenes, objects within the scene are often more useful features than global scene statistics [12]: for instance, shoe shops are similar to other stores in global layout, but they mostly contain shoes. [sent-279, score-0.312]
93 We also used smaller descriptors: 6-by-6 HOG cells, corresponding to 64-by-64 patches and 1188-dimensional descriptors. [sent-284, score-0.279]
94 We again select elements by fixing purity and greedily selecting elements to maximize coverage, as above. [sent-285, score-0.728]
95 We even outperform the Improved Fisher Vector of [12], as well as IFV combined with discriminative patches (IFV+BoP). [sent-295, score-0.42]
96 6 Conclusion We developed an extension of the classic mean-shift algorithm to density ratio estimation, showing that the resulting algorithm could be used for element discovery, and demonstrating state-of-the-art results for scene classification. [sent-299, score-0.425]
97 Also, our elements are detected based only on individual patches, but images often contain global structures beyond patches. [sent-304, score-0.263]
98 The variable bandwidth mean shift and data-driven scale selection. [sent-333, score-0.28]
99 Object bank: A high-level image representation for scene classification and semantic feature sparsification. [sent-403, score-0.253]
100 Learning discriminative part detectors for image classification and cosegmentation. [sent-481, score-0.238]
wordName wordTfidf (topN-words)
[('patches', 0.279), ('purity', 0.276), ('visual', 0.249), ('bandwidth', 0.236), ('elements', 0.208), ('lda', 0.207), ('coverage', 0.206), ('patch', 0.199), ('retrained', 0.165), ('scene', 0.156), ('discriminative', 0.141), ('ifv', 0.126), ('maxima', 0.118), ('density', 0.118), ('clusters', 0.109), ('doersch', 0.101), ('element', 0.097), ('datapoint', 0.092), ('staircase', 0.089), ('bj', 0.083), ('cvpr', 0.081), ('guess', 0.08), ('paris', 0.079), ('ascent', 0.077), ('wj', 0.077), ('bop', 0.076), ('nneg', 0.076), ('npos', 0.076), ('sidewalk', 0.076), ('gupta', 0.075), ('street', 0.073), ('detections', 0.07), ('hog', 0.07), ('dataset', 0.068), ('exemplar', 0.068), ('gt', 0.066), ('cluster', 0.066), ('detection', 0.066), ('mode', 0.065), ('visually', 0.063), ('pixels', 0.062), ('hariharan', 0.062), ('chunk', 0.062), ('ini', 0.058), ('seeking', 0.057), ('representative', 0.057), ('sivic', 0.055), ('images', 0.055), ('image', 0.055), ('denominator', 0.055), ('ratio', 0.054), ('efros', 0.053), ('svm', 0.052), ('discovery', 0.052), ('classi', 0.052), ('mining', 0.051), ('abhinav', 0.051), ('centrist', 0.051), ('corridor', 0.051), ('noverlap', 0.051), ('pnneg', 0.051), ('railing', 0.051), ('ramesh', 0.051), ('modes', 0.048), ('max', 0.048), ('iccv', 0.048), ('gradient', 0.045), ('negative', 0.045), ('itera', 0.045), ('representativeness', 0.045), ('admittedly', 0.045), ('comaniciu', 0.045), ('shift', 0.044), ('detector', 0.044), ('kernel', 0.043), ('sift', 0.043), ('feature', 0.042), ('regions', 0.042), ('descriptors', 0.042), ('detectors', 0.042), ('cj', 0.039), ('maximally', 0.039), ('xing', 0.039), ('mine', 0.039), ('pm', 0.039), ('fraction', 0.038), ('divide', 0.037), ('member', 0.037), ('local', 0.037), ('bandwidths', 0.037), ('retraining', 0.037), ('centroid', 0.037), ('iarpa', 0.037), ('bk', 0.037), ('object', 0.036), ('select', 0.036), ('occurring', 0.035), ('datapoints', 0.035), ('positive', 0.035), ('clustering', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
Author: Carl Doersch, Abhinav Gupta, Alexei A. Efros
Abstract: Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical “visual words”, but lower than full-blown semantic objects. Several approaches [5, 6, 12, 23] have been proposed to discover mid-level visual elements, that are both 1) representative, i.e., frequently occurring within a visual dataset, and 2) visually discriminative. However, the current approaches are rather ad hoc and difficult to analyze and evaluate. In this work, we pose visual element discovery as discriminative mode seeking, drawing connections to the the well-known and well-studied mean-shift algorithm [2, 1, 4, 8]. Given a weakly-labeled image collection, our method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels. One advantage of our formulation is that it requires only a single pass through the data. We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset of [5]. We also evaluate our method on the task of scene classification, demonstrating state-of-the-art performance on the MIT Scene-67 dataset. 1
2 0.1803944 5 nips-2013-A Deep Architecture for Matching Short Texts
Author: Zhengdong Lu, Hang Li
Abstract: Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents, etc.). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufficient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More specifically, we apply this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models. 1
3 0.13431792 119 nips-2013-Fast Template Evaluation with Vector Quantization
Author: Mohammad Amin Sadeghi, David Forsyth
Abstract: Applying linear templates is an integral part of many object detection systems and accounts for a significant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cascade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the original Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. 1
4 0.13235329 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
Author: Yangqing Jia, Joshua T. Abbott, Joseph Austerweil, Thomas Griffiths, Trevor Darrell
Abstract: Learning a visual concept from a small number of positive examples is a significant challenge for machine learning algorithms. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given set of visual examples. Recent work in cognitive science on Bayesian models of generalization addresses this challenge, but prior results assumed that objects were perfectly recognized. We present an algorithm for learning visual concepts directly from images, using probabilistic predictions generated by visual classifiers as the input to a Bayesian generalization model. As no existing challenge data tests this paradigm, we collect and make available a new, large-scale dataset for visual concept learning using the ImageNet hierarchy as the source of possible concepts, with human annotators to provide ground truth labels as to whether a new image is an instance of each concept using a paradigm similar to that used in experiments studying word learning in children. We compare the performance of our system to several baseline algorithms, and show a significant advantage results from combining visual classifiers with the ability to identify an appropriate level of abstraction using Bayesian generalization. 1
5 0.13128249 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
Author: Zhenwen Dai, Georgios Exarchakis, Jörg Lücke
Abstract: We study optimal image encoding based on a generative approach with non-linear feature combinations and explicit position encoding. By far most approaches to unsupervised learning of visual features, such as sparse coding or ICA, account for translations by representing the same features at different positions. Some earlier models used a separate encoding of features and their positions to facilitate invariant data encoding and recognition. All probabilistic generative models with explicit position encoding have so far assumed a linear superposition of components to encode image patches. Here, we for the first time apply a model with non-linear feature superposition and explicit position encoding for patches. By avoiding linear superpositions, the studied model represents a closer match to component occlusions which are ubiquitous in natural images. In order to account for occlusions, the non-linear model encodes patches qualitatively very different from linear models by using component representations separated into mask and feature parameters. We first investigated encodings learned by the model using artificial data with mutually occluding components. We find that the model extracts the components, and that it can correctly identify the occlusive components with the hidden variables of the model. On natural image patches, the model learns component masks and features for typical image components. By using reverse correlation, we estimate the receptive fields associated with the model’s hidden units. We find many Gabor-like or globular receptive fields as well as fields sensitive to more complex structures. Our results show that probabilistic models that capture occlusions and invariances can be trained efficiently on image patches, and that the resulting encoding represents an alternative model for the neural encoding of images in the primary visual cortex. 1
6 0.11740539 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
7 0.11351682 167 nips-2013-Learning the Local Statistics of Optical Flow
8 0.10491367 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
9 0.099962004 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning
10 0.098885849 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
11 0.093453743 84 nips-2013-Deep Neural Networks for Object Detection
12 0.091681518 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
13 0.090640977 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
14 0.090340376 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs
15 0.088986903 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
16 0.086460687 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
17 0.086445853 251 nips-2013-Predicting Parameters in Deep Learning
18 0.083472513 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections
19 0.083183348 148 nips-2013-Latent Maximum Margin Clustering
20 0.082108483 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
topicId topicWeight
[(0, 0.219), (1, 0.098), (2, -0.124), (3, -0.091), (4, 0.131), (5, -0.042), (6, -0.032), (7, 0.018), (8, -0.027), (9, 0.037), (10, -0.148), (11, 0.044), (12, 0.001), (13, 0.014), (14, 0.03), (15, 0.034), (16, 0.024), (17, -0.173), (18, -0.054), (19, -0.002), (20, 0.011), (21, 0.055), (22, 0.007), (23, 0.01), (24, -0.12), (25, -0.048), (26, 0.035), (27, -0.019), (28, -0.027), (29, -0.066), (30, 0.035), (31, 0.046), (32, 0.094), (33, -0.018), (34, -0.054), (35, 0.023), (36, 0.031), (37, 0.032), (38, -0.065), (39, 0.02), (40, 0.054), (41, -0.077), (42, -0.045), (43, -0.034), (44, -0.075), (45, -0.09), (46, -0.013), (47, 0.056), (48, 0.031), (49, -0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.96031713 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
Author: Carl Doersch, Abhinav Gupta, Alexei A. Efros
Abstract: Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical “visual words”, but lower than full-blown semantic objects. Several approaches [5, 6, 12, 23] have been proposed to discover mid-level visual elements, that are both 1) representative, i.e., frequently occurring within a visual dataset, and 2) visually discriminative. However, the current approaches are rather ad hoc and difficult to analyze and evaluate. In this work, we pose visual element discovery as discriminative mode seeking, drawing connections to the the well-known and well-studied mean-shift algorithm [2, 1, 4, 8]. Given a weakly-labeled image collection, our method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels. One advantage of our formulation is that it requires only a single pass through the data. We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset of [5]. We also evaluate our method on the task of scene classification, demonstrating state-of-the-art performance on the MIT Scene-67 dataset. 1
2 0.79187441 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning
Author: Chen-Ping Yu, Wen-Yu Hua, Dimitris Samaras, Greg Zelinsky
Abstract: Visual clutter, the perception of an image as being crowded and disordered, affects aspects of our lives ranging from object detection to aesthetics, yet relatively little effort has been made to model this important and ubiquitous percept. Our approach models clutter as the number of proto-objects segmented from an image, with proto-objects defined as groupings of superpixels that are similar in intensity, color, and gradient orientation features. We introduce a novel parametric method of clustering superpixels by modeling mixture of Weibulls on Earth Mover’s Distance statistics, then taking the normalized number of proto-objects following partitioning as our estimate of clutter perception. We validated this model using a new 90-image dataset of real world scenes rank ordered by human raters for clutter, and showed that our method not only predicted clutter extremely well (Spearman’s ρ = 0.8038, p < 0.001), but also outperformed all existing clutter perception models and even a behavioral object segmentation ground truth. We conclude that the number of proto-objects in an image affects clutter perception more than the number of objects or features. 1
3 0.77361554 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
Author: Zhenwen Dai, Georgios Exarchakis, Jörg Lücke
Abstract: We study optimal image encoding based on a generative approach with non-linear feature combinations and explicit position encoding. By far most approaches to unsupervised learning of visual features, such as sparse coding or ICA, account for translations by representing the same features at different positions. Some earlier models used a separate encoding of features and their positions to facilitate invariant data encoding and recognition. All probabilistic generative models with explicit position encoding have so far assumed a linear superposition of components to encode image patches. Here, we for the first time apply a model with non-linear feature superposition and explicit position encoding for patches. By avoiding linear superpositions, the studied model represents a closer match to component occlusions which are ubiquitous in natural images. In order to account for occlusions, the non-linear model encodes patches qualitatively very different from linear models by using component representations separated into mask and feature parameters. We first investigated encodings learned by the model using artificial data with mutually occluding components. We find that the model extracts the components, and that it can correctly identify the occlusive components with the hidden variables of the model. On natural image patches, the model learns component masks and features for typical image components. By using reverse correlation, we estimate the receptive fields associated with the model’s hidden units. We find many Gabor-like or globular receptive fields as well as fields sensitive to more complex structures. Our results show that probabilistic models that capture occlusions and invariances can be trained efficiently on image patches, and that the resulting encoding represents an alternative model for the neural encoding of images in the primary visual cortex. 1
4 0.72693175 119 nips-2013-Fast Template Evaluation with Vector Quantization
Author: Mohammad Amin Sadeghi, David Forsyth
Abstract: Applying linear templates is an integral part of many object detection systems and accounts for a significant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cascade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the original Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. 1
5 0.71124256 166 nips-2013-Learning invariant representations and applications to face verification
Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio
Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1
6 0.70424241 167 nips-2013-Learning the Local Statistics of Optical Flow
7 0.69150394 84 nips-2013-Deep Neural Networks for Object Detection
8 0.68071997 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty
9 0.65693998 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections
10 0.65052682 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs
11 0.64916557 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation
12 0.64776963 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
13 0.63184965 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking
14 0.61340153 226 nips-2013-One-shot learning by inverting a compositional causal process
15 0.61077029 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
17 0.55256802 5 nips-2013-A Deep Architecture for Matching Short Texts
18 0.53638995 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling
19 0.53638077 357 nips-2013-k-Prototype Learning for 3D Rigid Structures
20 0.53335875 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
topicId topicWeight
[(2, 0.017), (16, 0.037), (18, 0.176), (19, 0.016), (33, 0.186), (34, 0.125), (41, 0.02), (49, 0.052), (56, 0.1), (70, 0.056), (85, 0.041), (89, 0.037), (93, 0.068), (95, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.88002539 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
Author: Carl Doersch, Abhinav Gupta, Alexei A. Efros
Abstract: Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical “visual words”, but lower than full-blown semantic objects. Several approaches [5, 6, 12, 23] have been proposed to discover mid-level visual elements, that are both 1) representative, i.e., frequently occurring within a visual dataset, and 2) visually discriminative. However, the current approaches are rather ad hoc and difficult to analyze and evaluate. In this work, we pose visual element discovery as discriminative mode seeking, drawing connections to the the well-known and well-studied mean-shift algorithm [2, 1, 4, 8]. Given a weakly-labeled image collection, our method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels. One advantage of our formulation is that it requires only a single pass through the data. We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset of [5]. We also evaluate our method on the task of scene classification, demonstrating state-of-the-art performance on the MIT Scene-67 dataset. 1
2 0.82578927 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
Author: Nataliya Shapovalova, Michalis Raptis, Leonid Sigal, Greg Mori
Abstract: We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efficiently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classification, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classification, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classification label and localized latent paths. 1
3 0.82366562 201 nips-2013-Multi-Task Bayesian Optimization
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
4 0.82298529 64 nips-2013-Compete to Compute
Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber
Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1
5 0.82291001 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
Author: Marius Pachitariu, Adam M. Packer, Noah Pettit, Henry Dalgleish, Michael Hausser, Maneesh Sahani
Abstract: Biological tissue is often composed of cells with similar morphologies replicated throughout large volumes and many biological applications rely on the accurate identification of these cells and their locations from image data. Here we develop a generative model that captures the regularities present in images composed of repeating elements of a few different types. Formally, the model can be described as convolutional sparse block coding. For inference we use a variant of convolutional matching pursuit adapted to block-based representations. We extend the KSVD learning algorithm to subspaces by retaining several principal vectors from the SVD decomposition instead of just one. Good models with little cross-talk between subspaces can be obtained by learning the blocks incrementally. We perform extensive experiments on simulated images and the inference algorithm consistently recovers a large proportion of the cells with a small number of false positives. We fit the convolutional model to noisy GCaMP6 two-photon images of spiking neurons and to Nissl-stained slices of cortical tissue and show that it recovers cell body locations without supervision. The flexibility of the block-based representation is reflected in the variability of the recovered cell shapes. 1
6 0.82132757 251 nips-2013-Predicting Parameters in Deep Learning
7 0.81901801 173 nips-2013-Least Informative Dimensions
8 0.81851155 331 nips-2013-Top-Down Regularization of Deep Belief Networks
9 0.81694031 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
10 0.81686729 5 nips-2013-A Deep Architecture for Matching Short Texts
11 0.81686372 183 nips-2013-Mapping paradigm ontologies to and from the brain
12 0.81528199 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
13 0.81499207 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
14 0.81471449 301 nips-2013-Sparse Additive Text Models with Low Rank Background
15 0.81393123 294 nips-2013-Similarity Component Analysis
16 0.81375688 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning
17 0.81361556 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
18 0.81298023 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
19 0.81240374 30 nips-2013-Adaptive dropout for training deep neural networks
20 0.81134444 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables