nips nips2010 nips2010-149 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Victor Lempitsky, Andrew Zisserman
Abstract: We propose a new supervised learning framework for visual object counting tasks, such as estimating the number of cells in a microscopic image or the number of humans in surveillance video frames. We focus on the practically-attractive case when the training images are annotated with dots (one dot per object). Our goal is to accurately estimate the count. However, we evade the hard task of learning to detect and localize individual object instances. Instead, we cast the problem as that of estimating an image density whose integral over any image region gives the count of objects within that region. Learning to infer such density can be formulated as a minimization of a regularized risk quadratic cost function. We introduce a new loss function, which is well-suited for such learning, and at the same time can be computed efficiently via a maximum subarray algorithm. The learning can then be posed as a convex quadratic program solvable with cutting-plane optimization. The proposed framework is very flexible as it can accept any domain-specific visual features. Once trained, our system provides accurate object counts and requires a very small time overhead over the feature extraction step, making it a good candidate for applications involving real-time processing or dealing with huge amount of visual data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We focus on the practically-attractive case when the training images are annotated with dots (one dot per object). [sent-2, score-0.417]
2 Instead, we cast the problem as that of estimating an image density whose integral over any image region gives the count of objects within that region. [sent-5, score-0.729]
3 Once trained, our system provides accurate object counts and requires a very small time overhead over the feature extraction step, making it a good candidate for applications involving real-time processing or dealing with huge amount of visual data. [sent-10, score-0.248]
4 1 Introduction The counting problem is the estimation of the number of objects in a still image or video frame. [sent-11, score-0.745]
5 It arises in many real-world applications including cell counting in microscopic images, monitoring crowds in surveillance systems, and performing wildlife census or counting the number of trees in an aerial image of a forest. [sent-12, score-1.182]
6 Arguably, the bare minimum of annotation is to provide the overall count of objects in each training image. [sent-15, score-0.471]
7 This paper focusses on the next level of annotation which is to specify the object position by putting a single dot on each object instance in each image. [sent-16, score-0.432]
8 Figure 1 gives examples of the counting problems and the dotted annotation we consider. [sent-17, score-0.613]
9 Dotting (pointing) is the natural way to count objects for humans, at least when the number of objects is large. [sent-18, score-0.346]
10 It may be argued therefore that providing dotted annotations for the training images is no harder for a human than giving just the raw counts. [sent-19, score-0.38]
11 On the other hand, a spatial arrangement of the dots provides a wealth of additional information, and this paper is, in part, about how to exploit this “free lunch” (in the context of the counting problem). [sent-20, score-0.5]
12 This paper develops a simple and general discriminative learning-based framework for counting objects in images. [sent-23, score-0.554]
13 The high-level idea of our approach is extremely simple: given an image I, our goal is to recover a density function F as a real function of pixels in this image. [sent-26, score-0.391]
14 Our notion of density function loosely 1 Figure 1: Examples of counting problems. [sent-27, score-0.548]
15 Left — counting bacterial cells in a fluorescence-light microscopy image (from [29]), right — counting people in a surveillance video frame (from [10]). [sent-28, score-1.403]
16 Our framework learns to estimate the number of objects in the previously unseen images based on a set of training images of the same kind augmented with dotted annotations. [sent-31, score-0.637]
17 Given the estimate F of the density function and the query about the number of objects in the entire image I, the number of objects in the image is estimated by integrating F over the entire I. [sent-33, score-0.873]
18 Furthermore, integrating the density over an image subregion S ⊂ I gives an estimate of the count of objects in that subregion. [sent-34, score-0.55]
19 Our approach assumes that each pixel p in an image is represented by a feature vector xp and models the density function as a linear transformation of xp : F (p) = wT xp . [sent-35, score-0.745]
20 Given a set of training images, the parameter vector w is learnt in the regularized risk framework, so that the density function estimates for the training images matches the ground truth densities inferred from the user annotations (under regularization on w). [sent-36, score-0.943]
21 The key conceptual difficulty with the density function is the discrete nature of both image observations (pixel grid) and, in particular, the user training annotation (sparse set of dots). [sent-37, score-0.577]
22 As a result, while it is easy to reason about average densities over the extended image regions (e. [sent-38, score-0.296]
23 the whole image), the notion of density is not well-defined at a pixel level. [sent-40, score-0.317]
24 Thus, given a set of dotted annotation there is no trivial answer to the question: what should be the ground truth density for this training example. [sent-41, score-0.733]
25 Our main contribution, addressing this conceptual difficulty, is a specific distance metric D between density functions used as a loss in our framework, which we call the MESA distance (where MESA stands for Maximum Excess over SubArrays, as well as for the geological term for the elevated plateau). [sent-43, score-0.417]
26 Thus, it does not matter much how exactly we define the ground truth density locally, as long as the integrals of the ground truth density over the larger regions reflect the counts correctly. [sent-47, score-0.988]
27 We can then naturally define the “ground truth” density for a dotted annotation to be a sum of normalized gaussians centered at the dots. [sent-48, score-0.383]
28 As virtually no assumptions is made about the features xp , our framework can benefit from much of the research on good features for object detection. [sent-55, score-0.242]
29 A number of approaches tackle counting problems in an unsupervised way, performing grouping based on self-similarities [3] or motion similarities [27]. [sent-59, score-0.389]
30 However, the counting accuracy of such fully unsupervised methods is limited, and therefore others considered approaches based on supervised learning. [sent-60, score-0.389]
31 (MATLAB jet colormap is used) Counting by detection: This assumes the use of a visual object detector, that localizes individual object instances in the image. [sent-71, score-0.346]
32 Given the localizations of all instances, counting becomes trivial. [sent-72, score-0.389]
33 In particular, most current object detectors operate in two stages: first producing a real-valued confidence map; and second, given such a map, a further thresholding and non-maximum suppression steps are needed to locate peaks correspoinding to individual instances [12, 26]. [sent-74, score-0.289]
34 More generative approaches avoid nonmaximum suppression by reasoning about relations between object parts and instances [6, 14, 20, 33, 34], but they are still geared towards a situation with a small number of objects in images and require time-consuming inference. [sent-75, score-0.505]
35 Instead, a direct mapping from some global image characteristics (mainly histograms of various features) to the number of objects is learned. [sent-79, score-0.313]
36 As a result, a large number of training images with the supplied counts needs to be provided during training. [sent-84, score-0.292]
37 Finally, counting by segmentation methods [10, 28] can be regarded as hybrids of counting-by-detection and counting-by-regression approaches. [sent-85, score-0.439]
38 They segment the objects into separate clusters and then regress from the global properties of each cluster to the overall number of objects in it. [sent-86, score-0.348]
39 It is also assumed that each pixel p in each image Ii is associated with a real-valued feature vector xi ∈ RK . [sent-93, score-0.337]
40 It is finally assumed that each training image Ii is annotated with a set of 2D points Pi = {P1 , . [sent-95, score-0.306]
41 The density functions in our approaches are real-valued functions over pixel grids, whose integrals over image regions should match the object counts. [sent-99, score-0.673]
42 For a training image Ii , we define the ground truth density function to be a kernel density estimate based on the provided points: ∀p ∈ Ii , Fi0 (p) = N (p; P, σ 2 12×2 ) . [sent-100, score-0.847]
43 With this definition, the sum of the ground truth density p∈Ii Fi0 (p) over the entire image will not match the dot count Ci exactly, as dots that lie very close to the image boundary result in their Gaussian probability mass being partly outside the image. [sent-102, score-1.066]
44 This is a natural and desirable 3 behaviour for most applications, as in many cases an object that lies partly outside the image boundary should not be counted as a full object, but rather as a fraction of an object. [sent-103, score-0.307]
45 After the optimal weight vector has been learned from the training data, the system can produce a density estimate for an unseen image I by a simple linear weighting of the feature vector computed in each pixel as suggested by (2). [sent-107, score-0.607]
46 2 The MESA distance The distance D in (3) measures the mismatch between the ground truth and the estimated densities (the loss) and has a significant impact on the performance of the entire learning framework. [sent-110, score-0.596]
47 support vector regression and ridge regression for L1 and L2 cases respectively), where each pixel in each training image effectively provides a sample in the 2 training set. [sent-116, score-0.658]
48 • As the overall counts is what we ultimately care about, one may choose D to be an absolute or squared difference between the overall sums over the entire images for the two arguments, e. [sent-123, score-0.464]
49 Once again, we get either the support vector regression (for the absolute differences) or ridge regression (for the squared differences), but now each training sample corresponds to the entire training image. [sent-128, score-0.436]
50 Thus, although this choice of the loss matches our ultimate goal of learning to count very well, it requires many annotated images for training as spatial information in the annotation is discarded. [sent-129, score-0.528]
51 The MESA distance (in fact, a metric) can be regarded as an L∞ distance between combinatorially-long vectors of subarray sums. [sent-132, score-0.317]
52 4 original noise added σ increased dots jittered dots removed dots reshuffled Figure 3: Comparison of distances for matching density functions. [sent-134, score-0.521]
53 Here, the top-left image shows one of the densities, computed as the ground truth density for a set of dots. [sent-135, score-0.609]
54 In the bottom row, we compare side-by-side the per-pixel L1 distance, the absolute difference of overall counts, and the MESA distance between the original and the perturbed densities (the distances are normalized across the 5 examples). [sent-137, score-0.348]
55 In the middle row we give per-pixel plots of the differences between the respective densities and show the boxes on which the maxima in the definition of the MESA distance are achieved. [sent-139, score-0.272]
56 Firstly, it is directly related to the counting objective we want to optimize. [sent-141, score-0.389]
57 Since the set of all subarrays include the full image, DMESA (F1 , F2 ) is an upper bound on the absolute difference of the overall count estimates given by the two densities F1 and F2 . [sent-142, score-0.577]
58 Secondly, when the two density functions differ by a zero-mean high-frequency signal or an independent zero-mean noise, the DMESA distance between them is small, because positive and negative deviations of F1 from F2 pixels tend to cancel each other over the large regions. [sent-143, score-0.294]
59 F1 and F2 are the ground truth densities corresponding to the two point sets leaning towards two different corners of the image, then the DMESA distance between F1 and F2 is large, even if F1 and F2 sum to the same counts over the entire image. [sent-147, score-0.593]
60 (5) p∈B Computing both inner maxima in (5) then constitutes a 2D maximum subarray problem, which is finding the box subarray of a given 2D array with the largest sum. [sent-151, score-0.341]
61 ξN ∀i, ∀B ∈ Bi : wT w + λ ξi , Fi0 (p) − wT xi , p ξi ≥ subject to (6) i=1 wT xi − Fi0 (p) p ξi ≥ (7) p∈B p∈B Here, ξi are the auxiliary slack variables (one for each training image) and Bi is the set of all subarrays in image i. [sent-166, score-0.52]
62 j ξN after the iteration j, one can find the box subarrays corresponding to the most violated constraints among (7). [sent-174, score-0.297]
63 To do that, for each image we find the subarrays that maximize the right hand sides of (7), which are exactly the 2D maximum subarrays of Fi0 (·) − Fi (·|j w) and Fi (·|j w) − Fi0 (·) respectively. [sent-175, score-0.703]
64 1 2 The boxes j Bi and j Bi corresponding to these maximum subarrays are then found for each image i. [sent-176, score-0.478]
65 The iterations terminate when for all images the sums corresponding to maximum subarrays are within (1 + ) factor from j ξi and hence no constraints are activated. [sent-178, score-0.434]
66 3 Experiments Our framework and several baselines were evaluated on counting tasks for two types of imagery shown in Figure 1. [sent-181, score-0.455]
67 For the experiments, we generated a dataset of images (available at [1]), with the overall number of cells varying between 74 and 317. [sent-187, score-0.259]
68 Few annotated datasets with real cell microscopy images also exist. [sent-188, score-0.37]
69 While it is tempting to use real rather than synthetic imagery, all the real image datasets to the best of our knowledge are small (only few images have annotations), and, most importantly, there always are very big discrepancies between the annotations of different human experts. [sent-189, score-0.387]
70 The latter effectively invalidates the use of such real datasets for quantitative comparison of different counting approaches. [sent-190, score-0.389]
71 Below we discuss the comparison of the counting accuracy achieved by our approach and baseline approaches. [sent-191, score-0.425]
72 The features used in all approaches were based on the dense SIFT descriptor [21] computing using [32] software at each pixel of each image with the fixed SIFT frame radius (about the size of the cell) and fixed orientation. [sent-192, score-0.439]
73 Each algorithm was trained on N training images, while another N images were used for the validation of metaparameters. [sent-193, score-0.279]
74 Then each pixel is represented by a vector of length K, which is 1 at the dimension corresponding to the entry of the SIFT descriptor at that pixel and 0 for all other dimensions. [sent-197, score-0.316]
75 6 linear ridge regression kernel ridge regression detection detection detection+correction density learning density learning Validation counting counting counting detection counting counting MESA N =1 67. [sent-202, score-2.771]
76 2 Table 1: Mean absolute errors for cell counting on the test set of 100 fluorescent microscopy images. [sent-284, score-0.648]
77 The last 6 columns correspond to the numbers of images in the training and validation sets. [sent-287, score-0.279]
78 Standard deviations in the table correspond to 5 different draws of training and validation image sets. [sent-289, score-0.324]
79 59 Table 2: Mean absolute errors for people counting in the surveillance video [10]. [sent-311, score-0.625]
80 Each of the training images was described by a global histogram of the entries occurrences for the same codebook as above. [sent-316, score-0.26]
81 The slope and the intercept of the correction for each combination of τ , ρ, and regularization strength were estimated via robust regression on the union of the training and validation sets. [sent-327, score-0.247]
82 The counting algorithm here is based on adaptive thresholding and morphological analysis. [sent-331, score-0.418]
83 The objective minimized during the validation was counting accuracy. [sent-335, score-0.455]
84 For counting-by-detection, we also considered optimizing detection accuracy (computed via Hungarian matching with the ground truth), and, for our approach, we also considered minimizing the MESA distance with the ground truth density on the validation set. [sent-336, score-0.809]
85 The results for a different number N of training and validation images are given in Table 1, based on 5 random draws of training and validation sets. [sent-337, score-0.424]
86 The authors of [10] also provided the dotted ground truth for these frames, the position of the ground plane, and the region of interest, where the counts should be performed. [sent-342, score-0.582]
87 Thus, we first extracted the primary features in each pixel including the absolute differences with the previous frame and the background, the image intensity, and the absolute values x- and y-derivatives. [sent-354, score-0.553]
88 The training objective was the regression from the appearance of each pixel and its neighborhood to the ground truth density. [sent-356, score-0.555]
89 Given the pretrained forest, each pixel p gets assigned a vector xp of dimension equal to the total number of leaves in all trees, with ones corresponding to the leaves in each of the five trees the pixel falls into and zeros otherwise. [sent-358, score-0.399]
90 Finally, to account for the perspective distortion, we multiplied xp by the square of the depth of the ground plane at p (provided with the sequence). [sent-359, score-0.249]
91 Within each scenario, we allocated one-fifth of the training frames to pick λ and the tree depth through validation via the MESA distance. [sent-360, score-0.246]
92 In both sets of experiments, we tried two strategies for setting σ (kernel width in the definition of the ground truth densities): setting σ = 0 (effectively, the ground truth is then a sum of delta-functions), and setting σ = 4 (roughly comparable with object half-size in both experiments). [sent-363, score-0.67]
93 in the case of pedestrians, one can store the value wt computed during learning at each leaf t in each tree, so that counting would require simply “pushing” each pixel down the forest, and summing the resulting wt from the obtained leaves. [sent-374, score-0.689]
94 4 Conclusion We have presented the general framework for learning to count objects in images. [sent-376, score-0.243]
95 While our ultimate goal is the counting accuracy over the entire image, during the learning our approach is optimizing the loss based on the MESA-distance. [sent-377, score-0.491]
96 This loss involves counting accuracy over multiple subarrays of the entire image (and not only the entire image itself). [sent-378, score-1.126]
97 We demonstrate that given limited amount of training data, such an approach achieves much higher accuracy than optimizing the counting accuracy over the entire image directly (counting-by-regression). [sent-379, score-0.691]
98 On the detection of multiple object instances using Hough transforms. [sent-421, score-0.269]
99 Computational framework for simulating fluorescence microscope images with cell populations. [sent-524, score-0.288]
100 Software for quantification of labeled bacteria from digital microscope images by automated image analysis. [sent-598, score-0.357]
wordName wordTfidf (topN-words)
[('counting', 0.389), ('mesa', 0.35), ('subarrays', 0.262), ('dmesa', 0.219), ('image', 0.179), ('density', 0.159), ('pixel', 0.158), ('subarray', 0.153), ('ground', 0.139), ('images', 0.134), ('objects', 0.134), ('truth', 0.132), ('annotation', 0.131), ('crowd', 0.131), ('object', 0.128), ('densities', 0.117), ('dots', 0.111), ('microscopy', 0.109), ('frames', 0.101), ('dotted', 0.093), ('detection', 0.092), ('surveillance', 0.088), ('xp', 0.083), ('distance', 0.082), ('cell', 0.079), ('training', 0.079), ('counts', 0.079), ('count', 0.078), ('cells', 0.076), ('annotations', 0.074), ('wt', 0.071), ('absolute', 0.071), ('crowded', 0.07), ('ridge', 0.069), ('sift', 0.069), ('validation', 0.066), ('downscale', 0.066), ('upscale', 0.066), ('forest', 0.063), ('suppression', 0.06), ('bacterial', 0.058), ('microscopic', 0.058), ('correction', 0.055), ('pixels', 0.053), ('peaks', 0.052), ('segmentation', 0.05), ('fi', 0.05), ('integrals', 0.049), ('instances', 0.049), ('overall', 0.049), ('annotated', 0.048), ('lempitsky', 0.047), ('codebook', 0.047), ('pedestrians', 0.047), ('regression', 0.047), ('dot', 0.045), ('entire', 0.044), ('dotting', 0.044), ('microscope', 0.044), ('pearls', 0.044), ('selinummi', 0.044), ('video', 0.043), ('cvpr', 0.042), ('perturbations', 0.042), ('visual', 0.041), ('delaunay', 0.038), ('uorescence', 0.038), ('sums', 0.038), ('frame', 0.038), ('boxes', 0.037), ('dense', 0.036), ('differences', 0.036), ('metric', 0.036), ('baseline', 0.036), ('splits', 0.036), ('scenarios', 0.036), ('imagery', 0.035), ('box', 0.035), ('people', 0.034), ('descriptors', 0.034), ('hybrid', 0.033), ('unseen', 0.032), ('regress', 0.031), ('framework', 0.031), ('program', 0.03), ('risk', 0.03), ('detector', 0.03), ('dence', 0.03), ('distances', 0.029), ('bi', 0.029), ('loss', 0.029), ('ultimate', 0.029), ('morphological', 0.029), ('conceptual', 0.029), ('software', 0.028), ('ii', 0.028), ('rectangles', 0.028), ('iccv', 0.027), ('plane', 0.027), ('quadratic', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 149 nips-2010-Learning To Count Objects in Images
Author: Victor Lempitsky, Andrew Zisserman
Abstract: We propose a new supervised learning framework for visual object counting tasks, such as estimating the number of cells in a microscopic image or the number of humans in surveillance video frames. We focus on the practically-attractive case when the training images are annotated with dots (one dot per object). Our goal is to accurately estimate the count. However, we evade the hard task of learning to detect and localize individual object instances. Instead, we cast the problem as that of estimating an image density whose integral over any image region gives the count of objects within that region. Learning to infer such density can be formulated as a minimization of a regularized risk quadratic cost function. We introduce a new loss function, which is well-suited for such learning, and at the same time can be computed efficiently via a maximum subarray algorithm. The learning can then be posed as a convex quadratic program solvable with cutting-plane optimization. The proposed framework is very flexible as it can accept any domain-specific visual features. Once trained, our system provides accurate object counts and requires a very small time overhead over the feature extraction step, making it a good candidate for applications involving real-time processing or dealing with huge amount of visual data. 1
2 0.19621958 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
Author: Matthew Blaschko, Andrea Vedaldi, Andrew Zisserman
Abstract: A standard approach to learning object category detectors is to provide strong supervision in the form of a region of interest (ROI) specifying each instance of the object in the training images [17]. In this work are goal is to learn from heterogeneous labels, in which some images are only weakly supervised, specifying only the presence or absence of the object or a weak indication of object location, whilst others are fully annotated. To this end we develop a discriminative learning approach and make two contributions: (i) we propose a structured output formulation for weakly annotated images where full annotations are treated as latent variables; and (ii) we propose to optimize a ranking objective function, allowing our method to more effectively use negatively labeled images to improve detection average precision performance. The method is demonstrated on the benchmark INRIA pedestrian detection dataset of Dalal and Triggs [14] and the PASCAL VOC dataset [17], and it is shown that for a significant proportion of weakly supervised images the performance achieved is very similar to the fully supervised (state of the art) results. 1
3 0.18395422 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
Author: Yang Wang, Greg Mori
Abstract: We propose a discriminative latent model for annotating images with unaligned object-level textual annotations. Instead of using the bag-of-words image representation currently popular in the computer vision community, our model explicitly captures more intricate relationships underlying visual and textual information. In particular, we model the mapping that translates image regions to annotations. This mapping allows us to relate image regions to their corresponding annotation terms. We also model the overall scene label as latent information. This allows us to cluster test images. Our training data consist of images and their associated annotations. But we do not have access to the ground-truth regionto-annotation mapping or the overall scene label. We develop a novel variant of the latent SVM framework to model them as latent variables. Our experimental results demonstrate the effectiveness of the proposed model compared with other baseline methods.
4 0.17824736 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
Author: Mario Fritz, Kate Saenko, Trevor Darrell
Abstract: Metric constraints are known to be highly discriminative for many objects, but if training is limited to data captured from a particular 3-D sensor the quantity of training data may be severly limited. In this paper, we show how a crucial aspect of 3-D information–object and feature absolute size–can be added to models learned from commonly available online imagery, without use of any 3-D sensing or reconstruction at training time. Such models can be utilized at test time together with explicit 3-D sensing to perform robust search. Our model uses a “2.1D” local feature, which combines traditional appearance gradient statistics with an estimate of average absolute depth within the local window. We show how category size information can be obtained from online images by exploiting relatively unbiquitous metadata fields specifying camera intrinstics. We develop an efficient metric branch-and-bound algorithm for our search task, imposing 3-D size constraints as part of an optimal search for a set of features which indicate the presence of a category. Experiments on test scenes captured with a traditional stereo rig are shown, exploiting training data from from purely monocular sources with associated EXIF metadata. 1
5 0.15937486 133 nips-2010-Kernel Descriptors for Visual Recognition
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT [16] and HOG [3], are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes (gradient, color, local binary pattern, etc.) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA) [23]. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classification benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.
8 0.11489902 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
9 0.10986864 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
10 0.10921305 103 nips-2010-Generating more realistic images using gated MRF's
11 0.099888101 79 nips-2010-Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces
12 0.093685038 153 nips-2010-Learning invariant features using the Transformed Indian Buffet Process
13 0.090093948 235 nips-2010-Self-Paced Learning for Latent Variable Models
14 0.089953288 77 nips-2010-Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures
15 0.086151406 137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models
16 0.085586965 1 nips-2010-(RF)^2 -- Random Forest Random Field
17 0.083577633 281 nips-2010-Using body-anchored priors for identifying actions in single images
18 0.083104432 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
19 0.078759052 159 nips-2010-Lifted Inference Seen from the Other Side : The Tractable Features
20 0.077274948 267 nips-2010-The Multidimensional Wisdom of Crowds
topicId topicWeight
[(0, 0.242), (1, 0.115), (2, -0.163), (3, -0.211), (4, 0.03), (5, -0.041), (6, -0.048), (7, 0.025), (8, 0.06), (9, 0.015), (10, 0.005), (11, -0.0), (12, -0.082), (13, -0.002), (14, 0.049), (15, -0.015), (16, 0.048), (17, -0.078), (18, 0.125), (19, 0.034), (20, 0.003), (21, 0.003), (22, 0.032), (23, 0.005), (24, -0.019), (25, -0.067), (26, 0.046), (27, -0.026), (28, -0.03), (29, 0.013), (30, 0.005), (31, 0.015), (32, 0.021), (33, -0.079), (34, -0.082), (35, -0.034), (36, 0.041), (37, -0.07), (38, 0.084), (39, -0.074), (40, -0.043), (41, -0.019), (42, -0.027), (43, 0.058), (44, -0.061), (45, -0.032), (46, -0.008), (47, -0.016), (48, 0.082), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.96041244 149 nips-2010-Learning To Count Objects in Images
Author: Victor Lempitsky, Andrew Zisserman
Abstract: We propose a new supervised learning framework for visual object counting tasks, such as estimating the number of cells in a microscopic image or the number of humans in surveillance video frames. We focus on the practically-attractive case when the training images are annotated with dots (one dot per object). Our goal is to accurately estimate the count. However, we evade the hard task of learning to detect and localize individual object instances. Instead, we cast the problem as that of estimating an image density whose integral over any image region gives the count of objects within that region. Learning to infer such density can be formulated as a minimization of a regularized risk quadratic cost function. We introduce a new loss function, which is well-suited for such learning, and at the same time can be computed efficiently via a maximum subarray algorithm. The learning can then be posed as a convex quadratic program solvable with cutting-plane optimization. The proposed framework is very flexible as it can accept any domain-specific visual features. Once trained, our system provides accurate object counts and requires a very small time overhead over the feature extraction step, making it a good candidate for applications involving real-time processing or dealing with huge amount of visual data. 1
2 0.82512414 256 nips-2010-Structural epitome: a way to summarize one’s visual experience
Author: Nebojsa Jojic, Alessandro Perina, Vittorio Murino
Abstract: In order to study the properties of total visual input in humans, a single subject wore a camera for two weeks capturing, on average, an image every 20 seconds. The resulting new dataset contains a mix of indoor and outdoor scenes as well as numerous foreground objects. Our first goal is to create a visual summary of the subject’s two weeks of life using unsupervised algorithms that would automatically discover recurrent scenes, familiar faces or common actions. Direct application of existing algorithms, such as panoramic stitching (e.g., Photosynth) or appearance-based clustering models (e.g., the epitome), is impractical due to either the large dataset size or the dramatic variations in the lighting conditions. As a remedy to these problems, we introduce a novel image representation, the ”structural element (stel) epitome,” and an associated efficient learning algorithm. In our model, each image or image patch is characterized by a hidden mapping T which, as in previous epitome models, defines a mapping between the image coordinates and the coordinates in the large ”all-I-have-seen” epitome matrix. The limited epitome real-estate forces the mappings of different images to overlap which indicates image similarity. However, the image similarity no longer depends on direct pixel-to-pixel intensity/color/feature comparisons as in previous epitome models, but on spatial configuration of scene or object parts, as the model is based on the palette-invariant stel models. As a result, stel epitomes capture structure that is invariant to non-structural changes, such as illumination changes, that tend to uniformly affect pixels belonging to a single scene or object part. 1
3 0.81854129 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
Author: Mario Fritz, Kate Saenko, Trevor Darrell
Abstract: Metric constraints are known to be highly discriminative for many objects, but if training is limited to data captured from a particular 3-D sensor the quantity of training data may be severly limited. In this paper, we show how a crucial aspect of 3-D information–object and feature absolute size–can be added to models learned from commonly available online imagery, without use of any 3-D sensing or reconstruction at training time. Such models can be utilized at test time together with explicit 3-D sensing to perform robust search. Our model uses a “2.1D” local feature, which combines traditional appearance gradient statistics with an estimate of average absolute depth within the local window. We show how category size information can be obtained from online images by exploiting relatively unbiquitous metadata fields specifying camera intrinstics. We develop an efficient metric branch-and-bound algorithm for our search task, imposing 3-D size constraints as part of an optimal search for a set of features which indicate the presence of a category. Experiments on test scenes captured with a traditional stereo rig are shown, exploiting training data from from purely monocular sources with associated EXIF metadata. 1
4 0.81025702 245 nips-2010-Space-Variant Single-Image Blind Deconvolution for Removing Camera Shake
Author: Stefan Harmeling, Hirsch Michael, Bernhard Schölkopf
Abstract: Modelling camera shake as a space-invariant convolution simplifies the problem of removing camera shake, but often insufficiently models actual motion blur such as those due to camera rotation and movements outside the sensor plane or when objects in the scene have different distances to the camera. In an effort to address these limitations, (i) we introduce a taxonomy of camera shakes, (ii) we build on a recently introduced framework for space-variant filtering by Hirsch et al. and a fast algorithm for single image blind deconvolution for space-invariant filters by Cho and Lee to construct a method for blind deconvolution in the case of space-variant blur, and (iii), we present an experimental setup for evaluation that allows us to take images with real camera shake while at the same time recording the spacevariant point spread function corresponding to that blur. Finally, we demonstrate that our method is able to deblur images degraded by spatially-varying blur originating from real camera shake, even without using additionally motion sensor information. 1
5 0.76139432 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
Author: Yang Wang, Greg Mori
Abstract: We propose a discriminative latent model for annotating images with unaligned object-level textual annotations. Instead of using the bag-of-words image representation currently popular in the computer vision community, our model explicitly captures more intricate relationships underlying visual and textual information. In particular, we model the mapping that translates image regions to annotations. This mapping allows us to relate image regions to their corresponding annotation terms. We also model the overall scene label as latent information. This allows us to cluster test images. Our training data consist of images and their associated annotations. But we do not have access to the ground-truth regionto-annotation mapping or the overall scene label. We develop a novel variant of the latent SVM framework to model them as latent variables. Our experimental results demonstrate the effectiveness of the proposed model compared with other baseline methods.
6 0.75863278 79 nips-2010-Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces
7 0.74718404 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
8 0.74069226 234 nips-2010-Segmentation as Maximum-Weight Independent Set
10 0.70023149 77 nips-2010-Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures
11 0.66230828 267 nips-2010-The Multidimensional Wisdom of Crowds
12 0.63555723 153 nips-2010-Learning invariant features using the Transformed Indian Buffet Process
13 0.63555706 17 nips-2010-A biologically plausible network for the computation of orientation dominance
14 0.62750512 95 nips-2010-Feature Transitions with Saccadic Search: Size, Color, and Orientation Are Not Alike
15 0.62407237 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach
16 0.62157136 1 nips-2010-(RF)^2 -- Random Forest Random Field
17 0.61927599 137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models
18 0.60254663 103 nips-2010-Generating more realistic images using gated MRF's
19 0.58644509 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
20 0.57436132 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics
topicId topicWeight
[(13, 0.031), (17, 0.012), (27, 0.083), (30, 0.054), (35, 0.32), (45, 0.26), (50, 0.045), (52, 0.025), (60, 0.026), (77, 0.038), (78, 0.015), (90, 0.031)]
simIndex simValue paperId paperTitle
1 0.95777631 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
Author: Jun Liu, Jieping Ye
Abstract: We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-defined tree structure is based on a group-Lasso penalty, where one group is defined for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efficient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efficient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efficiency and effectiveness of the proposed algorithm.
2 0.91082734 59 nips-2010-Deep Coding Network
Author: Yuanqing Lin, Zhang Tong, Shenghuo Zhu, Kai Yu
Abstract: This paper proposes a principled extension of the traditional single-layer flat sparse coding scheme, where a two-layer coding scheme is derived based on theoretical analysis of nonlinear functional approximation that extends recent results for local coordinate coding. The two-layer approach can be easily generalized to deeper structures in a hierarchical multiple-layer manner. Empirically, it is shown that the deep coding approach yields improved performance in benchmark datasets.
3 0.906829 97 nips-2010-Functional Geometry Alignment and Localization of Brain Areas
Author: Georg Langs, Yanmei Tie, Laura Rigolo, Alexandra Golby, Polina Golland
Abstract: Matching functional brain regions across individuals is a challenging task, largely due to the variability in their location and extent. It is particularly difficult, but highly relevant, for patients with pathologies such as brain tumors, which can cause substantial reorganization of functional systems. In such cases spatial registration based on anatomical data is only of limited value if the goal is to establish correspondences of functional areas among different individuals, or to localize potentially displaced active regions. Rather than rely on spatial alignment, we propose to perform registration in an alternative space whose geometry is governed by the functional interaction patterns in the brain. We first embed each brain into a functional map that reflects connectivity patterns during a fMRI experiment. The resulting functional maps are then registered, and the obtained correspondences are propagated back to the two brains. In application to a language fMRI experiment, our preliminary results suggest that the proposed method yields improved functional correspondences across subjects. This advantage is pronounced for subjects with tumors that affect the language areas and thus cause spatial reorganization of the functional regions. 1
4 0.9017005 73 nips-2010-Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization
Author: Feiping Nie, Heng Huang, Xiao Cai, Chris H. Ding
Abstract: Feature selection is an important component of many machine learning applications. Especially in many bioinformatics tasks, efficient and robust feature selection methods are desired to extract meaningful features and eliminate noisy ones. In this paper, we propose a new robust feature selection method with emphasizing joint 2,1 -norm minimization on both loss function and regularization. The 2,1 -norm based loss function is robust to outliers in data points and the 2,1 norm regularization selects features across all data points with joint sparsity. An efficient algorithm is introduced with proved convergence. Our regression based objective makes the feature selection process more efficient. Our method has been applied into both genomic and proteomic biomarkers discovery. Extensive empirical studies are performed on six data sets to demonstrate the performance of our feature selection method. 1
same-paper 5 0.84195751 149 nips-2010-Learning To Count Objects in Images
Author: Victor Lempitsky, Andrew Zisserman
Abstract: We propose a new supervised learning framework for visual object counting tasks, such as estimating the number of cells in a microscopic image or the number of humans in surveillance video frames. We focus on the practically-attractive case when the training images are annotated with dots (one dot per object). Our goal is to accurately estimate the count. However, we evade the hard task of learning to detect and localize individual object instances. Instead, we cast the problem as that of estimating an image density whose integral over any image region gives the count of objects within that region. Learning to infer such density can be formulated as a minimization of a regularized risk quadratic cost function. We introduce a new loss function, which is well-suited for such learning, and at the same time can be computed efficiently via a maximum subarray algorithm. The learning can then be posed as a convex quadratic program solvable with cutting-plane optimization. The proposed framework is very flexible as it can accept any domain-specific visual features. Once trained, our system provides accurate object counts and requires a very small time overhead over the feature extraction step, making it a good candidate for applications involving real-time processing or dealing with huge amount of visual data. 1
6 0.79485577 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
7 0.78511906 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
8 0.7693578 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection
9 0.76034445 76 nips-2010-Energy Disaggregation via Discriminative Sparse Coding
10 0.74767178 249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis
11 0.74611497 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
12 0.74469519 181 nips-2010-Network Flow Algorithms for Structured Sparsity
13 0.74305558 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts
15 0.74124789 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
16 0.74048901 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives
17 0.73815763 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
18 0.73721069 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
19 0.72982109 217 nips-2010-Probabilistic Multi-Task Feature Selection
20 0.72812891 258 nips-2010-Structured sparsity-inducing norms through submodular functions