nips nips2013 nips2013-84 knowledge-graph by maker-knowledge-mining

84 nips-2013-Deep Neural Networks for Object Detection

Source: pdf

Author: Christian Szegedy, Alexander Toshev, Dumitru Erhan

Abstract: Deep Neural Networks (DNNs) have recently shown outstanding performance on image classiﬁcation tasks [14]. In this paper we go one step further and address the problem of object detection using DNNs, that is not only classifying but also precisely localizing objects of various classes. We present a simple and yet powerful formulation of object detection as a regression problem to object bounding box masks. We deﬁne a multi-scale inference procedure which is able to produce high-resolution object detections at a low cost by a few network applications. State-of-the-art performance of the approach is shown on Pascal VOC. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we go one step further and address the problem of object detection using DNNs, that is not only classifying but also precisely localizing objects of various classes. [sent-4, score-0.56]

2 We present a simple and yet powerful formulation of object detection as a regression problem to object bounding box masks. [sent-5, score-1.209]

3 We deﬁne a multi-scale inference procedure which is able to produce high-resolution object detections at a low cost by a few network applications. [sent-6, score-0.525]

4 1 Introduction As we move towards more complete image understanding, having more precise and detailed object recognition becomes crucial. [sent-8, score-0.418]

5 In this context, one cares not only about classifying images, but also about precisely estimating estimating the class and location of objects contained within the images, a problem known as object detection. [sent-9, score-0.426]

6 The main advances in object detection were achieved thanks to improvements in object representations and machine learning models. [sent-10, score-0.745]

7 Using discriminative learning of graphical models allows for building high-precision part-based models for variety of object classes. [sent-13, score-0.295]

8 Manually engineered representations in conjunction with shallow discriminatively trained models have been among the best performing paradigms for the related problem of object classiﬁcation as well [17]. [sent-14, score-0.452]

9 This expressivity and robust training algorithms allow for learning powerful object representations without the need to hand design features. [sent-18, score-0.404]

10 In this paper, we exploit the power of DNNs for the problem of object detection, where we not only classify but also try to precisely localize objects. [sent-20, score-0.347]

11 The problem we are address here is challenging, since we want to detect a potentially large number object instances with varying sizes in the same image using a limited amount of computing resources. [sent-21, score-0.418]

12 We present a formulation which is capable of predicting the bounding boxes of multiple objects in a given image. [sent-22, score-0.576]

13 More precisely, we formulate a DNN-based regression which outputs a binary mask of the object bounding box (and portions of the box as well), as shown in Fig. [sent-23, score-1.396]

14 Additionally, we employ a simple bounding box inference to extract detections from the masks. [sent-25, score-0.592]

15 To increase localization precision, we apply the DNN mask generation in a multi-scale fashion on the full image as well as on a small number of large image crops, followed by a reﬁnement step (see Fig. [sent-26, score-0.771]

16 1 In this way, only through a few dozen DNN-regressions we can achieve state-of-art bounding box localization. [sent-28, score-0.462]

17 The somewhat surprising but powerful insight is that networks which to some extent encode translation invariance, can capture object locations as well. [sent-31, score-0.354]

18 This simplicity has the advantage of easy applicability to wide range of classes, but also show better detection performance across a wider range of objects – rigid ones as well as deformable ones. [sent-36, score-0.341]

19 2 Related Work One of the most heavily studied paradigms for object detection is the deformable part-based model, with [9] being the most prominent example. [sent-39, score-0.548]

20 Deep architectures for object detection and parsing have been motivated by part-based models and traditionally are called compositional models, where the object is expressed as layered composition of image primitives. [sent-43, score-0.929]

21 A notable example is the And/Or graph [20], where an object is modeled by a tree with And-nodes representing different parts and Or-nodes representing different modes of the same part. [sent-44, score-0.327]

22 Similarly to DNNs, the And/Or graph consists of multiple layers, where lower layers represent small generic image primitives, while higher layers represent object parts. [sent-45, score-0.542]

23 Our approach, however, uses the full image as an input and performs localization through regression. [sent-58, score-0.276]

24 DBN mask regression layer full object mask left object mask top object mask Figure 1: A schematic view of object detection as DNN-based regression. [sent-70, score-2.959]

25 reﬁne object box extraction DNN scale 1 DNN scale 2 small set of boxes covering image object box extraction merged object masks Figure 2: After regressing to object masks across several scales and large image boxes, we perform object box extraction. [sent-71, score-3.422]

26 The obtained boxes are reﬁned by repeating the same procedure on the sub images, cropped via the current object boxes. [sent-72, score-0.578]

27 For brevity, we display only the full object mask, however, we use all ﬁve object masks. [sent-73, score-0.627]

28 3 DNN-based Detection The core of our approach is a DNN-based regression towards an object mask, as shown in Fig. [sent-74, score-0.328]

29 Based on this regression model, we can generate masks for the full object as well as portions of the object. [sent-76, score-0.664]

30 A single DNN regression can give us masks of multiple objects in an image. [sent-77, score-0.44]

31 Instead of using a softmax classiﬁer as a last layer, we use a regression layer which generates an object binary mask DN N (x; Θ) ∈ RN , where Θ are the parameters of the network and N is the total number of pixels. [sent-87, score-0.826]

32 Since the output of the network has a ﬁxed dimension, we predict a mask of a ﬁxed size N = d × d. [sent-88, score-0.481]

33 After being resized to the image size, the resulting binary mask represents one or several objects: it should have value 1 at particular pixel if this pixel lies within the bounding box of an object of a given class and 0 otherwise. [sent-89, score-1.22]

34 The intuition is that most of the objects are small relative to the image size and the network can be easily trapped by the trivial solution of assigning a zero value to every output. [sent-92, score-0.299]

35 To avoid this undesirable behavior, it is helpful to increase the weight of the outputs corresponding to non-zero values in the ground truth mask by a parameter λ ∈ R+ . [sent-93, score-0.448]

36 In our implementation, we used networks with a receptive ﬁeld of 225 × 225 and outputs predicting a mask of size d × d for d = 24. [sent-95, score-0.46]

37 First, a single object mask might not be sufﬁcient to disambiguate objects which are placed next to each other. [sent-97, score-0.807]

38 Second, due to the limits in the output size, we generate masks that are much smaller than the size of the original image. [sent-98, score-0.34]

39 1 Multiple Masks for Robust Localization To deal with multiple touching objects, we generate not one but several masks, each representing either the full object or part of it. [sent-103, score-0.332]

40 Since our end goal is to produce a bounding box, we use one network to predict the object box mask and four additional networks to predict four halves of the box: bottom, top, left and right halves, all denoted by mh , h ∈ {full, bottom, top, left, left}. [sent-104, score-1.421]

41 Further, if two objects of the same type are placed next to each other, then at least two of the produced ﬁve masks would not have the objects merged which would allow to disambiguate them. [sent-106, score-0.578]

42 At training time, we need to convert the object box to these ﬁve masks. [sent-108, score-0.576]

43 Since the masks can be much smaller than the original image, we need to downsize the ground truth mask to the size of the network output. [sent-109, score-0.79]

44 Denote by T (i, j) the rectangle in the image for which the presence of an object is predicted by output (i, j) of the network. [sent-110, score-0.483]

45 This rectangle has upper left corner at ( d1 (i−1), d2 (j−1)) d d and has size d1 × d1 , where d is the size of the output mask and d1 , d2 the height and width of the d d image. [sent-111, score-0.469]

46 During training we assign as value m(i, j) to be predicted as portion of T (i, j) being covered by box bb(h) : area(bb(h) ∩ T (i, j)) (1) area(T (i, j)) where bb(full) corresponds to the ground truth object box. [sent-112, score-0.663]

47 mh (i, j; bb) = Note that we use the full object box as well as the top, bottom, left and right halves of the box to deﬁne total ﬁve different coverage types. [sent-114, score-1.033]

48 The resulting mh (bb) for groundtruth box bb are being used at training time for network of type h. [sent-115, score-0.672]

49 At this point, it should be noted that one could train one network for all masks where the output layer would generate all ﬁve of them. [sent-116, score-0.492]

50 2 Object Localization from DNN Output In order to complete the detection process, we need to estimate a set of bounding boxes for each image. [sent-121, score-0.574]

51 Although the output resolution is smaller than the input image, we rescale the binary masks to the resolution as the input image. [sent-122, score-0.448]

52 The goal is to estimate bounding boxes bb = (i, j, k, l) parametrized by their upper-left corner (i, j) and lower-right corner (k, l) in output mask coordinates. [sent-123, score-1.069]

53 4 To do this, we use a score S expressing an agreement of each bounding box bb with the masks and infer the boxes with highest scores. [sent-124, score-1.191]

54 A natural agreement would be to measure what portion of the bounding box is covered by the mask: S(bb, m) = 1 area(bb) m(i, j)area(bb ∩ T (i, j)) (2) (i,j) where we sum over all network outputs indexed by (i, j) and denote by m = DN N (x) the output of the network. [sent-125, score-0.6]

55 If we expand the above score over all ﬁve mask types, then ﬁnal score reads: S(bb) = ¯ (S(bb(h), mh ) − S(bb(h), mh )) (3) h∈halves where halves = {full, bottom, top, left, left} index the full box and its four halves. [sent-126, score-1.119]

56 a top mask should be well covered by a top ¯ mask and not at all by the bottom one. [sent-129, score-0.866]

57 For h = full, we denote by h a rectangular region around bb whose score will penalize if the full masks extend outside bb. [sent-130, score-0.547]

58 In the above summation, the score for a box would be large if it is consistent with all ﬁve masks. [sent-131, score-0.3]

59 We consider bounding boxes with mean dimension equal to [0. [sent-134, score-0.44]

60 9] of the mean image dimension and 10 different aspect ratios estimated by k-means clustering of the boxes of the objects in the training data. [sent-139, score-0.522]

61 (3) can be efﬁciently computed using 4 operations after the integral image of the mask m has been computed. [sent-142, score-0.495]

62 The exact number of operations is 5(2 × #pixels + 20 × #boxes), where the ﬁrst term measures the complexity of the integral mask computation while the second accounts for box score computation. [sent-143, score-0.672]

63 The ﬁrst is by keeping boxes with strong score as deﬁned by Eq. [sent-145, score-0.31]

64 Using large windows at various scales, we produce several masks and merge them into higher resolution masks, one for each scale. [sent-157, score-0.484]

65 To achieve the above goals, we use three scales: the full image and two other scales such that the size of the window at a given scale is half of the size of the window at the previous scale. [sent-159, score-0.316]

66 We cover the image at each scale with windows such that these windows have a small overlap – 20% of their area. [sent-160, score-0.413]

67 Most importantly, the windows at the smallest scale allow localization at a higher resolution. [sent-162, score-0.275]

68 The generated object masks at each scale are merged by maximum operation. [sent-165, score-0.653]

69 This gives us three masks of the size of the image, each ‘looking’ at objects of different sizes. [sent-166, score-0.407]

70 For each scale, we apply the bounding box inference from Sec. [sent-167, score-0.43]

71 The DNN localizer is applied on the windows deﬁned by the initial detection stage – each of the 15 bounding boxes is enlarged by a factor of 1. [sent-172, score-0.92]

72 Applying the localizer at higher resolution increases the precision of the detections signiﬁcantly. [sent-174, score-0.418]

73 The above algorithm is applied for each object class separately. [sent-177, score-0.295]

74 Input: x input image of size; networks DN N h producing full and partial object box mask. [sent-178, score-0.733]

75 Output: Set of detected object bounding boxes with conﬁdence scores. [sent-179, score-0.735]

76 for s ∈ scales do windows ← generate windows for the given scale s. [sent-181, score-0.332]

77 for w ∈ windows do for h ∈ {lower, upper, top, bottom, f ull} do mh ← DN N h (w) w end end mh ← merge masks mh , w ∈ windows w detectionss ← obtain a set of bounding boxes with scores from mh as in Sec. [sent-182, score-1.565]

78 2 detections ← detections ∪ detectionss end ref ined ← ∅ for d ← detections do c ← cropped image for enlarged bounding box of d for h ∈ {lower, upper, top, bottom, f ull} do mh ← DN N h (c) w end detection ← infer highest scoring bounding box from mh as in Sec. [sent-184, score-2.094]

79 2 ref ined ← ref ined ∪ detection end return ref ined 6 DNN Training One of the compelling features of our network is its simplicity: the classiﬁer is simply replaced by a mask generation layer without any smoothness prior or convolutional structure. [sent-186, score-1.101]

80 For training the mask generator, we generate several thousand samples from each image divided into 60% negative and 40% positive samples. [sent-188, score-0.535]

81 A sample is considered to be negative if it does not intersect the bounding box of any object of interest. [sent-189, score-0.725]

82 Positive samples are those covering at least 80% of the area of some of the object bounding boxes. [sent-190, score-0.509]

83 The negative samples are those whose bounding boxes have less than 0. [sent-194, score-0.44]

84 2 Jaccard-similarity with any of the groundtruth object boxes The positive samples must have at least 0. [sent-195, score-0.585]

85 6 similarity with some of the object bounding boxes and are labeled by the class of the object with most similar bounding box to the crop. [sent-196, score-1.46]

86 Figure 3: For each image, we show two heat maps on the right: the ﬁrst one corresponds to the output of DN N full , while the second one encodes the four partial masks in terms of the strength of the colors red, green, blue and yellow. [sent-310, score-0.377]

87 In addition, we visualize the estimated object bounding box. [sent-311, score-0.484]

88 At test time an algorithm produces for an image a set of detections, deﬁned bounding boxes and their class labels. [sent-318, score-0.563]

89 After training this network as a 21-way classiﬁer (VOC classes and background), we generate bounding boxes with 8 different aspect ration and at 10 different scales paced 5 pixels apart. [sent-323, score-0.621]

90 We reduce the number of the boxes by non-maximum suppression using Jaccard similarity of at least 1 Trained on VOC2012 training and validation sets. [sent-328, score-0.316]

91 This shows that it can handle less rigid objects in a better way while working well at the same time on rigid objects such as car, bus, etc. [sent-364, score-0.284]

92 3, where both the detected box as well as all ﬁve generated masks are visualized. [sent-366, score-0.54]

93 The generated masks are well localized and have almost no response outside the object. [sent-368, score-0.299]

94 The common misdetections are due to similarly looking objects (left object in last row of Fig. [sent-370, score-0.403]

95 3) or imprecise localization (right object in last row). [sent-371, score-0.411]

96 The latter problem is due to the ambiguous deﬁnition of object extend by the training data – in some images only the head of the bird is visible while in others the full body. [sent-372, score-0.448]

97 8 Conclusion In this work we leverage the expressivity of DNNs for object detector. [sent-378, score-0.321]

98 We show that the simple formulation of detection as DNN-base object mask regression can yield strong results when applied using a multi-scale course-to-ﬁne procedure. [sent-379, score-0.834]

99 These results come at some computational cost at training time – one needs to train a network per object type and mask type. [sent-380, score-0.801]

100 Towards scalable representations of object categories: Learning a hiers archy of parts. [sent-425, score-0.316]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mask', 0.372), ('masks', 0.299), ('object', 0.295), ('dnn', 0.291), ('boxes', 0.251), ('box', 0.241), ('detectornet', 0.233), ('bounding', 0.189), ('detections', 0.162), ('bb', 0.152), ('localizer', 0.143), ('detection', 0.134), ('mh', 0.132), ('windows', 0.131), ('dnns', 0.131), ('image', 0.123), ('localization', 0.116), ('objects', 0.108), ('halves', 0.087), ('nns', 0.079), ('convolutional', 0.079), ('ined', 0.072), ('network', 0.068), ('deformable', 0.065), ('sliding', 0.062), ('layers', 0.062), ('compositional', 0.06), ('precision', 0.059), ('score', 0.059), ('ref', 0.058), ('layer', 0.058), ('dn', 0.057), ('deep', 0.055), ('girshick', 0.055), ('nement', 0.055), ('resolution', 0.054), ('classi', 0.049), ('crops', 0.047), ('ve', 0.047), ('bird', 0.046), ('voc', 0.045), ('trained', 0.044), ('bus', 0.044), ('discriminatively', 0.044), ('pascal', 0.043), ('stage', 0.043), ('window', 0.043), ('scales', 0.042), ('output', 0.041), ('training', 0.04), ('vision', 0.039), ('groundtruth', 0.039), ('dpm', 0.037), ('full', 0.037), ('networks', 0.037), ('detectionss', 0.036), ('dumitru', 0.036), ('kinematically', 0.036), ('toshev', 0.036), ('covered', 0.036), ('rigid', 0.034), ('regression', 0.033), ('er', 0.033), ('parts', 0.032), ('corner', 0.032), ('ull', 0.032), ('cropped', 0.032), ('disambiguate', 0.032), ('dozen', 0.032), ('merged', 0.031), ('classes', 0.031), ('re', 0.031), ('images', 0.03), ('top', 0.029), ('localize', 0.029), ('sheep', 0.029), ('enlarged', 0.029), ('szegedy', 0.029), ('scale', 0.028), ('prominent', 0.028), ('imagenet', 0.028), ('capable', 0.028), ('bottom', 0.028), ('truth', 0.027), ('receptive', 0.026), ('train', 0.026), ('felzenszwalb', 0.026), ('expressivity', 0.026), ('paradigms', 0.026), ('suppression', 0.025), ('outputs', 0.025), ('area', 0.025), ('ground', 0.024), ('rectangle', 0.024), ('precisely', 0.023), ('yann', 0.023), ('shallow', 0.022), ('architectures', 0.022), ('powerful', 0.022), ('representations', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 84 nips-2013-Deep Neural Networks for Object Detection

Author: Christian Szegedy, Alexander Toshev, Dumitru Erhan

2 0.19572642 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

Author: Zhenwen Dai, Georgios Exarchakis, Jörg Lücke

Abstract: We study optimal image encoding based on a generative approach with non-linear feature combinations and explicit position encoding. By far most approaches to unsupervised learning of visual features, such as sparse coding or ICA, account for translations by representing the same features at different positions. Some earlier models used a separate encoding of features and their positions to facilitate invariant data encoding and recognition. All probabilistic generative models with explicit position encoding have so far assumed a linear superposition of components to encode image patches. Here, we for the ﬁrst time apply a model with non-linear feature superposition and explicit position encoding for patches. By avoiding linear superpositions, the studied model represents a closer match to component occlusions which are ubiquitous in natural images. In order to account for occlusions, the non-linear model encodes patches qualitatively very different from linear models by using component representations separated into mask and feature parameters. We ﬁrst investigated encodings learned by the model using artiﬁcial data with mutually occluding components. We ﬁnd that the model extracts the components, and that it can correctly identify the occlusive components with the hidden variables of the model. On natural image patches, the model learns component masks and features for typical image components. By using reverse correlation, we estimate the receptive ﬁelds associated with the model’s hidden units. We ﬁnd many Gabor-like or globular receptive ﬁelds as well as ﬁelds sensitive to more complex structures. Our results show that probabilistic models that capture occlusions and invariances can be trained efﬁciently on image patches, and that the resulting encoding represents an alternative model for the neural encoding of images in the primary visual cortex. 1

3 0.1914365 119 nips-2013-Fast Template Evaluation with Vector Quantization

Author: Mohammad Amin Sadeghi, David Forsyth

Abstract: Applying linear templates is an integral part of many object detection systems and accounts for a signiﬁcant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cascade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the original Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. 1

4 0.15251127 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion

Author: Franz Kiraly, Louis Theran

Abstract: We propose a general framework for reconstructing and denoising single entries of incomplete and noisy entries. We describe: effective algorithms for deciding if and entry can be reconstructed and, if so, for reconstructing and denoising it; and a priori bounds on the error of each entry, individually. In the noiseless case our algorithm is exact. For rank-one matrices, the new algorithm is fast, admits a highly-parallel implementation, and produces an error minimizing estimate that is qualitatively close to our theoretical and the state-of-the-art Nuclear Norm and OptSpace methods. 1

5 0.12910487 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

Author: Nataliya Shapovalova, Michalis Raptis, Leonid Sigal, Greg Mori

Abstract: We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efﬁciently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classiﬁcation, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classiﬁcation, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classiﬁcation label and localized latent paths. 1

6 0.11929997 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

7 0.11859081 251 nips-2013-Predicting Parameters in Deep Learning

8 0.11102968 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

9 0.10961588 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

10 0.10843979 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation

11 0.10559162 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

12 0.10285465 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

13 0.10231553 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

14 0.09955541 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

15 0.095733471 331 nips-2013-Top-Down Regularization of Deep Belief Networks

16 0.09367606 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

17 0.093453743 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

18 0.092506908 166 nips-2013-Learning invariant representations and applications to face verification

19 0.090258293 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

20 0.077607155 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.157), (1, 0.095), (2, -0.161), (3, -0.105), (4, 0.129), (5, -0.122), (6, -0.062), (7, 0.024), (8, -0.053), (9, -0.008), (10, -0.065), (11, 0.019), (12, 0.015), (13, 0.021), (14, -0.066), (15, 0.024), (16, -0.047), (17, -0.139), (18, -0.071), (19, 0.051), (20, 0.037), (21, -0.015), (22, 0.04), (23, -0.002), (24, -0.084), (25, 0.054), (26, 0.018), (27, -0.101), (28, 0.006), (29, -0.044), (30, 0.051), (31, -0.044), (32, -0.025), (33, -0.036), (34, -0.079), (35, 0.067), (36, -0.041), (37, -0.022), (38, 0.014), (39, -0.073), (40, 0.046), (41, -0.048), (42, 0.028), (43, -0.072), (44, 0.024), (45, -0.037), (46, 0.043), (47, -0.023), (48, -0.019), (49, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95745522 84 nips-2013-Deep Neural Networks for Object Detection

Author: Christian Szegedy, Alexander Toshev, Dumitru Erhan

2 0.80877447 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

Author: Naiyan Wang, Dit-Yan Yeung

Abstract: In this paper, we study the challenging problem of tracking the trajectory of a moving object in a video with possibly very complex background. In contrast to most existing trackers which only learn the appearance of the tracked object online, we take a different approach, inspired by recent advances in deep learning architectures, by putting more emphasis on the (unsupervised) feature learning problem. Speciﬁcally, by using auxiliary natural images, we train a stacked denoising autoencoder ofﬂine to learn generic image features that are more robust against variations. This is then followed by knowledge transfer from ofﬂine training to the online tracking process. Online tracking involves a classiﬁcation neural network which is constructed from the encoder part of the trained autoencoder as a feature extractor and an additional classiﬁcation layer. Both the feature extractor and the classiﬁer can be further tuned to adapt to appearance changes of the moving object. Comparison with the state-of-the-art trackers on some challenging benchmark video sequences shows that our deep learning tracker is more accurate while maintaining low computational cost with real-time performance when our MATLAB implementation of the tracker is used with a modest graphics processing unit (GPU). 1

3 0.79410088 119 nips-2013-Fast Template Evaluation with Vector Quantization

Author: Mohammad Amin Sadeghi, David Forsyth

4 0.77706045 166 nips-2013-Learning invariant representations and applications to face verification

Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio

Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D afﬁne transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically conﬁrm theoretical predictions (from [1]) for the case of 2D afﬁne transformations. Next, we apply the model to non-afﬁne transformations; as expected, it performs well on face veriﬁcation tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical ﬁndings, we tested the same model on face veriﬁcation benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1

5 0.73632616 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

Author: Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

Abstract: As massively parallel computations have become broadly available with modern GPUs, deep architectures trained on very large datasets have risen in popularity. Discriminatively trained convolutional neural networks, in particular, were recently shown to yield state-of-the-art performance in challenging image classiﬁcation benchmarks such as ImageNet. However, elements of these architectures are similar to standard hand-crafted representations used in computer vision. In this paper, we explore the extent of this analogy, proposing a version of the stateof-the-art Fisher vector image encoding that can be stacked in multiple layers. This architecture signiﬁcantly improves on standard Fisher vectors, and obtains competitive results with deep convolutional networks at a smaller computational learning cost. Our hybrid architecture allows us to assess how the performance of a conventional hand-crafted image classiﬁcation pipeline changes with increased depth. We also show that convolutional networks and Fisher vector encodings are complementary in the sense that their combination further improves the accuracy. 1

6 0.7359069 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation

7 0.69387972 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning

8 0.68699598 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

9 0.68111652 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

10 0.66626149 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

11 0.66453886 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

12 0.64803815 226 nips-2013-One-shot learning by inverting a compositional causal process

13 0.5832513 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs

14 0.58141273 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

15 0.57711786 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

16 0.57356465 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars

17 0.57115477 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

18 0.56518859 251 nips-2013-Predicting Parameters in Deep Learning

19 0.54385579 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

20 0.52959126 335 nips-2013-Transfer Learning in a Transductive Setting

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.022), (19, 0.028), (33, 0.153), (34, 0.075), (41, 0.012), (49, 0.022), (56, 0.059), (70, 0.448), (85, 0.027), (89, 0.028), (93, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92229217 267 nips-2013-Recurrent networks of coupled Winner-Take-All oscillators for solving constraint satisfaction problems

Author: Hesham Mostafa, Lorenz. K. Mueller, Giacomo Indiveri

Abstract: We present a recurrent neuronal network, modeled as a continuous-time dynamical system, that can solve constraint satisfaction problems. Discrete variables are represented by coupled Winner-Take-All (WTA) networks, and their values are encoded in localized patterns of oscillations that are learned by the recurrent weights in these networks. Constraints over the variables are encoded in the network connectivity. Although there are no sources of noise, the network can escape from local optima in its search for solutions that satisfy all constraints by modifying the effective network connectivity through oscillations. If there is no solution that satisﬁes all constraints, the network state changes in a seemingly random manner and its trajectory approximates a sampling procedure that selects a variable assignment with a probability that increases with the fraction of constraints satisﬁed by this assignment. External evidence, or input to the network, can force variables to speciﬁc values. When new inputs are applied, the network re-evaluates the entire set of variables in its search for states that satisfy the maximum number of constraints, while being consistent with the external input. Our results demonstrate that the proposed network architecture can perform a deterministic search for the optimal solution to problems with non-convex cost functions. The network is inspired by canonical microcircuit models of the cortex and suggests possible dynamical mechanisms to solve constraint satisfaction problems that can be present in biological networks, or implemented in neuromorphic electronic circuits. 1

same-paper 2 0.88068098 84 nips-2013-Deep Neural Networks for Object Detection

Author: Christian Szegedy, Alexander Toshev, Dumitru Erhan

3 0.8404659 157 nips-2013-Learning Multi-level Sparse Representations

Author: Ferran Diego Andilla, Fred A. Hamprecht

Abstract: Bilinear approximation of a matrix is a powerful paradigm of unsupervised learning. In some applications, however, there is a natural hierarchy of concepts that ought to be reﬂected in the unsupervised analysis. For example, in the neurosciences image sequence considered here, there are the semantic concepts of pixel → neuron → assembly that should ﬁnd their counterpart in the unsupervised analysis. Driven by this concrete problem, we propose a decomposition of the matrix of observations into a product of more than two sparse matrices, with the rank decreasing from lower to higher levels. In contrast to prior work, we allow for both hierarchical and heterarchical relations of lower-level to higher-level concepts. In addition, we learn the nature of these relations rather than imposing them. Finally, we describe an optimization scheme that allows to optimize the decomposition over all levels jointly, rather than in a greedy level-by-level fashion. The proposed bilevel SHMF (sparse heterarchical matrix factorization) is the ﬁrst formalism that allows to simultaneously interpret a calcium imaging sequence in terms of the constituent neurons, their membership in assemblies, and the time courses of both neurons and assemblies. Experiments show that the proposed model fully recovers the structure from difﬁcult synthetic data designed to imitate the experimental data. More importantly, bilevel SHMF yields plausible interpretations of real-world Calcium imaging data. 1

4 0.83162129 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning

Author: Jose Bento, Nate Derbinsky, Javier Alonso-Mora, Jonathan S. Yedidia

Abstract: We describe a novel approach for computing collision-free global trajectories for p agents with speciﬁed initial and ﬁnal conﬁgurations, based on an improved version of the alternating direction method of multipliers (ADMM). Compared with existing methods, our approach is naturally parallelizable and allows for incorporating different cost functionals with only minor adjustments. We apply our method to classical challenging instances and observe that its computational requirements scale well with p for several cost functionals. We also show that a specialization of our algorithm can be used for local motion planning by solving the problem of joint optimization in velocity space. 1

5 0.83134043 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang

Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classiﬁer of weak classiﬁers through directly minimizing empirical classiﬁcation error over labeled training examples; once the training classiﬁcation error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classiﬁers to maximize any targeted arbitrarily deﬁned margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1

6 0.75936431 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average

7 0.73331761 15 nips-2013-A memory frontier for complex synapses

8 0.62780195 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit

9 0.61067146 121 nips-2013-Firing rate predictions in optimal balanced networks

10 0.59167635 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval

11 0.58187985 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement

12 0.56990653 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

13 0.55911875 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

14 0.55029076 64 nips-2013-Compete to Compute

15 0.54146338 264 nips-2013-Reciprocally Coupled Local Estimators Implement Bayesian Information Integration Distributively

16 0.54064178 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

17 0.53360093 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

18 0.53065223 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

19 0.52650899 119 nips-2013-Fast Template Evaluation with Vector Quantization

20 0.52543825 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning