nips nips2013 nips2013-163 knowledge-graph by maker-knowledge-mining

163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking


Source: pdf

Author: Naiyan Wang, Dit-Yan Yeung

Abstract: In this paper, we study the challenging problem of tracking the trajectory of a moving object in a video with possibly very complex background. In contrast to most existing trackers which only learn the appearance of the tracked object online, we take a different approach, inspired by recent advances in deep learning architectures, by putting more emphasis on the (unsupervised) feature learning problem. Specifically, by using auxiliary natural images, we train a stacked denoising autoencoder offline to learn generic image features that are more robust against variations. This is then followed by knowledge transfer from offline training to the online tracking process. Online tracking involves a classification neural network which is constructed from the encoder part of the trained autoencoder as a feature extractor and an additional classification layer. Both the feature extractor and the classifier can be further tuned to adapt to appearance changes of the moving object. Comparison with the state-of-the-art trackers on some challenging benchmark video sequences shows that our deep learning tracker is more accurate while maintaining low computational cost with real-time performance when our MATLAB implementation of the tracker is used with a modest graphics processing unit (GPU). 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 hk Abstract In this paper, we study the challenging problem of tracking the trajectory of a moving object in a video with possibly very complex background. [sent-4, score-0.624]

2 In contrast to most existing trackers which only learn the appearance of the tracked object online, we take a different approach, inspired by recent advances in deep learning architectures, by putting more emphasis on the (unsupervised) feature learning problem. [sent-5, score-1.057]

3 Specifically, by using auxiliary natural images, we train a stacked denoising autoencoder offline to learn generic image features that are more robust against variations. [sent-6, score-0.547]

4 This is then followed by knowledge transfer from offline training to the online tracking process. [sent-7, score-0.357]

5 Online tracking involves a classification neural network which is constructed from the encoder part of the trained autoencoder as a feature extractor and an additional classification layer. [sent-8, score-0.564]

6 Both the feature extractor and the classifier can be further tuned to adapt to appearance changes of the moving object. [sent-9, score-0.186]

7 1 Introduction Visual tracking, also called object tracking, refers to automatic estimation of the trajectory of an object as it moves around in a video. [sent-11, score-0.27]

8 It has numerous applications in many domains, including video surveillance for security, human-computer interaction, and sports video analysis. [sent-12, score-0.254]

9 While a certain application may require multiple moving objects be tracked, the typical setting is to treat each object separately. [sent-13, score-0.192]

10 After the object to track is identified either manually or automatically in the first video frame, the goal of visual tracking is to automatically track the trajectory of the object over the subsequent frames. [sent-14, score-0.922]

11 Most existing trackers adopt either the generative or the discriminative approach. [sent-16, score-0.637]

12 Generative trackers, like other generative models in machine learning, assume that the object being tracked can be described by some generative process and hence tracking corresponds to finding the most probable candidate among possibly infinitely many. [sent-17, score-0.733]

13 The motivation behind generative trackers is to develop image representations which can facilitate robust tracking. [sent-18, score-0.722]

14 On the other hand, the discriminative approach treats tracking as a binary classification problem which learns to explicitly distinguish the object being tracked from its background. [sent-22, score-0.726]

15 Some representative trackers in this category are the online AdaBoost (OAB) tracker [6], multiple instance learning (MIL) tracker [3], and structured output tracker (Struck) [8]. [sent-23, score-1.427]

16 While generative trackers usually produce more accurate results under less complex environments due to the richer image representations used, discriminative trackers are more robust against strong occlusion and variations since they explicitly take the background into consideration. [sent-24, score-1.394]

17 We refer the reader to a recent paper [23] which empirically compares many existing trackers based on a common benchmark. [sent-25, score-0.55]

18 From the learning perspective, visual tracking is challenging because it has only one labeled instance in the form of an identified object in the first video frame. [sent-26, score-0.688]

19 In the subsequent frames, the tracker has to learn variations of the tracked object with only unlabeled data available. [sent-27, score-0.659]

20 With no prior knowledge about the object being tracked, it is easy for the tracker to drift away from the target. [sent-28, score-0.442]

21 To address this problem, some trackers taking the semi-supervised learning approach have been proposed [12, 7]. [sent-29, score-0.55]

22 An alternative approach [22] first learns a dictionary of image features (such as SIFT local descriptors) from auxiliary data and then transfers the knowledge learned to online tracking. [sent-30, score-0.25]

23 Another issue is that many existing trackers make use of image representations that may not be good enough for robust tracking in complex environments. [sent-31, score-0.979]

24 This is especially the case for discriminative trackers which usually put more emphasis on improving the classifiers rather than the image features used. [sent-32, score-0.732]

25 While many trackers simply use raw pixels as features, some attempts have used more informative features, such as Haar features, histogram features, and local binary patterns. [sent-33, score-0.573]

26 However, these features are all handcrafted offline but not tailor-made for the tracked object. [sent-34, score-0.311]

27 The key to success is to make use of deep architectures to learn richer invariant features via multiple nonlinear transformations. [sent-36, score-0.187]

28 We believe that visual tracking can also benefit from deep learning for the same reasons. [sent-37, score-0.47]

29 In this paper, we propose a novel deep learning tracker (DLT) for robust visual tracking. [sent-38, score-0.503]

30 We attempt to combine the philosophies behind both generative and discriminative trackers by developing a robust discriminative tracker which uses an effective image representation learned automatically. [sent-39, score-1.09]

31 First, it uses a stacked denoising autoencoder (SDAE) [20] to learn generic image features from a large image dataset as auxiliary data and then transfers the features learned to the online tracking task. [sent-41, score-0.981]

32 Second, unlike some previous methods which also learn features from auxiliary data, the learned features in DLT can be further tuned to adapt to specific objects during the online tracking process. [sent-42, score-0.604]

33 Moreover, since representing the tracked object does not require solving an optimization problem as in previous trackers based on sparse coding, DLT is significantly more efficient and hence is more suitable for real-time applications. [sent-44, score-0.909]

34 2 Particle Filter Approach for Visual Tracking The particle filter approach [5] is commonly used for visual tracking. [sent-45, score-0.233]

35 Mathematically, object tracking corresponds to the problem of finding the most probable state for each time step t based on the observations up to the previous time step: st = argmax p(st | y1:t−1 ) = argmax p(st | st−1 ) p(st−1 | y1:t−1 ) dst−1 . [sent-48, score-0.557]

36 The particles are drawn from an importance distribution i=1 q(st | s1:t−1 , y1:t ) and the weights are updated as follows: t−1 t wi = wi · p(yt | st ) p(st | st−1 ) i i i . [sent-51, score-0.257]

37 Conset−1 t quently, the weights are updated as wi = wi p(yt | st ). [sent-53, score-0.199]

38 In case it is smaller than a threshold, resampling is applied to draw n particles from the current particle set in proportion to their weights and then resetting their weights to 1/n. [sent-55, score-0.242]

39 For each frame, the tracking result is simply the particle with the largest weight. [sent-59, score-0.43]

40 While many trackers also adopt the same particle filter approach, the main difference lies in the formulation of the observation model p(yt | st ). [sent-60, score-0.81]

41 Apparently, a good model should be able to distinguish well the tracked object from i the background while still being robust against various types of object variation. [sent-61, score-0.598]

42 The particle filter framework is the dominant approach in visual tracking for several reasons. [sent-63, score-0.529]

43 For visual tracking, this property makes it easier for the tracker to recover from incorrect tracking results. [sent-66, score-0.667]

44 A tutorial on using particle filters for visual tracking can be found in [2]. [sent-67, score-0.529]

45 , [15], further improves the particle filter framework for visual tracking. [sent-70, score-0.233]

46 During the offline training stage, unsupervised feature learning is carried out by training an SDAE with auxiliary image data to learn generic natural image features. [sent-72, score-0.262]

47 During the online tracking process, an additional classification layer is added to the encoder part of the trained SDAE to result in a classification neural network. [sent-74, score-0.454]

48 Since most state-of-the-art trackers included in our empirical comparison use only grayscale images, we have converted all the sampled images to grayscale (but our method can also use the color images directly if necessary). [sent-82, score-0.69]

49 2 Learning Generic Image Features with a Stacked Denoising Autoencoder The basic building block of an SDAE is a one-layer neural network called a denoising autoencoder (DAE), which is a more recent variant of the conventional autoencoder. [sent-87, score-0.223]

50 In so doing, robust features are learned since the neural network 3 contains a “bottleneck” which is a hidden layer with fewer units than the input units. [sent-89, score-0.197]

51 By reconstructing the input from a corrupted version of it, a DAE is more effective than the conventional autoencoder in discovering more robust features by preventing the autoencoder from simply learning the identity mapping. [sent-98, score-0.39]

52 2 Online Tracking Process The object to track is specified by the location of its bounding box in the first frame. [sent-119, score-0.251]

53 When a new video frame arrives, we first draw particles according to the particle filter approach. [sent-124, score-0.48]

54 4 (a) (b) (c) Figure 1: Some key components of the network architecture: (a) denoising autoencoder; (b) stacked denoising autoencoder; (c) network for online tracking. [sent-127, score-0.318]

55 If the maximum confidence of all particles in a frame is below a predefined threshold τ , it may indicate significant appearance change of the object being tracked. [sent-129, score-0.399]

56 If τ is too small, the tracker cannot adapt well to appearance changes. [sent-132, score-0.343]

57 On the other hand, if τ is too large, even an occluding object or the background may be mis-treated as the tracked object and hence leads to drifting of the target. [sent-133, score-0.555]

58 4 Experiments We empirically compare DLT with some state-of-the-art trackers in this section using 10 challenging benchmark video sequences. [sent-134, score-0.708]

59 These trackers are: MTT [26], CT [24], VTD [15], MIL [3], a latest variant of L1T [4], TLD [13], and IVT [18]. [sent-135, score-0.55]

60 We use the original implementations of these trackers provided by their authors. [sent-136, score-0.55]

61 In case a tracker can only deal with grayscale video, the rgb2gray function provided by the MATLAB Image Processing Toolbox is used to convert the color video to grayscale. [sent-137, score-0.43]

62 Let BB T denote the bounding box produced by a tracker and BB G the ground-truth 5 bounding box. [sent-160, score-0.353]

63 For each video frame, a tracker is considered successful if the overlap percentage area(BB T ∩BB G ) area(BB T ∪BB G ) > 50%. [sent-161, score-0.399]

64 Since TLD can report that the tracked object is missing in some frames, we exclude it from the central-pixel error comparison. [sent-166, score-0.359]

65 Thanks to advances of the GPU technology, our tracker can achieve an average frame rate of 15fps (frames per second) which is sufficient for many real-time applications. [sent-170, score-0.433]

66 4) Table 1: Comparison of 8 trackers on 10 video sequences. [sent-327, score-0.677]

67 01 Table 2: Comparison of running time on 10 video sequences (in fps). [sent-340, score-0.186]

68 4 shows some key frames with bounding boxes reported by all eight trackers for each of the 10 video sequences. [sent-344, score-0.743]

69 More detailed results for the complete video sequences can be found in the supplemental material. [sent-345, score-0.186]

70 In both the car4 and car11 sequences, the tracked objects are cars moving on an open road. [sent-346, score-0.281]

71 Since the car being tracked is a rigid object, its shape does not change much and hence generative trackers like IVT, L1T and MTT generally perform well for these two sequences. [sent-349, score-0.837]

72 In the davidin and trellis sequences, each tracker has to track a face in indoor and outdoor environments, respectively. [sent-351, score-0.489]

73 As a consequence, all trackers drift or even fail to different degrees. [sent-354, score-0.62]

74 In the woman sequence, we track a woman walking in the street. [sent-356, score-0.275]

75 TLD first fails at frame 63 because of the pose change. [sent-358, score-0.192]

76 All other trackers compared fail when the woman walks close to the car at about frame 130. [sent-359, score-0.875]

77 Both the shaking and singer1 sequences are recordings on the stage with illumination changes. [sent-367, score-0.208]

78 For shaking, the pose of the head being tracked also changes. [sent-368, score-0.255]

79 L1T, IVT and TLD totally fail before frame 10, while MTT and MIL show some drifting effects then. [sent-369, score-0.233]

80 All trackers except MTT can track the object but CT and MIL do not support scale change and hence the results are less satisfactory. [sent-372, score-0.75]

81 In the surfer sequence, the goal is to track the head of a surfer while its pose changes along the video sequence. [sent-373, score-0.44]

82 All trackers can merely track it except that TLD shows an incorrect scale and both CT and MIL drift slightly. [sent-374, score-0.65]

83 Most trackers fail or drift at about frame 15 with the exception of L1T, TLD and DLT. [sent-376, score-0.781]

84 First, we learn generic image features from a larger and more general dataset rather than a smaller set with only some chosen image categories. [sent-380, score-0.265]

85 Second, we learn the image features from raw images automatically instead of relying on handcrafted SIFT features. [sent-381, score-0.23]

86 Third, further learning is allowed during the online tracking process of our method so as to adapt better to the specific object being tracked. [sent-382, score-0.518]

87 The resulting tracker would be similar to previous patch (or fragment) based methods [1, 11] which have been shown to be robust against partial occlusion. [sent-384, score-0.329]

88 Nevertheless, current research on CNN focuses on learning shift-invariant features for such tasks as image classification and object detection. [sent-385, score-0.269]

89 However, the nature of object tracking is very different in that it has to learn shift-variant but similarity-preserving features to overcome the drifting problem. [sent-386, score-0.554]

90 Noting that the key to success for deep learning architectures is the learning of useful features, we first train a stacked denoising autoencoder using many auxiliary natural images to learn generic image features. [sent-390, score-0.572]

91 This alleviates the problem of not having much labeled data in visual tracking applications. [sent-391, score-0.395]

92 After offline training, the encoder part of the SDAE is used as a feature extractor 7 car4 car11 davidin trellis woman singer1 shaking animal surfer bird2 Figure 4: Comparison of 8 trackers on 10 video sequences in terms of the bounding box reported. [sent-392, score-1.401]

93 during the online tracking process to train a classification neural network to distinguish the tracked object from the background. [sent-393, score-0.775]

94 Since further tuning is allowed during the online tracking process, both the feature extractor and the classifier can adapt to appearance changes of the moving object. [sent-395, score-0.543]

95 Through quantitative and qualitative comparison with state-of-the-art trackers on some challenging benchmark video sequences, we demonstrate that our deep learning tracker gives very encouraging results while having low computational cost. [sent-396, score-1.077]

96 Also, the classification layer in our current tracker is just a linear classifier for simplicity. [sent-399, score-0.318]

97 A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. [sent-413, score-0.195]

98 Real time robust L1 tracker using accelerated proximal gradient approach. [sent-426, score-0.329]

99 Visual tracking via adaptive structural local sparse appearance model. [sent-477, score-0.341]

100 Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. [sent-532, score-0.237]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('trackers', 0.55), ('tracking', 0.296), ('tracker', 0.272), ('dlt', 0.243), ('tracked', 0.224), ('sdae', 0.162), ('tld', 0.162), ('frame', 0.161), ('object', 0.135), ('particle', 0.134), ('video', 0.127), ('st', 0.126), ('autoencoder', 0.124), ('mil', 0.114), ('woman', 0.105), ('shaking', 0.1), ('visual', 0.099), ('mtt', 0.097), ('surfer', 0.097), ('ine', 0.095), ('davidin', 0.081), ('ivt', 0.081), ('image', 0.076), ('deep', 0.075), ('trellis', 0.071), ('dae', 0.071), ('bb', 0.068), ('track', 0.065), ('denoising', 0.063), ('online', 0.061), ('stacked', 0.059), ('sequences', 0.059), ('features', 0.058), ('particles', 0.058), ('robust', 0.057), ('extractor', 0.057), ('auxiliary', 0.055), ('animal', 0.052), ('encoder', 0.051), ('lter', 0.05), ('illumination', 0.049), ('discriminative', 0.048), ('layer', 0.046), ('appearance', 0.045), ('grabner', 0.043), ('vtd', 0.043), ('center', 0.042), ('yt', 0.041), ('lters', 0.04), ('pretraining', 0.039), ('generative', 0.039), ('images', 0.039), ('classi', 0.038), ('drifting', 0.037), ('overcomplete', 0.037), ('frames', 0.036), ('network', 0.036), ('moving', 0.035), ('drift', 0.035), ('fail', 0.035), ('tiny', 0.034), ('cvpr', 0.033), ('ct', 0.033), ('ghanem', 0.032), ('struck', 0.032), ('challenging', 0.031), ('bird', 0.031), ('grayscale', 0.031), ('pose', 0.031), ('activation', 0.031), ('momentum', 0.03), ('bounding', 0.03), ('architecture', 0.03), ('gpu', 0.029), ('daes', 0.029), ('handcrafted', 0.029), ('kalal', 0.029), ('learn', 0.028), ('environments', 0.027), ('corrupted', 0.027), ('generic', 0.027), ('architectures', 0.026), ('adapt', 0.026), ('weights', 0.025), ('background', 0.024), ('sigmoid', 0.024), ('wi', 0.024), ('car', 0.024), ('cnn', 0.024), ('distinguish', 0.023), ('changes', 0.023), ('pixels', 0.023), ('cluttered', 0.023), ('occlusion', 0.023), ('satisfactory', 0.023), ('quantitative', 0.022), ('objects', 0.022), ('eccv', 0.022), ('coding', 0.021), ('box', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

Author: Naiyan Wang, Dit-Yan Yeung

Abstract: In this paper, we study the challenging problem of tracking the trajectory of a moving object in a video with possibly very complex background. In contrast to most existing trackers which only learn the appearance of the tracked object online, we take a different approach, inspired by recent advances in deep learning architectures, by putting more emphasis on the (unsupervised) feature learning problem. Specifically, by using auxiliary natural images, we train a stacked denoising autoencoder offline to learn generic image features that are more robust against variations. This is then followed by knowledge transfer from offline training to the online tracking process. Online tracking involves a classification neural network which is constructed from the encoder part of the trained autoencoder as a feature extractor and an additional classification layer. Both the feature extractor and the classifier can be further tuned to adapt to appearance changes of the moving object. Comparison with the state-of-the-art trackers on some challenging benchmark video sequences shows that our deep learning tracker is more accurate while maintaining low computational cost with real-time performance when our MATLAB implementation of the tracker is used with a modest graphics processing unit (GPU). 1

2 0.11102968 84 nips-2013-Deep Neural Networks for Object Detection

Author: Christian Szegedy, Alexander Toshev, Dumitru Erhan

Abstract: Deep Neural Networks (DNNs) have recently shown outstanding performance on image classification tasks [14]. In this paper we go one step further and address the problem of object detection using DNNs, that is not only classifying but also precisely localizing objects of various classes. We present a simple and yet powerful formulation of object detection as a regression problem to object bounding box masks. We define a multi-scale inference procedure which is able to produce high-resolution object detections at a low cost by a few network applications. State-of-the-art performance of the approach is shown on Pascal VOC. 1

3 0.10700084 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

Author: David J. Weiss, Ben Taskar

Abstract: Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. Significant efforts have been devoted to sparsity-based model selection to decrease this cost. Such feature selection methods control computation statically and miss the opportunity to finetune feature extraction to each input at run-time. We address the key challenge of learning to control fine-grained feature extraction adaptively, exploiting nonhomogeneity of the data. We propose an architecture that uses a rich feedback loop between extraction and prediction. The run-time control policy is learned using efficient value-function approximation, which adaptively determines the value of information of features at the level of individual variables for each input. We demonstrate significant speedups over state-of-the-art methods on two challenging datasets. For articulated pose estimation in video, we achieve a more accurate state-of-the-art model that is also faster, with similar results on an OCR task. 1

4 0.097543836 251 nips-2013-Predicting Parameters in Deep Learning

Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas

Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1

5 0.09050329 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen

Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identification) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. To enable efficient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1

6 0.084824339 280 nips-2013-Robust Data-Driven Dynamic Programming

7 0.084244408 331 nips-2013-Top-Down Regularization of Deep Belief Networks

8 0.081271328 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

9 0.080966182 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

10 0.080526836 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

11 0.079183489 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

12 0.078049131 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

13 0.073366575 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

14 0.072379105 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

15 0.068117507 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

16 0.066873051 166 nips-2013-Learning invariant representations and applications to face verification

17 0.066636145 30 nips-2013-Adaptive dropout for training deep neural networks

18 0.066112041 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

19 0.064199105 5 nips-2013-A Deep Architecture for Matching Short Texts

20 0.06394019 226 nips-2013-One-shot learning by inverting a compositional causal process


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.15), (1, 0.053), (2, -0.113), (3, -0.075), (4, 0.067), (5, -0.09), (6, -0.044), (7, 0.074), (8, -0.031), (9, -0.07), (10, -0.071), (11, -0.024), (12, -0.006), (13, 0.074), (14, -0.082), (15, 0.027), (16, -0.025), (17, -0.064), (18, 0.008), (19, -0.005), (20, 0.005), (21, 0.036), (22, 0.046), (23, -0.011), (24, -0.017), (25, 0.023), (26, 0.018), (27, -0.006), (28, 0.013), (29, -0.019), (30, 0.068), (31, 0.01), (32, -0.025), (33, -0.019), (34, -0.051), (35, 0.041), (36, -0.073), (37, -0.063), (38, -0.006), (39, 0.004), (40, 0.054), (41, 0.02), (42, 0.007), (43, -0.069), (44, 0.056), (45, -0.06), (46, 0.111), (47, -0.031), (48, 0.031), (49, 0.073)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94168526 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

Author: Naiyan Wang, Dit-Yan Yeung

Abstract: In this paper, we study the challenging problem of tracking the trajectory of a moving object in a video with possibly very complex background. In contrast to most existing trackers which only learn the appearance of the tracked object online, we take a different approach, inspired by recent advances in deep learning architectures, by putting more emphasis on the (unsupervised) feature learning problem. Specifically, by using auxiliary natural images, we train a stacked denoising autoencoder offline to learn generic image features that are more robust against variations. This is then followed by knowledge transfer from offline training to the online tracking process. Online tracking involves a classification neural network which is constructed from the encoder part of the trained autoencoder as a feature extractor and an additional classification layer. Both the feature extractor and the classifier can be further tuned to adapt to appearance changes of the moving object. Comparison with the state-of-the-art trackers on some challenging benchmark video sequences shows that our deep learning tracker is more accurate while maintaining low computational cost with real-time performance when our MATLAB implementation of the tracker is used with a modest graphics processing unit (GPU). 1

2 0.80592859 84 nips-2013-Deep Neural Networks for Object Detection

Author: Christian Szegedy, Alexander Toshev, Dumitru Erhan

Abstract: Deep Neural Networks (DNNs) have recently shown outstanding performance on image classification tasks [14]. In this paper we go one step further and address the problem of object detection using DNNs, that is not only classifying but also precisely localizing objects of various classes. We present a simple and yet powerful formulation of object detection as a regression problem to object bounding box masks. We define a multi-scale inference procedure which is able to produce high-resolution object detections at a low cost by a few network applications. State-of-the-art performance of the approach is shown on Pascal VOC. 1

3 0.69556862 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation

Author: Vibhav Vineet, Carsten Rother, Philip Torr

Abstract: Many methods have been proposed to solve the problems of recovering intrinsic scene properties such as shape, reflectance and illumination from a single image, and object class segmentation separately. While these two problems are mutually informative, in the past not many papers have addressed this topic. In this work we explore such joint estimation of intrinsic scene properties recovered from an image, together with the estimation of the objects and attributes present in the scene. In this way, our unified framework is able to capture the correlations between intrinsic properties (reflectance, shape, illumination), objects (table, tv-monitor), and materials (wooden, plastic) in a given scene. For example, our model is able to enforce the condition that if a set of pixels take same object label, e.g. table, most likely those pixels would receive similar reflectance values. We cast the problem in an energy minimization framework and demonstrate the qualitative and quantitative improvement in the overall accuracy on the NYU and Pascal datasets. 1

4 0.6708011 166 nips-2013-Learning invariant representations and applications to face verification

Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio

Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1

5 0.65353709 119 nips-2013-Fast Template Evaluation with Vector Quantization

Author: Mohammad Amin Sadeghi, David Forsyth

Abstract: Applying linear templates is an integral part of many object detection systems and accounts for a significant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cascade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the original Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. 1

6 0.62257159 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

7 0.61304241 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning

8 0.60082656 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

9 0.59177929 226 nips-2013-One-shot learning by inverting a compositional causal process

10 0.58688128 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

11 0.5648644 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs

12 0.56098825 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

13 0.54947323 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

14 0.54449159 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

15 0.53922749 331 nips-2013-Top-Down Regularization of Deep Belief Networks

16 0.5290795 160 nips-2013-Learning Stochastic Feedforward Neural Networks

17 0.52646869 85 nips-2013-Deep content-based music recommendation

18 0.52432883 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars

19 0.51252919 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

20 0.50622332 251 nips-2013-Predicting Parameters in Deep Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.01), (16, 0.033), (31, 0.011), (33, 0.144), (34, 0.097), (41, 0.011), (49, 0.047), (56, 0.071), (70, 0.064), (73, 0.256), (85, 0.032), (89, 0.073), (93, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80139256 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

Author: Naiyan Wang, Dit-Yan Yeung

Abstract: In this paper, we study the challenging problem of tracking the trajectory of a moving object in a video with possibly very complex background. In contrast to most existing trackers which only learn the appearance of the tracked object online, we take a different approach, inspired by recent advances in deep learning architectures, by putting more emphasis on the (unsupervised) feature learning problem. Specifically, by using auxiliary natural images, we train a stacked denoising autoencoder offline to learn generic image features that are more robust against variations. This is then followed by knowledge transfer from offline training to the online tracking process. Online tracking involves a classification neural network which is constructed from the encoder part of the trained autoencoder as a feature extractor and an additional classification layer. Both the feature extractor and the classifier can be further tuned to adapt to appearance changes of the moving object. Comparison with the state-of-the-art trackers on some challenging benchmark video sequences shows that our deep learning tracker is more accurate while maintaining low computational cost with real-time performance when our MATLAB implementation of the tracker is used with a modest graphics processing unit (GPU). 1

2 0.79591298 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics

Author: Özgür1 Şimşek

Abstract: Several attempts to understand the success of simple decision heuristics have examined heuristics as an approximation to a linear decision rule. This research has identified three environmental structures that aid heuristics: dominance, cumulative dominance, and noncompensatoriness. This paper develops these ideas further and examines their empirical relevance in 51 natural environments. The results show that all three structures are prevalent, making it possible for simple rules to reach, and occasionally exceed, the accuracy of the linear decision rule, using less information and less computation. 1

3 0.72979611 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

Author: Marius Pachitariu, Adam M. Packer, Noah Pettit, Henry Dalgleish, Michael Hausser, Maneesh Sahani

Abstract: Biological tissue is often composed of cells with similar morphologies replicated throughout large volumes and many biological applications rely on the accurate identification of these cells and their locations from image data. Here we develop a generative model that captures the regularities present in images composed of repeating elements of a few different types. Formally, the model can be described as convolutional sparse block coding. For inference we use a variant of convolutional matching pursuit adapted to block-based representations. We extend the KSVD learning algorithm to subspaces by retaining several principal vectors from the SVD decomposition instead of just one. Good models with little cross-talk between subspaces can be obtained by learning the blocks incrementally. We perform extensive experiments on simulated images and the inference algorithm consistently recovers a large proportion of the cells with a small number of false positives. We fit the convolutional model to noisy GCaMP6 two-photon images of spiking neurons and to Nissl-stained slices of cortical tissue and show that it recovers cell body locations without supervision. The flexibility of the block-based representation is reflected in the variability of the recovered cell shapes. 1

4 0.68806958 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

Author: Andreas Ruttor, Philipp Batz, Manfred Opper

Abstract: We introduce a nonparametric approach for estimating drift functions in systems of stochastic differential equations from sparse observations of the state vector. Using a Gaussian process prior over the drift as a function of the state vector, we develop an approximate EM algorithm to deal with the unobserved, latent dynamics between observations. The posterior over states is approximated by a piecewise linearized process of the Ornstein-Uhlenbeck type and the MAP estimation of the drift is facilitated by a sparse Gaussian process regression. 1

5 0.63544697 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

Author: Nataliya Shapovalova, Michalis Raptis, Leonid Sigal, Greg Mori

Abstract: We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efficiently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classification, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classification, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classification label and localized latent paths. 1

6 0.63059366 64 nips-2013-Compete to Compute

7 0.62864351 183 nips-2013-Mapping paradigm ontologies to and from the brain

8 0.62725204 251 nips-2013-Predicting Parameters in Deep Learning

9 0.6272518 331 nips-2013-Top-Down Regularization of Deep Belief Networks

10 0.62709236 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

11 0.62697244 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

12 0.62683189 121 nips-2013-Firing rate predictions in optimal balanced networks

13 0.62605298 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions

14 0.62484837 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

15 0.62465972 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit

16 0.62198466 237 nips-2013-Optimal integration of visual speed across different spatiotemporal frequency channels

17 0.62114942 5 nips-2013-A Deep Architecture for Matching Short Texts

18 0.62110525 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

19 0.62095153 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking

20 0.62062347 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles