nips nips2012 nips2012-197 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. [sent-3, score-0.655]
2 Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. [sent-4, score-0.376]
3 Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. [sent-5, score-0.149]
4 This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. [sent-6, score-0.14]
5 The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. [sent-7, score-0.26]
6 The Support Vector Machine (SVM) has been a popular method for multimodal classification tasks since its introduction, and one of its main advantages is the simplicity of training a linear model. [sent-9, score-0.136]
7 In addition, finding the “oracle” kernel for a specific task remains an open problem, especially in applications such as vision and speech. [sent-11, score-0.139]
8 Our aim is to design a classifier that combines the simplicity of the linear Support Vector Machine (SVM) with the power derived from deep architectures. [sent-12, score-0.239]
9 the framework of building layer-by-layer architectures, and is motivated by the recent success of a convex stacking architecture which uses a simplified form of neural network with closed-form, convex learning [10]. [sent-15, score-0.429]
10 Specifically, we propose a new stacking technique for building a deep architecture, using a linear SVM as the base building block, and a random projection as its core stacking element. [sent-16, score-0.959]
11 The key element in our convex learning of each layer is to randomly project the predictions of the previous layer SVM back to the original feature space. [sent-18, score-0.547]
12 As we will show in the paper, this could be seen as recursively transforming the original data manifold so that data from different classes are moved apart, leading to better linear separability in the subsequent layers. [sent-19, score-0.233]
13 Starting from data manifolds that are not linearly separable, our method transforms the data manifolds in a stacked way to find a linear separating hyperplane in the high layers, which corresponds to non-linear separating hyperplanes in the lower layers. [sent-22, score-0.538]
14 Non-linear classification is achieved without kernelization, using a recursive architecture. [sent-23, score-0.146]
15 model does not require any complex learning techniques other than training linear SVMs, while canonical deep architectures usually require carefully designed pre-training and fine-tuning steps, which often depend on specific applications. [sent-24, score-0.347]
16 Using linear SVMs as building blocks our model scales in the same way as the linear SVM does, enabling fast computation during both training and testing time. [sent-25, score-0.316]
17 From a kernel based perspective, our method could be viewed as a special non-linear SVM, with the benefit that the non-linear kernel naturally emerges from the stacked structure instead of being defined as in conventional algorithms. [sent-27, score-0.251]
18 Our findings suggest that the proposed model, while keeping the simplicity and efficiency of training a linear SVM, can exploit non-linear dependencies with the proposed deep architecture, as suggested by the results on two well known vision and speech datasets. [sent-30, score-0.504]
19 it exhibits better generalization gap), which is a desirable property inherited from the linear model used in the architecture presented in the paper. [sent-33, score-0.216]
20 2 Previous Work There has been a trend on object, acoustic and image classification to move the complexity from the classifier to the feature extraction step. [sent-34, score-0.164]
21 In [4], the authors note that the choice of codebook does not seem to impact performance significantly, and encoding via an inner product plus a non-linearity can effectively replace sparse coding, making testing significantly simpler and faster. [sent-40, score-0.308]
22 A disturbing issue with sparse coding + linear classification is that with a limited codebook size, linear separability might be an overly strong statement, undermining the use of a single linear classifier. [sent-41, score-0.654]
23 This has been empirically verified: as we increase the codebook size, the performance keeps improving [4], indicating that such representations may not be able to fully exploit the complexity 2 of the data [2]. [sent-42, score-0.25]
24 In fact, recent success on PASCAL VOC could partially be attributed to a huge codebook [25]. [sent-43, score-0.22]
25 While this is theoretically valid, the practical advantage of linear models diminishes quickly, as the computation cost of feature generation, as well as training a high-dimensional classifier (despite linear), can make it as expensive as classical non-linear classifiers. [sent-44, score-0.195]
26 Despite this trend to rely on linear classifiers and overcomplete feature representations, sparse coding is still a flat model, and efforts have been made to add flexibility to the features. [sent-45, score-0.311]
27 Our approach can be seen as an extension to sparse coding used in a stacked architecture. [sent-47, score-0.314]
28 The method presented in this paper is a new stacking technique that has close connections to several stacking methods developed in the literature, which are briefly surveyed in this section. [sent-49, score-0.556]
29 In [23], the concept of stacking was proposed where simple modules of functions or classifiers are “stacked” on top of each other in order to learn complex functions or classifiers. [sent-50, score-0.326]
30 Since then, various ways of implementing stacking operations have been developed, and they can be divided into two general categories. [sent-51, score-0.278]
31 In the first category, stacking is performed in a layer-by-layer fashion and typically involves no supervised information. [sent-52, score-0.278]
32 This gives rise to multiple layers in unsupervised feature learning, as exemplified in Deep Belief Networks [14, 13, 9], layered Convolutional Neural Networks [15], Deep Auto-encoder [14, 9], etc. [sent-53, score-0.228]
33 Applications of such stacking methods includes object recognition [15, 26, 4], speech recognition [20], etc. [sent-54, score-0.517]
34 In the second category of techniques, stacking is carried out using supervised information. [sent-55, score-0.312]
35 The modules of the stacking architectures are typically simple classifiers. [sent-56, score-0.371]
36 The new features for the stacked classifier at a higher level of the hierarchy come from concatenation of the classifier output of lower modules and the raw input features. [sent-57, score-0.378]
37 Cohen and de Carvalho [5] developed a stacking architecture where the simple module is a Conditional Random Field. [sent-58, score-0.416]
38 Another successful stacking architecture reported in [10, 11] uses supervised information for stacking where the basic module is a simplified form of multilayer perceptron where the output units are linear and the hidden units are sigmoidal nonlinear. [sent-59, score-0.977]
39 The linearity in the output units permits highly efficient, closed-form estimation (results of convex optimization) for the output network weights given the hidden units’ outputs. [sent-60, score-0.128]
40 Stacked context has also been used in [3], where a set of classifier scores are stacked to produce a more reliable detection. [sent-61, score-0.167]
41 Our proposed method will build a stacked architecture where each layer is an SVM, which has proven to be a very successful classifier for computer vision applications. [sent-62, score-0.546]
42 Specifically, we consider a training set that contains N pairs of tuples (d(i) , y (i) ), where d(i) ∈ RD is the feature vector, and y (i) ∈ {1, . [sent-64, score-0.15]
43 As depicted in Figure 2(b), the model is built by multiple layers of blocks, which we call Random SVMs, that each learns a linear SVM classifier and transforms the data based on a random projection of previous layers SVM outputs. [sent-68, score-0.415]
44 1 Recursive Transform of Input Features Figure 2(b) visualizes one typical layer in the pipeline of our algorithm. [sent-73, score-0.304]
45 Each layer takes the output of the previous layer, (starting from x1 = d for the first layer as our initial input), and feeds it to a standard linear SVM that gives the output o1 . [sent-74, score-0.549]
46 (a) The model is built with layers of Random SVM blocks, which are based on simple linear SVMs. [sent-78, score-0.204]
47 (b) For each random SVM layer, we train a linear SVM using the transformed data manifold by combining the original features and random projections of previous layers’ predictions. [sent-80, score-0.197]
48 The sigmoid function controls the scale of the resulting features, and at the same time prevents the random projection to be “too confident” on some data points, as the prediction of the lower-layer is still imperfect. [sent-85, score-0.188]
49 ol = ec , where ec is the one-hot encoding representing class c), so the fact that they are approximately orthogonal means that (with high probability) they are pushing the per-class manifolds apart. [sent-89, score-0.178]
50 Following [10], for each layer we use the outputs from all lower modules, instead of only the immediately lower module. [sent-92, score-0.203]
51 A chief difference of our proposed method from previous approaches is that, instead of concatenating predictions with the raw input data to form the new expanded input data, we use the predictions to modify the features in the original space with a non-linear transformation. [sent-93, score-0.254]
52 The following Lemma illustrates the fact that, if we are given an oracle prediction of the labels, it is possible to 4 add an offset to each class to “pull” the manifolds apart with this new architecture, and to guarantee an improvement on the training set if we assume perfect labels. [sent-98, score-0.182]
53 1 would work for any monotonically decreasing loss function (in particular, for the hinge loss of SVM), and motivates our search for a transform of the original features to achieve linear separability, under the guidance of SVM predictions. [sent-114, score-0.138]
54 1 (which degenerates due to imperfect predictions), or alternative stacking strategies such as concatenation as in [10]. [sent-121, score-0.349]
55 In general, we aim to avoid supervision in the projection parameters, as trying to optimize the weights jointly would defeat the purpose of having a computationally efficient method, and would, perhaps, increase training accuracy at the expense of over-fitting. [sent-123, score-0.176]
56 The risk of over-fitting is also lower in this way, as we do not increase the dimensionality of the input space, and we do not learn the matrices Wl , which means we pass a weak signal from layer to layer. [sent-124, score-0.277]
57 The first layer of our approach is identical to the linear SVM, which is not able to separate the data well. [sent-130, score-0.276]
58 However, when classifiers are recursively stacked in our approach, the classification hyperplane is able to adapt to the nonlinear characteristics of the two classes. [sent-131, score-0.25]
59 5 (a) (b) (c) (d) (e) (f) Figure 3: Classification hyperplane from different stages of our algorithm: first layer, second layer, and final layer outputs. [sent-133, score-0.257]
60 (b) Accuracy versus codebook size on CIFAR-10 for linear SVM, RBF SVM, and our proposed method. [sent-139, score-0.293]
61 TIMIT is a speech database that contains two orders of magnitude more training samples than the other datasets, and the largest output label space. [sent-142, score-0.233]
62 Recall that our method relies on two parameters: β, which is the factor that controls how much to shift the original feature space, and C, the regularization parameter of the linear SVM trained at 1 each layer. [sent-143, score-0.197]
63 C controls the regularization of each layer, and is an important parameter – setting it too high will yield overfitting as the number of layers is increased. [sent-145, score-0.161]
64 As a result, even if the training and testing sets are fixed, randomness still exists in our algorithm. [sent-148, score-0.127]
65 For this dataset, we follow the standard pipeline defined in [4]: dense 6x6 local patches with ZCA whitening are extracted with stride 1, and thresholding coding with α = 0. [sent-153, score-0.187]
66 As have been shown in Figure 4(b), the performance is almost monotonically increasing as we stack more layers in R2 SVM. [sent-158, score-0.131]
67 Also, stacks of SVMs by concatenation of output and input feature space does not yield much gain above 1 layer (which is a linear SVM), and neither does a deterministic 6 Table 2: Results on CIFAR-10, with 25 training data per class. [sent-159, score-0.562]
68 Table 1: Results on CIFAR-10, with different codebook sizes (hence feature dimensions). [sent-160, score-0.279]
69 7% version of recursive SVM where a projection matrix as in the proof for Lemma 3. [sent-183, score-0.226]
70 Note that training each layer involves training a linear SVM, so the computational complexity is simply linear to the depth of our model. [sent-186, score-0.475]
71 In contrast to this, the difficulty of training deep learning models based on many hidden layers may be significantly harder, partially due to the lack of supervised information for its hidden layers. [sent-187, score-0.414]
72 Figure 4(b) shows the effect that the feature dimensionality (controlled by the codebook size of OMP-1) has on the performance of the linear and non-linear classifiers, and Table 1 provides representative numerical results. [sent-188, score-0.399]
73 In particular, when the codebook size is low, the assumption that we can approximate the non-linear function f as a globally linear classifier fails, and in those cases the R2 SVM and RBF SVM clearly outperform the linear SVM. [sent-189, score-0.366]
74 Moreover, as the codebook size grows, non-linear classifiers, represented by RBF SVM in our experiments, suffer from the curse of dimensionality partially due to the large dimensionality of the over-complete feature representation. [sent-190, score-0.411]
75 For linear SVM, increasing the codebook size makes it perform better with respect to non-linear classifiers, but additional gains can still be consistently obtained by the Random Recursive SVM method. [sent-192, score-0.293]
76 Also note how our model outperforms DCN, another stacking architecture proposed in [10]. [sent-193, score-0.387]
77 Similar to the change of codebook sizes, it is interesting to experiment with the number of training examples per class. [sent-194, score-0.283]
78 This again suggests that our proposed method may generalize better than RBF, which is a desirable property when the number of training examples is small with respect to the dimensionality of the feature space, which are cases of interest to many computer vision applications. [sent-196, score-0.236]
79 In general, our method is able to combine the advantages of both linear and nonlinear SVM: it has higher representation power than linear SVM, providing consistent performance gains, and at the same time has a better robustness against overfitting. [sent-197, score-0.146]
80 It is also worth pointing out again that R2 SVM is highly efficient, since each layer is a simple linear SVM that can be carried out by simple matrix multiplication. [sent-198, score-0.31]
81 TIMIT Finally, we report our experiments using the popular speech database TIMIT. [sent-200, score-0.135]
82 The speech data is analyzed using a 25-ms Hamming window with a 10-ms fixed frame rate. [sent-201, score-0.135]
83 We represent the speech using first- to 12th-order Mel frequency cepstral coefficients (MFCCs) and energy, along with their first and second temporal derivatives. [sent-202, score-0.135]
84 In Table 3 we also report recent work on this dataset [10], which uses multi-layer perceptron with a hidden layer and linear output, and stacks each block on top of each other. [sent-224, score-0.373]
85 In their experiments, the representation used from the speech signal is not sparse, and uses instead Restricted Boltzman Machine, which is more time consuming to learn. [sent-225, score-0.135]
86 Second, it can offer a better generalization ability over nonlinear SVMs, especially when the ratio of dimensionality to the number of training data is large. [sent-231, score-0.174]
87 These advantages, combined with the fact that R2 SVM is efficient in both training and testing, suggests that it could be adopted as an improvement over the existing classification pipeline in general. [sent-232, score-0.189]
88 We also note that in the current work we have not employed techniques of fine tuning similar to the one employed in the architecture of [10]. [sent-233, score-0.139]
89 Fine tuning of the latter architecture has accounted for between 10% to 20% error reduction, and reduces the need for having large depth in order to achieve a fixed level of recognition accuracy. [sent-234, score-0.176]
90 We combined the simplicity of linear SVMs with the power derived from deep architectures, and proposed a new stacking technique for building a better classifier, using linear SVM as the base building blocks and emplying a random non-linear projection to add flexibility to the model. [sent-238, score-0.79]
91 Our work is partially motivated by the recent trend of using coding techniques as feature representation with relatively large dictionaries. [sent-239, score-0.208]
92 The chief advantage of our method lies in the fact that it learns non-linear classifiers without the need of kernel design, while keeping the efficiency of linear SVMs. [sent-240, score-0.156]
93 Experimental results on vision and speech datasets showed that the method provides consistent improvement over linear baselines, even with no learning of the model parameters. [sent-241, score-0.302]
94 The importance of encoding versus training with sparse coding and vector quantization. [sent-253, score-0.239]
95 Binary coding of speech spectrograms using a deep auto-encoder. [sent-268, score-0.418]
96 Deep convex network: A scalable architecture for deep learning. [sent-271, score-0.275]
97 What is the best multi-stage architecture for object recognition? [sent-286, score-0.139]
98 Investigation of full-sequence training of deep belief networks for speech recognition. [sent-304, score-0.393]
99 Linear spatial pyramid matching using sparse coding for image classification. [sent-316, score-0.174]
100 Efficient highly over-complete sparse coding using a mixture model. [sent-319, score-0.147]
wordName wordTfidf (topN-words)
[('svm', 0.548), ('stacking', 0.278), ('codebook', 0.22), ('layer', 0.203), ('dcn', 0.188), ('svms', 0.174), ('stacked', 0.167), ('deep', 0.166), ('recursive', 0.146), ('rbf', 0.144), ('speech', 0.135), ('wl', 0.131), ('layers', 0.131), ('coding', 0.117), ('classi', 0.117), ('architecture', 0.109), ('rsvm', 0.094), ('projection', 0.08), ('ft', 0.079), ('er', 0.078), ('ers', 0.074), ('linear', 0.073), ('interspeech', 0.072), ('concatenation', 0.071), ('pipeline', 0.07), ('separability', 0.068), ('vision', 0.067), ('manifolds', 0.065), ('training', 0.063), ('xl', 0.062), ('deng', 0.061), ('feature', 0.059), ('ol', 0.057), ('separating', 0.057), ('hyperplane', 0.054), ('berkeley', 0.054), ('sigmoid', 0.05), ('yu', 0.049), ('phone', 0.049), ('modules', 0.048), ('multilayer', 0.047), ('predictions', 0.047), ('dimensionality', 0.047), ('vinyals', 0.047), ('ot', 0.046), ('acoustic', 0.046), ('architectures', 0.045), ('mohamed', 0.044), ('building', 0.042), ('kernel', 0.042), ('chief', 0.041), ('faced', 0.04), ('perceptron', 0.039), ('layered', 0.038), ('suffer', 0.038), ('recognition', 0.037), ('codes', 0.036), ('blocks', 0.036), ('uc', 0.036), ('timit', 0.036), ('output', 0.035), ('randomness', 0.035), ('original', 0.035), ('carried', 0.034), ('wy', 0.034), ('speakers', 0.034), ('cvpr', 0.034), ('generalization', 0.034), ('accuracy', 0.033), ('philosophy', 0.033), ('trend', 0.032), ('stacks', 0.031), ('maji', 0.031), ('visualizes', 0.031), ('projections', 0.031), ('units', 0.031), ('object', 0.03), ('controls', 0.03), ('representations', 0.03), ('lemma', 0.03), ('rd', 0.03), ('tuning', 0.03), ('especially', 0.03), ('features', 0.03), ('sparse', 0.03), ('adopted', 0.029), ('encoding', 0.029), ('recursively', 0.029), ('module', 0.029), ('networks', 0.029), ('testing', 0.029), ('manifold', 0.028), ('tuples', 0.028), ('prevents', 0.028), ('improvement', 0.027), ('input', 0.027), ('cation', 0.027), ('image', 0.027), ('hidden', 0.027), ('class', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
2 0.19106048 188 nips-2012-Learning from Distributions via Support Measure Machines
Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf
Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1
3 0.17694177 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain
Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1
4 0.16320576 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1
5 0.15714325 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity
Author: Angela Eigenstetter, Bjorn Ommer
Abstract: Category-level object detection has a crucial need for informative object representations. This demand has led to feature descriptors of ever increasing dimensionality like co-occurrence statistics and self-similarity. In this paper we propose a new object representation based on curvature self-similarity that goes beyond the currently popular approximation of objects using straight lines. However, like all descriptors using second order statistics, ours also exhibits a high dimensionality. Although improving discriminability, the high dimensionality becomes a critical issue due to lack of generalization ability and curse of dimensionality. Given only a limited amount of training data, even sophisticated learning algorithms such as the popular kernel methods are not able to suppress noisy or superfluous dimensions of such high-dimensional data. Consequently, there is a natural need for feature selection when using present-day informative features and, particularly, curvature self-similarity. We therefore suggest an embedded feature selection method for SVMs that reduces complexity and improves generalization capability of object models. By successfully integrating the proposed curvature self-similarity representation together with the embedded feature selection in a widely used state-of-the-art object detection framework we show the general pertinence of the approach. 1
6 0.15132993 168 nips-2012-Kernel Latent SVM for Visual Recognition
7 0.1504382 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
8 0.15035118 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
9 0.14833777 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
10 0.14593209 227 nips-2012-Multiclass Learning with Simplex Coding
11 0.14588545 200 nips-2012-Local Supervised Learning through Space Partitioning
12 0.13790412 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
13 0.12223029 361 nips-2012-Volume Regularization for Binary Classification
14 0.11253599 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
15 0.11253163 193 nips-2012-Learning to Align from Scratch
16 0.10872429 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging
17 0.10809851 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs
18 0.10792045 62 nips-2012-Burn-in, bias, and the rationality of anchoring
19 0.10792045 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning
20 0.10188857 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
topicId topicWeight
[(0, 0.24), (1, 0.071), (2, -0.211), (3, -0.033), (4, 0.19), (5, -0.026), (6, 0.012), (7, 0.08), (8, -0.046), (9, -0.106), (10, 0.056), (11, 0.124), (12, 0.079), (13, 0.161), (14, -0.026), (15, -0.169), (16, -0.066), (17, 0.029), (18, 0.041), (19, -0.063), (20, -0.046), (21, -0.056), (22, -0.037), (23, -0.071), (24, -0.054), (25, -0.012), (26, -0.089), (27, -0.063), (28, -0.04), (29, -0.084), (30, -0.098), (31, 0.082), (32, 0.012), (33, 0.033), (34, 0.007), (35, -0.075), (36, 0.085), (37, 0.118), (38, 0.123), (39, 0.016), (40, 0.037), (41, 0.049), (42, -0.011), (43, 0.009), (44, -0.032), (45, -0.039), (46, -0.116), (47, -0.088), (48, -0.007), (49, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.96796584 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
2 0.65992451 170 nips-2012-Large Scale Distributed Deep Networks
Author: Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, Andrew Y. Ng
Abstract: Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm. 1
3 0.64878637 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity
Author: Angela Eigenstetter, Bjorn Ommer
Abstract: Category-level object detection has a crucial need for informative object representations. This demand has led to feature descriptors of ever increasing dimensionality like co-occurrence statistics and self-similarity. In this paper we propose a new object representation based on curvature self-similarity that goes beyond the currently popular approximation of objects using straight lines. However, like all descriptors using second order statistics, ours also exhibits a high dimensionality. Although improving discriminability, the high dimensionality becomes a critical issue due to lack of generalization ability and curse of dimensionality. Given only a limited amount of training data, even sophisticated learning algorithms such as the popular kernel methods are not able to suppress noisy or superfluous dimensions of such high-dimensional data. Consequently, there is a natural need for feature selection when using present-day informative features and, particularly, curvature self-similarity. We therefore suggest an embedded feature selection method for SVMs that reduces complexity and improves generalization capability of object models. By successfully integrating the proposed curvature self-similarity representation together with the embedded feature selection in a widely used state-of-the-art object detection framework we show the general pertinence of the approach. 1
4 0.63616514 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng
Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1
5 0.63206464 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction
Author: Pietro D. Lena, Ken Nagata, Pierre F. Baldi
Abstract: Residue-residue contact prediction is a fundamental problem in protein structure prediction. Hower, despite considerable research efforts, contact prediction methods are still largely unreliable. Here we introduce a novel deep machine-learning architecture which consists of a multidimensional stack of learning modules. For contact prediction, the idea is implemented as a three-dimensional stack of Neural Networks NNk , where i and j index the spatial coordinates of the contact ij map and k indexes “time”. The temporal dimension is introduced to capture the fact that protein folding is not an instantaneous process, but rather a progressive refinement. Networks at level k in the stack can be trained in supervised fashion to refine the predictions produced by the previous level, hence addressing the problem of vanishing gradients, typical of deep architectures. Increased accuracy and generalization capabilities of this approach are established by rigorous comparison with other classical machine learning approaches for contact prediction. The deep approach leads to an accuracy for difficult long-range contacts of about 30%, roughly 10% above the state-of-the-art. Many variations in the architectures and the training algorithms are possible, leaving room for further improvements. Furthermore, the approach is applicable to other problems with strong underlying spatial and temporal components. 1
6 0.62952363 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
7 0.62294662 168 nips-2012-Kernel Latent SVM for Visual Recognition
8 0.62024742 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
9 0.61824 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
10 0.60834676 188 nips-2012-Learning from Distributions via Support Measure Machines
11 0.6039775 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection
12 0.59289247 200 nips-2012-Local Supervised Learning through Space Partitioning
13 0.58778876 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics
14 0.58729386 72 nips-2012-Cocktail Party Processing via Structured Prediction
15 0.58534181 361 nips-2012-Volume Regularization for Binary Classification
16 0.57497931 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
17 0.5689975 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
18 0.56110173 193 nips-2012-Learning to Align from Scratch
19 0.55157757 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
20 0.54883945 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
topicId topicWeight
[(0, 0.065), (17, 0.011), (21, 0.017), (38, 0.104), (39, 0.023), (42, 0.041), (44, 0.015), (54, 0.035), (55, 0.062), (74, 0.07), (76, 0.118), (80, 0.165), (92, 0.083), (96, 0.105)]
simIndex simValue paperId paperTitle
same-paper 1 0.90558052 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
2 0.89633209 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies
Author: Zhitang Chen, Kun Zhang, Laiwan Chan
Abstract: In conventional causal discovery, structural equation models (SEM) are directly applied to the observed variables, meaning that the causal effect can be represented as a function of the direct causes themselves. However, in many real world problems, there are significant dependencies in the variances or energies, which indicates that causality may possibly take place at the level of variances or energies. In this paper, we propose a probabilistic causal scale-mixture model with spatiotemporal variance dependencies to represent a specific type of generating mechanism of the observations. In particular, the causal mechanism including contemporaneous and temporal causal relations in variances or energies is represented by a Structural Vector AutoRegressive model (SVAR). We prove the identifiability of this model under the non-Gaussian assumption on the innovation processes. We also propose algorithms to estimate the involved parameters and discover the contemporaneous causal structure. Experiments on synthetic and real world data are conducted to show the applicability of the proposed model and algorithms.
3 0.87181747 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1
4 0.86894548 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks
Author: Firdaus Janoos, Weichang Li, Niranjan Subrahmanya, Istvan Morocz, William Wells
Abstract: Identifying patterns from the neuroimaging recordings of brain activity related to the unobservable psychological or mental state of an individual can be treated as a unsupervised pattern recognition problem. The main challenges, however, for such an analysis of fMRI data are: a) deďŹ ning a physiologically meaningful feature-space for representing the spatial patterns across time; b) dealing with the high-dimensionality of the data; and c) robustness to the various artifacts and confounds in the fMRI time-series. In this paper, we present a network-aware feature-space to represent the states of a general network, that enables comparing and clustering such states in a manner that is a) meaningful in terms of the network connectivity structure; b)computationally efďŹ cient; c) low-dimensional; and d) relatively robust to structured and random noise artifacts. This feature-space is obtained from a spherical relaxation of the transportation distance metric which measures the cost of transporting “massâ€? over the network to transform one function into another. Through theoretical and empirical assessments, we demonstrate the accuracy and efďŹ ciency of the approximation, especially for large problems. 1
5 0.85896409 251 nips-2012-On Lifting the Gibbs Sampling Algorithm
Author: Deepak Venugopal, Vibhav Gogate
Abstract: First-order probabilistic models combine the power of first-order logic, the de facto tool for handling relational structure, with probabilistic graphical models, the de facto tool for handling uncertainty. Lifted probabilistic inference algorithms for them have been the subject of much recent research. The main idea in these algorithms is to improve the accuracy and scalability of existing graphical models’ inference algorithms by exploiting symmetry in the first-order representation. In this paper, we consider blocked Gibbs sampling, an advanced MCMC scheme, and lift it to the first-order level. We propose to achieve this by partitioning the first-order atoms in the model into a set of disjoint clusters such that exact lifted inference is polynomial in each cluster given an assignment to all other atoms not in the cluster. We propose an approach for constructing the clusters and show how it can be used to trade accuracy with computational complexity in a principled manner. Our experimental evaluation shows that lifted Gibbs sampling is superior to the propositional algorithm in terms of accuracy, scalability and convergence.
6 0.85821879 168 nips-2012-Kernel Latent SVM for Visual Recognition
7 0.85810089 193 nips-2012-Learning to Align from Scratch
8 0.85684621 200 nips-2012-Local Supervised Learning through Space Partitioning
9 0.85627216 65 nips-2012-Cardinality Restricted Boltzmann Machines
10 0.85494834 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
11 0.85462606 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models
12 0.8540265 100 nips-2012-Discriminative Learning of Sum-Product Networks
13 0.85339606 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
14 0.8520304 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
15 0.84676892 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
16 0.84483105 279 nips-2012-Projection Retrieval for Classification
17 0.84101403 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning
18 0.84050965 234 nips-2012-Multiresolution analysis on the symmetric group
19 0.84036922 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
20 0.8398332 274 nips-2012-Priors for Diversity in Generative Latent Variable Models