nips nips2009 nips2009-119 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Youngmin Cho, Lawrence K. Saul
Abstract: We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. [sent-4, score-0.505]
2 These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). [sent-5, score-1.298]
3 We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. [sent-6, score-0.547]
4 On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. [sent-7, score-0.882]
5 1 Introduction Recent work in machine learning has highlighted the circumstances that appear to favor deep architectures, such as multilayer neural nets, over shallow architectures, such as support vector machines (SVMs) [1]. [sent-8, score-0.761]
6 Many issues surround the ongoing debate over deep versus shallow architectures [1, 6]. [sent-12, score-0.633]
7 Like many, we are intrigued by the successes of deep architectures yet drawn to the elegance of kernel methods. [sent-18, score-0.705]
8 In this paper, we explore the possibility of deep learning in kernel machines. [sent-19, score-0.547]
9 Second, using these kernel functions, we show how to train multilayer kernel machines (MKMs) that benefit from many advantages of deep learning. [sent-23, score-1.03]
10 1 2 Arc-cosine kernels In this section, we develop a new family of kernel functions for computing the similarity of vector inputs x, y ∈ d . [sent-30, score-0.7]
11 We 2 define the nth order arc-cosine kernel function via the integral representation: w 2 kn (x, y) = 2 e− 2 Θ(w · x) Θ(w · y) (w · x)n (w · y)n dw (2π)d/2 (1) The integral representation makes it straightforward to show that these kernel functions are positivesemidefinite. [sent-32, score-0.503]
12 As a practical matter, we note that arccosine kernels do not have any continuous tuning parameters (such as the kernel width in RBF kernels), which can be laborious to set by cross-validation. [sent-57, score-0.67]
13 Arc-cosine kernels differ from polynomial and RBF kernels in one especially interesting respect. [sent-100, score-0.812]
14 (1), arc-cosine kernels induce feature spaces that mimic the sparse, nonnegative, distributed representations of single-layer threshold networks. [sent-102, score-0.52]
15 Polynomial and RBF kernels do not encode their inputs in this way. [sent-103, score-0.504]
16 In particular, the feature vector induced by polynomial kernels is neither sparse nor nonnegative, while the feature vector induced by RBF kernels resembles the localized output of a soft vector quantizer. [sent-104, score-0.906]
17 3 Computation in multilayer threshold networks A kernel function can be viewed as inducing a nonlinear mapping from inputs x to feature vectors Φ(x). [sent-107, score-0.701]
18 SVMs with arc cosine kernels have error rates from 22. [sent-120, score-0.839]
19 Results are s kernels of varying degree (n) and levels of recursion ( ). [sent-123, score-0.688]
20 Right: classification error rates on th DBN−3 SVMs with arc cosine kernels have error rates from 17. [sent-129, score-0.973]
21 SVMs with arc-cosine kernels have error rates from 22. [sent-139, score-0.529]
22 Results are shown for kernels of varying degree (n) and levels of 2000 training). [sent-142, score-0.576]
23 retrained each SVM using all the training exam reference, we also report the best results obtained previously from three layer deep belief ne 3) and SVMs with RBF kernels (SVM-RBF). [sent-147, score-1.107]
24 These references are representative of th state-of-the-art for deep and shallow architectures on these data sets. [sent-148, score-0.633]
25 Consider theshow the test set error rates from arc cosine kernels o degree (n) and levels of recursion ( that maps x to Φ(Φ(x)). [sent-152, score-1.155]
26 isWe experimented with kernels of degree n = 0 trivial: we obtain corresponding to polynomial kernels networks with “step”, the identity map Φ(Φ(x)) = Φ(x) = x. [sent-154, score-0.962]
27 We also experimented with the multilayer kernels described in section 2. [sent-156, score-0.706]
28 Φ(Φ(x)) · Φ(Φ(y)) = (Φ(x) · Φ(y))dcosine kernelsdoutperformdthe best results previously reported for SVMs w (10) kernels and deep belief nets. [sent-159, score-0.883]
29 At a h though, we note that by this composition kernels are very The above result is not especially interesting: the kernel impliedSVMs with arc cosine is also polyno- straightforward to train; unli with RBF from which it not constructed. [sent-161, score-0.953]
30 However, r only to n = 1 arc cosine kernel preserves the norm would large compared to the kernel width. [sent-169, score-0.691]
31 while higher-order (n > 1) kernels may induc spaces with severely distorted dynamic ranges. [sent-171, score-0.516]
32 the magnitude of their inputs to work effe kernels preserve sufficient information about We state the result in the form of a recursion. [sent-174, score-0.548]
33 The inductive step is given by: Finally, the results on both data sets reveal an interesting trend: the multilayer arc cosin (l+1) kn (x, y) = 1 π often perform better than their single layer counterparts. [sent-178, score-0.684]
34 The resulting kernels mimic the computations in large multilayer threshold networks. [sent-183, score-0.736]
35 These data sets were specifically constructed to compare deep architectures and kernel machines [11]. [sent-189, score-0.738]
36 SVMs with arc cosine kernels have error rates from 22. [sent-193, score-0.839]
37 Results are s kernels of varying degree (n) and levels of recursion ( ). [sent-196, score-0.688]
38 Right: classification error rates on th SVMs with arc cosine kernels have error rates from 17. [sent-202, score-0.973]
39 SVMs with arc-cosine kernels have error rates from 17. [sent-212, score-0.529]
40 Seecross-validation, we then retrained each SVM using all the training exam reference, we also report the best results obtained previously from three layer deep belief ne 3) and SVMs with RBF kernels (SVM-RBF). [sent-220, score-1.107]
41 These references are representative of th state-of-the-art for deep and shallow architectures on these data sets. [sent-221, score-0.633]
42 show thedata set error rates from arc cosine kernels o sets have 50000 figures 2 and 3 These test sets have been extensively benchmarked by previousdegree (n) and levelsexperiments in). [sent-225, score-0.871]
43 We also nets) and traditional multilayer kernels described in section 2. [sent-228, score-0.648]
44 Overall, the figures show that on these two data se different arc cosine kernels outperform the best results previously reported for SVMs w We followed the same experimental methodology as previous authors [11]. [sent-233, score-0.726]
45 SVMs were trained using kernels and deep belief nets. [sent-234, score-0.862]
46 At a h though, we note that SVMs with arc cosine after choosing 2000 training examples as a validation set to choose the margin penalty parameter; kernels are very straightforward to train; unli with RBF kernels, they do not all the training examples. [sent-238, score-0.857]
47 These references appear to be representative of kernels only performed w In our experiments, we quickly discovered that the multilayer n = kernels were on these data sets. [sent-240, score-1.065]
48 We also experimented only the n = 1 arc cosine kernel preserves the norm of its inputs: the n = 0 kernel maps onto 2. [sent-247, score-0.769]
49 with the multilayer kernels described in section a unit hypersphere in feature six levels of higher-order (n > 1) kernels may induc spaces arc-cosine kernels outperform traditional SVMs, Overall, the figures show that many SVMs with with severely distorted dynamic ranges. [sent-249, score-1.735]
50 unlike SVMs with RBF kernels, they do not require tuning a kernel width parameter, and both data sets reveal an interesting trend: the multilayer arc cosin Finally, the results on unlike deep belief nets, they do not often or searching than their single layer counterparts. [sent-252, score-1.362]
51 Our experiments with multilayer kernels revealed that these SVMs only performed well when arc5 cosine kernels of degree n = 1 were used at higher ( > 1) levels in the recursion. [sent-254, score-1.324]
52 2 and 3 therefore show only these sets of results; in particular, each group of bars shows the test error rates when a particular kernel (of degree n = 0, 1, 2) was used at the first layer of nonlinearity, while the n = 1 kernel was used at successive layers. [sent-256, score-0.781]
53 We hypothesize that only n = 1 arc-cosine kernels preserve sufficient information about the magnitude of their inputs to work effectively in composition with other kernels. [sent-257, score-0.615]
54 Recall that only the n = 1 arc-cosine kernel preserves the norm of its inputs: the n = 0 kernel maps all inputs onto a unit hypersphere in feature space, while higherorder (n > 1) kernels induce feature spaces with different dynamic ranges. [sent-258, score-1.071]
55 Finally, the results on both data sets reveal an interesting trend: the multilayer arc-cosine kernels often perform better than their single-layer counterparts. [sent-259, score-0.67]
56 Though SVMs are (inherently) shallow architectures, this trend suggests that for these problems in binary classification, arc-cosine kernels may be yielding some of the advantages typically associated with deep architectures. [sent-260, score-0.902]
57 3 Deep learning In this section, we explore how to use kernel methods in deep architectures [7]. [sent-261, score-0.705]
58 We show how to train deep kernel-based architectures by a simple combination of supervised and unsupervised methods. [sent-262, score-0.579]
59 Using the arc-cosine kernels in the previous section, these multilayer kernel machines (MKMs) perform very competitively on multiclass data sets designed to foil shallow architectures [11]. [sent-263, score-1.113]
60 This use of kernel PCA was suggested over a decade ago [16] and more recently inspired by the pretraining of deep belief nets by unsupervised methods. [sent-278, score-0.802]
61 In MKMs, the outputs (or features) from kernel PCA at one layer are the inputs to kernel PCA at the next layer. [sent-279, score-0.596]
62 While any nonlinear kernel can be used for the layerwise PCA in MKMs, arc-cosine kernels are natural choices to mimic the computations in large neural nets. [sent-281, score-0.671]
63 The use of LMNN is inspired by the supervised fine-tuning of weights in the training of deep architectures [18]. [sent-296, score-0.554]
64 SVMs with arc cosine kernels have error rates from 22. [sent-307, score-0.839]
65 Resul kernels of varying degree (n) and levels of recursion ( ). [sent-310, score-0.688]
66 Ramp (n=1)classification error rates Right: Figure 3: Left: examples SVMs with arc cosine kernels have error rates from 17. [sent-319, score-1.005]
67 Right: classification error 1 4 1 6 5 2 of arc Step (n=0) with RBF Ramp (n=1) cosine kernels have error rates from text for details. [sent-324, score-0.914]
68 pipe (n=2) kernels kernels of varying degree (n) and levels of recursion ( ). [sent-332, score-1.205]
69 MKMs with arc-cosine kernel 2000 training examples for validation set to kernels have error rates from 6. [sent-338, score-0.785]
70 21 reference, we also report the best results obtained previously from three layer deep be 20 3) and SVMs with RBF kernels (SVM-RBF). [sent-344, score-0.931]
71 These references are representativ state-of-the-art rate deep 19 shallow architectures on these data sets. [sent-345, score-0.633]
72 TestTest error for (%) and error rate (%) 30 21 18 The right panels of figures 2 and 3 show the test set error rates from arc cosine ke degree (n) and levels of recursion ( ). [sent-346, score-0.867]
73 We experimented with kernels of degree n 17 25 3 4 networks1with 4 5 2 1 3 4 5 6 SVM! [sent-347, score-0.537]
74 We also experimented with the multilayer kernels described in section 20 18 Figure oneLeft:six levels of recursion. [sent-351, score-0.774]
75 figures show that error rates on SVMs with arccosine kernels outperformratesDBN−317. [sent-353, score-0.551]
76 Results are shown different arc cosine kernels have error the from results previously reported for S best 15 17 0 1 2 deep 5 and levels of3 give more details on previous results are 19. [sent-356, score-1.215]
77 pipeRBF (n=2) kernels that Quarter−pipe (n=2) belief kernels are text straightforward to trai though, we note and 18. [sent-361, score-0.911]
78 MKMs with nonlinear kernel test set for MKMs with different kernels and numbers not require solving a difficultarc-cosineoptimization or searching over possib 2000 training examples as a validation set to choose the margin penalty parameter; af have error rates from 18. [sent-366, score-0.912]
79 The bestIn our experiments, we quicklyfor SVMs with the multilayer kernels only perfor previous results are 22. [sent-369, score-0.648]
80 61% discovered that RBF n = 1 kernels were used at higher ( > 1) levels in the SVM using all the training ex kernels and 16. [sent-370, score-0.902]
81 2 and 3 ther reference, we also report the best results obtained previously from the test error rates w these sets of results; in particular, each group of bars shows three layer deep belief 3) and SVMs with n = 0,kernels (SVM-RBF). [sent-374, score-0.817]
82 These references are representative of kernel (of degree RBF 1, 2) was used at the first layer of nonlinearity, while the n state-of-the-art for deep and We do not have a formalthese data sets. [sent-375, score-0.772]
83 shallow architectures on explanation for We trained MKMs with arc-cosine kernels and only the panels in each layer. [sent-378, score-0.708]
84 For each data set, error its inputs: the cosine kernel RBF kernels of cosine kernel show the the set we The right n = 1 arcfigures 2 and 3 preserves test norm of rates from arc n = 0 kernel initially withheld the last 2000 training examples as aavalidation set. [sent-379, score-1.608]
85 kernels We methodology as earlier multilayer kernels described in section 2. [sent-383, score-1.043]
86 different arc cosine kernels distance the best training set and used all 12000 training examples for feature selectionbothoutperformmetric an results previously reported for SVM and learning. [sent-387, score-0.839]
87 Finally, the results on interesting trend: the multilayer ar kernels and deep belief nets. [sent-388, score-1.137]
88 We chose these 6000 examples randomly, but repeated arc cosine kernels are very straightforward to with RBF kernels, they do not require tuning a kernelthe parameter, and unlike deep width times to obtain a measure of average performance. [sent-392, score-1.191]
89 In our experiments, we quickly discovered that the multilayer kernels only performe The right panels of Figs. [sent-394, score-0.703]
90 4 and 5 show the testn = 1 kernels were used at higher different kernels and set error rates of MKMs with ( > 1) levels in the recursion. [sent-395, score-0.992]
91 Howeve nificantly better than shallow architectures such as SVMs with RBF kernels or LMNNawith feature only to n = arc cosine kernel preserves the norm of its selection (reported as the case = 0). [sent-401, score-1.209]
92 slightly lower error rates on one data set and slightlyahigher error ratesin feature space, while higher-order (n > 1) kernels may in spaces with severely distorted dynamic ranges. [sent-403, score-0.712]
93 Therefore, we hypothesize that only n = We can describe the architecture of an MKM by the number of selectedinformation about the magnitude of their inputs to work e kernels preserve sufficient features at each layer (incomposition with other kernels. [sent-404, score-0.726]
94 For the mnist-back-rand datathe results on both data sets reveal1 arc-cosine trend: the multilayer arc co Finally, set, the best MKM used an n = an interesting kernel and 300-90-105-136-126-240 features at each layer. [sent-407, score-0.651]
95 The kernel of degree n = 2 5 performed less well in MKMs, perhaps because multiple iterations of kernel PCA distorted the dynamic range of the inputs (which in turn seemed to complicate the training for LMNN). [sent-411, score-0.631]
96 MKMs with RBF kernels were difficult to train due to the sensitive dependence on kernel width parameters. [sent-412, score-0.669]
97 For multiclass problems, these SVMs compared poorly to deep architectures (both DBNs and MKMs), presumably because they had no unsupervised training that shared information across examples from all different classes. [sent-417, score-0.609]
98 4 Discussion In this paper, we have developed a new family of kernel functions that mimic the computation in large, multilayer neural nets. [sent-421, score-0.505]
99 We hope that our results inspire more work on kernel methods for deep learning. [sent-425, score-0.547]
100 An empirical evaluation of deep architectures on problems with many factors of variation. [sent-531, score-0.532]
wordName wordTfidf (topN-words)
[('kernels', 0.395), ('deep', 0.374), ('mkms', 0.334), ('svms', 0.267), ('multilayer', 0.253), ('rbf', 0.228), ('arc', 0.181), ('kernel', 0.173), ('ramp', 0.167), ('architectures', 0.158), ('layer', 0.141), ('nets', 0.139), ('cosine', 0.129), ('pipe', 0.122), ('quarter', 0.117), ('recursion', 0.112), ('inputs', 0.109), ('shallow', 0.101), ('cos', 0.098), ('belief', 0.093), ('rates', 0.087), ('degree', 0.084), ('sin', 0.078), ('jn', 0.071), ('levels', 0.068), ('mkm', 0.067), ('kn', 0.065), ('lmnn', 0.058), ('experimented', 0.058), ('mimic', 0.056), ('width', 0.055), ('nonlinearity', 0.054), ('composition', 0.053), ('hypersphere', 0.05), ('layers', 0.05), ('distorted', 0.049), ('nonlinear', 0.047), ('error', 0.047), ('integral', 0.046), ('gures', 0.045), ('dbn', 0.04), ('arch', 0.039), ('retrained', 0.039), ('feature', 0.037), ('hypothesize', 0.036), ('svm', 0.036), ('preserves', 0.035), ('activation', 0.034), ('machines', 0.033), ('panels', 0.033), ('test', 0.032), ('examples', 0.032), ('trend', 0.032), ('threshold', 0.032), ('pca', 0.031), ('networks', 0.03), ('severely', 0.029), ('angular', 0.029), ('classi', 0.029), ('validation', 0.029), ('varying', 0.029), ('text', 0.028), ('prune', 0.027), ('though', 0.027), ('margin', 0.025), ('tuning', 0.025), ('train', 0.024), ('reference', 0.023), ('searching', 0.023), ('angle', 0.023), ('family', 0.023), ('architecture', 0.023), ('unsupervised', 0.023), ('network', 0.022), ('training', 0.022), ('preserve', 0.022), ('dependence', 0.022), ('arccosine', 0.022), ('cosin', 0.022), ('cosn', 0.022), ('datathe', 0.022), ('effe', 0.022), ('exam', 0.022), ('induc', 0.022), ('unli', 0.022), ('discovered', 0.022), ('knn', 0.022), ('bars', 0.022), ('successive', 0.022), ('interesting', 0.022), ('previously', 0.021), ('explanation', 0.021), ('iterated', 0.021), ('induced', 0.021), ('unit', 0.021), ('dynamic', 0.021), ('benchmarks', 0.02), ('uninformative', 0.02), ('mapping', 0.02), ('maps', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 119 nips-2009-Kernel Methods for Deep Learning
Author: Youngmin Cho, Lawrence K. Saul
Abstract: We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1
2 0.22587298 128 nips-2009-Learning Non-Linear Combinations of Kernels
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.
3 0.2180918 151 nips-2009-Measuring Invariances in Deep Networks
Author: Ian Goodfellow, Honglak Lee, Quoc V. Le, Andrew Saxe, Andrew Y. Ng
Abstract: For many pattern recognition tasks, the ideal input feature would be invariant to multiple confounding properties (such as illumination and viewing angle, in computer vision applications). Recently, deep architectures trained in an unsupervised manner have been proposed as an automatic method for extracting useful features. However, it is difficult to evaluate the learned features by any means other than using them in a classifier. In this paper, we propose a number of empirical tests that directly measure the degree to which these learned features are invariant to different input transformations. We find that stacked autoencoders learn modestly increasingly invariant features with depth when trained on natural images. We find that convolutional deep belief networks learn substantially more invariant features in each layer. These results further justify the use of “deep” vs. “shallower” representations, but suggest that mechanisms beyond merely stacking one autoencoder on top of another may be important for achieving invariance. Our evaluation metrics can also be used to evaluate future work in deep learning, and thus help the development of future algorithms. 1
4 0.18113963 2 nips-2009-3D Object Recognition with Deep Belief Nets
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We introduce a new type of top-level model for Deep Belief Nets and evaluate it on a 3D object recognition task. The top-level model is a third-order Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB database (normalized-uniform version), which contains stereo-pair images of objects under different lighting conditions and viewpoints. Our model achieves 6.5% error on the test set, which is close to the best published result for NORB (5.9%) using a convolutional neural net that has built-in knowledge of translation invariance. It substantially outperforms shallow models such as SVMs (11.6%). DBNs are especially suited for semi-supervised learning, and to demonstrate this we consider a modified version of the NORB recognition task in which additional unlabeled images are created by applying small translations to the images in the database. With the extra unlabeled data (and the same amount of labeled data as before), our model achieves 5.2% error. 1
5 0.17657088 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1
6 0.14086351 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
7 0.13229991 95 nips-2009-Fast subtree kernels on graphs
8 0.1291049 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks
9 0.12859063 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
10 0.11471449 253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks
11 0.11425147 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation
12 0.10236232 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
13 0.09923362 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection
14 0.073566236 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
15 0.07017526 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
16 0.068480834 33 nips-2009-Analysis of SVM with Indefinite Kernels
17 0.066255547 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels
18 0.066197425 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
19 0.06466642 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming
20 0.062329847 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test
topicId topicWeight
[(0, -0.176), (1, 0.015), (2, -0.064), (3, 0.145), (4, -0.121), (5, -0.021), (6, -0.037), (7, 0.309), (8, -0.154), (9, -0.022), (10, 0.198), (11, -0.148), (12, -0.225), (13, 0.17), (14, 0.049), (15, 0.083), (16, 0.006), (17, -0.022), (18, -0.024), (19, 0.01), (20, -0.057), (21, 0.088), (22, 0.021), (23, 0.035), (24, -0.006), (25, -0.019), (26, 0.023), (27, -0.079), (28, -0.054), (29, 0.008), (30, -0.091), (31, -0.019), (32, -0.029), (33, -0.03), (34, -0.015), (35, 0.045), (36, -0.041), (37, 0.059), (38, 0.021), (39, 0.013), (40, -0.015), (41, 0.012), (42, -0.0), (43, -0.077), (44, 0.091), (45, 0.019), (46, 0.023), (47, -0.001), (48, 0.081), (49, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.97029114 119 nips-2009-Kernel Methods for Deep Learning
Author: Youngmin Cho, Lawrence K. Saul
Abstract: We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1
2 0.71905595 128 nips-2009-Learning Non-Linear Combinations of Kernels
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.
3 0.69268239 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
Author: Liefeng Bo, Cristian Sminchisescu
Abstract: In visual recognition, the images are frequently modeled as unordered collections of local features (bags). We show that bag-of-words representations commonly used in conjunction with linear classifiers can be viewed as special match kernels, which count 1 if two local features fall into the same regions partitioned by visual words and 0 otherwise. Despite its simplicity, this quantization is too coarse, motivating research into the design of match kernels that more accurately measure the similarity between local features. However, it is impractical to use such kernels for large datasets due to their significant computational cost. To address this problem, we propose efficient match kernels (EMK) that map local features to a low dimensional feature space and average the resulting vectors to form a setlevel feature. The local feature maps are learned so their inner products preserve, to the best possible, the values of the specified kernel function. Classifiers based on EMK are linear both in the number of images and in the number of local features. We demonstrate that EMK are extremely efficient and achieve the current state of the art in three difficult computer vision datasets: Scene-15, Caltech-101 and Caltech-256. 1
4 0.6109972 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1
5 0.59699714 95 nips-2009-Fast subtree kernels on graphs
Author: Nino Shervashidze, Karsten M. Borgwardt
Abstract: In this article, we propose fast subtree kernels on graphs. On graphs with n nodes and m edges and maximum degree d, these kernels comparing subtrees of height h can be computed in O(mh), whereas the classic subtree kernel by Ramon & G¨ rtner scales as O(n2 4d h). Key to this efficiency is the observation that the a Weisfeiler-Lehman test of isomorphism from graph theory elegantly computes a subtree kernel as a byproduct. Our fast subtree kernels can deal with labeled graphs, scale up easily to large graphs and outperform state-of-the-art graph kernels on several classification benchmark datasets in terms of accuracy and runtime. 1
6 0.59362143 151 nips-2009-Measuring Invariances in Deep Networks
7 0.55963248 2 nips-2009-3D Object Recognition with Deep Belief Nets
8 0.55800915 253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks
9 0.52261645 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
10 0.48641449 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test
11 0.46922362 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming
12 0.44775897 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
13 0.4452388 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks
14 0.43642217 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection
15 0.40881512 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions
16 0.39622939 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
17 0.39330932 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
18 0.38858637 176 nips-2009-On Invariance in Hierarchical Models
19 0.38792962 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation
20 0.37802061 33 nips-2009-Analysis of SVM with Indefinite Kernels
topicId topicWeight
[(2, 0.221), (21, 0.018), (24, 0.035), (25, 0.06), (35, 0.061), (36, 0.106), (39, 0.042), (42, 0.012), (58, 0.075), (61, 0.019), (71, 0.024), (81, 0.02), (86, 0.192), (91, 0.031)]
simIndex simValue paperId paperTitle
Author: Natasha Singh-miller, Michael Collins
Abstract: We consider the problem of using nearest neighbor methods to provide a conditional probability estimate, P (y|a), when the number of labels y is large and the labels share some underlying structure. We propose a method for learning label embeddings (similar to error-correcting output codes (ECOCs)) to model the similarity between labels within a nearest neighbor framework. The learned ECOCs and nearest neighbor information are used to provide conditional probability estimates. We apply these estimates to the problem of acoustic modeling for speech recognition. We demonstrate significant improvements in terms of word error rate (WER) on a lecture recognition task over a state-of-the-art baseline GMM model. 1
same-paper 2 0.84720612 119 nips-2009-Kernel Methods for Deep Learning
Author: Youngmin Cho, Lawrence K. Saul
Abstract: We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1
3 0.7390182 151 nips-2009-Measuring Invariances in Deep Networks
Author: Ian Goodfellow, Honglak Lee, Quoc V. Le, Andrew Saxe, Andrew Y. Ng
Abstract: For many pattern recognition tasks, the ideal input feature would be invariant to multiple confounding properties (such as illumination and viewing angle, in computer vision applications). Recently, deep architectures trained in an unsupervised manner have been proposed as an automatic method for extracting useful features. However, it is difficult to evaluate the learned features by any means other than using them in a classifier. In this paper, we propose a number of empirical tests that directly measure the degree to which these learned features are invariant to different input transformations. We find that stacked autoencoders learn modestly increasingly invariant features with depth when trained on natural images. We find that convolutional deep belief networks learn substantially more invariant features in each layer. These results further justify the use of “deep” vs. “shallower” representations, but suggest that mechanisms beyond merely stacking one autoencoder on top of another may be important for achieving invariance. Our evaluation metrics can also be used to evaluate future work in deep learning, and thus help the development of future algorithms. 1
4 0.72673982 253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks
Author: Honglak Lee, Peter Pham, Yan Largman, Andrew Y. Ng
Abstract: In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classification tasks. In the case of speech data, we show that the learned features correspond to phones/phonemes. In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classification tasks. We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks. 1
5 0.72294003 137 nips-2009-Learning transport operators for image manifolds
Author: Benjamin Culpepper, Bruno A. Olshausen
Abstract: We describe an unsupervised manifold learning algorithm that represents a surface through a compact description of operators that traverse it. The operators are based on matrix exponentials, which are the solution to a system of first-order linear differential equations. The matrix exponents are represented by a basis that is adapted to the statistics of the data so that the infinitesimal generator for a trajectory along the underlying manifold can be produced by linearly composing a few elements. The method is applied to recover topological structure from low dimensional synthetic data, and to model local structure in how natural images change over time and scale. 1
6 0.72256875 176 nips-2009-On Invariance in Hierarchical Models
7 0.72193944 190 nips-2009-Polynomial Semantic Indexing
8 0.71976125 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning
9 0.71818024 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming
10 0.7172243 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds
11 0.71479911 104 nips-2009-Group Sparse Coding
12 0.69938266 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs
13 0.69789368 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
14 0.69677174 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels
15 0.69208384 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections
16 0.69106477 2 nips-2009-3D Object Recognition with Deep Belief Nets
17 0.69071043 87 nips-2009-Exponential Family Graph Matching and Ranking
18 0.68966258 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks
19 0.68789899 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification
20 0.68677354 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations