jmlr jmlr2011 jmlr2011-48 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
Reference: text
sentIndex sentText sentNum sentScore
1 28/29 10587 Berlin, Germany Editor: Yoshua Bengio Abstract When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. [sent-9, score-0.901]
2 We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. [sent-11, score-2.443]
3 We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. [sent-12, score-1.956]
4 Keywords: deep networks, kernel principal component analysis, representations 1. [sent-13, score-0.968]
5 Through their deep multi-layered architecture, simpler and more accurate representations of the learning problem can be built layer after layer. [sent-21, score-0.99]
6 Also, their flexibility offers the possibility to systematically and structurally incorporate prior knowledge, for example, by constraining the connectivity of the deep network (e. [sent-23, score-0.825]
7 Such prior knowledge can significantly improve the generalization ability of deep networks, leading to state-of-the-art performance on several complex real-world data sets. [sent-28, score-0.818]
8 While a considerable amount of work has been dedicated to learning efficiently deep architectures (Orr and M¨ ller, 1998; Hinton et al. [sent-29, score-0.783]
9 e u ¨ M ONTAVON , B RAUN AND M ULLER a significant amount of research has focused on improving our theoretical understanding of deep networks, in particular, understanding the benefits of unsupervised pretraining (Erhan et al. [sent-36, score-0.762]
10 , 2010), understanding what are the main difficulties when training deep networks (Larochelle et al. [sent-37, score-0.839]
11 , 2009) and studying the invariance of representations built in deep networks (Goodfellow et al. [sent-38, score-1.005]
12 However, quantifying how good hidden representations are and measuring how the representation evolves layer after layer are still open questions. [sent-40, score-0.473]
13 Overall, deep networks are thus generally assumed to be powerful and flexible learning machines that are however not well understood theoretically (Bengio, 2009). [sent-41, score-0.839]
14 In parallel to the development of deep networks, kernel methods (M¨ ller et al. [sent-42, score-0.879]
15 The kernel operator k(x, x′ )—a central concept of the kernel framework—measures the similarity between two points x and x′ of the input distribution, yielding an implicit kernel feature map x → φ(x) (Sch¨ lkopf et al. [sent-44, score-0.417]
16 The goal of this paper is to study in the light of the kernel framework how exactly the representation is built in a deep network, in particular, how the representation evolves as we map the input through more and more layers of the deep network. [sent-52, score-1.973]
17 Here, the kernel framework is not used as an effective learning machine, but as an abstraction tool for modeling the deep network. [sent-53, score-0.845]
18 Our analysis takes a trained deep network f (x) = fL ◦ · · · ◦ f1 (x) as input, defines a sequence of “deep kernels” k0 (x, x′ ) = kRBF (x, x′ ), k1 (x, x′ ) = kRBF ( f1 (x), f1 (x′ )), . [sent-54, score-0.876]
19 kL (x, x′ ) = kRBF ( fL ◦ · · · ◦ f1 (x), fL ◦ · · · ◦ f1 (x′ )) that subsume the mapping performed by more and more layers of the deep network and outputs how good the representations yielded by these deeper and deeper kernels are. [sent-57, score-1.344]
20 We quantify for each kernel how good the representation with respect to the learning problem is by measuring how much taskrelevant information is contained in the leading principal components of the kernel feature space. [sent-58, score-0.511]
21 This analysis allows us for the first time to observe and quantify the evolution of the representation in deep networks. [sent-61, score-0.966]
22 We use our analysis to test two hypotheses on deep networks: Hypothesis 1: as the input is propagated through more and more layers of the deep network, simpler and more accurate representations of the learning problem are obtained. [sent-62, score-1.732]
23 Indeed, as the input is mapped through more and more layers, abstractions learned by the deep network are likely to change the perception of whether a task is simple or not. [sent-63, score-0.932]
24 For example, in 2564 K ERNEL A NALYSIS OF D EEP N ETWORKS output input f1 l=0 f2 f3 l=1 l=2 Hypothesis 2: l=0 l=1 l=2 error e(d0 ) error e(d) Hypothesis 1: dimensionality d deep networks with various structures layer l Figure 1: Illustration of our analysis. [sent-64, score-1.057]
25 Curves on the left plot relate the simplicity (dimensionality) and accuracy (error) of the representation of the learning problem at each layer of the deep network. [sent-65, score-0.947]
26 The dimensionality is measured as the number of kernel principal components on which the representation is projected. [sent-66, score-0.321]
27 The thick gray arrows indicate the forward path of the deep network. [sent-67, score-0.731]
28 Hypothesis 1 states that as deeper and deeper kernels are built, simpler and more accurate representations of the learning problem are obtained. [sent-68, score-0.277]
29 Hypothesis 2 states that the structure of the deep network controls the way the solution is formed layer after layer. [sent-69, score-1.004]
30 Hypothesis 2: the structure of the deep network controls how fast the representation of the task is formed layer after layer. [sent-71, score-1.066]
31 We hypothesize that a common aspect of these various regularization techniques is to control the layer-wise evolution of the representation through the deep network. [sent-74, score-0.941]
32 On the other hand, a simple unregularized deep network may make inefficient use of the representational power of deep networks, distributing the discrimination steps across layers in a suboptimal way. [sent-75, score-1.781]
33 Testing them are, to our opinion, of significant importance as they might shed light on the nature of deep learning and on the way complex problems are to be solved. [sent-77, score-0.764]
34 , 2010) by extending the discussion on the interest of analyzing deep networks within the kernel framework and by extending the empirical study to more data sets and larger deep networks. [sent-79, score-1.684]
35 1 Related Work The concept of building kernels imitating the structure of deep architectures—or more simply, building “deep kernels”—is not new. [sent-81, score-0.789]
36 Cho and Saul (2009) already expressed deep architectures as kernels in order to solve a convex optimization problem and achieve large margin discrimination in a deep network. [sent-82, score-1.615]
37 This approach differs from our work in the sense that their deep kernel is not used as an analysis tool for trained deep networks but as part of an effective learning machine. [sent-83, score-1.735]
38 (2010) where a principal component analysis is performed on top of these deep kernels in order to measure invariance properties of deep networks. [sent-86, score-1.653]
39 While the last authors focus mostly on the representation of data in static deep architectures made of predefined features, we are considering instead trainable deep architectures. [sent-87, score-1.576]
40 (2009) also analyze the layer-wise evolution of the representation in deep networks, showing that deep networks trained in an unsupervised fashion build increasing levels of invariance with respect to several engineered transformations of the input and to temporal transformations in video data. [sent-89, score-1.929]
41 Theory Before being able to observe the layer-wise evolution of the representation in deep networks, we first need quantify how good a representation is with respect to the learning problem. [sent-91, score-1.028]
42 The analysis extends naturally to deep networks by building a sequence of kernels that subsume the mapping performed by more and more layers of the deep network and repeating the analysis for these deeper and deeper kernels. [sent-94, score-2.132]
43 For example, a local predictor could be modeled with a Gaussian kernel while a more intelligent human-like predictor should be modeled with a more complex kernel encoding translation invariance, rotation invariance, etc. [sent-106, score-0.261]
44 Then, the 2566 K ERNEL A NALYSIS OF D EEP N ETWORKS induced kernel feature map x → φ(x) encodes implicitly all the prior defined in the kernel with the advantage that linear models can be built on top of it (Sch¨ lkopf et al. [sent-107, score-0.32]
45 2 Application to Deep Networks In this section, we describe how the analysis of representations presented above can be used to measure the layer-wise forming of the representation in deep networks. [sent-204, score-0.844]
46 Let f (x) = fL ◦ · · · ◦ f1 (x) be a trained deep network of L layers. [sent-205, score-0.876]
47 kL (x, x′ ) = kRBF ( fL ◦ · · · ◦ f1 (x), fL ◦ · · · ◦ f1 (x′ )) that subsume the mapping performed by more and more layers of the deep network and repeating for each kernel the analysis presented in Section 2. [sent-209, score-1.181]
48 When d increases, more variations of the 2569 ¨ M ONTAVON , B RAUN AND M ULLER Algorithm 1: Main computational steps of our layer-wise analysis of deep networks. [sent-218, score-0.731]
49 At every layer of the deep network, the same analysis is performed, returning a list of curves e(d) capturing the evolution of the representation in the deep network. [sent-219, score-1.826]
50 , (xn , yn )} A deep network f : x → fL ◦ · · · ◦ f1 (x) Output: The curves e(d) for each layer l for l ∈ {1, . [sent-223, score-0.979]
51 , L} do for σ ∈ Σ do k(x, x′ ) = kRBF(σ) ( fl ◦ · · · ◦ f1 (x), fl ◦ · · · ◦ f1 (x′ )) compute the kernel matrix K associated to k(x, x′ ) and (x1 , . [sent-226, score-0.248]
52 Consequently, observing the learning problem become simpler as we build deeper and deeper kernels highlights the capacity of deep networks to model the regularities of the input distribution. [sent-259, score-1.102]
53 As the input distribution gets distorted more and more, the number of leading components required to solve the learning problem increases, hinting that Gaussian kernels become progressively less suited. [sent-265, score-0.234]
54 Scale invariance is desirable since the representation at a given layer of the deep network can take different scales due to the number of nodes contained in each layer or to the multiple types of nonlinearities that can be implemented in deep networks. [sent-268, score-1.987]
55 Methodology In Section 2, we presented the theory and algorithms required to test our two hypotheses on the evolution of representations in deep networks. [sent-270, score-0.93]
56 It remains to select a set of deep networks and data sets in order to test the hypotheses formulated in Section 1. [sent-272, score-0.839]
57 The second hypothesis on the effect of the structure of deep networks can be tested by taking a set of structured and unstructured deep networks and observing how the layer-wise evolution of the representation differs between these deep networks. [sent-279, score-2.652]
58 We consider a multilayer perceptron (MLP), a pretrained multilayer perceptron (PMLP) and a convolutional neural network (CNN). [sent-280, score-0.488]
59 Since it has been observed that overparameterizing deep networks generally improves the generalization error, the size of layers is chosen large with the only constraint of computational cost. [sent-286, score-1.021]
60 , 2006) referred in this paper as PMLP is a multilayer perceptron that has been pretrained using a deep belief network (DBN, Hinton et al. [sent-290, score-1.014]
61 The pretraining procedure aims to build a deep generative model of the input that can be used as a starting point to learn the supervised task. [sent-292, score-0.799]
62 Here, the structure of the deep network is implicitly given by the weights initialization subsequent to the unsupervised pretraining. [sent-294, score-0.825]
63 , 1998) is a deep network inspired by the structure of the primary visual cortex (Hubel and Wiesel, 1962). [sent-296, score-0.852]
64 It is built by alternating (1) convolutional layers y = w ⊛ x + b transforming a set of input features maps {x1 , x2 , . [sent-298, score-0.324]
65 } such that yi = ∑ j wi j ∗ x j + bi and where wi j are convolution kernels, (2) detection layers where a nonlinearity is applied element-wise to the output of the convolutions in order to extract important features and (3) pooling layers subsampling each feature map by a given factor. [sent-304, score-0.433]
66 The deep networks described above are trained on a supervised task with backpropagation (Rumelhart et al. [sent-307, score-0.89]
67 The softmax module (Bishop, 1996) optimizes the deep network for maximum likelihood. [sent-310, score-0.889]
68 These deep networks are analyzed in two different settings: • Supervised learning: the deep network is trained in a supervised fashion on the target task (digit classification for the MNIST data set and image classification for the CIFAR data set). [sent-312, score-1.715]
69 • Transfer learning: the deep network is trained in a supervised fashion on a binary classification task that consists of determining whether the sample has been flipped vertically or not. [sent-313, score-0.876]
70 These settings allow us to measure how the structure contained in deep networks affects different aspects of learning such as the layer-wise organization of the learned solution or the transferability of features from one task to another. [sent-314, score-0.839]
71 1 Experimental Setup We train the deep networks on the 10000 samples of the data set until a training error of 2. [sent-316, score-0.839]
72 Such stopping criterion ensures that the subsequent solutions have a constant complexity and that the limited capacity of the deep network has no side effect on the structure of the solution. [sent-318, score-0.825]
73 In our analysis, we estimate the kernel principal components with the 10000 samples used for training the deep network. [sent-326, score-0.963]
74 Therefore, the empirical estimate of the d leading kernel principal components takes the form of d 10000-dimensional vectors, or similarly, of a data set of 10000 ddimensional mapped samples. [sent-327, score-0.325]
75 The layers of interest are the input data (l = 0) and the output of each layer (l = 1, 2, . [sent-335, score-0.373]
76 Results In this section, we present the results of our analysis on the evolution of the representation in deep networks. [sent-340, score-0.941]
77 1 discusses the empirical observation that deep networks trained on the supervised task produce gradually simpler and more accurate representations of the learning problem. [sent-342, score-0.941]
78 2573 ¨ M ONTAVON , B RAUN AND M ULLER Figure 5: Effect of the learning rate, of the capacity, of the training time and of the weight penalty on the layer-wise evolution of the representation built by an MLP on the MNIST-10K data set. [sent-343, score-0.264]
79 As the learning rate increases, the solution tends to make use primarily of the first layers of the deep network. [sent-345, score-0.913]
80 2 compares side-by-side the evolution of the representation in different deep networks and discusses the empirical observation that the structure of the deep network controls the layer-wise evolution of the representation in the deep network. [sent-348, score-2.84]
81 This means that the task-relevant information, initially spread over a large number of principal components, converges progressively towards the leading components of the mapped data distribution. [sent-364, score-0.25]
82 This layer-wise preservation of the statistical tractability of the learning problem and its progressive simplification is a theoretical motivation for using these deep networks in a modular way (Caruana, 1997; Weston et al. [sent-365, score-0.864]
83 2 Role of the Structure of Deep Networks Training deep networks is a complex nonconvex learning problem with many reasonable solutions. [sent-368, score-0.872]
84 As a consequence, the structure of the pretrained deep network already contains a certain part of the solution (Larochelle et al. [sent-375, score-0.885]
85 Similarly, in the context of sequential data, we can postulate that dedicating the early layers of the architecture to a convolutional preprocessing is also a more effective (LeCun, 1989; Serre et al. [sent-377, score-0.286]
86 We corroborate this argument by comparing in Figure 7 the layer-wise evolution of the representation for different deep networks: a multilayer perceptron (MLP), a pretrained MLP (PMLP) and a convolutional neural network (CNN). [sent-380, score-1.3]
87 Figure 7 (top) shows the evolution of the representation with respect to the learning problem when the deep network has been trained on the target task. [sent-384, score-1.086]
88 We observe that the evolution of the representation of the MLP follows a different trend than the representation built by the PMLP and the CNN. [sent-385, score-0.326]
89 Figure 7 (bottom) shows the evolution of the 2576 K ERNEL A NALYSIS OF D EEP N ETWORKS representation with respect to the learning problem when the deep network has been trained on the transfer task. [sent-388, score-1.11]
90 (2010) already described the PMLP as a regularized version of the MLP and showed how it improves the generalization ability of deep networks. [sent-392, score-0.731]
91 Our analysis completes the study, providing a layer-wise perspective on the effect and the role of regularization in deep networks and a unified view on the very different regularizers implemented by the PMLP and the CNN. [sent-393, score-0.869]
92 Conclusion and Discussion We introduce a method for analyzing deep networks that combines kernel methods and descriptive statistics in order to quantify the layer-wise evolution of the representation in deep networks. [sent-395, score-1.919]
93 Our method abstracts deep networks as a sequence of deeper and deeper kernels subsuming the mapping performed by more and more layers. [sent-396, score-1.065]
94 The kernel framework expresses the relation between the representation built in the deep network and the learning problem. [sent-397, score-1.055]
95 Our analysis is able to detect and quantify the progressive and layer-wise transformation of the input performed by the deep network. [sent-398, score-0.818]
96 In particular, we find that properly trained deep networks progressively simplify the statistics of complex data distributions, building in their last layers representations that are both simple and accurate. [sent-399, score-1.195]
97 The analysis also corroborates the hypothesis that a suitable structure for the deep network allows to make efficient use of its representational power by controlling the rate of discrimination at each layer of the deep network. [sent-400, score-1.786]
98 This observation provides a new unified view on the role and effect of regularizers in deep networks. [sent-401, score-0.761]
99 A unified architecture for natural language processing: deep neural networks with multitask learning. [sent-480, score-0.917]
100 Layer-wise analysis of deep networks e u with Gaussian kernels. [sent-530, score-0.839]
wordName wordTfidf (topN-words)
[('deep', 0.731), ('layers', 0.182), ('pmlp', 0.181), ('layer', 0.154), ('evolution', 0.148), ('mlp', 0.133), ('cnn', 0.133), ('kernel', 0.114), ('krbf', 0.109), ('ontavon', 0.109), ('raun', 0.109), ('networks', 0.108), ('network', 0.094), ('eep', 0.092), ('uller', 0.092), ('deeper', 0.084), ('etworks', 0.083), ('hinton', 0.077), ('braun', 0.076), ('multilayer', 0.074), ('principal', 0.072), ('fl', 0.067), ('kpca', 0.064), ('softmax', 0.064), ('cifar', 0.062), ('representation', 0.062), ('invariance', 0.061), ('pretrained', 0.06), ('sigm', 0.06), ('subsume', 0.06), ('nalysis', 0.06), ('kernels', 0.058), ('mika', 0.055), ('perceptron', 0.055), ('ernel', 0.055), ('built', 0.054), ('leading', 0.054), ('architecture', 0.053), ('architectures', 0.052), ('representations', 0.051), ('convolutional', 0.051), ('trained', 0.051), ('montavon', 0.048), ('mnist', 0.047), ('components', 0.046), ('bernhard', 0.045), ('lecun', 0.045), ('discrimination', 0.043), ('mikio', 0.042), ('yoshua', 0.042), ('pooling', 0.041), ('rumelhart', 0.041), ('weston', 0.041), ('sch', 0.04), ('geoffrey', 0.039), ('gunnar', 0.039), ('progressively', 0.039), ('mapped', 0.039), ('lkopf', 0.038), ('input', 0.037), ('erhan', 0.036), ('goire', 0.036), ('goodfellow', 0.036), ('ller', 0.034), ('hypothesis', 0.033), ('alex', 0.033), ('un', 0.033), ('complex', 0.033), ('pretraining', 0.031), ('abstractions', 0.031), ('smale', 0.031), ('tk', 0.03), ('regularizers', 0.03), ('tsch', 0.029), ('bengio', 0.029), ('quantifying', 0.028), ('tomaso', 0.028), ('collobert', 0.028), ('larochelle', 0.028), ('nonlinearity', 0.028), ('dimensionality', 0.027), ('jason', 0.027), ('visual', 0.027), ('sebastian', 0.025), ('orr', 0.025), ('progressive', 0.025), ('cat', 0.025), ('tu', 0.025), ('neural', 0.025), ('pca', 0.025), ('controls', 0.025), ('quantify', 0.025), ('eigenvectors', 0.024), ('transfer', 0.024), ('measuring', 0.024), ('bed', 0.024), ('bouvrie', 0.024), ('cho', 0.024), ('hubel', 0.024), ('rectifying', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999809 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
2 0.28261971 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
Author: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
Abstract: Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood
3 0.12037653 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
4 0.10130812 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
5 0.07517729 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
Author: Umut Ozertem, Deniz Erdogmus
Abstract: Principal curves are defined as self-consistent smooth curves passing through the middle of the data, and they have been used in many applications of machine learning as a generalization, dimensionality reduction and a feature extraction tool. We redefine principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve fitting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. The theory does not impose any particular density estimation method can be used with any density estimator that gives continuous first and second derivatives. Therefore, we first present our principal curve/surface definition without assuming any particular density estimation method. Afterwards, we develop practical algorithms for the commonly used kernel density estimation (KDE) and Gaussian mixture models (GMM). Results of these algorithms are presented in notional data sets as well as real applications with comparisons to other approaches in the principal curve literature. All in all, we present a novel theoretical understanding of principal curves and surfaces, practical algorithms as general purpose machine learning tools, and applications of these algorithms to several practical problems. Keywords: unsupervised learning, dimensionality reduction, principal curves, principal surfaces, subspace constrained mean-shift
6 0.068136215 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
7 0.062669456 55 jmlr-2011-Learning Multi-modal Similarity
8 0.05336025 66 jmlr-2011-Multiple Kernel Learning Algorithms
9 0.051519055 105 jmlr-2011-lp-Norm Multiple Kernel Learning
10 0.048565146 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
11 0.045034379 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
12 0.039247628 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
13 0.036960296 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
14 0.034053326 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
15 0.033505026 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
16 0.033477601 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures
17 0.032560546 50 jmlr-2011-LPmade: Link Prediction Made Easy
18 0.032324269 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels
19 0.031394135 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
20 0.031023625 92 jmlr-2011-The Stationary Subspace Analysis Toolbox
topicId topicWeight
[(0, 0.187), (1, -0.132), (2, -0.007), (3, -0.186), (4, -0.232), (5, 0.033), (6, 0.213), (7, -0.438), (8, -0.087), (9, 0.024), (10, 0.034), (11, 0.023), (12, -0.003), (13, 0.023), (14, 0.019), (15, -0.006), (16, 0.005), (17, 0.033), (18, 0.014), (19, -0.028), (20, 0.029), (21, 0.002), (22, 0.005), (23, 0.055), (24, -0.0), (25, 0.049), (26, -0.047), (27, -0.109), (28, -0.04), (29, 0.046), (30, -0.013), (31, 0.026), (32, 0.002), (33, 0.039), (34, 0.03), (35, 0.027), (36, -0.043), (37, -0.047), (38, 0.097), (39, 0.016), (40, 0.014), (41, 0.044), (42, 0.038), (43, -0.001), (44, -0.008), (45, 0.037), (46, -0.111), (47, -0.037), (48, -0.092), (49, 0.11)]
simIndex simValue paperId paperTitle
same-paper 1 0.95821947 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
2 0.78885102 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
Author: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
Abstract: Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood
3 0.62071246 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
4 0.46537724 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
5 0.38098145 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
Author: Umut Ozertem, Deniz Erdogmus
Abstract: Principal curves are defined as self-consistent smooth curves passing through the middle of the data, and they have been used in many applications of machine learning as a generalization, dimensionality reduction and a feature extraction tool. We redefine principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve fitting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. The theory does not impose any particular density estimation method can be used with any density estimator that gives continuous first and second derivatives. Therefore, we first present our principal curve/surface definition without assuming any particular density estimation method. Afterwards, we develop practical algorithms for the commonly used kernel density estimation (KDE) and Gaussian mixture models (GMM). Results of these algorithms are presented in notional data sets as well as real applications with comparisons to other approaches in the principal curve literature. All in all, we present a novel theoretical understanding of principal curves and surfaces, practical algorithms as general purpose machine learning tools, and applications of these algorithms to several practical problems. Keywords: unsupervised learning, dimensionality reduction, principal curves, principal surfaces, subspace constrained mean-shift
6 0.35896769 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
7 0.30013335 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
8 0.26845112 66 jmlr-2011-Multiple Kernel Learning Algorithms
9 0.25778762 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
10 0.25643778 105 jmlr-2011-lp-Norm Multiple Kernel Learning
11 0.24769726 55 jmlr-2011-Learning Multi-modal Similarity
12 0.23726326 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
13 0.22523049 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
14 0.21977271 50 jmlr-2011-LPmade: Link Prediction Made Easy
15 0.21860357 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
16 0.21181753 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
17 0.20450006 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
18 0.1786664 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
19 0.17748062 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
20 0.16963595 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures
topicId topicWeight
[(4, 0.033), (9, 0.04), (10, 0.034), (24, 0.036), (31, 0.083), (32, 0.052), (41, 0.028), (45, 0.331), (60, 0.067), (66, 0.021), (71, 0.012), (73, 0.063), (78, 0.039), (86, 0.023), (87, 0.011), (90, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.75157052 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
2 0.38573283 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
Author: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
Abstract: Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood
3 0.3645741 66 jmlr-2011-Multiple Kernel Learning Algorithms
Author: Mehmet Gönen, Ethem Alpaydın
Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning
4 0.35878569 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
5 0.35155579 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
Author: Tony Jebara
Abstract: A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems. These classifiers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. Three multitask scenarios are described. The first multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. The third multitask method produces a pooled classifier as well as adaptively specialized individual classifiers. Furthermore, extensions to regression, graphical model structure estimation and other sparse methods are discussed. The maximum entropy optimization problems are implemented via a sequential quadratic programming method which leverages recent progress in fast SVM solvers. Fast monotonic convergence bounds are provided by bounding the MED sparsifying cost function with a quadratic function and ensuring only a constant factor runtime increase above standard independent SVM solvers. Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. Keywords: meta-learning, support vector machines, feature selection, kernel selection, maximum entropy, large margin, Bayesian methods, variational bounds, classification, regression, Lasso, graphical model structure estimation, quadratic programming, convex programming
6 0.35115364 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
7 0.34874398 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
8 0.34709406 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
9 0.34136185 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
10 0.33886239 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
11 0.33837044 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
12 0.33580992 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
13 0.33521926 12 jmlr-2011-Bayesian Co-Training
14 0.33478785 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
15 0.3344937 91 jmlr-2011-The Sample Complexity of Dictionary Learning
16 0.33419979 105 jmlr-2011-lp-Norm Multiple Kernel Learning
17 0.33259255 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
18 0.33158469 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
19 0.32929936 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
20 0.32534364 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation