jmlr jmlr2011 jmlr2011-42 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
Abstract: Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood
Reference: text
sentIndex sentText sentNum sentScore
1 One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. [sent-15, score-0.313]
2 Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. [sent-17, score-0.489]
3 Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. [sent-18, score-0.691]
4 We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. [sent-19, score-0.44]
5 Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood 1. [sent-20, score-0.537]
6 The most prominent example of the recent research in hierarchical image modeling is the research into a class of hierarchical generative models called deep belief networks. [sent-33, score-0.515]
7 (2006) together with a greedy learning rule as an approach to the long-standing challenge of training deep neural networks, that is, hierarchical neural networks such as multi-layer perceptrons. [sent-35, score-0.359]
8 When applied to natural images, deep belief networks have been shown to develop biologically plausible features (Lee and Ng, 2007) and samples from the model were shown to adhere to certain statistical regularities also found in natural images (Osindero and Hinton, 2008). [sent-44, score-0.476]
9 Examples of natural image patches and features learned by a deep belief network are presented in Figure 1. [sent-45, score-0.561]
10 Even when the ultimate goal is classification, deep belief networks and related unsupervised feature learning approaches are optimized with respect to the likelihood. [sent-57, score-0.315]
11 Unfortunately, the likelihood of deep belief networks is in general computationally intractable to evaluate. [sent-59, score-0.397]
12 3072 I N A LL L IKELIHOOD , D EEP B ELIEF I S N OT E NOUGH Figure 1: Left: Natural image patches sampled from the van Hateren dataset (van Hateren and van der Schaaf, 1998). [sent-60, score-0.279]
13 Right: Filters learned by a deep belief network trained on whitened image patches. [sent-61, score-0.496]
14 In this article, we set out to test the performance of a deep belief network by evaluating its likelihood. [sent-62, score-0.351]
15 After reviewing the relevant aspects of deep belief networks, we will derive a new consistent estimator for their likelihood and demonstrate the estimator’s applicability in practice. [sent-63, score-0.463]
16 We will investigate a particular deep belief network’s capability to model the statistical regularities found in natural image patches. [sent-64, score-0.42]
17 We will show that the deep belief network under study is not particularly good at capturing the statistics of natural image patches as it is outperformed with respect to the likelihood even by very simple mixture models. [sent-65, score-0.678]
18 Models In this section we will review the statistical models used in the remainder of this article and discuss some of their properties relevant for estimating the likelihood of deep belief networks (DBNs). [sent-68, score-0.421]
19 Filled nodes denote visible variables, unfilled nodes denote hidden variables. [sent-74, score-0.328]
20 C: A semi-restricted Boltzmann machine, which in contrast to RBMs also allows connections between the visible units. [sent-77, score-0.251]
21 We will refer to states of observed or visible random variables as x and to states of unobserved or hidden random variables as y, such that s = (x, y). [sent-85, score-0.408]
22 The corresponding graph has no connections between the visible units and no connections between the hidden units (Figure 2). [sent-103, score-0.762]
23 The unnormalized marginal distribution of the visible units becomes q∗ (x) = exp(b⊤ x) ∏(1 + exp(w⊤ x + c j )). [sent-105, score-0.478]
24 The GRBM employs continuous visible units and binary hidden units and can thus be used to model continuous data. [sent-107, score-0.709]
25 3075 T HEIS , G ERWINN , S INZ AND B ETHGE Each binary state of the hidden units encodes one mean, while σ controls the variance of each Gaussian and is the same for all hidden units. [sent-111, score-0.41]
26 In an SRBM, only the hidden units are constrained to have no direct connections to each other while the visible units are unconstrained (Figure 2). [sent-113, score-0.723]
27 Analytic expressions are therefore only available for q∗ (x) but not for q∗ (y) and the visible units are no longer independent given a state for the hidden units. [sent-114, score-0.506]
28 3 Deep Belief Networks Figure 3: A graphical model representation of a two-layer deep belief network composed of two RBMs. [sent-116, score-0.348]
29 DBNs (Hinton and Salakhutdinov, 2006) are hierarchical generative models composed of several layers of RBMs or one of their generalizations. [sent-118, score-0.282]
30 Let q(x, y) and r(y, z) be the densities of two RBMs over visible states x and hidden states y and z. [sent-119, score-0.408]
31 The resulting model is best described not as a deep Boltzmann machine but as a graphical model with undirected connections between y and z and directed connections between x and y (Figure 3). [sent-121, score-0.317]
32 DBNs with an arbitrary number of layers have been shown to be universal approximators even if the number of hidden units in each layer is fixed to the number of visible units (Sutskever and Hinton, 2008). [sent-123, score-1.106]
33 The first approximation is made by training the DBN in a greedy manner: After the first layer of the model has been trained to approximate the data distribution, its parameters are fixed and only the parameters of the second layer are optimized. [sent-127, score-0.714]
34 The greedy learning procedure can be generalized to more layers by training each additional layer to approximate the distribution obtained by conditionally sampling from each layer in turn, starting with the lowest layer. [sent-134, score-0.804]
35 Second, despite being able to integrate analytically over z, even computing just the unnormalized likelihood still requires integration over an exponential number of hidden states y, p∗ (x) = ∑ q(x | y)r∗ (y). [sent-147, score-0.351]
36 y After briefly reviewing previous approaches to resolving these difficulties, we will propose an unbiased estimator for p∗ (x), its contribution being a possible solution to the second problem, and discuss how to construct a consistent estimator for p(x) based on this result. [sent-148, score-0.25]
37 s(x) (6) Estimates of the partition function Zq can therefore be obtained by drawing samples x(n) from a proposal distribution and averaging the resulting importance weights w(x(n) ). [sent-154, score-0.249]
38 It was pointed out in Minka (2005) that minimizing the variance of the importance sampling estimate of the partition function (6) is equivalent to minimizing an α-divergence1 between the proposal distribution s and the true distribution q. [sent-155, score-0.246]
39 Also note that the partition function Zr only has to be calculated once for all visible states we wish to evaluate. [sent-204, score-0.294]
40 Note that although the estimator tends to overestimate the true likelihood in expectation and is unbiased in the limit, it is still possible for it to underestimate the true likelihood most of the time. [sent-215, score-0.322]
41 If we cast Equation 9 into the form of Equation 6, we see that our estimator performs importance sampling with proposal distribution q(y | x) and target distribution p(y | x), where p(x) can be seen as the partition function of an unnormalized distribution p(x, y). [sent-219, score-0.426]
42 As mentioned earlier, the efficiency of importance sampling estimates of partition functions depends on how well the proposal distribution approximates the true distribution. [sent-220, score-0.281]
43 , ZL , and if we refer to the states of the random vectors in each layer by x0 , . [sent-237, score-0.286]
44 , xL , where x0 contains the visible states and xL contains the states of the top hidden layer, then p(x0 ) = ∑ x1 ,. [sent-240, score-0.408]
45 Intuitively, the estimation process can be imagined as first assigning a basic value using the first layer and then correcting this value based on how the densities of each pair of consecutive layers relate to each other. [sent-255, score-0.422]
46 For each choice for the second layer of a DBN, we obtain a different value for the log-likelihood and its lower bound. [sent-260, score-0.246]
47 However, it is unlikely that such a distribution will be found, as the second layer is optimized with respect to the lower bound. [sent-264, score-0.246]
48 The model consists of a GRBM in the first layer and SRBMs in the subsequent layers. [sent-286, score-0.271]
49 For all but the topmost layer, our estimator requires evaluation of the unnormalized marginal density over hidden states (see Figure 4). [sent-287, score-0.336]
50 In an SRBM, however, integrating out the visible states is analytically intractable. [sent-288, score-0.277]
51 It employed 15 hidden units in each of the first two layers, 50 hidden units in the third layer and was 3083 T HEIS , G ERWINN , S INZ AND B ETHGE layers 1 2 3 true neg. [sent-293, score-1.01]
52 Adding more layers to the network did not help to improve the performance if the GRBM employed only few hidden units. [sent-304, score-0.326]
53 trained on 4 by 4 pixel image patches taken from the van Hateren image data set (van Hateren and van der Schaaf, 1998). [sent-305, score-0.506]
54 Using the same preprocessing of the image patches and the same hyperparameters as in Murray and Salakhutdinov (2009), we trained a two-layer model with 2000 hidden units in the first layer and 500 hidden units in the second layer on 20 by 20 pixel van Hateren image patches. [sent-311, score-1.64]
55 We used 150000 patches for training and 50000 patches for testing and obtained a negative log-likelihood of 2. [sent-313, score-0.307]
56 We further evaluated a two-layer model with 500 hidden units in the first and 2000 hidden units in the second layer trained on a binarized version of the MNIST data set. [sent-316, score-0.955]
57 For the first layer we took the parameters from Salakhutdinov (2009), which were available online. [sent-317, score-0.246]
58 We then tried to match the training of the second-layer RBM with 2000 hidden units using the information given in Salakhutdinov (2009) and Murray and Salakhutdinov (2009). [sent-318, score-0.337]
59 In our case, the target distribution is the conditional distribution over the hidden states of a DBN given a state for the visible units. [sent-326, score-0.368]
60 3084 I N A LL L IKELIHOOD , D EEP B ELIEF I S N OT E NOUGH Figure 6: Left: Distributions of relative effective sample sizes (ESS) for two-layer DBNs (thick lines) and three-layer DBNs (thin lines) trained on 4 by 4 pixel image patches using different training rules. [sent-331, score-0.402]
61 Right: Ditributions of relative ESSs for two two-layer models trained on 20 by 20 pixel image patches. [sent-333, score-0.251]
62 Training a model on the larger image patches using the same hyperparameters as used by Osindero and Hinton (2008) and Murray and Salakhutdinov (2009), for example, led to a distribution of ESSs which was sharply peaked around 0. [sent-339, score-0.357]
63 For the larger models trained on 20 by 20 van Hateren image patches, it took us less than a second and up to a few minutes to get reasonably accurate estimates for 50 samples (Figure 7). [sent-349, score-0.305]
64 The number behind each model hints either at the number of hidden units or at the number of mixture components used. [sent-377, score-0.354]
65 Using PCD, we trained a three-layer model with 100 hidden units per layer on the smaller 4 by 4 image patches. [sent-381, score-0.738]
66 Adding a second layer to the network only helped very little and we found that the better the performance achieved by the GRBM, the smaller the improvement contributed by subsequent layers. [sent-388, score-0.28]
67 For the hyperparameters that led to the best performance, adding a third layer had no effect on the likelihood and in some cases even led to a decrease in performance (Figure 9). [sent-389, score-0.51]
68 The hyperparameters were selected by performing separate grid searches for the different layers (for details, see Appendix A). [sent-391, score-0.24]
69 For other sets of hyperparameters we found that adding a third layer was still able to yield some improvement, but the overall performance of the three-layer model achieved with these hyperparameters was worse. [sent-399, score-0.399]
70 We also trained a two-layer DBN with 500 hidden units in the first layer and 2000 hidden units in the third layer on 20 by 20 pixel image patches. [sent-400, score-1.307]
71 As for the smaller model, the improvement in performance of the GRBM led to a smaller improvement gained by adding a second layer to the model. [sent-404, score-0.305]
72 Despite the much larger second layer, the performance gain induced by the second layer was even less than for the smaller models. [sent-405, score-0.246]
73 The estimator is unbiased for the unnormalized likelihood, but only asymptotically unbiased for the log-likelihood. [sent-436, score-0.312]
74 Since the number of proposal samples is the only parameter of our estimator and the evaluation can generally be performed quickly, a good way to do this is to test the estimator for different numbers of proposal samples on a smaller test set. [sent-438, score-0.512]
75 One goal of this paper, however, was to see if a deep network with simple layer modules could perform well as a generative model for natural images. [sent-443, score-0.587]
76 While more complex Boltzmann machines are likely to achieve a better likelihood, it is not clear why they should also be good choices as layer modules for constructing DBNs. [sent-444, score-0.305]
77 In our experiments, better performances achieved with the first layer were always accompanied by a decrease in the performance gained by adding layers. [sent-446, score-0.246]
78 This would make a model a better model for natural images, but not the best choice as a layer module for DBNs trained with the greedy learning algorithm. [sent-448, score-0.479]
79 This observation might be related to the observations made here, that adding layers does not help much to improve the likelihood on natural image patches. [sent-459, score-0.364]
80 A possible explanation for the small contribution of each layer would be that too little information is carried by the hidden representations of each layer about the respective visible states. [sent-460, score-0.82]
81 In ICA, all the information about a visible state is retained in the hidden representation, so that the potential log-likelihood is optimal for this layer module. [sent-464, score-0.634]
82 As shown in Figure 8, already a single ICA layer can compete with a DBN based on RBMs. [sent-465, score-0.246]
83 Furthermore, adding layers to the network is guaranteed to improve the likelihood of the model (Chen and Gopinath, 2001) and not just a lower bound as with the greedy learning algorithm for DBNs. [sent-466, score-0.375]
84 Hosseini and Bethge (2009) have shown that adding layers does indeed give significant improvements of the likelihood, although it was also shown that hierarchical ICA cannot compete with other, nonhierarchical models of natural images. [sent-467, score-0.272]
85 A lot of research has been devoted to creating new layer modules (e. [sent-470, score-0.271]
86 Currently, the lower layers are trained in a way which is independent of whether additional layers will be added to the network. [sent-479, score-0.448]
87 An improvement could nevertheless indirectly be achieved by maximizing the information that is carried in the hidden representations about the states of the visible units. [sent-482, score-0.368]
88 For the GRBM with 100 hidden units trained on 4 by 4 pixel image patches, we used a σ of 0. [sent-503, score-0.521]
89 During training, approximate samples from the conditional distribution of the visible units were obtained using 20 parallel mean field updates with a damping parameter of 0. [sent-510, score-0.428]
90 The third layer was randomly initialized and had to be trained for 500 epochs before convergence was reached. [sent-514, score-0.375]
91 For the larger model applied to 20 by 20 pixel image patches, we trained both layers for 200 epochs using PCD. [sent-521, score-0.461]
92 Conditioned on the hidden units, the visible units were updated in a random order. [sent-534, score-0.506]
93 To estimate the unnormalized log-probability of each data point with respect to the smaller models, we used 200 samples from the proposal distribution of our estimator for the two-layer, and 500 samples for the three-layer model. [sent-535, score-0.382]
94 We used 4000 samples from the proposal distribution of our estimator to estimate the log-likelihood. [sent-543, score-0.256]
95 To estimate the unnormalized hidden marginals of SRBMs, we used 100 proposal samples from an approximating RBM with the same hidden-to-visible connections and same bias weights. [sent-545, score-0.407]
96 Code for training and evaluating deep belief networks using our estimator can be found under http://www. [sent-549, score-0.478]
97 Modeling pixel means and covariances using factorized third-order boltzmann machines. [sent-711, score-0.24]
98 Factored 3-way restricted boltzmann machines for modeling natural images. [sent-718, score-0.249]
99 Representational power of restricted boltzmann machines and deep belief networks. [sent-740, score-0.509]
100 Training restricted boltzmann machines using approximations to the likelihood gradient. [sent-804, score-0.302]
wordName wordTfidf (topN-words)
[('grbm', 0.303), ('layer', 0.246), ('dbn', 0.228), ('visible', 0.212), ('dbns', 0.193), ('hinton', 0.189), ('deep', 0.189), ('boltzmann', 0.186), ('units', 0.178), ('salakhutdinov', 0.177), ('layers', 0.176), ('xl', 0.173), ('erwinn', 0.141), ('ethge', 0.141), ('heis', 0.141), ('inz', 0.141), ('zr', 0.141), ('patches', 0.132), ('elief', 0.13), ('nough', 0.13), ('proposal', 0.126), ('pcd', 0.119), ('hidden', 0.116), ('eep', 0.111), ('srbm', 0.108), ('belief', 0.1), ('ot', 0.099), ('murray', 0.097), ('trained', 0.096), ('estimator', 0.092), ('ranzato', 0.091), ('unnormalized', 0.088), ('hateren', 0.087), ('likelihood', 0.082), ('image', 0.077), ('ikelihood', 0.076), ('osindero', 0.074), ('rbm', 0.074), ('rbms', 0.074), ('unbiased', 0.066), ('bethge', 0.065), ('hyperparameters', 0.064), ('ql', 0.061), ('cd', 0.061), ('potential', 0.06), ('led', 0.059), ('greedy', 0.058), ('bits', 0.055), ('pixel', 0.054), ('const', 0.05), ('ess', 0.05), ('ll', 0.046), ('importance', 0.043), ('esss', 0.043), ('plim', 0.043), ('srbms', 0.043), ('training', 0.043), ('hierarchical', 0.043), ('partition', 0.042), ('images', 0.04), ('states', 0.04), ('connections', 0.039), ('generative', 0.039), ('energy', 0.039), ('roux', 0.038), ('loglikelihood', 0.038), ('samples', 0.038), ('sinz', 0.037), ('sampling', 0.035), ('mixture', 0.035), ('estimates', 0.035), ('van', 0.035), ('ica', 0.034), ('machines', 0.034), ('network', 0.034), ('zq', 0.033), ('tuebingen', 0.033), ('ais', 0.033), ('epochs', 0.033), ('mpg', 0.033), ('eichhorn', 0.032), ('gerwinn', 0.032), ('grbms', 0.032), ('hosseini', 0.032), ('susskind', 0.032), ('natural', 0.029), ('welling', 0.028), ('theis', 0.028), ('lateral', 0.028), ('evaluating', 0.028), ('posterior', 0.026), ('networks', 0.026), ('model', 0.025), ('analytically', 0.025), ('matthias', 0.025), ('lucas', 0.025), ('fabian', 0.025), ('modules', 0.025), ('models', 0.024), ('jan', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
Author: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
Abstract: Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood
2 0.36159843 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
3 0.28261971 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
4 0.096670456 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
5 0.048963312 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon
Abstract: Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classifiers in transfer learning, and for training classifiers with no labeled data whatsoever. Keywords: classification, large margin, maximum likelihood
6 0.046469286 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
7 0.045888551 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
8 0.044284966 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
9 0.042874172 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
10 0.041719574 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
11 0.036886927 54 jmlr-2011-Learning Latent Tree Graphical Models
12 0.03688452 61 jmlr-2011-Logistic Stick-Breaking Process
13 0.033598483 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
14 0.030970987 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
15 0.030360471 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
16 0.027804181 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
17 0.027232563 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
18 0.026098147 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
19 0.026087407 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
20 0.026068563 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
topicId topicWeight
[(0, 0.202), (1, -0.155), (2, -0.122), (3, -0.094), (4, -0.336), (5, 0.031), (6, 0.304), (7, -0.539), (8, -0.026), (9, 0.129), (10, 0.117), (11, 0.015), (12, -0.05), (13, 0.015), (14, 0.12), (15, 0.045), (16, -0.101), (17, -0.014), (18, 0.023), (19, 0.041), (20, -0.011), (21, -0.077), (22, -0.054), (23, -0.005), (24, -0.048), (25, -0.01), (26, 0.041), (27, 0.02), (28, 0.02), (29, -0.011), (30, 0.044), (31, 0.025), (32, -0.007), (33, -0.009), (34, 0.018), (35, 0.006), (36, -0.015), (37, -0.001), (38, -0.003), (39, -0.009), (40, -0.017), (41, 0.003), (42, -0.026), (43, -0.042), (44, -0.027), (45, 0.019), (46, 0.033), (47, -0.015), (48, -0.012), (49, -0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.96026433 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
Author: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
Abstract: Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood
2 0.89957786 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
3 0.72343624 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
4 0.32728276 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
5 0.20047481 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
6 0.19257082 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
7 0.18191919 61 jmlr-2011-Logistic Stick-Breaking Process
8 0.16523419 54 jmlr-2011-Learning Latent Tree Graphical Models
9 0.14636803 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
10 0.13374276 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
11 0.13174742 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
12 0.12349113 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
13 0.12296271 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
14 0.11653462 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
15 0.11596481 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
16 0.11448011 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
17 0.11428747 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
18 0.11252774 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
19 0.11095122 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
20 0.11068474 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
topicId topicWeight
[(4, 0.032), (9, 0.03), (10, 0.026), (24, 0.036), (31, 0.106), (32, 0.026), (41, 0.021), (60, 0.516), (70, 0.02), (71, 0.013), (73, 0.032), (78, 0.046), (90, 0.018)]
simIndex simValue paperId paperTitle
1 0.98575211 83 jmlr-2011-Scikit-learn: Machine Learning in Python
Author: Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net. Keywords: Python, supervised learning, unsupervised learning, model selection
same-paper 2 0.82131958 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
Author: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
Abstract: Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the likelihood of deep belief networks which is computationally tractable and simple to apply in practice. Using this estimator, we quantitatively investigate a deep belief network for natural image patches and compare its performance to the performance of other models for natural image patches. We find that the deep belief network is outperformed with respect to the likelihood even by very simple mixture models. Keywords: deep belief network, restricted Boltzmann machine, likelihood estimation, natural image statistics, potential log-likelihood
3 0.3769559 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
Author: Fabien Lauer, Yann Guermeur
Abstract: This paper describes MSVMpack, an open source software package dedicated to our generic model of multi-class support vector machine. All four multi-class support vector machines (M-SVMs) proposed so far in the literature appear as instances of this model. MSVMpack provides for them the first unified implementation and offers a convenient basis to develop other instances. This is also the first parallel implementation for M-SVMs. The package consists in a set of command-line tools with a callable library. The documentation includes a tutorial, a user’s guide and a developer’s guide. Keywords: multi-class support vector machines, open source, C
4 0.36773998 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
5 0.36568031 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
6 0.33585024 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
7 0.32632458 102 jmlr-2011-Waffles: A Machine Learning Toolkit
8 0.30114388 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
9 0.28714177 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
10 0.28389749 12 jmlr-2011-Bayesian Co-Training
11 0.27999359 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
12 0.27725527 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
13 0.27404949 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels
14 0.27390742 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
15 0.27390626 95 jmlr-2011-Training SVMs Without Offset
16 0.27182189 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
17 0.26969954 16 jmlr-2011-Clustering Algorithms for Chains
18 0.26905206 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms
19 0.26804513 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
20 0.26753813 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data