nips nips2007 nips2007-180 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marc'aurelio Ranzato, Y-lan Boureau, Yann L. Cun
Abstract: Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input observed variables can be captured. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu} Abstract Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. [sent-3, score-0.217]
2 Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e. [sent-4, score-0.21]
3 We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. [sent-8, score-0.16]
4 We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. [sent-9, score-0.318]
5 We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input observed variables can be captured. [sent-11, score-0.208]
6 1 Introduction One of the main purposes of unsupervised learning is to produce good representations for data, that can be used for detection, recognition, prediction, or visualization. [sent-12, score-0.189]
7 One cause for the recent resurgence of interest in unsupervised learning is the ability to produce deep feature hierarchies by stacking unsupervised modules on top of each other, as proposed by Hinton et al. [sent-14, score-0.367]
8 The unsupervised module at one level in the hierarchy is fed with the representation vectors produced by the level below. [sent-17, score-0.215]
9 An encoder transforms the input into the representation (also known as the code or the feature vector), and a decoder reconstructs the input (perhaps stochastically) from the representation. [sent-21, score-1.056]
10 PCA, Auto-encoder neural nets, Restricted Boltzmann Machines (RBMs), our previous sparse energy-based model [3], and the model proposed in [6] for noisy overcomplete channels are just examples of this kind of architecture. [sent-22, score-0.215]
11 after training, computing the code is a very fast process that merely consists in running the input through the encoder; 2. [sent-24, score-0.383]
12 reconstructing the input with the decoder provides a way to check that the code has captured the relevant information in the data. [sent-25, score-0.671]
13 Some learning algorithms [7] do not have a decoder and must resort to computationally expensive Markov Chain Monte Carlo (MCMC) sampling methods in order to provide reconstructions. [sent-26, score-0.255]
14 Other learning algorithms [8, 9] lack an encoder, which makes it necessary to run an expensive optimization algorithm to find the code associated with each new input sample. [sent-27, score-0.383]
15 The energy function includes the reconstruction error, and perhaps other terms as well. [sent-30, score-0.262]
16 Training the machine to model the input distribution is performed by finding the encoder and decoder parameters that minimize a loss function equal to the negative log likelihood of the training data under the model. [sent-32, score-0.868]
17 For a single training sample Y , the loss function is L(W, Y ) = − 1 log β e−βE(Y,z) + z 1 log β e−βE(y,z) (2) y,z The first term is the free energy Fβ (Y ). [sent-33, score-0.42]
18 It ensures that the system produces low energy only for input vectors that have high probability in the (true) data distribution, and produces higher energies for all other input vectors [5]. [sent-36, score-0.313]
19 Regardless of whether only Z ∗ or the whole distribution over Z is considered, the main difficulty with this framework is that it can be very hard to compute the gradient of the log partition function in equation 2 or 3 with respect to the parameters W . [sent-38, score-0.197]
20 For instance, Restricted Boltzmann Machines (RBM) [10] approximate the gradient of the log partition function in equation 2 by sampling values of Y whose energy will be pulled up using an MCMC technique. [sent-40, score-0.412]
21 By running the MCMC for a short time, those samples are chosen in the vicinity of the training samples, thereby ensuring that the energy surface forms a ravine around the manifold of the training samples. [sent-41, score-0.419]
22 The role of the log partition function is merely to ensure that the energy surface is lower around training samples than anywhere else. [sent-43, score-0.449]
23 The method proposed here eliminates the log partition function from the loss, and replaces it by a term that limits the volume of the input space over which the energy surface can take a low value. [sent-44, score-0.442]
24 This is performed by adding a penalty term on the code rather than on the input. [sent-45, score-0.39]
25 To understand the method, we first note that if for each vector Y , there exists a corresponding optimal code Z ∗ (Y ) that makes the reconstruction error (or energy) F∞ (Y ) zero (or near zero), the model can perfectly reconstruct any input vector. [sent-47, score-0.52]
26 On the other hand, if Z can only take a small number of different values (low entropy code), then the energy F∞ (Y ) can only be low in a limited number of places (the Y ’s that are reconstructed from this small number of Z values), and the energy cannot be flat. [sent-49, score-0.471]
27 More generally, a convenient method through which flat energy surfaces can be avoided is to limit the maximum information content of the code. [sent-50, score-0.257]
28 Hence, minimizing the energy F∞ (Y ) together with the information content of the code is a good substitute for minimizing the log partition function. [sent-51, score-0.708]
29 2 A popular way to minimize the information content in the code is to make the code sparse or lowdimensional [5]. [sent-52, score-0.82]
30 This technique is used in a number of unsupervised learning methods, including PCA, auto-encoders neural network, and sparse coding methods [6, 3, 8, 9]. [sent-53, score-0.289]
31 In sparse methods, the code is forced to have only a few non-zero units while most code units are zero most of the time. [sent-54, score-0.876]
32 This may explain why biology seems to like sparse representations [11]. [sent-58, score-0.156]
33 In our context, the main advantage of sparsity constraints is to allow us to replace a marginalization by a minimization, and to free ourselves from the need to minimize the log partition function explicitly. [sent-59, score-0.29]
34 2 and 3, we consider a loss function which is a weighted sum of the reconstruction error and a sparsity penalty, as in many other unsupervised learning algorithms [13, 14, 8]. [sent-62, score-0.391]
35 Encoder and decoder are constrained to be symmetric, and share a set of linear filters. [sent-63, score-0.255]
36 Although we only consider linear filters in this paper, the method allows the use of any differentiable function for encoder and decoder. [sent-64, score-0.329]
37 The first step computes the optimal code by minimizing the energy for the given input. [sent-66, score-0.528]
38 Following [15], we evaluate these methods by measuring the reconstruction error for a given entropy of the code. [sent-70, score-0.174]
39 The representational power of this hierarchical non-linear feature extraction is demonstrated through the unsupervised discovery of the numeral class labels in the high-level code. [sent-79, score-0.168]
40 2 Architecture In this section we describe a Sparse Encoding Symmetric Machine (SESM) having a set of linear filters in both encoder and decoder. [sent-80, score-0.329]
41 However, everything can be easily extended to any other choice of parameterized functions as long as these are differentiable and maintain symmetry between encoder and decoder. [sent-81, score-0.329]
42 Let us denote with Y the input defined in RN , and with Z the code defined in RM , where M is in general greater than N (for overcomplete representations). [sent-82, score-0.496]
43 Let the filters in encoder and decoder be the columns of matrix W ∈ RN ×M , and let the biases in the encoder and decoder be denoted by benc ∈ RM and bdec ∈ RN , respectively. [sent-83, score-1.301]
44 Then, encoder and decoder compute: fenc (Y ) = W T Y + benc , fdec (Z) = W l(Z) + bdec (5) where the function l is a point-wise logistic non-linearity of the form: l(x) = 1/(1 + exp(−gx)), (6) with g fixed gain. [sent-84, score-0.897]
45 The system is characterized by an energy measuring the compatibility between pairs of input Y and latent code Z, E(Y, Z) [16]. [sent-85, score-0.568]
46 The second term is the mean-squared error between the input Y and the reconstruction provided by the decoder. [sent-88, score-0.214]
47 For instance, the first two terms appear in our previous model [3], but in that work, the weights of encoder and decoder were not tied and the parameters in the logistic were updated using running averages. [sent-92, score-0.631]
48 Because of the sparsity penalty and the linear reconstruction, code units become tiny and are compensated by the filters in the decoder that grow without bound. [sent-97, score-0.784]
49 In this work, we propose to enforce symmetry between encoder and decoder (through weight sharing) so as to have automatic scaling of filters. [sent-101, score-0.584]
50 Their norm cannot possibly be large because code units, produced by the encoder weights, would have large values as well, producing bad reconstructions and increasing the energy (the second term in equation 7 and 8). [sent-102, score-1.004]
51 3 Learning Algorithm Learning consists of determining the parameters in W , benc , and bdec that minimize the loss in equation 8. [sent-103, score-0.281]
52 As indicated in the introduction, the energy augmented with the sparsity constraint is minimized with respect to the code to find the optimal code. [sent-104, score-0.61]
53 Instead, we rely on the code sparsity constraints to ensure that the energy surface is not flat. [sent-108, score-0.654]
54 Since the second term in equation 8 couples both Z and W and bdec , it is not straightforward to minimize this energy with respect to both. [sent-109, score-0.387]
55 Vice versa, if the parameters W are fixed, the optimal code Z ∗ that minimizes L can be computed easily through gradient descent. [sent-111, score-0.349]
56 for a given sample Y and parameter setting, minimize the loss in equation 8 with respect to Z by gradient descent to obtain the optimal code Z ∗ 2. [sent-113, score-0.528]
57 clamping both the input Y and the optimal code Z ∗ found at the previous step, do one step of gradient descent to update the parameters. [sent-114, score-0.444]
58 After training, the system converges to a state where the decoder produces good reconstructions from a sparse code, and the optimal code is predicted by a simple feed-forward propagation through the encoder. [sent-117, score-0.712]
59 An RBM is a binary stochastic symmetric machine defined 4 by an energy function of the form: E(Y, Z) = −Z T W T Y − bT Z − bT Y . [sent-120, score-0.217]
60 Although this is not enc dec obvious at first glance, this energy can be seen as a special case of the encoder-decoder architecture that pertains to binary data vectors and code vectors [5]. [sent-121, score-0.554]
61 Training an RBM minimizes an approximation of the negative log likelihood loss function 2, averaged over the training set, through a gradient descent procedure. [sent-122, score-0.243]
62 Instead of estimating the gradient of the log partition function, RBM training uses contrastive divergence [10], which takes random samples drawn over a limited region Ω around the training samples. [sent-123, score-0.374]
63 This means that only the energy of points around training samples is pulled up. [sent-126, score-0.321]
64 However, the code vector in an RBM is binary and noisy, and one may wonder whether this does not have the effect of surreptitiously limiting the information content of the code, thereby further minimizing the log partition function as a bonus. [sent-128, score-0.522]
65 SESM RBM and SESM have almost the same architecture because they both have a symmetric encoder and decoder, and a logistic non-linearity on the top of the encoder. [sent-129, score-0.458]
66 However, RBM is trained using (approximate) maximum likelihood, while SESM is trained by simply minimizing the average energy F∞ (Y ) of equation 4 with an additional code sparsity term. [sent-130, score-0.803]
67 SESM relies on the sparsity term to prevent flat energy surfaces, while RBM relies on an explicit contrastive term in the loss, an approximation of the log partition function. [sent-131, score-0.538]
68 Also, the coding strategy is very different because code units are “noisy” and binary in RBM, while they are quasi-binary and sparse in SESM. [sent-132, score-0.563]
69 1 Experimental Comparison In the first experiment we have trained SESM, RBM, and PCA on the first 20000 digits in the MNIST training dataset [18] in order to produce codes with 200 components. [sent-135, score-0.278]
70 Similarly to [15] we have collected test image codes after the logistic non linearity (except for PCA which is linear), and we have measured the root mean square error (RMSE) and the entropy. [sent-136, score-0.247]
71 SESM was run for different values of the sparsity coefficient αs in equation 8 (while all other parameters are left unchanged, see 1 ¯ ¯ next section for details). [sent-137, score-0.159]
72 The RMSE is defined as 1 Y − fdec (Z) 2 , where Z is the uniformly σ PN 2 quantized code produced by the encoder, P is the number of test samples, and σ is the estimated variance of units in the input Y . [sent-138, score-0.616]
73 Assuming to encode the (quantized) code units independently and with the same distribution, the lower bound on the number of bits required to encode each of them Q i i is given by: Hc. [sent-139, score-0.429]
74 Unlike N in [15, 12], the reconstruction is done taking the quantized code in order to measure the robustness of the code to the quantization noise. [sent-145, score-0.865]
75 1-C, RBM is very robust to noise in the code because it is trained by sampling. [sent-147, score-0.377]
76 Despite the similarities in the architecture, filters look quite different in general, revealing two different coding strategies: distributed for RBM, and sparse for SESM. [sent-152, score-0.176]
77 Since we have available also the labels in the MNIST, we have used the codes (produced by these machines trained unsupervised) as input to the same linear classifier. [sent-154, score-0.259]
78 1-A/B show the result of these experiments in addition to what can be achieved by a linear classifier trained on the raw pixel data. [sent-157, score-0.161]
79 45 PCA: quantization in 5 bins PCA: quantization in 256 bins RBM: quantization in 5 bins RBM: quantization in 256 bins Sparse Coding: quantization in 5 bins Sparse Coding: quantization in 256 bins 0. [sent-165, score-0.948]
80 5 2 (D) (E) (F) (G) (H) Figure 1: (A)-(B) Error rate on MNIST training (with 10, 100 and 1000 samples per class) and test set produced by a linear classifier trained on the codes produced by SESM, RBM, and PCA. [sent-175, score-0.378]
81 The entropy and RMSE refers to a quantization into 256 bins. [sent-176, score-0.167]
82 The comparison has been extended also to the same classifier trained on raw pixel data (showing the advantage of extracting features). [sent-177, score-0.161]
83 (C) Comparison between SESM, RBM, and PCA when quantizing the code into 5 and 256 bins. [sent-187, score-0.319]
84 (E) Some pairs of original and reconstructed digit from the code produced by the encoder in SESM (feed-forward propagation through encoder and decoder). [sent-190, score-1.066]
85 (G) Back-projection in image space of the filters learned in the second stage of the hierarchical feature extractor. [sent-192, score-0.167]
86 The second stage was trained on the non linearly transformed codes produced by the first stage machine. [sent-193, score-0.426]
87 The back-projection has been performed by using a 1-of-10 code in the second stage machine, and propagating this through the second stage decoder and first stage decoder. [sent-194, score-0.82]
88 Even if the code unit activation has a very sparse distribution, reconstructions are very good (no minimization in code space was performed). [sent-209, score-0.776]
89 1 Hierarchical Features A hierarchical feature extractor can be trained layer-by-layer similarly to what has been proposed in [19, 1] for training deep belief nets (DBNs). [sent-212, score-0.287]
90 We have trained a second (higher) stage machine on the non linearly transformed codes produced by the first (lower) stage machine described in the previous example. [sent-213, score-0.426]
91 Since we aimed to find a 1-of-10 code we increased the sparsity level (in the second stage machine) by setting αs to 1. [sent-215, score-0.507]
92 Despite the completely unsupervised training procedure, the feature detectors in the second stage machine look like digit prototypes as can be seen in fig. [sent-216, score-0.306]
93 The hierarchical unsupervised feature extractor is able to capture higher order correlations among the input pixel intensities, and to discover the highly non-linear mapping from raw pixel data to the class labels. [sent-218, score-0.446]
94 For comparison, when training a DBN, prototypes are not recovered because the learned code is distributed among units. [sent-221, score-0.405]
95 2 Natural Image Patches A SESM with about the same set up was trained on a dataset of 30000 8x8 natural image patches randomly extracted from the Berkeley segmentation dataset [20]. [sent-223, score-0.159]
96 We have considered a 2 times overcomplete code with 128 units. [sent-226, score-0.432]
97 6 Conclusions There are two strategies to train unsupervised machines: 1) having a contrastive term in the loss function minimized during training, 2) constraining the internal representation in such a way that training samples can be better reconstructed than other points in input space. [sent-233, score-0.512]
98 We have proposed an evaluation protocol to compare different machines which is based on RMSE, entropy and, eventually, error rate when also 7 labels are available. [sent-236, score-0.176]
99 A theoretical analysis of robust coding over noisy overcomplete channels. [sent-286, score-0.187]
100 Sparse coding with an overcomplete basis set: a strategy employed by v1? [sent-303, score-0.187]
wordName wordTfidf (topN-words)
[('sesm', 0.472), ('rbm', 0.37), ('encoder', 0.329), ('code', 0.319), ('decoder', 0.255), ('energy', 0.185), ('rmse', 0.162), ('overcomplete', 0.113), ('unsupervised', 0.113), ('lters', 0.11), ('quantization', 0.108), ('sparsity', 0.106), ('sparse', 0.102), ('codes', 0.089), ('pca', 0.082), ('stage', 0.082), ('reconstruction', 0.077), ('bdec', 0.076), ('fdec', 0.076), ('coding', 0.074), ('partition', 0.073), ('units', 0.068), ('input', 0.064), ('contrastive', 0.063), ('raw', 0.062), ('training', 0.061), ('entropy', 0.059), ('deep', 0.059), ('trained', 0.058), ('loss', 0.057), ('benc', 0.057), ('fenc', 0.057), ('hinton', 0.056), ('representations', 0.054), ('equation', 0.053), ('architecture', 0.05), ('bins', 0.05), ('chopra', 0.049), ('ranzato', 0.049), ('machines', 0.048), ('produced', 0.047), ('logistic', 0.047), ('handwritten', 0.047), ('samples', 0.045), ('surface', 0.044), ('non', 0.043), ('quantized', 0.042), ('reconstructed', 0.042), ('content', 0.042), ('pixel', 0.041), ('log', 0.041), ('mnist', 0.039), ('bengio', 0.039), ('minimize', 0.038), ('error', 0.038), ('numerals', 0.038), ('pcm', 0.038), ('discover', 0.037), ('reconstructions', 0.036), ('penalty', 0.036), ('term', 0.035), ('stacking', 0.035), ('reconstructing', 0.033), ('mcmc', 0.033), ('extractor', 0.033), ('rbms', 0.033), ('marginalization', 0.032), ('symmetric', 0.032), ('train', 0.032), ('descent', 0.031), ('rate', 0.031), ('hierarchy', 0.03), ('boltzmann', 0.03), ('gradient', 0.03), ('surfaces', 0.03), ('pulled', 0.03), ('image', 0.03), ('hierarchical', 0.03), ('encoding', 0.029), ('patches', 0.027), ('recognition', 0.027), ('olshausen', 0.026), ('classi', 0.026), ('digits', 0.026), ('prototypes', 0.025), ('fed', 0.025), ('osindero', 0.025), ('feature', 0.025), ('transformed', 0.025), ('minimizing', 0.024), ('stochastically', 0.024), ('localized', 0.023), ('likelihood', 0.023), ('thereby', 0.023), ('dataset', 0.022), ('produce', 0.022), ('understand', 0.022), ('encode', 0.021), ('er', 0.021), ('nets', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
Author: Marc'aurelio Ranzato, Y-lan Boureau, Yann L. Cun
Abstract: Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input observed variables can be captured. 1
2 0.23857762 182 nips-2007-Sparse deep belief net model for visual area V2
Author: Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng
Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or “deep,” structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear (“contour”) features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex “corner” features matches well with the results from the Ito & Komatsu’s study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features. 1
3 0.22007081 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
Author: Simon Osindero, Geoffrey E. Hinton
Abstract: We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1
4 0.18255731 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.
5 0.11916678 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
Author: Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis
Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the “expressiveness” of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.
6 0.11593682 129 nips-2007-Mining Internet-Scale Software Repositories
7 0.1144815 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
8 0.11177319 145 nips-2007-On Sparsity and Overcompleteness in Image Models
9 0.072394706 115 nips-2007-Learning the 2-D Topology of Images
10 0.069290906 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence
11 0.054828353 8 nips-2007-A New View of Automatic Relevance Determination
12 0.053372301 33 nips-2007-Bayesian Inference for Spiking Neuron Models with a Sparsity Prior
13 0.053277716 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
14 0.050086271 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
15 0.049924411 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
16 0.049100693 62 nips-2007-Convex Learning with Invariances
17 0.048660606 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI
18 0.047947451 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
19 0.046259817 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning
20 0.044686992 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
topicId topicWeight
[(0, -0.194), (1, 0.106), (2, 0.016), (3, -0.063), (4, -0.016), (5, 0.073), (6, -0.228), (7, 0.08), (8, 0.217), (9, 0.028), (10, 0.258), (11, -0.034), (12, 0.116), (13, -0.036), (14, 0.022), (15, -0.003), (16, -0.025), (17, -0.073), (18, -0.121), (19, 0.02), (20, -0.006), (21, 0.003), (22, -0.061), (23, -0.033), (24, 0.018), (25, -0.092), (26, -0.015), (27, 0.074), (28, 0.023), (29, -0.039), (30, 0.044), (31, -0.083), (32, -0.034), (33, 0.04), (34, 0.014), (35, 0.037), (36, 0.038), (37, -0.015), (38, -0.066), (39, -0.057), (40, -0.067), (41, -0.009), (42, -0.005), (43, -0.041), (44, 0.003), (45, -0.042), (46, -0.081), (47, -0.006), (48, -0.064), (49, -0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.93424863 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
Author: Marc'aurelio Ranzato, Y-lan Boureau, Yann L. Cun
Abstract: Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input observed variables can be captured. 1
2 0.82162684 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
Author: Simon Osindero, Geoffrey E. Hinton
Abstract: We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1
3 0.80874008 182 nips-2007-Sparse deep belief net model for visual area V2
Author: Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng
Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or “deep,” structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear (“contour”) features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex “corner” features matches well with the results from the Ito & Komatsu’s study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features. 1
4 0.58755541 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.
5 0.46765584 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
Author: Pierre Garrigues, Bruno A. Olshausen
Abstract: It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefficients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive fields. However, the resulting sparse coefficients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefficients by including a pairwise coupling term in the prior over the coefficient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1 in terms of a prior over natural images.
6 0.44561508 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
7 0.42082852 145 nips-2007-On Sparsity and Overcompleteness in Image Models
8 0.41655728 129 nips-2007-Mining Internet-Scale Software Repositories
9 0.41173163 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence
10 0.38358095 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation
11 0.38312191 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
12 0.37268972 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
13 0.35744393 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks
14 0.35724509 89 nips-2007-Feature Selection Methods for Improving Protein Structure Prediction with Rosetta
15 0.33431983 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning
16 0.31895158 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
17 0.31425956 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI
18 0.30692345 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
19 0.30242667 115 nips-2007-Learning the 2-D Topology of Images
20 0.29134363 62 nips-2007-Convex Learning with Invariances
topicId topicWeight
[(5, 0.078), (10, 0.145), (13, 0.042), (16, 0.041), (18, 0.011), (19, 0.017), (21, 0.057), (34, 0.035), (35, 0.054), (47, 0.093), (49, 0.012), (83, 0.183), (85, 0.016), (87, 0.062), (90, 0.07)]
simIndex simValue paperId paperTitle
same-paper 1 0.90106654 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
Author: Marc'aurelio Ranzato, Y-lan Boureau, Yann L. Cun
Abstract: Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input observed variables can be captured. 1
2 0.83319861 115 nips-2007-Learning the 2-D Topology of Images
Author: Nicolas L. Roux, Yoshua Bengio, Pascal Lamblin, Marc Joliveau, Balázs Kégl
Abstract: We study the following question: is the two-dimensional structure of images a very strong prior or is it something that can be learned with a few examples of natural images? If someone gave us a learning task involving images for which the two-dimensional topology of pixels was not known, could we discover it automatically and exploit it? For example suppose that the pixels had been permuted in a fixed but unknown way, could we recover the relative two-dimensional location of pixels on images? The surprising result presented here is that not only the answer is yes, but that about as few as a thousand images are enough to approximately recover the relative locations of about a thousand pixels. This is achieved using a manifold learning algorithm applied to pixels associated with a measure of distributional similarity between pixel intensities. We compare different topologyextraction approaches and show how having the two-dimensional topology can be exploited.
3 0.83183086 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
4 0.82899094 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
Author: Matthias Bethge, Philipp Berens
Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1
5 0.82878679 189 nips-2007-Supervised Topic Models
Author: Jon D. Mcauliffe, David M. Blei
Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1
6 0.82874173 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation
7 0.82592678 49 nips-2007-Colored Maximum Variance Unfolding
8 0.82027215 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
9 0.8190223 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
10 0.81842631 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
11 0.8170045 16 nips-2007-A learning framework for nearest neighbor search
12 0.81651646 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
13 0.81515151 187 nips-2007-Structured Learning with Approximate Inference
14 0.81486148 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
15 0.81368601 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
16 0.81240016 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging
17 0.80902886 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
18 0.80895102 156 nips-2007-Predictive Matrix-Variate t Models
19 0.80882078 7 nips-2007-A Kernel Statistical Test of Independence
20 0.80865073 116 nips-2007-Learning the structure of manifolds using random projections