nips nips2006 nips2006-72 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marc'aurelio Ranzato, Christopher Poultney, Sumit Chopra, Yann L. Cun
Abstract: We describe a novel unsupervised method for learning sparse, overcomplete features. The model uses a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector. Given an input, the optimal code minimizes the distance between the output of the decoder and the input patch while being as similar as possible to the encoder output. Learning proceeds in a two-phase EM-like fashion: (1) compute the minimum-energy code vector, (2) adjust the parameters of the encoder and decoder so as to decrease the energy. The model produces “stroke detectors” when trained on handwritten numerals, and Gabor-like filters when trained on natural image patches. Inference and learning are very fast, requiring no preprocessing, and no expensive sampling. Using the proposed unsupervised method to initialize the first layer of a convolutional network, we achieved an error rate slightly lower than the best reported result on the MNIST dataset. Finally, an extension of the method is described to learn topographical filter maps. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The model uses a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector. [sent-4, score-1.434]
2 Given an input, the optimal code minimizes the distance between the output of the decoder and the input patch while being as similar as possible to the encoder output. [sent-5, score-1.453]
3 Learning proceeds in a two-phase EM-like fashion: (1) compute the minimum-energy code vector, (2) adjust the parameters of the encoder and decoder so as to decrease the energy. [sent-6, score-1.284]
4 The model produces “stroke detectors” when trained on handwritten numerals, and Gabor-like filters when trained on natural image patches. [sent-7, score-0.224]
5 Using the proposed unsupervised method to initialize the first layer of a convolutional network, we achieved an error rate slightly lower than the best reported result on the MNIST dataset. [sent-9, score-0.243]
6 However, several recent works have advocated the use of sparse-overcomplete representations for images, in which the dimension of the feature vector is larger than the dimension of the input, but only a small number of components are non-zero for any one image [3, 4]. [sent-14, score-0.16]
7 It seems reasonable to consider a representation “complete” if it is possible to reconstruct the input from it, because the information contained in the input would need to be preserved in the representation itself. [sent-19, score-0.168]
8 Most unsupervised learning methods for feature extraction are based on this principle, and can be understood in terms of an encoder module followed by a decoder module. [sent-20, score-1.137]
9 The encoder takes the input and computes a code vector, for example a sparse and overcomplete representation. [sent-21, score-1.145]
10 The decoder takes the code vector given by the encoder and produces a reconstruction of the input. [sent-22, score-1.401]
11 Encoder and decoder are trained in such a way that reconstructions provided by the decoder are as similar as possible to the actual input data, when these input data have the same statistics as the training samples. [sent-23, score-1.0]
12 Methods such as Vector Quantization, PCA, auto-encoders [7], Restricted Boltzmann Machines [8], and others [9] have exactly this architecture but with different constraints on the code and learning algorithms, and different kinds of encoder and decoder architectures. [sent-24, score-1.323]
13 In other approaches, the encoding module is missing but its role is taken by a minimization in code space which retrieves the representation [3]. [sent-25, score-0.604]
14 Likewise, in non-causal models the decoding module is missing and sampling techniques must be used to reconstruct the input from a code [4]. [sent-26, score-0.602]
15 After training, the encoder allows very fast inference because finding a representation does not require solving an optimization problem. [sent-29, score-0.534]
16 The decoder provides an easy way to reconstruct input vectors, thus allowing the trainer to assess directly whether the representation extracts most of the information from the input. [sent-30, score-0.485]
17 This term usually penalizes those code units that are active, aiming to make the distribution of their activities highly peaked at zero with heavy tails [10] [4]. [sent-33, score-0.496]
18 An alternative approach is to embed a sparsifying module, e. [sent-35, score-0.222]
19 In this paper, we present a system which achieves sparsity by placing a non-linearity between encoder and decoder. [sent-39, score-0.626]
20 1 describes this module, dubbed the “Sparsifying Logistic”, which is a logistic function with an adaptive bias that tracks the mean of its input. [sent-42, score-0.156]
21 This non-linearity is parameterized in a simple way which allows us to control the degree of sparsity of the representation as well as the entropy of each code unit. [sent-43, score-0.422]
22 Unfortunately, learning the parameters in encoder and decoder can not be achieved by simple backpropagation of the gradients of the reconstruction error: the Sparsifying Logistic is highly non-linear and resets most of the gradients coming from the decoder to zero. [sent-44, score-1.467]
23 3 we propose to augment the loss function by considering not only the parameters of the system but also the code vectors as variables over which the optimization is performed. [sent-46, score-0.462]
24 The procedure can be seen as a sort of deterministic version of the EM algorithm in which the code vectors play the role of hidden variables. [sent-48, score-0.39]
25 4 we report experiments of feature extraction on handwritten numerals and natural image patches. [sent-52, score-0.255]
26 When the system has a linear encoder and decoder (remember that the Sparsifying Logistic is a separate module), the filters resemble “object parts” for the numerals, and localized, oriented features for the natural image patches. [sent-53, score-1.055]
27 We conclude by showing a hierarchical extension which suggests the form of simple and complex cell receptive fields, and leads to a topographic layout of the filters which is reminiscent of the topographic maps found in area V1 of the visual cortex. [sent-55, score-0.215]
28 1: • The encoder: A set of feed-forward filters parameterized by the rows of matrix WC , that computes a code vector from an image patch X. [sent-57, score-0.499]
29 • The Sparsifying Logistic: A non-linear module that transforms the code vector Z into a ¯ sparse code vector Z with components in the range [0, 1]. [sent-58, score-0.986]
30 • The decoder: A set of reverse filters parameterized by the columns of matrix WD , that ¯ computes a reconstruction of the input image patch from the sparse code vector Z. [sent-59, score-0.739]
31 The energy of the system is the sum of two terms: E(X, Z, WC , WD ) = EC (X, Z, WC ) + ED (X, Z, WD ) (1) The first term is the code prediction energy which measures the discrepancy between the output of the encoder and the code vector Z. [sent-60, score-1.528]
32 In our experiments, it is defined as: 1 1 (2) ||Z − Enc(X, WC )||2 = ||Z − WC X||2 2 2 The second term is the reconstruction energy which measures the discrepancy between the reconstructed image patch produced by the decoder and the input image patch X. [sent-61, score-0.875]
33 The input image patch X is processed by the encoder to produce an initial estimate of the code vector. [sent-63, score-1.126]
34 The encoding prediction energy EC measures the squared distance between the code vector Z and its estimate. [sent-64, score-0.517]
35 The code vector Z is passed through the Sparsifying Logistic non-linearity which ¯ produces a sparsified code vector Z. [sent-65, score-0.772]
36 The decoder reconstructs the input image patch from the sparse code. [sent-66, score-0.664]
37 The reconstruction energy ED measures the squared distance between the reconstruction and the input image patch. [sent-67, score-0.357]
38 The optimal code vector Z ∗ for a given patch minimizes the sum of the two energies. [sent-68, score-0.444]
39 The learning process finds the encoder and decoder parameters that minimize the energy for the optimal code vectors averaged over a set of training samples. [sent-69, score-1.423]
40 ¼ ¼½ ¬ ¿¼ ¼ ½ ¬ ¿¼ ¼ ½ ¬ ½¼ Figure 2: Toy example of sparsifying rectification produced by the Sparsifying Logistic for different choices of the parameters η and β. [sent-70, score-0.222]
41 1 (3) The Sparsifying Logistic The Sparsifying Logistic module is a non-linear front-end to the decoder that transforms the code vector into a sparse vector with positive components. [sent-77, score-0.972]
42 Let zi (k) be the i-th component of the code vector and zi (k) be its corresponding ¯ output, with i ∈ [1. [sent-79, score-0.48]
43 m] where m is the number of components in the code vector. [sent-81, score-0.394]
44 This non-linearity can be easily understood as a weighted softmax function applied over consecutive samples of the same code unit. [sent-87, score-0.422]
45 η controls the sparseness of the code by determining the “width” of the time window over which samples are summed up. [sent-90, score-0.398]
46 Another view of the Sparsifying Logistic is as a logistic function with an adaptive bias that tracks the average input; by dividing the right hand side of eq. [sent-94, score-0.193]
47 Large values of this parameter will turn the ¯ non-linearity into a step function and will make Z(k) a binary code vector. [sent-98, score-0.37]
48 Spatial sparsity usually requires some sort of ad-hoc normalization to ensure that the components of the code that are “on” are not always the same ones. [sent-103, score-0.446]
49 Our solution tackles this problem differently: each unit must be sparse when encoding different samples, independently from the activities of the other components in the code vector. [sent-104, score-0.568]
50 Unlike other methods [10], no ad-hoc rescaling of the weights or code units is necessary. [sent-105, score-0.468]
51 Indicating with superscripts the indices referring to the training samples and making explicit the dependencies on the code vectors, we can rewrite the energy of the system as: P E(WC , WD , Z 1 , . [sent-108, score-0.606]
52 , Z P ) (7) It is easy to minimize this loss with respect to WC and WD when the Z i are known and, particularly for our experiments where encoder and decoder are a set of linear filters, this is a convex quadratic optimization problem. [sent-118, score-0.946]
53 First, we find the optimal Z i for a given set of filters in encoder and decoder. [sent-121, score-0.534]
54 propagate the input X through the encoder to get a codeword Zinit 2. [sent-125, score-0.597]
55 6, sum of reconstruction and code prediction energy, with respect to Z by gradient descent using Zinit as the initial value 3. [sent-127, score-0.509]
56 Since the code vector Z minimizes both energy terms, it not only minimizes the reconstruction energy, but is also as similar as possible to the code predicted by the encoder. [sent-129, score-0.894]
57 After training the decoder settles on filters that produce low reconstruction errors from minimum-energy, sparsified code ¯ vectors Z ∗ , while the encoder simultaneously learns filters that predict the corresponding minimumenergy codes Z ∗ . [sent-130, score-1.521]
58 In other words, the system converges to a state where minimum-energy code vectors not only reconstruct the image patch but can also be easily predicted by the encoder filters. [sent-131, score-1.135]
59 Moreover, starting the minimization over Z from the prediction given by the encoder allows convergence in very few iterations. [sent-132, score-0.611]
60 When training is complete, a simple pass through the encoder will produce an accurate prediction of the minimum-energy code vector. [sent-134, score-1.008]
61 Notice that we could differently weight the encoding and the reconstruction energies in the loss function. [sent-138, score-0.171]
62 In particular, assigning a very large weight to the encoding energy corresponds to turning the penalty on the encoding prediction into a hard constraint. [sent-139, score-0.201]
63 The code vector would be assigned the value predicted by the encoder, and the minimization would reduce to a mean square error minimization through back-propagation as in a standard autoencoder. [sent-140, score-0.476]
64 Hence, the gradients back-propagated to the encoder are likely to be very small. [sent-143, score-0.565]
65 This causes the direct minimization over encoder parameters to fail, but does not seem to adversely affect the minimization over code vectors. [sent-144, score-1.01]
66 We surmise that the large number of degrees of freedom in code vectors (relative to the number of encoder parameters) makes the minimization problem considerably better conditioned. [sent-145, score-1.003]
67 The alternated descent over code and parameters can be seen as a kind of deterministic EM. [sent-147, score-0.431]
68 For example, in Olshausen and Field’s [10] linear generative model, inference is expensive because minimization in code space is necessary during testing as well as training. [sent-154, score-0.423]
69 [4], learning is very expensive because the decoder is missing, and sampling techniques [8] must be used to provide a reconstruction. [sent-156, score-0.38]
70 Two standard data sets were used: natural image patches and handwritten digits. [sent-160, score-0.177]
71 1 Feature Extraction from Natural Image Patches In the first experiment, the system was trained on 100,000 gray-level patches of size 12x12 extracted from the Berkeley segmentation data set [12]. [sent-166, score-0.165]
72 We chose an overcomplete factor approximately equal to 2 by representing the input with 200 code units1 . [sent-168, score-0.519]
73 The learning rate for the minimization in code space was set to 0. [sent-177, score-0.444]
74 Some components of the sparse code must be allowed to take continuous values to account for the average value of a patch. [sent-180, score-0.486]
75 For this reason, during training we saturated the running sums ζ to allow some units to be always active. [sent-181, score-0.194]
76 Examples of learned encoder and decoder filters are shown in figure 3. [sent-186, score-0.914]
77 Interest1 Overcompleteness must be evaluated by considering the number of code units and the effective dimensionality of the input as given by PCA. [sent-189, score-0.531]
78 8 Figure 4: Top: A randomly selected subset of encoder filters learned by our energy-based model when trained on the MNIST handwritten digit dataset. [sent-192, score-0.656]
79 The reconstruction is made by adding “parts”: it is the additive linear combination of few basis functions of the decoder with positive coefficients. [sent-194, score-0.465]
80 ingly, the encoder and decoder filter values are nearly identical up to a scale factor. [sent-195, score-0.914]
81 The number of components in the code vector was 196. [sent-200, score-0.394]
82 Reconstruction of most single digits can be achieved by a linear additive combination of a small number of filters since the output of the Sparsifying Logistic is sparse and positive. [sent-209, score-0.154]
83 3 Learning Local Features for the MNIST dataset Deep convolutional networks trained with backpropagation hold the current record for accuracy on the MNIST dataset [14, 15]. [sent-213, score-0.204]
84 [16] have recently shown that initializing the weights of a deep network using unsupervised learning before performing supervised learning with back-propagation can significantly improve the performance of a deep network. [sent-216, score-0.193]
85 This section describes a similar experiment in which we used the proposed method to initialize the first layer of a large convolutional network. [sent-217, score-0.186]
86 However, because our model produces sparse features, our network had a considerably larger number of feature maps: 50 for layer 1 and 2, 50 for layer 3 and 4, 200 for layer 5, and 10 for the output layer. [sent-219, score-0.542]
87 In the next experiment, the proposed sparse feature learning method was trained on 5x5 image patches extracted from the MNIST training set. [sent-228, score-0.353]
88 The encoder filters were used to initialize the first layer of the 50-50-200-10 net. [sent-230, score-0.636]
89 The network was then trained in the usual way, except that the first layer was kept fixed for the first 10 epochs through the training set. [sent-231, score-0.235]
90 Figure 5: Filters in the first convolutional layer after training when the network is randomly initialized (top row) and when the first layer of the network is initialized with the features learned by the unsupervised energy-based model (bottom row). [sent-244, score-0.512]
91 39 Table 1: Comparison of test error rates on MNIST dataset using convolutional networkswith various training set size: 20,000, 60,000, and 60,000 plus 550,000 elastic distortions. [sent-254, score-0.161]
92 4 Hierarchical Extension: Learning Topographic Maps It has already been observed that features extracted from natural image patches resemble Gabor-like filters, see fig. [sent-257, score-0.186]
93 In order to capture higher order dependencies among code units, we propose to extend the encoder architecture by adding to the linear filter bank a second layer of units. [sent-260, score-1.073]
94 In order to activate a unit at the output of the Sparsifying Logistic, all the afferent unrectified units in the first layer must agree in giving a strong positive response to the input patch. [sent-263, score-0.295]
95 Using a neighborhood of size 3x3, toroidal boundary conditions, and computing code vectors with 400 units from 12x12 input patches from the Berkeley dataset, we have obtained the topographic map shown in fig. [sent-266, score-0.67]
96 5 Conclusions An energy-based model was proposed for unsupervised learning of sparse overcomplete representations. [sent-287, score-0.214]
97 The decoder produces accurate reconstructions of the patches, while the encoder provides a fast prediction of the code without the need for any particular preprocessing of the input images. [sent-290, score-1.427]
98 It seems that a non-linearity that directly sparsifies the code is considerably simpler to control than adding a sparsity term in the loss function, which generally requires ad-hoc normalization procedures [3]. [sent-291, score-0.48]
99 Another possible extension would stack multiple instances of the system described in the paper, with each system as a module in a multi-layer structure where the sparse code produced by one feature extractor is fed to the input of a higher-level feature extractor. [sent-296, score-0.795]
100 Future work will include the application of the model to various tasks, including facial feature extraction, image denoising, image compression, inpainting, classification, and invariant feature extraction for robotics applications. [sent-297, score-0.222]
wordName wordTfidf (topN-words)
[('encoder', 0.534), ('decoder', 0.38), ('code', 0.37), ('wc', 0.223), ('sparsifying', 0.222), ('wd', 0.198), ('lters', 0.175), ('logistic', 0.128), ('mnist', 0.113), ('module', 0.106), ('layer', 0.102), ('units', 0.098), ('sparse', 0.092), ('overcomplete', 0.086), ('reconstruction', 0.085), ('convolutional', 0.084), ('patch', 0.074), ('energy', 0.069), ('patches', 0.065), ('input', 0.063), ('hinton', 0.062), ('numerals', 0.062), ('handwritten', 0.057), ('zi', 0.055), ('image', 0.055), ('encoding', 0.054), ('topographic', 0.054), ('deep', 0.054), ('minimization', 0.053), ('codes', 0.052), ('ec', 0.052), ('sparsity', 0.052), ('training', 0.05), ('representations', 0.05), ('extraction', 0.05), ('saturated', 0.046), ('olshausen', 0.045), ('reconstruct', 0.042), ('system', 0.04), ('trained', 0.04), ('architecture', 0.039), ('sparsi', 0.037), ('dividing', 0.037), ('unsupervised', 0.036), ('stroke', 0.035), ('zinit', 0.035), ('initialized', 0.034), ('subtracting', 0.034), ('loss', 0.032), ('output', 0.032), ('produces', 0.032), ('gradients', 0.031), ('alternated', 0.031), ('feature', 0.031), ('produce', 0.03), ('digits', 0.03), ('descent', 0.03), ('lter', 0.029), ('receptive', 0.029), ('berkeley', 0.029), ('reminiscent', 0.028), ('bank', 0.028), ('inpainting', 0.028), ('seung', 0.028), ('tracks', 0.028), ('hierarchical', 0.028), ('samples', 0.028), ('teh', 0.028), ('activities', 0.028), ('dataset', 0.027), ('ever', 0.027), ('considerably', 0.026), ('backpropagation', 0.026), ('initializing', 0.026), ('digit', 0.025), ('lewicki', 0.025), ('referring', 0.025), ('osindero', 0.025), ('components', 0.024), ('transforms', 0.024), ('localized', 0.024), ('prediction', 0.024), ('features', 0.024), ('filters', 0.024), ('softmax', 0.024), ('superscripts', 0.024), ('reconstructions', 0.024), ('parts', 0.023), ('network', 0.023), ('lecun', 0.022), ('extension', 0.022), ('resemble', 0.022), ('detectors', 0.022), ('coding', 0.021), ('rate', 0.021), ('missing', 0.021), ('extracted', 0.02), ('vectors', 0.02), ('kept', 0.02), ('discrepancy', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
Author: Marc'aurelio Ranzato, Christopher Poultney, Sumit Chopra, Yann L. Cun
Abstract: We describe a novel unsupervised method for learning sparse, overcomplete features. The model uses a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector. Given an input, the optimal code minimizes the distance between the output of the decoder and the input patch while being as similar as possible to the encoder output. Learning proceeds in a two-phase EM-like fashion: (1) compute the minimum-energy code vector, (2) adjust the parameters of the encoder and decoder so as to decrease the energy. The model produces “stroke detectors” when trained on handwritten numerals, and Gabor-like filters when trained on natural image patches. Inference and learning are very fast, requiring no preprocessing, and no expensive sampling. Using the proposed unsupervised method to initialize the first layer of a convolutional network, we achieved an error rate slightly lower than the best reported result on the MNIST dataset. Finally, an extension of the method is described to learn topographical filter maps. 1
2 0.14720213 167 nips-2006-Recursive ICA
Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell
Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1
3 0.14369516 88 nips-2006-Greedy Layer-Wise Training of Deep Networks
Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle
Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.
4 0.10153402 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons
Author: Thomas Voegtlin
Abstract: In biological neurons, the timing of a spike depends on the timing of synaptic currents, in a way that is classically described by the Phase Response Curve. This has implications for temporal coding: an action potential that arrives on a synapse has an implicit meaning, that depends on the position of the postsynaptic neuron on the firing cycle. Here we show that this implicit code can be used to perform computations. Using theta neurons, we derive a spike-timing dependent learning rule from an error criterion. We demonstrate how to train an auto-encoder neural network using this rule. 1
5 0.095182739 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
Author: J.t. Lindgren, Aapo Hyvärinen
Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1
6 0.093792669 16 nips-2006-A Theory of Retinal Population Coding
7 0.093304202 75 nips-2006-Efficient sparse coding algorithms
8 0.092534877 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
9 0.086651281 126 nips-2006-Logistic Regression for Single Trial EEG Classification
10 0.084733747 134 nips-2006-Modeling Human Motion Using Binary Latent Variables
11 0.079516344 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
12 0.079118267 179 nips-2006-Sparse Representation for Signal Classification
13 0.072837748 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
14 0.06366279 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions
15 0.060728967 148 nips-2006-Nonlinear physically-based models for decoding motor-cortical population activity
16 0.058489021 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
17 0.057923026 168 nips-2006-Reducing Calibration Time For Brain-Computer Interfaces: A Clustering Approach
18 0.057065688 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors
19 0.055672627 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests
20 0.054597341 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
topicId topicWeight
[(0, -0.178), (1, -0.047), (2, 0.125), (3, -0.054), (4, -0.078), (5, -0.042), (6, -0.095), (7, -0.058), (8, -0.035), (9, 0.11), (10, -0.0), (11, -0.037), (12, -0.062), (13, -0.078), (14, -0.072), (15, 0.082), (16, 0.0), (17, 0.187), (18, 0.181), (19, -0.066), (20, -0.053), (21, -0.055), (22, 0.096), (23, -0.082), (24, -0.081), (25, 0.044), (26, 0.034), (27, 0.064), (28, 0.009), (29, -0.031), (30, 0.017), (31, 0.064), (32, 0.002), (33, 0.022), (34, -0.067), (35, -0.011), (36, 0.022), (37, -0.033), (38, 0.091), (39, -0.056), (40, 0.005), (41, 0.002), (42, -0.061), (43, -0.011), (44, 0.009), (45, 0.003), (46, 0.031), (47, -0.076), (48, 0.051), (49, -0.101)]
simIndex simValue paperId paperTitle
same-paper 1 0.94397032 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
Author: Marc'aurelio Ranzato, Christopher Poultney, Sumit Chopra, Yann L. Cun
Abstract: We describe a novel unsupervised method for learning sparse, overcomplete features. The model uses a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector. Given an input, the optimal code minimizes the distance between the output of the decoder and the input patch while being as similar as possible to the encoder output. Learning proceeds in a two-phase EM-like fashion: (1) compute the minimum-energy code vector, (2) adjust the parameters of the encoder and decoder so as to decrease the energy. The model produces “stroke detectors” when trained on handwritten numerals, and Gabor-like filters when trained on natural image patches. Inference and learning are very fast, requiring no preprocessing, and no expensive sampling. Using the proposed unsupervised method to initialize the first layer of a convolutional network, we achieved an error rate slightly lower than the best reported result on the MNIST dataset. Finally, an extension of the method is described to learn topographical filter maps. 1
2 0.75030142 167 nips-2006-Recursive ICA
Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell
Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1
3 0.7123422 88 nips-2006-Greedy Layer-Wise Training of Deep Networks
Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle
Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.
4 0.58119828 16 nips-2006-A Theory of Retinal Population Coding
Author: Eizaburo Doi, Michael S. Lewicki
Abstract: Efficient coding models predict that the optimal code for natural images is a population of oriented Gabor receptive fields. These results match response properties of neurons in primary visual cortex, but not those in the retina. Does the retina use an optimal code, and if so, what is it optimized for? Previous theories of retinal coding have assumed that the goal is to encode the maximal amount of information about the sensory signal. However, the image sampled by retinal photoreceptors is degraded both by the optics of the eye and by the photoreceptor noise. Therefore, de-blurring and de-noising of the retinal signal should be important aspects of retinal coding. Furthermore, the ideal retinal code should be robust to neural noise and make optimal use of all available neurons. Here we present a theoretical framework to derive codes that simultaneously satisfy all of these desiderata. When optimized for natural images, the model yields filters that show strong similarities to retinal ganglion cell (RGC) receptive fields. Importantly, the characteristics of receptive fields vary with retinal eccentricities where the optical blur and the number of RGCs are significantly different. The proposed model provides a unified account of retinal coding, and more generally, it may be viewed as an extension of the Wiener filter with an arbitrary number of noisy units. 1
5 0.53377253 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
Author: J.t. Lindgren, Aapo Hyvärinen
Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1
6 0.50662422 75 nips-2006-Efficient sparse coding algorithms
7 0.48810166 52 nips-2006-Clustering appearance and shape by learning jigsaws
8 0.47349161 179 nips-2006-Sparse Representation for Signal Classification
9 0.4523465 134 nips-2006-Modeling Human Motion Using Binary Latent Variables
10 0.43003649 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
11 0.41677311 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions
12 0.38977236 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
13 0.38048628 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection
14 0.37225711 182 nips-2006-Statistical Modeling of Images with Fields of Gaussian Scale Mixtures
15 0.36707771 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
16 0.36370054 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests
17 0.34995341 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
18 0.34801257 185 nips-2006-Subordinate class recognition using relational object models
19 0.33710578 149 nips-2006-Nonnegative Sparse PCA
20 0.33598819 122 nips-2006-Learning to parse images of articulated bodies
topicId topicWeight
[(1, 0.097), (3, 0.036), (7, 0.064), (9, 0.069), (20, 0.031), (22, 0.066), (44, 0.061), (56, 0.222), (57, 0.107), (64, 0.023), (65, 0.038), (69, 0.074), (71, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.82395649 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
Author: Marc'aurelio Ranzato, Christopher Poultney, Sumit Chopra, Yann L. Cun
Abstract: We describe a novel unsupervised method for learning sparse, overcomplete features. The model uses a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector. Given an input, the optimal code minimizes the distance between the output of the decoder and the input patch while being as similar as possible to the encoder output. Learning proceeds in a two-phase EM-like fashion: (1) compute the minimum-energy code vector, (2) adjust the parameters of the encoder and decoder so as to decrease the energy. The model produces “stroke detectors” when trained on handwritten numerals, and Gabor-like filters when trained on natural image patches. Inference and learning are very fast, requiring no preprocessing, and no expensive sampling. Using the proposed unsupervised method to initialize the first layer of a convolutional network, we achieved an error rate slightly lower than the best reported result on the MNIST dataset. Finally, an extension of the method is described to learn topographical filter maps. 1
2 0.66896665 167 nips-2006-Recursive ICA
Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell
Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1
3 0.66325003 158 nips-2006-PG-means: learning the number of clusters in data
Author: Yu Feng, Greg Hamerly
Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1
4 0.65705615 34 nips-2006-Approximate Correspondences in High Dimensions
Author: Kristen Grauman, Trevor Darrell
Abstract: Pyramid intersection is an efficient method for computing an approximate partial matching between two sets of feature vectors. We introduce a novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors. The matching similarity is computed in linear time and forms a Mercer kernel. Whereas previous matching approximation algorithms suffer from distortion factors that increase linearly with the feature dimension, we demonstrate that our approach can maintain constant accuracy even as the feature dimension increases. When used as a kernel in a discriminative classifier, our approach achieves improved object recognition results over a state-of-the-art set kernel. 1
5 0.65677387 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
Author: Graham Mcneill, Sethu Vijayakumar
Abstract: Correspondence algorithms typically struggle with shapes that display part-based variation. We present a probabilistic approach that matches shapes using independent part transformations, where the parts themselves are learnt during matching. Ideas from semi-supervised learning are used to bias the algorithm towards finding ‘perceptually valid’ part structures. Shapes are represented by unlabeled point sets of arbitrary size and a background component is used to handle occlusion, local dissimilarity and clutter. Thus, unlike many shape matching techniques, our approach can be applied to shapes extracted from real images. Model parameters are estimated using an EM algorithm that alternates between finding a soft correspondence and computing the optimal part transformations using Procrustes analysis.
6 0.64984262 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
7 0.64538544 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
8 0.64385688 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
9 0.64354849 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
10 0.641087 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
11 0.64047492 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
12 0.64020032 175 nips-2006-Simplifying Mixture Models through Function Approximation
13 0.64018846 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
14 0.63973343 75 nips-2006-Efficient sparse coding algorithms
15 0.63827789 42 nips-2006-Bayesian Image Super-resolution, Continued
16 0.63638639 65 nips-2006-Denoising and Dimension Reduction in Feature Space
17 0.63605475 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
18 0.63566309 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements
19 0.63408184 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
20 0.6338709 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds