nips nips2010 nips2010-206 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. [sent-4, score-0.975]
2 However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. [sent-5, score-0.394]
3 The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. [sent-6, score-0.326]
4 Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. [sent-7, score-0.718]
5 In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. [sent-8, score-0.189]
6 The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20. [sent-9, score-0.396]
7 1 Introduction Acoustic modeling is a fundamental problem in automatic continuous speech recognition. [sent-11, score-0.199]
8 Most state of the art speech recognition systems perform acoustic modeling using the following approach [1]. [sent-12, score-0.511]
9 Hidden Markov models (HMMs), with Gaussian mixture models (GMMs) for the emission distributions, are used to model the probability of the acoustic vector sequence given the (tri)phone sequence in the utterance to be recognized. [sent-14, score-0.307]
10 1 Typically, all of the individual Gaussians in the mixtures are restricted to have diagonal covariance matrices and a large hidden Markov model is constructed from sub-HMMs for each triphone to help deal with the effects of context-dependent variations. [sent-15, score-0.391]
11 Although systems of this sort have yielded many useful results, diagonal covariance CDHMM models have several potential weaknesses as models of speech data. [sent-17, score-0.254]
12 However, perhaps even more disturbing than the frame-independence assumption are the compromises required to deal with two competing pressures in Gaussian mixture model 1 We will refer to HMMs with GMM emission distributions as CDHMMs for continuous-density HMMs. [sent-19, score-0.131]
13 1 training: the need for expressive models capable of representing the variability present in real speech data and the need to combat the resulting data sparsity and statistical efficiency issues. [sent-20, score-0.159]
14 These pressures of course exist for other models as well, but the tendency of GMMs to partition the input space into regions where only one component of the mixture dominates is a weakness that inhibits efficient use of a very large number of tunable parameters. [sent-21, score-0.116]
15 The common decision to use diagonal covariance Gaussians for the mixture components is an example of such a compromise of expressiveness that suggests that it might be worthwhile to explore models in which each parameter is constrained by a large fraction of the training data. [sent-22, score-0.183]
16 By contrast, models that use the simultaneous activation of a large number of hidden features to generate an observed input can use many more of their parameters to model each training example and hence have many more training examples to constrain each parameter. [sent-23, score-0.312]
17 The diagonal covariance approximation typically employed for GMM-based acoustic models is symptomatic of, but distinct from, the general representational inefficiencies that tend to crop up in mixture models with massive numbers of highly specialized, distinctly parameterized mixture components. [sent-25, score-0.46]
18 Restricting mixture components to have diagonal covariance matrices introduces a conditional independence assumption between dimensions within a single frame. [sent-26, score-0.185]
19 The delta-feature augmentation mitigates the severity of the approximation and thus makes outperforming diagonal covariance Gaussian mixture models difficult. [sent-27, score-0.174]
20 However, a variety of precision matrix modeling techniques have emerged in the speech recognition literature. [sent-28, score-0.404]
21 GRBMs model different dimensions of their input as conditionally independent given the hidden unit activations, a weakness akin to restricting Gaussians in a GMM to have diagonal covariance. [sent-31, score-0.304]
22 This conditional independence assumption is inappropriate for speech data encoded as a sequence of overlapping frames of spectral information, especially when many frames are concatenated to form the input vector. [sent-32, score-0.346]
23 We demonstrate the efficacy of our approach by reporting results on the speaker-independent TIMIT phone recognition task. [sent-36, score-0.464]
24 TIMIT, as argued in [7], is an ideal dataset for testing new ideas in speech recognition before trying to scale them up to large vocabulary tasks because it is phonetically rich, has well-labeled transcriptions, and is small enough not to pose substantial computational challenges at test time. [sent-37, score-0.289]
25 Our best system achieves a phone error rate on the TIMIT corpus of 20. [sent-38, score-0.366]
26 We obtain these results without augmenting the input with temporal difference features since a sensible model of speech data should be able to learn to extract its own useful features that make explicit inclusion of difference features unnecessary. [sent-40, score-0.249]
27 We construct training cases for the DBN by taking n adjacent frames of acoustic input and pairing them with the identity of the HMM state for the central frame. [sent-42, score-0.312]
28 Equation 2 implicitly assumes that the visible units have a diagonal covariance Gaussian noise model with a variance of 1 on each dimension. [sent-49, score-0.486]
29 The mcRBM has two groups of hidden units: mean units and precision units. [sent-51, score-0.539]
30 The precision units are designed to enforce smoothness constraints in the data, but when one of these constraints is seriously violated, it is removed by turning off the precision unit. [sent-54, score-0.495]
31 The set of active precision units therefore specifies a sample-specific covariance matrix. [sent-55, score-0.414]
32 In order for a visible vector to be assigned high probability by the precision units, it must only fail to satisfy a small number of the precision unit constraints, although each of these constraints could be egregiously violated. [sent-56, score-0.411]
33 In other words, the RBM energy function is modified to have multiplicative interactions between triples of two visible units, vi and vj , and one hidden unit hk . [sent-58, score-0.469]
34 After factoring, we may write the cRBM energy function2 (with visible biases omitted) as: E(v, h) = −dT h − (vT R)2 Ph, (3) where R is the visible-factor weight matrix, d denotes the hidden unit bias vector, and P is the factor-hidden, or “pooling” matrix. [sent-61, score-0.433]
35 The hidden units of the cRBM are still (just as in GRBMs) conditionally independent given the states of the visible units, so inference remains simple. [sent-65, score-0.602]
36 However, the visible units are coupled in a Markov Random Field determined by the settings of the hidden units. [sent-66, score-0.571]
37 The interaction weight between two arbitrary visible units vi and vj , which we shall denote wi,j , depends on the states of all the hidden ˜ units according to: wi,j = ˜ hk rif rjf pkf . [sent-67, score-0.89]
38 The conditional distribution of the visible units given the hidden unit states for the cRBM is given by: P (v|h) ∼ N 0, R diag(−PT h) RT −1 . [sent-71, score-0.698]
39 (4) The cRBM always assigns highest probability to the all zero visible vector. [sent-72, score-0.139]
40 In order to allow the model to shift the mean, we add an additional set of binary hidden units whose vector of states we shall denote m. [sent-73, score-0.463]
41 If EC (v, h) denotes the cRBM energy function (equation 3) and EM (v, m) denotes the GRBM energy function (equation 2), then the mcRBM energy function is: EM C (v, h, m) = EC (v, h) + EM (v, m). [sent-75, score-0.168]
42 The resulting conditional distribution over the visible units, given the two sets of hidden units is: P (v|h, m) ∝ N (ΣWm, Σ) , where −1 Σ = R diag(−PT h) RT . [sent-77, score-0.609]
43 Thus the mcRBM can produce conditional distributions over the visible units, given the hidden units, that have non-zero means, unlike the cRBM. [sent-78, score-0.357]
44 ∂θ data ∂θ reconstruction However, since the matrix inversion required to sample from P (v|h, m) can be expensive, we integrate out the hidden units and use Hybrid Monte Carlo (HMC) [11] on the mcRBM free energy to obtain the reconstructions. [sent-80, score-0.488]
45 1 Practical details In order to facilitate stable training, we make the precision unit term in the energy function insensitive to the scale of the input data by normalizing by the length of v. [sent-84, score-0.221]
46 4 Deep Belief Nets Learning is difficult in densely connected, directed belief nets that have many hidden layers because it is difficult to infer the posterior distribution over the hidden variables, when given a data vector, due to the phenomenon of explaining away. [sent-90, score-0.681]
47 In [8] complementary priors were used to eliminate the explaining away effects, producing a training procedure which is equivalent to training a stack of restricted Boltzmann machines. [sent-92, score-0.171]
48 Once an RBM has been trained on data, we can infer the hidden unit activation probabilities given a data vector and re-represent the data vector as the vector of corresponding hidden activations. [sent-94, score-0.443]
49 Once we have used one RBM as a feature extractor we can, if desired, train an additional RBM that treats the hidden activations of the first RBM as data to model. [sent-96, score-0.21]
50 After training a sequence of RBMs, we can compose them to form a generative model whose top two layers are the final RBM in the stack and whose lower layers all have downward-directed connections that implement the p(hk−1 |hk ) learned by the k th RBM, where h0 = v. [sent-97, score-0.298]
51 The weights obtained by the greedy layer-by-layer training procedure described for stacking RBMs, above, can be used to initialize the weights of a deep feed-forward neural network. [sent-98, score-0.161]
52 Once we add an output layer to the pre-trained neural network, we can discriminatively fine-tune the weights of this neural net using any variant of backpropagation [13] we wish. [sent-99, score-0.161]
53 Note that the RBM immediately above the mcRBM uses both the mean unit activities and the precision unit activities together as visible data. [sent-102, score-0.362]
54 1 The TIMIT Dataset We used the TIMIT corpus3 for all of our phone recognition experiments. [sent-105, score-0.464]
55 Since there are three HMM states per phone and 61 phones, all DBN architectures had a 183-way softmax output unit. [sent-112, score-0.464]
56 After decoding, starting and ending silences were removed and the 61 phone classes were mapped to a set of 39 classes as in [14] for scoring. [sent-120, score-0.538]
57 We removed starting and ending silences before scoring in order to be as similar to [5] as possible. [sent-121, score-0.2]
58 However, to produce a more informative comparison between our results and results in the literature that do not remove starting and ending silences, we also present the phone error rate of our best model using the more common scoring strategy. [sent-122, score-0.453]
59 2 Preprocessing Since we have completely abandoned Gaussian mixture model emission distributions, we are no longer forced to use temporal derivative features. [sent-127, score-0.123]
60 For all experiments the acoustic signal was analyzed using a 25-ms Hamming window with 10-ms between the left edges of successive frames. [sent-128, score-0.214]
61 We use the output from a mel scale filterbank, extracting 39 filterbank output log magnitudes and one log energy per frame. [sent-129, score-0.116]
62 Determining the number of frames of acoustic context to give to the DBN is an important preprocessing decision; preliminary experiments revealed that moving to 15 frames of acoustic data, from the 11 used in [5], could provide improvements in PER when training a DBN on features from a mcRBM. [sent-132, score-0.644]
63 It is possible that even larger acoustic contexts might be beneficial as well. [sent-133, score-0.214]
64 An epoch of training of an mcRBM that had 1536 hidden units (1024 precision units and 512 mean units) took 20 minutes. [sent-139, score-0.929]
65 When each DBN layer had 2048 hidden units, each epoch of pre-training for the first DBN layer took about three minutes and each epoch of pretraining for the fifth layer took seven to eight minutes, since we propagated through each earlier layer. [sent-140, score-0.708]
66 Each epoch of fine-tuning for such a five-DBN-layer architecture took 12 minutes. [sent-141, score-0.13]
67 We used 100 epochs to train the mcRBM, 50 epochs to train each RBM in the stack and 14 epochs of discriminative fine-tuning of the whole network for a total of nearly 60 hours, about 34 of which were spent training the mcRBM. [sent-142, score-0.204]
68 6 Experiments Since one goal of this work is to improve performance on TIMIT by using deep learning architectures, we explored varying the number of DBN layers in our architecture. [sent-143, score-0.238]
69 Figure 2 plots phone error rate on both the development set and the core test set against the number of hidden layers in a mcRBM-DBN (we don’t count the mcRBM as a hidden layer since we do not backpropagate through it). [sent-145, score-1.03]
70 The particular mcRBM-DBN shown had 1536 hidden units in each DBN hidden layer, 1024 precision units in the mcRBM, and 512 mean units in the mcRBM. [sent-146, score-1.223]
71 As the number of DBN hidden layers increased, error on the development and test sets decreased and eventually leveled off. [sent-147, score-0.325]
72 In fact, an mcRBM-DBN with 8 hidden layers is what exhibits the best development set error, 20. [sent-149, score-0.325]
73 5% if starting and ending silences are included in scoring). [sent-153, score-0.143]
74 Models of this depth (note also that an mcRBM-DBN with 8 DBN hidden layers is really a 9 layer model) have rarely been employed in the deep learning literature (cf. [sent-163, score-0.513]
75 Table 1 demonstrates that once the hidden layers are sufficiently large, continuing to increase the size of the hidden layers did not seem to provide additional improvements. [sent-165, score-0.586]
76 In general, we did not find our results to be very sensitive to the exact number of hidden units in each layer, as long the hidden layers were relatively large. [sent-166, score-0.725]
77 Table 3 compares previously published results on the speaker-independent TIMIT phone recognition task to the best mcRBM-DBN architecture we investigated. [sent-170, score-0.53]
78 Results marked with a * remove starting Table 2: mcRBM-DBN vs GRBM-DBN Phone Error Rate Model 5 layer GRBM-DBN mcRBM + 4 layer DBN devset PER 22. [sent-171, score-0.228]
79 5% and ending silences at test time before scoring. [sent-186, score-0.143]
80 One should note that the work of [7] used triphone HMMs and a trigram language model whereas in this work we used only a bigram language model and monophone HMMs, so table 3 probably underestimates the error reduction our system provides over the best published GMM-based approach. [sent-187, score-0.334]
81 7 Conclusions and Future Work We have presented a new deep architecture for phone recognition that combines a mcRBM feature extraction module with a standard DBN. [sent-188, score-0.648]
82 Our approach attacks both the representational inefficiency issues of GMMs and an important limitation of previous work applying DBNs to phone recognition. [sent-189, score-0.413]
83 However, DBN-based acoustic modeling approaches are still in their infancy and many important research questions remain. [sent-191, score-0.254]
84 During the fine-tuning, one could imagine backpropagating through the decoder itself and optimizing an objective function more closely related to the phone error rate. [sent-192, score-0.405]
85 Since the pretraining procedure can make use of large quantities of completely unlabeled data, leveraging untranscribed speech data on a large scale might allow our approach to be even more robust to inter-speaker acoustic variations and would certainly be an interesting avenue of future work. [sent-193, score-0.412]
86 Young, “Statistical modeling in continuous speech recognition (CSR),” in UAI ’01: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, San Francisco, CA, USA, 2001, pp. [sent-195, score-0.297]
87 Gales, “Minimum phone error training of precision matrix models,” IEEE Transactions on Audio, Speech & Language Processing, vol. [sent-214, score-0.509]
88 Hinton, “Deep belief networks for phone recognition,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009. [sent-223, score-0.461]
89 Picheny, “An exploration of large vocabulary tools for small vocabulary phonetic recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop, 2009. [sent-232, score-0.119]
90 Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. [sent-237, score-0.193]
91 Bourlard, “Continuous speech recognition,” Signal Processing Magazine, IEEE, vol. [sent-242, score-0.159]
92 Hon, “Speaker-independent phone recognition using hidden markov models,” IEEE Transactions on Audio, Speech & Language Processing, vol. [sent-275, score-0.644]
93 Hinton, “3-d object recognition with deep belief nets,” in Advances in Neural Information Processing Systems 22, Y. [sent-286, score-0.291]
94 Rohlicek, “Fast algorithms for phone classification and recognition using segment-based models,” IEEE Transactions on Signal Processing, vol. [sent-301, score-0.464]
95 Saul, “Large margin gaussian mixture modeling for phonetic classification and recognition,” in Proc. [sent-311, score-0.147]
96 Renals, “Speech recognition using augmented conditional random fields,” IEEE Transactions on Audio, Speech & Language Processing, vol. [sent-316, score-0.169]
97 Robinson, “An application of recurrent nets to phone probability estimation,” IEEE Transactions on Neural Networks, vol. [sent-321, score-0.475]
98 Smith, “Improved phone recognition using bayesian triphone models,” in Proc. [sent-328, score-0.548]
99 Yu, “Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition,” in Proc. [sent-333, score-0.479]
100 Glass, “Heterogeneous measurements and multiple classifiers for speech recognition,” in Proc. [sent-338, score-0.159]
wordName wordTfidf (topN-words)
[('mcrbm', 0.435), ('phone', 0.366), ('dbn', 0.276), ('units', 0.252), ('acoustic', 0.214), ('timit', 0.189), ('hidden', 0.18), ('speech', 0.159), ('visible', 0.139), ('rbm', 0.133), ('deep', 0.125), ('crbm', 0.124), ('dbns', 0.116), ('hmm', 0.114), ('layers', 0.113), ('nets', 0.109), ('grbm', 0.108), ('precision', 0.107), ('recognition', 0.098), ('layer', 0.095), ('boltzmann', 0.091), ('grbms', 0.084), ('silences', 0.084), ('triphone', 0.084), ('hmms', 0.076), ('epoch', 0.073), ('gmm', 0.069), ('belief', 0.068), ('gmms', 0.067), ('hinton', 0.066), ('backpropagation', 0.066), ('rbms', 0.066), ('frames', 0.062), ('ending', 0.059), ('unit', 0.058), ('language', 0.057), ('energy', 0.056), ('phonetic', 0.055), ('covariance', 0.055), ('mixture', 0.052), ('representational', 0.047), ('lterbank', 0.046), ('epochs', 0.044), ('emission', 0.041), ('ranzato', 0.041), ('diagonal', 0.04), ('modeling', 0.04), ('decoder', 0.039), ('pretraining', 0.039), ('cdhmm', 0.038), ('devset', 0.038), ('monophone', 0.038), ('pressures', 0.038), ('icassp', 0.038), ('published', 0.038), ('architectures', 0.038), ('conditional', 0.038), ('gaussians', 0.037), ('stack', 0.036), ('hk', 0.036), ('training', 0.036), ('audio', 0.034), ('backpropagate', 0.034), ('trigram', 0.034), ('em', 0.034), ('augmented', 0.033), ('restricted', 0.032), ('vocabulary', 0.032), ('ec', 0.032), ('development', 0.032), ('states', 0.031), ('mel', 0.031), ('testset', 0.031), ('mohamed', 0.031), ('explaining', 0.031), ('extraction', 0.031), ('core', 0.03), ('features', 0.03), ('forced', 0.03), ('activations', 0.03), ('constrain', 0.03), ('took', 0.029), ('per', 0.029), ('removed', 0.029), ('tying', 0.029), ('dahl', 0.029), ('scoring', 0.028), ('architecture', 0.028), ('cepstral', 0.027), ('augmentation', 0.027), ('networks', 0.027), ('weakness', 0.026), ('factoring', 0.026), ('bigram', 0.026), ('vt', 0.026), ('preprocessing', 0.026), ('viterbi', 0.025), ('concatenated', 0.025), ('trained', 0.025), ('whitening', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1
2 0.19370653 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
Author: Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, Jean-pierre Martens
Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant. 1
3 0.15588924 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels
Author: Grégoire Montavon, Klaus-Robert Müller, Mikio L. Braun
Abstract: Deep networks can potentially express a learning problem more efficiently than local learning machines. While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after layer as the deep network builds higher-level abstract representations of the input. We use this analysis to show empirically that deep networks build progressively better representations of the learning problem and that the best representations are obtained when the deep network discriminates only in the last layers. 1
4 0.13742629 103 nips-2010-Generating more realistic images using gated MRF's
Author: Marc'aurelio Ranzato, Volodymyr Mnih, Geoffrey E. Hinton
Abstract: Probabilistic models of natural images are usually evaluated by measuring performance on rather indirect tasks, such as denoising and inpainting. A more direct way to evaluate a generative model is to draw samples from it and to check whether statistical properties of the samples match the statistics of natural images. This method is seldom used with high-resolution images, because current models produce samples that are very different from natural images, as assessed by even simple visual inspection. We investigate the reasons for this failure and we show that by augmenting existing models so that there are two sets of latent variables, one set modelling pixel intensities and the other set modelling image-specific pixel covariances, we are able to generate high-resolution images that look much more realistic than before. The overall model can be interpreted as a gated MRF where both pair-wise dependencies and mean intensities of pixels are modulated by the states of latent variables. Finally, we confirm that if we disallow weight-sharing between receptive fields that overlap each other, the gated MRF learns more efficient internal representations, as demonstrated in several recognition tasks. 1 Introduction and Prior Work The study of the statistical properties of natural images has a long history and has influenced many fields, from image processing to computational neuroscience [1]. In this work we focus on probabilistic models of natural images. These models are useful for extracting representations [2, 3, 4] that can be used for discriminative tasks and they can also provide adaptive priors [5, 6, 7] that can be used in applications like denoising and inpainting. Our main focus, however, will be on improving the quality of the generative model, rather than exploring its possible applications. Markov Random Fields (MRF’s) provide a very general framework for modelling natural images. In an MRF, an image is assigned a probability which is a normalized product of potential functions, with each function typically being defined over a subset of the observed variables. In this work we consider a very versatile class of MRF’s in which potential functions are defined over both pixels and latent variables, thus allowing the states of the latent variables to modulate or gate the effective interactions between the pixels. This type of MRF, that we dub gated MRF, was proposed as an image model by Geman and Geman [8]. Welling et al. [9] showed how an MRF in this family1 could be learned for small image patches and their work was extended to high-resolution images by Roth and Black [6] who also demonstrated its success in some practical applications [7]. Besides their practical use, these models were specifically designed to match the statistical properties of natural images, and therefore, it seems natural to evaluate them in those terms. Indeed, several authors [10, 7] have proposed that these models should be evaluated by generating images and 1 Product of Student’s t models (without pooling) may not appear to have latent variables but each potential can be viewed as an infinite mixture of zero-mean Gaussians where the inverse variance of the Gaussian is the latent variable. 1 checking whether the samples match the statistical properties observed in natural images. It is, therefore, very troublesome that none of the existing models can generate good samples, especially for high-resolution images (see for instance fig. 2 in [7] which is one of the best models of highresolution images reported in the literature so far). In fact, as our experiments demonstrate the generated samples from these models are more similar to random images than to natural images! When MRF’s with gated interactions are applied to small image patches, they actually seem to work moderately well, as demonstrated by several authors [11, 12, 13]. The generated patches have some coherent and elongated structure and, like natural image patches, they are predominantly very smooth with sudden outbreaks of strong structure. This is unsurprising because these models have a built-in assumption that images are very smooth with occasional strong violations of smoothness [8, 14, 15]. However, the extension of these patch-based models to high-resolution images by replicating filters across the image has proven to be difficult. The receptive fields that are learned no longer resemble Gabor wavelets but look random [6, 16] and the generated images lack any of the long range structure that is so typical of natural images [7]. The success of these methods in applications such as denoising is a poor measure of the quality of the generative model that has been learned: Setting the parameters to random values works almost as well for eliminating independent Gaussian noise [17], because this can be done quite well by just using a penalty for high-frequency variation. In this work, we show that the generative quality of these models can be drastically improved by jointly modelling both pixel mean intensities and pixel covariances. This can be achieved by using two sets of latent variables, one that gates pair-wise interactions between pixels and another one that sets the mean intensities of pixels, as we already proposed in some earlier work [4]. Here, we show that this modelling choice is crucial to make the gated MRF work well on high-resolution images. Finally, we show that the most widely used method of sharing weights in MRF’s for high-resolution images is overly constrained. Earlier work considered homogeneous MRF’s in which each potential is replicated at all image locations. This has the subtle effect of making learning very difficult because of strong correlations at nearby sites. Following Gregor and LeCun [18] and also Tang and Eliasmith [19], we keep the number of parameters under control by using local potentials, but unlike Roth and Black [6] we only share weights between potentials that do not overlap. 2 Augmenting Gated MRF’s with Mean Hidden Units A Product of Student’s t (PoT) model [15] is a gated MRF defined on small image patches that can be viewed as modelling image-specific, pair-wise relationships between pixel values by using the states of its latent variables. It is very good at representing the fact that two-pixel have very similar intensities and no good at all at modelling what these intensities are. Failure to model the mean also leads to impoverished modelling of the covariances when the input images have nonzero mean intensity. The covariance RBM (cRBM) [20] is another model that shares the same limitation since it only differs from PoT in the distribution of its latent variables: The posterior over the latent variables is a product of Bernoulli distributions instead of Gamma distributions as in PoT. We explain the fundamental limitation of these models by using a simple toy example: Modelling two-pixel images using a cRBM with only one binary hidden unit, see fig. 1. This cRBM assumes that the conditional distribution over the input is a zero-mean Gaussian with a covariance that is determined by the state of the latent variable. Since the latent variable is binary, the cRBM can be viewed as a mixture of two zero-mean full covariance Gaussians. The latent variable uses the pairwise relationship between pixels to decide which of the two covariance matrices should be used to model each image. When the input data is pre-proessed by making each image have zero mean intensity (the empirical histogram is shown in the first row and first column), most images lie near the origin because most of the times nearby pixels are strongly correlated. Less frequently we encounter edge images that exhibit strong anti-correlation between the pixels, as shown by the long tails along the anti-diagonal line. A cRBM could model this data by using two Gaussians (first row and second column): one that is spherical and tight at the origin for smooth images and another one that has a covariance elongated along the anti-diagonal for structured images. If, however, the whole set of images is normalized by subtracting from every pixel the mean value of all pixels over all images (second row and first column), the cRBM fails at modelling structured images (second row and second column). It can fit a Gaussian to the smooth images by discovering 2 Figure 1: In the first row, each image is zero mean. In the second row, the whole set of data points is centered but each image can have non-zero mean. The first column shows 8x8 images picked at random from natural images. The images in the second column are generated by a model that does not account for mean intensity. The images in the third column are generated by a model that has both “mean” and “covariance” hidden units. The contours in the first column show the negative log of the empirical distribution of (tiny) natural two-pixel images (x-axis being the first pixel and the y-axis the second pixel). The plots in the other columns are toy examples showing how each model could represent the empirical distribution using a mixture of Gaussians with components that have one of two possible covariances (corresponding to the state of a binary “covariance” latent variable). Models that can change the means of the Gaussians (mPoT and mcRBM) can represent better structured images (edge images lie along the anti-diagonal and are fitted by the Gaussians shown in red) while the other models (PoT and cRBM) fail, overall when each image can have non-zero mean. the direction of strong correlation along the main diagonal, but it is very likely to fail to discover the direction of anti-correlation, which is crucial to represent discontinuities, because structured images with different mean intensity appear to be evenly spread over the whole input space. If the model has another set of latent variables that can change the means of the Gaussian distributions in the mixture (as explained more formally below and yielding the mPoT and mcRBM models), then the model can represent both changes of mean intensity and the correlational structure of pixels (see last column). The mean latent variables effectively subtract off the relevant mean from each data-point, letting the covariance latent variable capture the covariance structure of the data. As before, the covariance latent variable needs only to select between two covariance matrices. In fact, experiments on real 8x8 image patches confirm these conjectures. Fig. 1 shows samples drawn from PoT and mPoT. mPoT (and similarly mcRBM [4]) is not only better at modelling zero mean images but it can also represent images that have non zero mean intensity well. We now describe mPoT, referring the reader to [4] for a detailed description of mcRBM. In PoT [9] the energy function is: E PoT (x, hc ) = i 1 [hc (1 + (Ci T x)2 ) + (1 − γ) log hc ] i i 2 (1) where x is a vectorized image patch, hc is a vector of Gamma “covariance” latent variables, C is a filter bank matrix and γ is a scalar parameter. The joint probability over input pixels and latent variables is proportional to exp(−E PoT (x, hc )). Therefore, the conditional distribution over the input pixels is a zero-mean Gaussian with covariance equal to: Σc = (Cdiag(hc )C T )−1 . (2) In order to make the mean of the conditional distribution non-zero, we define mPoT as the normalized product of the above zero-mean Gaussian that models the covariance and a spherical covariance Gaussian that models the mean. The overall energy function becomes: E mPoT (x, hc , hm ) = E PoT (x, hc ) + E m (x, hm ) 3 (3) Figure 2: Illustration of different choices of weight-sharing scheme for a RBM. Links converging to one latent variable are filters. Filters with the same color share the same parameters. Kinds of weight-sharing scheme: A) Global, B) Local, C) TConv and D) Conv. E) TConv applied to an image. Cells correspond to neighborhoods to which filters are applied. Cells with the same color share the same parameters. F) 256 filters learned by a Gaussian RBM with TConv weight-sharing scheme on high-resolution natural images. Each filter has size 16x16 pixels and it is applied every 16 pixels in both the horizontal and vertical directions. Filters in position (i, j) and (1, 1) are applied to neighborhoods that are (i, j) pixels away form each other. Best viewed in color. where hm is another set of latent variables that are assumed to be Bernoulli distributed (but other distributions could be used). The new energy term is: E m (x, hm ) = 1 T x x− 2 hm Wj T x j (4) j yielding the following conditional distribution over the input pixels: p(x|hc , hm ) = N (Σ(W hm ), Σ), Σ = (Σc + I)−1 (5) with Σc defined in eq. 2. As desired, the conditional distribution has non-zero mean2 . Patch-based models like PoT have been extended to high-resolution images by using spatially localized filters [6]. While we can subtract off the mean intensity from independent image patches to successfully train PoT, we cannot do that on a high-resolution image because overlapping patches might have different mean. Unfortunately, replicating potentials over the image ignoring variations of mean intensity has been the leading strategy to date [6]3 . This is the major reason why generation of high-resolution images is so poor. Sec. 4 shows that generation can be drastically improved by explicitly accounting for variations of mean intensity, as performed by mPoT and mcRBM. 3 Weight-Sharing Schemes By integrating out the latent variables, we can write the density function of any gated MRF as a normalized product of potential functions (for mPoT refer to eq. 6). In this section we investigate different ways of constraining the parameters of the potentials of a generic MRF. Global: The obvious way to extend a patch-based model like PoT to high-resolution images is to define potentials over the whole image; we call this scheme global. This is not practical because 1) the number of parameters grows about quadratically with the size of the image making training too slow, 2) we do not need to model interactions between very distant pairs of pixels since their dependence is negligible, and 3) we would not be able to use the model on images of different size. Conv: The most popular way to handle big images is to define potentials on small subsets of variables (e.g., neighborhoods of size 5x5 pixels) and to replicate these potentials across space while 2 The need to model the means was clearly recognized in [21] but they used conjunctive latent features that simultaneously represented a contribution to the “precision matrix” in a specific direction and the mean along that same direction. 3 The success of PoT-like models in Bayesian denoising is not surprising since the noisy image effectively replaces the reconstruction term from the mean hidden units (see eq. 5), providing a set of noisy mean intensities that are cleaned up by the patterns of correlation enforced by the covariance latent variables. 4 sharing their parameters at each image location [23, 24, 6]. This yields a convolutional weightsharing scheme, also called homogeneous field in the statistics literature. This choice is justified by the stationarity of natural images. This weight-sharing scheme is extremely concise in terms of number of parameters, but also rather inefficient in terms of latent representation. First, if there are N filters at each location and these filters are stepped by one pixel then the internal representation is about N times overcomplete. The internal representation has not only high computational cost, but it is also highly redundant. Since the input is mostly smooth and the parameters are the same across space, the latent variables are strongly correlated as well. This inefficiency turns out to be particularly harmful for a model like PoT causing the learned filters to become “random” looking (see fig 3-iii). A simple intuition follows from the equivalence between PoT and square ICA [15]. If the filter matrix C of eq. 1 is square and invertible, we can marginalize out the latent variables and write: p(y) = i S(yi ), where yi = Ci T x and S is a Student’s t distribution. In other words, there is an underlying assumption that filter outputs are independent. However, if the filters of matrix C are shifted and overlapping versions of each other, this clearly cannot be true. Training PoT with the Conv weight-sharing scheme forces the model to find filters that make filter outputs as independent as possible, which explains the very high-frequency patterns that are usually discovered [6]. Local: The Global and Conv weight-sharing schemes are at the two extremes of a spectrum of possibilities. For instance, we can define potentials on a small subset of input variables but, unlike Conv, each potential can have its own set of parameters, as shown in fig. 2-B. This is called local, or inhomogeneous field. Compared to Conv the number of parameters increases only slightly but the number of latent variables required and their redundancy is greatly reduced. In fact, the model learns different receptive fields at different locations as a better strategy for representing the input, overall when the number of potentials is limited (see also fig. 2-F). TConv: Local would not allow the model to be trained and tested on images of different resolution, and it might seem wasteful not to exploit the translation invariant property of images. We therefore advocate the use of a weight-sharing scheme that we call tiled-convolutional (TConv) shown in fig. 2-C and E [18]. Each filter tiles the image without overlaps with copies of itself (i.e. the stride equals the filter diameter). This reduces spatial redundancy of latent variables and allows the input images to have arbitrary size. At the same time, different filters do overlap with each other in order to avoid tiling artifacts. Fig. 2-F shows filters that were (jointly) learned by a Restricted Boltzmann Machine (RBM) [29] with Gaussian input variables using the TConv weight-sharing scheme. 4 Experiments We train gated MRF’s with and without mean hidden units using different weight-sharing schemes. The training procedure is very similar in all cases. We perform approximate maximum likelihood by using Fast Persistence Contrastive Divergence (FPCD) [25] and we draw samples by using Hybrid Monte Carlo (HMC) [26]. Since all latent variables can be exactly marginalized out we can use HMC on the free energy (negative logarithm of the marginal distribution over the input pixels). For mPoT this is: F mPoT (x) = − log(p(x))+const. = k,i 1 1 γ log(1+ (Cik T xk )2 )+ xT x− 2 2 T log(1+exp(Wjk xk )) (6) k,j where the index k runs over spatial locations and xk is the k-th image patch. FPCD keeps samples, called negative particles, that it uses to represent the model distribution. These particles are all updated after each weight update. For each mini-batch of data-points a) we compute the derivative of the free energy w.r.t. the training samples, b) we update the negative particles by running HMC for one HMC step consisting of 20 leapfrog steps. We start at the previous set of negative particles and use as parameters the sum of the regular parameters and a small perturbation vector, c) we compute the derivative of the free energy at the negative particles, and d) we update the regular parameters by using the difference of gradients between step a) and c) while the perturbation vector is updated using the gradient from c) only. The perturbation is also strongly decayed to zero and is subject to a larger learning rate. The aim is to encourage the negative particles to explore the space more quickly by slightly and temporarily raising the energy at their current position. Note that the use of FPCD as opposed to other estimation methods (like Persistent Contrastive Divergence [27]) turns out to be crucial to achieve good mixing of the sampler even after training. We train on mini-batches of 32 samples using gray-scale images of approximate size 160x160 pixels randomly cropped from the Berkeley segmentation dataset [28]. We perform 160,000 weight updates decreasing the learning by a factor of 4 by the end of training. The initial learning rate is set to 0.1 for the covariance 5 Figure 3: 160x160 samples drawn by A) mPoT-TConv, B) mHPoT-TConv, C) mcRBM-TConv and D) PoTTConv. On the side also i) a subset of 8x8 “covariance” filters learned by mPoT-TConv (the plot below shows how the whole set of filters tile a small patch; each bar correspond to a Gabor fit of a filter and colors identify filters applied at the same 8x8 location, each group is shifted by 2 pixels down the diagonal and a high-resolution image is tiled by replicating this pattern every 8 pixels horizontally and vertically), ii) a subset of 8x8 “mean” filters learned by the same mPoT-TConv, iii) filters learned by PoT-Conv and iv) by PoT-TConv. filters (matrix C of eq. 1), 0.01 for the mean parameters (matrix W of eq. 4), and 0.001 for the other parameters (γ of eq. 1). During training we condition on the borders and initialize the negative particles at zero in order to avoid artifacts at the border of the image. We learn 8x8 filters and pre-multiply the covariance filters by a whitening transform retaining 99% of the variance; we also normalize the norm of the covariance filters to prevent some of them from decaying to zero during training4 . Whenever we use the TConv weight-sharing scheme the model learns covariance filters that mostly resemble localized and oriented Gabor functions (see fig. 3-i and iv), while the Conv weight-sharing scheme learns structured but poorly localized high-frequency patterns (see fig. 3-iii) [6]. The TConv models re-use the same 8x8 filters every 8 pixels and apply a diagonal offset of 2 pixels between neighboring filters with different weights in order to reduce tiling artifacts. There are 4 sets of filters, each with 64 filters for a total of 256 covariance filters (see bottom plot of fig. 3). Similarly, we have 4 sets of mean filters, each with 32 filters. These filters have usually non-zero mean and exhibit on-center off-surround and off-center on-surround patterns, see fig. 3-ii. In order to draw samples from the learned models, we run HMC for a long time (10,000 iterations, each composed of 20 leap-frog steps). Some samples of size 160x160 pixels are reported in fig. 3 A)D). Without modelling the mean intensity, samples lack structure and do not seem much different from those that would be generated by a simple Gaussian model merely fitting the second order statistics (see fig. 3 in [1] and also fig. 2 in [7]). By contrast, structure, sharp boundaries and some simple texture emerge only from models that have mean latent variables, namely mcRBM, mPoT and mHPoT which differs from mPoT by having a second layer pooling matrix on the squared covariance filter outputs [11]. A more quantitative comparison is reported in table 1. We first compute marginal statistics of filter responses using the generated images, natural images from the test set, and random images. The statistics are the normalized histogram of individual filter responses to 24 Gabor filters (8 orientations and 3 scales). We then calculate the KL divergence between the histograms on random images and generated images and the KL divergence between the histograms on natural images and generated images. The table also reports the average difference of energies between random images and natural images. All results demonstrate that models that account for mean intensity generate images 4 The code used in the experiments can be found at the first author’s web-page. 6 MODEL F (R) − F (T ) (104 ) KL(R G) KL(T G) KL(R G) − KL(T PoT - Conv 2.9 0.3 0.6 PoT - TConv 2.8 0.4 1.0 -0.6 mPoT - TConv 5.2 1.0 0.2 0.8 mHPoT - TConv 4.9 1.7 0.8 0.9 mcRBM - TConv 3.5 1.5 1.0 G) -0.3 0.5 Table 1: Comparing MRF’s by measuring: difference of energy (negative log ratio of probabilities) between random images (R) and test natural images (T), the KL divergence between statistics of random images (R) and generated images (G), KL divergence between statistics of test natural images (T) and generated images (G), and difference of these two KL divergences. Statistics are computed using 24 Gabor filters. that are closer to natural images than to random images, whereas models that do not account for the mean (like the widely used PoT-Conv) produce samples that are actually closer to random images. 4.1 Discriminative Experiments on Weight-Sharing Schemes In future work, we intend to use the features discovered by the generative model for recognition. To understand how the different weight sharing schemes affect recognition performance we have done preliminary tests using the discriminative performance of a simpler model on simpler data. We consider one of the simplest and most versatile models, namely the RBM [29]. Since we also aim to test the Global weight-sharing scheme we are constrained to using fairly low resolution datasets such as the MNIST dataset of handwritten digits [30] and the CIFAR 10 dataset of generic object categories [22]. The MNIST dataset has soft binary images of size 28x28 pixels, while the CIFAR 10 dataset has color images of size 32x32 pixels. CIFAR 10 has 10 classes, 5000 training samples per class and 1000 test samples per class. MNIST also has 10 classes with, on average, 6000 training samples per class and 1000 test samples per class. The energy function of the RBM trained on the CIFAR 10 dataset, modelling input pixels with 3 (R,G,B) Gaussian variables [31], is exactly the one shown in eq. 4; while the RBM trained on MNIST uses logistic units for the pixels and the energy function is again the same as before but without any quadratic term. All models are trained in an unsupervised way to approximately maximize the likelihood in the training set using Contrastive Divergence [32]. They are then used to represent each input image with a feature vector (mean of the posterior over the latent variables) which is fed to a multinomial logistic classifier for discrimination. Models are compared in terms of: 1) recognition accuracy, 2) convergence time and 3) dimensionality of the representation. In general, assuming filters much smaller than the input image and assuming equal number of latent variables, Conv, TConv and Local models process each sample faster than Global by a factor approximately equal to the ratio between the area of the image and the area of the filters, which can be very large in practice. In the first set of experiments reported on the left of fig. 4 we study the internal representation in terms of discrimination and dimensionality using the MNIST dataset. For each choice of dimensionality all models are trained using the same number of operations. This is set to the amount necessary to complete one epoch over the training set using the Global model. This experiment shows that: 1) Local outperforms all other weight-sharing schemes for a wide range of dimensionalities, 2) TConv does not perform as well as Local probably because the translation invariant assumption is clearly violated for these relatively small, centered, images, 3) Conv performs well only when the internal representation is very high dimensional (10 times overcomplete) otherwise it severely underfits, 4) Global performs well when the representation is compact but its performance degrades rapidly as this increases because it needs more than the allotted training time. The right hand side of fig. 4 shows how the recognition performance evolves as we increase the number of operations (or training time) using models that produce a twice overcomplete internal representation. With only very few filters Conv still underfits and it does not improve its performance by training for longer, but Global does improve and eventually it reaches the performance of Local. If we look at the crossing of the error rate at 2% we can see that Local is about 4 times faster than Global. To summarize, Local provides more compact representations than Conv, is much faster than Global while achieving 7 6 2.4 error rate % 5 error rate % 2.6 Global Local TConv Conv 4 3 2 1 0 2.2 Global Local 2 Conv 1.8 1000 2000 3000 4000 5000 dimensionality 6000 7000 1.6 0 8000 2 4 6 8 # flops (relative to # flops per epoch of Global model) 10 Figure 4: Experiments on MNIST using RBM’s with different weight-sharing schemes. Left: Error rate as a function of the dimensionality of the latent representation. Right: Error rate as a function of the number of operations (normalized to those needed to perform one epoch in the Global model); all models have a twice overcomplete latent representation. similar performance in discrimination. Also, Local can easily scale to larger images while Global cannot. Similar experiments are performed using the CIFAR 10 dataset [22] of natural images. Using the same protocol introduced in earlier work by Krizhevsky [22], the RBM’s are trained in an unsupervised way on a subset of the 80 million tiny images dataset [33] and then “fine-tuned” on the CIFAR 10 dataset by supervised back-propagation of the error through the linear classifier and feature extractor. All models produce an approximately 10,000 dimensional internal representation to make a fair comparison. Models using local filters learn 16x16 filters that are stepped every pixel. Again, we do not experiment with the TConv weight-sharing scheme because the image is not large enough to allow enough replicas. Similarly to fig. 3-iii the Conv weight-sharing scheme was very difficult to train and did not produce Gabor-like features. Indeed, careful injection of sparsity and long training time seem necessary [31] for these RBM’s. By contrast, both Local and Global produce Gabor-like filters similar to those shown in fig. 2 F). The model trained with Conv weight-sharing scheme yields an accuracy equal to 56.6%, while Local and Global yield much better performance, 63.6% and 64.8% [22], respectively. Although Local and Global have similar performance, training with the Local weight-sharing scheme took under an hour while using the Global weight-sharing scheme required more than a day. 5 Conclusions and Future Work This work is motivated by the poor generative quality of currently popular MRF models of natural images. These models generate images that are actually more similar to white noise than to natural images. Our contribution is to recognize that current models can benefit from 1) the addition of a simple model of the mean intensities and from 2) the use of a less constrained weight-sharing scheme. By augmenting these models with an extra set of latent variables that model mean intensity we can generate samples that look much more realistic: they are characterized by smooth regions, sharp boundaries and some simple high frequency texture. We validate our approach by comparing the statistics of filter outputs on natural images and generated images. In the future, we plan to integrate these MRF’s into deeper hierarchical models and to use their internal representation to perform object recognition in high-resolution images. The hope is to further improve generation by capturing longer range dependencies and to exploit this to better cope with missing values and ambiguous sensory inputs. References [1] E.P. Simoncelli. Statistical modeling of photographic images. Handbook of Image and Video Processing, pages 431–441, 2005. 8 [2] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001. [3] G.E. Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [4] M. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order boltzmann machines. In CVPR, 2010. [5] M.J. Wainwright and E.P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images. In NIPS, 2000. [6] S. Roth and M.J. Black. Fields of experts: A framework for learning image priors. In CVPR, 2005. [7] U. Schmidt, Q. Gao, and S. Roth. A generative perspective on mrfs in low-level vision. In CVPR, 2010. [8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. PAMI, 6:721–741, 1984. [9] M. Welling, G.E. Hinton, and S. Osindero. Learning sparse topographic representations with products of student-t distributions. In NIPS, 2003. [10] S.C. Zhu and D. Mumford. Prior learning and gibbs reaction diffusion. PAMI, pages 1236–1250, 1997. [11] S. Osindero, M. Welling, and G. E. Hinton. Topographic product models applied to natural scene statistics. Neural Comp., 18:344–381, 2006. [12] S. Osindero and G. E. Hinton. Modeling image patches with a directed hierarchy of markov random fields. In NIPS, 2008. [13] Y. Karklin and M.S. Lewicki. Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457:83–86, 2009. [14] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Research, 37:3311–3325, 1997. [15] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete representations. JMLR, 4:1235–1260, 2003. [16] Y. Weiss and W.T. Freeman. What makes a good model of natural images? In CVPR, 2007. [17] S. Roth and M. J. Black. Fields of experts. Int. Journal of Computer Vision, 82:205–229, 2009. [18] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields. arXiv:1006.0448, 2010. [19] C. Tang and C. Eliasmith. Deep networks for robust visual recognition. In ICML, 2010. [20] M. Ranzato, A. Krizhevsky, and G.E. Hinton. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, 2010. [21] N. Heess, C.K.I. Williams, and G.E. Hinton. Learning generative texture models with extended fields-ofexperts. In BMCV, 2009. [22] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. MSc Thesis, Dept. of Comp. Science, Univ. of Toronto. [23] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Acoustics Speech and Signal Proc., 37:328–339, 1989. [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [25] T. Tieleman and G.E. Hinton. Using fast weights to improve persistent contrastive divergence. In ICML, 2009. [26] R.M. Neal. Bayesian learning for neural networks. Springer-Verlag, 1996. [27] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In ICML, 2008. [28] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/. [29] M. Welling, M. Rosen-Zvi, and G.E. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, 2005. [30] http://yann.lecun.com/exdb/mnist/. [31] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proc. ICML, 2009. [32] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. [33] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 30:1958–1970, 2008. 9
5 0.12775934 101 nips-2010-Gaussian sampling by local perturbations
Author: George Papandreou, Alan L. Yuille
Abstract: We present a technique for exact simulation of Gaussian Markov random fields (GMRFs), which can be interpreted as locally injecting noise to each Gaussian factor independently, followed by computing the mean/mode of the perturbed GMRF. Coupled with standard iterative techniques for the solution of symmetric positive definite systems, this yields a very efficient sampling algorithm with essentially linear complexity in terms of speed and memory requirements, well suited to extremely large scale probabilistic models. Apart from synthesizing data under a Gaussian model, the proposed technique directly leads to an efficient unbiased estimator of marginal variances. Beyond Gaussian models, the proposed algorithm is also very useful for handling highly non-Gaussian continuously-valued MRFs such as those arising in statistical image modeling or in the first layer of deep belief networks describing real-valued data, where the non-quadratic potentials coupling different sites can be represented as finite or infinite mixtures of Gaussians with the help of local or distributed latent mixture assignment variables. The Bayesian treatment of such models most naturally involves a block Gibbs sampler which alternately draws samples of the conditionally independent latent mixture assignments and the conditionally multivariate Gaussian continuous vector and we show that it can directly benefit from the proposed methods. 1
6 0.11996444 99 nips-2010-Gated Softmax Classification
7 0.11484646 271 nips-2010-Tiled convolutional neural networks
8 0.11139044 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine
9 0.10545376 281 nips-2010-Using body-anchored priors for identifying actions in single images
10 0.092473872 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering
11 0.091470957 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
12 0.086979583 61 nips-2010-Direct Loss Minimization for Structured Prediction
13 0.07546965 111 nips-2010-Hallucinations in Charles Bonnet Syndrome Induced by Homeostasis: a Deep Boltzmann Machine Model
14 0.075251706 128 nips-2010-Infinite Relational Modeling of Functional Connectivity in Resting State fMRI
15 0.07404431 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
16 0.070263237 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
17 0.069616668 59 nips-2010-Deep Coding Network
18 0.06367062 8 nips-2010-A Log-Domain Implementation of the Diffusion Network in Very Large Scale Integration
19 0.063302085 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
20 0.061385587 224 nips-2010-Regularized estimation of image statistics by Score Matching
topicId topicWeight
[(0, 0.167), (1, 0.061), (2, -0.135), (3, -0.037), (4, -0.016), (5, -0.021), (6, 0.009), (7, 0.046), (8, -0.112), (9, 0.057), (10, -0.01), (11, -0.08), (12, 0.142), (13, -0.166), (14, -0.129), (15, -0.019), (16, 0.021), (17, -0.0), (18, -0.088), (19, -0.167), (20, -0.045), (21, 0.226), (22, 0.012), (23, 0.128), (24, 0.141), (25, 0.118), (26, 0.057), (27, -0.036), (28, 0.068), (29, -0.041), (30, -0.095), (31, -0.117), (32, 0.131), (33, 0.09), (34, 0.064), (35, 0.007), (36, 0.044), (37, -0.011), (38, -0.011), (39, -0.025), (40, -0.001), (41, -0.036), (42, -0.031), (43, -0.112), (44, 0.002), (45, -0.022), (46, -0.014), (47, 0.05), (48, 0.032), (49, -0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.95404166 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1
2 0.81997156 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
Author: Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, Jean-pierre Martens
Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant. 1
3 0.75353408 271 nips-2010-Tiled convolutional neural networks
Author: Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W. Koh, Quoc V. Le, Andrew Y. Ng
Abstract: Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights significantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hard-coded into the architecture. In this paper, we consider the problem of learning invariances, rather than relying on hardcoding. We propose tiled convolution neural networks (Tiled CNNs), which use a regular “tiled” pattern of tied weights that does not require that adjacent hidden units share identical weights, but instead requires only that hidden units k steps away from each other to have tied weights. By pooling over neighboring units, this architecture is able to learn complex invariances (such as scale and rotational invariance) beyond translational invariance. Further, it also enjoys much of CNNs’ advantage of having a relatively small number of learned parameters (such as ease of learning and greater scalability). We provide an efficient learning algorithm for Tiled CNNs based on Topographic ICA, and show that learning complex invariant features allows us to achieve highly competitive results for both the NORB and CIFAR-10 datasets. 1
4 0.73790395 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels
Author: Grégoire Montavon, Klaus-Robert Müller, Mikio L. Braun
Abstract: Deep networks can potentially express a learning problem more efficiently than local learning machines. While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after layer as the deep network builds higher-level abstract representations of the input. We use this analysis to show empirically that deep networks build progressively better representations of the learning problem and that the best representations are obtained when the deep network discriminates only in the last layers. 1
5 0.68480831 111 nips-2010-Hallucinations in Charles Bonnet Syndrome Induced by Homeostasis: a Deep Boltzmann Machine Model
Author: Peggy Series, David P. Reichert, Amos J. Storkey
Abstract: The Charles Bonnet Syndrome (CBS) is characterized by complex vivid visual hallucinations in people with, primarily, eye diseases and no other neurological pathology. We present a Deep Boltzmann Machine model of CBS, exploring two core hypotheses: First, that the visual cortex learns a generative or predictive model of sensory input, thus explaining its capability to generate internal imagery. And second, that homeostatic mechanisms stabilize neuronal activity levels, leading to hallucinations being formed when input is lacking. We reproduce a variety of qualitative findings in CBS. We also introduce a modification to the DBM that allows us to model a possible role of acetylcholine in CBS as mediating the balance of feed-forward and feed-back processing. Our model might provide new insights into CBS and also demonstrates that generative frameworks are promising as hypothetical models of cortical learning and perception. 1
6 0.59778643 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
7 0.59221435 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine
8 0.5816344 99 nips-2010-Gated Softmax Classification
9 0.51136345 188 nips-2010-On Herding and the Perceptron Cycling Theorem
10 0.47609633 61 nips-2010-Direct Loss Minimization for Structured Prediction
11 0.47140616 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
12 0.46418229 103 nips-2010-Generating more realistic images using gated MRF's
13 0.42289627 101 nips-2010-Gaussian sampling by local perturbations
14 0.39876074 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
15 0.38083789 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
16 0.37127939 215 nips-2010-Probabilistic Deterministic Infinite Automata
17 0.36211312 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
18 0.3572098 125 nips-2010-Inference and communication in the game of Password
19 0.35523692 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering
20 0.35209835 281 nips-2010-Using body-anchored priors for identifying actions in single images
topicId topicWeight
[(0, 0.013), (13, 0.032), (17, 0.025), (27, 0.09), (30, 0.036), (35, 0.017), (45, 0.19), (50, 0.086), (52, 0.055), (59, 0.292), (60, 0.022), (77, 0.033), (78, 0.014), (90, 0.03)]
simIndex simValue paperId paperTitle
1 0.78855646 4 nips-2010-A Computational Decision Theory for Interactive Assistants
Author: Alan Fern, Prasad Tadepalli
Abstract: We study several classes of interactive assistants from the points of view of decision theory and computational complexity. We first introduce a class of POMDPs called hidden-goal MDPs (HGMDPs), which formalize the problem of interactively assisting an agent whose goal is hidden and whose actions are observable. In spite of its restricted nature, we show that optimal action selection in finite horizon HGMDPs is PSPACE-complete even in domains with deterministic dynamics. We then introduce a more restricted model called helper action MDPs (HAMDPs), where the assistant’s action is accepted by the agent when it is helpful, and can be easily ignored by the agent otherwise. We show classes of HAMDPs that are complete for PSPACE and NP along with a polynomial time class. Furthermore, we show that for general HAMDPs a simple myopic policy achieves a regret, compared to an omniscient assistant, that is bounded by the entropy of the initial goal distribution. A variation of this policy is shown to achieve worst-case regret that is logarithmic in the number of goals for any goal distribution. 1
same-paper 2 0.77068758 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1
3 0.71756816 254 nips-2010-Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models
Author: Han Liu, Kathryn Roeder, Larry Wasserman
Abstract: A challenging problem in estimating high-dimensional graphical models is to choose the regularization parameter in a data-dependent way. The standard techniques include K-fold cross-validation (K-CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Though these methods work well for low-dimensional problems, they are not suitable in high dimensional settings. In this paper, we present StARS: a new stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs. The method has a clear interpretation: we use the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. This interpretation requires essentially no conditions. Under mild conditions, we show that StARS is partially sparsistent in terms of graph estimation: i.e. with high probability, all the true edges will be included in the selected model even when the graph size diverges with the sample size. Empirically, the performance of StARS is compared with the state-of-the-art model selection procedures, including K-CV, AIC, and BIC, on both synthetic data and a real microarray dataset. StARS outperforms all these competing procedures.
4 0.68983632 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
Author: Anton Chechetka, Carlos Guestrin
Abstract: We present a simple and effective approach to learning tractable conditional random fields with structure that depends on the evidence. Our approach retains the advantages of tractable discriminative models, namely efficient exact inference and arbitrarily accurate parameter learning in polynomial time. At the same time, our algorithm does not suffer a large expressive power penalty inherent to fixed tractable structures. On real-life relational datasets, our approach matches or exceeds state of the art accuracy of the dense models, and at the same time provides an order of magnitude speedup. 1
5 0.66761857 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
6 0.64638978 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
7 0.62519485 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
8 0.6229468 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
9 0.62184 96 nips-2010-Fractionally Predictive Spiking Neurons
10 0.62143409 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
11 0.61814994 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models
12 0.61422664 17 nips-2010-A biologically plausible network for the computation of orientation dominance
13 0.61313999 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior
14 0.61247587 98 nips-2010-Functional form of motion priors in human motion perception
15 0.60949194 158 nips-2010-Learning via Gaussian Herding
16 0.60922378 194 nips-2010-Online Learning for Latent Dirichlet Allocation
17 0.60901225 161 nips-2010-Linear readout from a neural population with partial correlation data
18 0.60874307 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
19 0.6087302 103 nips-2010-Generating more realistic images using gated MRF's
20 0.60796309 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models