nips nips2007 nips2007-132 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Simon Osindero, Geoffrey E. Hinton
Abstract: We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Modeling image patches with a directed hierarchy of Markov random fields Simon Osindero and Geoffrey Hinton Department of Computer Science, University of Toronto 6, King’s College Road, M5S 3G4, Canada osindero,hinton@cs. [sent-1, score-0.306]
2 edu Abstract We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. [sent-3, score-0.372]
3 The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. [sent-4, score-0.659]
4 Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. [sent-5, score-1.474]
5 To generate from the model, each layer in turn must settle to equilibrium given its top-down input. [sent-6, score-0.549]
6 We show that this type of model is good at capturing the statistics of patches of natural images. [sent-7, score-0.044]
7 1 Introduction The soldiers on a parade ground form a neat rectangle by interacting with their neighbors. [sent-8, score-0.268]
8 An officer decides where the rectangle should be, but he would be ill-advised to try to tell each individual soldier exactly where to stand. [sent-9, score-0.149]
9 By allowing constraints to be enforced by local interactions, the officer enormously reduces the bandwidth of top-down communication required to generate a familiar pattern. [sent-10, score-0.079]
10 Instead of micro-managing the soldiers, the officer specifies an objective function and leaves it to the soldiers to optimise that function. [sent-11, score-0.167]
11 This example of pattern generation suggests that a multilayer, directed belief net may not be the most effective way to generate patterns. [sent-12, score-0.256]
12 Instead of using shared ancestors to create correlations between the variables within a layer, it may be more efficient for each layer to have its own energy function that is modulated by directed, top-down input from the layer above. [sent-13, score-1.081]
13 Given the top-down input, each layer can then use lateral interactions to settle on a good configuration and this configuration can then provide the top-down input for the next layer down. [sent-14, score-0.985]
14 In this paper, we show that recently developed techniques for learning deep belief nets (DBN’s) can be generalized to solve the apparently more difficult problem of learning a directed hierarchy of Markov Random Fields (MRF’s). [sent-16, score-0.451]
15 The method we describe can learn models that have many hidden layers, each with its own MRF whose energy function is conditional on the values of the variables in the layer above. [sent-17, score-0.821]
16 It does not require detailed prior knowledge about the data to be modeled, though it obviously works better if the architecture and the types of latent variable are well matched to the task. [sent-18, score-0.024]
17 1 2 Learning deep belief nets: An overview The learning procedure for deep belief nets has now been described in several places (Hinton et al. [sent-19, score-0.424]
18 1 Restricted Boltzmann Machines An RBM consists of a layer of binary stochastic “visible” units connected to a layer of binary, stochastic “hidden” units via symmetrically weighted connections. [sent-24, score-1.465]
19 Given a training vector, v, the binary state, hj , of each feature detector, j, is set to 1 with probability σ(bj + i vi wij ), where σ(x) is the logistic function 1/(1 + exp(−x)), bj is the bias of j, vi is the state of visible unit i, and wij is the weight between i and j. [sent-27, score-1.658]
20 Once binary states have been chosen for the hidden units, a reconstruction is produced by setting each vi to 1 with probability σ(bi + j hj wij ). [sent-28, score-1.011]
21 The states of the hidden units are then updated once more so that they represent features of the reconstruction. [sent-29, score-0.513]
22 The learning works well even though it is not exactly following the gradient of the log probability of the training data (Hinton, 2002). [sent-32, score-0.025]
23 2 Compositions of experts A single layer of binary features is usually not the best way to capture the structure in the data. [sent-34, score-0.463]
24 We now show how RBM’S can be composed to create much more powerful, multilayer models. [sent-35, score-0.152]
25 After using an RBM to learn the first layer of hidden features we have an undirected model that defines p(v, h) via the energy function in Eq. [sent-36, score-0.798]
26 We can also think of the model as defining p(v, h) by defining a consistent pair of conditional probabilities, p(h|v) and p(v|h) which can be used to sample from the model distribution. [sent-38, score-0.06]
27 Unlike a standard directed model, this p(h) does not have its own separate parameters. [sent-40, score-0.168]
28 This peculiar decomposition into p(h) and p(v|h) suggests a recursive algorithm: keep the learned p(v|h) but replace p(h) by a better prior over h, i. [sent-42, score-0.068]
29 a prior that is closer to the average, over all the data vectors, of the conditional posterior over h. [sent-44, score-0.085]
30 We can sample from this average conditional posterior by simply applying p(h|v) to the training data. [sent-45, score-0.085]
31 The sampled h vectors are then the “data” that is used for training a higher-level RBM that learns the next layer of features. [sent-46, score-0.418]
32 We could initialize the higher-level RBM model by using the same parameters as the lower-level RBM but with the roles of the hidden and visible units reversed. [sent-47, score-0.8]
33 This ensures that p(v) for the higher-level RBM starts out being exactly the same as p(h) for the lowerlevel one. [sent-48, score-0.025]
34 Provided the number of features per layer does not decrease, Hinton et al. [sent-49, score-0.388]
35 (2006) show that each extra layer increases a variational lower bound on the log probability of the data. [sent-50, score-0.388]
36 The directed connections from the first hidden layer to the visible units in the final, composite graphical model are a consequence of the the fact that we keep the p(v|h) but throw away the p(h) defined by the first level RBM. [sent-51, score-1.611]
37 In the final composite model, the only undirected connections are 2 between the top two layers, because we do not throw away the p(h) for the highest-level RBM. [sent-52, score-0.29]
38 3 Semi-restricted Boltzmann machines For contrastive divergence learning to work well, it is important for the hidden units to be sampled from their conditional distribution given the data or the reconstructions. [sent-54, score-0.701]
39 It not necessary, however, for the reconstructions to be sampled from their conditional distribution given the hidden states. [sent-55, score-0.42]
40 All that is required is that the reconstructions have lower free energy than the data. [sent-56, score-0.275]
41 So it is possible to include lateral connections between the visible units and to create reconstructions by taking a small step towards the conditional equilibrium distribution given the hidden states. [sent-57, score-1.236]
42 If we are using meanfield activities for the reconstructions, we can move towards the equilibrium distribution by using a few damped mean-field updates (Welling and Hinton, 2002). [sent-58, score-0.102]
43 The visible units form a conditional MRF with the biases of the visible units being determined by the hidden states. [sent-60, score-1.485]
44 The learning procedure for the visible to hidden connections is unaffected and the same learning procedure applies to the lateral connections. [sent-61, score-0.779]
wordName wordTfidf (topN-words)
[('layer', 0.388), ('rbm', 0.287), ('visible', 0.286), ('vi', 0.282), ('units', 0.274), ('hj', 0.233), ('hidden', 0.214), ('wij', 0.171), ('directed', 0.168), ('energy', 0.159), ('hinton', 0.158), ('cer', 0.137), ('soldiers', 0.137), ('reconstructions', 0.116), ('bj', 0.113), ('mrf', 0.112), ('multilayer', 0.109), ('boltzmann', 0.098), ('deep', 0.095), ('hiddens', 0.091), ('srbm', 0.091), ('visibles', 0.091), ('biases', 0.091), ('lateral', 0.091), ('connections', 0.084), ('throw', 0.079), ('nets', 0.076), ('bi', 0.073), ('modulated', 0.073), ('equilibrium', 0.068), ('settle', 0.068), ('guration', 0.065), ('recon', 0.064), ('rectangle', 0.064), ('belief', 0.063), ('contrastive', 0.061), ('conditional', 0.06), ('composite', 0.058), ('layers', 0.058), ('interactions', 0.05), ('hierarchy', 0.049), ('binary', 0.049), ('unit', 0.046), ('image', 0.045), ('patches', 0.044), ('create', 0.043), ('lii', 0.04), ('symmetrically', 0.04), ('unaffected', 0.04), ('neat', 0.04), ('peculiar', 0.04), ('transmitting', 0.04), ('reconstruction', 0.037), ('undirected', 0.037), ('compositions', 0.036), ('mouth', 0.034), ('sketched', 0.034), ('simon', 0.034), ('damped', 0.034), ('salakhutdinov', 0.034), ('nose', 0.034), ('machines', 0.033), ('raise', 0.032), ('decides', 0.032), ('away', 0.032), ('procedure', 0.032), ('road', 0.03), ('ancestors', 0.03), ('dbn', 0.03), ('osindero', 0.03), ('optimise', 0.03), ('sampled', 0.03), ('divergence', 0.029), ('geoffrey', 0.029), ('suppress', 0.029), ('enforced', 0.029), ('keep', 0.028), ('locations', 0.028), ('tell', 0.028), ('module', 0.028), ('raised', 0.028), ('king', 0.028), ('fraction', 0.027), ('elds', 0.027), ('toronto', 0.027), ('interacting', 0.027), ('stochastic', 0.026), ('experts', 0.026), ('roles', 0.026), ('posterior', 0.025), ('weight', 0.025), ('generate', 0.025), ('familiar', 0.025), ('adjusting', 0.025), ('reconstructed', 0.025), ('exactly', 0.025), ('eld', 0.025), ('states', 0.025), ('fields', 0.024), ('matched', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
Author: Simon Osindero, Geoffrey E. Hinton
Abstract: We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1
2 0.42420977 182 nips-2007-Sparse deep belief net model for visual area V2
Author: Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng
Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or “deep,” structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear (“contour”) features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex “corner” features matches well with the results from the Ito & Komatsu’s study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features. 1
3 0.40912354 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.
4 0.22007081 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
Author: Marc'aurelio Ranzato, Y-lan Boureau, Yann L. Cun
Abstract: Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input observed variables can be captured. 1
5 0.092221469 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
Author: Pierre Garrigues, Bruno A. Olshausen
Abstract: It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefficients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive fields. However, the resulting sparse coefficients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefficients by including a pairwise coupling term in the prior over the coefficient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1 in terms of a prior over natural images.
6 0.085167661 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis
7 0.06951052 145 nips-2007-On Sparsity and Overcompleteness in Image Models
8 0.06835518 63 nips-2007-Convex Relaxations of Latent Variable Training
9 0.062243577 97 nips-2007-Hidden Common Cause Relations in Relational Learning
10 0.061414044 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs
11 0.055271331 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
12 0.055150546 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks
13 0.052940845 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
14 0.052602187 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
15 0.050668657 59 nips-2007-Continuous Time Particle Filtering for fMRI
16 0.048279487 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations
17 0.047424603 56 nips-2007-Configuration Estimates Improve Pedestrian Finding
18 0.042760339 22 nips-2007-Agreement-Based Learning
19 0.042299829 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
20 0.041611072 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration
topicId topicWeight
[(0, -0.168), (1, 0.111), (2, 0.031), (3, -0.072), (4, -0.025), (5, 0.038), (6, -0.366), (7, 0.074), (8, 0.4), (9, -0.037), (10, 0.312), (11, -0.081), (12, 0.092), (13, -0.073), (14, -0.017), (15, -0.03), (16, -0.056), (17, -0.145), (18, -0.094), (19, 0.104), (20, 0.159), (21, -0.025), (22, -0.176), (23, -0.068), (24, 0.102), (25, -0.061), (26, -0.041), (27, 0.03), (28, 0.056), (29, -0.003), (30, 0.021), (31, -0.036), (32, 0.01), (33, 0.155), (34, 0.052), (35, 0.013), (36, 0.012), (37, -0.023), (38, 0.065), (39, -0.045), (40, -0.027), (41, -0.027), (42, 0.032), (43, 0.037), (44, -0.011), (45, 0.057), (46, 0.054), (47, 0.055), (48, 0.016), (49, -0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.97951645 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
Author: Simon Osindero, Geoffrey E. Hinton
Abstract: We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1
2 0.77155429 182 nips-2007-Sparse deep belief net model for visual area V2
Author: Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng
Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or “deep,” structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear (“contour”) features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex “corner” features matches well with the results from the Ito & Komatsu’s study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features. 1
3 0.68058109 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
Author: Marc'aurelio Ranzato, Y-lan Boureau, Yann L. Cun
Abstract: Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input observed variables can be captured. 1
4 0.6687035 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.
5 0.30578154 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence
Author: Devarajan Sridharan, Brian Percival, John Arthur, Kwabena A. Boahen
Abstract: We describe a neurobiologically plausible model to implement dynamic routing using the concept of neuronal communication through neuronal coherence. The model has a three-tier architecture: a raw input tier, a routing control tier, and an invariant output tier. The correct mapping between input and output tiers is realized by an appropriate alignment of the phases of their respective background oscillations by the routing control units. We present an example architecture, implemented on a neuromorphic chip, that is able to achieve circular-shift invariance. A simple extension to our model can accomplish circular-shift dynamic routing with only O(N ) connections, compared to O(N 2 ) connections required by traditional models. 1 Dynamic Routing Circuit Models for Circular-Shift Invariance Dynamic routing circuit models are among the most prominent neural models for invariant recognition [1] (also see [2] for review). These models implement shift invariance by dynamically changing spatial connectivity to transform an object to a standard position or orientation. The connectivity between the raw input and invariant output layers is controlled by routing units, which turn certain subsets of connections on or off (Figure 1A). An important feature of this model is the explicit representation of what and where information in the main network and the routing units, respectively; the routing units use the where information to create invariant representations. Traditional solutions for shift invariance are neurobiologically implausible for at least two reasons. First, there are too many synaptic connections: for N input neurons, N output neurons and N possible input-output mappings, the network requires O(N 2 ) connections in the routing layer— between each of the N routing units and each set of N connections that that routing unit gates (Figure 1A). Second, these connections must be extremely precise: each routing unit must activate an inputoutput mapping (N individual connections) corresponding to the desired shift (as highlighted in Figure 1A). Other approaches that have been proposed, including invariant feature networks [3,4], also suffer from significant drawbacks, such as the inability to explicitly represent where information [2]. It remains an open question how biology could achieve shift invariance without profligate and precise connections. In this article, we propose a simple solution for shift invariance for quantities that are circular or periodic in nature—circular-shift invariance (CSI)—orientation invariance in vision and key invariance in music. The visual system may create orientation-invariant representations to aid recognition under conditions of object rotation or head-tilt [5,6]; a similar mechanism could be employed by the auditory system to create key-invariant representations under conditions where the same melody 1 Figure 1: Dynamic routing. A In traditional dynamic routing, connections from the (raw) input layer to the (invariant) output layer are gated by routing units. For instance, the mapping from A to 5, B to 6, . . . , F to 4 is achieved by turning on the highlighted routing unit. B In time-division multiplexing (TDM), the encoder samples input channels periodically (using a rotating switch) while the decoder sends each sample to the appropriate output channel (based on its time bin). TDM can be extended to achieve a circular-shift transformation by altering the angle between encoder and decoder switches (θ), thereby creating a rotated mapping between input and output channels (adapted from [7]). is played in different keys. Similar to orientation, which is a periodic quantity, musical notes one octave apart sound alike, a phenomenon known as octave equivalence [8]. Thus, the problems of key invariance and orientation invariance admit similar solutions. Deriving inspiration from time-division multiplexing (TDM), we propose a neural network for CSI that uses phase to encode and decode information. We modulate the temporal window of communication between (raw) input and (invariant) output neurons to achieve the appropriate input–output mapping. Extending TDM, any particular circular-shift transformation can be accomplished by changing the relative angle, θ, between the rotating switches of the encoder (that encodes the raw input in time) and decoder (that decodes the invariant output in time) (Figure 1B). This obviates the need to hardwire routing control units that specifically modulate the strength of each possible inputoutput connection, thereby significantly reducing the complexity inherent in the traditional dynamic routing solution. Similarly, a remapping between the input and output neurons can be achieved by introducing a relative phase-shift in their background oscillations. 2 Dynamic Routing through Neuronal Coherence To modulate the temporal window of communication, the model uses a ring of neurons (the oscillation ring) to select the pool of neurons (in the projection ring) that encode or decode information at a particular time (Figure 2A). Each projection pool encodes a specific value of the feature (for example, one of twelve musical notes). Upon activation by external input, each pool is active only when background inhibition generated by the oscillation ring (outer ring of neurons) is at a minimum. In addition to exciting 12 inhibitory interneurons in the projection ring, each oscillation ring neuron excites its nearest 18 neighbors in the clockwise direction around the oscillation ring. As a result, a wave of inhibition travels around the projection ring that allows only one pool to be excitable at any point in time. These neurons become excitable at roughly the same time (numbered sectors, inner ring) by virtue of recurrent excitatory intra-pool connections. Decoding is accomplished by a second tier of rings (Figure 2B). The projection ring of the first (input) tier connects all-to-all to the projection ring of the second (output) tier. The two oscillation rings create a window of excitability for the pools of neurons in their respective projection rings. Hence, the most effective communication occurs between input and output pools that become excitable at the same time (i.e. are oscillating in phase with one another [9]). The CSI problem is solved by introducing a phase-shift between the input and output tiers. If they are exactly in phase, then an input pool is simply mapped to the output pool directly above it. If their 2 Figure 2: Double-Ring Network for Encoding and Decoding. A The projection (inner) ring is divided into (numbered) pools. The oscillation (outer) ring modulates sub-threshold activity (waveforms) of the projection ring by exciting (black distribution) inhibitory neurons that inhibit neighboring projection neurons. A wave of activity travels around the oscillation ring due to asymmetric excitatory connections, creating a corresponding wave of inhibitory activity in the projection ring, such that only one pool of projection neurons is excitable (spikes) at a given time. B Two instances of the double-ring structure from A. The input projection ring connects all-to-all to the output projection ring (dashed lines). Because each input pool will spike only during a distinct time bin, and each output pool is excitable only in a certain time bin, communication occurs between input and output pools that are oscillating in phase with each other. Appropriate phase offset between input and output oscillation rings realizes the desired circular shift (input pool H to output pool 1, solid arrow). C Interactions among pools highlighted in B. phases are different, the input is dynamically routed to an appropriate circularly shifted position in the output tier. Such changes in phase are analogous to adjusting the angle of the rotating switch at either the encoder or the decoder in TDM (see Figure 1B). There is some evidence that neural systems could employ phase relationships of subthreshold oscillations to selectively target neural populations [9-11]. 3 Implementation in Silicon We implemented this solution to CSI on a neuromorphic silicon chip [12]. The neuromorphic chip has neurons whose properties resemble that of biological neurons; these neurons even have intrinsic differences, thereby mimicking heterogeneity in real neurobiological systems. The chip uses a conductance-based spiking model for both inhibitory and excitatory neurons. Inhibitory neurons project to nearby excitatory and inhibitory neurons via a diffusor network that determines the spread of inhibition. A lookup table of excitatory synaptic connectivity is stored in a separate randomaccess memory (RAM) chip. Spikes occurring on-chip are converted to a neuron address, mapped to synapses (if any) via the lookup table, and routed to the targeted on-chip synapse. A universal serial bus (USB) interface chip communicates spikes to and from a computer, for external input and 3 Figure 3: Traveling-wave activity in the oscillation ring. A Population activity (5ms bins) of a pool of eighteen (adjacent) oscillation neurons. B Increasing the strength of feedforward excitation led to increasing frequencies of periodic firing in the θ and α range (1-10 Hz). Strength of excitation is the amplitude change in post-synaptic conductance due to a single pre-synaptic spike (measured relative to minimum amplitude used). data analysis, respectively. Simulations on the chip occur in real-time, making it an attractive option for implementing the model. We configured the following parameters: • Magnitude of a potassium M-current: increasing this current’s magnitude increased the post-spike repolarization time of the membrane potential, thereby constraining spiking to a single time bin per cycle. • The strength of excitatory and inhibitory synapses: a correct balance had to be established between excitation and inhibition to make only a small subset of neurons in the projection rings fire at a time—too much excitation led to widespread firing and too much inhibition led to neurons that were entirely silent or fired sporadically. • The space constant of inhibitory spread: increasing the spread was effective in preventing runaway excitation, which could occur due to the recurrent excitatory connections. We were able to create a stable traveling wave of background activity within the oscillation ring. We transiently stimulated a small subset of the neurons, which initiated a wave of activity that propagated in a stable manner around the ring after the transient external stimulation had ceased (Figure 3A). The network frequency determined from a Fourier transform of the network activity smoothed with a non-causal Gaussian kernel (FDHM = 80ms) was 7.4Hz. The frequency varied with the strength of the neurons’ excitatory connections (Figure 3B), measured as the amplitude of the step increase in membrane conductivity due to the arrival of a pre-synaptic spike. Over much of the range of the synaptic strengths tested, we observed stable oscillations in the θ and α bands (1-10Hz); the frequency appeared to increase logarithmically with synaptic strength. 4 Phase-based Encoding and Decoding In order to assess the best-case performance of the model, the background activity in the input and output projection rings was derived from the input oscillation ring. Their spikes were delivered to the appropriately circularly-shifted output oscillation neurons. The asymmetric feedforward connections were disabled in the output oscillation ring. For instance, in order to achieve a circular shift by k pools (i.e. mapping input projection pool 1 to output projection pool k + 1, input pool 2 to output pool k + 2, and so on), activity from the input oscillation neurons closest to input pool 1 was fed into the output oscillation neurons closest to output pool k. By providing the appropriate phase difference between input and output oscillation, we were able to assess the performance of the model under ideal conditions. In the Discussion section, we discuss a biologically plausible mechanism to control the relative phases. 4 Figure 4: Phase-based encoding. Rasters indicating activity of projection pools in 1ms bins, and mean phase of firing (×’s) for each pool (relative to arbitrary zero time). The abscissa shows firing time normalized by the period of oscillation (which may be converted to firing phase by multiplication by 2π). Under constant input to the input projection ring, the input pools fire approximately in sequence. Two cycles of pool activity normalized by maximum firing rate for each pool are shown in left inset (for clarity, pools 1-6 are shown in the top panel and pools 7-12 are shown separately in the bottom panel); phase of background inhibition of pool 4 is shown (below) for reference. Phase-aligned average1 of activity (right inset) showed that the firing times were relatively tight and uniform across pools: a standard deviation of 0.0945 periods, or equivalently, a spread of 1.135 pools at any instant of time. We verified that the input projection pools fired in a phase-shifted fashion relative to one another, a property critical for accurate encoding (see Figure 2). We stimulated all pools in the input projection ring simultaneously while the input oscillation ring provided a periodic wave of background inhibition. The mean phase of firing for each pool (relative to arbitrary zero time) increased nearly linearly with pool number, thereby providing evidence for accurate, phase-based encoding (Figure 4). The firing times of all pools are shown for two cycles of background oscillatory activity (Figure 4 left inset). A phase-aligned average1 showed that the timing was relatively tight (standard deviation 1.135 pools) and uniform across pools of neurons (Figure 4 right inset). We then characterized the system’s ability to correctly decode this encoding under a given circular shift. The shift was set to seven pools, mapping input pool 1 to output pool 8, and so on. Each input pool was stimulated in turn. We expected to see only the appropriately shifted output pool become highly active. In fact, not only was this pool active, but other pools around it were also active, though to a lesser extent (Figure 5A). Thus, the phase-encoded input was decoded successfully, and circularly shifted, except that the output units were broadly tuned. To quantify the overall precision of encoding and decoding, we constructed an input-locked average of the tuning curves (Figure 5B): the curves were circularly shifted to the left by an amount corresponding to the stimulated input pool number, and the raw pool firing rates were averaged. If the phase-based encoding and decoding were perfect, the peak should occur at a shift of 7 pools. 1 The phase-aligned average was constructed by shifting the pool-activity curves by the (# of the pool) × 1 ( 12 of the period) to align activity across pools, which was then averaged. 5 Figure 5: Decoding phase-encoded input. A In order to assess decoding performance under a given circular shift (here 7 pools) each input pool was stimulated in turn and activity in each output pool was recorded and averaged over 500ms. The pool’s response, normalized by its maximum firing rate, is plotted for each stimulated input pool (arrows pointing to curves, color code as in Figure 4). Each input pool stimulation trial consistently resulted in peak activity in the appropriate output pool; however, adjacent pools were also active, but to a lesser extent, resulting in a broad tuning curve. B The best-fit Gaussian (dot-dashed grey curve, σ = 2.30 pools) to the input-locked average of the raw pool firing rates (see text for details) revealed a maximum between a shift of 7 and 8 pools (inverted grey triangle; expected peak at a shift of 7 pools). Indeed, the highest (average) firing rate corresponded to a shift of 7 pools. However, the activity corresponding to a shift of 8 pools was nearly equal to that of 7 pools, and the best fitting Gaussian curve to the activity histogram (grey dot-dashed line) peaked at a point between pools 7 and 8 (inverted grey triangle). The standard deviation (σ) was 2.30 pools, versus the expected ideal σ of 1.60, which corresponds to the encoding distribution (σ = 1.135 pools) convolved with itself. 5 Discussion We have demonstrated a biologically plausible mechanism for the dynamic routing of information in time that obviates the need for precise gating of connections. This mechanism requires that a wave of activity propagate around pools of neurons arranged in a ring. While previous work has described traveling waves in a ring of neurons [13], and a double ring architecture (for determining head-direction) [14], our work combines these two features (twin rings with phase-shifted traveling waves) to achieve dynamic routing. These features of the model are found in the cortex: Bonhoeffer and Grinwald [15] describe iso-orientation columns in the cat visual cortex that are arranged in ring-like pinwheel patterns, with orientation tuning changing gradually around the pinwheel center. Moreover, Rubino et al. [16] have shown that coherent oscillations can propagate as waves across the cortical surface in the motor cortex of awake, behaving monkeys performing a delayed reaching task. Our solution for CSI is also applicable to music perception. In the Western twelve-tone, equaltemperament tuning system (12-tone scale), each octave is divided into twelve logarithmicallyspaced notes. Human observers are known to construct mental representations for raw notes that are invariant of the (perceived) key of the music: a note of C heard in the key of C-Major is perceptually equivalent to the note C# heard in the key of C#-Major [8,17]. In previous dynamic routing models of key invariance, the tonic—the first note of the key (e.g., C is the tonic of C-Major)— supplies the equivalent where information used by routing units that gate precise connections to map the raw note into a key-invariant output representation [17]. To achieve key invariance in our model, the bottom tier encodes raw note information while the top tier decodes key-invariant notes (Figure 6). The middle tier receives the tonic information and aligns the phase of the first output pool (whose invariant representation corresponds to the tonic) with the appropriate input pool (whose raw note representation corresponds to the tonic of the perceived key). 6 Figure 6: Phase-based dynamic routing to achieve key-invariance. The input (bottom) tier encodes raw note information, and the output (top) tier decodes key-invariant information. The routing (middle) tier sets the phase of the background wave activity in the input and output oscillation rings (dashed arrows) such that the first output pool is in phase with the input pool representing the note corresponding to the tonic. On the left, where G is the tonic, input pool G, output pool 1, and the routing tier are in phase with one another (black clocks), while input pool C and output pool 6 are in phase with one another (grey clocks). Thus, the raw note input, G, activates the invariant output 1, which corresponds to the perceived tonic invariant representation (heavy solid arrows). On the right, the same raw input note, G, is active, but the key is different and A is now the active tonic; thus the raw input, G, is now mapped to output pool 11. The tonic information is supplied to a specific pool in the routing ring according to the perceived key. This pool projects directly down to the input pool corresponding to the tonic. This ensures that the current tonic’s input pool is excitable in the same time bin as the first output pool. Each of the remaining raw input notes of the octave is mapped by time binning to the corresponding key-invariant representation in the output tier, as the phases of input pools are all shifted by the same amount. Supporting evidence for phase-based encoding of note information comes from MEG recordings in humans: the phase of the MEG signal (predominantly over right hemispheric sensor locations) tracks the note of the heard note sequence with surprising accuracy [18]. The input and output tiers’ periods must be kept in lock-step, which can be accomplished through more plausible means than employed in the current implementation of this model. Here, we maintained a fixed phase shift between the input and output oscillation rings by feeding activity from the input oscillation ring to the appropriately shifted pool in the output oscillation ring. This approach allowed us to avoid difficulties achieving coherent oscillations at identical frequencies in the input and output oscillation rings. Alternatively, entrainment could be achieved even when the frequencies are not identical—a more biologically plausible scenario—if the routing ring resets the phase of the input and output rings on a cycle-by-cycle basis. Lakatos et al. [19] have shown that somatosensory inputs can reset the phase of ongoing neuronal oscillations in the primary auditory cortex (A1), which helps in the generation of a unified auditory-tactile percept (the so-called “Hearing-Hands Effect”). A simple extension to our model can reduce the number of connections below the requirements of traditional dynamic routing models. Instead of having all-to-all connections between the input and output layers, a relay layer of very few (M N ) neurons could be used to transmit the spikes form the input neurons to the output neurons (analogous to the single wire connecting encoder and decoder in Figure 1B). A small number of (or ideally even one) relay neurons suffices because encoding and decoding occur in time. Hence, the connections between each input pool and the relay neurons require O(M N ) ≈ O(N ) connections (as long as M does not scale with N ) and those between the relay neurons and each output pool require O(M N) ≈ O(N ) connections as well. Thus, by removing all-to-all connectivity between the input and output units (a standard feature in traditional dynamic routing models), the number of required connections is reduced from O(N 2 ) 7 to O(N ). Further, by replacing the strict pool boundaries with nearest neighbor connectivity in the projection rings, the proposed model can accommodate a continuum of rotation angles. In summary, we propose that the mechanism of dynamic routing through neuronal coherence could be a general mechanism that could be used by multiple sensory and motor modalities in the neocortex: it is particularly suitable for placing raw information in an appropriate context (defined by the routing tier). Acknowledgments DS was supported by a Stanford Graduate Fellowship and BP was supported under a National Science Foundation Graduate Research Fellowship. References [1] Olshausen B.A., Anderson C.H. & Van Essen D.C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience 13(11):47004719. [2] Wiskott L. (2004). How does our visual system achieve shift and size invariance? In J.L. van Hemmen & T.J. Sejnowski (Eds.), 23 Problems in Systems Neuroscience, Oxford University Press. [3] Fukushima K., Miyake S. & Ito T. (1983). A neural network model for a mechanism of visual pattern recognition. IEEE Transactions on Systems, Man and Cybernetics 13:826-834. [4] Mel B.W., Ruderman D.L & Archie K.A. (1998). Translation invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. Journal of Neuroscience 18(11):4325-4334. [5] McKone, E. & Grenfell, T. (1999). Orientation invariance in naming rotated objects: Individual differences and repetition priming. Perception and Psychophysics, 61:1590-1603. [6] Harris IM & Dux PE. (2005). Orientation-invariant object recognition: evidence from repetition blindness. Cognition, 95(1):73-93. [7] Naval Electrical Engineering Training Series (NEETS). Module 17, Radio-Frequency Communication Principles, Chapter 3, pp.32. Published online at http://www.tpub.com/content/neets/14189 (Integrated Publishing). [8] Krumhansl C.L. (1990). Cognitive foundations of musical pitch. Oxford University Press, 1990. [9] Fries P. (2005). A mechanism for cognitive dynamics: neuronal communication through neuronal coherence. Trends in Cognitive Sciences 9(10):474-480. [10] Buzsaki G. & Draguhn A. (2004). Neuronal Oscillations in Cortical Networks. Science 304(5679):19261929. [11] Sejnowski T.J. & Paulsen O. (2006). Network oscillations: Emerging computational principles. Journal of Neuroscience 26(6):1673-1676. [12] Arthur J.A. & Boahen K. (2005). Learning in Silicon: Timing is Everything. Advances in Neural Information Processing Systems 17, B Sholkopf and Y Weiss, Eds, MIT Press, 2006. [13] Hahnloser R.H.R., Sarpeshkar R., Mahowald M.A., Douglas R.J., & Seung H.S. (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405:947-951. [14] Xie X., Hahnloser R.H.R., & Seung H.S (2002). Double-ring network modeling of the head-direction system. Phys. Rev. E66 041902:1-9. [15] Bonhoeffer K. & Grinwald A. (1991). Iso-orientation domains in cat visual cortex are arranged in pinwheel-like patterns. Nature 353:426-437. [16] Rubino D., Robbins K.A. & Hastopoulos N.G. (2006). Propagating waves mediate information transfer in the motor cortex. Nature Neuroscience 9:1549-1557. [17] Bharucha J.J. (1999). Neural nets, temporal composites and tonality. In D. Deutsch (Ed.), The Psychology of Music (2d Ed.) Academic Press, New York. [18] Patel A.D. & Balaban E. (2000). Temporal patterns of human cortical activity reflect tone sequence structure. Nature 404:80-84. [19] Lakatos P., Chen C., O’Connell M., Mills A. & Schroeder C. (2007). Neuronal oscillations and multisensory interaction in primary auditory cortex. Neuron 53(2):279-292. 8
6 0.29325825 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks
7 0.26846525 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis
8 0.25742459 89 nips-2007-Feature Selection Methods for Improving Protein Structure Prediction with Rosetta
9 0.20854571 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs
10 0.19995399 97 nips-2007-Hidden Common Cause Relations in Relational Learning
11 0.19537333 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
12 0.19475593 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
13 0.19190194 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes
14 0.1908237 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
15 0.18916419 63 nips-2007-Convex Relaxations of Latent Variable Training
16 0.18382294 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
17 0.1803361 145 nips-2007-On Sparsity and Overcompleteness in Image Models
18 0.1770763 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
19 0.17353731 60 nips-2007-Contraction Properties of VLSI Cooperative Competitive Neural Networks of Spiking Neurons
20 0.17228045 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
topicId topicWeight
[(5, 0.036), (9, 0.011), (13, 0.017), (21, 0.026), (35, 0.019), (47, 0.049), (83, 0.687), (87, 0.013), (90, 0.039)]
simIndex simValue paperId paperTitle
1 0.99650508 159 nips-2007-Progressive mixture rules are deviation suboptimal
Author: Jean-yves Audibert
Abstract: We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisfies ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1
same-paper 2 0.99441546 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
Author: Simon Osindero, Geoffrey E. Hinton
Abstract: We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1
3 0.99382281 109 nips-2007-Kernels on Attributed Pointsets with Applications
Author: Mehul Parsana, Sourangshu Bhattacharya, Chiru Bhattacharya, K. Ramakrishnan
Abstract: This paper introduces kernels on attributed pointsets, which are sets of vectors embedded in an euclidean space. The embedding gives the notion of neighborhood, which is used to define positive semidefinite kernels on pointsets. Two novel kernels on neighborhoods are proposed, one evaluating the attribute similarity and the other evaluating shape similarity. Shape similarity function is motivated from spectral graph matching techniques. The kernels are tested on three real life applications: face recognition, photo album tagging, and shot annotation in video sequences, with encouraging results. 1
4 0.99257207 110 nips-2007-Learning Bounds for Domain Adaptation
Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman
Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1
5 0.99244505 61 nips-2007-Convex Clustering with Exemplar-Based Models
Author: Danial Lashkari, Polina Golland
Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1
6 0.98846275 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis
7 0.92538929 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
8 0.9157058 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin
9 0.89720792 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
10 0.89477015 116 nips-2007-Learning the structure of manifolds using random projections
11 0.89474595 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching
12 0.89311445 40 nips-2007-Bundle Methods for Machine Learning
13 0.89094281 21 nips-2007-Adaptive Online Gradient Descent
14 0.88785309 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
15 0.8812232 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
16 0.88014013 101 nips-2007-How SVMs can estimate quantiles and the median
17 0.87828678 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
18 0.87588936 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
19 0.8749972 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
20 0.8735258 46 nips-2007-Cluster Stability for Finite Samples