nips nips2002 nips2002-73 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Barber
Abstract: The application of latent/hidden variable Dynamic Bayesian Networks is constrained by the complexity of marginalising over latent variables. For this reason either small latent dimensions or Gaussian latent conditional tables linearly dependent on past states are typically considered in order that inference is tractable. We suggest an alternative approach in which the latent variables are modelled using deterministic conditional probability tables. This specialisation has the advantage of tractable inference even for highly complex non-linear/non-Gaussian visible conditional probability tables. This approach enables the consideration of highly complex latent dynamics whilst retaining the benefits of a tractable probabilistic model. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract The application of latent/hidden variable Dynamic Bayesian Networks is constrained by the complexity of marginalising over latent variables. [sent-6, score-0.104]
2 For this reason either small latent dimensions or Gaussian latent conditional tables linearly dependent on past states are typically considered in order that inference is tractable. [sent-7, score-0.651]
3 We suggest an alternative approach in which the latent variables are modelled using deterministic conditional probability tables. [sent-8, score-0.385]
4 This specialisation has the advantage of tractable inference even for highly complex non-linear/non-Gaussian visible conditional probability tables. [sent-9, score-0.641]
5 This approach enables the consideration of highly complex latent dynamics whilst retaining the benefits of a tractable probabilistic model. [sent-10, score-0.382]
6 1 Introduction Dynamic Bayesian Networks are a powerful framework for temporal data models with widespread application in time series analysis[10, 2, 5]. [sent-11, score-0.212]
7 A time series of length T is a sequence of observation vectors V = {v(1), v(2), . [sent-12, score-0.195]
8 , v(T )}, where vi (t) represents the state of visible variable i at time t. [sent-15, score-0.442]
9 For example, in a speech application V may represent a vector of cepstral coefficients through time, the aim being to classify the sequence as belonging to a particular phonene[2, 9]. [sent-16, score-0.278]
10 The power in the Dynamic Bayesian Network is the assumption that the observations may be generated by some latent (hidden) process that cannot be directly experimentally observed. [sent-17, score-0.104]
11 The basic structure of these models is shown in fig(1)[a] where network states are only dependent on a short time history of previous states (the Markov assumption). [sent-18, score-0.349]
12 Representing the hidden variable sequence by H = {h(1), h(2), . [sent-19, score-0.641]
13 , h(T )}, the joint distribution of a first order Dynamic Bayesian Network is T −1 p(V, H) = p(v(1))p(h(1)|v(1)) p(v(t+1)|v(t), h(t))p(h(t+1)|v(t), v(t+1), h(t)) t=1 This is a Hidden Markov Model (HMM), with additional connections from visible to hidden units[9]. [sent-22, score-0.931]
14 The usage of such models is varied, but here we shall concentrate on unsupervised sequence learning. [sent-23, score-0.169]
15 That is, given a set of training sequences h(1) h(2) h(t) v(1) v(2) v(t) h(1), h(2) (a) Bayesian Network h(2), h(3) h(t − 1), h(t) (b) Hidden Inference Figure 1: (a) A first order Dynamic Bayesian Network containing a sequence of hidden (latent) variables h(1), h(2), . [sent-24, score-0.891]
16 , h(T ) and a sequence of visible (observable) variables v(1), v(2), . [sent-27, score-0.573]
17 In general, all conditional probability tables are stochastic – that is, more than one state can be realised. [sent-31, score-0.265]
18 (b) Conditioning on the visible units forms an undirected chain in the hidden space. [sent-32, score-1.124]
19 Hidden unit inference is achieved by propagating information along both directions of the chain to ensure normalisation. [sent-33, score-0.181]
20 Denoting the parameters of the model by Θ, learning can be achieved using the EM algorithm which maximises a lower bound on the likelihood of a set of observed sequences by the procedure[5]: P Θnew = arg max p(Hµ |V µ , Θold ) log p(Hµ , V µ , Θ). [sent-38, score-0.196]
21 Θ µ=1 (1) This procedure contains expectations with respect to the distribution p(H|V) – that is, to do learning, we need to infer the hidden unit distribution conditional on the visible variables. [sent-39, score-1.096]
22 p(H|V) is represented by the undirected clique graph, fig(1)[b], in which each node represents a function (dependent on the clamped visible units) of the hidden variables it contains, with p(H|V) being the product of these clique potentials. [sent-40, score-1.353]
23 In order to do inference on such a graph, in general, it is necessary to carry out a message passing type procedure in which messages are first passed one way along the undirected graph, and then back, such as in the forward-backward algorithm in HMMs [5]. [sent-41, score-0.231]
24 Only when messages have been passed along both directions of all links can the normalised conditional hidden unit distribution be numerically determined. [sent-42, score-0.78]
25 The complexity of calculating messages is dominated by marginalisation of the clique functions over a hidden vector h(t). [sent-43, score-0.802]
26 In the case of discrete hidden units with S states, this complexity is of the order S 2 , and the total complexity of inference is then O(T S 2 ). [sent-44, score-0.793]
27 For continuous hidden units, the analogous marginalisation requires integration of a clique function over a hidden vector. [sent-45, score-1.309]
28 If the clique function is very low dimensional, this may be feasible. [sent-46, score-0.16]
29 However, in high dimensions, this is typically intractable unless the clique functions are of a very specific form, such as Gaussians. [sent-47, score-0.16]
30 This motivates the Kalman filter model[5] in which all conditional probability tables are Gaussian with means determined by a linear combination of previous states. [sent-48, score-0.172]
31 (c) Conditioning on the visible variables forms a directed chain in the hidden space which is deterministic. [sent-53, score-0.982]
32 Hidden unit inference can be achieved by forward propagation alone. [sent-54, score-0.215]
33 (d) Integrating out hidden variables gives a cascade style directed visible graph, shown here for only four time steps. [sent-55, score-0.982]
34 The aim is then to be able to consider non-Gaussian and nonlinear conditional probability tables (CPTs), and hence richer dynamics in the hidden space. [sent-57, score-0.833]
35 Whilst the restriction to deterministic CPTs appears severe, the model retains some attractive features : The marginal p(V) is non-Markovian, coupling all the variables in the sequence, see fig(2)[d]. [sent-60, score-0.207]
36 The marginal p(H) is stochastic, whilst hidden unit inference is deterministic, as illustrated in fig(2)[c]. [sent-61, score-0.811]
37 The case of training multiple independently generated sequences V µ , µ = 1, . [sent-64, score-0.199]
38 To maximise the log-likelihood, it is useful to evaluate the derivatives with respect to the model parameters. [sent-68, score-0.083]
39 Hence the derivatives can be calculated by deterministic forward propagation of errors and highly complex functions f and CPTs p(v(t + 1)|v(t), h(t)) may be used. [sent-70, score-0.268]
40 Whilst the training of such networks resembles back-propagation in neural networks [1, 6], the models have a stochastic interpretation and retain the benefits inherited from probability theory, including the possibility of a Bayesian treatment. [sent-71, score-0.323]
41 3 A Discrete Visible Illustration To make the above framework more explicit, we consider the case of continuous hidden units and discrete, binary visible units, vi (t) ∈ {0, 1}. [sent-72, score-1.111]
42 In particular, we restrict attention to the model: V p(v(t+1)|v(t), h(t)) = i=1 σ (2vi (t + 1) − 1) j wij φj (t) , hi (t+1) = uij ψj (t) j where σ(x) = 1/(1 + e−x ) and φj (t) and ψj (t) represent fixed functions of the network state (h(t), v(t)). [sent-73, score-0.188]
43 This model generalises a recurrent stochastic heteroassociative Hopfield network[4] to include deterministic hidden units dependent on past network states. [sent-75, score-1.129]
44 (b) The training sequence consists of a random set vectors (V = 3) over T = 10 time steps. [sent-77, score-0.205]
45 The initial state v(t = 1) for the recalled sequence was set to the correct initial training value albeit with one of the values flipped. [sent-79, score-0.339]
46 Note how the dynamics learned is an attractor for the original sequence. [sent-80, score-0.163]
47 A slice of the network is illustrated in fig(3)[a]. [sent-82, score-0.164]
48 We can easily iterate the hidden states in this case to give t−1 h(t + 1) = Ah(t) + Bv(t) = At h(1) + At Bv(t − t ) t =0 which demonstrates how the hidden state depends on the full past history of the observations. [sent-83, score-1.213]
49 We trained the network using 3 visible units and 7 hidden units to maximise the likelihood of the binary sequence in fig(3)[b]. [sent-84, score-1.516]
50 Note that this sequence contains repeated patterns and therefore could not be recalled perfectly with a model which does not contain hidden units. [sent-85, score-0.802]
51 We tested if the learned model had captured the dynamics of the training sequence by initialising the network in the first visible state in the training sequence, but with one of the values flipped. [sent-86, score-0.949]
52 The network then generated the following hidden and visible states recursively, as plotted in fig(3)[c]. [sent-87, score-1.087]
53 The learned network is an attractor with the training sequence as a stable point, demonstrating that such models are capable of learning attractor recurrent networks more powerful than those without hidden units. [sent-88, score-1.17]
54 Learning is very fast in such networks, and we have successfully applied these models to cases of several hundred hidden and visible unit dimensions. [sent-89, score-1.075]
55 1 Recall Capacity What effect have the hidden units on the ability of Hopfield networks to recall sequences? [sent-91, score-0.767]
56 By recall, we mean that a training sequence is correctly generated by the network given that only the initial state of the training sequence is presented to the trained network. [sent-92, score-0.557]
57 , x(T ) can be recalled correctly if we can find a matrix M and real numbers i (t) such that M [˜ (1), . [sent-98, score-0.098]
58 The use of hidden units therefore increases the length of temporal sequences that we can store by forming, during learning, appropriate hidden representations h(t) such that the vectors h(2) v(2) ,. [sent-109, score-1.337]
59 The reader might consider forming from a set of linearly dependent patterns v(1), . [sent-115, score-0.161]
60 This would appear to dispense with the need to use hidden units. [sent-119, score-0.586]
61 However, if the same pattern in the training set is repeated at different times in the sequence (as in fig(3)[b]), no ˆ ˆ matter how complex this non-linear mapping, the resulting vectors v(1), . [sent-120, score-0.237]
62 This demonstrates that hidden units not only solve the linear dependence problem for non-repeated patterns, they also solve it for repeated patterns. [sent-124, score-0.699]
63 They are therefore capable of sequence disambiguation since the hidden unit representations formed are dependent on the full history of the visible units. [sent-125, score-1.233]
64 Nine speakers uttered two Japanese vowels /ae/ successively to form discrete time series with 12 LPC cepstral coefficients. [sent-132, score-0.368]
65 Each utterance forms a time series V whose length is in the range T = 7 to T = 29 and each vector v(t) of the time series contains 12 cepstral coefficients. [sent-133, score-0.332]
66 The training data consists of 30 training utterances for each of the 9 speakers. [sent-134, score-0.221]
67 The test data contains 370 time series, each uttered by one of the nine speakers. [sent-135, score-0.115]
68 We used the special settings f (x) ≡ x and g(x) ≡ x to see if such a simple network would be able to perform well. [sent-137, score-0.111]
69 We split the training data into a 2/3 train and a 1/3 validation part, training then a set of 10 models for each of the 9 speakers, with hidden unit dimensions taking the values H = 1, 2, . [sent-138, score-0.886]
70 , 10 and using 20 training iterations of conjugate gradient learning[1]. [sent-141, score-0.089]
71 For simplicity, we used the same number of hidden units for each of the nine speaker models. [sent-142, score-0.772]
72 To classify a test utterance, we chose the speaker model which had the highest likelihood of generating the test utterance, using an error of 0 if the utterance was assigned to the correct speaker and an error of 1 otherwise. [sent-143, score-0.221]
73 (Right) Five sequences from the model v(t) = sin(5(t − 1) + 3 (t)) + 0. [sent-146, score-0.11]
74 1 4 (t), where i (t) are zero mean unit variance Gaussian noise samples. [sent-147, score-0.091]
75 These were combined to form a training set of 10 unlabelled sequences. [sent-148, score-0.138]
76 We performed unsupervised learning by fitting a two component mixture model. [sent-149, score-0.083]
77 The posterior probability p(i = 1|V µ ) of the 5 sequences on the left belonging to class 1 are (from above) 0. [sent-150, score-0.146]
78 96 and for the 5 sequences on the right belonging to class 2 are (from above) 0. [sent-155, score-0.146]
79 Based on these validation results, we retrained a model with H = 3 hidden units on all available training data. [sent-162, score-0.795]
80 2% reported for training using a continuous-output HMM with 5 (discrete) hidden states[8]. [sent-166, score-0.614]
81 Although our model is not powerful in being able to reconstruct the training data, it does learn sufficient information in the data to be able to make reliable classification. [sent-167, score-0.134]
82 An interesting alternative training method not explored here would be to use discriminative learning[7]. [sent-169, score-0.129]
83 Also, not explored here, is the possibility of using Bayesian methods to set the number of hidden dimensions. [sent-170, score-0.565]
84 Training mixture models by maximum likelihood on a set of sequences V 1 , . [sent-172, score-0.251]
85 The data is a series of 10 one dimensional (V = 1) time series each of length T = 40. [sent-176, score-0.158]
86 Two distinct models were used to generate 10 training sequences, see fig(4). [sent-177, score-0.142]
87 We fitted a two component mixture model using mixture components of the form (9) (with linear functions f and g) each model having H = 3 hidden units. [sent-178, score-0.659]
88 51 and it was satisfying to find that the separation of the unlabelled training sequences is entirely consistent with the data generation process, see fig(4). [sent-181, score-0.248]
89 An interesting observation is that, whilst the true data generating process is governed by effectively stochastic hidden transitions, the deterministic hidden model still performs admirably. [sent-182, score-1.368]
90 6 Discussion We have considered a class of models for temporal sequence processing which are a specially constrained version of Dynamic Bayesian Networks. [sent-183, score-0.245]
91 The constraint was chosen to ensure that inference would be trivial even in high dimensional continuous hidden/latent spaces. [sent-184, score-0.128]
92 Highly complex dynamics may therefore be postulated for the hidden space transitions, and also for the hidden to the visible transitions. [sent-185, score-1.558]
93 However, unlike traditional neural networks the models remain probabilistic (generative models), and hence the full machinery of Bayesian inference is applicable to this class of models. [sent-186, score-0.205]
94 Indeed, whilst not explored here, model selection issues, such as assessing the relevant hidden unit dimension, are greatly facilitated in this class of models. [sent-187, score-0.761]
95 An area we are currently investigating is using these models for fast inference and learning in Independent Component Analysis and related areas. [sent-189, score-0.143]
96 In the case that the hidden unit dynamics is known to be highly stochastic, this class of models is arguably less appropriate. [sent-190, score-0.803]
97 However, stochastic hidden dynamics is often used in cases where one believes that the true hidden dynamics is too complex to model effectively (or, rather, deal with computationally) and one uses noise to ‘cover’ for the lack of complexity in the assumed hidden dynamics. [sent-191, score-1.836]
98 The models outlined here provide an alternative in the case that a potentially complex hidden dynamics form can be assumed, and may also still provide a reasonable solution even in cases where the underlying hidden dynamics is stochastic. [sent-192, score-1.339]
99 This class of models is therefore a potential route to computationally tractable, yet powerful time series models. [sent-193, score-0.177]
100 Juang, An introduction to hidden Markov models, IEEE Transactions on Acoustics Speech, Signal Processing 3 (1986), no. [sent-234, score-0.525]
wordName wordTfidf (topN-words)
[('hidden', 0.525), ('visible', 0.406), ('duij', 0.184), ('clique', 0.16), ('deterministic', 0.156), ('units', 0.142), ('cpts', 0.122), ('sequence', 0.116), ('network', 0.111), ('sequences', 0.11), ('whilst', 0.105), ('latent', 0.104), ('dynamics', 0.102), ('tables', 0.098), ('recalled', 0.098), ('cepstral', 0.092), ('vout', 0.092), ('unit', 0.091), ('inference', 0.09), ('training', 0.089), ('utterance', 0.082), ('thresh', 0.08), ('series', 0.079), ('dl', 0.078), ('bayesian', 0.078), ('conditional', 0.074), ('cpt', 0.073), ('bv', 0.073), ('coe', 0.072), ('dynamic', 0.07), ('vin', 0.068), ('old', 0.066), ('networks', 0.062), ('dhl', 0.061), ('dispense', 0.061), ('marginalisation', 0.061), ('recursions', 0.061), ('uttered', 0.061), ('vowels', 0.061), ('attractor', 0.061), ('hmm', 0.059), ('stochastic', 0.057), ('messages', 0.056), ('nine', 0.054), ('japanese', 0.053), ('slice', 0.053), ('models', 0.053), ('dependent', 0.052), ('variables', 0.051), ('mixture', 0.051), ('speaker', 0.051), ('undirected', 0.051), ('dh', 0.051), ('kalman', 0.049), ('log', 0.049), ('mij', 0.049), ('ectively', 0.049), ('unlabelled', 0.049), ('recurrent', 0.047), ('derivatives', 0.046), ('states', 0.045), ('powerful', 0.045), ('linearly', 0.045), ('history', 0.043), ('utterances', 0.043), ('ah', 0.041), ('wij', 0.041), ('specially', 0.041), ('explored', 0.04), ('hop', 0.039), ('speakers', 0.039), ('krogh', 0.039), ('tractable', 0.039), ('past', 0.039), ('validation', 0.039), ('transitions', 0.038), ('continuous', 0.038), ('recall', 0.038), ('sin', 0.037), ('maximise', 0.037), ('likelihood', 0.037), ('hmms', 0.036), ('discrete', 0.036), ('belonging', 0.036), ('state', 0.036), ('temporal', 0.035), ('graph', 0.034), ('propagation', 0.034), ('aim', 0.034), ('passed', 0.034), ('conditioning', 0.034), ('recursively', 0.033), ('delta', 0.033), ('forming', 0.033), ('edinburgh', 0.033), ('component', 0.032), ('repeated', 0.032), ('outlined', 0.032), ('highly', 0.032), ('patterns', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables
Author: David Barber
Abstract: The application of latent/hidden variable Dynamic Bayesian Networks is constrained by the complexity of marginalising over latent variables. For this reason either small latent dimensions or Gaussian latent conditional tables linearly dependent on past states are typically considered in order that inference is tractable. We suggest an alternative approach in which the latent variables are modelled using deterministic conditional probability tables. This specialisation has the advantage of tractable inference even for highly complex non-linear/non-Gaussian visible conditional probability tables. This approach enables the consideration of highly complex latent dynamics whilst retaining the benefits of a tractable probabilistic model. 1
2 0.24530193 129 nips-2002-Learning in Spiking Neural Assemblies
Author: David Barber
Abstract: We consider a statistical framework for learning in a class of networks of spiking neurons. Our aim is to show how optimal local learning rules can be readily derived once the neural dynamics and desired functionality of the neural assembly have been specified, in contrast to other models which assume (sub-optimal) learning rules. Within this framework we derive local rules for learning temporal sequences in a model of spiking neurons and demonstrate its superior performance to correlation (Hebbian) based approaches. We further show how to include mechanisms such as synaptic depression and outline how the framework is readily extensible to learning in networks of highly complex spiking neurons. A stochastic quantal vesicle release mechanism is considered and implications on the complexity of learning discussed. 1
3 0.16039565 164 nips-2002-Prediction of Protein Topologies Using Generalized IOHMMs and RNNs
Author: Gianluca Pollastri, Pierre Baldi, Alessandro Vullo, Paolo Frasconi
Abstract: We develop and test new machine learning methods for the prediction of topological representations of protein structures in the form of coarse- or fine-grained contact or distance maps that are translation and rotation invariant. The methods are based on generalized input-output hidden Markov models (GIOHMMs) and generalized recursive neural networks (GRNNs). The methods are used to predict topology directly in the fine-grained case and, in the coarsegrained case, indirectly by first learning how to score candidate graphs and then using the scoring function to search the space of possible configurations. Computer simulations show that the predictors achieve state-of-the-art performance. 1 Introduction: Protein Topology Prediction Predicting the 3D structure of protein chains from the linear sequence of amino acids is a fundamental open problem in computational molecular biology [1]. Any approach to the problem must deal with the basic fact that protein structures are translation and rotation invariant. To address this invariance, we have proposed a machine learning approach to protein structure prediction [4] based on the prediction of topological representations of proteins, in the form of contact or distance maps. The contact or distance map is a 2D representation of neighborhood relationships consisting of an adjacency matrix at some distance cutoff (typically in the range of 6 to 12 ˚), or a matrix of pairwise Euclidean distances. Fine-grained maps A are derived at the amino acid or even atomic level. Coarse maps are obtained by looking at secondary structure elements, such as helices, and the distance between their centers of gravity or, as in the simulations below, the minimal distances between their Cα atoms. Reasonable methods for reconstructing 3D coordinates from contact/distance maps have been developed in the NMR literature and elsewhere Oi B Hi F Hi Ii Figure 1: Bayesian network for bidirectional IOHMMs consisting of input units, output units, and both forward and backward Markov chains of hidden states. [14] using distance geometry and stochastic optimization techniques. Thus the main focus here is on the more difficult task of contact map prediction. Various algorithms for the prediction of contact maps have been developed, in particular using feedforward neural networks [6]. The best contact map predictor in the literature and at the last CASP prediction experiment reports an average precision [True Positives/(True Positives + False Positives)] of 21% for distant contacts, i.e. with a linear distance of 8 amino acid or more [6] for fine-grained amino acid maps. While this result is encouraging and well above chance level by a factor greater than 6, it is still far from providing sufficient accuracy for reliable 3D structure prediction. A key issue in this area is the amount of noise that can be tolerated in a contact map prediction without compromising the 3D-reconstruction step. While systematic tests in this area have not yet been published, preliminary results appear to indicate that recovery of as little as half of the distant contacts may suffice for proper reconstruction, at least for proteins up to 150 amino acid long (Rita Casadio and Piero Fariselli, private communication and oral presentation during CASP4 [10]). It is important to realize that the input to a fine-grained contact map predictor need not be confined to the sequence of amino acids only, but may also include evolutionary information in the form of profiles derived by multiple alignment of homologue proteins, or structural feature information, such as secondary structure (alpha helices, beta strands, and coils), or solvent accessibility (surface/buried), derived by specialized predictors [12, 13]. In our approach, we use different GIOHMM and GRNN strategies to predict both structural features and contact maps. 2 GIOHMM Architectures Loosely speaking, GIOHMMs are Bayesian networks with input, hidden, and output units that can be used to process complex data structures such as sequences, images, trees, chemical compounds and so forth, built on work in, for instance, [5, 3, 7, 2, 11]. In general, the connectivity of the graphs associated with the hidden units matches the structure of the data being processed. Often multiple copies of the same hidden graph, but with different edge orientations, are used in the hidden layers to allow direct propagation of information in all relevant directions. Output Plane NE NW 4 Hidden Planes SW SE Input Plane Figure 2: 2D GIOHMM Bayesian network for processing two-dimensional objects such as contact maps, with nodes regularly arranged in one input plane, one output plane, and four hidden planes. In each hidden plane, nodes are arranged on a square lattice, and all edges are oriented towards the corresponding cardinal corner. Additional directed edges run vertically in column from the input plane to each hidden plane, and from each hidden plane to the output plane. To illustrate the general idea, a first example of GIOHMM is provided by the bidirectional IOHMMs (Figure 1) introduced in [2] to process sequences and predict protein structural features, such as secondary structure. Unlike standard HMMs or IOHMMS used, for instance in speech recognition, this architecture is based on two hidden markov chains running in opposite directions to leverage the fact that biological sequences are spatial objects rather than temporal sequences. Bidirectional IOHMMs have been used to derive a suite of structural feature predictors [12, 13, 4] available through http://promoter.ics.uci.edu/BRNN-PRED/. These predictors have accuracy rates in the 75-80% range on a per amino acid basis. 2.1 Direct Prediction of Topology To predict contact maps, we use a 2D generalization of the previous 1D Bayesian network. The basic version of this architecture (Figures 2) contains 6 layers of units: input, output, and four hidden layers, one for each cardinal corner. Within each column indexed by i and j, connections run from the input to the four hidden units, and from the four hidden units to the output unit. In addition, the hidden units in each hidden layer are arranged on a square or triangular lattice, with all the edges oriented towards the corresponding cardinal corner. Thus the parameters of this two-dimensional GIOHMMs, in the square lattice case, are the conditional probability distributions: NE NW SW SE P (Oi |Ii,j , Hi,j , Hi,j , Hi,j , Hi,j, ) NE NE NE P (Hi,j |Ii,j , Hi−1,j , Hi,j−1 ) N NW NW P (Hi,jW |Ii,j , Hi+1,j , Hi,j−1 ) SW SW SW P (Hi,j |Ii,j , Hi+1,j , Hi,j+1 ) SE SE SE P (Hi,j |Ii,j , Hi−1,j , Hi,j+1 ) (1) In a contact map prediction at the amino acid level, for instance, the (i, j) output represents the probability of whether amino acids i and j are in contact or not. This prediction depends directly on the (i, j) input and the four-hidden units in the same column, associated with omni-directional contextual propagation in the hidden planes. In the simulations reported below, we use a more elaborated input consisting of a 20 × 20 probability matrix over amino acid pairs derived from a multiple alignment of the given protein sequence and its homologues, as well as the structural features of the corresponding amino acids, including their secondary structure classification and their relative exposure to the solvent, derived from our corresponding predictors. It should be clear how GIOHMM ideas can be generalized to other data structures and problems in many ways. In the case of 3D data, for instance, a standard GIOHMM would have an input cube, an output cube, and up to 8 cubes of hidden units, one for each corner with connections inside each hidden cube oriented towards the corresponding corner. In the case of data with an underlying tree structure, the hidden layers would correspond to copies of the same tree with different orientations and so forth. Thus a fundamental advantage of GIOHMMs is that they can process a wide range of data structures of variable sizes and dimensions. 2.2 Indirect Prediction of Topology Although GIOHMMs allow flexible integration of contextual information over ranges that often exceed what can be achieved, for instance, with fixed-input neural networks, the models described above still suffer from the fact that the connections remain local and therefore long-ranged propagation of information during learning remains difficult. Introduction of large numbers of long-ranged connections is computationally intractable but in principle not necessary since the number of contacts in proteins is known to grow linearly with the length of the protein, and hence connectivity is inherently sparse. The difficulty of course is that the location of the long-ranged contacts is not known. To address this problem, we have developed also a complementary GIOHMM approach described in Figure 3 where a candidate graph structure is proposed in the hidden layers of the GIOHMM, with the two different orientations naturally associated with a protein sequence. Thus the hidden graphs change with each protein. In principle the output ought to be a single unit (Figure 3b) which directly computes a global score for the candidate structure presented in the hidden layer. In order to cope with long-ranged dependencies, however, it is preferable to compute a set of local scores (Figure 3c), one for each vertex, and combine the local scores into a global score by averaging. More specifically, consider a true topology represented by the undirected contact graph G∗ = (V, E ∗ ), and a candidate undirected prediction graph G = (V, E). A global measure of how well E approximates E ∗ is provided by the informationretrieval F1 score defined by the normalized edge-overlap F1 = 2|E ∩ E ∗ |/(|E| + |E ∗ |) = 2P R/(P + R), where P = |E ∩ E ∗ |/|E| is the precision (or specificity) and R = |E ∩ E ∗ |/|E ∗ | is the recall (or sensitivity) measure. Obviously, 0 ≤ F1 ≤ 1 and F1 = 1 if and only if E = E ∗ . The scoring function F1 has the property of being monotone in the sense that if |E| = |E | then F1 (E) < F1 (E ) if and only if |E ∩ E ∗ | < |E ∩ E ∗ |. Furthermore, if E = E ∪ {e} where e is an edge in E ∗ but not in E, then F1 (E ) > F1 (E). Monotonicity is important to guide the search in the space of possible topologies. It is easy to check that a simple search algorithm based on F1 takes on the order of O(|V |3 ) steps to find E ∗ , basically by trying all possible edges one after the other. The problem then is to learn F1 , or rather a good approximation to F1 . To approximate F1 , we first consider a similar local measure Fv by considering the O I(v) I(v) F B H (v) H (v) (a) I(v) F B H (v) H (v) (b) O(v) (c) Figure 3: Indirect prediction of contact maps. (a) target contact graph to be predicted. (b) GIOHMM with two hidden layers: the two hidden layers correspond to two copies of the same candidate graph oriented in opposite directions from one end of the protein to the other end. The single output O is the global score of how well the candidate graph approximates the true contact map. (c) Similar to (b) but with a local score O(v) at each vertex. The local scores can be averaged to produce a global score. In (b) and (c) I(v) represents the input for vertex v, and H F (v) and H B (v) are the corresponding hidden variables. ∗ ∗ set Ev of edges adjacent to vertex v and Fv = 2|Ev ∩ Ev |/(|Ev | + |Ev |) with the ¯ global average F = v Fv /|V |. If n and n∗ are the average degrees of G and G∗ , it can be shown that: F1 = 1 |V | v 2|Ev ∩ E ∗ | n + n∗ and 1 ¯ F = |V | v 2|Ev ∩ E ∗ | n + v + n∗ + ∗ v (2) where n + v (resp. n∗ + ∗ ) is the degree of v in G (resp. in G∗ ). In particular, if G v ¯ ¯ and G∗ are regular graphs, then F1 (E) = F (E) so that F is a good approximation to F1 . In the contact map regime where the number of contacts grows linearly with the length of the sequence, we should have in general |E| ≈ |E ∗ | ≈ (1 + α)|V | so that each node on average has n = n∗ = 2(1 + α) edges. The value of α depends of course on the neighborhood cutoff. As in reinforcement learning, to learn the scoring function one is faced with the problem of generating good training sets in a high dimensional space, where the states are the topologies (graphs), and the policies are algorithms for adding a single edge to a given graph. In the simulations we adopt several different strategies including static and dynamic generation. Within dynamic generation we use three exploration strategies: random exploration (successor graph chosen at random), pure exploitation (successor graph maximizes the current scoring function), and semi-uniform exploitation to find a balance between exploration and exploitation [with probability (resp. 1 − ) we choose random exploration (resp. pure exploitation)]. 3 GRNN Architectures Inference and learning in the protein GIOHMMs we have described is computationally intensive due to the large number of undirected loops they contain. This problem can be addressed using a neural network reparameterization assuming that: (a) all the nodes in the graphs are associated with a deterministic vector (note that in the case of the output nodes this vector can represent a probability distribution so that the overall model remains probabilistic); (b) each vector is a deterministic function of its parents; (c) each function is parameterized using a neural network (or some other class of approximators); and (d) weight-sharing or stationarity is used between similar neural networks in the model. For example, in the 2D GIOHMM contact map predictor, we can use a total of 5 neural networks to recursively compute the four hidden states and the output in each column in the form: NW NE SW SE Oij = NO (Iij , Hi,j , Hi,j , Hi,j , Hi,j ) NE NE NE Hi,j = NN E (Ii,j , Hi−1,j , Hi,j−1 ) N NW NW Hi,jW = NN W (Ii,j , Hi+1,j , Hi,j−1 ) SW SW SW Hi,j = NSW (Ii,j , Hi+1,j , Hi,j+1 ) SE SE SE Hi,j = NSE (Ii,j , Hi−1,j , Hi,j+1 ) (3) N In the NE plane, for instance, the boundary conditions are set to Hij E = 0 for i = 0 N or j = 0. The activity vector associated with the hidden unit Hij E depends on the NE NE local input Iij , and the activity vectors of the units Hi−1,j and Hi,j−1 . Activity in NE plane can be propagated row by row, West to East, and from the first row to the last (from South to North), or column by column South to North, and from the first column to the last. These GRNN architectures can be trained by gradient descent by unfolding the structures in space, leveraging the acyclic nature of the underlying GIOHMMs. 4 Data Many data sets are available or can be constructed for training and testing purposes, as described in the references. The data sets used in the present simulations are extracted from the publicly available Protein Data Bank (PDB) and then redundancy reduced, or from the non-homologous subset of PDB Select (ftp://ftp.emblheidelberg.de/pub/databases/). In addition, we typically exclude structures with poor resolution (less than 2.5-3 ˚), sequences containing less than 30 amino acids, A and structures containing multiple sequences or sequences with chain breaks. For coarse contact maps, we use the DSSP program [9] (CMBI version) to assign secondary structures and we remove also sequences for which DSSP crashes. The results we report for fine-grained contact maps are derived using 424 proteins with lengths in the 30-200 range for training and an additional non-homologous set of 48 proteins in the same length range for testing. For the coarse contact map, we use a set of 587 proteins of length less than 300. Because the average length of a secondary structure element is slightly above 7, the size of a coarse map is roughly 2% the size of the corresponding amino acid map. 5 Simulation Results and Conclusions We have trained several 2D GIOHMM/GRNN models on the direct prediction of fine-grained contact maps. Training of a single model typically takes on the order of a week on a fast workstation. A sample of validation results is reported in Table 1 for four different distance cutoffs. Overall percentages of correctly predicted contacts Table 1: Direct prediction of amino acid contact maps. Column 1: four distance cutoffs. Column 2, 3, and 4: overall percentages of amino acids correctly classified as contacts, non-contacts, and in total. Column 5: Precision percentage for distant contacts (|i − j| ≥ 8) with a threshold of 0.5. Single model results except for last line corresponding to an ensemble of 5 models. Cutoff 6˚ A 8˚ A 10 ˚ A 12 ˚ A 12 ˚ A Contact .714 .638 .512 .433 .445 Non-Contact .998 .998 .993 .987 .990 Total .985 .970 .931 .878 .883 Precision (P) .594 .670 .557 .549 .717 and non-contacts at all linear distances, as well as precision results for distant contacts (|i − j| ≥ 8) are reported for a single GIOHMM/GRNN model. The model has k = 14 hidden units in the hidden and output layers of the four hidden networks, as well as in the hidden layer of the output network. In the last row, we also report as an example the results obtained at 12˚ by an ensemble of 5 networks A with k = 11, 12, 13, 14 and 15. Note that precision for distant contacts exceeds all previously reported results and is well above 50%. For the prediction of coarse-grained contact maps, we use the indirect GIOHMM/GRNN strategy and compare different exploration/exploitation strategies: random exploration, pure exploitation, and their convex combination (semiuniform exploitation). In the semi-uniform case we set the probability of random uniform exploration to = 0.4. In addition, we also try a fourth hybrid strategy in which the search proceeds greedily (i.e. the best successor is chosen at each step, as in pure exploitation), but the network is trained by randomly sub-sampling the successors of the current state. Eight numerical features encode the input label of each node: one-hot encoding of secondary structure classes; normalized linear distances from the N to C terminus; average, maximum and minimum hydrophobic character of the segment (based on the Kyte-Doolittle scale with a moving window of length 7). A sample of results obtained with 5-fold cross-validation is shown in Table 2. Hidden state vectors have dimension k = 5 with no hidden layers. For each strategy we measure performances by means of several indices: micro and macroaveraged precision (mP , M P ), recall (mR, M R) and F1 measure (mF1 , M F1 ). Micro-averages are derived based on each pair of secondary structure elements in each protein, whereas macro-averages are obtained on a per-protein basis, by first computing precision and recall for each protein, and then averaging over the set of all proteins. In addition, we also measure the micro and macro averages for specificity in the sense of percentage of correct prediction for non-contacts (mP (nc), M P (nc)). Note the tradeoffs between precision and recall across the training methods, the hybrid method achieving the best F 1 results. Table 2: Indirect prediction of coarse contact maps with dynamic sampling. Strategy Random exploration Semi-uniform Pure exploitation Hybrid mP .715 .454 .431 .417 mP (nc) .769 .787 .806 .834 mR .418 .631 .726 .790 mF1 .518 .526 .539 .546 MP .767 .507 .481 .474 M P (nc) .709 .767 .793 .821 MR .469 .702 .787 .843 M F1 .574 .588 .596 .607 We have presented two approaches, based on a very general IOHMM/RNN framework, that achieve state-of-the-art performance in the prediction of proteins contact maps at fine and coarse-grained levels of resolution. In principle both methods can be applied to both resolution levels, although the indirect prediction is computationally too demanding for fine-grained prediction of large proteins. Several extensions are currently under development, including the integration of these methods into complete 3D structure predictors. While these systems require long training periods, once trained they can rapidly sift through large proteomic data sets. Acknowledgments The work of PB and GP is supported by a Laurel Wilkening Faculty Innovation award and awards from NIH, BREP, Sun Microsystems, and the California Institute for Telecommunications and Information Technology. The work of PF and AV is partially supported by a MURST grant. References [1] D. Baker and A. Sali. Protein structure prediction and structural genomics. Science, 294:93–96, 2001. [2] P. Baldi and S. Brunak and P. Frasconi and G. Soda and G. Pollastri. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15(11):937–946, 1999. [3] P. Baldi and Y. Chauvin. Hybrid modeling, HMM/NN architectures, and protein applications. Neural Computation, 8(7):1541–1565, 1996. [4] P. Baldi and G. Pollastri. Machine learning structural and functional proteomics. IEEE Intelligent Systems. Special Issue on Intelligent Systems in Biology, 17(2), 2002. [5] Y. Bengio and P. Frasconi. Input-output HMM’s for sequence processing. IEEE Trans. on Neural Networks, 7:1231–1249, 1996. [6] P. Fariselli, O. Olmea, A. Valencia, and R. Casadio. Prediction of contact maps with neural networks and correlated mutations. Protein Engineering, 14:835–843, 2001. [7] P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of data structures. IEEE Trans. on Neural Networks, 9:768–786, 1998. [8] Z. Ghahramani and M. I. Jordan. Factorial hidden Markov models Machine Learning, 29:245–273, 1997. [9] W. Kabsch and C. Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22:2577–2637, 1983. [10] A. M. Lesk, L. Lo Conte, and T. J. P. Hubbard. Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, and interresidue contacts. Proteins, 45, S5:98–118, 2001. [11] G. Pollastri and P. Baldi. Predition of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Proceedings of 2002 ISMB (Intelligent Systems for Molecular Biology) Conference. Bioinformatics, 18, S1:62–70, 2002. [12] G. Pollastri, D. Przybylski, B. Rost, and P. Baldi. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47:228–235, 2002. [13] G. Pollastri, P. Baldi, P. Fariselli, and R. Casadio. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47:142–153, 2002. [14] M. Vendruscolo, E. Kussell, and E. Domany. Recovery of protein structure from contact maps. Folding and Design, 2:295–306, 1997.
4 0.1169566 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition
Author: Samy Bengio
Abstract: This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/ Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions. 1
5 0.10807092 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
Author: Shantanu Chakrabartty, Gert Cauwenberghs
Abstract: Forward decoding kernel machines (FDKM) combine large-margin classifiers with hidden Markov models (HMM) for maximum a posteriori (MAP) adaptive sequence estimation. State transitions in the sequence are conditioned on observed data using a kernel-based probability model trained with a recursive scheme that deals effectively with noisy and partially labeled data. Training over very large data sets is accomplished using a sparse probabilistic support vector machine (SVM) model based on quadratic entropy, and an on-line stochastic steepest descent algorithm. For speaker-independent continuous phone recognition, FDKM trained over 177 ,080 samples of the TlMIT database achieves 80.6% recognition accuracy over the full test set, without use of a prior phonetic language model.
6 0.10343777 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
7 0.097330637 74 nips-2002-Dynamic Structure Super-Resolution
8 0.093649164 181 nips-2002-Self Supervised Boosting
9 0.090196185 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
10 0.088480234 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
11 0.088143021 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
12 0.085372008 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach
13 0.085191056 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
14 0.0849953 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
15 0.079799742 206 nips-2002-Visual Development Aids the Acquisition of Motion Velocity Sensitivities
16 0.076980777 124 nips-2002-Learning Graphical Models with Mercer Kernels
17 0.07669577 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
18 0.076123133 199 nips-2002-Timing and Partial Observability in the Dopamine System
19 0.075496949 94 nips-2002-Fractional Belief Propagation
20 0.072938398 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
topicId topicWeight
[(0, -0.231), (1, 0.02), (2, -0.065), (3, 0.056), (4, -0.017), (5, 0.188), (6, -0.065), (7, 0.105), (8, 0.074), (9, -0.062), (10, 0.063), (11, -0.026), (12, 0.04), (13, 0.083), (14, -0.17), (15, -0.057), (16, -0.031), (17, 0.248), (18, -0.002), (19, 0.117), (20, 0.046), (21, -0.112), (22, -0.003), (23, -0.089), (24, -0.089), (25, 0.091), (26, 0.025), (27, 0.112), (28, 0.037), (29, 0.082), (30, -0.107), (31, 0.133), (32, 0.097), (33, -0.041), (34, 0.174), (35, -0.157), (36, -0.095), (37, 0.191), (38, -0.062), (39, 0.021), (40, -0.037), (41, 0.01), (42, -0.07), (43, -0.093), (44, 0.068), (45, 0.102), (46, -0.029), (47, -0.016), (48, 0.013), (49, 0.07)]
simIndex simValue paperId paperTitle
same-paper 1 0.98112154 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables
Author: David Barber
Abstract: The application of latent/hidden variable Dynamic Bayesian Networks is constrained by the complexity of marginalising over latent variables. For this reason either small latent dimensions or Gaussian latent conditional tables linearly dependent on past states are typically considered in order that inference is tractable. We suggest an alternative approach in which the latent variables are modelled using deterministic conditional probability tables. This specialisation has the advantage of tractable inference even for highly complex non-linear/non-Gaussian visible conditional probability tables. This approach enables the consideration of highly complex latent dynamics whilst retaining the benefits of a tractable probabilistic model. 1
2 0.73844022 129 nips-2002-Learning in Spiking Neural Assemblies
Author: David Barber
Abstract: We consider a statistical framework for learning in a class of networks of spiking neurons. Our aim is to show how optimal local learning rules can be readily derived once the neural dynamics and desired functionality of the neural assembly have been specified, in contrast to other models which assume (sub-optimal) learning rules. Within this framework we derive local rules for learning temporal sequences in a model of spiking neurons and demonstrate its superior performance to correlation (Hebbian) based approaches. We further show how to include mechanisms such as synaptic depression and outline how the framework is readily extensible to learning in networks of highly complex spiking neurons. A stochastic quantal vesicle release mechanism is considered and implications on the complexity of learning discussed. 1
3 0.67987162 164 nips-2002-Prediction of Protein Topologies Using Generalized IOHMMs and RNNs
Author: Gianluca Pollastri, Pierre Baldi, Alessandro Vullo, Paolo Frasconi
Abstract: We develop and test new machine learning methods for the prediction of topological representations of protein structures in the form of coarse- or fine-grained contact or distance maps that are translation and rotation invariant. The methods are based on generalized input-output hidden Markov models (GIOHMMs) and generalized recursive neural networks (GRNNs). The methods are used to predict topology directly in the fine-grained case and, in the coarsegrained case, indirectly by first learning how to score candidate graphs and then using the scoring function to search the space of possible configurations. Computer simulations show that the predictors achieve state-of-the-art performance. 1 Introduction: Protein Topology Prediction Predicting the 3D structure of protein chains from the linear sequence of amino acids is a fundamental open problem in computational molecular biology [1]. Any approach to the problem must deal with the basic fact that protein structures are translation and rotation invariant. To address this invariance, we have proposed a machine learning approach to protein structure prediction [4] based on the prediction of topological representations of proteins, in the form of contact or distance maps. The contact or distance map is a 2D representation of neighborhood relationships consisting of an adjacency matrix at some distance cutoff (typically in the range of 6 to 12 ˚), or a matrix of pairwise Euclidean distances. Fine-grained maps A are derived at the amino acid or even atomic level. Coarse maps are obtained by looking at secondary structure elements, such as helices, and the distance between their centers of gravity or, as in the simulations below, the minimal distances between their Cα atoms. Reasonable methods for reconstructing 3D coordinates from contact/distance maps have been developed in the NMR literature and elsewhere Oi B Hi F Hi Ii Figure 1: Bayesian network for bidirectional IOHMMs consisting of input units, output units, and both forward and backward Markov chains of hidden states. [14] using distance geometry and stochastic optimization techniques. Thus the main focus here is on the more difficult task of contact map prediction. Various algorithms for the prediction of contact maps have been developed, in particular using feedforward neural networks [6]. The best contact map predictor in the literature and at the last CASP prediction experiment reports an average precision [True Positives/(True Positives + False Positives)] of 21% for distant contacts, i.e. with a linear distance of 8 amino acid or more [6] for fine-grained amino acid maps. While this result is encouraging and well above chance level by a factor greater than 6, it is still far from providing sufficient accuracy for reliable 3D structure prediction. A key issue in this area is the amount of noise that can be tolerated in a contact map prediction without compromising the 3D-reconstruction step. While systematic tests in this area have not yet been published, preliminary results appear to indicate that recovery of as little as half of the distant contacts may suffice for proper reconstruction, at least for proteins up to 150 amino acid long (Rita Casadio and Piero Fariselli, private communication and oral presentation during CASP4 [10]). It is important to realize that the input to a fine-grained contact map predictor need not be confined to the sequence of amino acids only, but may also include evolutionary information in the form of profiles derived by multiple alignment of homologue proteins, or structural feature information, such as secondary structure (alpha helices, beta strands, and coils), or solvent accessibility (surface/buried), derived by specialized predictors [12, 13]. In our approach, we use different GIOHMM and GRNN strategies to predict both structural features and contact maps. 2 GIOHMM Architectures Loosely speaking, GIOHMMs are Bayesian networks with input, hidden, and output units that can be used to process complex data structures such as sequences, images, trees, chemical compounds and so forth, built on work in, for instance, [5, 3, 7, 2, 11]. In general, the connectivity of the graphs associated with the hidden units matches the structure of the data being processed. Often multiple copies of the same hidden graph, but with different edge orientations, are used in the hidden layers to allow direct propagation of information in all relevant directions. Output Plane NE NW 4 Hidden Planes SW SE Input Plane Figure 2: 2D GIOHMM Bayesian network for processing two-dimensional objects such as contact maps, with nodes regularly arranged in one input plane, one output plane, and four hidden planes. In each hidden plane, nodes are arranged on a square lattice, and all edges are oriented towards the corresponding cardinal corner. Additional directed edges run vertically in column from the input plane to each hidden plane, and from each hidden plane to the output plane. To illustrate the general idea, a first example of GIOHMM is provided by the bidirectional IOHMMs (Figure 1) introduced in [2] to process sequences and predict protein structural features, such as secondary structure. Unlike standard HMMs or IOHMMS used, for instance in speech recognition, this architecture is based on two hidden markov chains running in opposite directions to leverage the fact that biological sequences are spatial objects rather than temporal sequences. Bidirectional IOHMMs have been used to derive a suite of structural feature predictors [12, 13, 4] available through http://promoter.ics.uci.edu/BRNN-PRED/. These predictors have accuracy rates in the 75-80% range on a per amino acid basis. 2.1 Direct Prediction of Topology To predict contact maps, we use a 2D generalization of the previous 1D Bayesian network. The basic version of this architecture (Figures 2) contains 6 layers of units: input, output, and four hidden layers, one for each cardinal corner. Within each column indexed by i and j, connections run from the input to the four hidden units, and from the four hidden units to the output unit. In addition, the hidden units in each hidden layer are arranged on a square or triangular lattice, with all the edges oriented towards the corresponding cardinal corner. Thus the parameters of this two-dimensional GIOHMMs, in the square lattice case, are the conditional probability distributions: NE NW SW SE P (Oi |Ii,j , Hi,j , Hi,j , Hi,j , Hi,j, ) NE NE NE P (Hi,j |Ii,j , Hi−1,j , Hi,j−1 ) N NW NW P (Hi,jW |Ii,j , Hi+1,j , Hi,j−1 ) SW SW SW P (Hi,j |Ii,j , Hi+1,j , Hi,j+1 ) SE SE SE P (Hi,j |Ii,j , Hi−1,j , Hi,j+1 ) (1) In a contact map prediction at the amino acid level, for instance, the (i, j) output represents the probability of whether amino acids i and j are in contact or not. This prediction depends directly on the (i, j) input and the four-hidden units in the same column, associated with omni-directional contextual propagation in the hidden planes. In the simulations reported below, we use a more elaborated input consisting of a 20 × 20 probability matrix over amino acid pairs derived from a multiple alignment of the given protein sequence and its homologues, as well as the structural features of the corresponding amino acids, including their secondary structure classification and their relative exposure to the solvent, derived from our corresponding predictors. It should be clear how GIOHMM ideas can be generalized to other data structures and problems in many ways. In the case of 3D data, for instance, a standard GIOHMM would have an input cube, an output cube, and up to 8 cubes of hidden units, one for each corner with connections inside each hidden cube oriented towards the corresponding corner. In the case of data with an underlying tree structure, the hidden layers would correspond to copies of the same tree with different orientations and so forth. Thus a fundamental advantage of GIOHMMs is that they can process a wide range of data structures of variable sizes and dimensions. 2.2 Indirect Prediction of Topology Although GIOHMMs allow flexible integration of contextual information over ranges that often exceed what can be achieved, for instance, with fixed-input neural networks, the models described above still suffer from the fact that the connections remain local and therefore long-ranged propagation of information during learning remains difficult. Introduction of large numbers of long-ranged connections is computationally intractable but in principle not necessary since the number of contacts in proteins is known to grow linearly with the length of the protein, and hence connectivity is inherently sparse. The difficulty of course is that the location of the long-ranged contacts is not known. To address this problem, we have developed also a complementary GIOHMM approach described in Figure 3 where a candidate graph structure is proposed in the hidden layers of the GIOHMM, with the two different orientations naturally associated with a protein sequence. Thus the hidden graphs change with each protein. In principle the output ought to be a single unit (Figure 3b) which directly computes a global score for the candidate structure presented in the hidden layer. In order to cope with long-ranged dependencies, however, it is preferable to compute a set of local scores (Figure 3c), one for each vertex, and combine the local scores into a global score by averaging. More specifically, consider a true topology represented by the undirected contact graph G∗ = (V, E ∗ ), and a candidate undirected prediction graph G = (V, E). A global measure of how well E approximates E ∗ is provided by the informationretrieval F1 score defined by the normalized edge-overlap F1 = 2|E ∩ E ∗ |/(|E| + |E ∗ |) = 2P R/(P + R), where P = |E ∩ E ∗ |/|E| is the precision (or specificity) and R = |E ∩ E ∗ |/|E ∗ | is the recall (or sensitivity) measure. Obviously, 0 ≤ F1 ≤ 1 and F1 = 1 if and only if E = E ∗ . The scoring function F1 has the property of being monotone in the sense that if |E| = |E | then F1 (E) < F1 (E ) if and only if |E ∩ E ∗ | < |E ∩ E ∗ |. Furthermore, if E = E ∪ {e} where e is an edge in E ∗ but not in E, then F1 (E ) > F1 (E). Monotonicity is important to guide the search in the space of possible topologies. It is easy to check that a simple search algorithm based on F1 takes on the order of O(|V |3 ) steps to find E ∗ , basically by trying all possible edges one after the other. The problem then is to learn F1 , or rather a good approximation to F1 . To approximate F1 , we first consider a similar local measure Fv by considering the O I(v) I(v) F B H (v) H (v) (a) I(v) F B H (v) H (v) (b) O(v) (c) Figure 3: Indirect prediction of contact maps. (a) target contact graph to be predicted. (b) GIOHMM with two hidden layers: the two hidden layers correspond to two copies of the same candidate graph oriented in opposite directions from one end of the protein to the other end. The single output O is the global score of how well the candidate graph approximates the true contact map. (c) Similar to (b) but with a local score O(v) at each vertex. The local scores can be averaged to produce a global score. In (b) and (c) I(v) represents the input for vertex v, and H F (v) and H B (v) are the corresponding hidden variables. ∗ ∗ set Ev of edges adjacent to vertex v and Fv = 2|Ev ∩ Ev |/(|Ev | + |Ev |) with the ¯ global average F = v Fv /|V |. If n and n∗ are the average degrees of G and G∗ , it can be shown that: F1 = 1 |V | v 2|Ev ∩ E ∗ | n + n∗ and 1 ¯ F = |V | v 2|Ev ∩ E ∗ | n + v + n∗ + ∗ v (2) where n + v (resp. n∗ + ∗ ) is the degree of v in G (resp. in G∗ ). In particular, if G v ¯ ¯ and G∗ are regular graphs, then F1 (E) = F (E) so that F is a good approximation to F1 . In the contact map regime where the number of contacts grows linearly with the length of the sequence, we should have in general |E| ≈ |E ∗ | ≈ (1 + α)|V | so that each node on average has n = n∗ = 2(1 + α) edges. The value of α depends of course on the neighborhood cutoff. As in reinforcement learning, to learn the scoring function one is faced with the problem of generating good training sets in a high dimensional space, where the states are the topologies (graphs), and the policies are algorithms for adding a single edge to a given graph. In the simulations we adopt several different strategies including static and dynamic generation. Within dynamic generation we use three exploration strategies: random exploration (successor graph chosen at random), pure exploitation (successor graph maximizes the current scoring function), and semi-uniform exploitation to find a balance between exploration and exploitation [with probability (resp. 1 − ) we choose random exploration (resp. pure exploitation)]. 3 GRNN Architectures Inference and learning in the protein GIOHMMs we have described is computationally intensive due to the large number of undirected loops they contain. This problem can be addressed using a neural network reparameterization assuming that: (a) all the nodes in the graphs are associated with a deterministic vector (note that in the case of the output nodes this vector can represent a probability distribution so that the overall model remains probabilistic); (b) each vector is a deterministic function of its parents; (c) each function is parameterized using a neural network (or some other class of approximators); and (d) weight-sharing or stationarity is used between similar neural networks in the model. For example, in the 2D GIOHMM contact map predictor, we can use a total of 5 neural networks to recursively compute the four hidden states and the output in each column in the form: NW NE SW SE Oij = NO (Iij , Hi,j , Hi,j , Hi,j , Hi,j ) NE NE NE Hi,j = NN E (Ii,j , Hi−1,j , Hi,j−1 ) N NW NW Hi,jW = NN W (Ii,j , Hi+1,j , Hi,j−1 ) SW SW SW Hi,j = NSW (Ii,j , Hi+1,j , Hi,j+1 ) SE SE SE Hi,j = NSE (Ii,j , Hi−1,j , Hi,j+1 ) (3) N In the NE plane, for instance, the boundary conditions are set to Hij E = 0 for i = 0 N or j = 0. The activity vector associated with the hidden unit Hij E depends on the NE NE local input Iij , and the activity vectors of the units Hi−1,j and Hi,j−1 . Activity in NE plane can be propagated row by row, West to East, and from the first row to the last (from South to North), or column by column South to North, and from the first column to the last. These GRNN architectures can be trained by gradient descent by unfolding the structures in space, leveraging the acyclic nature of the underlying GIOHMMs. 4 Data Many data sets are available or can be constructed for training and testing purposes, as described in the references. The data sets used in the present simulations are extracted from the publicly available Protein Data Bank (PDB) and then redundancy reduced, or from the non-homologous subset of PDB Select (ftp://ftp.emblheidelberg.de/pub/databases/). In addition, we typically exclude structures with poor resolution (less than 2.5-3 ˚), sequences containing less than 30 amino acids, A and structures containing multiple sequences or sequences with chain breaks. For coarse contact maps, we use the DSSP program [9] (CMBI version) to assign secondary structures and we remove also sequences for which DSSP crashes. The results we report for fine-grained contact maps are derived using 424 proteins with lengths in the 30-200 range for training and an additional non-homologous set of 48 proteins in the same length range for testing. For the coarse contact map, we use a set of 587 proteins of length less than 300. Because the average length of a secondary structure element is slightly above 7, the size of a coarse map is roughly 2% the size of the corresponding amino acid map. 5 Simulation Results and Conclusions We have trained several 2D GIOHMM/GRNN models on the direct prediction of fine-grained contact maps. Training of a single model typically takes on the order of a week on a fast workstation. A sample of validation results is reported in Table 1 for four different distance cutoffs. Overall percentages of correctly predicted contacts Table 1: Direct prediction of amino acid contact maps. Column 1: four distance cutoffs. Column 2, 3, and 4: overall percentages of amino acids correctly classified as contacts, non-contacts, and in total. Column 5: Precision percentage for distant contacts (|i − j| ≥ 8) with a threshold of 0.5. Single model results except for last line corresponding to an ensemble of 5 models. Cutoff 6˚ A 8˚ A 10 ˚ A 12 ˚ A 12 ˚ A Contact .714 .638 .512 .433 .445 Non-Contact .998 .998 .993 .987 .990 Total .985 .970 .931 .878 .883 Precision (P) .594 .670 .557 .549 .717 and non-contacts at all linear distances, as well as precision results for distant contacts (|i − j| ≥ 8) are reported for a single GIOHMM/GRNN model. The model has k = 14 hidden units in the hidden and output layers of the four hidden networks, as well as in the hidden layer of the output network. In the last row, we also report as an example the results obtained at 12˚ by an ensemble of 5 networks A with k = 11, 12, 13, 14 and 15. Note that precision for distant contacts exceeds all previously reported results and is well above 50%. For the prediction of coarse-grained contact maps, we use the indirect GIOHMM/GRNN strategy and compare different exploration/exploitation strategies: random exploration, pure exploitation, and their convex combination (semiuniform exploitation). In the semi-uniform case we set the probability of random uniform exploration to = 0.4. In addition, we also try a fourth hybrid strategy in which the search proceeds greedily (i.e. the best successor is chosen at each step, as in pure exploitation), but the network is trained by randomly sub-sampling the successors of the current state. Eight numerical features encode the input label of each node: one-hot encoding of secondary structure classes; normalized linear distances from the N to C terminus; average, maximum and minimum hydrophobic character of the segment (based on the Kyte-Doolittle scale with a moving window of length 7). A sample of results obtained with 5-fold cross-validation is shown in Table 2. Hidden state vectors have dimension k = 5 with no hidden layers. For each strategy we measure performances by means of several indices: micro and macroaveraged precision (mP , M P ), recall (mR, M R) and F1 measure (mF1 , M F1 ). Micro-averages are derived based on each pair of secondary structure elements in each protein, whereas macro-averages are obtained on a per-protein basis, by first computing precision and recall for each protein, and then averaging over the set of all proteins. In addition, we also measure the micro and macro averages for specificity in the sense of percentage of correct prediction for non-contacts (mP (nc), M P (nc)). Note the tradeoffs between precision and recall across the training methods, the hybrid method achieving the best F 1 results. Table 2: Indirect prediction of coarse contact maps with dynamic sampling. Strategy Random exploration Semi-uniform Pure exploitation Hybrid mP .715 .454 .431 .417 mP (nc) .769 .787 .806 .834 mR .418 .631 .726 .790 mF1 .518 .526 .539 .546 MP .767 .507 .481 .474 M P (nc) .709 .767 .793 .821 MR .469 .702 .787 .843 M F1 .574 .588 .596 .607 We have presented two approaches, based on a very general IOHMM/RNN framework, that achieve state-of-the-art performance in the prediction of proteins contact maps at fine and coarse-grained levels of resolution. In principle both methods can be applied to both resolution levels, although the indirect prediction is computationally too demanding for fine-grained prediction of large proteins. Several extensions are currently under development, including the integration of these methods into complete 3D structure predictors. While these systems require long training periods, once trained they can rapidly sift through large proteomic data sets. Acknowledgments The work of PB and GP is supported by a Laurel Wilkening Faculty Innovation award and awards from NIH, BREP, Sun Microsystems, and the California Institute for Telecommunications and Information Technology. The work of PF and AV is partially supported by a MURST grant. References [1] D. Baker and A. Sali. Protein structure prediction and structural genomics. Science, 294:93–96, 2001. [2] P. Baldi and S. Brunak and P. Frasconi and G. Soda and G. Pollastri. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15(11):937–946, 1999. [3] P. Baldi and Y. Chauvin. Hybrid modeling, HMM/NN architectures, and protein applications. Neural Computation, 8(7):1541–1565, 1996. [4] P. Baldi and G. Pollastri. Machine learning structural and functional proteomics. IEEE Intelligent Systems. Special Issue on Intelligent Systems in Biology, 17(2), 2002. [5] Y. Bengio and P. Frasconi. Input-output HMM’s for sequence processing. IEEE Trans. on Neural Networks, 7:1231–1249, 1996. [6] P. Fariselli, O. Olmea, A. Valencia, and R. Casadio. Prediction of contact maps with neural networks and correlated mutations. Protein Engineering, 14:835–843, 2001. [7] P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of data structures. IEEE Trans. on Neural Networks, 9:768–786, 1998. [8] Z. Ghahramani and M. I. Jordan. Factorial hidden Markov models Machine Learning, 29:245–273, 1997. [9] W. Kabsch and C. Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22:2577–2637, 1983. [10] A. M. Lesk, L. Lo Conte, and T. J. P. Hubbard. Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, and interresidue contacts. Proteins, 45, S5:98–118, 2001. [11] G. Pollastri and P. Baldi. Predition of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Proceedings of 2002 ISMB (Intelligent Systems for Molecular Biology) Conference. Bioinformatics, 18, S1:62–70, 2002. [12] G. Pollastri, D. Przybylski, B. Rost, and P. Baldi. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47:228–235, 2002. [13] G. Pollastri, P. Baldi, P. Fariselli, and R. Casadio. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47:142–153, 2002. [14] M. Vendruscolo, E. Kussell, and E. Domany. Recovery of protein structure from contact maps. Folding and Design, 2:295–306, 1997.
4 0.56553221 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences
Author: Eric P. Xing, Michael I. Jordan, Richard M. Karp, Stuart Russell
Abstract: We propose a dynamic Bayesian model for motifs in biopolymer sequences which captures rich biological prior knowledge and positional dependencies in motif structure in a principled way. Our model posits that the position-specific multinomial parameters for monomer distribution are distributed as a latent Dirichlet-mixture random variable, and the position-specific Dirichlet component is determined by a hidden Markov process. Model parameters can be fit on training motifs using a variational EM algorithm within an empirical Bayesian framework. Variational inference is also used for detecting hidden motifs. Our model improves over previous models that ignore biological priors and positional dependence. It has much higher sensitivity to motifs during detection and a notable ability to distinguish genuine motifs from false recurring patterns.
5 0.55542558 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
Author: Shantanu Chakrabartty, Gert Cauwenberghs
Abstract: Forward decoding kernel machines (FDKM) combine large-margin classifiers with hidden Markov models (HMM) for maximum a posteriori (MAP) adaptive sequence estimation. State transitions in the sequence are conditioned on observed data using a kernel-based probability model trained with a recursive scheme that deals effectively with noisy and partially labeled data. Training over very large data sets is accomplished using a sparse probabilistic support vector machine (SVM) model based on quadratic entropy, and an on-line stochastic steepest descent algorithm. For speaker-independent continuous phone recognition, FDKM trained over 177 ,080 samples of the TlMIT database achieves 80.6% recognition accuracy over the full test set, without use of a prior phonetic language model.
6 0.49735656 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition
7 0.48854151 22 nips-2002-Adaptive Nonlinear System Identification with Echo State Networks
8 0.48265094 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
9 0.48262307 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
10 0.39734185 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
11 0.3827568 150 nips-2002-Multiple Cause Vector Quantization
12 0.36129549 206 nips-2002-Visual Development Aids the Acquisition of Motion Velocity Sensitivities
13 0.36005047 199 nips-2002-Timing and Partial Observability in the Dopamine System
14 0.35996535 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
15 0.35631779 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex
16 0.35011661 195 nips-2002-The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities
17 0.34473896 87 nips-2002-Fast Transformation-Invariant Factor Analysis
18 0.34410515 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
19 0.34360421 124 nips-2002-Learning Graphical Models with Mercer Kernels
20 0.3399688 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design
topicId topicWeight
[(11, 0.019), (23, 0.022), (42, 0.048), (54, 0.084), (55, 0.039), (68, 0.438), (74, 0.067), (92, 0.035), (98, 0.171)]
simIndex simValue paperId paperTitle
1 0.98053962 171 nips-2002-Reconstructing Stimulus-Driven Neural Networks from Spike Times
Author: Duane Q. Nykamp
Abstract: We present a method to distinguish direct connections between two neurons from common input originating from other, unmeasured neurons. The distinction is computed from the spike times of the two neurons in response to a white noise stimulus. Although the method is based on a highly idealized linear-nonlinear approximation of neural response, we demonstrate via simulation that the approach can work with a more realistic, integrate-and-fire neuron model. We propose that the approach exemplified by this analysis may yield viable tools for reconstructing stimulus-driven neural networks from data gathered in neurophysiology experiments.
same-paper 2 0.8911677 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables
Author: David Barber
Abstract: The application of latent/hidden variable Dynamic Bayesian Networks is constrained by the complexity of marginalising over latent variables. For this reason either small latent dimensions or Gaussian latent conditional tables linearly dependent on past states are typically considered in order that inference is tractable. We suggest an alternative approach in which the latent variables are modelled using deterministic conditional probability tables. This specialisation has the advantage of tractable inference even for highly complex non-linear/non-Gaussian visible conditional probability tables. This approach enables the consideration of highly complex latent dynamics whilst retaining the benefits of a tractable probabilistic model. 1
3 0.83288902 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems
Author: Sepp Hochreiter, Michael C. Mozer, Klaus Obermayer
Abstract: We introduce a family of classifiers based on a physical analogy to an electrostatic system of charged conductors. The family, called Coulomb classifiers, includes the two best-known support-vector machines (SVMs), the ν–SVM and the C–SVM. In the electrostatics analogy, a training example corresponds to a charged conductor at a given location in space, the classification function corresponds to the electrostatic potential function, and the training objective function corresponds to the Coulomb energy. The electrostatic framework provides not only a novel interpretation of existing algorithms and their interrelationships, but it suggests a variety of new methods for SVMs including kernels that bridge the gap between polynomial and radial-basis functions, objective functions that do not require positive-definite kernels, regularization techniques that allow for the construction of an optimal classifier in Minkowski space. Based on the framework, we propose novel SVMs and perform simulation studies to show that they are comparable or superior to standard SVMs. The experiments include classification tasks on data which are represented in terms of their pairwise proximities, where a Coulomb Classifier outperformed standard SVMs. 1
4 0.7653923 76 nips-2002-Dynamical Constraints on Computing with Spike Timing in the Cortex
Author: Arunava Banerjee, Alexandre Pouget
Abstract: If the cortex uses spike timing to compute, the timing of the spikes must be robust to perturbations. Based on a recent framework that provides a simple criterion to determine whether a spike sequence produced by a generic network is sensitive to initial conditions, and numerical simulations of a variety of network architectures, we argue within the limits set by our model of the neuron, that it is unlikely that precise sequences of spike timings are used for computation under conditions typically found in the cortex.
5 0.72369677 49 nips-2002-Charting a Manifold
Author: Matthew Brand
Abstract: We construct a nonlinear mapping from a high-dimensional sample space to a low-dimensional vector space, effectively recovering a Cartesian coordinate system for the manifold from which the data is sampled. The mapping preserves local geometric relations in the manifold and is pseudo-invertible. We show how to estimate the intrinsic dimensionality of the manifold from samples, decompose the sample data into locally linear low-dimensional patches, merge these patches into a single lowdimensional coordinate system, and compute forward and reverse mappings between the sample and coordinate spaces. The objective functions are convex and their solutions are given in closed form. 1 Nonlinear dimensionality reduction (NLDR) by charting Charting is the problem of assigning a low-dimensional coordinate system to data points in a high-dimensional sample space. It is presumed that the data lies on or near a lowdimensional manifold embedded in the sample space, and that there exists a 1-to-1 smooth nonlinear transform between the manifold and a low-dimensional vector space. The datamodeler’s goal is to estimate smooth continuous mappings between the sample and coordinate spaces. Often this analysis will shed light on the intrinsic variables of the datagenerating phenomenon, for example, revealing perceptual or configuration spaces. Our goal is to find a mapping—expressed as a kernel-based mixture of linear projections— that minimizes information loss about the density and relative locations of sample points. This constraint is expressed in a posterior that combines a standard gaussian mixture model (GMM) likelihood function with a prior that penalizes uncertainty due to inconsistent projections in the mixture. Section 3 develops a special case where this posterior is unimodal and maximizable in closed form, yielding a GMM whose covariances reveal a patchwork of overlapping locally linear subspaces that cover the manifold. Section 4 shows that for this (or any) GMM and a choice of reduced dimension d, there is a unique, closed-form solution for a minimally distorting merger of the subspaces into a d-dimensional coordinate space, as well as an reverse mapping defining the surface of the manifold in the sample space. The intrinsic dimensionality d of the data manifold can be estimated from the growth process of point-to-point distances. In analogy to differential geometry, we call the subspaces “charts” and their merger the “connection.” Section 5 considers example problems where these methods are used to untie knots, unroll and untwist sheets, and visualize video data. 1.1 Background Topology-neutral NLDR algorithms can be divided into those that compute mappings, and those that directly compute low-dimensional embeddings. The fi has its roots in mapeld ping algorithms: DeMers and Cottrell [3] proposed using auto-encoding neural networks with a hidden layer “ bottleneck,” effectively casting dimensionality reduction as a compression problem. Hastie defi principal curves [5] as nonparametric 1 D curves that pass ned through the center of “ nearby” data points. A rich literature has grown up around properly regularizing this approach and extending it to surfaces. Smola and colleagues [10] analyzed the NLDR problem in the broader framework of regularized quantization methods. More recent advances aim for embeddings: Gomes and Mojsilovic [4] treat manifold completion as an anisotropic diffusion problem, iteratively expanding points until they connect to their neighbors. The I SO M AP algorithm [12] represents remote distances as sums of a trusted set of distances between immediate neighbors, then uses multidimensional scaling to compute a low-dimensional embedding that minimally distorts all distances. The locally linear embedding algorithm (LLE) [9] represents each point as a weighted combination of a trusted set of nearest neighbors, then computes a minimally distorting low-dimensional barycentric embedding. They have complementary strengths: I SO M AP handles holes well but can fail if the data hull is nonconvex [12]; and vice versa for LLE [9]. Both offer embeddings without mappings. It has been noted that trusted-set methods are vulnerable to noise because they consider the subset of point-to-point relationships that has the lowest signal-to-noise ratio; small changes to the trusted set can induce large changes in the set of constraints on the embedding, making solutions unstable [1]. In a return to mapping, Roweis and colleagues [8] proposed global coordination— learning a mixture of locally linear projections from sample to coordinate space. They constructed a posterior that penalizes distortions in the mapping, and gave a expectation-maximization (EM) training rule. Innovative use of variational methods highlighted the diffi culty of even hill-climbing their multimodal posterior. Like [2, 7, 6, 8], the method we develop below is a decomposition of the manifold into locally linear neighborhoods. It bears closest relation to global coordination [8], although by a different construction of the problem, we avoid hill-climbing a spiky posterior and instead develop a closed-form solution. 2 Estimating locally linear scale and intrinsic dimensionality . We begin with matrix of sample points Y = [y1 , · · · , yN ], yn ∈ RD populating a Ddimensional sample space, and a conjecture that these points are samples from a manifold M of intrinsic dimensionality d < D. We seek a mapping onto a vector space . G(Y) → X = [x1 , · · · , xN ], xn ∈ Rd and 1-to-1 reverse mapping G−1 (X) → Y such that local relations between nearby points are preserved (this will be formalized below). The map G should be non-catastrophic, that is, without folds: Parallel lines on the manifold in RD should map to continuous smooth non-intersecting curves in Rd . This guarantees that linear operations on X such as interpolation will have reasonable analogues on Y. Smoothness means that at some scale r the mapping from a neighborhood on M to Rd is effectively linear. Consider a ball of radius r centered on a data point and containing n(r) data points. The count n(r) grows as rd , but only at the locally linear scale; the grow rate is inflated by isotropic noise at smaller scales and by embedding curvature at larger scales. . To estimate r, we look at how the r-ball grows as points are added to it, tracking c(r) = d d log n(r) log r. At noise scales, c(r) ≈ 1/D < 1/d, because noise has distributed points in all directions with equal probability. At the scale at which curvature becomes signifi cant, c(r) < 1/d, because the manifold is no longer perpendicular to the surface of the ball, so the ball does not have to grow as fast to accommodate new points. At the locally linear scale, the process peaks at c(r) = 1/d, because points are distributed only in the directions of the manifold’s local tangent space. The maximum of c(r) therefore gives an estimate of both the scale and the local dimensionality of the manifold (see fi gure 1), provided that the ball hasn’t expanded to a manifold boundary— boundaries have lower dimension than Scale behavior of a 1D manifold in 2-space Point−count growth process on a 2D manifold in 3−space 1 10 radial growth process 1D hypothesis 2D hypothesis 3D hypothesis radius (log scale) samples noise scale locally linear scale curvature scale 0 10 2 1 10 2 10 #points (log scale) 3 10 Figure 1: Point growth processes. L EFT: At the locally linear scale, the number of points in an r-ball grows as rd ; at noise and curvature scales it grows faster. R IGHT: Using the point-count growth process to fi the intrinsic dimensionality of a 2D manifold nonlinearly nd embedded in 3-space (see fi gure 2). Lines of slope 1/3 , 1/2 , and 1 are fi tted to sections of the log r/ log nr curve. For neighborhoods of radius r ≈ 1 with roughly n ≈ 10 points, the slope peaks at 1/2 indicating a dimensionality of d = 2. Below that, the data appears 3 D because it is dominated by noise (except for n ≤ D points); above, the data appears >2 D because of manifold curvature. As the r-ball expands to cover the entire data-set the dimensionality appears to drop to 1 as the process begins to track the 1D edges of the 2D sheet. the manifold. For low-dimensional manifolds such as sheets, the boundary submanifolds (edges and corners) are very small relative to the full manifold, so the boundary effect is typically limited to a small rise in c(r) as r approaches the scale of the entire data set. In practice, our code simply expands an r-ball at every point and looks for the fi peak in rst c(r), averaged over many nearby r-balls. One can estimate d and r globally or per-point. 3 Charting the data In the charting step we fi a soft partitioning of the data into locally linear low-dimensional nd neighborhoods, as a prelude to computing the connection that gives the global lowdimensional embedding. To minimize information loss in the connection, we require that the data points project into a subspace associated with each neighborhood with (1) minimal loss of local variance and (2) maximal agreement of the projections of nearby points into nearby neighborhoods. Criterion (1) is served by maximizing the likelihood function of a Gaussian mixture model (GMM) density fi tted to the data: . p(yi |µ, Σ) = ∑ j p(yi |µ j , Σ j ) p j = ∑ j N (yi ; µ j , Σ j ) p j . (1) Each gaussian component defi a local neighborhood centered around µ j with axes denes fi ned by the eigenvectors of Σ j . The amount of data variance along each axis is indicated by the eigenvalues of Σ j ; if the data manifold is locally linear in the vicinity of the µ j , all but the d dominant eigenvalues will be near-zero, implying that the associated eigenvectors constitute the optimal variance-preserving local coordinate system. To some degree likelihood maximization will naturally realize this property: It requires that the GMM components shrink in volume to fi the data as tightly as possible, which is best achieved by t positioning the components so that they “ pancake” onto locally flat collections of datapoints. However, this state of affairs is easily violated by degenerate (zero-variance) GMM components or components fi tted to overly small enough locales where the data density off the manifold is comparable to density on the manifold (e.g., at the noise scale). Consequently a prior is needed. Criterion (2) implies that neighboring partitions should have dominant axes that span similar subspaces, since disagreement (large subspace angles) would lead to inconsistent projections of a point and therefore uncertainty about its location in a low-dimensional coordinate space. The principal insight is that criterion (2) is exactly the cost of coding the location of a point in one neighborhood when it is generated by another neighborhood— the cross-entropy between the gaussian models defi ning the two neighborhoods: D(N1 N2 ) = = dy N (y; µ1 ,Σ1 ) log N (y; µ1 ,Σ1 ) N (y; µ2 ,Σ2 ) (log |Σ−1 Σ2 | + trace(Σ−1 Σ1 ) + (µ2 −µ1 ) Σ−1 (µ2 −µ1 ) − D)/2. (2) 1 2 2 Roughly speaking, the terms in (2) measure differences in size, orientation, and position, respectively, of two coordinate frames located at the means µ1 , µ2 with axes specifi by ed the eigenvectors of Σ1 , Σ2 . All three terms decline to zero as the overlap between the two frames is maximized. To maximize consistency between adjacent neighborhoods, we form . the prior p(µ, Σ) = exp[− ∑i= j mi (µ j )D(Ni N j )], where mi (µ j ) is a measure of co-locality. Unlike global coordination [8], we are not asking that the dominant axes in neighboring charts are aligned— only that they span nearly the same subspace. This is a much easier objective to satisfy, and it contains a useful special case where the posterior p(µ, Σ|Y) ∝ ∑i p(yi |µ, Σ)p(µ, Σ) is unimodal and can be maximized in closed form: Let us associate a gaussian neighborhood with each data-point, setting µi = yi ; take all neighborhoods to be a priori equally probable, setting pi = 1/N; and let the co-locality measure be determined from some local kernel. For example, in this paper we use mi (µ j ) ∝ N (µ j ; µi , σ2 ), with the scale parameter σ specifying the expected size of a neighborhood on the manifold in sample space. A reasonable choice is σ = r/2, so that 2erf(2) > 99.5% of the density of mi (µ j ) is contained in the area around yi where the manifold is expected to be locally linear. With uniform pi and µi , mi (µ j ) and fi xed, the MAP estimates of the GMM covariances are Σi = ∑ mi (µ j ) (y j − µi )(y j − µi ) + (µ j − µi )(µ j − µi ) + Σ j j ∑ mi (µ j ) (3) . j Note that each covariance Σi is dependent on all other Σ j . The MAP estimators for all covariances can be arranged into a set of fully constrained linear equations and solved exactly for their mutually optimal values. This key step brings nonlocal information about the manifold’s shape into the local description of each neighborhood, ensuring that adjoining neighborhoods have similar covariances and small angles between their respective subspaces. Even if a local subset of data points are dense in a direction perpendicular to the manifold, the prior encourages the local chart to orient parallel to the manifold as part of a globally optimal solution, protecting against a pathology noted in [8]. Equation (3) is easily adapted to give a reduced number of charts and/or charts centered on local centroids. 4 Connecting the charts We now build a connection for set of charts specifi as an arbitrary nondegenerate GMM. A ed GMM gives a soft partitioning of the dataset into neighborhoods of mean µk and covariance Σk . The optimal variance-preserving low-dimensional coordinate system for each neighborhood derives from its weighted principal component analysis, which is exactly specifi ed by the eigenvectors of its covariance matrix: Eigendecompose Vk Λk Vk ← Σk with eigen. values in descending order on the diagonal of Λk and let Wk = [Id , 0]Vk be the operator . th projecting points into the k local chart, such that local chart coordinate uki = Wk (yi − µk ) . and Uk = [uk1 , · · · , ukN ] holds the local coordinates of all points. Our goal is to sew together all charts into a globally consistent low-dimensional coordinate system. For each chart there will be a low-dimensional affi transform Gk ∈ R(d+1)×d ne that projects Uk into the global coordinate space. Summing over all charts, the weighted average of the projections of point yi into the low-dimensional vector space is W j (y − µ j ) 1 . x|y = ∑ G j j p j|y (y) . xi |yi = ∑ G j ⇒ u ji 1 j p j|y (yi ), (4) where pk|y (y) ∝ pk N (y; µk , Σk ), ∑k pk|y (y) = 1 is the probability that chart k generates point y. As pointed out in [8], if a point has nonzero probabilities in two charts, then there should be affi transforms of those two charts that map the point to the same place in a ne global coordinate space. We set this up as a weighted least-squares problem: . G = [G1 , · · · , GK ] = arg min uki 1 ∑ pk|y (yi )p j|y (yi ) Gk Gk ,G j i −Gj u ji 1 2 . (5) F Equation (5) generates a homogeneous set of equations that determines a solution up to an affi transform of G. There are two solution methods. First, let us temporarily anchor one ne neighborhood at the origin to fi this indeterminacy. This adds the constraint G1 = [I, 0] . x . To solve, defi indicator matrix Fk = [0, · · · , 0, I, 0, · · · , 0] with the identity mane . trix occupying the kth block, such that Gk = GFk . Let the diagonal of Pk = diag([pk|y (y1 ), · · · , pk|y (yN )]) record the per-point posteriors of chart k. The squared error of the connection is then a sum of of all patch-to-anchor and patch-to-patch inconsistencies: . E =∑ (GUk − k U1 0 2 )Pk P1 F + ∑ (GU j − GUk )P j Pk j=k 2 F ; . Uk = Fk Uk 1 . (6) Setting dE /dG = 0 and solving to minimize convex E gives −1 G = ∑ Uk P2 k k ∑ j=k P2 j Uk − ∑ ∑ Uk P2 P2 k 1 Uk P2 P2 U j k j k j=k U1 0 . (7) We now remove the dependence on a reference neighborhood G1 by rewriting equation 5, G = arg min ∑ j=k (GU j − GUk )P j Pk G 2 F = GQ 2 F = trace(GQQ G ) , (8) . where Q = ∑ j=k U j − Uk P j Pk . If we require that GG = I to prevent degenerate solutions, then equation (8) is solved (up to rotation in coordinate space) by setting G to the eigenvectors associated with the smallest eigenvalues of QQ . The eigenvectors can be computed effi ciently without explicitly forming QQ ; other numerical effi ciencies obtain by zeroing any vanishingly small probabilities in each Pk , yielding a sparse eigenproblem. A more interesting strategy is to numerically condition the problem by calculating the trailing eigenvectors of QQ + 1. It can be shown that this maximizes the posterior 2 p(G|Q) ∝ p(Q|G)p(G) ∝ e− GQ F e− G1 , where the prior p(G) favors a mapping G whose unit-norm rows are also zero-mean. This maximizes variance in each row of G and thereby spreads the projected points broadly and evenly over coordinate space. The solutions for MAP charts (equation (5)) and connection (equation (8)) can be applied to any well-fi tted mixture of gaussians/factors1 /PCAs density model; thus large eigenproblems can be avoided by connecting just a small number of charts that cover the data. 1 We thank reviewers for calling our attention to Teh & Roweis ([11]— in this volume), which shows how to connect a set of given local dimensionality reducers in a generalized eigenvalue problem that is related to equation (8). LLE, n=5 charting (projection onto coordinate space) charting best Isomap LLE, n=6 LLE, n=7 LLE, n=8 random subset of local charts XYZ view LLE, n=9 LLE, n=10 XZ view data (linked) embedding, XY view XY view original data reconstruction (back−projected coordinate grid) best LLE (regularized) Figure 2: The twisted curl problem. L EFT: Comparison of charting, I SO M AP, & LLE. 400 points are randomly sampled from the manifold with noise. Charting is the only method that recovers the original space without catastrophes (folding), albeit with some shear. R IGHT: The manifold is regularly sampled (with noise) to illustrate the forward and backward projections. Samples are shown linked into lines to help visualize the manifold structure. Coordinate axes of a random selection of charts are shown as bold lines. Connecting subsets of charts such as this will also give good mappings. The upper right quadrant shows various LLE results. At bottom we show the charting solution and the reconstructed (back-projected) manifold, which smooths out the noise. Once the connection is solved, equation (4) gives the forward projection of any point y down into coordinate space. There are several numerically distinct candidates for the backprojection: posterior mean, mode, or exact inverse. In general, there may not be a unique posterior mode and the exact inverse is not solvable in closed form (this is also true of [8]). Note that chart-wise projection defi a complementary density in coordinate space nes px|k (x) = N (x; Gk 0 1 , Gk [Id , 0]Λk [Id , 0] 0 0 0 Gk ). (9) Let p(y|x, k), used to map x into subspace k on the surface of the manifold, be a Dirac delta function whose mean is a linear function of x. Then the posterior mean back-projection is obtained by integrating out uncertainty over which chart generates x: y|x = ∑ pk|x (x) k µk + Wk Gk I 0 + x − Gk 0 1 , (10) where (·)+ denotes pseudo-inverse. In general, a back-projecting map should not reconstruct the original points. Instead, equation (10) generates a surface that passes through the weighted average of the µi of all the neighborhoods in which yi has nonzero probability, much like a principal curve passes through the center of each local group of points. 5 Experiments Synthetic examples: 400 2 D points were randomly sampled from a 2 D square and embedded in 3 D via a curl and twist, then contaminated with gaussian noise. Even if noiselessly sampled, this manifold cannot be “ unrolled” without distortion. In addition, the outer curl is sampled much less densely than the inner curl. With an order of magnitude fewer points, higher noise levels, no possibility of an isometric mapping, and uneven sampling, this is arguably a much more challenging problem than the “ swiss roll” and “ s-curve” problems featured in [12, 9, 8, 1]. Figure 2LEFT contrasts the (unique) output of charting and the best outputs obtained from I SO M AP and LLE (considering all neighborhood sizes between 2 and 20 points). I SO M AP and LLE show catastrophic folding; we had to change LLE’s b. data, yz view c. local charts d. 2D embedding e. 1D embedding 1D ordinate a. data, xy view true manifold arc length Figure 3: Untying a trefoil knot ( ) by charting. 900 noisy samples from a 3 D-embedded 1 D manifold are shown as connected dots in front (a) and side (b) views. A subset of charts is shown in (c). Solving for the 2 D connection gives the “ unknot” in (d). After removing some points to cut the knot, charting gives a 1 D embedding which we plot against true manifold arc length in (e); monotonicity (modulo noise) indicates correctness. Three principal degrees of freedom recovered from raw jittered images pose scale expression images synthesized via backprojection of straight lines in coordinate space Figure 4: Modeling the manifold of facial images from raw video. Each row contains images synthesized by back-projecting an axis-parallel straight line in coordinate space onto the manifold in image space. Blurry images correspond to points on the manifold whose neighborhoods contain few if any nearby data points. regularization in order to coax out nondegenerate (>1 D) solutions. Although charting is not designed for isometry, after affi transform the forward-projected points disagree with ne the original points with an RMS error of only 1.0429, lower than the best LLE (3.1423) or best I SO M AP (1.1424, not shown). Figure 2RIGHT shows the same problem where points are sampled regularly from a grid, with noise added before and after embedding. Figure 3 shows a similar treatment of a 1 D line that was threaded into a 3 D trefoil knot, contaminated with gaussian noise, and then “ untied” via charting. Video: We obtained a 1965-frame video sequence (courtesy S. Roweis and B. Frey) of 20 × 28-pixel images in which B.F. strikes a variety of poses and expressions. The video is heavily contaminated with synthetic camera jitters. We used raw images, though image processing could have removed this and other uninteresting sources of variation. We took a 500-frame subsequence and left-right mirrored it to obtain 1000 points in 20 × 28 = 560D image space. The point-growth process peaked just above d = 3 dimensions. We solved for 25 charts, each centered on a random point, and a 3D connection. The recovered degrees of freedom— recognizable as pose, scale, and expression— are visualized in fi gure 4. original data stereographic map to 3D fishbowl charting Figure 5: Flattening a fi shbowl. From the left: Original 2000×2D points; their stereographic mapping to a 3D fi shbowl; its 2D embedding recovered using 500 charts; and the stereographic map. Fewer charts lead to isometric mappings that fold the bowl (not shown). Conformality: Some manifolds can be flattened conformally (preserving local angles) but not isometrically. Figure 5 shows that if the data is fi nely charted, the connection behaves more conformally than isometrically. This problem was suggested by J. Tenenbaum. 6 Discussion Charting breaks kernel-based NLDR into two subproblems: (1) Finding a set of datacovering locally linear neighborhoods (“ charts” ) such that adjoining neighborhoods span maximally similar subspaces, and (2) computing a minimal-distortion merger (“ connection” ) of all charts. The solution to (1) is optimal w.r.t. the estimated scale of local linearity r; the solution to (2) is optimal w.r.t. the solution to (1) and the desired dimensionality d. Both problems have Bayesian settings. By offloading the nonlinearity onto the kernels, we obtain least-squares problems and closed form solutions. This scheme is also attractive because large eigenproblems can be avoided by using a reduced set of charts. The dependence on r, like trusted-set methods, is a potential source of solution instability. In practice the point-growth estimate seems fairly robust to data perturbations (to be expected if the data density changes slowly over a manifold of integral Hausdorff dimension), while the use of a soft neighborhood partitioning appears to make charting solutions reasonably stable to variations in r. Eigenvalue stability analyses may prove useful here. Ultimately, we would prefer to integrate r out. In contrast, use of d appears to be a virtue: Unlike other eigenvector-based methods, the best d-dimensional embedding is not merely a linear projection of the best d + 1-dimensional embedding; a unique distortion is found for each value of d that maximizes the information content of its embedding. Why does charting performs well on datasets where the signal-to-noise ratio confounds recent state-of-the-art methods? Two reasons may be adduced: (1) Nonlocal information is used to construct both the system of local charts and their global connection. (2) The mapping only preserves the component of local point-to-point distances that project onto the manifold; relationships perpendicular to the manifold are discarded. Thus charting uses global shape information to suppress noise in the constraints that determine the mapping. Acknowledgments Thanks to J. Buhmann, S. Makar, S. Roweis, J. Tenenbaum, and anonymous reviewers for insightful comments and suggested “ challenge” problems. References [1] M. Balasubramanian and E. L. Schwartz. The IsoMap algorithm and topological stability. Science, 295(5552):7, January 2002. [2] C. Bregler and S. Omohundro. Nonlinear image interpolation using manifold learning. In NIPS–7, 1995. [3] D. DeMers and G. Cottrell. Nonlinear dimensionality reduction. In NIPS–5, 1993. [4] J. Gomes and A. Mojsilovic. A variational approach to recovering a manifold from sample points. In ECCV, 2002. [5] T. Hastie and W. Stuetzle. Principal curves. J. Am. Statistical Assoc, 84(406):502–516, 1989. [6] G. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of handwritten digits. IEEE Trans. Neural Networks, 8, 1997. [7] N. Kambhatla and T. Leen. Dimensionality reduction by local principal component analysis. Neural Computation, 9, 1997. [8] S. Roweis, L. Saul, and G. Hinton. Global coordination of linear models. In NIPS–13, 2002. [9] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. [10] A. Smola, S. Mika, B. Schölkopf, and R. Williamson. Regularized principal manifolds. Machine Learning, 1999. [11] Y. W. Teh and S. T. Roweis. Automatic alignment of hidden representations. In NIPS–15, 2003. [12] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000.
6 0.68428147 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design
7 0.67871517 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach
8 0.63420385 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
9 0.63024694 102 nips-2002-Hidden Markov Model of Cortical Synaptic Plasticity: Derivation of the Learning Rule
10 0.62163484 43 nips-2002-Binary Coding in Auditory Cortex
11 0.61858511 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics
12 0.58263445 148 nips-2002-Morton-Style Factorial Coding of Color in Primary Visual Cortex
13 0.58198607 184 nips-2002-Spectro-Temporal Receptive Fields of Subthreshold Responses in Auditory Cortex
14 0.57873851 199 nips-2002-Timing and Partial Observability in the Dopamine System
15 0.57336873 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain
16 0.56919181 180 nips-2002-Selectivity and Metaplasticity in a Unified Calcium-Dependent Model
17 0.56706208 160 nips-2002-Optoelectronic Implementation of a FitzHugh-Nagumo Neural Model
18 0.56460333 128 nips-2002-Learning a Forward Model of a Reflex
19 0.55937564 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals
20 0.55518496 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives