nips nips2005 nips2005-62 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Herbert Jaeger, Mingjie Zhao, Andreas Kolling
Abstract: A standard method to obtain stochastic models for symbolic time series is to train state-emitting hidden Markov models (SE-HMMs) with the Baum-Welch algorithm. Based on observable operator models (OOMs), in the last few months a number of novel learning algorithms for similar purposes have been developed: (1,2) two versions of an ”efficiency sharpening” (ES) algorithm, which iteratively improves the statistical efficiency of a sequence of OOM estimators, (3) a constrained gradient descent ML estimator for transition-emitting HMMs (TE-HMMs). We give an overview on these algorithms and compare them with SE-HMM/EM learning on synthetic and real-life data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract A standard method to obtain stochastic models for symbolic time series is to train state-emitting hidden Markov models (SE-HMMs) with the Baum-Welch algorithm. [sent-5, score-0.072]
2 1 Introduction Stochastic symbol sequences with memory effects are frequently modelled by training hidden Markov models with the Baum-Welch variant of the EM algorithm. [sent-8, score-0.21]
3 More specifically, state-emitting HMMs (SE-HMMs) are standardly employed, which emit observable events from hidden states. [sent-9, score-0.082]
4 Known weaknesses of HMM training with Baum-Welch are long runtimes and proneness to getting trapped in local maxima. [sent-10, score-0.07]
5 Over the last few years, an alternative to HMMs has been developed, observable operator models (OOMs). [sent-11, score-0.163]
6 OOMs identify the observable events a of a process with linear observable operators τa acting on a real vector space of predictive states w [1]. [sent-13, score-0.247]
7 A basic learning algorithm for OOMs [2] estimates the observable operators τa by solving a linear system of learning equations. [sent-14, score-0.139]
8 The learning algorithm is constructive, fast and yields asymptotically correct estimates. [sent-15, score-0.055]
9 Two problems that so far prevented OOMs from practical use were (i) poor statistical efficiency, (ii) the possibility that the obtained models might predict negative “probabilities” for some sequences. [sent-16, score-0.099]
10 In this novel approach to learning OOMs from data we iteratively construct a sequence of estimators whose statistical efficiency increases, which led us to call the method efficiency sharpening (ES). [sent-18, score-0.129]
11 Another, somewhat neglected class of stochastic models is transition-emitting HMMs (TEHMMs). [sent-19, score-0.048]
12 TEHMMs are equivalent to OOMs whose operator matrices are non-negative. [sent-24, score-0.088]
13 We have derived an alternative learning constrained log gradient (CLG) algorithm for MOOMs which performs a constrained gradient descent on the log likelihood surface in the log model parameter space of MOOMs. [sent-28, score-0.184]
14 2 Basics of OOMs Let (Ω, A, P, (Xn )n≥0 ) or (Xn ) for short be a discrete-time stochastic process with values in a finite symbol set O = {a1 , . [sent-31, score-0.068]
15 We will consider only stationary processes here for notational simplicity; OOMs can equally model nonstationary processes. [sent-35, score-0.055]
16 An mdimensional OOM for (Xn ) is a structure A = (Rm , (τa )a∈O , w0 ), where each observable operator τa is a real-valued m × m matrix and w0 ∈ Rm is the starting state, provided that for any finite sequence ai0 . [sent-36, score-0.163]
17 Xn = ain ) = 1m τain · · · τai0 w0 , (1) where 1m always denotes a row vector of units of length m (we drop the subscript if it is clear from the context). [sent-42, score-0.106]
18 We will use the shorthand notation a to denote a generic sequence ¯ and τa to denote a concatenation of the corresponding operators in reverse order, which ¯ would condense (1) into P (¯) = 1τa w0 . [sent-43, score-0.269]
19 a ¯ Conversely, if a structure A = (Rm , (τa )a∈O , w0 ) satisfies (i) 1w0 = 1, (ii) 1( τa ) = 1, (iii) ∀¯ ∈ O∗ : 1τa w0 ≥ 0, a ¯ (2) a∈O (where O∗ denotes the set of all finite sequences over O), then there exists a process whose distribution is described by A via (1). [sent-44, score-0.072]
20 The process is stationary iff ( a∈O τa )w0 = w0 . [sent-45, score-0.07]
21 Conditions (i) and (ii) are easy to check, but no efficient criterium is known to decide whether the non-negativity criterium (iii) holds for a structure A (for recent progress in this problem, which is equivalent to a problem of general interest in linear algebra, see [4]). [sent-46, score-0.141]
22 The state wa of an OOM after an initial history a is obtained by normalizing τa w0 to unit ¯ ¯ ¯ component sum via wa = τa w0 /1τa w0 . [sent-48, score-0.242]
23 ¯ ¯ ¯ A fundamental (and nontrivial) theorem for OOMs characterizes equivalence of two ˜ τ ˜ OOMs. [sent-49, score-0.061]
24 By the equivalence theorem, A is equivalent to A if and only if there exists a transformation matrix of size m × m, satisfying 1 = 1, such that τa = τa −1 for all symbols ˜ a. [sent-51, score-0.125]
25 It can be shown [2] that every OOM has characb ¯ terizers of length k for suitably large k. [sent-60, score-0.046]
26 Intuitively, a characterizer “bundles” the length k future distribution into the state vector by projection. [sent-61, score-0.446]
27 ˜ If two equivalent OOMs A, A are related by τa = τa ˜ ˜ it is easy to check that C is a characterizer for A. [sent-62, score-0.431]
28 −1 , and C is a characterizer for A, We conclude this section by explaining the basic learning equations. [sent-63, score-0.4]
29 An analysis of (1) reveals that for any state wa and operator τb from an OOM it holds that ¯ τa wa = P (a|¯)waa , a ¯ ¯ (4) a ¯ where aa is the concatenation of a with a. [sent-64, score-0.329]
30 The vectors wa and P (a|¯)waa thus form an ¯ ¯ ¯ argument-value pair for τa . [sent-65, score-0.121]
31 , al be a finite sequence of finite sequences over ¯ ¯ O, and let V = (wa1 · · · wal ) be the matrix containing the corresponding state vectors. [sent-69, score-0.168]
32 ¯ ¯ Let again C be a m × κ sized characterizer of length k and ¯1 , . [sent-70, score-0.446]
33 Let V = (P (¯i |¯j )) be the κ × l matrix containing the conditional b a b continuation probabilities of the initial sequences aj by the sequences ¯i . [sent-74, score-0.176]
34 A linear operator on Rm is uniquely determined by l ≥ m argument-value pairs provided there are at least m linearly independent argument vectors in these pairs. [sent-79, score-0.081]
35 Thus, if a characterizer C is found such that V = CV has rank m, the operators τa of an OOM characterized by C are uniquely determined by V and the matrices W a via τa = Wa V † = CW a (CV )† , where † denotes the pseudo-inverse. [sent-80, score-0.481]
36 Now, given a training sequence S, the conditional continuation probabilities P (¯i |¯j ), P (a¯i |¯j ) that make up V , W a can be estimated from b a b a ˆ b a ˆ b a S by an obvious counting scheme, yielding estimates P (¯i |¯j ), P (a¯i |¯j ) for making up ˆ ˆ V and W a , respectively. [sent-81, score-0.156]
37 ˆ (5) In words, to learn an OOM from S, first fix a model dimension m, a characterizer C, inˆ ˆ dicative sequences a1 , . [sent-83, score-0.51]
38 This estimation procedure is asymptotically correct in the sense that, if the training data were generated by an m-dimensional OOM in the first place, this generator will almost surely be perfectly recovered as the size ˆ ˆ of training data goes to infinity. [sent-87, score-0.233]
39 The starting state can be recovered from the estimated ˆ operators by exploiting ( a∈O τa )w0 = w0 or directly from C and V (see [2] for details). [sent-89, score-0.057]
40 3 The ES Family of Learning Algorithms All learning algorithms based on (5) are asymptotically correct (which EM algorithms are not, by the way), but their statistical efficiency (model variance) depends crucially on (i) the choice of indicative sequences a1 , . [sent-90, score-0.169]
41 , al and (ii) the characterizer C (assuming that ¯ ¯ the model dimension m is determined by other means, e. [sent-93, score-0.47]
42 We will first address (ii) and describe an iterative scheme to obtain characterizers that lead to a low model variance. [sent-96, score-0.163]
43 First, observe that if some characterizer C is used ˆ ˜ with (5), obtaining a model A, and is an OOM equivalence transformation, then if C = ˆ is an equivalent version of A via . [sent-103, score-0.492]
44 ˜ ˆ C is used with (5), the obtained model A Furthermore, it is easy to see [2] that two characterizers C1 , C2 characterize the same OOM iff C1 V = C2 V . [sent-104, score-0.165]
45 We call two characterizers similar if this holds, and write C1 ∼ C2 . [sent-105, score-0.145]
46 That is, the similarity equivalence class of some characterizer C is the set {C + G|GV = 0, 1G = 0}. [sent-107, score-0.461]
47 Together with the first observation this implies that we may confine our search for “good” characterizers to a single (and arbitrary) such equivalence class of characterizers. [sent-108, score-0.181]
48 ˆ In [2] it is explained that the variance of C V is monotonically tied to ¯i ) wa − C(:, i) 2 , where C(:, i) is the i-th column of C. [sent-110, score-0.145]
49 Given an OOM A = (Rm , (τa )a∈O , w0 ) with an r r induced probability distribution PA , its reverse OOM Ar = (Rm , (τa )a∈O , w0 ) is characr satisfying terized by a probability distribution PA ∀ a0 · · · an ∈ O∗ : PA (a0 · · · an ) = PAr (an · · · a0 ). [sent-127, score-0.188]
50 (7) A reverse OOM can be easily computed from the “forward” OOM as follows. [sent-128, score-0.188]
51 If A = (Rm , (τa )a∈O , w0 ) is an OOM for a stationary process, and w0 has no zero entry, then Ar = (Rm , (Dτa D−1 )a∈O , w0 ) (8) is a reverse OOM to A, where D = diag(w0 ) is a diagonal matrix with w0 on its diagonal. [sent-129, score-0.213]
52 Let Ar = (Rm , (τa )a∈O , w0 ) be b b the reverse OOM to A, which was characterized by C0 . [sent-134, score-0.188]
53 Then it holds that C = (w¯1 · · · w¯κ ) is a characterizer b b b for an OOM equivalent to A. [sent-139, score-0.461]
54 C can effectively be transformed into a characterizer C r for A by C r = r C, where r 1τ¯1 b . [sent-140, score-0.4]
55 1τ¯κ b (9) We call C r the reverse characterizer of A, because it is composed from the states of a reverse OOM to A. [sent-144, score-0.827]
56 (10) To summarize, within a similarity class of characterizers, the one which minimizes model variance is the (unique) reverse characterizer in this class. [sent-146, score-0.612]
57 This analytical finding suggests the following generic, iterative procedure to obtain characterizers that minimize model variance: 1. [sent-148, score-0.15]
58 Choose a model dimension m and a characterizer length k. [sent-150, score-0.484]
59 Terminate when the training log-likelihood of models A(n) appear to settle on a plateau. [sent-162, score-0.094]
60 ˆ The rationale behind this scheme is that the initial model A(0) is obtained essentially from an uninformed, ad hoc characterizer, for which one has to expect a large model variation ˆ ˆ and thus (on the average) a poor A(0) . [sent-163, score-0.118]
61 However, the characterizer C r(1) obtained from ˆ(0) is not uninformed any longer but shaped by a reasonable reverse model. [sent-164, score-0.628]
62 We have developed two specific instantiations of the general ES learning scheme, differentiated by the set of indicative sequences used. [sent-169, score-0.114]
63 , ¯κ , which leads to a computationally very cheap iterated recomputation of (5) with b b updated reverse characterizers. [sent-176, score-0.188]
64 The statistical efficiency of the poor man’s ES algorithm is impaired by the fact that only the counting statistics of subsequences of length 2k are exploited. [sent-178, score-0.178]
65 The other ES instantiation exploits the statistics of all subsequences in the original training string. [sent-179, score-0.097]
66 In each iteration, the current reverse model is run backwards through S and the obtained reverse states are additively collected bottom-up in the nodes of the ST. [sent-182, score-0.402]
67 From the ST nodes the collected states are then harvested into matrices ˆ ˆ corresponding directly to C V and C W a , that is, an explicit computation of the reverse characterizer is not required. [sent-183, score-0.614]
68 This scheme is analytically derived as gradient descent in the model log likelihood surface over the log space of these matrix parameters, observing constraints of non-negativity of these parameters and the general OOM constraints (i) and (ii) from Eqn. [sent-188, score-0.168]
69 sN be the training string and for 1 ≤ k ≤ N define ak = s1 . [sent-195, score-0.131]
70 ¯ b Define for some m-dimensional OOM and a ∈ O σk = 1τ¯k b , ya = 1τ¯k wak ¯ b s σk wak−1 ¯ k =a 1τsk wak−1 ¯ , y0 = max{[ya ]ij }, [ya0 ]i,j = [ya ]i,j /y0 . [sent-202, score-0.102]
71 (2), and λ is a learning rate which here unconventionally appears in the exponent because the gradient descent is carried out in the log parameter space. [sent-204, score-0.127]
72 This update scheme is derived τ+ τ in a way that is unrelated to the derivation of the EM algorithm; to our surprise we found that for λ = 1 (12) is equivalent to the Baum-Welch algorithm for TE-HMMs. [sent-206, score-0.099]
73 For each of (a) and (b), 40 experiments were carried out with freshly constructed generators per experiment; a training string of length 1000 and a test string of length 10000 was produced from each generator. [sent-209, score-0.353]
74 For (c), likewise 40 experiments were carried out with freshly generated training/testing sequences of same lengthes as before; here however the generator was identical for all experiments. [sent-210, score-0.212]
75 For the (d) dataset, after preprocessing which shrunk the number of different symbols to 27, the original string was sorted sentence-wise into a training and a testing string, each of length ∼ 21000 (details in [2]). [sent-212, score-0.241]
76 The following settings were used with the various training methods. [sent-213, score-0.07]
77 (i) The poor man’s ES algorithm was used with a length k = 2 of indicative sequences on all datasets. [sent-214, score-0.235]
78 Two ES iterations were carried out and the model of the last iteration was used to compute the reported log likelihoods. [sent-215, score-0.101]
79 (ii) For the suffix-tree based ES algorithm, on datasets (a) – (c), likewise two ES iterations were done and the model from the iteration with the lowest (reverse) training LL was used for reporting. [sent-216, score-0.17]
80 On dataset (d), 4 ES iterations were called and similarly the model with the best reverse training LL was chosen. [sent-217, score-0.318]
81 Iterations were stopped when two consecutive training LL’s differed by less than 5e-5% or after 100 iterations. [sent-220, score-0.137]
82 Iterations were stopped after 100 steps or if LL’s differed by less than 1e-5%. [sent-222, score-0.067]
83 All computations were done in Matlab on 2 GHz PCs except the HMM training on dataset (d) which was done on a 330 MHz machine (the reported CPU times were scaled by 330/2000 to make them comparable with the other studies). [sent-223, score-0.095]
84 Figure 1 shows the training and testing loglikelihoods as well as the CPU times for all methods and datasets. [sent-224, score-0.101]
85 In each panel, the left y-axis shows log likelihoods for training and testing (testing LL normalized to training stringlength) and the right y-axis measures the log 10 of CPU times. [sent-231, score-0.235]
86 HMM models are documented in solid/black lines, poor man’s ES models in dotted/magenta lines, suffix-tree ES models in broken/blue, and MOOMs in dash-dotted/red lines. [sent-232, score-0.147]
87 The thickest lines in each panel show training LL, the thinnest CPU time, and intermediate testing LL. [sent-233, score-0.101]
88 On dataset (c), no results of the poor man’s algorithm are given because the learning equations became ill-conditioned for all but the lowest dimensions. [sent-235, score-0.132]
89 (1) The CPU times roughly exhibit an even log spread over almost 2 orders of magnitude, in the order poor man’s (fastest) – suffix-tree ES – CLG – Baum-Welch. [sent-238, score-0.107]
90 (2) CLG has the lowest training LL throughout, which needs an explanation because the proper OOMs trained by ES are more expressive. [sent-239, score-0.102]
91 Apparently the ES algorithm does not lead to local ML optima; otherwise suffix-tree ES models should show the lowest training LL. [sent-240, score-0.126]
92 (4) On the MOOM data (b), the test LL of MOOM/CLG and OOM/poor man models of dimension 2 equals the best HMM/Baum-Welch test LL which is attained at a dimension of 4; the OOM/suffix-tree test LL at dimension 2 is superior to the best HMM test LL. [sent-242, score-0.244]
93 (5) On the “probability clock” data (c), the suffix-tree ES trained OOMs surpassed the non-OOM models in test LL, with the optimal value obtained at the (correct) model dimension 3. [sent-243, score-0.062]
94 This comes as no surprise because these data come from a generator that is incommensurable with either HMMs or MOOMs. [sent-244, score-0.063]
95 (6) On the large empirical dataset (d) the CLG/MOOMs have by a fair margin the highest training LL, but the test LL quickly drops to unacceptable lows. [sent-245, score-0.095]
96 It is hard to explain this by overfitting, considering the complexity and the size of the training string. [sent-246, score-0.07]
97 The other three types of models are evenly ordered in both training and testing error from HMMs (poorest) to suffix-tree ES trained OOMs. [sent-247, score-0.125]
98 Depending on whether one wants a very fast algorithm with good, or a fast algorithm with very good train/test LL, one here would choose the poor man’s or the suffix-tree ES algorithm as the winner. [sent-249, score-0.075]
99 The topics that we will address next in our research group are (i) a mathematical analysis of the asymptotic behaviour of the ES algorithms, (ii) online adaptive versions of these algorithms, and (iii) versions of the ES algorithms for nonstationary time series. [sent-261, score-0.076]
100 Stochastic models combining discrete symbols and continuous attributes in handwriting recognition. [sent-293, score-0.057]
wordName wordTfidf (topN-words)
[('oom', 0.541), ('characterizer', 0.4), ('ooms', 0.4), ('reverse', 0.188), ('ll', 0.165), ('clg', 0.16), ('wa', 0.121), ('characterizers', 0.12), ('man', 0.106), ('hmms', 0.093), ('rm', 0.082), ('observable', 0.082), ('copt', 0.08), ('mooms', 0.08), ('sharpening', 0.08), ('es', 0.078), ('poor', 0.075), ('sequences', 0.072), ('ciency', 0.071), ('cpu', 0.071), ('training', 0.07), ('hmm', 0.066), ('string', 0.061), ('equivalence', 0.061), ('ain', 0.06), ('mealy', 0.06), ('moom', 0.06), ('tehmms', 0.06), ('wak', 0.06), ('operator', 0.057), ('operators', 0.057), ('gv', 0.052), ('ii', 0.048), ('iii', 0.047), ('length', 0.046), ('iff', 0.045), ('cv', 0.044), ('symbol', 0.044), ('scheme', 0.043), ('ij', 0.043), ('indicative', 0.042), ('ya', 0.042), ('alphabetical', 0.04), ('basics', 0.04), ('bremen', 0.04), ('criterium', 0.04), ('enumeration', 0.04), ('gopt', 0.04), ('uninformed', 0.04), ('waa', 0.04), ('wal', 0.04), ('zhao', 0.04), ('generator', 0.038), ('dimension', 0.038), ('ar', 0.037), ('bk', 0.037), ('suf', 0.036), ('differed', 0.035), ('freshly', 0.035), ('iterations', 0.035), ('descent', 0.034), ('carried', 0.034), ('likewise', 0.033), ('symbols', 0.033), ('al', 0.032), ('continuation', 0.032), ('jaeger', 0.032), ('stopped', 0.032), ('lowest', 0.032), ('log', 0.032), ('testing', 0.031), ('asymptotically', 0.031), ('equivalent', 0.031), ('analytical', 0.03), ('counting', 0.03), ('holds', 0.03), ('months', 0.03), ('ok', 0.03), ('nonstationary', 0.03), ('sk', 0.03), ('xn', 0.029), ('cw', 0.028), ('clock', 0.027), ('plateau', 0.027), ('subsequences', 0.027), ('gradient', 0.027), ('states', 0.026), ('surprise', 0.025), ('ef', 0.025), ('matlab', 0.025), ('stationary', 0.025), ('call', 0.025), ('dataset', 0.025), ('sequence', 0.024), ('stochastic', 0.024), ('models', 0.024), ('variance', 0.024), ('correct', 0.024), ('uniquely', 0.024), ('versions', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 62 nips-2005-Efficient Estimation of OOMs
Author: Herbert Jaeger, Mingjie Zhao, Andreas Kolling
Abstract: A standard method to obtain stochastic models for symbolic time series is to train state-emitting hidden Markov models (SE-HMMs) with the Baum-Welch algorithm. Based on observable operator models (OOMs), in the last few months a number of novel learning algorithms for similar purposes have been developed: (1,2) two versions of an ”efficiency sharpening” (ES) algorithm, which iteratively improves the statistical efficiency of a sequence of OOM estimators, (3) a constrained gradient descent ML estimator for transition-emitting HMMs (TE-HMMs). We give an overview on these algorithms and compare them with SE-HMM/EM learning on synthetic and real-life data. 1
2 0.062959217 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
Author: Jason Farquhar, David Hardoon, Hongying Meng, John S. Shawe-taylor, Sándor Szedmák
Abstract: Kernel methods make it relatively easy to define complex highdimensional feature spaces. This raises the question of how we can identify the relevant subspaces for a particular learning task. When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classification algorithms such as the Support Vector Machine (SVM). This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. We present both experimental and theoretical analysis of the approach showing encouraging results and insights. 1
3 0.057388339 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
Author: Anatoli Juditsky, Alexander Nazin, Alexandre Tsybakov, Nicolas Vayatis
Abstract: We consider the problem of constructing an aggregated estimator from a finite class of base functions which approximately minimizes a convex risk functional under the ℓ1 constraint. For this purpose, we propose a stochastic procedure, the mirror descent, which performs gradient descent in the dual space. The generated estimates are additionally averaged in a recursive fashion with specific weights. Mirror descent algorithms have been developed in different contexts and they are known to be particularly efficient in high dimensional problems. Moreover their implementation is adapted to the online setting. The main result of the paper is the upper bound on the convergence rate for the generalization error. 1
4 0.048025001 105 nips-2005-Large-Scale Multiclass Transduction
Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan
Abstract: We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufficiently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. 1
5 0.043248571 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang
Abstract: We propose a probabilistic model based on Independent Component Analysis for learning multiple related tasks. In our model the task parameters are assumed to be generated from independent sources which account for the relatedness of the tasks. We use Laplace distributions to model hidden sources which makes it possible to identify the hidden, independent components instead of just modeling correlations. Furthermore, our model enjoys a sparsity property which makes it both parsimonious and robust. We also propose efficient algorithms for both empirical Bayes method and point estimation. Our experimental results on two multi-label text classification data sets show that the proposed approach is promising.
6 0.041116513 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
7 0.03975264 114 nips-2005-Learning Rankings via Convex Hull Separation
8 0.039473195 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
9 0.038866233 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks
10 0.038731351 148 nips-2005-Online Discovery and Learning of Predictive State Representations
11 0.037055742 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
12 0.036505528 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
13 0.035573293 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
14 0.035520971 79 nips-2005-Fusion of Similarity Data in Clustering
15 0.035164192 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains
16 0.034118261 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
17 0.033979766 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
18 0.033612512 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
19 0.032836825 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
20 0.031885393 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
topicId topicWeight
[(0, 0.138), (1, 0.018), (2, 0.01), (3, -0.009), (4, 0.021), (5, 0.001), (6, -0.022), (7, 0.018), (8, 0.005), (9, 0.002), (10, -0.024), (11, -0.006), (12, 0.008), (13, -0.029), (14, -0.0), (15, 0.056), (16, -0.024), (17, 0.015), (18, 0.005), (19, 0.045), (20, 0.024), (21, 0.022), (22, 0.018), (23, 0.042), (24, -0.036), (25, -0.037), (26, 0.033), (27, 0.025), (28, 0.024), (29, 0.094), (30, 0.05), (31, -0.049), (32, -0.048), (33, 0.098), (34, 0.05), (35, 0.049), (36, -0.008), (37, 0.02), (38, 0.006), (39, -0.028), (40, -0.054), (41, -0.057), (42, -0.124), (43, 0.04), (44, -0.02), (45, 0.038), (46, -0.129), (47, -0.027), (48, 0.084), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.86154604 62 nips-2005-Efficient Estimation of OOMs
Author: Herbert Jaeger, Mingjie Zhao, Andreas Kolling
Abstract: A standard method to obtain stochastic models for symbolic time series is to train state-emitting hidden Markov models (SE-HMMs) with the Baum-Welch algorithm. Based on observable operator models (OOMs), in the last few months a number of novel learning algorithms for similar purposes have been developed: (1,2) two versions of an ”efficiency sharpening” (ES) algorithm, which iteratively improves the statistical efficiency of a sequence of OOM estimators, (3) a constrained gradient descent ML estimator for transition-emitting HMMs (TE-HMMs). We give an overview on these algorithms and compare them with SE-HMM/EM learning on synthetic and real-life data. 1
2 0.53143948 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
Author: Jason Farquhar, David Hardoon, Hongying Meng, John S. Shawe-taylor, Sándor Szedmák
Abstract: Kernel methods make it relatively easy to define complex highdimensional feature spaces. This raises the question of how we can identify the relevant subspaces for a particular learning task. When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classification algorithms such as the Support Vector Machine (SVM). This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. We present both experimental and theoretical analysis of the approach showing encouraging results and insights. 1
3 0.52007163 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
Author: Bhaskar D. Rao, David P. Wipf
Abstract: Given a redundant dictionary of basis vectors (or atoms), our goal is to find maximally sparse representations of signals. Previously, we have argued that a sparse Bayesian learning (SBL) framework is particularly well-suited for this task, showing that it has far fewer local minima than other Bayesian-inspired strategies. In this paper, we provide further evidence for this claim by proving a restricted equivalence condition, based on the distribution of the nonzero generating model weights, whereby the SBL solution will equal the maximally sparse representation. We also prove that if these nonzero weights are drawn from an approximate Jeffreys prior, then with probability approaching one, our equivalence condition is satisfied. Finally, we motivate the worst-case scenario for SBL and demonstrate that it is still better than the most widely used sparse representation algorithms. These include Basis Pursuit (BP), which is based on a convex relaxation of the ℓ0 (quasi)-norm, and Orthogonal Matching Pursuit (OMP), a simple greedy strategy that iteratively selects basis vectors most aligned with the current residual. 1
4 0.51659387 162 nips-2005-Rate Distortion Codes in Sensor Networks: A System-level Analysis
Author: Tatsuto Murayama, Peter Davis
Abstract: This paper provides a system-level analysis of a scalable distributed sensing model for networked sensors. In our system model, a data center acquires data from a bunch of L sensors which each independently encode their noisy observations of an original binary sequence, and transmit their encoded data sequences to the data center at a combined rate R, which is limited. Supposing that the sensors use independent LDGM rate distortion codes, we show that the system performance can be evaluated for any given finite R when the number of sensors L goes to infinity. The analysis shows how the optimal strategy for the distributed sensing problem changes at critical values of the data rate R or the noise level. 1
5 0.47871003 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
Author: Kristina Klinkner, Cosma Shalizi, Marcelo Camperi
Abstract: Most nervous systems encode information about stimuli in the responding activity of large neuronal networks. This activity often manifests itself as dynamically coordinated sequences of action potentials. Since multiple electrode recordings are now a standard tool in neuroscience research, it is important to have a measure of such network-wide behavioral coordination and information sharing, applicable to multiple neural spike train data. We propose a new statistic, informational coherence, which measures how much better one unit can be predicted by knowing the dynamical state of another. We argue informational coherence is a measure of association and shared information which is superior to traditional pairwise measures of synchronization and correlation. To find the dynamical states, we use a recently-introduced algorithm which reconstructs effective state spaces from stochastic time series. We then extend the pairwise measure to a multivariate analysis of the network by estimating the network multi-information. We illustrate our method by testing it on a detailed model of the transition from gamma to beta rhythms. Much of the most important information in neural systems is shared over multiple neurons or cortical areas, in such forms as population codes and distributed representations [1]. On behavioral time scales, neural information is stored in temporal patterns of activity as opposed to static markers; therefore, as information is shared between neurons or brain regions, it is physically instantiated as coordination between entire sequences of neural spikes. Furthermore, neural systems and regions of the brain often require coordinated neural activity to perform important functions; acting in concert requires multiple neurons or cortical areas to share information [2]. Thus, if we want to measure the dynamic network-wide behavior of neurons and test hypotheses about them, we need reliable, practical methods to detect and quantify behavioral coordination and the associated information sharing across multiple neural units. These would be especially useful in testing ideas about how particular forms of coordination relate to distributed coding (e.g., that of [3]). Current techniques to analyze relations among spike trains handle only pairs of neurons, so we further need a method which is extendible to analyze the coordination in the network, system, or region as a whole. Here we propose a new measure of behavioral coordination and information sharing, informational coherence, based on the notion of dynamical state. Section 1 argues that coordinated behavior in neural systems is often not captured by exist- ing measures of synchronization or correlation, and that something sensitive to nonlinear, stochastic, predictive relationships is needed. Section 2 defines informational coherence as the (normalized) mutual information between the dynamical states of two systems and explains how looking at the states, rather than just observables, fulfills the needs laid out in Section 1. Since we rarely know the right states a prori, Section 2.1 briefly describes how we reconstruct effective state spaces from data. Section 2.2 gives some details about how we calculate the informational coherence and approximate the global information stored in the network. Section 3 applies our method to a model system (a biophysically detailed conductance-based model) comparing our results to those of more familiar second-order statistics. In the interest of space, we omit proofs and a full discussion of the existing literature, giving only minimal references here; proofs and references will appear in a longer paper now in preparation. 1 Synchrony or Coherence? Most hypotheses which involve the idea that information sharing is reflected in coordinated activity across neural units invoke a very specific notion of coordinated activity, namely strict synchrony: the units should be doing exactly the same thing (e.g., spiking) at exactly the same time. Investigators then measure coordination by measuring how close the units come to being strictly synchronized (e.g., variance in spike times). From an informational point of view, there is no reason to favor strict synchrony over other kinds of coordination. One neuron consistently spiking 50 ms after another is just as informative a relationship as two simultaneously spiking, but such stable phase relations are missed by strict-synchrony approaches. Indeed, whatever the exact nature of the neural code, it uses temporally extended patterns of activity, and so information sharing should be reflected in coordination of those patterns, rather than just the instantaneous activity. There are three common ways of going beyond strict synchrony: cross-correlation and related second-order statistics, mutual information, and topological generalized synchrony. The cross-correlation function (the normalized covariance function; this includes, for present purposes, the joint peristimulus time histogram [2]), is one of the most widespread measures of synchronization. It can be efficiently calculated from observable series; it handles statistical as well as deterministic relationships between processes; by incorporating variable lags, it reduces the problem of phase locking. Fourier transformation of the covariance function γXY (h) yields the cross-spectrum FXY (ν), which in turn gives the 2 spectral coherence cXY (ν) = FXY (ν)/FX (ν)FY (ν), a normalized correlation between the Fourier components of X and Y . Integrated over frequencies, the spectral coherence measures, essentially, the degree of linear cross-predictability of the two series. ([4] applies spectral coherence to coordinated neural activity.) However, such second-order statistics only handle linear relationships. Since neural processes are known to be strongly nonlinear, there is little reason to think these statistics adequately measure coordination and synchrony in neural systems. Mutual information is attractive because it handles both nonlinear and stochastic relationships and has a very natural and appealing interpretation. Unfortunately, it often seems to fail in practice, being disappointingly small even between signals which are known to be tightly coupled [5]. The major reason is that the neural codes use distinct patterns of activity over time, rather than many different instantaneous actions, and the usual approach misses these extended patterns. Consider two neurons, one of which drives the other to spike 50 ms after it does, the driving neuron spiking once every 500 ms. These are very tightly coordinated, but whether the first neuron spiked at time t conveys little information about what the second neuron is doing at t — it’s not spiking, but it’s not spiking most of the time anyway. Mutual information calculated from the direct observations conflates the “no spike” of the second neuron preparing to fire with its just-sitting-around “no spike”. Here, mutual information could find the coordination if we used a 50 ms lag, but that won’t work in general. Take two rate-coding neurons with base-line firing rates of 1 Hz, and suppose that a stimulus excites one to 10 Hz and suppresses the other to 0.1 Hz. The spiking rates thus share a lot of information, but whether the one neuron spiked at t is uninformative about what the other neuron did then, and lagging won’t help. Generalized synchrony is based on the idea of establishing relationships between the states of the various units. “State” here is taken in the sense of physics, dynamics and control theory: the state at time t is a variable which fixes the distribution of observables at all times ≥ t, rendering the past of the system irrelevant [6]. Knowing the state allows us to predict, as well as possible, how the system will evolve, and how it will respond to external forces [7]. Two coupled systems are said to exhibit generalized synchrony if the state of one system is given by a mapping from the state of the other. Applications to data employ statespace reconstruction [8]: if the state x ∈ X evolves according to smooth, d-dimensional deterministic dynamics, and we observe a generic function y = f (x), then the space Y of time-delay vectors [y(t), y(t − τ ), ...y(t − (k − 1)τ )] is diffeomorphic to X if k > 2d, for generic choices of lag τ . The various versions of generalized synchrony differ on how, precisely, to quantify the mappings between reconstructed state spaces, but they all appear to be empirically equivalent to one another and to notions of phase synchronization based on Hilbert transforms [5]. Thus all of these measures accommodate nonlinear relationships, and are potentially very flexible. Unfortunately, there is essentially no reason to believe that neural systems have deterministic dynamics at experimentally-accessible levels of detail, much less that there are deterministic relationships among such states for different units. What we want, then, but none of these alternatives provides, is a quantity which measures predictive relationships among states, but allows those relationships to be nonlinear and stochastic. The next section introduces just such a measure, which we call “informational coherence”. 2 States and Informational Coherence There are alternatives to calculating the “surface” mutual information between the sequences of observations themselves (which, as described, fails to capture coordination). If we know that the units are phase oscillators, or rate coders, we can estimate their instantaneous phase or rate and, by calculating the mutual information between those variables, see how coordinated the units’ patterns of activity are. However, phases and rates do not exhaust the repertoire of neural patterns and a more general, common scheme is desirable. The most general notion of “pattern of activity” is simply that of the dynamical state of the system, in the sense mentioned above. We now formalize this. Assuming the usual notation for Shannon information [9], the information content of a state variable X is H[X] and the mutual information between X and Y is I[X; Y ]. As is well-known, I[X; Y ] ≤ min H[X], H[Y ]. We use this to normalize the mutual state information to a 0 − 1 scale, and this is the informational coherence (IC). ψ(X, Y ) = I[X; Y ] , with 0/0 = 0 . min H[X], H[Y ] (1) ψ can be interpreted as follows. I[X; Y ] is the Kullback-Leibler divergence between the joint distribution of X and Y , and the product of their marginal distributions [9], indicating the error involved in ignoring the dependence between X and Y . The mutual information between predictive, dynamical states thus gauges the error involved in assuming the two systems are independent, i.e., how much predictions could improve by taking into account the dependence. Hence it measures the amount of dynamically-relevant information shared between the two systems. ψ simply normalizes this value, and indicates the degree to which two systems have coordinated patterns of behavior (cf. [10], although this only uses directly observable quantities). 2.1 Reconstruction and Estimation of Effective State Spaces As mentioned, the state space of a deterministic dynamical system can be reconstructed from a sequence of observations. This is the main tool of experimental nonlinear dynamics [8]; but the assumption of determinism is crucial and false, for almost any interesting neural system. While classical state-space reconstruction won’t work on stochastic processes, such processes do have state-space representations [11], and, in the special case of discretevalued, discrete-time series, there are ways to reconstruct the state space. Here we use the CSSR algorithm, introduced in [12] (code available at http://bactra.org/CSSR). This produces causal state models, which are stochastic automata capable of statistically-optimal nonlinear prediction; the state of the machine is a minimal sufficient statistic for the future of the observable process[13].1 The basic idea is to form a set of states which should be (1) Markovian, (2) sufficient statistics for the next observable, and (3) have deterministic transitions (in the automata-theory sense). The algorithm begins with a minimal, one-state, IID model, and checks whether these properties hold, by means of hypothesis tests. If they fail, the model is modified, generally but not always by adding more states, and the new model is checked again. Each state of the model corresponds to a distinct distribution over future events, i.e., to a statistical pattern of behavior. Under mild conditions, which do not involve prior knowledge of the state space, CSSR converges in probability to the unique causal state model of the data-generating process [12]. In practice, CSSR is quite fast (linear in the data size), and generalizes at least as well as training hidden Markov models with the EM algorithm and using cross-validation for selection, the standard heuristic [12]. One advantage of the causal state approach (which it shares with classical state-space reconstruction) is that state estimation is greatly simplified. In the general case of nonlinear state estimation, it is necessary to know not just the form of the stochastic dynamics in the state space and the observation function, but also their precise parametric values and the distribution of observation and driving noises. Estimating the state from the observable time series then becomes a computationally-intensive application of Bayes’s Rule [17]. Due to the way causal states are built as statistics of the data, with probability 1 there is a finite time, t, at which the causal state at time t is certain. This is not just with some degree of belief or confidence: because of the way the states are constructed, it is impossible for the process to be in any other state at that time. Once the causal state has been established, it can be updated recursively, i.e., the causal state at time t + 1 is an explicit function of the causal state at time t and the observation at t + 1. The causal state model can be automatically converted, therefore, into a finite-state transducer which reads in an observation time series and outputs the corresponding series of states [18, 13]. (Our implementation of CSSR filters its training data automatically.) The result is a new time series of states, from which all non-predictive components have been filtered out. 2.2 Estimating the Coherence Our algorithm for estimating the matrix of informational coherences is as follows. For each unit, we reconstruct the causal state model, and filter the observable time series to produce a series of causal states. Then, for each pair of neurons, we construct a joint histogram of 1 Causal state models have the same expressive power as observable operator models [14] or predictive state representations [7], and greater power than variable-length Markov models [15, 16]. a b Figure 1: Rastergrams of neuronal spike-times in the network. Excitatory, pyramidal neurons (numbers 1 to 1000) are shown in green, inhibitory interneurons (numbers 1001 to 1300) in red. During the first 10 seconds (a), the current connections among the pyramidal cells are suppressed and a gamma rhythm emerges (left). At t = 10s, those connections become active, leading to a beta rhythm (b, right). the state distribution, estimate the mutual information between the states, and normalize by the single-unit state informations. This gives a symmetric matrix of ψ values. Even if two systems are independent, their estimated IC will, on average, be positive, because, while they should have zero mutual information, the empirical estimate of mutual information is non-negative. Thus, the significance of IC values must be assessed against the null hypothesis of system independence. The easiest way to do so is to take the reconstructed state models for the two systems and run them forward, independently of one another, to generate a large number of simulated state sequences; from these calculate values of the IC. This procedure will approximate the sampling distribution of the IC under a null model which preserves the dynamics of each system, but not their interaction. We can then find p-values as usual. We omit them here to save space. 2.3 Approximating the Network Multi-Information There is broad agreement [2] that analyses of networks should not just be an analysis of pairs of neurons, averaged over pairs. Ideally, an analysis of information sharing in a network would look at the over-all structure of statistical dependence between the various units, reflected in the complete joint probability distribution P of the states. This would then allow us, for instance, to calculate the n-fold multi-information, I[X1 , X2 , . . . Xn ] ≡ D(P ||Q), the Kullback-Leibler divergence between the joint distribution P and the product of marginal distributions Q, analogous to the pairwise mutual information [19]. Calculated over the predictive states, the multi-information would give the total amount of shared dynamical information in the system. Just as we normalized the mutual information I[X1 , X2 ] by its maximum possible value, min H[X1 ], H[X2 ], we normalize the multiinformation by its maximum, which is the smallest sum of n − 1 marginal entropies: I[X1 ; X2 ; . . . Xn ] ≤ min k H[Xn ] i=k Unfortunately, P is a distribution over a very high dimensional space and so, hard to estimate well without strong parametric constraints. We thus consider approximations. The lowest-order approximation treats all the units as independent; this is the distribution Q. One step up are tree distributions, where the global distribution is a function of the joint distributions of pairs of units. Not every pair of units needs to enter into such a distribution, though every unit must be part of some pair. Graphically, a tree distribution corresponds to a spanning tree, with edges linking units whose interactions enter into the global probability, and conversely spanning trees determine tree distributions. Writing ET for the set of pairs (i, j) and abbreviating X1 = x1 , X2 = x2 , . . . Xn = xn by X = x, one has n T (X = x) = (i,j)∈ET T (Xi = xi , Xj = xj ) T (Xi = xi ) T (Xi = xi )T (Xj = xj ) i=1 (2) where the marginal distributions T (Xi ) and the pair distributions T (Xi , Xj ) are estimated by the empirical marginal and pair distributions. We must now pick edges ET so that T best approximates the true global distribution P . A natural approach is to minimize D(P ||T ), the divergence between P and its tree approximation. Chow and Liu [20] showed that the maximum-weight spanning tree gives the divergence-minimizing distribution, taking an edge’s weight to be the mutual information between the variables it links. There are three advantages to using the Chow-Liu approximation. (1) Estimating T from empirical probabilities gives a consistent maximum likelihood estimator of the ideal ChowLiu tree [20], with reasonable rates of convergence, so T can be reliably known even if P cannot. (2) There are efficient algorithms for constructing maximum-weight spanning trees, such as Prim’s algorithm [21, sec. 23.2], which runs in time O(n2 + n log n). Thus, the approximation is computationally tractable. (3) The KL divergence of the Chow-Liu distribution from Q gives a lower bound on the network multi-information; that bound is just the sum of the mutual informations along the edges in the tree: I[X1 ; X2 ; . . . Xn ] ≥ D(T ||Q) = I[Xi ; Xj ] (3) (i,j)∈ET Even if we knew P exactly, Eq. 3 would be useful as an alternative to calculating D(P ||Q) directly, evaluating log P (x)/Q(x) for all the exponentially-many configurations x. It is natural to seek higher-order approximations to P , e.g., using three-way interactions not decomposable into pairwise interactions [22, 19]. But it is hard to do so effectively, because finding the optimal approximation to P when such interactions are allowed is NP [23], and analytical formulas like Eq. 3 generally do not exist [19]. We therefore confine ourselves to the Chow-Liu approximation here. 3 Example: A Model of Gamma and Beta Rhythms We use simulated data as a test case, instead of empirical multiple electrode recordings, which allows us to try the method on a system of over 1000 neurons and compare the measure against expected results. The model, taken from [24], was originally designed to study episodes of gamma (30–80Hz) and beta (12–30Hz) oscillations in the mammalian nervous system, which often occur successively with a spontaneous transition between them. More concretely, the rhythms studied were those displayed by in vitro hippocampal (CA1) slice preparations and by in vivo neocortical EEGs. The model contains two neuron populations: excitatory (AMPA) pyramidal neurons and inhibitory (GABAA ) interneurons, defined by conductance-based Hodgkin-Huxley-style equations. Simulations were carried out in a network of 1000 pyramidal cells and 300 interneurons. Each cell was modeled as a one-compartment neuron with all-to-all coupling, endowed with the basic sodium and potassium spiking currents, an external applied current, and some Gaussian input noise. The first 10 seconds of the simulation correspond to the gamma rhythm, in which only a group of neurons is made to spike via a linearly increasing applied current. The beta rhythm a b c d Figure 2: Heat-maps of coordination for the network, as measured by zero-lag cross-correlation (top row) and informational coherence (bottom), contrasting the gamma rhythm (left column) with the beta (right). Colors run from red (no coordination) through yellow to pale cream (maximum). (subsequent 10 seconds) is obtained by activating pyramidal-pyramidal recurrent connections (potentiated by Hebbian preprocessing as a result of synchrony during the gamma rhythm) and a slow outward after-hyper-polarization (AHP) current (the M-current), suppressed during gamma due to the metabotropic activation used in the generation of the rhythm. During the beta rhythm, pyramidal cells, silent during gamma rhythm, fire on a subset of interneurons cycles (Fig. 1). Fig. 2 compares zero-lag cross-correlation, a second-order method of quantifying coordination, with the informational coherence calculated from the reconstructed states. (In this simulation, we could have calculated the actual states of the model neurons directly, rather than reconstructing them, but for purposes of testing our method we did not.) Crosscorrelation finds some of the relationships visible in Fig. 1, but is confused by, for instance, the phase shifts between pyramidal cells. (Surface mutual information, not shown, gives similar results.) Informational coherence, however, has no trouble recognizing the two populations as effectively coordinated blocks. The presence of dynamical noise, problematic for ordinary state reconstruction, is not an issue. The average IC is 0.411 (or 0.797 if the inactive, low-numbered neurons are excluded). The tree estimate of the global informational multi-information is 3243.7 bits, with a global coherence of 0.777. The right half of Fig. 2 repeats this analysis for the beta rhythm; in this stage, the average IC is 0.614, and the tree estimate of the global multi-information is 7377.7 bits, though the estimated global coherence falls very slightly to 0.742. This is because low-numbered neurons which were quiescent before are now active, contributing to the global information, but the over-all pattern is somewhat weaker and more noisy (as can be seen from Fig. 1b.) So, as expected, the total information content is higher, but the overall coordination across the network is lower. 4 Conclusion Informational coherence provides a measure of neural information sharing and coordinated activity which accommodates nonlinear, stochastic relationships between extended patterns of spiking. It is robust to dynamical noise and leads to a genuinely multivariate measure of global coordination across networks or regions. Applied to data from multi-electrode recordings, it should be a valuable tool in evaluating hypotheses about distributed neural representation and function. Acknowledgments Thanks to R. Haslinger, E. Ionides and S. Page; and for support to the Santa Fe Institute (under grants from Intel, the NSF and the MacArthur Foundation, and DARPA agreement F30602-00-2-0583), the Clare Booth Luce Foundation (KLK) and the James S. McDonnell Foundation (CRS). References [1] L. F. Abbott and T. J. Sejnowski, eds. Neural Codes and Distributed Representations. MIT Press, 1998. [2] E. N. Brown, R. E. Kass, and P. P. Mitra. Nature Neuroscience, 7:456–461, 2004. [3] D. H. Ballard, Z. Zhang, and R. P. N. Rao. In R. P. N. Rao, B. A. Olshausen, and M. S. Lewicki, eds., Probabilistic Models of the Brain, pp. 273–284, MIT Press, 2002. [4] D. R. Brillinger and A. E. P. Villa. In D. R. Brillinger, L. T. Fernholz, and S. Morgenthaler, eds., The Practice of Data Analysis, pp. 77–92. Princeton U.P., 1997. [5] R. Quian Quiroga et al. Physical Review E, 65:041903, 2002. [6] R. F. Streater. Statistical Dynamics. Imperial College Press, London. [7] M. L. Littman, R. S. Sutton, and S. Singh. In T. G. Dietterich, S. Becker, and Z. Ghahramani, eds., Advances in Neural Information Processing Systems 14, pp. 1555–1561. MIT Press, 2002. [8] H. Kantz and T. Schreiber. Nonlinear Time Series Analysis. Cambridge U.P., 1997. [9] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991. [10] M. Palus et al. Physical Review E, 63:046211, 2001. [11] F. B. Knight. Annals of Probability, 3:573–596, 1975. [12] C. R. Shalizi and K. L. Shalizi. In M. Chickering and J. Halpern, eds., Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference, pp. 504–511. AUAI Press, 2004. [13] C. R. Shalizi and J. P. Crutchfield. Journal of Statistical Physics, 104:817–819, 2001. [14] H. Jaeger. Neural Computation, 12:1371–1398, 2000. [15] D. Ron, Y. Singer, and N. Tishby. Machine Learning, 25:117–149, 1996. [16] P. B¨ hlmann and A. J. Wyner. Annals of Statistics, 27:480–513, 1999. u [17] N. U. Ahmed. Linear and Nonlinear Filtering for Scientists and Engineers. World Scientific, 1998. [18] D. R. Upper. PhD thesis, University of California, Berkeley, 1997. [19] E. Schneidman, S. Still, M. J. Berry, and W. Bialek. Physical Review Letters, 91:238701, 2003. [20] C. K. Chow and C. N. Liu. IEEE Transactions on Information Theory, IT-14:462–467, 1968. [21] T. H. Cormen et al. Introduction to Algorithms. 2nd ed. MIT Press, 2001. [22] S. Amari. IEEE Transacttions on Information Theory, 47:1701–1711, 2001. [23] S. Kirshner, P. Smyth, and A. Robertson. Tech. Rep. 04-04, UC Irvine, Information and Computer Science, 2004. [24] M. S. Olufsen et al. Journal of Computational Neuroscience, 14:33–54, 2003.
6 0.47497231 148 nips-2005-Online Discovery and Learning of Predictive State Representations
7 0.46543014 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks
8 0.45091918 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
9 0.43247244 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
10 0.42819124 105 nips-2005-Large-Scale Multiclass Transduction
11 0.42786562 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning
12 0.42075926 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections
13 0.41251653 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
14 0.39942771 114 nips-2005-Learning Rankings via Convex Hull Separation
15 0.39249238 73 nips-2005-Fast biped walking with a reflexive controller and real-time policy searching
16 0.3868129 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
17 0.37239245 20 nips-2005-Affine Structure From Sound
18 0.367039 6 nips-2005-A Connectionist Model for Constructive Modal Reasoning
19 0.36085057 37 nips-2005-Benchmarking Non-Parametric Statistical Tests
20 0.35953209 126 nips-2005-Metric Learning by Collapsing Classes
topicId topicWeight
[(3, 0.042), (10, 0.041), (18, 0.029), (27, 0.034), (31, 0.036), (34, 0.078), (41, 0.019), (44, 0.011), (55, 0.052), (69, 0.063), (73, 0.027), (77, 0.36), (88, 0.066), (91, 0.051)]
simIndex simValue paperId paperTitle
1 0.8014009 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels
Author: Eizaburo Doi, Doru C. Balcan, Michael S. Lewicki
Abstract: Biological sensory systems are faced with the problem of encoding a high-fidelity sensory signal with a population of noisy, low-fidelity neurons. This problem can be expressed in information theoretic terms as coding and transmitting a multi-dimensional, analog signal over a set of noisy channels. Previously, we have shown that robust, overcomplete codes can be learned by minimizing the reconstruction error with a constraint on the channel capacity. Here, we present a theoretical analysis that characterizes the optimal linear coder and decoder for one- and twodimensional data. The analysis allows for an arbitrary number of coding units, thus including both under- and over-complete representations, and provides a number of important insights into optimal coding strategies. In particular, we show how the form of the code adapts to the number of coding units and to different data and noise conditions to achieve robustness. We also report numerical solutions for robust coding of highdimensional image data and show that these codes are substantially more robust compared against other image codes such as ICA and wavelets. 1
same-paper 2 0.78090352 62 nips-2005-Efficient Estimation of OOMs
Author: Herbert Jaeger, Mingjie Zhao, Andreas Kolling
Abstract: A standard method to obtain stochastic models for symbolic time series is to train state-emitting hidden Markov models (SE-HMMs) with the Baum-Welch algorithm. Based on observable operator models (OOMs), in the last few months a number of novel learning algorithms for similar purposes have been developed: (1,2) two versions of an ”efficiency sharpening” (ES) algorithm, which iteratively improves the statistical efficiency of a sequence of OOM estimators, (3) a constrained gradient descent ML estimator for transition-emitting HMMs (TE-HMMs). We give an overview on these algorithms and compare them with SE-HMM/EM learning on synthetic and real-life data. 1
3 0.68363976 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
Author: Lacey Gunter, Ji Zhu
Abstract: In this paper we derive an algorithm that computes the entire solution path of the support vector regression, with essentially the same computational cost as fitting one SVR model. We also propose an unbiased estimate for the degrees of freedom of the SVR model, which allows convenient selection of the regularization parameter. 1
4 0.42331582 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
Author: O. P. Kreidl, Alan S. Willsky
Abstract: Given a directed graphical model with binary-valued hidden nodes and real-valued noisy observations, consider deciding upon the maximum a-posteriori (MAP) or the maximum posterior-marginal (MPM) assignment under the restriction that each node broadcasts only to its children exactly one single-bit message. We present a variational formulation, viewing the processing rules local to all nodes as degrees-of-freedom, that minimizes the loss in expected (MAP or MPM) performance subject to such online communication constraints. The approach leads to a novel message-passing algorithm to be executed offline, or before observations are realized, which mitigates the performance loss by iteratively coupling all rules in a manner implicitly driven by global statistics. We also provide (i) illustrative examples, (ii) assumptions that guarantee convergence and efficiency and (iii) connections to active research areas. 1
5 0.4157919 162 nips-2005-Rate Distortion Codes in Sensor Networks: A System-level Analysis
Author: Tatsuto Murayama, Peter Davis
Abstract: This paper provides a system-level analysis of a scalable distributed sensing model for networked sensors. In our system model, a data center acquires data from a bunch of L sensors which each independently encode their noisy observations of an original binary sequence, and transmit their encoded data sequences to the data center at a combined rate R, which is limited. Supposing that the sensors use independent LDGM rate distortion codes, we show that the system performance can be evaluated for any given finite R when the number of sensors L goes to infinity. The analysis shows how the optimal strategy for the distributed sensing problem changes at critical values of the data rate R or the noise level. 1
6 0.40900242 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
7 0.40870234 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
8 0.40303683 20 nips-2005-Affine Structure From Sound
9 0.40266207 102 nips-2005-Kernelized Infomax Clustering
10 0.4003686 78 nips-2005-From Weighted Classification to Policy Search
11 0.39314136 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
12 0.3900623 167 nips-2005-Robust design of biological experiments
13 0.39001468 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections
14 0.38601431 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
15 0.38198778 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
16 0.38045788 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
17 0.37986371 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery
18 0.37935603 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
19 0.37831092 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
20 0.37791595 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation