nips nips2012 nips2012-14 knowledge-graph by maker-knowledge-mining

14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

Source: pdf

Author: Pieter-jan Kindermans, Hannes Verschore, David Verstraeten, Benjamin Schrauwen

Abstract: The usability of Brain Computer Interfaces (BCI) based on the P300 speller is severely hindered by the need for long training times and many repetitions of the same stimulus. In this contribution we introduce a set of unsupervised hierarchical probabilistic models that tackle both problems simultaneously by incorporating prior knowledge from two sources: information from other training subjects (through transfer learning) and information about the words being spelled (through language models). We show, that due to this prior knowledge, the performance of the unsupervised models parallels and in some cases even surpasses that of supervised models, while eliminating the tedious training session. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 be Abstract The usability of Brain Computer Interfaces (BCI) based on the P300 speller is severely hindered by the need for long training times and many repetitions of the same stimulus. [sent-3, score-0.28]

2 The user is presented with a grid of 36 characters of which alternately rows and columns light up, and focuses on the character he wishes to spell. [sent-8, score-0.475]

3 By correctly detecting this so-called P300 Event Related Potential (ERP), the character intended by the user can be determined. [sent-10, score-0.266]

4 To increase the spelling accuracy, multiple epochs are used before a character gets predicted, where a single epoch is deﬁned as a sequence of stimuli where each row and each column is intensiﬁed once. [sent-11, score-0.865]

5 The main difﬁculty in the construction of a P300 speller thus lies in the construction of a classiﬁer for the P300 wave. [sent-12, score-0.215]

6 It has been shown that these classiﬁers are among the best performing for P300 spelling [12]. [sent-17, score-0.367]

7 A recent interesting improvement of P300 spelling is post-processing of the classiﬁer outputs by a language model to improve spelling results [15]. [sent-18, score-0.922]

8 Other researchers have focused on adaptive classiﬁers which are ﬁrst trained supervisedly and then adapt to the test session while spelling [11, 13, 10]. [sent-19, score-0.49]

9 When the speller is used online without any prior training, it needs a warm-up period. [sent-23, score-0.293]

10 During this warm-up period the speller output will be more or less random as the classiﬁer is still trying to determine the underlying structure of the P300 ERP. [sent-24, score-0.248]

11 The length of this warm-up period depends on both the individual subject and 1 the number of epochs to spell each character. [sent-26, score-0.439]

12 A higher number of epochs will result in fewer letters in the warm-up, but the total spelling time might increase. [sent-27, score-0.599]

13 To achieve this goal, we extend the graphical model of the unsupervised classiﬁer by incorporating two types of prior knowledge: inter-subject information and language information. [sent-32, score-0.339]

14 2 Methods μw ws ws xs,t,i μw ws xs,t,i I xs,t,i I cs,t xs,t+1,i I cs,t cs,t T cs,t+1 I cs,t+2 T Subjects xs,t+2,i I Subjects (a) standard Subjects (b) subject transfer (c) subject transfer and language model Figure 1: Graphical representation of the different classiﬁers. [sent-37, score-1.213]

15 On the right the most complex model: inter subject information and a trigram language model. [sent-40, score-0.405]

16 1 Unsupervised P300 Speller The basic unsupervised speller which we extend in this paper, is the unsupervised P300 classiﬁer proposed in [9]. [sent-42, score-0.397]

17 The row vector xs,t,i contains the EEG recorded after intensiﬁcation i during spelling of ct by subject s, and a bias term. [sent-48, score-0.804]

18 The EEG for a character will be denoted as Xs,t which is a matrix whose rows are the different xs,t,i . [sent-49, score-0.266]

19 The mean y s,t,i (cs,t ) equals 1 when during intensiﬁcation i, character cs,t was highlighted, otherwise y s,t,i (cs,t ) = −1. [sent-52, score-0.266]

20 Thus during all epochs, each intensiﬁcation of this character should yield the P300. [sent-54, score-0.266]

21 Each intensiﬁcation which does not include this character should not elicit a P300 response. [sent-55, score-0.266]

22 The probability of a character given the EEG can be computed by application of Bayes’s rule: p(cs,t )p(Xs,t |cs,t ,ws ,βs ) p (cs,t |Xs,t , ws , βs ) = p(cs,t )p(Xs,t |cs,t ,ws ,βs ) . [sent-56, score-0.455]

23 In this model, the EEG Xs contains the obcs,t served variables, the characters are the latent variables which need to be inferred and ws , βs , αs are the parameters which we want to optimize. [sent-57, score-0.398]

24 Let y(cs ) be the labels which are assigned to the EEG given the character s prediction cs when the application constraints described above are taken into account. [sent-60, score-0.319]

25 First, we can incorporate prior information about the characters (the bottom of the graphical model) through language models. [sent-65, score-0.43]

26 The advantage of working with the most likely value is that there is no time penalty for transfer learning 3 when used in an online setting. [sent-75, score-0.192]

27 On the other hand, if µw takes on a nonzero value, the update equations for ws , αs become: ws old p cs |Xs , wold , βs s = T Xs Xs + cs αs = D T (wold − µw ) (wold − µw ) s s old αs I old βs −1 T Xs y s (cs ) + old αs Iµw , old βs . [sent-77, score-0.973]

28 S To apply transfer learning for a new subject S + 1, we assign µw = µnew and keep it ﬁxed. [sent-95, score-0.229]

29 3 Incorporation of language models A second possibility is to incorporate language models. [sent-99, score-0.376]

30 The only difference between working with and without a language model lies in the computation of the probability of a character given the EEG. [sent-100, score-0.454]

31 An n-gram language model takes the history into account: the probability of a character is deﬁned given the n − 1 previous characters: p (ct |ct−1 , . [sent-103, score-0.454]

32 In this work, we limit ourselves to uni, bi and trigram language models. [sent-107, score-0.356]

33 The graphical model of the P300 speller with subject transfer and a trigram language model is shown in Figure 1(c). [sent-108, score-0.745]

34 For the unigram language model, which counts character frequencies, we only have to change the prior on the characters p (ct ) to the probability of each character occurring. [sent-109, score-1.0]

35 To compute the marginal probability of a character given the EEG, which is exactly what we need in the E-step, we use the well known forward backward algorithm for HMM’s [1]. [sent-110, score-0.336]

36 The probability of a character can be computed as follows: p (X1 , . [sent-162, score-0.266]

37 Note that when we cache the forward pass from previous character predictions, only a single step of both the forward and backward pass has to be executed to spell a new character. [sent-192, score-0.524]

38 The recording was performed with the BCI2000 P300 speller software [14] with the following settings: a 2 second pause before and after each character, 62. [sent-198, score-0.215]

39 5ms between the intensiﬁcations, these intensiﬁcations lasted 125ms each and the spelling matrix contained the characters [a − z1 − 9 ]. [sent-199, score-0.576]

40 The train set contains 16 characters with 15 epochs per character. [sent-201, score-0.494]

41 The number of characters in the test is subject dependent and ranges from 17 to 29 with an average of 22. [sent-203, score-0.291]

42 This limited number of characters per sequence is very challenging for our unsupervised classiﬁer, since the spelling has to be as correct as possible right from the start, in order to obtain high average accuracies. [sent-205, score-0.697]

43 As the pre-processing in [9] has been shown to lead to good spelling performance, we adhere to their approach. [sent-206, score-0.367]

44 The EEG is preprocessed one character at a time; as a consequence this approach is valid in real online experiments2 . [sent-207, score-0.311]

45 It is therefore not an accurate representation of how a realistic speller would be used. [sent-214, score-0.215]

46 In a P300 speller, a look-up table assigns a speciﬁc character to each position in the on-screen matrix. [sent-218, score-0.266]

47 The Wikipedia dataset was transformed to lowercase and we used the ﬁrst 5 · 108 characters in the dataset to select the 36 most frequently occurring characters excluding numeric symbols. [sent-224, score-0.493]

48 As such, it makes sense to replace all the numeric characters with other symbols. [sent-226, score-0.238]

49 The selected characters are the following: [a − z : %() − ”. [sent-227, score-0.209]

50 This set of characters is then used to train unigram, bigram and trigram letter models. [sent-229, score-0.378]

51 These language models were trained on the ﬁrst 5 · 108 characters and we applied WittenBell smoothing [4], which assigns small but non-zero probabilities to n-grams not encountered in the train set. [sent-230, score-0.452]

52 This part was not used to train the language models. [sent-232, score-0.211]

53 Additionally, we modiﬁed the contents of the character grid such that it contains the 36 selected symbols. [sent-237, score-0.266]

54 The look-up table for the individual spelling actions was changed such that the correct solution is the newly sampled text. [sent-238, score-0.367]

55 The second letter indicates whether the classiﬁers adapts unsupervisedly during the spelling session (A) or is static (S). [sent-244, score-0.446]

56 We compared the standard unsupervised (and adaptive) algorithm (RA) which is randomly initialized, our proposed transfer learning approach without online adaptation (TS) and the transfer learning approach with adaptation (TA). [sent-245, score-0.492]

57 These three different setups were tested without a language model, and with a uni-, bi- and trigram language model. [sent-246, score-0.489]

58 We will indicate the language model by appending the subscript ‘−’ for the classiﬁer without language model, ‘uni’ for unigram, etcetera. [sent-247, score-0.403]

59 For example: TAtri is the unsupervised classiﬁer which used transfer learning, learns on the ﬂy and includes a trigram language model. [sent-248, score-0.539]

60 This means that for each subject we have 20 desired texts, 20 random initializations and 20 subject transfer initializations. [sent-253, score-0.379]

61 Additionally, we repeated the experiments with 3, 4, 5, 10 and 15 epochs per character. [sent-255, score-0.262]

62 In the case of transfer learning the initializations are computed as discussed in section 2. [sent-260, score-0.189]

63 The initial classiﬁers used in the transfer learning process itself were trained unsupervisedly and ofﬂine without a language model. [sent-262, score-0.413]

64 Finally, the current test subject is omitted when computing the transfer learning parameters. [sent-265, score-0.229]

65 However the time needed per EM iteration scales linearly with the number of characters in the trainset. [sent-269, score-0.239]

66 The addition of n-gram language models scales the time per E-step with (number of characters in grid)n−1 . [sent-270, score-0.427]

67 As this is a major issue in this real-time application, we will also discuss the setting where the classiﬁer is ﬁrst used to spell the next character and the EM updates are executed during the intensiﬁcations for the following symbol. [sent-272, score-0.38]

68 6 100 100 100 100 100 50 50 50 50 50 0 0 − uni bi tri 0 − (a) 3 epochs uni bi tri (b) 4 epochs 0 0 − uni bi tri − (c) 5 epochs RA TS TA uni bi − tri (d) 10 epochs uni bi tri (e) 15 epochs Figure 2: Overview of all spelling results from online experiments. [sent-283, score-3.397]

69 Increasing the number of epochs or adding complex language models improves accuracy. [sent-284, score-0.42]

70 Application of the baseline method RA− and averaging the results over the different subjects results in an online spelling accuracy starting at 24. [sent-291, score-0.47]

71 The result with 15 epochs is usable in practice and predicts only 4 characters incorrectly. [sent-294, score-0.441]

72 However, the spelling time is about half a minute per character. [sent-295, score-0.421]

73 Retesting the classiﬁers obtained after the online experiment gives the following results: when 3 epochs are used the ﬁnal classiﬁer is able to spell 60. [sent-296, score-0.369]

74 By evaluating the addition of a language model, RAuni,bi,tri , we see an improvement of the online results. [sent-300, score-0.233]

75 As more repetitions are used per character, the performance gain of the language models diminishes. [sent-302, score-0.261]

76 For 3 repetitions, a tri-gram model produces an online spelling accuracy of 43. [sent-303, score-0.412]

77 The results for 15 repetitions show that on average 3 characters are predicted incorrectly when a trigram is used. [sent-306, score-0.365]

78 Analysis of the re-evaluation of the classiﬁers after online processing shows a smaller improvement to the results, indicating that the language model mainly helps to reduce the warm up period. [sent-307, score-0.233]

79 This brings us to the full model TA: adaptive unsupervised training which is initialized with transfer learning and optionally makes use of language models. [sent-314, score-0.495]

80 The basic RA− and the full model TAtri tri are included. [sent-318, score-0.246]

81 Furthermore we give results for an adapted version TA∗ , which spells the character tri before the EM updates and for TA-Rtri , which is the re-evaluation of TAtri after processing the test set. [sent-319, score-0.543]

82 Epochs RA− TA∗ TAtri TA-Rtri BLDA BLDAtri BLDA− tri tri 3 4 5 10 15 24. [sent-320, score-0.492]

83 Also, the trigram classiﬁer produces the best results, which is not surprising given the incorporation of important prior language knowledge into the model. [sent-356, score-0.375]

84 Next, we give an overview of spelling accuracies in Table 1, where we compare the basic unsupervised method RA− to the full model TAtri . [sent-357, score-0.458]

85 With nearly three times as accurate spelling for 3 epochs (74. [sent-358, score-0.599]

86 6%) and near perfect spelling for 15 epochs, we can conclude that the full model is capable of instant spelling for a novel subject. [sent-360, score-0.761]

87 The application of TA∗ results in a minute tri performance drop, but as this classiﬁer spells the character before performing the EM iterations, it allows for real-time spelling when the EEG is received and is therefore of more use in an online setting. [sent-361, score-0.979]

88 The BLDA classiﬁers in this table are supervisedly trained using 15 epochs per character on 16 characters. [sent-363, score-0.628]

89 The BLDA− classiﬁer used a limited training set with only 3 epochs per character tri or almost three minutes of training. [sent-365, score-0.796]

90 When the limited training set is used, we see that our proposed method produces results which are competitive for 3-5 epochs and better for 10 and 15. [sent-366, score-0.254]

91 The BLDAtri model outperforms our method when we consider a low number of repetitions per character but not for 10 or 15 epochs. [sent-367, score-0.339]

92 From 4 epochs onwards we can see that the re-evaluated classiﬁer after online learning (TA-Rtri ) is able to learn models which are as good as supervisedly trained models. [sent-368, score-0.377]

93 Finally we would like to point out that even for just 3 epochs per character, our proposed method spelled less characters wrongly (about 6 on average) than the number of characters used during the supervised training (16 for each subject). [sent-369, score-0.809]

94 4 Conclusion In this work we set out to build a P300 based BCI which is able to produce accurate spelling for a novel subject without any form of training session. [sent-370, score-0.471]

95 This is made possible by incorporating both inter-subject information and language models directly into an unsupervised classiﬁer. [sent-371, score-0.306]

96 There are only a few other unsupervised approaches for P300 spelling, but they need a warm-up period during which the speller is unreliable or they need labeled data to initialize the adaptive spellers. [sent-374, score-0.362]

97 We compared our method to the original unsupervised speller proposed in [9] and have shown that unlike theirs, our approach works instantly. [sent-375, score-0.306]

98 Furthermore, our ﬁnal experiments demonstrated that the proposed method can compete with state of the art subject-speciﬁc and supervisedly trained classiﬁers [7], even when incorporating a language model. [sent-376, score-0.315]

99 A self-training semi-supervised SVM algorithm and its application in an EEG-based brain computer interface speller system. [sent-470, score-0.293]

100 Natural language processing with dynamic classiﬁcation improves P300 speller accuracy and bit rate. [sent-513, score-0.403]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('spelling', 0.367), ('ct', 0.355), ('character', 0.266), ('tri', 0.246), ('epochs', 0.232), ('eeg', 0.224), ('speller', 0.215), ('characters', 0.209), ('intensi', 0.2), ('ws', 0.189), ('language', 0.188), ('ra', 0.15), ('transfer', 0.147), ('er', 0.113), ('trigram', 0.113), ('bci', 0.112), ('wold', 0.109), ('blda', 0.108), ('classi', 0.095), ('ers', 0.093), ('ta', 0.093), ('spell', 0.092), ('unsupervised', 0.091), ('xs', 0.085), ('subject', 0.082), ('spelled', 0.077), ('tatri', 0.077), ('old', 0.076), ('supervisedly', 0.068), ('uni', 0.064), ('akimpech', 0.062), ('subjects', 0.058), ('bi', 0.055), ('cs', 0.053), ('wnew', 0.05), ('schalk', 0.047), ('unsupervisedly', 0.046), ('online', 0.045), ('repetitions', 0.043), ('initializations', 0.042), ('incorporation', 0.041), ('interface', 0.041), ('texts', 0.04), ('backward', 0.038), ('ghent', 0.038), ('unigram', 0.038), ('brain', 0.037), ('symbol', 0.037), ('ts', 0.035), ('prior', 0.033), ('letter', 0.033), ('period', 0.033), ('forward', 0.032), ('interfaces', 0.032), ('em', 0.032), ('trained', 0.032), ('bldatri', 0.031), ('kindermans', 0.031), ('schlogl', 0.031), ('schroder', 0.031), ('spells', 0.031), ('adaptation', 0.031), ('supervised', 0.03), ('per', 0.03), ('numeric', 0.029), ('wikipedia', 0.029), ('text', 0.029), ('instant', 0.027), ('appending', 0.027), ('hinterberger', 0.027), ('pfurtscheller', 0.027), ('verstraeten', 0.027), ('incorporating', 0.027), ('desired', 0.026), ('blankertz', 0.025), ('disabled', 0.025), ('exion', 0.025), ('rehabilitation', 0.025), ('wolpaw', 0.025), ('hz', 0.024), ('xt', 0.024), ('biomedical', 0.024), ('initialized', 0.024), ('electrocorticographic', 0.024), ('minute', 0.024), ('nger', 0.024), ('train', 0.023), ('adaptive', 0.023), ('dataset', 0.023), ('inter', 0.022), ('muller', 0.022), ('executed', 0.022), ('training', 0.022), ('guan', 0.021), ('drew', 0.021), ('recursion', 0.021), ('spanish', 0.021), ('sutskever', 0.021), ('pass', 0.021), ('cations', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

Author: Pieter-jan Kindermans, Hannes Verschore, David Verstraeten, Benjamin Schrauwen

2 0.14173277 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

Author: Stephen Bach, Matthias Broecheler, Lise Getoor, Dianne O'leary

Abstract: Probabilistic graphical models are powerful tools for analyzing constrained, continuous domains. However, ﬁnding most-probable explanations (MPEs) in these models can be computationally expensive. In this paper, we improve the scalability of MPE inference in a class of graphical models with piecewise-linear and piecewise-quadratic dependencies and linear constraints over continuous domains. We derive algorithms based on a consensus-optimization framework and demonstrate their superior performance over state of the art. We show empirically that in a large-scale voter-preference modeling problem our algorithms scale linearly in the number of dependencies and constraints. 1

3 0.12678133 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

Author: Joan Fruitet, Alexandra Carpentier, Maureen Clerc, Rémi Munos

Abstract: Brain-computer interfaces (BCI) allow users to “communicate” with a computer without using their muscles. BCI based on sensori-motor rhythms use imaginary motor tasks, such as moving the right or left hand, to send control signals. The performances of a BCI can vary greatly across users but also depend on the tasks used, making the problem of appropriate task selection an important issue. This study presents a new procedure to automatically select as fast as possible a discriminant motor task for a brain-controlled button. We develop for this purpose an adaptive algorithm, UCB-classif , based on the stochastic bandit theory. This shortens the training stage, thereby allowing the exploration of a greater variety of tasks. By not wasting time on inefﬁcient tasks, and focusing on the most promising ones, this algorithm results in a faster task selection and a more efﬁcient use of the BCI training session. Comparing the proposed method to the standard practice in task selection, for a ﬁxed time budget, UCB-classif leads to an improved classiﬁcation rate, and for a ﬁxed classiﬁcation rate, to a reduction of the time spent in training by 50%. 1

4 0.11190526 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-speciﬁc classiﬁers. We formulate an empirical risk minimization problem that incorporates both partitioning and classiﬁcation in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classiﬁers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-ﬁtting and generalization error. We train locally linear classiﬁers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classiﬁcation techniques on benchmark datasets. We also show improved robustness to label noise.

5 0.093815267 198 nips-2012-Learning with Target Prior

Author: Zuoguan Wang, Siwei Lyu, Gerwin Schalk, Qiang Ji

Abstract: In the conventional approaches for supervised parametric learning, relations between data and target variables are provided through training sets consisting of pairs of corresponded data and target variables. In this work, we describe a new learning scheme for parametric learning, in which the target variables y can be modeled with a prior model p(y) and the relations between data and target variables are estimated with p(y) and a set of uncorresponded data X in training. We term this method as learning with target priors (LTP). Speciﬁcally, LTP learning seeks parameter θ that maximizes the log likelihood of fθ (X) on a uncorresponded training set with regards to p(y). Compared to the conventional (semi)supervised learning approach, LTP can make efﬁcient use of prior knowledge of the target variables in the form of probabilistic distributions, and thus removes/reduces the reliance on training data in learning. Compared to the Bayesian approach, the learned parametric regressor in LTP can be more efﬁciently implemented and deployed in tasks where running efﬁciency is critical. We demonstrate the effectiveness of the proposed approach on parametric regression tasks for BCI signal decoding and pose estimation from video. 1

6 0.072391883 60 nips-2012-Bayesian nonparametric models for ranked data

7 0.062921844 273 nips-2012-Predicting Action Content On-Line and in Real Time before Action Onset – an Intracranial Human Study

8 0.06256628 331 nips-2012-Symbolic Dynamic Programming for Continuous State and Observation POMDPs

9 0.059534844 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

10 0.057978459 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

11 0.057826251 147 nips-2012-Graphical Models via Generalized Linear Models

12 0.056739766 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

13 0.054834507 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

14 0.047508281 197 nips-2012-Learning with Recursive Perceptual Representations

15 0.047290519 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation

16 0.046927068 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data

17 0.046340004 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

18 0.045139641 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

19 0.044561889 22 nips-2012-A latent factor model for highly multi-relational data

20 0.042774592 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.121), (1, 0.004), (2, -0.039), (3, 0.013), (4, 0.018), (5, -0.023), (6, 0.006), (7, 0.028), (8, -0.054), (9, -0.006), (10, -0.017), (11, 0.05), (12, 0.064), (13, 0.021), (14, 0.007), (15, -0.049), (16, -0.099), (17, 0.026), (18, -0.042), (19, 0.04), (20, 0.009), (21, -0.008), (22, -0.14), (23, -0.056), (24, 0.048), (25, -0.094), (26, 0.059), (27, 0.059), (28, -0.001), (29, 0.04), (30, 0.011), (31, 0.098), (32, 0.115), (33, -0.051), (34, -0.035), (35, -0.036), (36, 0.046), (37, 0.017), (38, -0.022), (39, 0.057), (40, -0.001), (41, -0.141), (42, 0.101), (43, -0.033), (44, -0.141), (45, -0.027), (46, -0.064), (47, -0.047), (48, 0.068), (49, -0.109)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92324746 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

Author: Pieter-jan Kindermans, Hannes Verschore, David Verstraeten, Benjamin Schrauwen

2 0.76080084 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

Author: Joan Fruitet, Alexandra Carpentier, Maureen Clerc, Rémi Munos

3 0.67745429 273 nips-2012-Predicting Action Content On-Line and in Real Time before Action Onset – an Intracranial Human Study

Author: Uri Maoz, Shengxuan Ye, Ian Ross, Adam Mamelak, Christof Koch

Abstract: The ability to predict action content from neural signals in real time before the action occurs has been long sought in the neuroscientiﬁc study of decision-making, agency and volition. On-line real-time (ORT) prediction is important for understanding the relation between neural correlates of decision-making and conscious, voluntary action as well as for brain-machine interfaces. Here, epilepsy patients, implanted with intracranial depth microelectrodes or subdural grid electrodes for clinical purposes, participated in a “matching-pennies” game against an opponent. In each trial, subjects were given a 5 s countdown, after which they had to raise their left or right hand immediately as the “go” signal appeared on a computer screen. They won a ﬁxed amount of money if they raised a different hand than their opponent and lost that amount otherwise. The question we here studied was the extent to which neural precursors of the subjects’ decisions can be detected in intracranial local ﬁeld potentials (LFP) prior to the onset of the action. We found that combined low-frequency (0.1–5 Hz) LFP signals from 10 electrodes were predictive of the intended left-/right-hand movements before the onset of the go signal. Our ORT system predicted which hand the patient would raise 0.5 s before the go signal with 68±3% accuracy in two patients. Based on these results, we constructed an ORT system that tracked up to 30 electrodes simultaneously, and tested it on retrospective data from 7 patients. On average, we could predict the correct hand choice in 83% of the trials, which rose to 92% if we let the system drop 3/10 of the trials on which it was less conﬁdent. Our system demonstrates— for the ﬁrst time—the feasibility of accurately predicting a binary action on single trials in real time for patients with intracranial recordings, well before the action occurs. 1 1

4 0.67726004 198 nips-2012-Learning with Target Prior

Author: Zuoguan Wang, Siwei Lyu, Gerwin Schalk, Qiang Ji

5 0.60327584 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data

Author: Francisco Pereira, Matthew Botvinick

Abstract: This paper introduces a novel classiﬁcation method for functional magnetic resonance imaging datasets with tens of classes. The method is designed to make predictions using information from as many brain locations as possible, instead of resorting to feature selection, and does this by decomposing the pattern of brain activation into differently informative sub-regions. We provide results over a complex semantic processing dataset that show that the method is competitive with state-of-the-art feature selection and also suggest how the method may be used to perform group or exploratory analyses of complex class structure. 1

6 0.54497343 289 nips-2012-Recognizing Activities by Attribute Dynamics

7 0.50302434 200 nips-2012-Local Supervised Learning through Space Partitioning

8 0.45942053 170 nips-2012-Large Scale Distributed Deep Networks

9 0.45315135 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

10 0.42526773 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

11 0.40898934 279 nips-2012-Projection Retrieval for Classification

12 0.38980007 197 nips-2012-Learning with Recursive Perceptual Representations

13 0.38915968 331 nips-2012-Symbolic Dynamic Programming for Continuous State and Observation POMDPs

14 0.37176067 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

15 0.37164873 256 nips-2012-On the connections between saliency and tracking

16 0.37092528 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

17 0.36967164 167 nips-2012-Kernel Hyperalignment

18 0.35931638 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

19 0.35927364 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

20 0.35173714 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (17, 0.015), (21, 0.014), (38, 0.09), (39, 0.019), (42, 0.025), (54, 0.026), (55, 0.015), (71, 0.013), (74, 0.035), (76, 0.125), (77, 0.398), (80, 0.086), (92, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74081618 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

Author: Pieter-jan Kindermans, Hannes Verschore, David Verstraeten, Benjamin Schrauwen

2 0.6478771 167 nips-2012-Kernel Hyperalignment

Author: Alexander Lorbert, Peter J. Ramadge

Abstract: We offer a regularized, kernel extension of the multi-set, orthogonal Procrustes problem, or hyperalignment. Our new method, called Kernel Hyperalignment, expands the scope of hyperalignment to include nonlinear measures of similarity and enables the alignment of multiple datasets with a large number of base features. With direct application to fMRI data analysis, kernel hyperalignment is well-suited for multi-subject alignment of large ROIs, including the entire cortex. We report experiments using real-world, multi-subject fMRI data. 1

3 0.60630792 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

Author: Jeff Beck, Alexandre Pouget, Katherine A. Heller

Abstract: Recent experiments have demonstrated that humans and animals typically reason probabilistically about their environment. This ability requires a neural code that represents probability distributions and neural circuits that are capable of implementing the operations of probabilistic inference. The proposed probabilistic population coding (PPC) framework provides a statistically efﬁcient neural representation of probability distributions that is both broadly consistent with physiological measurements and capable of implementing some of the basic operations of probabilistic inference in a biologically plausible way. However, these experiments and the corresponding neural models have largely focused on simple (tractable) probabilistic computations such as cue combination, coordinate transformations, and decision making. As a result it remains unclear how to generalize this framework to more complex probabilistic computations. Here we address this short coming by showing that a very general approximate inference algorithm known as Variational Bayesian Expectation Maximization can be naturally implemented within the linear PPC framework. We apply this approach to a generic problem faced by any given layer of cortex, namely the identiﬁcation of latent causes of complex mixtures of spikes. We identify a formal equivalent between this spike pattern demixing problem and topic models used for document classiﬁcation, in particular Latent Dirichlet Allocation (LDA). We then construct a neural network implementation of variational inference and learning for LDA that utilizes a linear PPC. This network relies critically on two non-linear operations: divisive normalization and super-linear facilitation, both of which are ubiquitously observed in neural circuits. We also demonstrate how online learning can be achieved using a variation of Hebb’s rule and describe an extension of this work which allows us to deal with time varying and correlated latent causes. 1 Introduction to Probabilistic Inference in Cortex Probabilistic (Bayesian) reasoning provides a coherent and, in many ways, optimal framework for dealing with complex problems in an uncertain world. It is, therefore, somewhat reassuring that behavioural experiments reliably demonstrate that humans and animals behave in a manner consistent with optimal probabilistic reasoning when performing a wide variety of perceptual [1, 2, 3], motor [4, 5, 6], and cognitive tasks[7]. This remarkable ability requires a neural code that represents probability distribution functions of task relevant stimuli rather than just single values. While there 1 are many ways to represent functions, Bayes rule tells us that when it comes to probability distribution functions, there is only one statistically optimal way to do it. More precisely, Bayes Rule states that any pattern of activity, r, that efﬁciently represents a probability distribution over some task relevant quantity s, must satisfy the relationship p(s|r) ∝ p(r|s)p(s), where p(r|s) is the stimulus conditioned likelihood function that speciﬁes the form of neural variability, p(s) gives the prior belief regarding the stimulus, and p(s|r) gives the posterior distribution over values of the stimulus, s given the representation r . Of course, it is unlikely that the nervous system consistently achieves this level of optimality. None-the-less, Bayes rule suggests the existence of a link between neural variability as characterized by the likelihood function p(r|s) and the state of belief of a mature statistical learning machine such as the brain. The so called Probabilistic Population Coding (or PPC) framework[8, 9, 10] takes this link seriously by proposing that the function encoded by a pattern of neural activity r is, in fact, the likelihood function p(r|s). When this is the case, the precise form of the neural variability informs the nature of the neural code. For example, the exponential family of statistical models with linear sufﬁcient statistics has been shown to be ﬂexible enough to model the ﬁrst and second order statistics of in vivo recordings in awake behaving monkeys[9, 11, 12] and anesthetized cats[13]. When the likelihood function is modeled in this way, the log posterior probability over the stimulus is linearly encoded by neural activity, i.e. log p(s|r) = h(s) · r − log Z(r) (1) Here, the stimulus dependent kernel, h(s), is a vector of functions of s, the dot represents a standard dot product, and Z(r) is the partition function which serves to normalize the posterior. This log linear form for a posterior distribution is highly computationally convenient and allows for evidence integration to be implemented via linear operations on neural activity[14, 8]. Proponents of this kind of linear PPC have demonstrated how to build biologically plausible neural networks capable of implementing the operations of probabilistic inference that are needed to optimally perform the behavioural tasks listed above. This includes, linear PPC implementations of cue combination[8], evidence integration over time, maximum likelihood and maximum a posterior estimation[9], coordinate transformation/auditory localization[10], object tracking/Kalman ﬁltering[10], explaining away[10], and visual search[15]. Moreover, each of these neural computations has required only a single recurrently connected layer of neurons that is capable of just two non-linear operations: coincidence detection and divisive normalization, both of which are widely observed in cortex[16, 17]. Unfortunately, this research program has been a piecemeal effort that has largely proceeded by building neural networks designed deal with particular problems. As a result, there have been no proposals for a general principle by which neural network implementations of linear PPCs might be generated and no suggestions regarding how to deal with complex (intractable) problems of probabilistic inference. In this work, we will partially address this short coming by showing that Variation Bayesian Expectation Maximization (VBEM) algorithm provides a general scheme for approximate inference and learning with linear PPCs. In section 2, we brieﬂy review the VBEM algorithm and show how it naturally leads to a linear PPC representation of the posterior as well as constraints on the neural network dynamics which build that PPC representation. Because this section describes the VB-PPC approach rather abstractly, the remainder of the paper is dedicated to concrete applications. As a motivating example, we consider the problem of inferring the concentrations of odors in an olfactory scene from a complex pattern of spikes in a population of olfactory receptor neurons (ORNs). In section 3, we argue that this requires solving a spike pattern demixing problem which is indicative of the generic problem faced by many layers of cortex. We then show that this demixing problem is equivalent to the problem addressed by a class of models for text documents know as probabilistic topic models, in particular Latent Dirichlet Allocation or LDA[18]. In section 4, we apply the VB-PPC approach to build a neural network implementation of probabilistic inference and learning for LDA. This derivation shows that causal inference with linear PPC’s also critically relies on divisive normalization. This result suggests that this particular non-linearity may be involved in very general and fundamental probabilistic computation, rather than simply playing a role in gain modulation. In this section, we also show how this formulation allows for a probabilistic treatment of learning and show that a simple variation of Hebb’s rule can implement Bayesian learning in neural circuits. 2 We conclude this work by generalizing this approach to time varying inputs by introducing the Dynamic Document Model (DDM) which can infer short term ﬂuctuations in the concentrations of individual topics/odors and can be used to model foraging and other tracking tasks. 2 Variational Bayesian Inference with linear Probabilistic Population Codes Variational Bayesian (VB) inference refers to a class of deterministic methods for approximating the intractable integrals which arise in the context of probabilistic reasoning. Properly implemented it can result a fast alternative to sampling based methods of inference such as MCMC[19] sampling. Generically, the goal of any Bayesian inference algorithm is to infer a posterior distribution over behaviourally relevant latent variables Z given observations X and a generative model which speciﬁes the joint distribution p(X, Θ, Z). This task is confounded by the fact that the generative model includes latent parameters Θ which must be marginalized out, i.e. we wish to compute, p(Z|X) ∝ p(X, Θ, Z)dΘ (2) When the number of latent parameters is large this integral can be quite unwieldy. The VB algorithms simplify this marginalization by approximating the complex joint distribution over behaviourally relevant latents and parameters, p(Θ, Z|X), with a distribution q(Θ, Z) for which integrals of this form are easier to deal with in some sense. There is some art to choosing the particular form for the approximating distribution to make the above integral tractable, however, a factorized approximation is common, i.e. q(Θ, Z) = qΘ (Θ)qZ (Z). Regardless, for any given observation X, the approximate posterior is found by minimizing the Kullback-Leibler divergence between q(Θ, Z) and p(Θ, Z|X). When a factorized posterior is assumed, the Variational Bayesian Expectation Maximization (VBEM) algorithm ﬁnds a local minimum of the KL divergence by iteratively updating, qΘ (Θ) and qZ (Z) according to the scheme n log qΘ (Θ) ∼ log p(X, Θ, Z) n qZ (Z) and n+1 log qZ (Z) ∼ log p(X, Θ, Z) n qΘ (Θ) (3) Here the brackets indicate an expected value taken with respect to the subscripted probability distribution function and the tilde indicates equality up to a constant which is independent of Θ and Z. The key property to note here is that the approximate posterior which results from this procedure is in an exponential family form and is therefore representable by a linear PPC (Eq. 1). This feature allows for the straightforward construction of networks which implement the VBEM algorithm with linear PPC’s in the following way. If rn and rn are patterns of activity that use a linear PPC representation Θ Z of the relevant posteriors, then n log qΘ (Θ) ∼ hΘ (Θ) · rn Θ and n+1 log qZ (Z) ∼ hZ (Z) · rn+1 . Z (4) Here the stimulus dependent kernels hZ (Z) and hΘ (Θ) are chosen so that their outer product results in a basis that spans the function space on Z × Θ given by log p(X, Θ, Z) for every X. This choice guarantees that there exist functions fΘ (X, rn ) and fZ (X, rn ) such that Z Θ rn = fΘ (X, rn ) Θ Z and rn+1 = fZ (X, rn ) Θ Z (5) satisfy Eq. 3. When this is the case, simply iterating the discrete dynamical system described by Eq. 5 until convergence will ﬁnd the VBEM approximation to the posterior. This is one way to build a neural network implementation of the VB algorithm. However, its not the only way. In general, any dynamical system which has stable ﬁxed points in common with Eq. 5 can also be said to implement the VBEM algorithm. In the example below we will take advantage of this ﬂexibility in order to build biologically plausible neural network implementations. 3 Response! to Mixture ! of Odors! Single Odor Response Cause Intensity Figure 1: (Left) Each cause (e.g. coffee) in isolation results in a pattern of neural activity (top). When multiple causes contribute to a scene this results in an overall pattern of neural activity which is a mixture of these patterns weighted by the intensities (bottom). (Right) The resulting pattern can be represented by a raster, where each spike is colored by its corresponding latent cause. 3 Probabilistic Topic Models for Spike Train Demixing Consider the problem of odor identiﬁcation depicted in Fig. 1. A typical mammalian olfactory system consists of a few hundred different types of olfactory receptor neurons (ORNs), each of which responds to a wide range of volatile chemicals. This results in a highly distributed code for each odor. Since, a typical olfactory scene consists of many different odors at different concentrations, the pattern of ORN spike trains represents a complex mixture. Described in this way, it is easy to see that the problem faced by early olfactory cortex can be described as the task of demixing spike trains to infer latent causes (odor intensities). In many ways this olfactory problem is a generic problem faced by each cortical layer as it tries to make sense of the activity of the neurons in the layer below. The input patterns of activity consist of spikes (or spike counts) labeled by the axons which deliver them and summarized by a histogram which indicates how many spikes come from each input neuron. Of course, just because a spike came from a particular neuron does not mean that it had a particular cause, just as any particular ORN spike could have been caused by any one of a large number of volatile chemicals. Like olfactory codes, cortical codes are often distributed and multiple latent causes can be present at the same time. Regardless, this spike or histogram demixing problem is formally equivalent to a class of demixing problems which arise in the context of probabilistic topic models used for document modeling. A simple but successful example of this kind of topic model is called Latent Dirichlet Allocation (LDA) [18]. LDA assumes that word order in documents is irrelevant and, therefore, models documents as histograms of word counts. It also assumes that there are K topics and that each of these topics appears in different proportions in each document, e.g. 80% of the words in a document might be concerned with coffee and 20% with strawberries. Words from a given topic are themselves drawn from a distribution over words associated with that topic, e.g. when talking about coffee you have a 5% chance of using the word ’bitter’. The goal of LDA is to infer both the distribution over topics discussed in each document and the distribution of words associated with each topic. We can map the generative model for LDA onto the task of spike demixing in cortex by letting topics become latent causes or odors, words become neurons, word occurrences become spikes, word distributions associated with each topic become patterns of neural activity associated with each cause, and different documents become the observed patterns of neural activity on different trials. This equivalence is made explicit in Fig. 2 which describes the standard generative model for LDA applied to documents on the left and mixtures of spikes on the right. 4 LDA Inference and Network Implementation In this section we will apply the VB-PPC formulation to build a biologically plausible network capable of approximating probabilistic inference for spike pattern demixing. For simplicity, we will use the equivalent Gamma-Poisson formulation of LDA which directly models word and topic counts 4 1. For each topic k = 1, . . . , K, (a) Distribution over words βk ∼ Dirichlet(η0 ) 2. For document d = 1, . . . , D, (a) Distribution over topics θd ∼ Dirichlet(α0 ) (b) For word m = 1, . . . , Ωd i. Topic assignment zd,m ∼ Multinomial(θd ) ii. Word assignment ωd,m ∼ Multinomial(βzm ) 1. For latent cause k = 1, . . . , K, (a) Pattern of neural activity βk ∼ Dirichlet(η0 ) 2. For scene d = 1, . . . , D, (a) Relative intensity of each cause θd ∼ Dirichlet(α0 ) (b) For spike m = 1, . . . , Ωd i. Cause assignment zd,m ∼ Multinomial(θd ) ii. Neuron assignment ωd,m ∼ Multinomial(βzm ) Figure 2: (Left) The LDA generative model in the context of document modeling. (Right) The corresponding LDA generative model mapped onto the problem of spike demixing. Text related attributes on the left, in red, have been replaced with neural attributes on the right, in green. rather than topic assignments. Speciﬁcally, we deﬁne, Rd,j to be the number of times neuron j ﬁres during trial d. Similarly, we let Nd,j,k to be the number of times a spike in neuron j comes from cause k in trial d. These new variables play the roles of the cause and neuron assignment variables, zd,m and ωd,m by simply counting them up. If we let cd,k be an un-normalized intensity of cause j such that θd,k = cd,k / k cd,k then the generative model, Rd,j = k Nd,j,k Nd,j,k ∼ Poisson(βj,k cd,k ) 0 cd,k ∼ Gamma(αk , C −1 ). (6) is equivalent to the topic models described above. Here the parameter C is a scale parameter which sets the expected total number of spikes from the population on each trial. Note that, the problem of inferring the wj,k and cd,k is a non-negative matrix factorization problem similar to that considered by Lee and Seung[20]. The primary difference is that, here, we are attempting to infer a probability distribution over these quantities rather than maximum likelihood estimates. See supplement for details. Following the prescription laid out in section 2, we approximate the posterior over latent variables given a set of input patterns, Rd , d = 1, . . . , D, with a factorized distribution of the form, qN (N)qc (c)qβ (β). This results in marginal posterior distributions q (β:,k |η:,k ), q cd,k |αd,k , C −1 + 1 ), and q (Nd,j,: | log pd,j,: , Rd,i ) which are Dirichlet, Gamma, and Multinomial respectively. Here, the parameters η:,k , αd,k , and log pd,j,: are the natural parameters of these distributions. The VBEM update algorithm yields update rules for these parameters which are summarized in Fig. 3 Algorithm1. Algorithm 1: Batch VB updates 1: while ηj,k not converged do 2: for d = 1, · · · , D do 3: while pd,j,k , αd,k not converged do 4: αd,k → α0 + j Rd,j pd,j,k 5: pd,j,k → Algorithm 2: Online VB updates 1: for d = 1, · · · , D do 2: reinitialize pj,k , αk ∀j, k 3: while pj,k , αk not converged do 4: αk → α0 + j Rd,j pj,k 5: pj,k → exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αk ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αi ) exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αd,k ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αd,i ) 6: end while 7: end for 8: ηj,k = η 0 + 9: end while end while ηj,k → (1 − dt)ηj,k + dt(η 0 + Rd,j pj,k ) 8: end for 6: 7: d Rd,j pd,j,k Figure 3: Here ηk = j ηj,k and ψ(x) is the digamma function so that exp ψ(x) is a smoothed ¯ threshold linear function. Before we move on to the neural network implementation, note that this standard formulation of variational inference for LDA utilizes a batch learning scheme that is not biologically plausible. Fortunately, an online version of this variational algorithm was recently proposed and shown to give 5 superior results when compared to the batch learning algorithm[21]. This algorithm replaces the sum over d in update equation for ηj,k with an incremental update based upon only the most recently observed pattern of spikes. See Fig. 3 Algorithm 2. 4.1 Neural Network Implementation Recall that the goal was to build a neural network that implements the VBEM algorithm for the underlying latent causes of a mixture of spikes using a neural code that represents the posterior distribution via a linear PPC. A linear PPC represents the natural parameters of a posterior distribution via a linear operation on neural activity. Since the primary quantity of interest here is the posterior distribution over odor concentrations, qc (c|α), this means that we need a pattern of activity rα which is linearly related to the αk ’s in the equations above. One way to accomplish this is to simply assume that the ﬁring rates of output neurons are equal to the positive valued αk parameters. Fig. 4 depicts the overall network architecture. Input patterns of activity, R, are transmitted to the synapses of a population of output neurons which represent the αk ’s. The output activity is pooled to ¯ form an un-normalized prediction of the activity of each input neuron, Rj , given the output layer’s current state of belief about the latent causes of the Rj . The activity at each synapse targeted by input neuron j is then inhibited divisively by this prediction. This results in a dendrite that reports to the ¯ soma a quantity, Nj,k , which represents the fraction of unexplained spikes from input neuron j that could be explained by latent cause k. A continuous time dynamical system with this feature and the property that it shares its ﬁxed points with the LDA algorithm is given by d ¯ Nj,k dt d αk dt ¯ ¯ = wj,k Rj − Rj Nj,k = (7) ¯ Nj,k exp (ψ (¯k )) (α0 − αk ) + exp (ψ (αk )) η (8) i ¯ where Rj = k wj,k exp (ψ (αk )), and wj,k = exp (ψ (ηj,k )). Note that, despite its form, it is Eq. 7 which implements the required divisive normalization operation since, in the steady state, ¯ ¯ Nj,k = wj,k Rj /Rj . Regardless, this network has a variety of interesting properties that align well with biology. It predicts that a balance of excitation and inhibition is maintained in the dendrites via divisive normalization and that the role of inhibitory neurons is to predict the input spikes which target individual dendrites. It also predicts superlinear facilitation. Speciﬁcally, the ﬁnal term on the right of Eq. 8 indicates that more active cells will be more sensitive to their dendritic inputs. Alternatively, this could be implemented via recurrent excitation at the population level. In either case, this is the mechanism by which the network implements a sparse prior on topic concentrations and stands in stark contrast to the winner take all mechanisms which rely on competitive mutual inhibition mechanisms. Additionally, the ηj in Eq. 8 represents a cell wide ’leak’ parameter that indicates that the total leak should be ¯ roughly proportional to the sum total weight of the synapses which drive the neuron. This predicts that cells that are highly sensitive to input should also decay back to baseline more quickly. This implementation also predicts Hebbian learning of synaptic weights. To observe this fact, note that the online update rule for the ηj,k parameters can be implemented by simply correlating the activity at ¯ each synapse, Nj,k with activity at the soma αj via the equation: τL d ¯ wj,k = exp (ψ (¯k )) (η0 − 1/2 − wj,k ) + Nj,k exp ψ (αk ) η dt (9) where τL is a long time constant for learning and we have used the fact that exp (ψ (ηjk )) ≈ ηjk −1/2 for x > 1. For a detailed derivation see the supplementary material. 5 Dynamic Document Model LDA is a rather simple generative model that makes several unrealistic assumptions about mixtures of sensory and cortical spikes. In particular, it assumes both that there are no correlations between the 6 Targeted Divisive Normalization Targeted Divisive Normalization αj Ri Input Neurons Recurrent Connections ÷ ÷ -1 -1 Σ μj Nij Ri Synapses Output Neurons Figure 4: The LDA network model. Dendritically targeted inhibition is pooled from the activity of all neurons in the output layer and acts divisively. Σ jj' Nij Input Neurons Synapses Output Neurons Figure 5: DDM network model also includes recurrent connections which target the soma with both a linear excitatory signal and an inhibitory signal that also takes the form of a divisive normalization. intensities of latent causes and that there are no correlations between the intensities of latent causes in temporally adjacent trials or scenes. This makes LDA a rather poor computational model for a task like olfactory foraging which requires the animal to track the rise a fall of odor intensities as it navigates its environment. We can model this more complicated task by replacing the static cause or odor intensity parameters with dynamic odor intensity parameters whose behavior is governed by an exponentiated Ornstein-Uhlenbeck process with drift and diffusion matrices given by (Λ and ΣD ). We call this variant of LDA the Dynamic Document Model (DDM) as it could be used to model smooth changes in the distribution of topics over the course of a single document. 5.1 DDM Model Thus the generative model for the DDM is as follows: 1. For latent cause k = 1, . . . , K, (a) Cause distribution over spikes βk ∼ Dirichlet(η0 ) 2. For scene t = 1, . . . , T , (a) Log intensity of causes c(t) ∼ Normal(Λct−1 , ΣD ) (b) Number of spikes in neuron j resulting from cause k, Nj,k (t) ∼ Poisson(βj,k exp ck (t)) (c) Number of spikes in neuron j, Rj (t) = k Nj,k (t) This model bears many similarities to the Correlated and Dynamic topic models[22], but models dynamics over a short time scale, where the dynamic relationship (Λ, ΣD ) is important. 5.2 Network Implementation Once again the quantity of interest is the current distribution of latent causes, p(c(t)|R(τ ), τ = 0..T ). If no spikes occur then no evidence is presented and posterior inference over c(t) is simply given by an undriven Kalman ﬁlter with parameters (Λ, ΣD ). A recurrent neural network which uses a linear PPC to encode a posterior that evolves according to a Kalman ﬁlter has the property that neural responses are linearly related to the inverse covariance matrix of the posterior as well as that inverse covariance matrix times the posterior mean. In the absence of evidence, it is easy to show that these quantities must evolve according to recurrent dynamics which implement divisive normalization[10]. Thus, the patterns of neural activity which linearly encode them must do so as well. When a new spike arrives, optimal inference is no longer possible and a variational approximation must be utilized. As is shown in the supplement, this variational approximation is similar to the variational approximation used for LDA. As a result, a network which can divisively inhibit its synapses is able to implement approximate Bayesian inference. Curiously, this implies that the addition of spatial and temporal correlations to the latent causes adds very little complexity to the VB-PPC network implementation of probabilistic inference. All that is required is an additional inhibitory population which targets the somata in the output population. See Fig. 5. 7 Natural Parameters Natural Parameters (α) 0.4 200 450 180 0.3 Network Estimate Network Estimate 500 400 350 300 250 200 150 100 0.1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 140 120 0.4 0.3 100 0.2 80 0.1 0 60 40 0.4 20 50 0 0 0.2 160 0 0 0.3 0.2 20 40 60 80 100 120 VBEM Estimate VBEM Estimate 140 160 180 200 0.1 0 Figure 6: (Left) Neural network approximation to the natural parameters of the posterior distribution over topics (the α’s) as a function of the VBEM estimate of those same parameters for a variety of ’documents’. (Center) Same as left, but for the natural parameters of the DDM (i.e the entries of the matrix Σ−1 (t) and Σ−1 µ(t) of the distribution over log topic intensities. (Right) Three example traces for cause intensity in the DDM. Black shows true concentration, blue and red (indistinguishable) show MAP estimates for the network and VBEM algorithms. 6 Experimental Results We compared the PPC neural network implementations of the variational inference with the standard VBEM algorithm. This comparison is necessary because the two algorithms are not guaranteed to converge to the same solution due to the fact that we only required that the neural network dynamics have the same ﬁxed points as the standard VBEM algorithm. As a result, it is possible for the two algorithms to converge to different local minima of the KL divergence. For the network implementation of LDA we ﬁnd good agreement between the neural network and VBEM estimates of the natural parameters of the posterior. See Fig. 6(left) which shows the two algorithms estimates of the shape parameter of the posterior distribution over topic (odor) concentrations (a quantity which is proportional to the expected concentration). This agreement, however, is not perfect, especially when posterior predicted concentrations are low. In part, this is due to the fact we are presenting the network with difﬁcult inference problems for which the true posterior distribution over topics (odors) is highly correlated and multimodal. As a result, the objective function (KL divergence) is littered with local minima. Additionally, the discrete iterations of the VBEM algorithm can take very large steps in the space of natural parameters while the neural network implementation cannot. In contrast, the network implementation of the DDM is in much better agreement with the VBEM estimation. See Fig. 6(right). This is because the smooth temporal dynamics of the topics eliminate the need for the VBEM algorithm to take large steps. As a result, the smooth network dynamics are better able to accurately track the VBEM algorithms output. For simulation details please see the supplement. 7 Discussion and Conclusion In this work we presented a general framework for inference and learning with linear Probabilistic Population codes. This framework takes advantage of the fact that the Variational Bayesian Expectation Maximization algorithm generates approximate posterior distributions which are in an exponential family form. This is precisely the form needed in order to make probability distributions representable by a linear PPC. We then outlined a general means by which one can build a neural network implementation of the VB algorithm using this kind of neural code. We applied this VB-PPC framework to generate a biologically plausible neural network for spike train demixing. We chose this problem because it has many of the features of the canonical problem faced by nearly every layer of cortex, i.e. that of inferring the latent causes of complex mixtures of spike trains in the layer below. Curiously, this very complicated problem of probabilistic inference and learning ended up having a remarkably simple network solution, requiring only that neurons be capable of implementing divisive normalization via dendritically targeted inhibition and superlinear facilitation. Moreover, we showed that extending this approach to the more complex dynamic case in which latent causes change in intensity over time does not substantially increase the complexity of the neural circuit. Finally, we would like to note that, while we utilized a rate coding scheme for our linear PPC, the basic equations would still apply to any spike based log probability codes such as that considered Beorlin and Deneve[23]. 8 References [1] Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as Bayesian inference. Annual review of psychology, 55:271–304, January 2004. [2] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429–33, 2002. [3] Yair Weiss, Eero P Simoncelli, and Edward H Adelson. Motion illusions as optimal percepts. Nature neuroscience, 5(6):598–604, 2002. [4] P N Sabes. The planning and control of reaching movements. Current opinion in neurobiology, 10(6): 740–6, 2000. o [5] Konrad P K¨ rding and Daniel M Wolpert. Bayesian integration in sensorimotor learning. Nature, 427 (6971):244–7, 2004. [6] Emanuel Todorov. Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907–15, 2004. [7] Erno T´ gl´ s, Edward Vul, Vittorio Girotto, Michel Gonzalez, Joshua B Tenenbaum, and Luca L Bonatti. e a Pure reasoning in 12-month-old infants as probabilistic inference. Science (New York, N.Y.), 332(6033): 1054–9, 2011. [8] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nature Neuroscience, 2006. [9] Jeffrey M Beck, Wei Ji Ma, Roozbeh Kiani, Tim Hanks, Anne K Churchland, Jamie Roitman, Michael N Shadlen, Peter E Latham, and Alexandre Pouget. Probabilistic population codes for Bayesian decision making. Neuron, 60(6):1142–52, 2008. [10] J. M. Beck, P. E. Latham, and a. Pouget. Marginalization in Neural Circuits with Divisive Normalization. Journal of Neuroscience, 31(43):15310–15319, 2011. [11] Tianming Yang and Michael N Shadlen. Probabilistic reasoning by neurons. Nature, 447(7148):1075–80, 2007. [12] RHS Carpenter and MLL Williams. Neural computation of log likelihood in control of saccadic eye movements. Nature, 1995. [13] Arnulf B a Graf, Adam Kohn, Mehrdad Jazayeri, and J Anthony Movshon. Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience, 14(2):239–45, 2011. [14] HB Barlow. Pattern Recognition and the Responses of Sensory Neurons. Annals of the New York Academy of Sciences, 1969. [15] Wei Ji Ma, Vidhya Navalpakkam, Jeffrey M Beck, Ronald Van Den Berg, and Alexandre Pouget. Behavior and neural basis of near-optimal visual search. Nature Neuroscience, (May), 2011. [16] DJ Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 1992. [17] M Carandini, D J Heeger, and J a Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. The Journal of neuroscience : the ofﬁcial journal of the Society for Neuroscience, 17(21):8621–44, 1997. [18] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 2003. [19] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, UCL, 2003. [20] D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401 (6755):788–91, 1999. [21] M. Hoffman, D. Blei, and F. Bach. Online learning for Latent Dirichlet Allocation. In NIPS, 2010. [22] D. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006. [23] M. Boerlin and S. Deneve. Spike-based population coding and working memory. PLOS computational biology, 2011. 9

4 0.54024839 65 nips-2012-Cardinality Restricted Boltzmann Machines

Author: Kevin Swersky, Ilya Sutskever, Daniel Tarlow, Richard S. Zemel, Ruslan Salakhutdinov, Ryan P. Adams

Abstract: The Restricted Boltzmann Machine (RBM) is a popular density model that is also good for extracting features. A main source of tractability in RBM models is that, given an input, the posterior distribution over hidden variables is factorizable and can be easily computed and sampled from. Sparsity and competition in the hidden representation is beneﬁcial, and while an RBM with competition among its hidden units would acquire some of the attractive properties of sparse coding, such constraints are typically not added, as the resulting posterior over the hidden units seemingly becomes intractable. In this paper we show that a dynamic programming algorithm can be used to implement exact sparsity in the RBM’s hidden units. We also show how to pass derivatives through the resulting posterior marginals, which makes it possible to ﬁne-tune a pre-trained neural network with sparse hidden layers. 1

5 0.5302434 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

Author: Stefano Ermon, Ashish Sabharwal, Bart Selman, Carla P. Gomes

Abstract: Given a probabilistic graphical model, its density of states is a distribution that, for any likelihood value, gives the number of conﬁgurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this distribution. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decomposition, the new upper bound based on ﬁner-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations and mean-ﬁeld based bounds. 1

6 0.43580848 322 nips-2012-Spiking and saturating dendrites differentially expand single neuron computation capacity

7 0.43533966 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

8 0.42709944 197 nips-2012-Learning with Recursive Perceptual Representations

9 0.42536554 198 nips-2012-Learning with Target Prior

10 0.42520157 279 nips-2012-Projection Retrieval for Classification

11 0.42299119 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

12 0.42240205 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

13 0.42238986 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

14 0.42230529 188 nips-2012-Learning from Distributions via Support Measure Machines

15 0.42195183 200 nips-2012-Local Supervised Learning through Space Partitioning

16 0.42184657 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

17 0.42184338 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

18 0.42178023 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

19 0.42156795 280 nips-2012-Proper losses for learning from partial labels

20 0.42147934 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines