nips nips2012 nips2012-167 knowledge-graph by maker-knowledge-mining

167 nips-2012-Kernel Hyperalignment

Source: pdf

Author: Alexander Lorbert, Peter J. Ramadge

Abstract: We offer a regularized, kernel extension of the multi-set, orthogonal Procrustes problem, or hyperalignment. Our new method, called Kernel Hyperalignment, expands the scope of hyperalignment to include nonlinear measures of similarity and enables the alignment of multiple datasets with a large number of base features. With direct application to fMRI data analysis, kernel hyperalignment is well-suited for multi-subject alignment of large ROIs, including the entire cortex. We report experiments using real-world, multi-subject fMRI data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ramadge Department of Electrical Engineering Princeton University Abstract We offer a regularized, kernel extension of the multi-set, orthogonal Procrustes problem, or hyperalignment. [sent-2, score-0.202]

2 Our new method, called Kernel Hyperalignment, expands the scope of hyperalignment to include nonlinear measures of similarity and enables the alignment of multiple datasets with a large number of base features. [sent-3, score-1.058]

3 With direct application to fMRI data analysis, kernel hyperalignment is well-suited for multi-subject alignment of large ROIs, including the entire cortex. [sent-4, score-1.228]

4 If yes, we say the data are aligned; if not, we must ﬁrst perform an alignment of the data. [sent-9, score-0.305]

5 The alignment problem is crucial to multi-subject fMRI data analysis, which is the motivation for this work. [sent-10, score-0.305]

6 This is to ensure temporal alignment across subjects for a common stimulus. [sent-12, score-0.437]

7 However, with each subject exhibiting his/her own unique spatial response patterns, there is a need for spatial alignment. [sent-13, score-0.095]

8 Speciﬁcally, we want between subject correspondence of voxel j at TR i (Time of Repetition). [sent-14, score-0.188]

9 The typical approach taken is anatomical alignment [20] whereby anatomical landmarks are used to anchor spatial commonality across subjects. [sent-15, score-0.604]

10 In linear algebra parlance, anatomical alignment is an afﬁne transformation with 9 degrees of freedom. [sent-16, score-0.468]

11 Instead of a 9-parameter transformation, a higher-order, orthogonal transformation is derived from voxel time-series data. [sent-19, score-0.154]

12 The underlying assumption of hyperalignment is that, for a ﬁxed stimulus, a subject’s time-series data will possess a common geometry. [sent-20, score-0.727]

13 Accordingly, the role of alignment is to ﬁnd isometric transformations of the per-subject trajectories traced out in voxel space so that the transformed time-series best match each other. [sent-21, score-0.419]

14 Using their method, the authors were able to achieve a between-subject classiﬁcation accuracy on par with—and even greater than—within-subject accuracy. [sent-22, score-0.03]

15 Suppose that subject data are recorded in matrices X1:m ∈ Rt×n . [sent-23, score-0.069]

16 We are interested in extending the regularized hyperalignment problem minimize subject to i 0 and β ≥ 0. [sent-25, score-0.807]

17 As with regularized hyperalignment [22], when (α, β) = (1, 0) we obtain hyperalignment and when (α, β) ≈ (0, 1) we obtain a form of CCA. [sent-26, score-1.489]

18 We introduce two symmetric, positive deﬁnite matrices: Bi = Vi diagj { √ 1 }Vi and α+βλij Ci = Vi diagj { λ1 ( √ 1 ij α+βλij − 1 √ )}VT . [sent-31, score-0.165]

19 1 to transform (7) into arg min Bi Φi Q − Ψ QT Q=I 2 or F arg max tr QT ΦT Bi i QT Q=I 1 |A| j∈A ˆ Bj Φj Qj , (8) ˆ where Qj is the current estimate of Qj . [sent-36, score-0.028]

20 The following lemma is the gateway for managing this problem. [sent-38, score-0.055]

21 2 ˜ Familiar applications of the above lemma include the identity matrix (G = Id ) and Householder ˜ ˜ reﬂections (G = −Id ). [sent-42, score-0.062]

22 If G is block diagonal with 2 × 2 blocks of Givens rotations, then the ˜ columns of U, taken two at a time, are the two-dimensional planes of rotation [7]. [sent-43, score-0.027]

23 2 can be interpreted as a lifting mechanism for identity deviations. [sent-46, score-0.028]

24 As is typically the case when using kernel methods, leveraging the Representer Theorem shifts the dimensionality of the problem from the feature cardinality to the number of examples, i. [sent-55, score-0.194]

25 We pool all of the data, forming the mt × N matrix Φ0 = ΦT ΦT · · · ΦT 1 2 m 1 −2 and set U = ΦT K0 0 T , (10) ∈ RN ×r with K0 = Φ0 ΦT assumed positive deﬁnite. [sent-58, score-0.089]

26 St(N, d) {Z : Z ∈ RN ×d , ZT Z = Id } is the (N, d) Stiefel Manifold (N ≥ d), and O(N ) {Z : Z ∈ RN ×N , ZT Z = IN } is the orthogonal group of N × N matrices. [sent-72, score-0.04]

27 2 4 ˆ Input: k(·, ·), α, β, X1:m ∈ Rt×n Output: R1:m , linear maps in feature space Initialize feature maps Φ1 , . [sent-73, score-0.114]

28 , Φm ∈ Rt×N Input: X1:m ∈ Rt×n , A1:m ∈ Rn×n Output: R1:m ∈ Rn×n Initialize Q1:m as identity (n × n) −1/2 ˜ 1:m Set Xi ← Xi Ai foreach round do foreach subject/view i do {1, 2, . [sent-76, score-0.244]

29 , m} \ {i} LOO mean T Initialize plane support Φ0 = ΦT ΦT · · · ΦT 1 2 m Initialize G1:m ∈ Rr×r as identity (r = mt) foreach round do foreach subject/view i do {1, 2, . [sent-82, score-0.267]

30 In general, rather than compute Q according to (7), involving N (N −1)/2 = O(N 2 ) degrees of freedom (when N is ﬁnite), we end up with r(r−1)/2 = O(r2 ) degrees of freedom via the kernel trick. [sent-91, score-0.28]

31 We reduce (8) in terms of Gi and obtain (Supplementary Material)    1 ˜ ˜ ˆ Gi = arg max tr GT BT  Bj Gj  , (11) i |A| G∈O(r) j∈A ˆ where Gj is the current estimate of Gj . [sent-93, score-0.028]

32 Equation (11) is the classical orthogonal Procrustes prob˜ ˆ ¯¯ ¯¯¯ ˜ Bj Gj , then a maximizer is given by UVT [7]. [sent-94, score-0.04]

33 If UΣVT is the SVD of GT BT 1 i |A| j∈A The kernel hyperalignment procedure is given in Algorithm 2. [sent-96, score-0.889]

34 Using the approach taken in this section also leads to an efﬁcient solution of the standard orthogonal Procrustes problem for n ≥ 2t (Supplementary Material). [sent-97, score-0.04]

35 In turn, this leads to an efﬁcient iterative solution for the hyperalignment problem when n is large. [sent-98, score-0.727]

36 4 Alignment Assessment An alignment procedure is not subject to the typical train-and-test paradigm. [sent-99, score-0.35]

37 The lack of spatial correspondence demands an align-train-test approach. [sent-100, score-0.054]

38 With all other parameters ﬁxed, if the aligned test error is smaller than the unaligned test error, there is strong evidence suggesting that alignment was the underlying cause. [sent-102, score-0.409]

39 Kernel hyperalignment returns linear transformations R1:m that act on data living in feature space. [sent-103, score-0.759]

40 In general, we cannot directly train and test in the feature space due to its large size. [sent-104, score-0.032]

41 For example, we can compute distances between examples and, subsequently, produce nearest neighbor classiﬁers. [sent-106, score-0.031]

42 We realized early on that the alignment and training phase would be replete with lengthy expansions and, consequently, sought to simplify matters with a computer science solution. [sent-112, score-0.336]

43 Both binary and unary operations in feature space can be accomplished with a simple class. [sent-113, score-0.032]

44 Our Phi class stores expressions of the following forms: K k=1 Mk Φ(Xa(k) ) Type 1 K T k=1 Φ(Xa(k) ) Mk bIN + Type 2 K T k=1 Φ(Xa(k) ) Mk Φ(Xa(k) ) . [sent-114, score-0.029]

45 (14) Type 3 Each class instance stores matrices M1:K , scalar b, right address vector a, and left address vector a. [sent-115, score-0.053]

46 If types match, then the M matrices must be checked for compatible sizes. [sent-119, score-0.048]

47 The ﬁrst of these cases, for example, produces a numeric result via the kernel trick. [sent-121, score-0.162]

48 We also deﬁne scalar multiplication and division for all types and matrix multiplication for types 1 and 2. [sent-122, score-0.112]

49 A transpose operator applies for all types and maps type 1 to 2, 2 to 1, and 3 to 3. [sent-123, score-0.049]

50 The construction of the Phi class allows us to stay in feature space and avoid lengthy expansions. [sent-126, score-0.063]

51 , Xm ∈ Rs×n be our ¯ 1 training data with feature representation Φ¯ = Φ(X¯) ∈ Rs×N . [sent-131, score-0.032]

52 Recall that kernel hyperalignment ı ı seeks to align in feature space. [sent-132, score-0.921]

53 Before alignment we might have considered K¯ = Φ¯ΦT ; we now ı¯ ı  ¯ consider the Gram matrix (Φ¯Ri )(Φ Rj )T = Φ¯Ri RT ΦT . [sent-133, score-0.305]

54 Φm Rm Φm Rm ¯ ¯ T T T T Φm Rm R1 Φ¯ Φm Rm Rm Φm ¯ ¯ ¯ 1 where KA = KT ∈ Rms×ms denotes the aligned kernel matrix. [sent-146, score-0.228]

55 The unaligned kernel matrix, KU , ¯ ¯ ¯ A is also an m × m block matrix with ij-th block K¯ . [sent-147, score-0.254]

56 Similar to a k-nearest neighbor classiﬁer relying on pairwise distances, an SVM relies on the kernel matrix. [sent-149, score-0.162]

57 The kernel matrix is a matrix of inner products and is therefore linear. [sent-150, score-0.162]

58 Each alignment produces two aligned kernel matrices, which we sum and then input into an SVM. [sent-153, score-0.533]

59 Thus, linearity provides us the means to handle ﬁner partitions by simply summing the aligned kernel matrices. [sent-154, score-0.228]

60 6 Table 1: Seven label classiﬁcation using movie-based alignment Below is the cross-validated, between-subject classiﬁcation accuracy (within-subject in brackets) with (α, β) = (1, 0). [sent-155, score-0.335]

61 Four hundred TRs per subject were used for the alignment. [sent-156, score-0.045]

62 79%] Experiments The data used in this section consisted of fMRI time-series data from 10 subjects who viewed a movie and also engaged in a block-design visualization experiment [17]. [sent-193, score-0.17]

63 Each subject saw Raiders of the Lost Ark (1981) lasting a total of 2213 TRs. [sent-194, score-0.045]

64 In the visualization experiment, subjects were shown images belonging to a speciﬁc class for 16 TRs followed by 10 TRs of rest. [sent-195, score-0.125]

65 To provide the same number of voxels per ROI for all subjects, we ﬁrst performed anatomical alignment. [sent-199, score-0.256]

66 We then selected a contiguous block of 400 TRs from the movie data to serve as the per-subject input of the kernel hyperalignment. [sent-200, score-0.234]

67 Next, we extracted labeled examples from the visualization experiment by taking an offset time average of each 16 TR class exposure. [sent-201, score-0.054]

68 An offset of 6 seconds factored in the hemodynamic response. [sent-202, score-0.027]

69 This produced 560 labeled examples: 10 subjects × 8 runs/subject × 7 examples/run. [sent-203, score-0.098]

70 Kernel hyperalignment allows us to (a) use nonlinear measures of similarity, and (b) consider more voxels for the alignment. [sent-204, score-0.872]

71 Consequently, we (a) experiment with a variety of kernels, and (b) do not need to pre-select or screen voxels as was done in [9]—we include them all. [sent-205, score-0.119]

72 We used the ﬁrst 400 TRs from each subject’s movie data, and aligned each hemisphere separately. [sent-209, score-0.111]

73 The kernel functions are supplied in the Supplementary Material. [sent-210, score-0.162]

74 As observed in [9] and repeated here, hyperalignment leads to increased between-subject accuracy and outperforms within-subject accuracy. [sent-211, score-0.757]

75 Whereas employing Algorithm 1 for 2,997 voxels is feasible (and slow), 133,590 voxels is not feasible at all. [sent-213, score-0.238]

76 Figure 1 displays the cross-validated, between-subject classiﬁcation accuracy for varying (α, β) where α = 1−β. [sent-215, score-0.03]

77 This traces out a route from CCA (α ≈ 0) to hyperalignment (α = 1). [sent-216, score-0.727]

78 When compared to the alignments in [9], our voxel counts are orders of magnitude larger. [sent-217, score-0.148]

79 For our four chosen kernels, hyperalignment (α = 1) presents itself as the option with near-greatest accuracy. [sent-218, score-0.727]

80 Our results support the robustness of hyperalignment and imply that voxel selection may be a crucial pre-processing step when dealing with the whole volume. [sent-219, score-0.841]

81 More voxels mean more noisy voxels, and hyperalignment does not distinguish itself from anatomical alignment when the entire cortex is considered. [sent-220, score-1.368]

82 MDS takes as input all of the pairwise distances between subjects (the previous section discussed distance calculations). [sent-222, score-0.129]

83 Figure 2 depicts the optimal Euclidean representation of our 10 subjects before and after kernel hyperalignment ((α, β) = (1, 0)) with respect to the ﬁrst 400 TRs of the movie data. [sent-223, score-1.032]

84 Focusing on VT, kernel hyperalignment manages to cluster 7 of the 10 subjects. [sent-224, score-0.918]

85 However, when we shift to the entire cortex, we see that anatomical alignment has already succeeded in a similar clustering. [sent-225, score-0.476]

86 Kernel hyperalignment manages to group the subjects closer together, and manifests itself as a re-centering. [sent-226, score-0.854]

87 8 1 Figure 1: Cross-validated between-subject classiﬁcation accuracy (7 labels) as a function of the regularization parameter, α = 1−β, for various kernels after alignment. [sent-275, score-0.03]

88 The solid curves are for Ventral Temporal and the dashed curves are for the entire cortex. [sent-276, score-0.034]

89 6 Conclusion We have extended hyperalignment in both scale and feature space. [sent-281, score-0.759]

90 Kernel hyperalignment can handle a large number of original features and incorporate nonlinear measures of similarity. [sent-282, score-0.753]

91 We have also shown how to use the linear maps—applied in feature space—for post-alignment classiﬁcation. [sent-283, score-0.032]

92 In the setting of fMRI, we have demonstrated successful alignment with a variety of kernels. [sent-284, score-0.305]

93 Kernel hyperalignment achieved better between-subject classiﬁcation over anatomical alignment for VT. [sent-285, score-1.169]

94 There was no noticeable difference when we considered the entire cortex. [sent-286, score-0.034]

95 Nevertheless, kernel hyperalignment proved robust and did not degrade with increasing voxel count. [sent-287, score-1.003]

96 Empirically, we have noticed a tradeoff between feature cardinality and classiﬁcation accuracy, motivating the need for intelligent feature selection within our established framework. [sent-289, score-0.064]

97 Although we have limited our focus to fMRI data analysis, kernel hyperalignment can be applied to other research areas which rely on multi-set Procrustes problems. [sent-290, score-0.889]

98 A common, high-dimensional model of the representational space in human ventral temporal cortex. [sent-367, score-0.102]

99 A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. [sent-396, score-0.029]

100 Co-planar stereotaxic atlas of the human brain: 3-dimensional proportional system: an approach to cerebral imaging. [sent-438, score-0.029]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hyperalignment', 0.727), ('alignment', 0.305), ('procrustes', 0.164), ('kernel', 0.162), ('trs', 0.145), ('anatomical', 0.137), ('fmri', 0.126), ('voxels', 0.119), ('voxel', 0.114), ('rt', 0.113), ('foreach', 0.108), ('subjects', 0.098), ('gi', 0.094), ('bsc', 0.094), ('id', 0.084), ('diagj', 0.07), ('mds', 0.07), ('ramadge', 0.07), ('ventral', 0.068), ('gj', 0.068), ('mt', 0.067), ('aligned', 0.066), ('qi', 0.06), ('xa', 0.059), ('qj', 0.056), ('ut', 0.048), ('conroy', 0.047), ('guntupalli', 0.047), ('phi', 0.047), ('cortex', 0.046), ('bi', 0.046), ('representer', 0.046), ('movie', 0.045), ('subject', 0.045), ('ri', 0.045), ('bj', 0.044), ('vt', 0.043), ('haxby', 0.041), ('lorbert', 0.041), ('qt', 0.041), ('svd', 0.04), ('ai', 0.04), ('orthogonal', 0.04), ('loo', 0.038), ('unaligned', 0.038), ('mk', 0.038), ('xt', 0.037), ('rn', 0.037), ('rm', 0.036), ('regularized', 0.035), ('classi', 0.034), ('lemma', 0.034), ('alignments', 0.034), ('temporal', 0.034), ('entire', 0.034), ('vi', 0.033), ('ka', 0.033), ('multiplication', 0.032), ('canonical', 0.032), ('feature', 0.032), ('bt', 0.031), ('lengthy', 0.031), ('distances', 0.031), ('accuracy', 0.03), ('manages', 0.029), ('cerebral', 0.029), ('correspondence', 0.029), ('stores', 0.029), ('tr', 0.028), ('ir', 0.028), ('identity', 0.028), ('block', 0.027), ('offset', 0.027), ('svm', 0.027), ('visualization', 0.027), ('degrees', 0.026), ('nonlinear', 0.026), ('maps', 0.025), ('spatial', 0.025), ('sigmoid', 0.025), ('rs', 0.025), ('ij', 0.025), ('matrices', 0.024), ('types', 0.024), ('orthogonality', 0.024), ('sch', 0.023), ('initialize', 0.023), ('plane', 0.023), ('rj', 0.023), ('end', 0.022), ('st', 0.022), ('forming', 0.022), ('relational', 0.022), ('freedom', 0.022), ('gt', 0.021), ('appreciable', 0.021), ('gateway', 0.021), ('connolly', 0.021), ('arias', 0.021), ('edelman', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 167 nips-2012-Kernel Hyperalignment

Author: Alexander Lorbert, Peter J. Ramadge

2 0.13708054 193 nips-2012-Learning to Align from Scratch

Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller

Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face veriﬁcation. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Speciﬁcally, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the speciﬁc data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned ﬁlters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face veriﬁcation compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1

3 0.12800375 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data

Author: Francisco Pereira, Matthew Botvinick

Abstract: This paper introduces a novel classiﬁcation method for functional magnetic resonance imaging datasets with tens of classes. The method is designed to make predictions using information from as many brain locations as possible, instead of resorting to feature selection, and does this by decomposing the pattern of brain activation into differently informative sub-regions. We provide results over a complex semantic processing dataset that show that the method is competitive with state-of-the-art feature selection and also suggest how the method may be used to perform group or exploratory analyses of complex class structure. 1

4 0.075828038 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

5 0.073275536 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a ﬂexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

6 0.073161863 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

7 0.05932514 227 nips-2012-Multiclass Learning with Simplex Coding

8 0.058519043 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

9 0.053994555 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

10 0.053298578 231 nips-2012-Multiple Operator-valued Kernel Learning

11 0.052296404 330 nips-2012-Supervised Learning with Similarity Functions

12 0.051718354 280 nips-2012-Proper losses for learning from partial labels

13 0.049127072 168 nips-2012-Kernel Latent SVM for Visual Recognition

14 0.0475001 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

15 0.046839617 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

16 0.046658587 187 nips-2012-Learning curves for multi-task Gaussian process regression

17 0.045853179 197 nips-2012-Learning with Recursive Perceptual Representations

18 0.045727413 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

19 0.04532123 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

20 0.044955026 359 nips-2012-Variational Inference for Crowdsourcing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.135), (1, 0.018), (2, -0.014), (3, -0.0), (4, 0.077), (5, -0.005), (6, 0.001), (7, 0.05), (8, -0.002), (9, -0.047), (10, 0.02), (11, -0.001), (12, 0.019), (13, 0.009), (14, 0.041), (15, -0.02), (16, 0.018), (17, 0.057), (18, 0.013), (19, -0.044), (20, 0.042), (21, -0.035), (22, -0.042), (23, -0.129), (24, 0.048), (25, 0.018), (26, 0.005), (27, 0.003), (28, -0.045), (29, 0.028), (30, -0.025), (31, -0.016), (32, 0.027), (33, 0.04), (34, 0.107), (35, -0.023), (36, -0.009), (37, -0.022), (38, -0.061), (39, 0.023), (40, -0.073), (41, -0.081), (42, 0.07), (43, -0.034), (44, -0.021), (45, -0.055), (46, 0.074), (47, 0.055), (48, -0.015), (49, -0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89629304 167 nips-2012-Kernel Hyperalignment

Author: Alexander Lorbert, Peter J. Ramadge

2 0.68326497 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson

Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classiﬁer and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefﬁcients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation signiﬁcantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We brieﬂy discuss ramiﬁcations in terms of learning bounds (Rademacher complexity). 1

3 0.66038352 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Author: Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou

Abstract: Both random Fourier features and the Nystr¨ m method have been successfully o applied to efﬁcient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystr¨ m method are randomly sampled from the training examples and are o therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystr¨ m method can yield impressively better generalization error bound than o random Fourier features based approach. We empirically verify our theoretical ﬁndings on a wide range of large data sets. 1

4 0.65449655 231 nips-2012-Multiple Operator-valued Kernel Learning

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive deﬁnite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a ﬁnite linear combination of inﬁnite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefﬁcients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of ﬁnger movement prediction in brain-computer interfaces. 1

5 0.57510817 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

6 0.57427573 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

7 0.54670703 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

8 0.52755094 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

9 0.52615482 330 nips-2012-Supervised Learning with Similarity Functions

10 0.52231908 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data

11 0.50071138 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

12 0.49438262 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

13 0.48280567 198 nips-2012-Learning with Target Prior

14 0.48191166 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

15 0.44943085 168 nips-2012-Kernel Latent SVM for Visual Recognition

16 0.44808587 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination

17 0.44752565 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

18 0.44678274 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

19 0.44122344 46 nips-2012-Assessing Blinding in Clinical Trials

20 0.4337962 273 nips-2012-Predicting Action Content On-Line and in Real Time before Action Onset – an Intracranial Human Study

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (17, 0.013), (21, 0.036), (38, 0.093), (42, 0.032), (53, 0.015), (54, 0.038), (55, 0.028), (74, 0.066), (76, 0.158), (77, 0.273), (80, 0.08), (92, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86053193 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

Author: Pieter-jan Kindermans, Hannes Verschore, David Verstraeten, Benjamin Schrauwen

Abstract: The usability of Brain Computer Interfaces (BCI) based on the P300 speller is severely hindered by the need for long training times and many repetitions of the same stimulus. In this contribution we introduce a set of unsupervised hierarchical probabilistic models that tackle both problems simultaneously by incorporating prior knowledge from two sources: information from other training subjects (through transfer learning) and information about the words being spelled (through language models). We show, that due to this prior knowledge, the performance of the unsupervised models parallels and in some cases even surpasses that of supervised models, while eliminating the tedious training session. 1

same-paper 2 0.81093705 167 nips-2012-Kernel Hyperalignment

Author: Alexander Lorbert, Peter J. Ramadge

3 0.76004159 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

Author: Jeff Beck, Alexandre Pouget, Katherine A. Heller

Abstract: Recent experiments have demonstrated that humans and animals typically reason probabilistically about their environment. This ability requires a neural code that represents probability distributions and neural circuits that are capable of implementing the operations of probabilistic inference. The proposed probabilistic population coding (PPC) framework provides a statistically efﬁcient neural representation of probability distributions that is both broadly consistent with physiological measurements and capable of implementing some of the basic operations of probabilistic inference in a biologically plausible way. However, these experiments and the corresponding neural models have largely focused on simple (tractable) probabilistic computations such as cue combination, coordinate transformations, and decision making. As a result it remains unclear how to generalize this framework to more complex probabilistic computations. Here we address this short coming by showing that a very general approximate inference algorithm known as Variational Bayesian Expectation Maximization can be naturally implemented within the linear PPC framework. We apply this approach to a generic problem faced by any given layer of cortex, namely the identiﬁcation of latent causes of complex mixtures of spikes. We identify a formal equivalent between this spike pattern demixing problem and topic models used for document classiﬁcation, in particular Latent Dirichlet Allocation (LDA). We then construct a neural network implementation of variational inference and learning for LDA that utilizes a linear PPC. This network relies critically on two non-linear operations: divisive normalization and super-linear facilitation, both of which are ubiquitously observed in neural circuits. We also demonstrate how online learning can be achieved using a variation of Hebb’s rule and describe an extension of this work which allows us to deal with time varying and correlated latent causes. 1 Introduction to Probabilistic Inference in Cortex Probabilistic (Bayesian) reasoning provides a coherent and, in many ways, optimal framework for dealing with complex problems in an uncertain world. It is, therefore, somewhat reassuring that behavioural experiments reliably demonstrate that humans and animals behave in a manner consistent with optimal probabilistic reasoning when performing a wide variety of perceptual [1, 2, 3], motor [4, 5, 6], and cognitive tasks[7]. This remarkable ability requires a neural code that represents probability distribution functions of task relevant stimuli rather than just single values. While there 1 are many ways to represent functions, Bayes rule tells us that when it comes to probability distribution functions, there is only one statistically optimal way to do it. More precisely, Bayes Rule states that any pattern of activity, r, that efﬁciently represents a probability distribution over some task relevant quantity s, must satisfy the relationship p(s|r) ∝ p(r|s)p(s), where p(r|s) is the stimulus conditioned likelihood function that speciﬁes the form of neural variability, p(s) gives the prior belief regarding the stimulus, and p(s|r) gives the posterior distribution over values of the stimulus, s given the representation r . Of course, it is unlikely that the nervous system consistently achieves this level of optimality. None-the-less, Bayes rule suggests the existence of a link between neural variability as characterized by the likelihood function p(r|s) and the state of belief of a mature statistical learning machine such as the brain. The so called Probabilistic Population Coding (or PPC) framework[8, 9, 10] takes this link seriously by proposing that the function encoded by a pattern of neural activity r is, in fact, the likelihood function p(r|s). When this is the case, the precise form of the neural variability informs the nature of the neural code. For example, the exponential family of statistical models with linear sufﬁcient statistics has been shown to be ﬂexible enough to model the ﬁrst and second order statistics of in vivo recordings in awake behaving monkeys[9, 11, 12] and anesthetized cats[13]. When the likelihood function is modeled in this way, the log posterior probability over the stimulus is linearly encoded by neural activity, i.e. log p(s|r) = h(s) · r − log Z(r) (1) Here, the stimulus dependent kernel, h(s), is a vector of functions of s, the dot represents a standard dot product, and Z(r) is the partition function which serves to normalize the posterior. This log linear form for a posterior distribution is highly computationally convenient and allows for evidence integration to be implemented via linear operations on neural activity[14, 8]. Proponents of this kind of linear PPC have demonstrated how to build biologically plausible neural networks capable of implementing the operations of probabilistic inference that are needed to optimally perform the behavioural tasks listed above. This includes, linear PPC implementations of cue combination[8], evidence integration over time, maximum likelihood and maximum a posterior estimation[9], coordinate transformation/auditory localization[10], object tracking/Kalman ﬁltering[10], explaining away[10], and visual search[15]. Moreover, each of these neural computations has required only a single recurrently connected layer of neurons that is capable of just two non-linear operations: coincidence detection and divisive normalization, both of which are widely observed in cortex[16, 17]. Unfortunately, this research program has been a piecemeal effort that has largely proceeded by building neural networks designed deal with particular problems. As a result, there have been no proposals for a general principle by which neural network implementations of linear PPCs might be generated and no suggestions regarding how to deal with complex (intractable) problems of probabilistic inference. In this work, we will partially address this short coming by showing that Variation Bayesian Expectation Maximization (VBEM) algorithm provides a general scheme for approximate inference and learning with linear PPCs. In section 2, we brieﬂy review the VBEM algorithm and show how it naturally leads to a linear PPC representation of the posterior as well as constraints on the neural network dynamics which build that PPC representation. Because this section describes the VB-PPC approach rather abstractly, the remainder of the paper is dedicated to concrete applications. As a motivating example, we consider the problem of inferring the concentrations of odors in an olfactory scene from a complex pattern of spikes in a population of olfactory receptor neurons (ORNs). In section 3, we argue that this requires solving a spike pattern demixing problem which is indicative of the generic problem faced by many layers of cortex. We then show that this demixing problem is equivalent to the problem addressed by a class of models for text documents know as probabilistic topic models, in particular Latent Dirichlet Allocation or LDA[18]. In section 4, we apply the VB-PPC approach to build a neural network implementation of probabilistic inference and learning for LDA. This derivation shows that causal inference with linear PPC’s also critically relies on divisive normalization. This result suggests that this particular non-linearity may be involved in very general and fundamental probabilistic computation, rather than simply playing a role in gain modulation. In this section, we also show how this formulation allows for a probabilistic treatment of learning and show that a simple variation of Hebb’s rule can implement Bayesian learning in neural circuits. 2 We conclude this work by generalizing this approach to time varying inputs by introducing the Dynamic Document Model (DDM) which can infer short term ﬂuctuations in the concentrations of individual topics/odors and can be used to model foraging and other tracking tasks. 2 Variational Bayesian Inference with linear Probabilistic Population Codes Variational Bayesian (VB) inference refers to a class of deterministic methods for approximating the intractable integrals which arise in the context of probabilistic reasoning. Properly implemented it can result a fast alternative to sampling based methods of inference such as MCMC[19] sampling. Generically, the goal of any Bayesian inference algorithm is to infer a posterior distribution over behaviourally relevant latent variables Z given observations X and a generative model which speciﬁes the joint distribution p(X, Θ, Z). This task is confounded by the fact that the generative model includes latent parameters Θ which must be marginalized out, i.e. we wish to compute, p(Z|X) ∝ p(X, Θ, Z)dΘ (2) When the number of latent parameters is large this integral can be quite unwieldy. The VB algorithms simplify this marginalization by approximating the complex joint distribution over behaviourally relevant latents and parameters, p(Θ, Z|X), with a distribution q(Θ, Z) for which integrals of this form are easier to deal with in some sense. There is some art to choosing the particular form for the approximating distribution to make the above integral tractable, however, a factorized approximation is common, i.e. q(Θ, Z) = qΘ (Θ)qZ (Z). Regardless, for any given observation X, the approximate posterior is found by minimizing the Kullback-Leibler divergence between q(Θ, Z) and p(Θ, Z|X). When a factorized posterior is assumed, the Variational Bayesian Expectation Maximization (VBEM) algorithm ﬁnds a local minimum of the KL divergence by iteratively updating, qΘ (Θ) and qZ (Z) according to the scheme n log qΘ (Θ) ∼ log p(X, Θ, Z) n qZ (Z) and n+1 log qZ (Z) ∼ log p(X, Θ, Z) n qΘ (Θ) (3) Here the brackets indicate an expected value taken with respect to the subscripted probability distribution function and the tilde indicates equality up to a constant which is independent of Θ and Z. The key property to note here is that the approximate posterior which results from this procedure is in an exponential family form and is therefore representable by a linear PPC (Eq. 1). This feature allows for the straightforward construction of networks which implement the VBEM algorithm with linear PPC’s in the following way. If rn and rn are patterns of activity that use a linear PPC representation Θ Z of the relevant posteriors, then n log qΘ (Θ) ∼ hΘ (Θ) · rn Θ and n+1 log qZ (Z) ∼ hZ (Z) · rn+1 . Z (4) Here the stimulus dependent kernels hZ (Z) and hΘ (Θ) are chosen so that their outer product results in a basis that spans the function space on Z × Θ given by log p(X, Θ, Z) for every X. This choice guarantees that there exist functions fΘ (X, rn ) and fZ (X, rn ) such that Z Θ rn = fΘ (X, rn ) Θ Z and rn+1 = fZ (X, rn ) Θ Z (5) satisfy Eq. 3. When this is the case, simply iterating the discrete dynamical system described by Eq. 5 until convergence will ﬁnd the VBEM approximation to the posterior. This is one way to build a neural network implementation of the VB algorithm. However, its not the only way. In general, any dynamical system which has stable ﬁxed points in common with Eq. 5 can also be said to implement the VBEM algorithm. In the example below we will take advantage of this ﬂexibility in order to build biologically plausible neural network implementations. 3 Response! to Mixture ! of Odors! Single Odor Response Cause Intensity Figure 1: (Left) Each cause (e.g. coffee) in isolation results in a pattern of neural activity (top). When multiple causes contribute to a scene this results in an overall pattern of neural activity which is a mixture of these patterns weighted by the intensities (bottom). (Right) The resulting pattern can be represented by a raster, where each spike is colored by its corresponding latent cause. 3 Probabilistic Topic Models for Spike Train Demixing Consider the problem of odor identiﬁcation depicted in Fig. 1. A typical mammalian olfactory system consists of a few hundred different types of olfactory receptor neurons (ORNs), each of which responds to a wide range of volatile chemicals. This results in a highly distributed code for each odor. Since, a typical olfactory scene consists of many different odors at different concentrations, the pattern of ORN spike trains represents a complex mixture. Described in this way, it is easy to see that the problem faced by early olfactory cortex can be described as the task of demixing spike trains to infer latent causes (odor intensities). In many ways this olfactory problem is a generic problem faced by each cortical layer as it tries to make sense of the activity of the neurons in the layer below. The input patterns of activity consist of spikes (or spike counts) labeled by the axons which deliver them and summarized by a histogram which indicates how many spikes come from each input neuron. Of course, just because a spike came from a particular neuron does not mean that it had a particular cause, just as any particular ORN spike could have been caused by any one of a large number of volatile chemicals. Like olfactory codes, cortical codes are often distributed and multiple latent causes can be present at the same time. Regardless, this spike or histogram demixing problem is formally equivalent to a class of demixing problems which arise in the context of probabilistic topic models used for document modeling. A simple but successful example of this kind of topic model is called Latent Dirichlet Allocation (LDA) [18]. LDA assumes that word order in documents is irrelevant and, therefore, models documents as histograms of word counts. It also assumes that there are K topics and that each of these topics appears in different proportions in each document, e.g. 80% of the words in a document might be concerned with coffee and 20% with strawberries. Words from a given topic are themselves drawn from a distribution over words associated with that topic, e.g. when talking about coffee you have a 5% chance of using the word ’bitter’. The goal of LDA is to infer both the distribution over topics discussed in each document and the distribution of words associated with each topic. We can map the generative model for LDA onto the task of spike demixing in cortex by letting topics become latent causes or odors, words become neurons, word occurrences become spikes, word distributions associated with each topic become patterns of neural activity associated with each cause, and different documents become the observed patterns of neural activity on different trials. This equivalence is made explicit in Fig. 2 which describes the standard generative model for LDA applied to documents on the left and mixtures of spikes on the right. 4 LDA Inference and Network Implementation In this section we will apply the VB-PPC formulation to build a biologically plausible network capable of approximating probabilistic inference for spike pattern demixing. For simplicity, we will use the equivalent Gamma-Poisson formulation of LDA which directly models word and topic counts 4 1. For each topic k = 1, . . . , K, (a) Distribution over words βk ∼ Dirichlet(η0 ) 2. For document d = 1, . . . , D, (a) Distribution over topics θd ∼ Dirichlet(α0 ) (b) For word m = 1, . . . , Ωd i. Topic assignment zd,m ∼ Multinomial(θd ) ii. Word assignment ωd,m ∼ Multinomial(βzm ) 1. For latent cause k = 1, . . . , K, (a) Pattern of neural activity βk ∼ Dirichlet(η0 ) 2. For scene d = 1, . . . , D, (a) Relative intensity of each cause θd ∼ Dirichlet(α0 ) (b) For spike m = 1, . . . , Ωd i. Cause assignment zd,m ∼ Multinomial(θd ) ii. Neuron assignment ωd,m ∼ Multinomial(βzm ) Figure 2: (Left) The LDA generative model in the context of document modeling. (Right) The corresponding LDA generative model mapped onto the problem of spike demixing. Text related attributes on the left, in red, have been replaced with neural attributes on the right, in green. rather than topic assignments. Speciﬁcally, we deﬁne, Rd,j to be the number of times neuron j ﬁres during trial d. Similarly, we let Nd,j,k to be the number of times a spike in neuron j comes from cause k in trial d. These new variables play the roles of the cause and neuron assignment variables, zd,m and ωd,m by simply counting them up. If we let cd,k be an un-normalized intensity of cause j such that θd,k = cd,k / k cd,k then the generative model, Rd,j = k Nd,j,k Nd,j,k ∼ Poisson(βj,k cd,k ) 0 cd,k ∼ Gamma(αk , C −1 ). (6) is equivalent to the topic models described above. Here the parameter C is a scale parameter which sets the expected total number of spikes from the population on each trial. Note that, the problem of inferring the wj,k and cd,k is a non-negative matrix factorization problem similar to that considered by Lee and Seung[20]. The primary difference is that, here, we are attempting to infer a probability distribution over these quantities rather than maximum likelihood estimates. See supplement for details. Following the prescription laid out in section 2, we approximate the posterior over latent variables given a set of input patterns, Rd , d = 1, . . . , D, with a factorized distribution of the form, qN (N)qc (c)qβ (β). This results in marginal posterior distributions q (β:,k |η:,k ), q cd,k |αd,k , C −1 + 1 ), and q (Nd,j,: | log pd,j,: , Rd,i ) which are Dirichlet, Gamma, and Multinomial respectively. Here, the parameters η:,k , αd,k , and log pd,j,: are the natural parameters of these distributions. The VBEM update algorithm yields update rules for these parameters which are summarized in Fig. 3 Algorithm1. Algorithm 1: Batch VB updates 1: while ηj,k not converged do 2: for d = 1, · · · , D do 3: while pd,j,k , αd,k not converged do 4: αd,k → α0 + j Rd,j pd,j,k 5: pd,j,k → Algorithm 2: Online VB updates 1: for d = 1, · · · , D do 2: reinitialize pj,k , αk ∀j, k 3: while pj,k , αk not converged do 4: αk → α0 + j Rd,j pj,k 5: pj,k → exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αk ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αi ) exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αd,k ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αd,i ) 6: end while 7: end for 8: ηj,k = η 0 + 9: end while end while ηj,k → (1 − dt)ηj,k + dt(η 0 + Rd,j pj,k ) 8: end for 6: 7: d Rd,j pd,j,k Figure 3: Here ηk = j ηj,k and ψ(x) is the digamma function so that exp ψ(x) is a smoothed ¯ threshold linear function. Before we move on to the neural network implementation, note that this standard formulation of variational inference for LDA utilizes a batch learning scheme that is not biologically plausible. Fortunately, an online version of this variational algorithm was recently proposed and shown to give 5 superior results when compared to the batch learning algorithm[21]. This algorithm replaces the sum over d in update equation for ηj,k with an incremental update based upon only the most recently observed pattern of spikes. See Fig. 3 Algorithm 2. 4.1 Neural Network Implementation Recall that the goal was to build a neural network that implements the VBEM algorithm for the underlying latent causes of a mixture of spikes using a neural code that represents the posterior distribution via a linear PPC. A linear PPC represents the natural parameters of a posterior distribution via a linear operation on neural activity. Since the primary quantity of interest here is the posterior distribution over odor concentrations, qc (c|α), this means that we need a pattern of activity rα which is linearly related to the αk ’s in the equations above. One way to accomplish this is to simply assume that the ﬁring rates of output neurons are equal to the positive valued αk parameters. Fig. 4 depicts the overall network architecture. Input patterns of activity, R, are transmitted to the synapses of a population of output neurons which represent the αk ’s. The output activity is pooled to ¯ form an un-normalized prediction of the activity of each input neuron, Rj , given the output layer’s current state of belief about the latent causes of the Rj . The activity at each synapse targeted by input neuron j is then inhibited divisively by this prediction. This results in a dendrite that reports to the ¯ soma a quantity, Nj,k , which represents the fraction of unexplained spikes from input neuron j that could be explained by latent cause k. A continuous time dynamical system with this feature and the property that it shares its ﬁxed points with the LDA algorithm is given by d ¯ Nj,k dt d αk dt ¯ ¯ = wj,k Rj − Rj Nj,k = (7) ¯ Nj,k exp (ψ (¯k )) (α0 − αk ) + exp (ψ (αk )) η (8) i ¯ where Rj = k wj,k exp (ψ (αk )), and wj,k = exp (ψ (ηj,k )). Note that, despite its form, it is Eq. 7 which implements the required divisive normalization operation since, in the steady state, ¯ ¯ Nj,k = wj,k Rj /Rj . Regardless, this network has a variety of interesting properties that align well with biology. It predicts that a balance of excitation and inhibition is maintained in the dendrites via divisive normalization and that the role of inhibitory neurons is to predict the input spikes which target individual dendrites. It also predicts superlinear facilitation. Speciﬁcally, the ﬁnal term on the right of Eq. 8 indicates that more active cells will be more sensitive to their dendritic inputs. Alternatively, this could be implemented via recurrent excitation at the population level. In either case, this is the mechanism by which the network implements a sparse prior on topic concentrations and stands in stark contrast to the winner take all mechanisms which rely on competitive mutual inhibition mechanisms. Additionally, the ηj in Eq. 8 represents a cell wide ’leak’ parameter that indicates that the total leak should be ¯ roughly proportional to the sum total weight of the synapses which drive the neuron. This predicts that cells that are highly sensitive to input should also decay back to baseline more quickly. This implementation also predicts Hebbian learning of synaptic weights. To observe this fact, note that the online update rule for the ηj,k parameters can be implemented by simply correlating the activity at ¯ each synapse, Nj,k with activity at the soma αj via the equation: τL d ¯ wj,k = exp (ψ (¯k )) (η0 − 1/2 − wj,k ) + Nj,k exp ψ (αk ) η dt (9) where τL is a long time constant for learning and we have used the fact that exp (ψ (ηjk )) ≈ ηjk −1/2 for x > 1. For a detailed derivation see the supplementary material. 5 Dynamic Document Model LDA is a rather simple generative model that makes several unrealistic assumptions about mixtures of sensory and cortical spikes. In particular, it assumes both that there are no correlations between the 6 Targeted Divisive Normalization Targeted Divisive Normalization αj Ri Input Neurons Recurrent Connections ÷ ÷ -1 -1 Σ μj Nij Ri Synapses Output Neurons Figure 4: The LDA network model. Dendritically targeted inhibition is pooled from the activity of all neurons in the output layer and acts divisively. Σ jj' Nij Input Neurons Synapses Output Neurons Figure 5: DDM network model also includes recurrent connections which target the soma with both a linear excitatory signal and an inhibitory signal that also takes the form of a divisive normalization. intensities of latent causes and that there are no correlations between the intensities of latent causes in temporally adjacent trials or scenes. This makes LDA a rather poor computational model for a task like olfactory foraging which requires the animal to track the rise a fall of odor intensities as it navigates its environment. We can model this more complicated task by replacing the static cause or odor intensity parameters with dynamic odor intensity parameters whose behavior is governed by an exponentiated Ornstein-Uhlenbeck process with drift and diffusion matrices given by (Λ and ΣD ). We call this variant of LDA the Dynamic Document Model (DDM) as it could be used to model smooth changes in the distribution of topics over the course of a single document. 5.1 DDM Model Thus the generative model for the DDM is as follows: 1. For latent cause k = 1, . . . , K, (a) Cause distribution over spikes βk ∼ Dirichlet(η0 ) 2. For scene t = 1, . . . , T , (a) Log intensity of causes c(t) ∼ Normal(Λct−1 , ΣD ) (b) Number of spikes in neuron j resulting from cause k, Nj,k (t) ∼ Poisson(βj,k exp ck (t)) (c) Number of spikes in neuron j, Rj (t) = k Nj,k (t) This model bears many similarities to the Correlated and Dynamic topic models[22], but models dynamics over a short time scale, where the dynamic relationship (Λ, ΣD ) is important. 5.2 Network Implementation Once again the quantity of interest is the current distribution of latent causes, p(c(t)|R(τ ), τ = 0..T ). If no spikes occur then no evidence is presented and posterior inference over c(t) is simply given by an undriven Kalman ﬁlter with parameters (Λ, ΣD ). A recurrent neural network which uses a linear PPC to encode a posterior that evolves according to a Kalman ﬁlter has the property that neural responses are linearly related to the inverse covariance matrix of the posterior as well as that inverse covariance matrix times the posterior mean. In the absence of evidence, it is easy to show that these quantities must evolve according to recurrent dynamics which implement divisive normalization[10]. Thus, the patterns of neural activity which linearly encode them must do so as well. When a new spike arrives, optimal inference is no longer possible and a variational approximation must be utilized. As is shown in the supplement, this variational approximation is similar to the variational approximation used for LDA. As a result, a network which can divisively inhibit its synapses is able to implement approximate Bayesian inference. Curiously, this implies that the addition of spatial and temporal correlations to the latent causes adds very little complexity to the VB-PPC network implementation of probabilistic inference. All that is required is an additional inhibitory population which targets the somata in the output population. See Fig. 5. 7 Natural Parameters Natural Parameters (α) 0.4 200 450 180 0.3 Network Estimate Network Estimate 500 400 350 300 250 200 150 100 0.1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 140 120 0.4 0.3 100 0.2 80 0.1 0 60 40 0.4 20 50 0 0 0.2 160 0 0 0.3 0.2 20 40 60 80 100 120 VBEM Estimate VBEM Estimate 140 160 180 200 0.1 0 Figure 6: (Left) Neural network approximation to the natural parameters of the posterior distribution over topics (the α’s) as a function of the VBEM estimate of those same parameters for a variety of ’documents’. (Center) Same as left, but for the natural parameters of the DDM (i.e the entries of the matrix Σ−1 (t) and Σ−1 µ(t) of the distribution over log topic intensities. (Right) Three example traces for cause intensity in the DDM. Black shows true concentration, blue and red (indistinguishable) show MAP estimates for the network and VBEM algorithms. 6 Experimental Results We compared the PPC neural network implementations of the variational inference with the standard VBEM algorithm. This comparison is necessary because the two algorithms are not guaranteed to converge to the same solution due to the fact that we only required that the neural network dynamics have the same ﬁxed points as the standard VBEM algorithm. As a result, it is possible for the two algorithms to converge to different local minima of the KL divergence. For the network implementation of LDA we ﬁnd good agreement between the neural network and VBEM estimates of the natural parameters of the posterior. See Fig. 6(left) which shows the two algorithms estimates of the shape parameter of the posterior distribution over topic (odor) concentrations (a quantity which is proportional to the expected concentration). This agreement, however, is not perfect, especially when posterior predicted concentrations are low. In part, this is due to the fact we are presenting the network with difﬁcult inference problems for which the true posterior distribution over topics (odors) is highly correlated and multimodal. As a result, the objective function (KL divergence) is littered with local minima. Additionally, the discrete iterations of the VBEM algorithm can take very large steps in the space of natural parameters while the neural network implementation cannot. In contrast, the network implementation of the DDM is in much better agreement with the VBEM estimation. See Fig. 6(right). This is because the smooth temporal dynamics of the topics eliminate the need for the VBEM algorithm to take large steps. As a result, the smooth network dynamics are better able to accurately track the VBEM algorithms output. For simulation details please see the supplement. 7 Discussion and Conclusion In this work we presented a general framework for inference and learning with linear Probabilistic Population codes. This framework takes advantage of the fact that the Variational Bayesian Expectation Maximization algorithm generates approximate posterior distributions which are in an exponential family form. This is precisely the form needed in order to make probability distributions representable by a linear PPC. We then outlined a general means by which one can build a neural network implementation of the VB algorithm using this kind of neural code. We applied this VB-PPC framework to generate a biologically plausible neural network for spike train demixing. We chose this problem because it has many of the features of the canonical problem faced by nearly every layer of cortex, i.e. that of inferring the latent causes of complex mixtures of spike trains in the layer below. Curiously, this very complicated problem of probabilistic inference and learning ended up having a remarkably simple network solution, requiring only that neurons be capable of implementing divisive normalization via dendritically targeted inhibition and superlinear facilitation. Moreover, we showed that extending this approach to the more complex dynamic case in which latent causes change in intensity over time does not substantially increase the complexity of the neural circuit. Finally, we would like to note that, while we utilized a rate coding scheme for our linear PPC, the basic equations would still apply to any spike based log probability codes such as that considered Beorlin and Deneve[23]. 8 References [1] Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as Bayesian inference. Annual review of psychology, 55:271–304, January 2004. [2] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429–33, 2002. [3] Yair Weiss, Eero P Simoncelli, and Edward H Adelson. Motion illusions as optimal percepts. Nature neuroscience, 5(6):598–604, 2002. [4] P N Sabes. The planning and control of reaching movements. Current opinion in neurobiology, 10(6): 740–6, 2000. o [5] Konrad P K¨ rding and Daniel M Wolpert. Bayesian integration in sensorimotor learning. Nature, 427 (6971):244–7, 2004. [6] Emanuel Todorov. Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907–15, 2004. [7] Erno T´ gl´ s, Edward Vul, Vittorio Girotto, Michel Gonzalez, Joshua B Tenenbaum, and Luca L Bonatti. e a Pure reasoning in 12-month-old infants as probabilistic inference. Science (New York, N.Y.), 332(6033): 1054–9, 2011. [8] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nature Neuroscience, 2006. [9] Jeffrey M Beck, Wei Ji Ma, Roozbeh Kiani, Tim Hanks, Anne K Churchland, Jamie Roitman, Michael N Shadlen, Peter E Latham, and Alexandre Pouget. Probabilistic population codes for Bayesian decision making. Neuron, 60(6):1142–52, 2008. [10] J. M. Beck, P. E. Latham, and a. Pouget. Marginalization in Neural Circuits with Divisive Normalization. Journal of Neuroscience, 31(43):15310–15319, 2011. [11] Tianming Yang and Michael N Shadlen. Probabilistic reasoning by neurons. Nature, 447(7148):1075–80, 2007. [12] RHS Carpenter and MLL Williams. Neural computation of log likelihood in control of saccadic eye movements. Nature, 1995. [13] Arnulf B a Graf, Adam Kohn, Mehrdad Jazayeri, and J Anthony Movshon. Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience, 14(2):239–45, 2011. [14] HB Barlow. Pattern Recognition and the Responses of Sensory Neurons. Annals of the New York Academy of Sciences, 1969. [15] Wei Ji Ma, Vidhya Navalpakkam, Jeffrey M Beck, Ronald Van Den Berg, and Alexandre Pouget. Behavior and neural basis of near-optimal visual search. Nature Neuroscience, (May), 2011. [16] DJ Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 1992. [17] M Carandini, D J Heeger, and J a Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. The Journal of neuroscience : the ofﬁcial journal of the Society for Neuroscience, 17(21):8621–44, 1997. [18] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 2003. [19] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, UCL, 2003. [20] D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401 (6755):788–91, 1999. [21] M. Hoffman, D. Blei, and F. Bach. Online learning for Latent Dirichlet Allocation. In NIPS, 2010. [22] D. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006. [23] M. Boerlin and S. Deneve. Spike-based population coding and working memory. PLOS computational biology, 2011. 9

4 0.70020485 65 nips-2012-Cardinality Restricted Boltzmann Machines

Author: Kevin Swersky, Ilya Sutskever, Daniel Tarlow, Richard S. Zemel, Ruslan Salakhutdinov, Ryan P. Adams

Abstract: The Restricted Boltzmann Machine (RBM) is a popular density model that is also good for extracting features. A main source of tractability in RBM models is that, given an input, the posterior distribution over hidden variables is factorizable and can be easily computed and sampled from. Sparsity and competition in the hidden representation is beneﬁcial, and while an RBM with competition among its hidden units would acquire some of the attractive properties of sparse coding, such constraints are typically not added, as the resulting posterior over the hidden units seemingly becomes intractable. In this paper we show that a dynamic programming algorithm can be used to implement exact sparsity in the RBM’s hidden units. We also show how to pass derivatives through the resulting posterior marginals, which makes it possible to ﬁne-tune a pre-trained neural network with sparse hidden layers. 1

5 0.68650538 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

Author: Stefano Ermon, Ashish Sabharwal, Bart Selman, Carla P. Gomes

Abstract: Given a probabilistic graphical model, its density of states is a distribution that, for any likelihood value, gives the number of conﬁgurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this distribution. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decomposition, the new upper bound based on ﬁner-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations and mean-ﬁeld based bounds. 1

6 0.62765807 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

7 0.62182623 188 nips-2012-Learning from Distributions via Support Measure Machines

8 0.62117994 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

9 0.61722362 210 nips-2012-Memorability of Image Regions

10 0.61548531 197 nips-2012-Learning with Recursive Perceptual Representations

11 0.61532462 168 nips-2012-Kernel Latent SVM for Visual Recognition

12 0.61527163 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

13 0.6150046 279 nips-2012-Projection Retrieval for Classification

14 0.61469531 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

15 0.61468649 303 nips-2012-Searching for objects driven by context

16 0.61456156 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

17 0.61416245 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

18 0.6127612 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

19 0.61270571 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

20 0.61267406 38 nips-2012-Algorithms for Learning Markov Field Policies