nips nips2006 nips2006-178 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gavin C. Cawley, Nicola L. Talbot, Mark Girolami
Abstract: Multinomial logistic regression provides the standard penalised maximumlikelihood solution to multi-class pattern recognition problems. More recently, the development of sparse multinomial logistic regression models has found application in text processing and microarray classification, where explicit identification of the most informative features is of value. In this paper, we propose a sparse multinomial logistic regression method, in which the sparsity arises from the use of a Laplace prior, but where the usual regularisation parameter is integrated out analytically. Evaluation over a range of benchmark datasets reveals this approach results in similar generalisation performance to that obtained using cross-validation, but at greatly reduced computational expense. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract Multinomial logistic regression provides the standard penalised maximumlikelihood solution to multi-class pattern recognition problems. [sent-19, score-0.408]
2 More recently, the development of sparse multinomial logistic regression models has found application in text processing and microarray classification, where explicit identification of the most informative features is of value. [sent-20, score-0.696]
3 In this paper, we propose a sparse multinomial logistic regression method, in which the sparsity arises from the use of a Laplace prior, but where the usual regularisation parameter is integrated out analytically. [sent-21, score-1.377]
4 Evaluation over a range of benchmark datasets reveals this approach results in similar generalisation performance to that obtained using cross-validation, but at greatly reduced computational expense. [sent-22, score-0.125]
5 1 Introduction Multinomial logistic and probit regression are perhaps the classic statistical methods for multi-class pattern recognition problems (for a detailed introduction, see e. [sent-23, score-0.425]
6 The output of a multinomial logistic regression model can be interpreted as an a-posteriori estimate of the probability that a pattern belongs to each of c disjoint classes. [sent-26, score-0.634]
7 As a result, these models have been adopted in a diverse range of applications, including cancer classification [6, 7], text categorisation [8], analysis of DNA binding sites [9] and call routing. [sent-28, score-0.11]
8 More recently, the focus of research has been on methods for inducing sparsity in (multinomial) logistic or probit regression models. [sent-29, score-0.512]
9 In this case the sparsity is desirable for the purposes of computational expediency, rather than as an aid to understanding the data. [sent-32, score-0.079]
10 A variety of methods have been explored that aim to introduce sparsity in non-parametric regression models through the incorporation of a penalty or regularisation term within the training criterion. [sent-33, score-0.809]
11 In the context of least-squares regression using Radial Basis Function (RBF) networks, Orr [10], proposes the use of local regularisation, in which a weight-decay regularisation term is used with distinct regularisation parameters for each weight. [sent-34, score-1.251]
12 The optimisation of the Generalised Cross-Validation (GCV) score typically leads to the regularisation parameters for redundant basis functions achieving very high values, allowing them to be identified and pruned from the network (c. [sent-35, score-0.737]
13 The relevance vector machine (RVM) [13] implements a form of Bayesian automatic relevance determination (ARD), using a separable Gaussian prior. [sent-39, score-0.134]
14 In this case, the regularisation parameter for each weight is adjusted so as to maximise the marginal likelihood, also known as the Bayesian evidence for the model. [sent-40, score-0.727]
15 An efficient component-wise training algorithm is given in [14]. [sent-41, score-0.046]
16 An alternative approach, known as the LASSO [15], seeks to minimise the negative log-likelihood of the sample, subject to an upper bound on the sum of the absolute value of the weights (see also [16] for a practical training procedure). [sent-42, score-0.079]
17 This strategy is equivalent to the use of a Laplace prior over the model parameters [17], which has been demonstrated to control over-fitting and induce sparsity in the weights of multi-layer perceptron networks [18]. [sent-43, score-0.142]
18 The equivalence of the Laplace prior and a separable Gaussian prior (with appropriate choice of regularisation parameters) has been established by Grandvalet [11, 12], unifying these strands of research. [sent-44, score-0.731]
19 In this paper, we demonstrate that, in the case of the Laplace prior, the regularisation parameters can be integrated out analytically, obviating the need for a lengthy cross-validation based model selection stage. [sent-45, score-0.623]
20 The resulting sparse multinomial logistic regression algorithm with Bayesian regularisation (SBMLR) is then fully automated and, having storage requirements that scale only linearly with the number of model parameters, is well suited to relatively large-scale applications. [sent-46, score-1.237]
21 The remainder of this paper is set out as follows: The sparse multinomial logistic regression procedure with Bayesian regularisation is presented in Section 2. [sent-47, score-1.237]
22 The proposed algorithm is then evaluated against competing approaches over a range of benchmark learning problems in Section 3. [sent-48, score-0.069]
23 Finally, the work is summarised in Section 5 and conclusion drawn. [sent-49, score-0.042]
24 2 Method ℓ Let D = {(xn , tn )}n=1 represent the training sample, where xn ∈ X ⊂ Rd is the vector of input features for the ith example, and tn ∈ T = {t | t ∈ {0, 1}c , t 1 = 1 } is the corresponding vector of desired outputs, using the usual 1-of-c coding scheme. [sent-50, score-0.279]
25 At a minima of L, the partial derivatives of L with respect to the model parameters will be uniformly zero, giving ∂ED =α ∂wij if |wij | > 0 and ∂ED <α ∂wij if |wij | = 0. [sent-55, score-0.069]
26 This implies that if the sensitivity of the negative log-likelihood with respect to a model parameter, wij , falls below α, then the value of that parameter will be set exactly to zero and the corresponding input feature can be pruned from the model. [sent-56, score-0.351]
27 L is then, up to an additive constant, the negative logarithm of the posterior density. [sent-59, score-0.048]
28 The prior over model parameters, w, is then given by a separable Laplace distribution P (w) = α 2 W W exp{−αEW } = α exp {−α|wi |} , 2 i=1 (3) where W is the number of active (non-zero) model parameters. [sent-60, score-0.125]
29 A good value for the regularisation parameter α can be estimated, within a Bayesian framework, by maximising the evidence [22] or alternatively it may be integrated out analytically [17, 23]. [sent-61, score-0.767]
30 Here we take the latter approach, where the prior distribution over model parameters is given by marginalising over α, p(w) = p(w|α)p(α)dα. [sent-62, score-0.105]
31 As α is a scale parameter, an appropriate ignorance prior is given by the improper Jeffrey’s prior, p(α) ∝ 1/α, corresponding to a uniform prior over log α. [sent-63, score-0.214]
32 384], we obtain − log p(w) ∝ W log EW , =⇒ giving a revised optimisation criterion for sparse logistic regression with Bayesian regularisation, M = ED + W log EW , (4) in which the regularisation parameter has been eliminated, for further details and theoretical justification, see [17]. [sent-66, score-1.208]
33 Note that we integrate out the regularisation parameter and optimise the model parameters, which is unusual in that most Bayesian approaches, such as the relevance vector machine [13] optimise the regularisation parameters and integrate over the weights. [sent-67, score-1.307]
34 1 Practical Implementation The training criterion incorporating a fully Bayesian regularisation term can be minimised via a simple modification of existing cyclic co-ordinate descent algorithms for sparse regression using a Laplace prior (e. [sent-70, score-0.863]
35 Differentiating the original and modified training criteria, (2) and (4) respectively, we have that ∇L = ∇ED + α∇EW ∇M = ∇ED + α∇EW ˜ and where 1 1/˜ = α W W i=1 |wi |. [sent-73, score-0.046]
36 (5) From a gradient descent perspective, minimising M effectively becomes equivalent to minimising L, assuming that the regularisation parameter, α, is continuously updated according to (5) following every change in the vector of model parameters, w [17]. [sent-74, score-0.641]
37 This requires only a very minor modification of the existing training algorithm, whilst eliminating the only training parameter and hence the need for a model selection procedure in fitting the model. [sent-75, score-0.175]
38 2 Equivalence of Marginalisation and Optimisation under the Evidence Framework Williams [17] notes that, at least in the case of the Laplace prior, integrating out the regularisation parameter analytically is equivalent to its optimisation under the evidence framework of MacKay [22]. [sent-78, score-0.839]
39 The argument provided by Williams can be summarised as follows: The evidence framework sets the value of the regularisation parameter so as to optimise the marginal likelihood, P (D) = P (D|w)P (w)dw, also known as the evidence for the model. [sent-79, score-0.846]
40 The Bayesian interpretation of the regularised objective function gives, 1 P (D) = exp {−L} dw, ZW where ZW is a normalising constant for the prior over the model parameters, for the Laplace prior, ZW = (2/α)W . [sent-80, score-0.212]
41 In the case of multinomial logistic regression, ED represents the negative logarithm of a normalised distribution, and so the corresponding normalising constant for the data misfit term is redundant. [sent-81, score-0.573]
42 The regulariser corresponding to the Laplace prior is locally a hyper-plane, and so does not contribute to the Hessian and so A = ∇∇ED . [sent-83, score-0.1]
43 The negative logarithm of the evidence can then be written as, − log P (D) = ED + αEW + 1 log |A| + log ZW + constant. [sent-84, score-0.188]
44 2 Setting the derivative of the evidence with respect to α to zero, gives rise to a simple update rule for the regularisation parameter, W 1 1 = α ˜ W j=1 |wj |, which is equivalent to the update rule obtained using the integrate-out approach. [sent-85, score-0.632]
45 Maximising the evidence for the model also provides a convenient means for model selection. [sent-86, score-0.065]
46 Using the Laplace approximation, evidence for a multinomial logistic regression model under the proposed Bayesian regularisation scheme is given by Γ (W ) 2W − log {D} = ED + W log EW − log + 1 log |A| + constant 2 where A = ∇∇L. [sent-87, score-1.332]
47 2 A Simple but Efficient Training Algorithm In this study, we adopt a simplified version of the efficient component-wise training algorithm of Shevade and Keerthi [25], adapted for multinomial, rather than binomial, logistic regression. [sent-89, score-0.285]
48 The principal advantage of a component-wise optimisation algorithm is that the Hessian matrix is not required, but only the first and second partial derivatives of the regularised training criterion. [sent-90, score-0.331]
49 The first partial derivatives of the data mis-fit term are given by, n ∂ED n = ∂aj c i=1 n n ∂ED ∂yi n ∂an ∂yi j where n ∂ED tn i n = − yn , ∂yi i n ∂yi = yi δij − yi yj ∂an j and δij = 1 if i = j and otherwise δij = 0. [sent-91, score-0.288]
50 Substituting, we obtain, ℓ ℓ ∂ED n [yi − tn ] = i ∂ai n=1 =⇒ ℓ ℓ ∂ED n n = [yi − tn ] xn = yi xn − tn xn . [sent-92, score-0.511]
51 i j j i j ∂wij n=1 n=1 n=1 Similarly, the second partial derivatives are given by, ℓ ℓ ∂ 2 ED ∂y n n n = xn i = yi (1 − yi ) xn j j ∂wij ∂wij n=1 n=1 2 . [sent-93, score-0.329]
52 The Laplace regulariser is locally a hyperplane, with the magnitude of the gradient given by the regularisation parameter, α, ∂αEW = sign {wij } α ∂wij and ∂ 2 αEW = 0. [sent-94, score-0.604]
53 Any step that causes a change of sign in a model parameter is truncated and that parameter set to zero. [sent-99, score-0.062]
54 All that remains is to decide on a heuristic used to select the parameter to be optimised in each step. [sent-100, score-0.062]
55 In this study, we adopt the heuristic chosen by Shevade and Keerthi, in which the parameter having the steepest gradient is selected in each iteration. [sent-101, score-0.055]
56 The optimisation proceeds using two nested loops, in the inner loop, only active parameters are considered. [sent-102, score-0.133]
57 If no further progress can be made by optimising active parameters, the search is extended to parameters that are currently set to zero. [sent-103, score-0.037]
58 An optimisation strategy based on scaled conjugate gradient descent [27] has also be found to be effective. [sent-104, score-0.133]
59 Clearly, there is little reason to prefer either model over the other in terms of generalisation performance, as neither consistently dominates the other, either in terms of error rate or cross-entropy. [sent-107, score-0.056]
60 Table 1 also shows that the Bayesian regularisation scheme results in models with a slightly higher degree of sparsity (i. [sent-108, score-0.646]
61 However, the most striking aspect of the comparison is that the Bayesian regularisation scheme is typically around two orders of magnitude faster than the cross-validation based approach, with SBMLR being approximately five times faster in the worst case (COVTYPE). [sent-111, score-0.567]
62 Probabilistic classifiers allow rejection thresholds to be set in a straight-forward manner. [sent-116, score-0.025]
63 Finally, the output of Table 1: Evaluation of linear sparse multinomial logistic regression methods over a set of nine benchmark datasets. [sent-118, score-0.739]
64 The best results for each statistic are shown in bold. [sent-119, score-0.025]
65 The final column shows the logarithm of the ratio of the training times for the SMLR and SBMLR, such that a value of 2 would indicate that SBMLR is 100 times faster than SMLR for a given benchmark dataset. [sent-120, score-0.163]
66 5541 a probabilistic classifier can be adjusted after training to compensate for a difference between the relative class frequencies in the training set and those observed in operation. [sent-184, score-0.153]
67 Saerens [4] provides a simple expectation-maximisation (EM) based procedure for estimating unknown operational apriori probabilities from the output of a probabilistic classifier (c. [sent-185, score-0.101]
68 Let pt (Ci ) represent the a-priori probability of class Ci in the training set and pt (Ci |xn ) represent the raw output of the classifier for the nth pattern of the test data (representing operational conditions). [sent-188, score-0.304]
69 The operational a-priori probabilities, po (Ci ) can then be updated iteratively via p(s) (ωi |xn ) o = p(s) (ωi ) n o pt (ωi ) pt (ωi |x ) (s) po (ωj ) c n j=1 pt (ωj ) pt (ωj |x ) N and p(s+1) (ωi ) o 1 p(s) (ωi |xn ), = ℓ n=1 o (6) (0) beginning with po (Ci ) = pt (Ci ). [sent-189, score-0.548]
70 The adjusted estimates of a-posteriori probability are then given by the first part of equation (6). [sent-191, score-0.035]
71 The training and validation sets of the COVTYPE benchmark have been artificially balanced, by random sampling, so that each class is represented by the same number of examples. [sent-192, score-0.115]
72 The test set consists of the unused patterns, and so the test set a-priori probabilities are both highly disparate and very different from the training set a-priori probabilities. [sent-193, score-0.074]
73 5 training set test set estimated Table 2: Error rate and average crossentropy score for linear SBMLR models of the COVTYPE benchmark, using the raw and corrected outputs. [sent-196, score-0.123]
74 1 0 1 2 3 4 class 5 6 7 Figure 1: Training set, test set and estimated a-priori probabilities for the COVTYPE benchmark. [sent-205, score-0.028]
75 In [29] this hierarchical representation of the Laplace prior is utilized to develop an EM style sparse binomial probit regression algorithm. [sent-207, score-0.372]
76 The hyper-parameter α is selected via cross-validation but in an attempt to circumvent this requirement a Jeffreys prior is placed on τ and is used to replace the exponential distribution in the above integral. [sent-208, score-0.063]
77 This yields an improper parameter free prior distribution over w which removes the explicit requirement to perform any cross-validation. [sent-209, score-0.157]
78 Likewise in [13] the RVM employs a similar scale-mixture for the prior where now the Exponential distribution is replaced by a Gamma distribution whose marginal yields a Student prior distribution. [sent-211, score-0.155]
79 No attempt is made to estimate the associated hyper-parameters and these are typically set to zero producing, as in [29], a sparsity inducing improper prior. [sent-212, score-0.184]
80 It is interesting to note that the SBMLR implements a strategy that is exactly the opposite of the relevance vector machine (RVM) [13], in that it integrates over the hyper-parameter and optimises the weights, rather than marginalising the model parameters and optimising the hyper-parameters. [sent-217, score-0.127]
81 It seems reasonable to suggest that this approach is feasible in the case of the Laplace prior as the pruning action of this prior ensures that values of all of the weights are strongly determined by the data misfit term. [sent-218, score-0.126]
82 A similar strategy has already proved effective in cancer classification based on gene expression microarray data in a binomial setting [32], and we plan to extend this work to multi-class cancer classification in the near future. [sent-219, score-0.281]
83 Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. [sent-243, score-0.037]
84 Multi-class cancer classification using multinomial probit regression with Bayesian gene selection. [sent-261, score-0.582]
85 Sequence features of DNA binding sites reveal structural class of associated transcription factor. [sent-273, score-0.029]
86 Regularisation in the selection of radial basis function centres. [sent-279, score-0.026]
87 Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. [sent-335, score-0.034]
88 A simple and efficient algorithm for gene selection using sparse logistic regression. [sent-378, score-0.368]
89 Sprse multinomial logistic regression: Fast algorithms and generalisation bounds. [sent-410, score-0.539]
90 Gene selection in cancer classification using sparse logistic regression with Bayesian regularisation. [sent-428, score-0.509]
wordName wordTfidf (topN-words)
[('regularisation', 0.567), ('wij', 0.283), ('multinomial', 0.268), ('sbmlr', 0.252), ('laplace', 0.219), ('logistic', 0.215), ('ew', 0.19), ('smlr', 0.168), ('optimisation', 0.133), ('regression', 0.117), ('covtype', 0.105), ('bayesian', 0.086), ('tn', 0.085), ('wmp', 0.084), ('zw', 0.084), ('regularised', 0.083), ('cancer', 0.081), ('sparsity', 0.079), ('pt', 0.072), ('sparse', 0.07), ('benchmark', 0.069), ('yi', 0.067), ('evidence', 0.065), ('prior', 0.063), ('improper', 0.063), ('mis', 0.063), ('shevade', 0.063), ('xn', 0.063), ('classi', 0.059), ('probit', 0.059), ('gene', 0.057), ('generalisation', 0.056), ('rvm', 0.05), ('relevance', 0.048), ('logarithm', 0.048), ('po', 0.047), ('operational', 0.047), ('optimise', 0.047), ('training', 0.046), ('corrected', 0.044), ('ci', 0.043), ('analytically', 0.043), ('anglia', 0.042), ('cawley', 0.042), ('marginalising', 0.042), ('norfolk', 0.042), ('normalising', 0.042), ('norwich', 0.042), ('penalised', 0.042), ('summarised', 0.042), ('inducing', 0.042), ('separable', 0.038), ('outputs', 0.037), ('derivatives', 0.037), ('pruned', 0.037), ('saerens', 0.037), ('minimising', 0.037), ('regulariser', 0.037), ('generalised', 0.037), ('optimising', 0.037), ('september', 0.036), ('binomial', 0.036), ('adjusted', 0.035), ('edition', 0.034), ('pattern', 0.034), ('grandvalet', 0.033), ('girolami', 0.033), ('glasgow', 0.033), ('minimise', 0.033), ('raw', 0.033), ('ij', 0.032), ('ed', 0.032), ('partial', 0.032), ('parameter', 0.031), ('maximising', 0.031), ('optimised', 0.031), ('er', 0.031), ('integrated', 0.03), ('cation', 0.03), ('dw', 0.029), ('binding', 0.029), ('marginal', 0.029), ('probabilities', 0.028), ('keerthi', 0.028), ('east', 0.028), ('figueiredo', 0.028), ('hessian', 0.027), ('bioinformatics', 0.027), ('utilized', 0.027), ('shrinkage', 0.027), ('selection', 0.026), ('probabilistic', 0.026), ('microarray', 0.026), ('eliminating', 0.026), ('log', 0.025), ('statistic', 0.025), ('rejection', 0.025), ('adopt', 0.024), ('exp', 0.024), ('january', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999917 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
Author: Gavin C. Cawley, Nicola L. Talbot, Mark Girolami
Abstract: Multinomial logistic regression provides the standard penalised maximumlikelihood solution to multi-class pattern recognition problems. More recently, the development of sparse multinomial logistic regression models has found application in text processing and microarray classification, where explicit identification of the most informative features is of value. In this paper, we propose a sparse multinomial logistic regression method, in which the sparsity arises from the use of a Laplace prior, but where the usual regularisation parameter is integrated out analytically. Evaluation over a range of benchmark datasets reveals this approach results in similar generalisation performance to that obtained using cross-validation, but at greatly reduced computational expense. 1
2 0.12281594 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
Author: Mark Girolami, Mingjun Zhong
Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1
3 0.11505503 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions
Author: Christian Walder, Olivier Chapelle, Bernhard Schölkopf
Abstract: We consider the problem of constructing a function whose zero set is to represent a surface, given sample points with surface normal vectors. The contributions include a novel means of regularising multi-scale compactly supported basis functions that leads to the desirable properties previously only associated with fully supported bases, and show equivalence to a Gaussian process with modified covariance function. We also provide a regularisation framework for simpler and more direct treatment of surface normals, along with a corresponding generalisation of the representer theorem. We demonstrate the techniques on 3D problems of up to 14 million data points, as well as 4D time series data. 1
4 0.09647271 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo
Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as finite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random field priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efficient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1
5 0.094376415 126 nips-2006-Logistic Regression for Single Trial EEG Classification
Author: Ryota Tomioka, Kazuyuki Aihara, Klaus-Robert Müller
Abstract: We propose a novel framework for the classification of single trial ElectroEncephaloGraphy (EEG), based on regularized logistic regression. Framed in this robust statistical framework no prior feature extraction or outlier removal is required. We present two variations of parameterizing the regression function: (a) with a full rank symmetric matrix coefficient and (b) as a difference of two rank=1 matrices. In the first case, the problem is convex and the logistic regression is optimal under a generative model. The latter case is shown to be related to the Common Spatial Pattern (CSP) algorithm, which is a popular technique in Brain Computer Interfacing. The regression coefficients can also be topographically mapped onto the scalp similarly to CSP projections, which allows neuro-physiological interpretation. Simulations on 162 BCI datasets demonstrate that classification accuracy and robustness compares favorably against conventional CSP based classifiers. 1
6 0.086150512 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
7 0.07514964 193 nips-2006-Tighter PAC-Bayes Bounds
8 0.067455754 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
9 0.06547299 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
10 0.062630966 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
11 0.062410947 20 nips-2006-Active learning for misspecified generalized linear models
12 0.059962455 156 nips-2006-Ordinal Regression by Extended Binary Classification
13 0.059730463 179 nips-2006-Sparse Representation for Signal Classification
14 0.059205096 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms
15 0.058860447 159 nips-2006-Parameter Expanded Variational Bayesian Methods
16 0.058399487 186 nips-2006-Support Vector Machines on a Budget
17 0.05832592 113 nips-2006-Learning Structural Equation Models for fMRI
18 0.058108173 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
19 0.054597341 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
20 0.054330267 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
topicId topicWeight
[(0, -0.183), (1, 0.017), (2, 0.024), (3, -0.063), (4, -0.037), (5, 0.116), (6, 0.057), (7, 0.03), (8, -0.032), (9, 0.046), (10, -0.082), (11, -0.034), (12, 0.012), (13, -0.006), (14, -0.028), (15, 0.159), (16, 0.031), (17, -0.007), (18, -0.009), (19, 0.01), (20, 0.027), (21, 0.018), (22, 0.059), (23, -0.064), (24, 0.023), (25, -0.027), (26, 0.111), (27, 0.071), (28, -0.035), (29, -0.112), (30, 0.058), (31, -0.1), (32, -0.026), (33, 0.035), (34, 0.027), (35, 0.15), (36, 0.127), (37, 0.013), (38, 0.01), (39, -0.021), (40, 0.139), (41, 0.028), (42, -0.188), (43, -0.008), (44, -0.02), (45, -0.042), (46, -0.138), (47, -0.187), (48, 0.057), (49, 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.92667681 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
Author: Gavin C. Cawley, Nicola L. Talbot, Mark Girolami
Abstract: Multinomial logistic regression provides the standard penalised maximumlikelihood solution to multi-class pattern recognition problems. More recently, the development of sparse multinomial logistic regression models has found application in text processing and microarray classification, where explicit identification of the most informative features is of value. In this paper, we propose a sparse multinomial logistic regression method, in which the sparsity arises from the use of a Laplace prior, but where the usual regularisation parameter is integrated out analytically. Evaluation over a range of benchmark datasets reveals this approach results in similar generalisation performance to that obtained using cross-validation, but at greatly reduced computational expense. 1
2 0.66715443 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions
Author: Christian Walder, Olivier Chapelle, Bernhard Schölkopf
Abstract: We consider the problem of constructing a function whose zero set is to represent a surface, given sample points with surface normal vectors. The contributions include a novel means of regularising multi-scale compactly supported basis functions that leads to the desirable properties previously only associated with fully supported bases, and show equivalence to a Gaussian process with modified covariance function. We also provide a regularisation framework for simpler and more direct treatment of surface normals, along with a corresponding generalisation of the representer theorem. We demonstrate the techniques on 3D problems of up to 14 million data points, as well as 4D time series data. 1
3 0.49836478 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
Author: Mark Girolami, Mingjun Zhong
Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1
4 0.44279709 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
Author: Sridevi Parise, Max Welling
Abstract: Scoring structures of undirected graphical models by means of evaluating the marginal likelihood is very hard. The main reason is the presence of the partition function which is intractable to evaluate, let alone integrate over. We propose to approximate the marginal likelihood by employing two levels of approximation: we assume normality of the posterior (the Laplace approximation) and approximate all remaining intractable quantities using belief propagation and the linear response approximation. This results in a fast procedure for model scoring. Empirically, we find that our procedure has about two orders of magnitude better accuracy than standard BIC methods for small datasets, but deteriorates when the size of the dataset grows. 1
5 0.43252647 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
Author: Matthias Seeger
Abstract: We propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. 1
6 0.42608845 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
7 0.40134317 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
8 0.39998969 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms
9 0.39411607 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
10 0.3932437 41 nips-2006-Bayesian Ensemble Learning
11 0.3906543 131 nips-2006-Mixture Regression for Covariate Shift
12 0.38806754 129 nips-2006-Map-Reduce for Machine Learning on Multicore
13 0.38451874 126 nips-2006-Logistic Regression for Single Trial EEG Classification
14 0.38276768 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
15 0.37410128 193 nips-2006-Tighter PAC-Bayes Bounds
16 0.36519355 179 nips-2006-Sparse Representation for Signal Classification
17 0.36456823 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
18 0.3639906 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
19 0.35953373 75 nips-2006-Efficient sparse coding algorithms
20 0.35104603 186 nips-2006-Support Vector Machines on a Budget
topicId topicWeight
[(1, 0.123), (3, 0.021), (6, 0.025), (7, 0.071), (9, 0.041), (20, 0.031), (22, 0.083), (44, 0.075), (45, 0.263), (52, 0.016), (57, 0.056), (65, 0.033), (69, 0.051), (71, 0.011), (90, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.78478444 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
Author: Gavin C. Cawley, Nicola L. Talbot, Mark Girolami
Abstract: Multinomial logistic regression provides the standard penalised maximumlikelihood solution to multi-class pattern recognition problems. More recently, the development of sparse multinomial logistic regression models has found application in text processing and microarray classification, where explicit identification of the most informative features is of value. In this paper, we propose a sparse multinomial logistic regression method, in which the sparsity arises from the use of a Laplace prior, but where the usual regularisation parameter is integrated out analytically. Evaluation over a range of benchmark datasets reveals this approach results in similar generalisation performance to that obtained using cross-validation, but at greatly reduced computational expense. 1
2 0.5953474 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
Author: Jeremy Lewi, Robert Butera, Liam Paninski
Abstract: Adaptively optimizing experiments can significantly reduce the number of trials needed to characterize neural responses using parametric statistical models. However, the potential for these methods has been limited to date by severe computational challenges: choosing the stimulus which will provide the most information about the (typically high-dimensional) model parameters requires evaluating a high-dimensional integration and optimization in near-real time. Here we present a fast algorithm for choosing the optimal (most informative) stimulus based on a Fisher approximation of the Shannon information and specialized numerical linear algebra techniques. This algorithm requires only low-rank matrix manipulations and a one-dimensional linesearch to choose the stimulus and is therefore efficient even for high-dimensional stimulus and parameter spaces; for example, we require just 15 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Our algorithm therefore makes real-time adaptive experimental design feasible. Simulation results show that model parameters can be estimated much more efficiently using these adaptive techniques than by using random (nonadaptive) stimuli. Finally, we generalize the algorithm to efficiently handle both fast adaptation due to spike-history effects and slow, non-systematic drifts in the model parameters. Maximizing the efficiency of data collection is important in any experimental setting. In neurophysiology experiments, minimizing the number of trials needed to characterize a neural system is essential for maintaining the viability of a preparation and ensuring robust results. As a result, various approaches have been developed to optimize neurophysiology experiments online in order to choose the “best” stimuli given prior knowledge of the system and the observed history of the cell’s responses. The “best” stimulus can be defined a number of different ways depending on the experimental objectives. One reasonable choice, if we are interested in finding a neuron’s “preferred stimulus,” is the stimulus which maximizes the firing rate of the neuron [1, 2, 3, 4]. Alternatively, when investigating the coding properties of sensory cells it makes sense to define the optimal stimulus in terms of the mutual information between the stimulus and response [5]. Here we take a system identification approach: we define the optimal stimulus as the one which tells us the most about how a neural system responds to its inputs [6, 7]. We consider neural systems in † ‡ http://www.prism.gatech.edu/∼gtg120z http://www.stat.columbia.edu/∼liam which the probability p(rt |{xt , xt−1 , ..., xt−tk }, {rt−1 , . . . , rt−ta }) of the neural response rt given the current and past stimuli {xt , xt−1 , ..., xt−tk }, and the observed recent history of the neuron’s activity, {rt−1 , . . . , rt−ta }, can be described by a model p(rt |{xt }, {rt−1 }, θ), specified by a finite vector of parameters θ. Since we estimate these parameters from experimental trials, we want to choose our stimuli so as to minimize the number of trials needed to robustly estimate θ. Two inconvenient facts make it difficult to realize this goal in a computationally efficient manner: 1) model complexity — we typically need a large number of parameters to accurately model a system’s response p(rt |{xt }, {rt−1 }, θ); and 2) stimulus complexity — we are typically interested in neural responses to stimuli xt which are themselves very high-dimensional (e.g., spatiotemporal movies if we are dealing with visual neurons). In particular, it is computationally challenging to 1) update our a posteriori beliefs about the model parameters p(θ|{rt }, {xt }) given new stimulus-response data, and 2) find the optimal stimulus quickly enough to be useful in an online experimental context. In this work we present methods for solving these problems using generalized linear models (GLM) for the input-output relationship p(rt |{xt }, {rt−1 }, θ) and certain Gaussian approximations of the posterior distribution of the model parameters. Our emphasis is on finding solutions which scale well in high dimensions. We solve problem (1) by using efficient rank-one update methods to update the Gaussian approximation to the posterior, and problem (2) by a reduction to a highly tractable onedimensional optimization problem. Simulation results show that the resulting algorithm produces a set of stimulus-response pairs which is much more informative than the set produced by random sampling. Moreover, the algorithm is efficient enough that it could feasibly run in real-time. Neural systems are highly adaptive and more generally nonstatic. A robust approach to optimal experimental design must be able to cope with changes in θ. We emphasize that the model framework analyzed here can account for three key types of changes: stimulus adaptation, spike rate adaptation, and random non-systematic changes. Adaptation which is completely stimulus dependent can be accounted for by including enough stimulus history terms in the model p(rt |{xt , ..., xt−tk }, {rt−1 , ..., rt−ta }). Spike-rate adaptation effects, and more generally spike history-dependent effects, are accounted for explicitly in the model (1) below. Finally, we consider slow, non-systematic changes which could potentially be due to changes in the health, arousal, or attentive state of the preparation. Methods We model a neuron as a point process whose conditional intensity function (instantaneous firing rate) is given as the output of a generalized linear model (GLM) [8, 9]. This model class has been discussed extensively elsewhere; briefly, this class is fairly natural from a physiological point of view [10], with close connections to biophysical models such as the integrate-and-fire cell [9], and has been applied in a wide variety of experimental settings [11, 12, 13, 14]. The model is summarized as: tk λt = E(rt ) = f ta aj rt−j ki,t−l xi,t−l + i l=1 (1) j=1 In the above summation the filter coefficients ki,t−l capture the dependence of the neuron’s instantaneous firing rate λt on the ith component of the vector stimulus at time t − l, xt−l ; the model therefore allows for spatiotemporal receptive fields. For convenience, we arrange all the stimulus coefficients in a vector, k, which allows for a uniform treatment of the spatial and temporal components of the receptive field. The coefficients aj model the dependence on the observed recent activity r at time t − j (these terms may reflect e.g. refractory effects, burstiness, firing-rate adaptation, etc., depending on the value of the vector a [9]). For convenience we denote the unknown parameter vector as θ = {k; a}. The experimental objective is the estimation of the unknown filter coefficients, θ, given knowledge of the stimuli, xt , and the resulting responses rt . We chose the nonlinear stage of the GLM, the link function f (), to be the exponential function for simplicity. This choice ensures that the log likelihood of the observed data is a concave function of θ [9]. Representing and updating the posterior. As emphasized above, our first key task is to efficiently update the posterior distribution of θ after t trials, p(θt |xt , rt ), as new stimulus-response pairs are trial 100 trial 500 trial 2500 trial 5000 θ true 1 info. max. trial 0 0 random −1 (a) random info. max. 2000 Time(Seconds) Entropy 1500 1000 500 0 −500 0 1000 2000 3000 Iteration (b) 4000 5000 0.1 total time diagonalization posterior update 1d line Search 0.01 0.001 0 200 400 Dimensionality 600 (c) Figure 1: A) Plots of the estimated receptive field for a simulated visual neuron. The neuron’s receptive field θ has the Gabor structure shown in the last panel (spike history effects were set to zero for simplicity here, a = 0). The estimate of θ is taken as the mean of the posterior, µt . The images compare the accuracy of the estimates using information maximizing stimuli and random stimuli. B) Plots of the posterior entropies for θ in these two cases; note that the information-maximizing stimuli constrain the posterior of θ much more effectively than do random stimuli. C) A plot of the timing of the three steps performed on each iteration as a function of the dimensionality of θ. The timing for each step was well-fit by a polynomial of degree 2 for the diagonalization, posterior update and total time, and degree 1 for the line search. The times are an average over many iterations. The error-bars for the total time indicate ±1 std. observed. (We use xt and rt to abbreviate the sequences {xt , . . . , x0 } and {rt , . . . , r0 }.) To solve this problem, we approximate this posterior as a Gaussian; this approximation may be justified by the fact that the posterior is the product of two smooth, log-concave terms, the GLM likelihood function and the prior (which we assume to be Gaussian, for simplicity). Furthermore, the main theorem of [7] indicates that a Gaussian approximation of the posterior will be asymptotically accurate. We use a Laplace approximation to construct the Gaussian approximation of the posterior, p(θt |xt , rt ): we set µt to the peak of the posterior (i.e. the maximum a posteriori (MAP) estimate of θ), and the covariance matrix Ct to the negative inverse of the Hessian of the log posterior at µt . In general, computing these terms directly requires O(td2 + d3 ) time (where d = dim(θ); the time-complexity increases with t because to compute the posterior we must form a product of t likelihood terms, and the d3 term is due to the inverse of the Hessian matrix), which is unfortunately too slow when t or d becomes large. Therefore we further approximate p(θt−1 |xt−1 , rt−1 ) as Gaussian; to see how this simplifies matters, we use Bayes to write out the posterior: 1 −1 log p(θ|rt , xt ) = − (θ − µt−1 )T Ct−1 (θ − µt−1 ) + − exp {xt ; rt−1 }T θ 2 + rt {xt ; rt−1 }T θ + const d log p(θ|rt , xt ) −1 = −(θ − µt−1 )T Ct−1 + (2) − exp({xt ; rt−1 }T θ) + rt {xt ; rt−1 }T dθ d2 log p(θ|rt , xt ) −1 = −Ct−1 − exp({xt ; rt−1 }T θ){xt ; rt−1 }{xt ; rt−1 }T dθi dθj (3) Now, to update µt we only need to find the peak of a one-dimensional function (as opposed to a d-dimensional function); this follows by noting that that the likelihood only varies along a single direction, {xt ; rt−1 }, as a function of θ. At the peak of the posterior, µt , the first term in the gradient must be parallel to {xt ; rt−1 } because the gradient is zero. Since Ct−1 is non-singular, µt − µt−1 must be parallel to Ct−1 {xt ; rt−1 }. Therefore we just need to solve a one dimensional problem now to determine how much the mean changes in the direction Ct−1 {xt ; rt−1 }; this requires only O(d2 ) time. Moreover, from the second derivative term above it is clear that computing Ct requires just a rank-one matrix update of Ct−1 , which can be evaluated in O(d2 ) time via the Woodbury matrix lemma. Thus this Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) provides a large gain in efficiency; our simulations (data not shown) showed that, despite this improved efficiency, the loss in accuracy due to this approximation was minimal. Deriving the (approximately) optimal stimulus. To simplify the derivation of our maximization strategy, we start by considering models in which the firing rate does not depend on past spiking, so θ = {k}. To choose the optimal stimulus for trial t + 1, we want to maximize the conditional mutual information I(θ; rt+1 |xt+1 , xt , rt ) = H(θ|xt , rt ) − H(θ|xt+1 , rt+1 ) (4) with respect to the stimulus xt+1 . The first term does not depend on xt+1 , so maximizing the information requires minimizing the conditional entropy H(θ|xt+1 , rt+1 ) = p(rt+1 |xt+1 ) −p(θ|rt+1 , xt+1 ) log p(θ|rt+1 , xt+1 )dθ = Ert+1 |xt+1 log det[Ct+1 ] + const. rt+1 (5) We do not average the entropy of p(θ|rt+1 , xt+1 ) over xt+1 because we are only interested in the conditional entropy for the particular xt+1 which will be presented next. The equality above is due to our Gaussian approximation of p(θ|xt+1 , rt+1 ). Therefore, we need to minimize Ert+1 |xt+1 log det[Ct+1 ] with respect to xt+1 . Since we set Ct+1 to be the negative inverse Hessian of the log-posterior, we have: −1 Ct+1 = Ct + Jobs (rt+1 , xt+1 ) −1 , (6) Jobs is the observed Fisher information. Jobs (rt+1 , xt+1 ) = −∂ 2 log p(rt+1 |ε = xt θ)/∂ε2 xt+1 xt t+1 t+1 (7) Here we use the fact that for the GLM, the likelihood depends only on the dot product, ε = xt θ. t+1 We can use the Woodbury lemma to evaluate the inverse: Ct+1 = Ct I + D(rt+1 , ε)(1 − D(rt+1 , ε)xt Ct xt+1 )−1 xt+1 xt Ct t+1 t+1 (8) where D(rt+1 , ε) = ∂ 2 log p(rt+1 |ε)/∂ε2 . Using some basic matrix identities, log det[Ct+1 ] = log det[Ct ] − log(1 − D(rt+1 , ε)xt Ct xt+1 ) t+1 = log det[Ct ] + D(rt+1 , ε)xt Ct xt+1 t+1 + o(D(rt+1 , ε)xt Ct xt+1 ) t+1 (9) (10) Ignoring the higher order terms, we need to minimize Ert+1 |xt+1 D(rt+1 , ε)xt Ct xt+1 . In our case, t+1 with f (θt xt+1 ) = exp(θt xt+1 ), we can use the moment-generating function of the multivariate Trial info. max. i.i.d 2 400 −10−4 0 0.05 −10−1 −2 ai 800 2 0 −2 −7 −10 i i.i.d k info. max. 1 1 50 i 1 50 i 1 10 1 i (a) i 100 0 −0.05 10 1 1 (b) i 10 (c) Figure 2: A comparison of parameter estimates using information-maximizing versus random stimuli for a model neuron whose conditional intensity depends on both the stimulus and the spike history. The images in the top row of A and B show the MAP estimate of θ after each trial as a row in the image. Intensity indicates the value of the coefficients. The true value of θ is shown in the second row of images. A) The estimated stimulus coefficients, k. B) The estimated spike history coefficients, a. C) The final estimates of the parameters after 800 trials: dashed black line shows true values, dark gray is estimate using information maximizing stimuli, and light gray is estimate using random stimuli. Using our algorithm improved the estimates of k and a. Gaussian p(θ|xt , rt ) to evaluate this expectation. After some algebra, we find that to maximize I(θ; rt+1 |xt+1 , xt , rt ), we need to maximize 1 F (xt+1 ) = exp(xT µt ) exp( xT Ct xt+1 )xT Ct xt+1 . t+1 t+1 2 t+1 (11) Computing the optimal stimulus. For the GLM the most informative stimulus is undefined, since increasing the stimulus power ||xt+1 ||2 increases the informativeness of any putatively “optimal” stimulus. To obtain a well-posed problem, we optimize the stimulus under the usual power constraint ||xt+1 ||2 ≤ e < ∞. We maximize Eqn. 11 under this constraint using Lagrange multipliers and an eigendecomposition to reduce our original d-dimensional optimization problem to a onedimensional problem. Expressing Eqn. 11 in terms of the eigenvectors of Ct yields: 1 2 2 F (xt+1 ) = exp( u i yi + ci yi ) ci yi (12) 2 i i i = g( 2 ci yi ) ui yi )h( i (13) i where ui and yi represent the projection of µt and xt+1 onto the ith eigenvector and ci is the corresponding eigenvalue. To simplify notation we also introduce the functions g() and h() which are monotonically strictly increasing functions implicitly defined by Eqn. 12. We maximize F (xt+1 ) by breaking the problem into an inner and outer problem by fixing the value of i ui yi and maximizing h() subject to that constraint. A single line search over all possible values of i ui yi will then find the global maximum of F (.). This approach is summarized by the equation: max F (y) = max g(b) · y:||y||2 =e b max y:||y||2 =e,y t u=b 2 ci yi ) h( i Since h() is increasing, to solve the inner problem we only need to solve: 2 ci yi max y:||y||2 =e,y t u=b (14) i This last expression is a quadratic function with quadratic and linear constraints and we can solve it using the Lagrange method for constrained optimization. The result is an explicit system of 1 true θ random info. max. info. max. no diffusion 1 0.8 0.6 trial 0.4 0.2 400 0 −0.2 −0.4 800 1 100 θi 1 θi 100 1 θi 100 1 θ i 100 −0.6 random info. max. θ true θ i 1 0 −1 Entropy θ i 1 0 −1 random info. max. 250 200 i 1 θ Trial 400 Trial 200 Trial 0 (a) 0 −1 20 40 (b) i 60 80 100 150 0 200 400 600 Iteration 800 (c) Figure 3: Estimating the receptive field when θ is not constant. A) The posterior means µt and true θt plotted after each trial. θ was 100 dimensional, with its components following a Gabor function. To simulate nonsystematic changes in the response function, the center of the Gabor function was moved according to a random walk in between trials. We modeled the changes in θ as a random walk with a white covariance matrix, Q, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the µt given stimuli chosen to maximize the information under the (mistaken) assumption that θ was constant. Each row of the images plots θ using intensity to indicate the value of the different components. B) Details of the posterior means µt on selected trials. C) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of θt more effectively. equations for the optimal yi as a function of the Lagrange multiplier λ1 . ui e yi (λ1 ) = ||y||2 2(ci − λ1 ) (15) Thus to find the global optimum we simply vary λ1 (this is equivalent to performing a search over b), and compute the corresponding y(λ1 ). For each value of λ1 we compute F (y(λ1 )) and choose the stimulus y(λ1 ) which maximizes F (). It is possible to show (details omitted) that the maximum of F () must occur on the interval λ1 ≥ c0 , where c0 is the largest eigenvalue. This restriction on the optimal λ1 makes the implementation of the linesearch significantly faster and more stable. To summarize, updating the posterior and finding the optimal stimulus requires three steps: 1) a rankone matrix update and one-dimensional search to compute µt and Ct ; 2) an eigendecomposition of Ct ; 3) a one-dimensional search over λ1 ≥ c0 to compute the optimal stimulus. The most expensive step here is the eigendecomposition of Ct ; in principle this step is O(d3 ), while the other steps, as discussed above, are O(d2 ). Here our Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) is once again quite useful: recall that in this setting Ct is just a rank-one modification of Ct−1 , and there exist efficient algorithms for rank-one eigendecomposition updates [15]. While the worst-case running time of this rank-one modification of the eigendecomposition is still O(d3 ), we found the average running time in our case to be O(d2 ) (Fig. 1(c)), due to deflation which reduces the cost of matrix multiplications associated with finding the eigenvectors of repeated eigenvalues. Therefore the total time complexity of our algorithm is empirically O(d2 ) on average. Spike history terms. The preceding derivation ignored the spike-history components of the GLM model; that is, we fixed a = 0 in equation (1). Incorporating spike history terms only affects the optimization step of our algorithm; updating the posterior of θ = {k; a} proceeds exactly as before. The derivation of the optimization strategy proceeds in a similar fashion and leads to an analogous optimization strategy, albeit with a few slight differences in detail which we omit due to space constraints. The main difference is that instead of maximizing the quadratic expression in Eqn. 14 to find the maximum of h(), we need to maximize a quadratic expression which includes a linear term due to the correlation between the stimulus coefficients, k, and the spike history coefficients,a. The results of our simulations with spike history terms are shown in Fig. 2. Dynamic θ. In addition to fast changes due to adaptation and spike-history effects, animal preparations often change slowly and nonsystematically over the course of an experiment [16]. We model these effects by letting θ experience diffusion: θt+1 = θt + wt (16) Here wt is a normally distributed random variable with mean zero and known covariance matrix Q. This means that p(θt+1 |xt , rt ) is Gaussian with mean µt and covariance Ct + Q. To update the posterior and choose the optimal stimulus, we use the same procedure as described above1 . Results Our first simulation considered the use of our algorithm for learning the receptive field of a visually sensitive neuron. We took the neuron’s receptive field to be a Gabor function, as a proxy model of a V1 simple cell. We generated synthetic responses by sampling Eqn. 1 with θ set to a 25x33 Gabor function. We used this synthetic data to compare how well θ could be estimated using information maximizing stimuli compared to using random stimuli. The stimuli were 2-d images which were rasterized in order to express x as a vector. The plots of the posterior means µt in Fig. 1 (recall these are equivalent to the MAP estimate of θ) show that the information maximizing strategy converges an order of magnitude more rapidly to the true θ. These results are supported by the conclusion of [7] that the information maximization strategy is asymptotically never worse than using random stimuli and is in general more efficient. The running time for each step of the algorithm as a function of the dimensionality of θ is plotted in Fig. 1(c). These results were obtained on a machine with a dual core Intel 2.80GHz XEON processor running Matlab. The solid lines indicate fitted polynomials of degree 1 for the 1d line search and degree 2 for the remaining curves; the total running time for each trial scaled as O(d2 ), as predicted. When θ was less than 200 dimensions, the total running time was roughly 50 ms (and for dim(θ) ≈ 100, the runtime was close to 15 ms), well within the range of tolerable latencies for many experiments. In Fig. 2 we apply our algorithm to characterize the receptive field of a neuron whose response depends on its past spiking. Here, the stimulus coefficients k were chosen to follow a sine-wave; 1 The one difference is that the covariance matrix of p(θt+1 |xt+1 , rt+1 ) is in general no longer just a rankone modification of the covariance matrix of p(θt |xt , rt ); thus, we cannot use the rank-one update to compute the eigendecomposition. However, it is often reasonable to take Q to be white, Q = cI; in this case the eigenvectors of Ct + Q are those of Ct and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct ; thus in this case, our methods may be applied without modification. the spike history coefficients a were inhibitory and followed an exponential function. When choosing stimuli we updated the posterior for the full θ = {k; a} simultaneously and maximized the information about both the stimulus coefficients and the spike history coefficients. The information maximizing strategy outperformed random sampling for estimating both the spike history and stimulus coefficients. Our final set of results, Fig. 3, considers a neuron whose receptive field drifts non-systematically with time. We take the receptive field to be a Gabor function whose center moves according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). The results demonstrate the feasibility of the information-maximization strategy in the presence of nonstationary response properties θ, and emphasize the superiority of adaptive methods in this context. Conclusion We have developed an efficient implementation of an algorithm for online optimization of neurophysiology experiments based on information-theoretic criterion. Reasonable approximations based on a GLM framework allow the algorithm to run in near-real time even for high dimensional parameter and stimulus spaces, and in the presence of spike-rate adaptation and time-varying neural response properties. Despite these approximations the algorithm consistently provides significant improvements over random sampling; indeed, the differences in efficiency are large enough that the information-optimization strategy may permit robust system identification in cases where it is simply not otherwise feasible to estimate the neuron’s parameters using random stimuli. Thus, in a sense, the proposed stimulus-optimization technique significantly extends the reach and power of classical neurophysiology methods. Acknowledgments JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG02-97ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by grant EY018003 from the NEI and by a Gatsby Foundation Pilot Grant. We thank P. Latham for helpful conversations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] I. Nelken, et al., Hearing Research 72, 237 (1994). P. Foldiak, Neurocomputing 38–40, 1217 (2001). K. Zhang, et al., Proceedings (Computational and Systems Neuroscience Meeting, 2004). R. C. deCharms, et al., Science 280, 1439 (1998). C. Machens, et al., Neuron 47, 447 (2005). A. Watson, et al., Perception and Psychophysics 33, 113 (1983). L. Paninski, Neural Computation 17, 1480 (2005). P. McCullagh, et al., Generalized linear models (Chapman and Hall, London, 1989). L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. P. Dayan, et al., Theoretical Neuroscience (MIT Press, 2001). E. Chichilnisky, Network: Computation in Neural Systems 12, 199 (2001). F. Theunissen, et al., Network: Computation in Neural Systems 12, 289 (2001). L. Paninski, et al., Journal of Neuroscience 24, 8551 (2004). M. Gu, et al., SIAM Journal on Matrix Analysis and Applications 15, 1266 (1994). N. A. Lesica, et al., IEEE Trans. On Neural Systems And Rehabilitation Engineering 13, 194 (2005).
3 0.59247482 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
4 0.59097135 5 nips-2006-A Kernel Method for the Two-Sample-Problem
Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola
Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2 ) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.
5 0.59023333 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo
Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as finite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random field priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efficient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1
6 0.58942169 20 nips-2006-Active learning for misspecified generalized linear models
7 0.58921641 65 nips-2006-Denoising and Dimension Reduction in Feature Space
8 0.58604485 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
9 0.5851534 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
10 0.58243716 169 nips-2006-Relational Learning with Gaussian Processes
11 0.58205158 175 nips-2006-Simplifying Mixture Models through Function Approximation
12 0.58166242 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
13 0.58155078 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
14 0.58138043 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
15 0.58065027 131 nips-2006-Mixture Regression for Covariate Shift
16 0.57851827 193 nips-2006-Tighter PAC-Bayes Bounds
17 0.57812685 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
18 0.57683557 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
19 0.57650268 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions
20 0.5759576 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space