nips nips2006 nips2006-5 knowledge-graph by maker-knowledge-mining

5 nips-2006-A Kernel Method for the Two-Sample-Problem

Source: pdf

Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola

Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The ﬁrst test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2 ) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract We propose two statistical tests to determine if two samples are from different distributions. [sent-19, score-0.163]

2 Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). [sent-20, score-0.418]

3 The ﬁrst test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. [sent-21, score-0.361]

4 The test statistic can be computed in O(m2 ) time. [sent-22, score-0.325]

5 We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. [sent-23, score-0.47]

6 We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist. [sent-24, score-0.144]

7 1 Introduction We address the problem of comparing samples from two probability distributions, by proposing a statistical test of the hypothesis that these distributions are different (this is called the two-sample or homogeneity problem). [sent-25, score-0.291]

8 In database attribute matching, it is desirable to merge databases containing multiple ﬁelds, where it is not known in advance which ﬁelds correspond: the ﬁelds are matched by maximising the similarity in the distributions of their entries. [sent-28, score-0.245]

9 In this study, we propose to test whether distributions p and q are different on the basis of samples drawn from each of them, by ﬁnding a smooth function which is large on the points drawn from p, and small (as negative as possible) on the points from q. [sent-29, score-0.182]

10 We use as our test statistic the difference between the mean function values on the two samples; when this is large, the samples are likely from different distributions. [sent-30, score-0.354]

11 We call this statistic the Maximum Mean Discrepancy (MMD). [sent-31, score-0.205]

12 Clearly the quality of MMD as a statistic depends heavily on the class F of smooth functions that deﬁne it. [sent-32, score-0.205]

13 On the other hand, for the test to be consistent, F needs to be “restrictive” enough for the empirical estimate of MMD to converge quickly to its expectation as the sample size increases. [sent-34, score-0.229]

14 We deﬁne two non-parametric statistical tests based on MMD. [sent-37, score-0.134]

15 The ﬁrst, which uses distributionindependent uniform convergence bounds, provides ﬁnite sample guarantees of test performance, at the expense of being conservative in detecting differences between p and q. [sent-38, score-0.221]

16 The second test is based on the asymptotic distribution of MMD, and is in practice more sensitive to differences in distribution at small sample sizes. [sent-39, score-0.29]

17 In addition, the present approach employs a more accurate approximation to the asymptotic distribution of the test statistic. [sent-41, score-0.218]

18 We begin our presentation in Section 2 with a formal deﬁnition of the MMD, and a proof that the population MMD is zero if and only if p = q when F is the unit ball of a universal RKHS. [sent-42, score-0.14]

19 We also give an overview of hypothesis testing as it applies to the two-sample problem, and review previous approaches. [sent-43, score-0.141]

20 We take a different approach in Section 4, where we use the asymptotic distribution of an unbiased estimate of the squared MMD as the basis for a second test. [sent-46, score-0.178]

21 Finally, in Section 5, we demonstrate the performance of our method on problems from neuroscience, bioinformatics, and attribute matching using the Hungarian marriage approach. [sent-47, score-0.261]

22 Our approach performs well on high dimensional data with low sample size; in addition, we are able to successfully apply our test to graph data, for which no alternative tests exist. [sent-48, score-0.277]

23 htm 2 The Two-Sample-Problem Our goal is to formulate a statistical test that answers the following question: Problem 1 Let p and q be distributions deﬁned on a domain X. [sent-53, score-0.176]

24 Then we deﬁne the maximum mean discrepancy (MMD) and its empirical estimate as MMD [F, p, q] := sup (Ex∼p [f (x)] − Ey∼q [f (y)]) , (1) f ∈F MMD [F, X, Y ] := sup f ∈F 1 m m i=1 f (xi ) − 1 n n f (yi ) . [sent-73, score-0.144]

25 Theorem 3 Let F be a unit ball in a universal RKHS H, deﬁned on the compact metric space X, with associated kernel k(·, ·). [sent-78, score-0.14]

26 (3) m i=1 φ(xi ) and k(x, x′ ) = φ(x), φ(x′ ) , an empirical estimate of MMD is 1  2 m,n m n 1 2 1 MMD [F, X, Y ] =  2 k(xi , xj ) − k(xi , yj ) + 2 k(yi , yj ) . [sent-86, score-0.152]

27 Overview of Statistical Hypothesis Testing, and of Previous Approaches Having deﬁned our test statistic, we brieﬂy describe the framework of statistical hypothesis testing as it applies in the present context, following [9, Chapter 8]. [sent-92, score-0.258]

28 samples X ∼ p of size m and Y ∼ q of size n, the statistical test, T(X, Y ) : Xm × Xn → {0, 1} is used to distinguish between the null hypothesis H0 : p = q and the alternative hypothesis H1 : p = q. [sent-96, score-0.307]

29 This is achieved by comparing the test statistic MMD[F, X, Y ] with a particular threshold: if the threshold is exceeded, then the test rejects the null hypothesis (bearing in mind that a zero population MMD indicates p = q). [sent-97, score-0.719]

30 The acceptance region of the test is thus deﬁned as any real number below the threshold. [sent-98, score-0.144]

31 Since the test is based on ﬁnite samples, it is possible that an incorrect answer will be returned: we deﬁne the Type I error as the probability of rejecting p = q based on the observed sample, despite the null hypothesis being true. [sent-99, score-0.289]

32 The level α of a test is an upper bound on the Type I error: this is a design parameter of the test, and is used to set the threshold to which we compare the test statistic (ﬁnding the test threshold for a given α is the topic of Sections 3 and 4). [sent-101, score-0.67]

33 A consistent test achieves a level α, and a Type II error of zero, in the large sample limit. [sent-102, score-0.166]

34 We next give a brief overview of previous approaches to the two sample problem for multivariate data. [sent-104, score-0.191]

35 A generalisation of the Wald-Wolfowitz runs test to the multivariate domain was proposed and analysed in [12, 17] (Wolf), which involves counting the number of edges in the minimum spanning tree over the aggregated data that connect points in X to points in Y . [sent-107, score-0.321]

36 Two possible generalisations of the Kolmogorov-Smirnov test to the multivariate case were studied in [4, 12]. [sent-109, score-0.239]

37 A more recent multivariate test was proposed in [20], which is based on the minimum distance non-bipartite matching over the aggregate data, at cost O((m + n)3 ). [sent-111, score-0.394]

38 Another recent test was proposed in [15] (Hall): for each point from p, it requires computing the closest points in the aggregated data, and counting how many of these are from q (the procedure is repeated for each point from q with respect to points from p). [sent-112, score-0.165]

39 The test statistic is costly to compute; [15] consider only tens of points in their experiments. [sent-113, score-0.325]

40 L1 or L2 ) between Parzen window estimates of the densities as a test statistic [1, 3], based on the asymptotic distribution of this distance given p = q. [sent-116, score-0.423]

41 When the L2 norm is used, the test statistic is related to those we present here, although it is arrived at from a different perspective (see [13]: the test in [1] is obtained in a more restricted setting where the RKHS kernel is an inner product between Parzen windows. [sent-117, score-0.533]

42 This establishes the consistency of statistical tests based on MMD. [sent-124, score-0.162]

43 Our next goal is to reﬁne this result in a way that allows us to deﬁne a test threshold under the null hypothesis p = q. [sent-130, score-0.33]

44 The ﬁrst inequality is interesting inasmuch as it provides a link between the bias bound B1 (F, p) and kernel size (for instance, if we were to use a Gaussian kernel with large σ, then k(x, x) and k(x, x′ ) would likely be close, and the bias small). [sent-134, score-0.151]

45 In the context of testing, however, we would need to provide an additional bound to show convergence of an empirical estimate of B1 (F, p) to its population equivalent. [sent-135, score-0.179]

46 Thus, in the following test for p = q based on Theorem 5, we use B2 (F, p) to bound the bias. [sent-136, score-0.143]

47 Lemma 6 A hypothesis test of level α for the null hypothesis p = q (equivalently MMD[F, p, q] = 0) has the acceptance region MMD[F, X, Y ] < 2 K/m 1 + log α−1 . [sent-137, score-0.399]

48 To put this convergence rate in perspective, consider a test of whether two normal distributions have equal means, given they have unknown but equal variance [9, Exercise 8. [sent-139, score-0.182]

49 In this case, the test statistic has a Student-t distribution with n + m − 2 degrees of freedom, and its error probability converges at the same rate as our test. [sent-141, score-0.325]

50 4 An Unbiased Test Based on the Asymptotic Distribution of the U-Statistic We now propose a second test, which is based on the asymptotic distribution of an unbiased estimate of MMD2 . [sent-142, score-0.178]

51 An unbiased empirical estimate of MMD2 is MMD2 u 1 [F, X, Y ] = (m)(m − 1) m h(zi , zj ), (6) i=j which is a one-sample U-statistic with h(zi , zj ) := k(xi , xj ) + k(yi , yj ) − k(xi , yj ) − k(xj , yi ). [sent-154, score-0.253]

52 The empirical statistic is an unbiased estimate of MMD2 , although it does not have minimum variance (the minimum variance estimate is almost identical: see [21, Section 5. [sent-155, score-0.348]

53 The asymptotic distribution of this test statistic under H1 is given by [21, Section 5. [sent-160, score-0.423]

54 Our goal is to determine whether the empirical test statistic MMD2 is so large as to be outside the u 1 − α quantile of the null distribution in (7) (consistency of the resulting test is guaranteed by the form of the distribution under H1 ). [sent-176, score-0.593]

55 One way to estimate this quantile is using the bootstrap [2] on the aggregated data. [sent-177, score-0.168]

56 u u 5 Experiments We conducted distribution comparisons using our MMD-based tests on datasets from three realworld domains: database applications, bioinformatics, and neurobiology. [sent-183, score-0.156]

57 We investigated the uniform convergence approach (MMD), the asymptotic approach with bootstrap (MMD2 B), u and the asymptotic approach with moment matching to Pearson curves (MMD2 M). [sent-184, score-0.393]

58 We also u compared against several alternatives from the literature (where applicable): the multivariate ttest, the Friedman-Rafsky Kolmogorov-Smirnov generalisation (Smir), the Friedman-Rafsky WaldWolfowitz generalisation (Wolf), the Biau-Gy¨ rﬁ test (Biau), and the Hall-Tajvidi test (Hall). [sent-185, score-0.465]

59 Note o that we do not apply the Biau-Gy¨ rﬁ test to high-dimensional problems (see end of Section 2), and o that MMD is the only method applicable to structured data such as graphs. [sent-186, score-0.156]

60 An important issue in the practical application of the MMD-based tests is the selection of the kernel parameters. [sent-187, score-0.175]

61 We illustrate this with a Gaussian RBF kernel, where we must choose the kernel width σ (we use this kernel for univariate and multivariate data, but not for graphs). [sent-188, score-0.344]

62 The empirical MMD is zero both for kernel size σ = 0 (where the aggregate Gram matrix over X and Y is a unit matrix), and also approaches zero as σ → ∞ (where the aggregate Gram matrix becomes uniformly constant). [sent-189, score-0.208]

63 Data integration As a ﬁrst application of MMD, we performed distribution testing for data integration: the objective is to aggregate two datasets into a single sample, with the understanding that both original samples are generated from the same distribution. [sent-191, score-0.157]

64 We applied our tests to these datasets in the following fashion. [sent-195, score-0.156]

65 We note that the Type I error of the bootstrap test on the Subtype dataset is far from its design value of 0. [sent-200, score-0.236]

66 Finally, the uniform convergence-based test is too conservative, ﬁnding differences in distribution only for the data with largest sample size. [sent-204, score-0.192]

67 2 Table 1: Distribution testing for data integration on multivariate data. [sent-262, score-0.148]

68 Numbers indicate the percentage of repetitions for which the null hypothesis (p=q) was accepted, given α = 0. [sent-263, score-0.23]

69 Attribute matching Our second series of experiments addresses automatic attribute matching. [sent-266, score-0.224]

70 We use a two-sample test on pairs of attributes from two databases to ﬁnd corresponding pairs. [sent-268, score-0.365]

71 1 This procedure is also called table matching for tables from different databases. [sent-269, score-0.181]

72 We performed attribute matching as follows: ﬁrst, the dataset D was split into two halves A and B. [sent-270, score-0.273]

73 Each of the n attributes 1 Note that corresponding attributes may have different distributions in real-world databases. [sent-271, score-0.345]

74 Advanced approaches to schema matching using MMD as one key statistical test are a topic of current research. [sent-273, score-0.271]

75 We then tested all pairs of attributes from A and from B against each other, to ﬁnd the optimal assignment of attributes A1 , . [sent-277, score-0.312]

76 As a naive approach, one could assume that any possible pair of attributes might correspond, and thus that every attribute of A needs to be tested against all the attributes of B to ﬁnd the optimal match. [sent-285, score-0.435]

77 We report results for this naive approach, aggregated over all pairs of possible attribute matches, in Table 2. [sent-286, score-0.168]

78 We used three datasets: the census income dataset from the UCI KDD archive (CNUM), the protein homology dataset from the 2004 KDD Cup (BIO) [8], and the forest dataset from the UCI ML archive [5]. [sent-287, score-0.327]

79 For the ﬁnal dataset, we performed univariate matching of attributes (FOREST) and multivariate matching of tables (FOREST10D) from two different databases, where each table represents one type of forest. [sent-288, score-0.711]

80 Both our asymptotic MMD2 -based tests perform as well as u or better than the alternatives, notably for CNUM, where the advantage of MMD2 is large. [sent-289, score-0.209]

81 A more principled approach to attribute matching is also possible. [sent-297, score-0.224]

82 , φn (An )): in other words, the kernel decomposes into kernels on the individual attributes of A (and also decomposes this way on the attributes of B). [sent-301, score-0.376]

83 Our goal of optimally assigning attributes from B to attributes of A via MMD is equivalent to ﬁnd2 ing the optimal permutation π of attributes of B that minimizes n i=1 µi (Ai ) − µi (Bπ(i) ) . [sent-303, score-0.468]

84 0 Table 2: Naive attribute matching on univariate (BIO, FOREST, CNUM) and multivariate data (FOREST10D). [sent-371, score-0.44]

85 Numbers indicate the percentage of accepted null hypothesis (p=q) pooled over attributes. [sent-372, score-0.169]

86 We tested this ’Hungarian approach’ to attribute matching via MMD2 B on three univariate datasets u (BIO, CNUM, FOREST) and for table matching on a fourth (FOREST10D). [sent-376, score-0.503]

87 To study MMD2 B u on structured data, we obtained two datasets of protein graphs (PROTEINS and ENZYMES) and used the graph kernel for proteins from [7] for table matching via the Hungarian method (the other tests were not applicable to this graph data). [sent-377, score-0.472]

88 The challenge here is to match tables representing one functional class of proteins (or enzymes) from dataset A to the corresponding tables (functional classes) in B. [sent-378, score-0.191]

89 u 6 Summary and Discussion We have established two simple multivariate tests for comparing two distributions p and q. [sent-381, score-0.263]

90 The test statistics are based on the maximum deviation of the expectation of a function evaluated on each of the random variables, taken over a sufﬁciently rich function class. [sent-382, score-0.149]

91 We do not require density Dataset BIO CNUM FOREST FOREST10D ENZYME PROTEINS Data type univariate univariate univariate multivariate structured structured No. [sent-383, score-0.539]

92 attributes 6 13 10 2 6 2 Sample size 377 386 538 1000 50 200 Repetitions 100 100 100 100 50 50 % correct matches 90. [sent-384, score-0.183]

93 0 Table 3: Hungarian Method for attribute matching via MMD2 B on univariate (BIO, CNUM, FORu EST), multivariate (FOREST10D), and structured data (ENZYMES, PROTEINS) (α = 0. [sent-390, score-0.476]

94 05; ‘% correct matches’ is the percentage of the correct attribute matches detected over all repetitions). [sent-391, score-0.15]

95 Finally, our test was successfully used to compare distributions on graphs, for which it is currently the only option. [sent-394, score-0.153]

96 Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. [sent-402, score-0.239]

97 On the asymptotic properties of a nonparametric l1 -test statistic of homogeneity. [sent-412, score-0.303]

98 A distribution free version of the Smirnov two sample test in the p-variate case. [sent-416, score-0.166]

99 Permutation tests for equality of distributions in high-dimensional settings. [sent-503, score-0.144]

100 An exact distribution-free test comparing two multivariate distributions based on adjacency. [sent-533, score-0.272]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mmd', 0.735), ('statistic', 0.205), ('bio', 0.16), ('attributes', 0.156), ('cnum', 0.147), ('attribute', 0.123), ('test', 0.12), ('multivariate', 0.119), ('tests', 0.111), ('hungarian', 0.102), ('matching', 0.101), ('asymptotic', 0.098), ('forest', 0.097), ('univariate', 0.097), ('wolf', 0.095), ('biau', 0.092), ('smir', 0.092), ('databases', 0.089), ('hypothesis', 0.086), ('null', 0.083), ('subtype', 0.073), ('bootstrap', 0.067), ('population', 0.064), ('kernel', 0.064), ('repetitions', 0.061), ('hall', 0.06), ('rkhs', 0.06), ('borgwardt', 0.058), ('pearson', 0.058), ('type', 0.057), ('aggregate', 0.054), ('proteins', 0.054), ('unbiased', 0.053), ('status', 0.051), ('universal', 0.049), ('dataset', 0.049), ('gretton', 0.048), ('rasch', 0.048), ('ez', 0.047), ('health', 0.047), ('sample', 0.046), ('datasets', 0.045), ('aggregated', 0.045), ('microarray', 0.045), ('tables', 0.044), ('enzymes', 0.044), ('ii', 0.042), ('threshold', 0.041), ('ex', 0.039), ('marriage', 0.037), ('paration', 0.037), ('smirnov', 0.037), ('subtypes', 0.037), ('wage', 0.037), ('generalisation', 0.037), ('kdd', 0.037), ('mpi', 0.037), ('structured', 0.036), ('empirical', 0.036), ('table', 0.036), ('theorem', 0.035), ('distributions', 0.033), ('alternatives', 0.032), ('yj', 0.032), ('hilbertian', 0.032), ('salary', 0.032), ('bioinformatics', 0.031), ('discrepancy', 0.031), ('testing', 0.029), ('convergence', 0.029), ('archive', 0.029), ('arthur', 0.029), ('nicta', 0.029), ('quantile', 0.029), ('zl', 0.029), ('rich', 0.029), ('samples', 0.029), ('consistency', 0.028), ('matches', 0.027), ('schema', 0.027), ('estimate', 0.027), ('ball', 0.027), ('xi', 0.027), ('annals', 0.027), ('overview', 0.026), ('uci', 0.026), ('differences', 0.026), ('parzen', 0.026), ('graz', 0.026), ('detect', 0.026), ('ep', 0.025), ('xj', 0.025), ('protein', 0.025), ('sup', 0.025), ('acceptance', 0.024), ('zj', 0.024), ('perspective', 0.024), ('gram', 0.023), ('statistical', 0.023), ('bound', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 5 nips-2006-A Kernel Method for the Two-Sample-Problem

Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola

2 0.08721301 158 nips-2006-PG-means: learning the number of clusters in data

Author: Yu Feng, Greg Hamerly

Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efﬁcient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difﬁcult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1

3 0.086496033 166 nips-2006-Recursive Attribute Factoring

Author: Deepak Verma, Karl Pfleger, David Tax

Abstract: Clustering, or factoring of a document collection attempts to “explain” each observed document in terms of one or a small number of inferred prototypes. Prior work demonstrated that when links exist between documents in the corpus (as is the case with a collection of web pages or scientiﬁc papers), building a joint model of document contents and connections produces a better model than that built from contents or connections alone. Many problems arise when trying to apply these joint models to corpus at the scale of the World Wide Web, however; one of these is that the sheer overhead of representing a feature space on the order of billions of dimensions becomes impractical. We address this problem with a simple representational shift inspired by probabilistic relational models: instead of representing document linkage in terms of the identities of linking documents, we represent it by the explicit and inferred attributes of the linking documents. Several surprising results come with this shift: in addition to being computationally more tractable, the new model produces factors that more cleanly decompose the document collection. We discuss several variations on this model and show how some can be seen as exact generalizations of the PageRank algorithm. 1

4 0.086052254 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data

Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola

Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to ﬁrst recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.

5 0.074748017 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classiﬁcation problem in feature space is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data to perform classiﬁcation. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classiﬁcation results. 1

6 0.074250698 39 nips-2006-Balanced Graph Matching

7 0.058058303 20 nips-2006-Active learning for misspecified generalized linear models

8 0.056855462 116 nips-2006-Learning from Multiple Sources

9 0.054581154 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

10 0.053683773 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

11 0.052967921 34 nips-2006-Approximate Correspondences in High Dimensions

12 0.052347235 130 nips-2006-Max-margin classification of incomplete data

13 0.052104793 77 nips-2006-Fast Computation of Graph Kernels

14 0.050068319 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

15 0.04569729 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

16 0.045533903 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

17 0.045141038 150 nips-2006-On Transductive Regression

18 0.045057394 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

19 0.044847354 109 nips-2006-Learnability and the doubling dimension

20 0.044723824 33 nips-2006-Analysis of Representations for Domain Adaptation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.162), (1, 0.046), (2, -0.006), (3, 0.031), (4, 0.031), (5, 0.061), (6, -0.006), (7, 0.044), (8, 0.037), (9, -0.002), (10, -0.042), (11, 0.031), (12, 0.048), (13, -0.045), (14, 0.028), (15, -0.046), (16, -0.052), (17, -0.042), (18, 0.007), (19, -0.041), (20, 0.011), (21, -0.023), (22, 0.039), (23, 0.034), (24, -0.001), (25, -0.162), (26, 0.051), (27, 0.058), (28, 0.067), (29, 0.024), (30, 0.001), (31, -0.163), (32, 0.027), (33, 0.025), (34, 0.012), (35, -0.062), (36, 0.052), (37, 0.033), (38, 0.078), (39, 0.077), (40, -0.065), (41, -0.144), (42, 0.051), (43, -0.097), (44, 0.202), (45, 0.033), (46, 0.111), (47, 0.092), (48, 0.08), (49, -0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91953385 5 nips-2006-A Kernel Method for the Two-Sample-Problem

Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola

2 0.52737463 166 nips-2006-Recursive Attribute Factoring

Author: Deepak Verma, Karl Pfleger, David Tax

3 0.50710279 158 nips-2006-PG-means: learning the number of clusters in data

Author: Yu Feng, Greg Hamerly

4 0.5068329 109 nips-2006-Learnability and the doubling dimension

Author: Yi Li, Philip M. Long

Abstract: Given a set of classiﬁers and a probability distribution over their domain, one can deﬁne a metric by taking the distance between a pair of classiﬁers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classiﬁer with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).

5 0.45225424 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data

Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola

6 0.43462262 169 nips-2006-Relational Learning with Gaussian Processes

7 0.40399775 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

8 0.40130949 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

9 0.39977667 82 nips-2006-Gaussian and Wishart Hyperkernels

10 0.39972612 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis

11 0.39839876 131 nips-2006-Mixture Regression for Covariate Shift

12 0.38455603 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers

13 0.36990562 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

14 0.36963138 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

15 0.36823761 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization

16 0.35765988 155 nips-2006-Optimal Single-Class Classification Strategies

17 0.35617822 39 nips-2006-Balanced Graph Matching

18 0.34672573 150 nips-2006-On Transductive Regression

19 0.3463026 34 nips-2006-Approximate Correspondences in High Dimensions

20 0.34303415 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.102), (3, 0.014), (6, 0.278), (7, 0.077), (9, 0.038), (20, 0.011), (22, 0.11), (44, 0.079), (57, 0.066), (65, 0.05), (69, 0.029), (71, 0.015), (83, 0.015), (90, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85318404 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

Author: Alexander T. Ihler, Padhraic Smyth

Abstract: Data sets that characterize human activity over time through collections of timestamped events or counts are of increasing interest in application areas as humancomputer interaction, video surveillance, and Web data analysis. We propose a non-parametric Bayesian framework for modeling collections of such data. In particular, we use a Dirichlet process framework for learning a set of intensity functions corresponding to different categories, which form a basis set for representing individual time-periods (e.g., several days) depending on which categories the time-periods are assigned to. This allows the model to learn in a data-driven fashion what “factors” are generating the observations on a particular day, including (for example) weekday versus weekend effects or day-speciﬁc effects corresponding to unique (single-day) occurrences of unusual behavior, sharing information where appropriate to obtain improved estimates of the behavior associated with each category. Applications to real–world data sets of count data involving both vehicles and people are used to illustrate the technique. 1

2 0.79179662 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions

Author: Christian Walder, Olivier Chapelle, Bernhard Schölkopf

Abstract: We consider the problem of constructing a function whose zero set is to represent a surface, given sample points with surface normal vectors. The contributions include a novel means of regularising multi-scale compactly supported basis functions that leads to the desirable properties previously only associated with fully supported bases, and show equivalence to a Gaussian process with modiﬁed covariance function. We also provide a regularisation framework for simpler and more direct treatment of surface normals, along with a corresponding generalisation of the representer theorem. We demonstrate the techniques on 3D problems of up to 14 million data points, as well as 4D time series data. 1

same-paper 3 0.78561622 5 nips-2006-A Kernel Method for the Two-Sample-Problem

Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola

4 0.74970233 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

Author: Jeremy Lewi, Robert Butera, Liam Paninski

Abstract: Adaptively optimizing experiments can signiﬁcantly reduce the number of trials needed to characterize neural responses using parametric statistical models. However, the potential for these methods has been limited to date by severe computational challenges: choosing the stimulus which will provide the most information about the (typically high-dimensional) model parameters requires evaluating a high-dimensional integration and optimization in near-real time. Here we present a fast algorithm for choosing the optimal (most informative) stimulus based on a Fisher approximation of the Shannon information and specialized numerical linear algebra techniques. This algorithm requires only low-rank matrix manipulations and a one-dimensional linesearch to choose the stimulus and is therefore efﬁcient even for high-dimensional stimulus and parameter spaces; for example, we require just 15 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Our algorithm therefore makes real-time adaptive experimental design feasible. Simulation results show that model parameters can be estimated much more efﬁciently using these adaptive techniques than by using random (nonadaptive) stimuli. Finally, we generalize the algorithm to efﬁciently handle both fast adaptation due to spike-history effects and slow, non-systematic drifts in the model parameters. Maximizing the efﬁciency of data collection is important in any experimental setting. In neurophysiology experiments, minimizing the number of trials needed to characterize a neural system is essential for maintaining the viability of a preparation and ensuring robust results. As a result, various approaches have been developed to optimize neurophysiology experiments online in order to choose the “best” stimuli given prior knowledge of the system and the observed history of the cell’s responses. The “best” stimulus can be deﬁned a number of different ways depending on the experimental objectives. One reasonable choice, if we are interested in ﬁnding a neuron’s “preferred stimulus,” is the stimulus which maximizes the ﬁring rate of the neuron [1, 2, 3, 4]. Alternatively, when investigating the coding properties of sensory cells it makes sense to deﬁne the optimal stimulus in terms of the mutual information between the stimulus and response [5]. Here we take a system identiﬁcation approach: we deﬁne the optimal stimulus as the one which tells us the most about how a neural system responds to its inputs [6, 7]. We consider neural systems in † ‡ http://www.prism.gatech.edu/∼gtg120z http://www.stat.columbia.edu/∼liam which the probability p(rt |{xt , xt−1 , ..., xt−tk }, {rt−1 , . . . , rt−ta }) of the neural response rt given the current and past stimuli {xt , xt−1 , ..., xt−tk }, and the observed recent history of the neuron’s activity, {rt−1 , . . . , rt−ta }, can be described by a model p(rt |{xt }, {rt−1 }, θ), speciﬁed by a ﬁnite vector of parameters θ. Since we estimate these parameters from experimental trials, we want to choose our stimuli so as to minimize the number of trials needed to robustly estimate θ. Two inconvenient facts make it difﬁcult to realize this goal in a computationally efﬁcient manner: 1) model complexity — we typically need a large number of parameters to accurately model a system’s response p(rt |{xt }, {rt−1 }, θ); and 2) stimulus complexity — we are typically interested in neural responses to stimuli xt which are themselves very high-dimensional (e.g., spatiotemporal movies if we are dealing with visual neurons). In particular, it is computationally challenging to 1) update our a posteriori beliefs about the model parameters p(θ|{rt }, {xt }) given new stimulus-response data, and 2) ﬁnd the optimal stimulus quickly enough to be useful in an online experimental context. In this work we present methods for solving these problems using generalized linear models (GLM) for the input-output relationship p(rt |{xt }, {rt−1 }, θ) and certain Gaussian approximations of the posterior distribution of the model parameters. Our emphasis is on ﬁnding solutions which scale well in high dimensions. We solve problem (1) by using efﬁcient rank-one update methods to update the Gaussian approximation to the posterior, and problem (2) by a reduction to a highly tractable onedimensional optimization problem. Simulation results show that the resulting algorithm produces a set of stimulus-response pairs which is much more informative than the set produced by random sampling. Moreover, the algorithm is efﬁcient enough that it could feasibly run in real-time. Neural systems are highly adaptive and more generally nonstatic. A robust approach to optimal experimental design must be able to cope with changes in θ. We emphasize that the model framework analyzed here can account for three key types of changes: stimulus adaptation, spike rate adaptation, and random non-systematic changes. Adaptation which is completely stimulus dependent can be accounted for by including enough stimulus history terms in the model p(rt |{xt , ..., xt−tk }, {rt−1 , ..., rt−ta }). Spike-rate adaptation effects, and more generally spike history-dependent effects, are accounted for explicitly in the model (1) below. Finally, we consider slow, non-systematic changes which could potentially be due to changes in the health, arousal, or attentive state of the preparation. Methods We model a neuron as a point process whose conditional intensity function (instantaneous ﬁring rate) is given as the output of a generalized linear model (GLM) [8, 9]. This model class has been discussed extensively elsewhere; brieﬂy, this class is fairly natural from a physiological point of view [10], with close connections to biophysical models such as the integrate-and-ﬁre cell [9], and has been applied in a wide variety of experimental settings [11, 12, 13, 14]. The model is summarized as: tk λt = E(rt ) = f ta aj rt−j ki,t−l xi,t−l + i l=1 (1) j=1 In the above summation the ﬁlter coefﬁcients ki,t−l capture the dependence of the neuron’s instantaneous ﬁring rate λt on the ith component of the vector stimulus at time t − l, xt−l ; the model therefore allows for spatiotemporal receptive ﬁelds. For convenience, we arrange all the stimulus coefﬁcients in a vector, k, which allows for a uniform treatment of the spatial and temporal components of the receptive ﬁeld. The coefﬁcients aj model the dependence on the observed recent activity r at time t − j (these terms may reﬂect e.g. refractory effects, burstiness, ﬁring-rate adaptation, etc., depending on the value of the vector a [9]). For convenience we denote the unknown parameter vector as θ = {k; a}. The experimental objective is the estimation of the unknown ﬁlter coefﬁcients, θ, given knowledge of the stimuli, xt , and the resulting responses rt . We chose the nonlinear stage of the GLM, the link function f (), to be the exponential function for simplicity. This choice ensures that the log likelihood of the observed data is a concave function of θ [9]. Representing and updating the posterior. As emphasized above, our ﬁrst key task is to efﬁciently update the posterior distribution of θ after t trials, p(θt |xt , rt ), as new stimulus-response pairs are trial 100 trial 500 trial 2500 trial 5000 θ true 1 info. max. trial 0 0 random −1 (a) random info. max. 2000 Time(Seconds) Entropy 1500 1000 500 0 −500 0 1000 2000 3000 Iteration (b) 4000 5000 0.1 total time diagonalization posterior update 1d line Search 0.01 0.001 0 200 400 Dimensionality 600 (c) Figure 1: A) Plots of the estimated receptive ﬁeld for a simulated visual neuron. The neuron’s receptive ﬁeld θ has the Gabor structure shown in the last panel (spike history effects were set to zero for simplicity here, a = 0). The estimate of θ is taken as the mean of the posterior, µt . The images compare the accuracy of the estimates using information maximizing stimuli and random stimuli. B) Plots of the posterior entropies for θ in these two cases; note that the information-maximizing stimuli constrain the posterior of θ much more effectively than do random stimuli. C) A plot of the timing of the three steps performed on each iteration as a function of the dimensionality of θ. The timing for each step was well-ﬁt by a polynomial of degree 2 for the diagonalization, posterior update and total time, and degree 1 for the line search. The times are an average over many iterations. The error-bars for the total time indicate ±1 std. observed. (We use xt and rt to abbreviate the sequences {xt , . . . , x0 } and {rt , . . . , r0 }.) To solve this problem, we approximate this posterior as a Gaussian; this approximation may be justiﬁed by the fact that the posterior is the product of two smooth, log-concave terms, the GLM likelihood function and the prior (which we assume to be Gaussian, for simplicity). Furthermore, the main theorem of [7] indicates that a Gaussian approximation of the posterior will be asymptotically accurate. We use a Laplace approximation to construct the Gaussian approximation of the posterior, p(θt |xt , rt ): we set µt to the peak of the posterior (i.e. the maximum a posteriori (MAP) estimate of θ), and the covariance matrix Ct to the negative inverse of the Hessian of the log posterior at µt . In general, computing these terms directly requires O(td2 + d3 ) time (where d = dim(θ); the time-complexity increases with t because to compute the posterior we must form a product of t likelihood terms, and the d3 term is due to the inverse of the Hessian matrix), which is unfortunately too slow when t or d becomes large. Therefore we further approximate p(θt−1 |xt−1 , rt−1 ) as Gaussian; to see how this simpliﬁes matters, we use Bayes to write out the posterior: 1 −1 log p(θ|rt , xt ) = − (θ − µt−1 )T Ct−1 (θ − µt−1 ) + − exp {xt ; rt−1 }T θ 2 + rt {xt ; rt−1 }T θ + const d log p(θ|rt , xt ) −1 = −(θ − µt−1 )T Ct−1 + (2) − exp({xt ; rt−1 }T θ) + rt {xt ; rt−1 }T dθ d2 log p(θ|rt , xt ) −1 = −Ct−1 − exp({xt ; rt−1 }T θ){xt ; rt−1 }{xt ; rt−1 }T dθi dθj (3) Now, to update µt we only need to ﬁnd the peak of a one-dimensional function (as opposed to a d-dimensional function); this follows by noting that that the likelihood only varies along a single direction, {xt ; rt−1 }, as a function of θ. At the peak of the posterior, µt , the ﬁrst term in the gradient must be parallel to {xt ; rt−1 } because the gradient is zero. Since Ct−1 is non-singular, µt − µt−1 must be parallel to Ct−1 {xt ; rt−1 }. Therefore we just need to solve a one dimensional problem now to determine how much the mean changes in the direction Ct−1 {xt ; rt−1 }; this requires only O(d2 ) time. Moreover, from the second derivative term above it is clear that computing Ct requires just a rank-one matrix update of Ct−1 , which can be evaluated in O(d2 ) time via the Woodbury matrix lemma. Thus this Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) provides a large gain in efﬁciency; our simulations (data not shown) showed that, despite this improved efﬁciency, the loss in accuracy due to this approximation was minimal. Deriving the (approximately) optimal stimulus. To simplify the derivation of our maximization strategy, we start by considering models in which the ﬁring rate does not depend on past spiking, so θ = {k}. To choose the optimal stimulus for trial t + 1, we want to maximize the conditional mutual information I(θ; rt+1 |xt+1 , xt , rt ) = H(θ|xt , rt ) − H(θ|xt+1 , rt+1 ) (4) with respect to the stimulus xt+1 . The ﬁrst term does not depend on xt+1 , so maximizing the information requires minimizing the conditional entropy H(θ|xt+1 , rt+1 ) = p(rt+1 |xt+1 ) −p(θ|rt+1 , xt+1 ) log p(θ|rt+1 , xt+1 )dθ = Ert+1 |xt+1 log det[Ct+1 ] + const. rt+1 (5) We do not average the entropy of p(θ|rt+1 , xt+1 ) over xt+1 because we are only interested in the conditional entropy for the particular xt+1 which will be presented next. The equality above is due to our Gaussian approximation of p(θ|xt+1 , rt+1 ). Therefore, we need to minimize Ert+1 |xt+1 log det[Ct+1 ] with respect to xt+1 . Since we set Ct+1 to be the negative inverse Hessian of the log-posterior, we have: −1 Ct+1 = Ct + Jobs (rt+1 , xt+1 ) −1 , (6) Jobs is the observed Fisher information. Jobs (rt+1 , xt+1 ) = −∂ 2 log p(rt+1 |ε = xt θ)/∂ε2 xt+1 xt t+1 t+1 (7) Here we use the fact that for the GLM, the likelihood depends only on the dot product, ε = xt θ. t+1 We can use the Woodbury lemma to evaluate the inverse: Ct+1 = Ct I + D(rt+1 , ε)(1 − D(rt+1 , ε)xt Ct xt+1 )−1 xt+1 xt Ct t+1 t+1 (8) where D(rt+1 , ε) = ∂ 2 log p(rt+1 |ε)/∂ε2 . Using some basic matrix identities, log det[Ct+1 ] = log det[Ct ] − log(1 − D(rt+1 , ε)xt Ct xt+1 ) t+1 = log det[Ct ] + D(rt+1 , ε)xt Ct xt+1 t+1 + o(D(rt+1 , ε)xt Ct xt+1 ) t+1 (9) (10) Ignoring the higher order terms, we need to minimize Ert+1 |xt+1 D(rt+1 , ε)xt Ct xt+1 . In our case, t+1 with f (θt xt+1 ) = exp(θt xt+1 ), we can use the moment-generating function of the multivariate Trial info. max. i.i.d 2 400 −10−4 0 0.05 −10−1 −2 ai 800 2 0 −2 −7 −10 i i.i.d k info. max. 1 1 50 i 1 50 i 1 10 1 i (a) i 100 0 −0.05 10 1 1 (b) i 10 (c) Figure 2: A comparison of parameter estimates using information-maximizing versus random stimuli for a model neuron whose conditional intensity depends on both the stimulus and the spike history. The images in the top row of A and B show the MAP estimate of θ after each trial as a row in the image. Intensity indicates the value of the coefﬁcients. The true value of θ is shown in the second row of images. A) The estimated stimulus coefﬁcients, k. B) The estimated spike history coefﬁcients, a. C) The ﬁnal estimates of the parameters after 800 trials: dashed black line shows true values, dark gray is estimate using information maximizing stimuli, and light gray is estimate using random stimuli. Using our algorithm improved the estimates of k and a. Gaussian p(θ|xt , rt ) to evaluate this expectation. After some algebra, we ﬁnd that to maximize I(θ; rt+1 |xt+1 , xt , rt ), we need to maximize 1 F (xt+1 ) = exp(xT µt ) exp( xT Ct xt+1 )xT Ct xt+1 . t+1 t+1 2 t+1 (11) Computing the optimal stimulus. For the GLM the most informative stimulus is undeﬁned, since increasing the stimulus power ||xt+1 ||2 increases the informativeness of any putatively “optimal” stimulus. To obtain a well-posed problem, we optimize the stimulus under the usual power constraint ||xt+1 ||2 ≤ e < ∞. We maximize Eqn. 11 under this constraint using Lagrange multipliers and an eigendecomposition to reduce our original d-dimensional optimization problem to a onedimensional problem. Expressing Eqn. 11 in terms of the eigenvectors of Ct yields: 1 2 2 F (xt+1 ) = exp( u i yi + ci yi ) ci yi (12) 2 i i i = g( 2 ci yi ) ui yi )h( i (13) i where ui and yi represent the projection of µt and xt+1 onto the ith eigenvector and ci is the corresponding eigenvalue. To simplify notation we also introduce the functions g() and h() which are monotonically strictly increasing functions implicitly deﬁned by Eqn. 12. We maximize F (xt+1 ) by breaking the problem into an inner and outer problem by ﬁxing the value of i ui yi and maximizing h() subject to that constraint. A single line search over all possible values of i ui yi will then ﬁnd the global maximum of F (.). This approach is summarized by the equation: max F (y) = max g(b) · y:||y||2 =e b max y:||y||2 =e,y t u=b 2 ci yi ) h( i Since h() is increasing, to solve the inner problem we only need to solve: 2 ci yi max y:||y||2 =e,y t u=b (14) i This last expression is a quadratic function with quadratic and linear constraints and we can solve it using the Lagrange method for constrained optimization. The result is an explicit system of 1 true θ random info. max. info. max. no diffusion 1 0.8 0.6 trial 0.4 0.2 400 0 −0.2 −0.4 800 1 100 θi 1 θi 100 1 θi 100 1 θ i 100 −0.6 random info. max. θ true θ i 1 0 −1 Entropy θ i 1 0 −1 random info. max. 250 200 i 1 θ Trial 400 Trial 200 Trial 0 (a) 0 −1 20 40 (b) i 60 80 100 150 0 200 400 600 Iteration 800 (c) Figure 3: Estimating the receptive ﬁeld when θ is not constant. A) The posterior means µt and true θt plotted after each trial. θ was 100 dimensional, with its components following a Gabor function. To simulate nonsystematic changes in the response function, the center of the Gabor function was moved according to a random walk in between trials. We modeled the changes in θ as a random walk with a white covariance matrix, Q, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the µt given stimuli chosen to maximize the information under the (mistaken) assumption that θ was constant. Each row of the images plots θ using intensity to indicate the value of the different components. B) Details of the posterior means µt on selected trials. C) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of θt more effectively. equations for the optimal yi as a function of the Lagrange multiplier λ1 . ui e yi (λ1 ) = ||y||2 2(ci − λ1 ) (15) Thus to ﬁnd the global optimum we simply vary λ1 (this is equivalent to performing a search over b), and compute the corresponding y(λ1 ). For each value of λ1 we compute F (y(λ1 )) and choose the stimulus y(λ1 ) which maximizes F (). It is possible to show (details omitted) that the maximum of F () must occur on the interval λ1 ≥ c0 , where c0 is the largest eigenvalue. This restriction on the optimal λ1 makes the implementation of the linesearch signiﬁcantly faster and more stable. To summarize, updating the posterior and ﬁnding the optimal stimulus requires three steps: 1) a rankone matrix update and one-dimensional search to compute µt and Ct ; 2) an eigendecomposition of Ct ; 3) a one-dimensional search over λ1 ≥ c0 to compute the optimal stimulus. The most expensive step here is the eigendecomposition of Ct ; in principle this step is O(d3 ), while the other steps, as discussed above, are O(d2 ). Here our Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) is once again quite useful: recall that in this setting Ct is just a rank-one modiﬁcation of Ct−1 , and there exist efﬁcient algorithms for rank-one eigendecomposition updates [15]. While the worst-case running time of this rank-one modiﬁcation of the eigendecomposition is still O(d3 ), we found the average running time in our case to be O(d2 ) (Fig. 1(c)), due to deﬂation which reduces the cost of matrix multiplications associated with ﬁnding the eigenvectors of repeated eigenvalues. Therefore the total time complexity of our algorithm is empirically O(d2 ) on average. Spike history terms. The preceding derivation ignored the spike-history components of the GLM model; that is, we ﬁxed a = 0 in equation (1). Incorporating spike history terms only affects the optimization step of our algorithm; updating the posterior of θ = {k; a} proceeds exactly as before. The derivation of the optimization strategy proceeds in a similar fashion and leads to an analogous optimization strategy, albeit with a few slight differences in detail which we omit due to space constraints. The main difference is that instead of maximizing the quadratic expression in Eqn. 14 to ﬁnd the maximum of h(), we need to maximize a quadratic expression which includes a linear term due to the correlation between the stimulus coefﬁcients, k, and the spike history coefﬁcients,a. The results of our simulations with spike history terms are shown in Fig. 2. Dynamic θ. In addition to fast changes due to adaptation and spike-history effects, animal preparations often change slowly and nonsystematically over the course of an experiment [16]. We model these effects by letting θ experience diffusion: θt+1 = θt + wt (16) Here wt is a normally distributed random variable with mean zero and known covariance matrix Q. This means that p(θt+1 |xt , rt ) is Gaussian with mean µt and covariance Ct + Q. To update the posterior and choose the optimal stimulus, we use the same procedure as described above1 . Results Our ﬁrst simulation considered the use of our algorithm for learning the receptive ﬁeld of a visually sensitive neuron. We took the neuron’s receptive ﬁeld to be a Gabor function, as a proxy model of a V1 simple cell. We generated synthetic responses by sampling Eqn. 1 with θ set to a 25x33 Gabor function. We used this synthetic data to compare how well θ could be estimated using information maximizing stimuli compared to using random stimuli. The stimuli were 2-d images which were rasterized in order to express x as a vector. The plots of the posterior means µt in Fig. 1 (recall these are equivalent to the MAP estimate of θ) show that the information maximizing strategy converges an order of magnitude more rapidly to the true θ. These results are supported by the conclusion of [7] that the information maximization strategy is asymptotically never worse than using random stimuli and is in general more efﬁcient. The running time for each step of the algorithm as a function of the dimensionality of θ is plotted in Fig. 1(c). These results were obtained on a machine with a dual core Intel 2.80GHz XEON processor running Matlab. The solid lines indicate ﬁtted polynomials of degree 1 for the 1d line search and degree 2 for the remaining curves; the total running time for each trial scaled as O(d2 ), as predicted. When θ was less than 200 dimensions, the total running time was roughly 50 ms (and for dim(θ) ≈ 100, the runtime was close to 15 ms), well within the range of tolerable latencies for many experiments. In Fig. 2 we apply our algorithm to characterize the receptive ﬁeld of a neuron whose response depends on its past spiking. Here, the stimulus coefﬁcients k were chosen to follow a sine-wave; 1 The one difference is that the covariance matrix of p(θt+1 |xt+1 , rt+1 ) is in general no longer just a rankone modiﬁcation of the covariance matrix of p(θt |xt , rt ); thus, we cannot use the rank-one update to compute the eigendecomposition. However, it is often reasonable to take Q to be white, Q = cI; in this case the eigenvectors of Ct + Q are those of Ct and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct ; thus in this case, our methods may be applied without modiﬁcation. the spike history coefﬁcients a were inhibitory and followed an exponential function. When choosing stimuli we updated the posterior for the full θ = {k; a} simultaneously and maximized the information about both the stimulus coefﬁcients and the spike history coefﬁcients. The information maximizing strategy outperformed random sampling for estimating both the spike history and stimulus coefﬁcients. Our ﬁnal set of results, Fig. 3, considers a neuron whose receptive ﬁeld drifts non-systematically with time. We take the receptive ﬁeld to be a Gabor function whose center moves according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). The results demonstrate the feasibility of the information-maximization strategy in the presence of nonstationary response properties θ, and emphasize the superiority of adaptive methods in this context. Conclusion We have developed an efﬁcient implementation of an algorithm for online optimization of neurophysiology experiments based on information-theoretic criterion. Reasonable approximations based on a GLM framework allow the algorithm to run in near-real time even for high dimensional parameter and stimulus spaces, and in the presence of spike-rate adaptation and time-varying neural response properties. Despite these approximations the algorithm consistently provides signiﬁcant improvements over random sampling; indeed, the differences in efﬁciency are large enough that the information-optimization strategy may permit robust system identiﬁcation in cases where it is simply not otherwise feasible to estimate the neuron’s parameters using random stimuli. Thus, in a sense, the proposed stimulus-optimization technique signiﬁcantly extends the reach and power of classical neurophysiology methods. Acknowledgments JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG02-97ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by grant EY018003 from the NEI and by a Gatsby Foundation Pilot Grant. We thank P. Latham for helpful conversations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] I. Nelken, et al., Hearing Research 72, 237 (1994). P. Foldiak, Neurocomputing 38–40, 1217 (2001). K. Zhang, et al., Proceedings (Computational and Systems Neuroscience Meeting, 2004). R. C. deCharms, et al., Science 280, 1439 (1998). C. Machens, et al., Neuron 47, 447 (2005). A. Watson, et al., Perception and Psychophysics 33, 113 (1983). L. Paninski, Neural Computation 17, 1480 (2005). P. McCullagh, et al., Generalized linear models (Chapman and Hall, London, 1989). L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. P. Dayan, et al., Theoretical Neuroscience (MIT Press, 2001). E. Chichilnisky, Network: Computation in Neural Systems 12, 199 (2001). F. Theunissen, et al., Network: Computation in Neural Systems 12, 289 (2001). L. Paninski, et al., Journal of Neuroscience 24, 8551 (2004). M. Gu, et al., SIAM Journal on Matrix Analysis and Applications 15, 1266 (1994). N. A. Lesica, et al., IEEE Trans. On Neural Systems And Rehabilitation Engineering 13, 194 (2005).

5 0.58339709 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspeciﬁcation, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classiﬁcation, multi-class classiﬁcation, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1

6 0.58088499 61 nips-2006-Convex Repeated Games and Fenchel Duality

7 0.57874906 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

8 0.57582021 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

9 0.57365048 65 nips-2006-Denoising and Dimension Reduction in Feature Space

10 0.57330739 175 nips-2006-Simplifying Mixture Models through Function Approximation

11 0.57146662 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

12 0.56896996 21 nips-2006-AdaBoost is Consistent

13 0.56846148 131 nips-2006-Mixture Regression for Covariate Shift

14 0.56826884 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

15 0.56806159 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

16 0.56760639 194 nips-2006-Towards a general independent subspace analysis

17 0.5661273 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

18 0.56463903 203 nips-2006-implicit Online Learning with Kernels

19 0.56390578 79 nips-2006-Fast Iterative Kernel PCA

20 0.56229365 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis