nips nips2005 nips2005-205 knowledge-graph by maker-knowledge-mining

205 nips-2005-Worst-Case Bounds for Gaussian Process Models

Source: pdf

Author: Sham M. Kakade, Matthias W. Seeger, Dean P. Foster

Abstract: We present a competitive analysis of some non-parametric Bayesian algorithms in a worst-case online learning setting, where no probabilistic assumptions about the generation of the data are made. We consider models which use a Gaussian process prior (over the space of all functions) and provide bounds on the regret (under the log loss) for commonly used non-parametric Bayesian algorithms — including Gaussian regression and logistic regression — which show how these algorithms can perform favorably under rather general conditions. These bounds explicitly handle the inﬁnite dimensionality of these non-parametric classes in a natural way. We also make formal connections to the minimax and minimum description length (MDL) framework. Here, we show precisely how Bayesian Gaussian regression is a minimax strategy. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Foster University of Pennsylvania Abstract We present a competitive analysis of some non-parametric Bayesian algorithms in a worst-case online learning setting, where no probabilistic assumptions about the generation of the data are made. [sent-4, score-0.032]

2 These bounds explicitly handle the inﬁnite dimensionality of these non-parametric classes in a natural way. [sent-6, score-0.023]

3 We also make formal connections to the minimax and minimum description length (MDL) framework. [sent-7, score-0.344]

4 Here, we show precisely how Bayesian Gaussian regression is a minimax strategy. [sent-8, score-0.365]

5 1 Introduction We study an online (sequential) prediction setting in which, at each timestep, the learner is given some input from the set X , and the learner must predict the output variable from the set Y. [sent-9, score-0.085]

6 , T } is chosen by Nature (or by an adversary), and importantly, we do not make any statistical assumptions about its source: our statements hold for all sequences. [sent-13, score-0.068]

7 Our goal is to sequentially code the next label yt , given that we have observed x≤t and y 0 is a constant such that for all yt ∈ y ≤T , − d2 log P (yt |u) ≤ c du2 for all u ∈ R. [sent-14, score-0.313]

8 The proof of this theorem parallels that provided by Kakade and Ng [2004], with a number of added complexities for handling GP priors. [sent-15, score-0.078]

9 For the special case of Gaussian regression where c = σ −2 , the following theorem shows the stronger result that the bound is satisﬁed with an equality for all sequences. [sent-16, score-0.134]

10 Then, − log Pbayes (y ≤T |x≤T ) = min − log P (y ≤T |x≤T , f (·)) + f ∈H 1 + log I + σ −2 K 2 and the minimum is attained for a kernel expansion over x≤T . [sent-20, score-0.368]

11 1 f 2 2 K (2) This equality has important implications in our minimax theory (in Corollary 4. [sent-21, score-0.348]

12 It is not hard to see that the equality does not hold for other likelihoods. [sent-23, score-0.054]

13 1 Interpretation The regret bound depends on two terms, f 2 and log |I + cK |. [sent-25, score-0.9]

14 K The dependence on f 2 states the intuitive fact that a meaningful bound can only be K obtained under smoothness assumptions on the set of experts. [sent-27, score-0.06]

15 The more complicated f is (as measured by · K ), the higher the regret may be. [sent-28, score-0.796]

16 Let us now interpret the log |I + cK | term, which we refer to as the regret term. [sent-32, score-0.9]

17 The constant c, which bounds the curvature of the likelihood, exists for most commonly used exponential family likelihoods. [sent-33, score-0.103]

18 For logistic regression, we have c = 1/4, and for the Gaussian regression, we have c = σ −2 . [sent-34, score-0.02]

19 Also, interestingly, while f is an arbitrary function in H, this regret term depends on K only at the sequence points x≤T . [sent-35, score-0.857]

20 For most inﬁnite-dimensional kernels and without strong restrictions on the inputs, the regret term can be as large as Ω(T ) — the sequence can be chosen s. [sent-36, score-0.857]

21 K ≈ c I, which implies that log |I + cK | ≈ T log(1 + cc ). [sent-38, score-0.121]

22 For example, for an isotropic kernel (which is a function of the norm x − x 2 ) we can choose the xt to be mutually far from each other. [sent-39, score-0.072]

23 the Ornstein-Uhlenbeck kernel exp(−b x − x 1 ) — the regret term can easily Ω(T ). [sent-42, score-0.857]

24 The cases we are interested in are those where the regret term is o(T ), in which case the average regret tends to 0 with time. [sent-43, score-1.634]

25 A spectral interpretation of this term helps us understand the behavior. [sent-44, score-0.061]

26 λT be the eigenvalues of K , then T log |I + cK | = log(1 + cλt ) ≤ c tr K t=1 where tr K is the trace of K . [sent-48, score-0.174]

27 Clearly, if the sum of the eigenvalues has a sublinear growth rate of o(T ), then the average regret tends to 0. [sent-51, score-0.864]

28 then the above eigenvalues are essentially the process eigenvalues. [sent-55, score-0.032]

29 In a forthcoming longer version, we explore this spectral interpretation in more detail and provide a case using the exponential kernel in which the regret grows as O(poly(log T )). [sent-56, score-0.883]

30 Namely, the familiar linear model — with u(x) = θ · x, θ, x ∈ Rd and Gaussian prior θ ∼ N (0, I) — can be seen as a GP model with the linear kernel: K(x, x ) = x · x . [sent-60, score-0.019]

31 xT )T we have that a kernel expansion f (x) = i αi xi · x = θ · x with θ = X T α, and f 2 = αT X X T α = θ 2 , so that H = {θ · x | θ ∈ Rd }, and so 2 K log |I + cK | = log I + cX T X Therefore, our result gives an input-dependent version of the result of Kakade and Ng [2004]. [sent-64, score-0.264]

32 If we make the further assumption that x 2 ≤ 1 (as done in Kakade and Ng [2004]), then we can obtain exactly their regret term: cT log |I + cK | ≤ d log 1 + d which can seen by rotating K into an diagonal matrix and maximizing the expression subject to the constraint that x 2 ≤ 1 (i. [sent-65, score-1.055]

33 In general, this example shows that if K is a ﬁnite-dimension kernel such as the linear or the polynomial kernel, then the regret term is only O(log T ). [sent-68, score-0.857]

34 4 Relationships to Minimax Procedures and MDL This section builds the framework for understanding the minimax property of Gaussian regression. [sent-69, score-0.31]

35 We start by reviewing Shtarkov’s theorem, which shows that a certain normalized maximum likelihood density is the minimax strategy (when using the log loss). [sent-70, score-0.637]

36 In many cases, this minimax strategy does not exist — in those cases where the minimax regret is inﬁnite. [sent-71, score-1.48]

37 We then propose a different, penalized notion of regret, and show that a certain normalized maximum a posteriori density is the minimax strategy here. [sent-72, score-0.659]

38 4) shows that for Gaussian regression the Bayesian strategy is precisely this minimax strategy 4. [sent-74, score-0.493]

39 1 Normalized Maximum Likelihood Here, let us assume that there are no inputs — sequences consist of only yt ∈ Y. [sent-75, score-0.131]

40 Given a measurable space with base measure µ, we employ a countable number of random variables yt in Y. [sent-76, score-0.141]

41 Fix the sequence length T and deﬁne the model class F = {Q(·|θ) | θ ∈ Θ)}, where Q(·|θ) denotes a joint probability density over Y T with respect to µ. [sent-77, score-0.106]

42 We assume that for our model class there exists a parameter, θml (y ≤T ), maximizing the likelihood Q(y ≤T |θ) over Θ for all y ≤T ∈ Y T . [sent-78, score-0.115]

43 We make this assumption to make the connections to maximum likelihood (and, later, MAP) estimation clear. [sent-79, score-0.109]

44 Deﬁne the regret of a joint density P on y ≤T as: R(y ≤T , P, Θ) = − log P (y ≤T ) − inf {− log Q(y ≤T |θ)} (3) θ∈Θ = − log P (y ≤T ) + log Q(y ≤T |θml (y ≤T )) . [sent-80, score-1.332]

45 Deﬁne the minimax regret with respect to Θ as: R(Θ) = inf sup P y T ≤T ∈Y R(y ≤T , P, Θ) where the inf is over all probability densities on Y T . [sent-82, score-1.288]

46 The following theorem due to Shtarkov [1987] characterizes the minimax strategy. [sent-83, score-0.351]

47 if it has a ﬁnite normalization constant), then deﬁne it to be the normalized maximum likelihood (NML) density. [sent-87, score-0.086]

48 Pml (y ≤T ) = Q(y ≤T |θml (y ≤T )) Q(y ≤T |θml (y ≤T ))dµ(y ≤T ) (5) If Pml exists, it is a minimax strategy, i. [sent-88, score-0.31]

49 for all y ≤T , the regret R(y ≤T , Pml , Θ) does not exceed R(Θ). [sent-90, score-0.826]

50 Note that this density exists only if the normalizing constant is ﬁnite, which is not the case in general. [sent-91, score-0.117]

51 The proof is straightforward using the fact that the NML density is an equalizer — meaning that it has constant regret on all sequences. [sent-92, score-0.96]

52 Proof: First note that log Q(y ≤T |θml (y ≤T ))dµ(y ≤T ). [sent-93, score-0.104]

53 the regret R(y ≤T , Pml , Θ) is the constant To see this, simply substitute Eq. [sent-95, score-0.83]

54 4 For convenience, deﬁne the regret of any P as R(P, Θ) = supy≤T ∈Y T R(y ≤T , P, Θ). [sent-97, score-0.796]

55 For any P = Pml (differing on a set with positive measure), there exists some y ≤T such that P (y ≤T ) < Pml (y ≤T ), since the densities are normalized. [sent-98, score-0.072]

56 This implies that R(P, Θ) ≥ R(y ≤T , P, Θ) > R(y ≤T , Pml , Θ) = R(Pml , Θ) where the ﬁrst step follows from the deﬁnition of R(P, Θ), the second from − log P (y ≤T ) > − log Pml (y ≤T ), and the last from the fact that Pml is an equalizer (its regret is constant on all sequences). [sent-99, score-1.092]

57 Hence, P has a strictly larger regret, implying that Pml is the unique minimax strategy. [sent-100, score-0.31]

58 Unfortunately, in many important model classes, the minimax regret R(Θ) is not ﬁnite, and the NML density does not exist. [sent-101, score-1.159]

59 2: Consider a model which assumes the sequence is generated i. [sent-104, score-0.035]

60 It is easy to see that for this class the t=1 minimax regret is inﬁnite and Pml does not exist (see Grunwald [2005]). [sent-109, score-1.124]

61 This example can be generalized to the Gaussian regression model (if we know the sequence x≤T in advance). [sent-110, score-0.09]

62 For this problem, if one modiﬁes the space of allowable sequences (i. [sent-111, score-0.045]

63 2 Normalized Maximum a Posteriori To remedy this problem, consider placing some structure on the model class F = {Q(·|θ)|θ ∈ Θ}. [sent-117, score-0.035]

64 The idea is to penalize Q(·|θ) ∈ F based on this structure. [sent-118, score-0.015]

65 Assume that Θ is a measurable space and place a prior distribution with density function q on Θ. [sent-120, score-0.098]

66 Deﬁne the penalized regret of P on y ≤T as: Rq (y ≤T , P, Θ) = − log P (y ≤T ) − inf {− log Q(y ≤T |θ) − log q(θ)} . [sent-121, score-1.322]

67 θ∈Θ Note that − log Q(y ≤T |θ) − log q(θ) can be viewed as a “two part” code, in which we ﬁrst code θ under the prior q and then code y ≤T under the likelihood Q(·|θ). [sent-122, score-0.323]

68 Unlike the standard regret, the penalized regret can be viewed as a comparison to an actual code. [sent-123, score-0.943]

69 However, in MDL, they consider using minimax schemes (via Pml ) for the likelihood part of the code, while we consider minimax schemes for this penalized regret. [sent-125, score-0.841]

70 Again, for clarity, assume there exists a parameter, θmap (y ≤T ) maximizing log Q(y ≤T |θ)+ log q(θ). [sent-126, score-0.271]

71 Notice that this is just the maximum aposteriori (MAP) parameter, if one were to use a Bayesian strategy with the prior q (since the posterior density would be proportional to Q(y ≤T |θ)q(θ)). [sent-127, score-0.179]

72 Here, Rq (y ≤T , P, Θ) = − log P (y ≤T ) + log Q(y ≤T |θmap (y ≤T )) + log q(θmap (y ≤T )) Similarly, with respect to Θ, deﬁne the minimax penalized regret as: Rq (Θ) = inf sup P y T ≤T ∈Y Rq (y ≤T P, Θ) where again the inf is over all densities on Y T . [sent-128, score-1.747]

73 If Θ is ﬁnite or countable and Q(·|θ) > 0 for all θ, then the Bayes procedure has the desirable property of having penalized regret which is non-positive. [sent-129, score-0.979]

74 2 However, in general, the Bayes procedure does not achieve the minimax penalized regret, Rq (Θ), which is what we desire — though, for one case, we show that it does (in the next section). [sent-130, score-0.472]

75 3: Deﬁne the normalized maximum a posteriori (NMAP) density, if it exists, as: Pmap (y ≤T ) = Q(y ≤T |θmap (y ≤T ))q(θmap (y ≤T )) . [sent-133, score-0.085]

76 Q(y ≤T |θmap (y ≤T ))q(θmap (y ≤T )) dµ(y ≤T ) (6) If Pmap exists, it is a minimax strategy for the penalized regret, i. [sent-134, score-0.521]

77 for all y ≤T , the penalized regret Rq (y ≤T , Pmap , Θ) does not exceed Rq (Θ). [sent-136, score-0.973]

78 The proof relies on Pmap being an equalizer for the penalized regret and is identical to that of Theorem 4. [sent-137, score-1.034]

79 1 — just replace all quantities with their penalized equivalents. [sent-138, score-0.147]

80 3 Bayesian Gaussian Regression as a Minimax Procedure We now return to the setting with inputs and show how the Bayesian strategy for the Gaussian regression model is a minimax strategy for all input sequences x≤T . [sent-140, score-0.545]

81 If we ﬁx the input sequence x≤T , we can consider the competitor class to be F = {P (y ≤T |x≤T , θ) | θ ∈ Θ)}. [sent-141, score-0.069]

82 In other words, we make the more stringent comparison against a model class which has full knowledge of the input sequence in advance. [sent-142, score-0.086]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('regret', 0.796), ('minimax', 0.31), ('pml', 0.296), ('penalized', 0.147), ('rq', 0.139), ('log', 0.104), ('pmap', 0.091), ('yt', 0.079), ('mdl', 0.079), ('ck', 0.075), ('kakade', 0.069), ('equalizer', 0.068), ('nml', 0.068), ('shtarkov', 0.068), ('inf', 0.067), ('ml', 0.065), ('strategy', 0.064), ('grunwald', 0.059), ('regression', 0.055), ('density', 0.053), ('map', 0.049), ('exists', 0.044), ('theorem', 0.041), ('equality', 0.038), ('xt', 0.037), ('countable', 0.036), ('foster', 0.036), ('learner', 0.035), ('gaussian', 0.035), ('kernel', 0.035), ('sequence', 0.035), ('likelihood', 0.034), ('posteriori', 0.033), ('bayesian', 0.033), ('eigenvalues', 0.032), ('ng', 0.032), ('code', 0.031), ('exceed', 0.03), ('normalized', 0.029), ('densities', 0.028), ('pennsylvania', 0.028), ('sequences', 0.027), ('term', 0.026), ('measurable', 0.026), ('inputs', 0.025), ('dependence', 0.024), ('bounds', 0.023), ('nite', 0.023), ('proof', 0.023), ('maximum', 0.023), ('gp', 0.022), ('expansion', 0.021), ('sup', 0.02), ('interpretation', 0.02), ('rd', 0.02), ('logistic', 0.02), ('schemes', 0.02), ('reviewing', 0.02), ('adversary', 0.02), ('aposteriori', 0.02), ('supy', 0.02), ('sublinear', 0.02), ('constant', 0.02), ('importantly', 0.019), ('maximizing', 0.019), ('tr', 0.019), ('smoothness', 0.019), ('prior', 0.019), ('corollary', 0.018), ('class', 0.018), ('barely', 0.018), ('sham', 0.018), ('barron', 0.018), ('allowable', 0.018), ('make', 0.018), ('parametric', 0.017), ('assumptions', 0.017), ('remedy', 0.017), ('cc', 0.017), ('poly', 0.017), ('statements', 0.017), ('forthcoming', 0.017), ('commonly', 0.016), ('hold', 0.016), ('matthias', 0.016), ('competitor', 0.016), ('connections', 0.016), ('tends', 0.016), ('differing', 0.015), ('dean', 0.015), ('penalize', 0.015), ('stringent', 0.015), ('desire', 0.015), ('favorably', 0.015), ('spectral', 0.015), ('online', 0.015), ('bayes', 0.015), ('rotating', 0.014), ('complexities', 0.014), ('substitute', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

Author: Sham M. Kakade, Matthias W. Seeger, Dean P. Foster

2 0.060012549 74 nips-2005-Faster Rates in Regression via Active Learning

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning signiﬁcantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra ﬂexibility leads to signiﬁcantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra ﬂexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including ﬁeld estimation using wireless sensor networks and fault line detection. 1

3 0.053634401 47 nips-2005-Consistency of one-class SVM and related algorithms

Author: Régis Vert, Jean-philippe Vert

Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to inﬁnity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held ﬁxed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classiﬁcation error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the ﬁrst time to be a consistent density level set estimator. 1

4 0.048884209 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf

Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.

5 0.047214366 50 nips-2005-Convex Neural Networks

Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte

Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artiﬁcial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an inﬁnite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time ﬁnding a linear classiﬁer that minimizes a weighted sum of errors. 1

6 0.046675049 45 nips-2005-Conditional Visual Tracking in Kernel Space

7 0.041446175 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

8 0.041202344 58 nips-2005-Divergences, surrogate loss functions and experimental design

9 0.041089356 2 nips-2005-A Bayes Rule for Density Matrices

10 0.040415391 136 nips-2005-Noise and the two-thirds power Law

11 0.03804886 30 nips-2005-Assessing Approximations for Gaussian Process Classification

12 0.037806798 80 nips-2005-Gaussian Process Dynamical Models

13 0.036998816 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

14 0.036907014 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior

15 0.036342949 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

16 0.03580001 159 nips-2005-Q-Clustering

17 0.035368901 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models

18 0.033606179 166 nips-2005-Robust Fisher Discriminant Analysis

19 0.0326511 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

20 0.032425109 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.108), (1, 0.044), (2, -0.017), (3, -0.032), (4, 0.084), (5, -0.044), (6, -0.002), (7, -0.039), (8, -0.045), (9, -0.018), (10, 0.01), (11, 0.042), (12, 0.02), (13, 0.035), (14, -0.003), (15, -0.021), (16, 0.021), (17, 0.017), (18, 0.013), (19, -0.018), (20, 0.015), (21, -0.042), (22, -0.008), (23, -0.071), (24, -0.002), (25, 0.067), (26, 0.025), (27, -0.012), (28, -0.001), (29, -0.028), (30, 0.011), (31, -0.041), (32, 0.021), (33, -0.009), (34, 0.013), (35, 0.002), (36, 0.019), (37, -0.017), (38, 0.028), (39, 0.038), (40, 0.073), (41, -0.011), (42, 0.013), (43, 0.019), (44, 0.064), (45, 0.074), (46, 0.03), (47, -0.025), (48, -0.087), (49, -0.119)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89753288 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

Author: Sham M. Kakade, Matthias W. Seeger, Dean P. Foster

2 0.5911091 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments

Author: Keiji Miura, Masato Okada, Shun-ichi Amari

Abstract: We considered a gamma distribution of interspike intervals as a statistical model for neuronal spike generation. The model parameters consist of a time-dependent ﬁring rate and a shape parameter that characterizes spiking irregularities of individual neurons. Because the environment changes with time, observed data are generated from the time-dependent ﬁring rate, which is an unknown function. A statistical model with an unknown function is called a semiparametric model, which is one of the unsolved problem in statistics and is generally very difﬁcult to solve. We used a novel method of estimating functions in information geometry to estimate the shape parameter without estimating the unknown function. We analytically obtained an optimal estimating function for the shape parameter independent of the functional form of the ﬁring rate. This estimation is efﬁcient without Fisher information loss and better than maximum likelihood estimation. 1

3 0.47586995 47 nips-2005-Consistency of one-class SVM and related algorithms

Author: Régis Vert, Jean-philippe Vert

4 0.43910074 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

Author: Gilles Blanchard, Masashi Sugiyama, Motoaki Kawanabe, Vladimir Spokoiny, Klaus-Robert Müller

Abstract: We propose a new linear method for dimension reduction to identify nonGaussian components in high dimensional data. Our method, NGCA (non-Gaussian component analysis), uses a very general semi-parametric framework. In contrast to existing projection methods we deﬁne what is uninteresting (Gaussian): by projecting out uninterestingness, we can estimate the relevant non-Gaussian subspace. We show that the estimation error of ﬁnding the non-Gaussian components tends to zero at a parametric rate. Once NGCA components are identiﬁed and extracted, various tasks can be applied in the data analysis process, like data visualization, clustering, denoising or classiﬁcation. A numerical study demonstrates the usefulness of our method. 1

5 0.43818673 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio

Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model ﬁtted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classiﬁcation, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1

6 0.43474096 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

7 0.43006533 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior

8 0.42312643 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

9 0.41824824 45 nips-2005-Conditional Visual Tracking in Kernel Space

10 0.41414914 30 nips-2005-Assessing Approximations for Gaussian Process Classification

11 0.40306193 112 nips-2005-Learning Minimum Volume Sets

12 0.40119755 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

13 0.3999323 58 nips-2005-Divergences, surrogate loss functions and experimental design

14 0.39623487 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

15 0.39471105 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

16 0.39395919 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares

17 0.39220861 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise

18 0.39196354 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

19 0.38969377 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

20 0.38475227 50 nips-2005-Convex Neural Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.069), (10, 0.057), (27, 0.016), (31, 0.03), (34, 0.092), (39, 0.01), (41, 0.016), (55, 0.027), (65, 0.012), (69, 0.033), (73, 0.026), (77, 0.011), (87, 0.303), (88, 0.094), (91, 0.075)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81190556 34 nips-2005-Bayesian Surprise Attracts Human Attention

Author: Laurent Itti, Pierre F. Baldi

Abstract: The concept of surprise is central to sensory processing, adaptation, learning, and attention. Yet, no widely-accepted mathematical theory currently exists to quantitatively characterize surprise elicited by a stimulus or event, for observers that range from single neurons to complex natural or engineered systems. We describe a formal Bayesian deﬁnition of surprise that is the only consistent formulation under minimal axiomatic assumptions. Surprise quantiﬁes how data affects a natural or artiﬁcial observer, by measuring the difference between posterior and prior beliefs of the observer. Using this framework we measure the extent to which humans direct their gaze towards surprising items while watching television and video games. We ﬁnd that subjects are strongly attracted towards surprising locations, with 72% of all human gaze shifts directed towards locations more surprising than the average, a ﬁgure which rises to 84% when considering only gaze targets simultaneously selected by all subjects. The resulting theory of surprise is applicable across different spatio-temporal scales, modalities, and levels of abstraction. Life is full of surprises, ranging from a great christmas gift or a new magic trick, to wardrobe malfunctions, reckless drivers, terrorist attacks, and tsunami waves. Key to survival is our ability to rapidly attend to, identify, and learn from surprising events, to decide on present and future courses of action [1]. Yet, little theoretical and computational understanding exists of the very essence of surprise, as evidenced by the absence from our everyday vocabulary of a quantitative unit of surprise: Qualities such as the “wow factor” have remained vague and elusive to mathematical analysis. Informal correlates of surprise exist at nearly all stages of neural processing. In sensory neuroscience, it has been suggested that only the unexpected at one stage is transmitted to the next stage [2]. Hence, sensory cortex may have evolved to adapt to, to predict, and to quiet down the expected statistical regularities of the world [3, 4, 5, 6], focusing instead on events that are unpredictable or surprising. Electrophysiological evidence for this early sensory emphasis onto surprising stimuli exists from studies of adaptation in visual [7, 8, 4, 9], olfactory [10, 11], and auditory cortices [12], subcortical structures like the LGN [13], and even retinal ganglion cells [14, 15] and cochlear hair cells [16]: neural response greatly attenuates with repeated or prolonged exposure to an initially novel stimulus. Surprise and novelty are also central to learning and memory formation [1], to the point that surprise is believed to be a necessary trigger for associative learning [17, 18], as supported by mounting evidence for a role of the hippocampus as a novelty detector [19, 20, 21]. Finally, seeking novelty is a well-identiﬁed human character trait, with possible association with the dopamine D4 receptor gene [22, 23, 24]. In the Bayesian framework, we develop the only consistent theory of surprise, in terms of the difference between the posterior and prior distributions of beliefs of an observer over the available class of models or hypotheses about the world. We show that this deﬁnition derived from ﬁrst principles presents key advantages over more ad-hoc formulations, typically relying on detecting outlier stimuli. Armed with this new framework, we provide direct experimental evidence that surprise best characterizes what attracts human gaze in large amounts of natural video stimuli. We here extend a recent pilot study [25], adding more comprehensive theory, large-scale human data collection, and additional analysis. 1 Theory Bayesian Deﬁnition of Surprise. We propose that surprise is a general concept, which can be derived from ﬁrst principles and formalized across spatio-temporal scales, sensory modalities, and, more generally, data types and data sources. Two elements are essential for a principled deﬁnition of surprise. First, surprise can exist only in the presence of uncertainty, which can arise from intrinsic stochasticity, missing information, or limited computing resources. A world that is purely deterministic and predictable in real-time for a given observer contains no surprises. Second, surprise can only be deﬁned in a relative, subjective, manner and is related to the expectations of the observer, be it a single synapse, neuronal circuit, organism, or computer device. The same data may carry different amount of surprise for different observers, or even for the same observer taken at different times. In probability and decision theory it can be shown that the only consistent and optimal way for modeling and reasoning about uncertainty is provided by the Bayesian theory of probability [26, 27, 28]. Furthermore, in the Bayesian framework, probabilities correspond to subjective degrees of beliefs in hypotheses or models which are updated, as data is acquired, using Bayes’ theorem as the fundamental tool for transforming prior belief distributions into posterior belief distributions. Therefore, within the same optimal framework, the only consistent deﬁnition of surprise must involve: (1) probabilistic concepts to cope with uncertainty; and (2) prior and posterior distributions to capture subjective expectations. Consistently with this Bayesian approach, the background information of an observer is captured by his/her/its prior probability distribution {P (M )}M ∈M over the hypotheses or models M in a model space M. Given this prior distribution of beliefs, the fundamental effect of a new data observation D on the observer is to change the prior distribution {P (M )}M ∈M into the posterior distribution {P (M |D)}M ∈M via Bayes theorem, whereby P (D|M ) ∀M ∈ M, P (M |D) = P (M ). (1) P (D) In this framework, the new data observation D carries no surprise if it leaves the observer beliefs unaffected, that is, if the posterior is identical to the prior; conversely, D is surprising if the posterior distribution resulting from observing D signiﬁcantly differs from the prior distribution. Therefore we formally measure surprise elicited by data as some distance measure between the posterior and prior distributions. This is best done using the relative entropy or Kullback-Leibler (KL) divergence [29]. Thus, surprise is deﬁned by the average of the log-odd ratio: P (M |D) S(D, M) = KL(P (M |D), P (M )) = P (M |D) log dM (2) P (M ) M taken with respect to the posterior distribution over the model class M. Note that KL is not symmetric but has well-known theoretical advantages, including invariance with respect to Figure 1: Computing surprise in early sensory neurons. (a) Prior data observations, tuning preferences, and top-down inﬂuences contribute to shaping a set of “prior beliefs” a neuron may have over a class of internal models or hypotheses about the world. For instance, M may be a set of Poisson processes parameterized by the rate λ, with {P (M )}M ∈M = {P (λ)}λ∈I +∗ the prior distribution R of beliefs about which Poisson models well describe the world as sensed by the neuron. New data D updates the prior into the posterior using Bayes’ theorem. Surprise quantiﬁes the difference between the posterior and prior distributions over the model class M. The remaining panels detail how surprise differs from conventional model ﬁtting and outlier-based novelty. (b) In standard iterative Bayesian model ﬁtting, at every iteration N , incoming data DN is used to update the prior {P (M |D1 , D2 , ..., DN −1 )}M ∈M into the posterior {P (M |D1 , D2 , ..., DN )}M ∈M . Freezing this learning at a given iteration, one then picks the currently best model, usually using either a maximum likelihood criterion, or a maximum a posteriori one (yielding MM AP shown). (c) This best model is used for a number of tasks at the current iteration, including outlier-based novelty detection. New data is then considered novel at that instant if it has low likelihood for the best model b a (e.g., DN is more novel than DN ). This focus onto the single best model presents obvious limitations, especially in situations where other models are nearly as good (e.g., M∗ in panel (b) is entirely ignored during standard novelty computation). One palliative solution is to consider mixture models, or simply P (D), but this just amounts to shifting the problem into a different model class. (d) Surprise directly addresses this problem by simultaneously considering all models and by measuring how data changes the observer’s distribution of beliefs from {P (M |D1 , D2 , ..., DN −1 )}M ∈M to {P (M |D1 , D2 , ..., DN )}M ∈M over the entire model class M (orange shaded area). reparameterizations. A unit of surprise — a “wow” — may then be deﬁned for a single model M as the amount of surprise corresponding to a two-fold variation between P (M |D) and P (M ), i.e., as log P (M |D)/P (M ) (with log taken in base 2), with the total number of wows experienced for all models obtained through the integration in eq. 2. Surprise and outlier detection. Outlier detection based on the likelihood P (D|M best ) of D given a single best model Mbest is at best an approximation to surprise and, in some cases, is misleading. Consider, for instance, a case where D has very small probability both for a model or hypothesis M and for a single alternative hypothesis M. Although D is a strong outlier, it carries very little information regarding whether M or M is the better model, and therefore very little surprise. Thus an outlier detection method would strongly focus attentional resources onto D, although D is a false positive, in the sense that it carries no useful information for discriminating between the two alternative hypotheses M and M. Figure 1 further illustrates this disconnect between outlier detection and surprise. 2 Human experiments To test the surprise hypothesis — that surprise attracts human attention and gaze in natural scenes — we recorded eye movements from eight na¨ve observers (three females and ı ﬁve males, ages 23-32, normal or corrected-to-normal vision). Each watched a subset from 50 videoclips totaling over 25 minutes of playtime (46,489 video frames, 640 × 480, 60.27 Hz, mean screen luminance 30 cd/m2 , room 4 cd/m2 , viewing distance 80cm, ﬁeld of view 28◦ × 21◦ ). Clips comprised outdoors daytime and nighttime scenes of crowded environments, video games, and television broadcast including news, sports, and commercials. Right-eye position was tracked with a 240 Hz video-based device (ISCAN RK-464), with methods as previously [30]. Two hundred calibrated eye movement traces (10,192 saccades) were analyzed, corresponding to four distinct observers for each of the 50 clips. Figure 2 shows sample scanpaths for one videoclip. To characterize image regions selected by participants, we process videoclips through computational metrics that output a topographic dynamic master response map, assigning in real-time a response value to every input location. A good master map would highlight, more than expected by chance, locations gazed to by observers. To score each metric we hence sample, at onset of every human saccade, master map activity around the saccade’s future endpoint, and around a uniformly random endpoint (random sampling was repeated 100 times to evaluate variability). We quantify differences between histograms of master Figure 2: (a) Sample eye movement traces from four observers (squares denote saccade endpoints). (b) Our data exhibits high inter-individual overlap, shown here with the locations where one human saccade endpoint was nearby (≈ 5◦ ) one (white squares), two (cyan squares), or all three (black squares) other humans. (c) A metric where the master map was created from the three eye movement traces other than that being tested yields an upper-bound KL score, computed by comparing the histograms of metric values at human (narrow blue bars) and random (wider green bars) saccade targets. Indeed, this metric’s map was very sparse (many random saccades landing on locations with nearzero response), yet humans preferentially saccaded towards the three active hotspots corresponding to the eye positions of three other humans (many human saccades landing on locations with near-unity responses). map samples collected from human and random saccades using again the Kullback-Leibler (KL) distance: metrics which better predict human scanpaths exhibit higher distances from random as, typically, observers non-uniformly gaze towards a minority of regions with highest metric responses while avoiding a majority of regions with low metric responses. This approach presents several advantages over simpler scoring schemes [31, 32], including agnosticity to putative mechanisms for generating saccades and the fact that applying any continuous nonlinearity to master map values would not affect scoring. Experimental results. We test six computational metrics, encompassing and extending the state-of-the-art found in previous studies. The ﬁrst three quantify static image properties (local intensity variance in 16 × 16 image patches [31]; local oriented edge density as measured with Gabor ﬁlters [33]; and local Shannon entropy in 16 × 16 image patches [34]). The remaining three metrics are more sensitive to dynamic events (local motion [33]; outlier-based saliency [33]; and surprise [25]). For all metrics, we ﬁnd that humans are signiﬁcantly attracted by image regions with higher metric responses. However, the static metrics typically respond vigorously at numerous visual locations (Figure 3), hence they are poorly speciﬁc and yield relatively low KL scores between humans and random. The metrics sensitive to motion, outliers, and surprising events, in comparison, yield sparser maps and higher KL scores. The surprise metric of interest here quantiﬁes low-level surprise in image patches over space and time, and at this point does not account for high-level or cognitive beliefs of our human observers. Rather, it assumes a family of simple models for image patches, each processed through 72 early feature detectors sensitive to color, orientation, motion, etc., and computes surprise from shifts in the distribution of beliefs about which models better describe the patches (see [25] and [35] for details). We ﬁnd that the surprise metric signiﬁcantly outperforms all other computational metrics (p < 10−100 or better on t-tests for equality of KL scores), scoring nearly 20% better than the second-best metric (saliency) and 60% better than the best static metric (entropy). Surprising stimuli often substantially differ from simple feature outliers; for example, a continually blinking light on a static background elicits sustained ﬂicker due to its locally outlier temporal dynamics but is only surprising for a moment. Similarly, a shower of randomly-colored pixels continually excites all low-level feature detectors but rapidly becomes unsurprising. Strongest attractors of human attention. Clearly, in our and previous eye-tracking experiments, in some situations potentially interesting targets were more numerous than in others. With many possible targets, different observers may orient towards different locations, making it more difﬁcult for a single metric to accurately predict all observers. Hence we consider (Figure 4) subsets of human saccades where at least two, three, or all four observers simultaneously agreed on a gaze target. Observers could have agreed based on bottom-up factors (e.g., only one location had interesting visual appearance at that time), top-down factors (e.g., only one object was of current cognitive interest), or both (e.g., a single cognitively interesting object was present which also had distinctive appearance). Irrespectively of the cause for agreement, it indicates consolidated belief that a location was attractive. While the KL scores of all metrics improved when progressively focusing onto only those locations, dynamic metrics improved more steeply, indicating that stimuli which more reliably attracted all observers carried more motion, saliency, and surprise. Surprise remained signiﬁcantly the best metric to characterize these agreed-upon attractors of human gaze (p < 10−100 or better on t-tests for equality of KL scores). Overall, surprise explained the greatest fraction of human saccades, indicating that humans are signiﬁcantly attracted towards surprising locations in video displays. Over 72% of all human saccades were targeted to locations predicted to be more surprising than on average. When only considering saccades where two, three, or four observers agreed on a common gaze target, this ﬁgure rose to 76%, 80%, and 84%, respectively. Figure 3: (a) Sample video frames, with corresponding human saccades and predictions from the entropy, surprise, and human-derived metrics. Entropy maps, like intensity variance and orientation maps, exhibited many locations with high responses, hence had low speciﬁcity and were poorly discriminative. In contrast, motion, saliency, and surprise maps were much sparser and more speciﬁc, with surprise signiﬁcantly more often on target. For three example frames (ﬁrst column), saccades from one subject are shown (arrows) with corresponding apertures over which master map activity at the saccade endpoint was sampled (circles). (b) KL scores for these metrics indicate signiﬁcantly different performance levels, and a strict ranking of variance < orientation < entropy < motion < saliency < surprise < human-derived. KL scores were computed by comparing the number of human saccades landing onto each given range of master map values (narrow blue bars) to the number of random saccades hitting the same range (wider green bars). A score of zero would indicate equality between the human and random histograms, i.e., humans did not tend to hit various master map values any differently from expected by chance, or, the master map could not predict human saccades better than random saccades. Among the six computational metrics tested in total, surprise performed best, in that surprising locations were relatively few yet reliably gazed to by humans. Figure 4: KL scores when considering only saccades where at least one (all 10,192 saccades), two (7,948 saccades), three (5,565 saccades), or all four (2,951 saccades) humans agreed on a common gaze location, for the static (a) and dynamic metrics (b). Static metrics improved substantially when progressively focusing onto saccades with stronger inter-observer agreement (average slope 0.56 ± 0.37 percent KL score units per 1,000 pruned saccades). Hence, when humans agreed on a location, they also tended to be more reliably predicted by the metrics. Furthermore, dynamic metrics improved 4.5 times more steeply (slope 2.44 ± 0.37), suggesting a stronger role of dynamic events in attracting human attention. Surprising events were signiﬁcantly the strongest (t-tests for equality of KL scores between surprise and other metrics, p < 10−100 ). 3 Discussion While previous research has shown with either static scenes or dynamic synthetic stimuli that humans preferentially ﬁxate regions of high entropy [34], contrast [31], saliency [32], ﬂicker [36], or motion [37], our data provides direct experimental evidence that humans ﬁxate surprising locations even more reliably. These conclusions were made possible by developing new tools to quantify what attracts human gaze over space and time in dynamic natural scenes. Surprise explained best where humans look when considering all saccades, and even more so when restricting the analysis to only those saccades for which human observers tended to agree. Surprise hence represents an inexpensive, easily computable approximation to human attentional allocation. In the absence of quantitative tools to measure surprise, most experimental and modeling work to date has adopted the approximation that novel events are surprising, and has focused on experimental scenarios which are simple enough to ensure an overlap between informal notions of novelty and surprise: for example, a stimulus is novel during testing if it has not been seen during training [9]. Our deﬁnition opens new avenues for more sophisticated experiments, where surprise elicited by different stimuli can be precisely compared and calibrated, yielding predictions at the single-unit as well as behavioral levels. The deﬁnition of surprise — as the distance between the posterior and prior distributions of beliefs over models — is entirely general and readily applicable to the analysis of auditory, olfactory, gustatory, or somatosensory data. While here we have focused on behavior rather than detailed biophysical implementation, it is worth noting that detecting surprise in neural spike trains does not require semantic understanding of the data carried by the spike trains, and thus could provide guiding signals during self-organization and development of sensory areas. At higher processing levels, top-down cues and task demands are known to combine with stimulus novelty in capturing attention and triggering learning [1, 38], ideas which may now be formalized and quantiﬁed in terms of priors, posteriors, and surprise. Surprise, indeed, inherently depends on uncertainty and on prior beliefs. Hence surprise theory can further be tested and utilized in experiments where the prior is biased, for ex- ample by top-down instructions or prior exposures to stimuli [38]. In addition, simple surprise-based behavioral measures such as the eye-tracking one used here may prove useful for early diagnostic of human conditions including autism and attention-deﬁcit hyperactive disorder, as well as for quantitative comparison between humans and animals which may have lower or different priors, including monkeys, frogs, and ﬂies. Beyond sensory biology, computable surprise could guide the development of data mining and compression systems (giving more bits to surprising regions of interest), to ﬁnd surprising agents in crowds, surprising sentences in books or speeches, surprising sequences in genomes, surprising medical symptoms, surprising odors in airport luggage racks, surprising documents on the world-wide-web, or to design surprising advertisements. Acknowledgments: Supported by HFSP, NSF and NGA (L.I.), NIH and NSF (P.B.). We thank UCI’s Institute for Genomics and Bioinformatics and USC’s Center High Performance Computing and Communications (www.usc.edu/hpcc) for access to their computing clusters. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] Ranganath, C. & Rainer, G. Nat Rev Neurosci 4, 193–202 (2003). Rao, R. P. & Ballard, D. H. Nat Neurosci 2, 79–87 (1999). Olshausen, B. A. & Field, D. J. Nature 381, 607–609 (1996). M¨ ller, J. R., Metha, A. B., Krauskopf, J. & Lennie, P. Science 285, 1405–1408 (1999). u Dragoi, V., Sharma, J., Miller, E. K. & Sur, M. Nat Neurosci 5, 883–891 (2002). David, S. V., Vinje, W. E. & Gallant, J. L. J Neurosci 24, 6991–7006 (2004). Maffei, L., Fiorentini, A. & Bisti, S. Science 182, 1036–1038 (1973). Movshon, J. A. & Lennie, P. Nature 278, 850–852 (1979). Fecteau, J. H. & Munoz, D. P. Nat Rev Neurosci 4, 435–443 (2003). Kurahashi, T. & Menini, A. Nature 385, 725–729 (1997). Bradley, J., Bonigk, W., Yau, K. W. & Frings, S. Nat Neurosci 7, 705–710 (2004). Ulanovsky, N., Las, L. & Nelken, I. Nat Neurosci 6, 391–398 (2003). Solomon, S. G., Peirce, J. W., Dhruv, N. T. & Lennie, P. Neuron 42, 155–162 (2004). Smirnakis, S. M., Berry, M. J. & et al. Nature 386, 69–73 (1997). Brown, S. P. & Masland, R. H. Nat Neurosci 4, 44–51 (2001). Kennedy, H. J., Evans, M. G. & et al. Nat Neurosci 6, 832–836 (2003). Schultz, W. & Dickinson, A. Annu Rev Neurosci 23, 473–500 (2000). Fletcher, P. C., Anderson, J. M., Shanks, D. R. et al. Nat Neurosci 4, 1043–1048 (2001). Knight, R. Nature 383, 256–259 (1996). Stern, C. E., Corkin, S., Gonzalez, R. G. et al. Proc Natl Acad Sci U S A 93, 8660–8665 (1996). Li, S., Cullen, W. K., Anwyl, R. & Rowan, M. J. Nat Neurosci 6, 526–531 (2003). Ebstein, R. P., Novick, O., Umansky, R. et al. Nat Genet 12, 78–80 (1996). Benjamin, J., Li, L. & et al. Nat Genet 12, 81–84 (1996). Lusher, J. M., Chandler, C. & Ball, D. Mol Psychiatry 6, 497–499 (2001). Itti, L. & Baldi, P. In Proc. IEEE CVPR. San Siego, CA (2005 in press). Cox, R. T. Am. J. Phys. 14, 1–13 (1964). Savage, L. J. The foundations of statistics (Dover, New York, 1972). (First Edition in 1954). Jaynes, E. T. Probability Theory. The Logic of Science (Cambridge University Press, 2003). Kullback, S. Information Theory and Statistics (Wiley, New York:New York, 1959). Itti, L. Visual Cognition (2005 in press). Reinagel, P. & Zador, A. M. Network 10, 341–350 (1999). Parkhurst, D., Law, K. & Niebur, E. Vision Res 42, 107–123 (2002). Itti, L. & Koch, C. Nat Rev Neurosci 2, 194–203 (2001). Privitera, C. M. & Stark, L. W. IEEE Trans Patt Anal Mach Intell 22, 970–982 (2000). All source code for all metrics is freely available at http://iLab.usc.edu/toolkit/. Theeuwes, J. Percept Psychophys 57, 637–644 (1995). Abrams, R. A. & Christ, S. E. Psychol Sci 14, 427–432 (2003). Wolfe, J. M. & Horowitz, T. S. Nat Rev Neurosci 5, 495–501 (2004).

same-paper 2 0.78525871 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

Author: Sham M. Kakade, Matthias W. Seeger, Dean P. Foster

3 0.74342382 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

Author: Lei Zhang, Dimitris Samaras, Nelly Alia-klein, Nora Volkow, Rita Goldstein

Abstract: Functional Magnetic Resonance Imaging (fMRI) has enabled scientists to look into the active brain. However, interactivity between functional brain regions, is still little studied. In this paper, we contribute a novel framework for modeling the interactions between multiple active brain regions, using Dynamic Bayesian Networks (DBNs) as generative models for brain activation patterns. This framework is applied to modeling of neuronal circuits associated with reward. The novelty of our framework from a Machine Learning perspective lies in the use of DBNs to reveal the brain connectivity and interactivity. Such interactivity models which are derived from fMRI data are then validated through a group classiﬁcation task. We employ and compare four different types of DBNs: Parallel Hidden Markov Models, Coupled Hidden Markov Models, Fully-linked Hidden Markov Models and Dynamically MultiLinked HMMs (DML-HMM). Moreover, we propose and compare two schemes of learning DML-HMMs. Experimental results show that by using DBNs, group classiﬁcation can be performed even if the DBNs are constructed from as few as 5 brain regions. We also demonstrate that, by using the proposed learning algorithms, different DBN structures characterize drug addicted subjects vs. control subjects. This ﬁnding provides an independent test for the effect of psychopathology on brain function. In general, we demonstrate that incorporation of computer science principles into functional neuroimaging clinical studies provides a novel approach for probing human brain function.

4 0.49979827 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

Author: Alan L. Yuille

Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1

5 0.49927065 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex

Author: Rory Sayres, David Ress, Kalanit Grill-spector

Abstract: The category of visual stimuli has been reliably decoded from patterns of neural activity in extrastriate visual cortex [1]. It has yet to be seen whether object identity can be inferred from this activity. We present fMRI data measuring responses in human extrastriate cortex to a set of 12 distinct object images. We use a simple winner-take-all classifier, using half the data from each recording session as a training set, to evaluate encoding of object identity across fMRI voxels. Since this approach is sensitive to the inclusion of noisy voxels, we describe two methods for identifying subsets of voxels in the data which optimally distinguish object identity. One method characterizes the reliability of each voxel within subsets of the data, while another estimates the mutual information of each voxel with the stimulus set. We find that both metrics can identify subsets of the data which reliably encode object identity, even when noisy measurements are artificially added to the data. The mutual information metric is less efficient at this task, likely due to constraints in fMRI data. 1

6 0.49611729 74 nips-2005-Faster Rates in Regression via Active Learning

7 0.49545673 50 nips-2005-Convex Neural Networks

8 0.49431914 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

9 0.49421191 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

10 0.49206004 177 nips-2005-Size Regularized Cut for Data Clustering

11 0.48950395 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models

12 0.48942274 54 nips-2005-Data-Driven Online to Batch Conversions

13 0.4892616 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

14 0.4890866 30 nips-2005-Assessing Approximations for Gaussian Process Classification

15 0.48878855 112 nips-2005-Learning Minimum Volume Sets

16 0.48872286 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

17 0.48864985 41 nips-2005-Coarse sample complexity bounds for active learning

18 0.48860225 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

19 0.48635018 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

20 0.48609519 47 nips-2005-Consistency of one-class SVM and related algorithms