nips nips2005 nips2005-205 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sham M. Kakade, Matthias W. Seeger, Dean P. Foster
Abstract: We present a competitive analysis of some non-parametric Bayesian algorithms in a worst-case online learning setting, where no probabilistic assumptions about the generation of the data are made. We consider models which use a Gaussian process prior (over the space of all functions) and provide bounds on the regret (under the log loss) for commonly used non-parametric Bayesian algorithms — including Gaussian regression and logistic regression — which show how these algorithms can perform favorably under rather general conditions. These bounds explicitly handle the infinite dimensionality of these non-parametric classes in a natural way. We also make formal connections to the minimax and minimum description length (MDL) framework. Here, we show precisely how Bayesian Gaussian regression is a minimax strategy. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Foster University of Pennsylvania Abstract We present a competitive analysis of some non-parametric Bayesian algorithms in a worst-case online learning setting, where no probabilistic assumptions about the generation of the data are made. [sent-4, score-0.032]
2 These bounds explicitly handle the infinite dimensionality of these non-parametric classes in a natural way. [sent-6, score-0.023]
3 We also make formal connections to the minimax and minimum description length (MDL) framework. [sent-7, score-0.344]
4 Here, we show precisely how Bayesian Gaussian regression is a minimax strategy. [sent-8, score-0.365]
5 1 Introduction We study an online (sequential) prediction setting in which, at each timestep, the learner is given some input from the set X , and the learner must predict the output variable from the set Y. [sent-9, score-0.085]
6 , T } is chosen by Nature (or by an adversary), and importantly, we do not make any statistical assumptions about its source: our statements hold for all sequences. [sent-13, score-0.068]
7 Our goal is to sequentially code the next label yt , given that we have observed x≤t and y 0 is a constant such that for all yt ∈ y ≤T , − d2 log P (yt |u) ≤ c du2 for all u ∈ R. [sent-14, score-0.313]
8 The proof of this theorem parallels that provided by Kakade and Ng [2004], with a number of added complexities for handling GP priors. [sent-15, score-0.078]
9 For the special case of Gaussian regression where c = σ −2 , the following theorem shows the stronger result that the bound is satisfied with an equality for all sequences. [sent-16, score-0.134]
10 Then, − log Pbayes (y ≤T |x≤T ) = min − log P (y ≤T |x≤T , f (·)) + f ∈H 1 + log I + σ −2 K 2 and the minimum is attained for a kernel expansion over x≤T . [sent-20, score-0.368]
11 1 f 2 2 K (2) This equality has important implications in our minimax theory (in Corollary 4. [sent-21, score-0.348]
12 It is not hard to see that the equality does not hold for other likelihoods. [sent-23, score-0.054]
13 1 Interpretation The regret bound depends on two terms, f 2 and log |I + cK |. [sent-25, score-0.9]
14 K The dependence on f 2 states the intuitive fact that a meaningful bound can only be K obtained under smoothness assumptions on the set of experts. [sent-27, score-0.06]
15 The more complicated f is (as measured by · K ), the higher the regret may be. [sent-28, score-0.796]
16 Let us now interpret the log |I + cK | term, which we refer to as the regret term. [sent-32, score-0.9]
17 The constant c, which bounds the curvature of the likelihood, exists for most commonly used exponential family likelihoods. [sent-33, score-0.103]
18 For logistic regression, we have c = 1/4, and for the Gaussian regression, we have c = σ −2 . [sent-34, score-0.02]
19 Also, interestingly, while f is an arbitrary function in H, this regret term depends on K only at the sequence points x≤T . [sent-35, score-0.857]
20 For most infinite-dimensional kernels and without strong restrictions on the inputs, the regret term can be as large as Ω(T ) — the sequence can be chosen s. [sent-36, score-0.857]
21 K ≈ c I, which implies that log |I + cK | ≈ T log(1 + cc ). [sent-38, score-0.121]
22 For example, for an isotropic kernel (which is a function of the norm x − x 2 ) we can choose the xt to be mutually far from each other. [sent-39, score-0.072]
23 the Ornstein-Uhlenbeck kernel exp(−b x − x 1 ) — the regret term can easily Ω(T ). [sent-42, score-0.857]
24 The cases we are interested in are those where the regret term is o(T ), in which case the average regret tends to 0 with time. [sent-43, score-1.634]
25 A spectral interpretation of this term helps us understand the behavior. [sent-44, score-0.061]
26 λT be the eigenvalues of K , then T log |I + cK | = log(1 + cλt ) ≤ c tr K t=1 where tr K is the trace of K . [sent-48, score-0.174]
27 Clearly, if the sum of the eigenvalues has a sublinear growth rate of o(T ), then the average regret tends to 0. [sent-51, score-0.864]
28 then the above eigenvalues are essentially the process eigenvalues. [sent-55, score-0.032]
29 In a forthcoming longer version, we explore this spectral interpretation in more detail and provide a case using the exponential kernel in which the regret grows as O(poly(log T )). [sent-56, score-0.883]
30 Namely, the familiar linear model — with u(x) = θ · x, θ, x ∈ Rd and Gaussian prior θ ∼ N (0, I) — can be seen as a GP model with the linear kernel: K(x, x ) = x · x . [sent-60, score-0.019]
31 xT )T we have that a kernel expansion f (x) = i αi xi · x = θ · x with θ = X T α, and f 2 = αT X X T α = θ 2 , so that H = {θ · x | θ ∈ Rd }, and so 2 K log |I + cK | = log I + cX T X Therefore, our result gives an input-dependent version of the result of Kakade and Ng [2004]. [sent-64, score-0.264]
32 If we make the further assumption that x 2 ≤ 1 (as done in Kakade and Ng [2004]), then we can obtain exactly their regret term: cT log |I + cK | ≤ d log 1 + d which can seen by rotating K into an diagonal matrix and maximizing the expression subject to the constraint that x 2 ≤ 1 (i. [sent-65, score-1.055]
33 In general, this example shows that if K is a finite-dimension kernel such as the linear or the polynomial kernel, then the regret term is only O(log T ). [sent-68, score-0.857]
34 4 Relationships to Minimax Procedures and MDL This section builds the framework for understanding the minimax property of Gaussian regression. [sent-69, score-0.31]
35 We start by reviewing Shtarkov’s theorem, which shows that a certain normalized maximum likelihood density is the minimax strategy (when using the log loss). [sent-70, score-0.637]
36 In many cases, this minimax strategy does not exist — in those cases where the minimax regret is infinite. [sent-71, score-1.48]
37 We then propose a different, penalized notion of regret, and show that a certain normalized maximum a posteriori density is the minimax strategy here. [sent-72, score-0.659]
38 4) shows that for Gaussian regression the Bayesian strategy is precisely this minimax strategy 4. [sent-74, score-0.493]
39 1 Normalized Maximum Likelihood Here, let us assume that there are no inputs — sequences consist of only yt ∈ Y. [sent-75, score-0.131]
40 Given a measurable space with base measure µ, we employ a countable number of random variables yt in Y. [sent-76, score-0.141]
41 Fix the sequence length T and define the model class F = {Q(·|θ) | θ ∈ Θ)}, where Q(·|θ) denotes a joint probability density over Y T with respect to µ. [sent-77, score-0.106]
42 We assume that for our model class there exists a parameter, θml (y ≤T ), maximizing the likelihood Q(y ≤T |θ) over Θ for all y ≤T ∈ Y T . [sent-78, score-0.115]
43 We make this assumption to make the connections to maximum likelihood (and, later, MAP) estimation clear. [sent-79, score-0.109]
44 Define the regret of a joint density P on y ≤T as: R(y ≤T , P, Θ) = − log P (y ≤T ) − inf {− log Q(y ≤T |θ)} (3) θ∈Θ = − log P (y ≤T ) + log Q(y ≤T |θml (y ≤T )) . [sent-80, score-1.332]
45 Define the minimax regret with respect to Θ as: R(Θ) = inf sup P y T ≤T ∈Y R(y ≤T , P, Θ) where the inf is over all probability densities on Y T . [sent-82, score-1.288]
46 The following theorem due to Shtarkov [1987] characterizes the minimax strategy. [sent-83, score-0.351]
47 if it has a finite normalization constant), then define it to be the normalized maximum likelihood (NML) density. [sent-87, score-0.086]
48 Pml (y ≤T ) = Q(y ≤T |θml (y ≤T )) Q(y ≤T |θml (y ≤T ))dµ(y ≤T ) (5) If Pml exists, it is a minimax strategy, i. [sent-88, score-0.31]
49 for all y ≤T , the regret R(y ≤T , Pml , Θ) does not exceed R(Θ). [sent-90, score-0.826]
50 Note that this density exists only if the normalizing constant is finite, which is not the case in general. [sent-91, score-0.117]
51 The proof is straightforward using the fact that the NML density is an equalizer — meaning that it has constant regret on all sequences. [sent-92, score-0.96]
52 Proof: First note that log Q(y ≤T |θml (y ≤T ))dµ(y ≤T ). [sent-93, score-0.104]
53 the regret R(y ≤T , Pml , Θ) is the constant To see this, simply substitute Eq. [sent-95, score-0.83]
54 4 For convenience, define the regret of any P as R(P, Θ) = supy≤T ∈Y T R(y ≤T , P, Θ). [sent-97, score-0.796]
55 For any P = Pml (differing on a set with positive measure), there exists some y ≤T such that P (y ≤T ) < Pml (y ≤T ), since the densities are normalized. [sent-98, score-0.072]
56 This implies that R(P, Θ) ≥ R(y ≤T , P, Θ) > R(y ≤T , Pml , Θ) = R(Pml , Θ) where the first step follows from the definition of R(P, Θ), the second from − log P (y ≤T ) > − log Pml (y ≤T ), and the last from the fact that Pml is an equalizer (its regret is constant on all sequences). [sent-99, score-1.092]
57 Hence, P has a strictly larger regret, implying that Pml is the unique minimax strategy. [sent-100, score-0.31]
58 Unfortunately, in many important model classes, the minimax regret R(Θ) is not finite, and the NML density does not exist. [sent-101, score-1.159]
59 2: Consider a model which assumes the sequence is generated i. [sent-104, score-0.035]
60 It is easy to see that for this class the t=1 minimax regret is infinite and Pml does not exist (see Grunwald [2005]). [sent-109, score-1.124]
61 This example can be generalized to the Gaussian regression model (if we know the sequence x≤T in advance). [sent-110, score-0.09]
62 For this problem, if one modifies the space of allowable sequences (i. [sent-111, score-0.045]
63 2 Normalized Maximum a Posteriori To remedy this problem, consider placing some structure on the model class F = {Q(·|θ)|θ ∈ Θ}. [sent-117, score-0.035]
64 The idea is to penalize Q(·|θ) ∈ F based on this structure. [sent-118, score-0.015]
65 Assume that Θ is a measurable space and place a prior distribution with density function q on Θ. [sent-120, score-0.098]
66 Define the penalized regret of P on y ≤T as: Rq (y ≤T , P, Θ) = − log P (y ≤T ) − inf {− log Q(y ≤T |θ) − log q(θ)} . [sent-121, score-1.322]
67 θ∈Θ Note that − log Q(y ≤T |θ) − log q(θ) can be viewed as a “two part” code, in which we first code θ under the prior q and then code y ≤T under the likelihood Q(·|θ). [sent-122, score-0.323]
68 Unlike the standard regret, the penalized regret can be viewed as a comparison to an actual code. [sent-123, score-0.943]
69 However, in MDL, they consider using minimax schemes (via Pml ) for the likelihood part of the code, while we consider minimax schemes for this penalized regret. [sent-125, score-0.841]
70 Again, for clarity, assume there exists a parameter, θmap (y ≤T ) maximizing log Q(y ≤T |θ)+ log q(θ). [sent-126, score-0.271]
71 Notice that this is just the maximum aposteriori (MAP) parameter, if one were to use a Bayesian strategy with the prior q (since the posterior density would be proportional to Q(y ≤T |θ)q(θ)). [sent-127, score-0.179]
72 Here, Rq (y ≤T , P, Θ) = − log P (y ≤T ) + log Q(y ≤T |θmap (y ≤T )) + log q(θmap (y ≤T )) Similarly, with respect to Θ, define the minimax penalized regret as: Rq (Θ) = inf sup P y T ≤T ∈Y Rq (y ≤T P, Θ) where again the inf is over all densities on Y T . [sent-128, score-1.747]
73 If Θ is finite or countable and Q(·|θ) > 0 for all θ, then the Bayes procedure has the desirable property of having penalized regret which is non-positive. [sent-129, score-0.979]
74 2 However, in general, the Bayes procedure does not achieve the minimax penalized regret, Rq (Θ), which is what we desire — though, for one case, we show that it does (in the next section). [sent-130, score-0.472]
75 3: Define the normalized maximum a posteriori (NMAP) density, if it exists, as: Pmap (y ≤T ) = Q(y ≤T |θmap (y ≤T ))q(θmap (y ≤T )) . [sent-133, score-0.085]
76 Q(y ≤T |θmap (y ≤T ))q(θmap (y ≤T )) dµ(y ≤T ) (6) If Pmap exists, it is a minimax strategy for the penalized regret, i. [sent-134, score-0.521]
77 for all y ≤T , the penalized regret Rq (y ≤T , Pmap , Θ) does not exceed Rq (Θ). [sent-136, score-0.973]
78 The proof relies on Pmap being an equalizer for the penalized regret and is identical to that of Theorem 4. [sent-137, score-1.034]
79 1 — just replace all quantities with their penalized equivalents. [sent-138, score-0.147]
80 3 Bayesian Gaussian Regression as a Minimax Procedure We now return to the setting with inputs and show how the Bayesian strategy for the Gaussian regression model is a minimax strategy for all input sequences x≤T . [sent-140, score-0.545]
81 If we fix the input sequence x≤T , we can consider the competitor class to be F = {P (y ≤T |x≤T , θ) | θ ∈ Θ)}. [sent-141, score-0.069]
82 In other words, we make the more stringent comparison against a model class which has full knowledge of the input sequence in advance. [sent-142, score-0.086]
wordName wordTfidf (topN-words)
[('regret', 0.796), ('minimax', 0.31), ('pml', 0.296), ('penalized', 0.147), ('rq', 0.139), ('log', 0.104), ('pmap', 0.091), ('yt', 0.079), ('mdl', 0.079), ('ck', 0.075), ('kakade', 0.069), ('equalizer', 0.068), ('nml', 0.068), ('shtarkov', 0.068), ('inf', 0.067), ('ml', 0.065), ('strategy', 0.064), ('grunwald', 0.059), ('regression', 0.055), ('density', 0.053), ('map', 0.049), ('exists', 0.044), ('theorem', 0.041), ('equality', 0.038), ('xt', 0.037), ('countable', 0.036), ('foster', 0.036), ('learner', 0.035), ('gaussian', 0.035), ('kernel', 0.035), ('sequence', 0.035), ('likelihood', 0.034), ('posteriori', 0.033), ('bayesian', 0.033), ('eigenvalues', 0.032), ('ng', 0.032), ('code', 0.031), ('exceed', 0.03), ('normalized', 0.029), ('densities', 0.028), ('pennsylvania', 0.028), ('sequences', 0.027), ('term', 0.026), ('measurable', 0.026), ('inputs', 0.025), ('dependence', 0.024), ('bounds', 0.023), ('nite', 0.023), ('proof', 0.023), ('maximum', 0.023), ('gp', 0.022), ('expansion', 0.021), ('sup', 0.02), ('interpretation', 0.02), ('rd', 0.02), ('logistic', 0.02), ('schemes', 0.02), ('reviewing', 0.02), ('adversary', 0.02), ('aposteriori', 0.02), ('supy', 0.02), ('sublinear', 0.02), ('constant', 0.02), ('importantly', 0.019), ('maximizing', 0.019), ('tr', 0.019), ('smoothness', 0.019), ('prior', 0.019), ('corollary', 0.018), ('class', 0.018), ('barely', 0.018), ('sham', 0.018), ('barron', 0.018), ('allowable', 0.018), ('make', 0.018), ('parametric', 0.017), ('assumptions', 0.017), ('remedy', 0.017), ('cc', 0.017), ('poly', 0.017), ('statements', 0.017), ('forthcoming', 0.017), ('commonly', 0.016), ('hold', 0.016), ('matthias', 0.016), ('competitor', 0.016), ('connections', 0.016), ('tends', 0.016), ('differing', 0.015), ('dean', 0.015), ('penalize', 0.015), ('stringent', 0.015), ('desire', 0.015), ('favorably', 0.015), ('spectral', 0.015), ('online', 0.015), ('bayes', 0.015), ('rotating', 0.014), ('complexities', 0.014), ('substitute', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
Author: Sham M. Kakade, Matthias W. Seeger, Dean P. Foster
Abstract: We present a competitive analysis of some non-parametric Bayesian algorithms in a worst-case online learning setting, where no probabilistic assumptions about the generation of the data are made. We consider models which use a Gaussian process prior (over the space of all functions) and provide bounds on the regret (under the log loss) for commonly used non-parametric Bayesian algorithms — including Gaussian regression and logistic regression — which show how these algorithms can perform favorably under rather general conditions. These bounds explicitly handle the infinite dimensionality of these non-parametric classes in a natural way. We also make formal connections to the minimax and minimum description length (MDL) framework. Here, we show precisely how Bayesian Gaussian regression is a minimax strategy. 1
2 0.060012549 74 nips-2005-Faster Rates in Regression via Active Learning
Author: Rebecca Willett, Robert Nowak, Rui M. Castro
Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning significantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra flexibility leads to significantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra flexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including field estimation using wireless sensor networks and fault line detection. 1
3 0.053634401 47 nips-2005-Consistency of one-class SVM and related algorithms
Author: Régis Vert, Jean-philippe Vert
Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1
4 0.048884209 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
5 0.047214366 50 nips-2005-Convex Neural Networks
Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1
6 0.046675049 45 nips-2005-Conditional Visual Tracking in Kernel Space
7 0.041446175 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
8 0.041202344 58 nips-2005-Divergences, surrogate loss functions and experimental design
9 0.041089356 2 nips-2005-A Bayes Rule for Density Matrices
10 0.040415391 136 nips-2005-Noise and the two-thirds power Law
11 0.03804886 30 nips-2005-Assessing Approximations for Gaussian Process Classification
12 0.037806798 80 nips-2005-Gaussian Process Dynamical Models
13 0.036998816 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
14 0.036907014 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
15 0.036342949 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
16 0.03580001 159 nips-2005-Q-Clustering
17 0.035368901 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
18 0.033606179 166 nips-2005-Robust Fisher Discriminant Analysis
19 0.0326511 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
20 0.032425109 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
topicId topicWeight
[(0, 0.108), (1, 0.044), (2, -0.017), (3, -0.032), (4, 0.084), (5, -0.044), (6, -0.002), (7, -0.039), (8, -0.045), (9, -0.018), (10, 0.01), (11, 0.042), (12, 0.02), (13, 0.035), (14, -0.003), (15, -0.021), (16, 0.021), (17, 0.017), (18, 0.013), (19, -0.018), (20, 0.015), (21, -0.042), (22, -0.008), (23, -0.071), (24, -0.002), (25, 0.067), (26, 0.025), (27, -0.012), (28, -0.001), (29, -0.028), (30, 0.011), (31, -0.041), (32, 0.021), (33, -0.009), (34, 0.013), (35, 0.002), (36, 0.019), (37, -0.017), (38, 0.028), (39, 0.038), (40, 0.073), (41, -0.011), (42, 0.013), (43, 0.019), (44, 0.064), (45, 0.074), (46, 0.03), (47, -0.025), (48, -0.087), (49, -0.119)]
simIndex simValue paperId paperTitle
same-paper 1 0.89753288 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
Author: Sham M. Kakade, Matthias W. Seeger, Dean P. Foster
Abstract: We present a competitive analysis of some non-parametric Bayesian algorithms in a worst-case online learning setting, where no probabilistic assumptions about the generation of the data are made. We consider models which use a Gaussian process prior (over the space of all functions) and provide bounds on the regret (under the log loss) for commonly used non-parametric Bayesian algorithms — including Gaussian regression and logistic regression — which show how these algorithms can perform favorably under rather general conditions. These bounds explicitly handle the infinite dimensionality of these non-parametric classes in a natural way. We also make formal connections to the minimax and minimum description length (MDL) framework. Here, we show precisely how Bayesian Gaussian regression is a minimax strategy. 1
2 0.5911091 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
Author: Keiji Miura, Masato Okada, Shun-ichi Amari
Abstract: We considered a gamma distribution of interspike intervals as a statistical model for neuronal spike generation. The model parameters consist of a time-dependent firing rate and a shape parameter that characterizes spiking irregularities of individual neurons. Because the environment changes with time, observed data are generated from the time-dependent firing rate, which is an unknown function. A statistical model with an unknown function is called a semiparametric model, which is one of the unsolved problem in statistics and is generally very difficult to solve. We used a novel method of estimating functions in information geometry to estimate the shape parameter without estimating the unknown function. We analytically obtained an optimal estimating function for the shape parameter independent of the functional form of the firing rate. This estimation is efficient without Fisher information loss and better than maximum likelihood estimation. 1
3 0.47586995 47 nips-2005-Consistency of one-class SVM and related algorithms
Author: Régis Vert, Jean-philippe Vert
Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1
4 0.43910074 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
Author: Gilles Blanchard, Masashi Sugiyama, Motoaki Kawanabe, Vladimir Spokoiny, Klaus-Robert Müller
Abstract: We propose a new linear method for dimension reduction to identify nonGaussian components in high dimensional data. Our method, NGCA (non-Gaussian component analysis), uses a very general semi-parametric framework. In contrast to existing projection methods we define what is uninteresting (Gaussian): by projecting out uninterestingness, we can estimate the relevant non-Gaussian subspace. We show that the estimation error of finding the non-Gaussian components tends to zero at a parametric rate. Once NGCA components are identified and extracted, various tasks can be applied in the data analysis process, like data visualization, clustering, denoising or classification. A numerical study demonstrates the usefulness of our method. 1
5 0.43818673 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio
Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1
6 0.43474096 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
7 0.43006533 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
8 0.42312643 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
9 0.41824824 45 nips-2005-Conditional Visual Tracking in Kernel Space
10 0.41414914 30 nips-2005-Assessing Approximations for Gaussian Process Classification
11 0.40306193 112 nips-2005-Learning Minimum Volume Sets
12 0.40119755 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
13 0.3999323 58 nips-2005-Divergences, surrogate loss functions and experimental design
14 0.39623487 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
15 0.39471105 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
16 0.39395919 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
17 0.39220861 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise
18 0.39196354 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
19 0.38969377 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
20 0.38475227 50 nips-2005-Convex Neural Networks
topicId topicWeight
[(3, 0.069), (10, 0.057), (27, 0.016), (31, 0.03), (34, 0.092), (39, 0.01), (41, 0.016), (55, 0.027), (65, 0.012), (69, 0.033), (73, 0.026), (77, 0.011), (87, 0.303), (88, 0.094), (91, 0.075)]
simIndex simValue paperId paperTitle
1 0.81190556 34 nips-2005-Bayesian Surprise Attracts Human Attention
Author: Laurent Itti, Pierre F. Baldi
Abstract: The concept of surprise is central to sensory processing, adaptation, learning, and attention. Yet, no widely-accepted mathematical theory currently exists to quantitatively characterize surprise elicited by a stimulus or event, for observers that range from single neurons to complex natural or engineered systems. We describe a formal Bayesian definition of surprise that is the only consistent formulation under minimal axiomatic assumptions. Surprise quantifies how data affects a natural or artificial observer, by measuring the difference between posterior and prior beliefs of the observer. Using this framework we measure the extent to which humans direct their gaze towards surprising items while watching television and video games. We find that subjects are strongly attracted towards surprising locations, with 72% of all human gaze shifts directed towards locations more surprising than the average, a figure which rises to 84% when considering only gaze targets simultaneously selected by all subjects. The resulting theory of surprise is applicable across different spatio-temporal scales, modalities, and levels of abstraction. Life is full of surprises, ranging from a great christmas gift or a new magic trick, to wardrobe malfunctions, reckless drivers, terrorist attacks, and tsunami waves. Key to survival is our ability to rapidly attend to, identify, and learn from surprising events, to decide on present and future courses of action [1]. Yet, little theoretical and computational understanding exists of the very essence of surprise, as evidenced by the absence from our everyday vocabulary of a quantitative unit of surprise: Qualities such as the “wow factor” have remained vague and elusive to mathematical analysis. Informal correlates of surprise exist at nearly all stages of neural processing. In sensory neuroscience, it has been suggested that only the unexpected at one stage is transmitted to the next stage [2]. Hence, sensory cortex may have evolved to adapt to, to predict, and to quiet down the expected statistical regularities of the world [3, 4, 5, 6], focusing instead on events that are unpredictable or surprising. Electrophysiological evidence for this early sensory emphasis onto surprising stimuli exists from studies of adaptation in visual [7, 8, 4, 9], olfactory [10, 11], and auditory cortices [12], subcortical structures like the LGN [13], and even retinal ganglion cells [14, 15] and cochlear hair cells [16]: neural response greatly attenuates with repeated or prolonged exposure to an initially novel stimulus. Surprise and novelty are also central to learning and memory formation [1], to the point that surprise is believed to be a necessary trigger for associative learning [17, 18], as supported by mounting evidence for a role of the hippocampus as a novelty detector [19, 20, 21]. Finally, seeking novelty is a well-identified human character trait, with possible association with the dopamine D4 receptor gene [22, 23, 24]. In the Bayesian framework, we develop the only consistent theory of surprise, in terms of the difference between the posterior and prior distributions of beliefs of an observer over the available class of models or hypotheses about the world. We show that this definition derived from first principles presents key advantages over more ad-hoc formulations, typically relying on detecting outlier stimuli. Armed with this new framework, we provide direct experimental evidence that surprise best characterizes what attracts human gaze in large amounts of natural video stimuli. We here extend a recent pilot study [25], adding more comprehensive theory, large-scale human data collection, and additional analysis. 1 Theory Bayesian Definition of Surprise. We propose that surprise is a general concept, which can be derived from first principles and formalized across spatio-temporal scales, sensory modalities, and, more generally, data types and data sources. Two elements are essential for a principled definition of surprise. First, surprise can exist only in the presence of uncertainty, which can arise from intrinsic stochasticity, missing information, or limited computing resources. A world that is purely deterministic and predictable in real-time for a given observer contains no surprises. Second, surprise can only be defined in a relative, subjective, manner and is related to the expectations of the observer, be it a single synapse, neuronal circuit, organism, or computer device. The same data may carry different amount of surprise for different observers, or even for the same observer taken at different times. In probability and decision theory it can be shown that the only consistent and optimal way for modeling and reasoning about uncertainty is provided by the Bayesian theory of probability [26, 27, 28]. Furthermore, in the Bayesian framework, probabilities correspond to subjective degrees of beliefs in hypotheses or models which are updated, as data is acquired, using Bayes’ theorem as the fundamental tool for transforming prior belief distributions into posterior belief distributions. Therefore, within the same optimal framework, the only consistent definition of surprise must involve: (1) probabilistic concepts to cope with uncertainty; and (2) prior and posterior distributions to capture subjective expectations. Consistently with this Bayesian approach, the background information of an observer is captured by his/her/its prior probability distribution {P (M )}M ∈M over the hypotheses or models M in a model space M. Given this prior distribution of beliefs, the fundamental effect of a new data observation D on the observer is to change the prior distribution {P (M )}M ∈M into the posterior distribution {P (M |D)}M ∈M via Bayes theorem, whereby P (D|M ) ∀M ∈ M, P (M |D) = P (M ). (1) P (D) In this framework, the new data observation D carries no surprise if it leaves the observer beliefs unaffected, that is, if the posterior is identical to the prior; conversely, D is surprising if the posterior distribution resulting from observing D significantly differs from the prior distribution. Therefore we formally measure surprise elicited by data as some distance measure between the posterior and prior distributions. This is best done using the relative entropy or Kullback-Leibler (KL) divergence [29]. Thus, surprise is defined by the average of the log-odd ratio: P (M |D) S(D, M) = KL(P (M |D), P (M )) = P (M |D) log dM (2) P (M ) M taken with respect to the posterior distribution over the model class M. Note that KL is not symmetric but has well-known theoretical advantages, including invariance with respect to Figure 1: Computing surprise in early sensory neurons. (a) Prior data observations, tuning preferences, and top-down influences contribute to shaping a set of “prior beliefs” a neuron may have over a class of internal models or hypotheses about the world. For instance, M may be a set of Poisson processes parameterized by the rate λ, with {P (M )}M ∈M = {P (λ)}λ∈I +∗ the prior distribution R of beliefs about which Poisson models well describe the world as sensed by the neuron. New data D updates the prior into the posterior using Bayes’ theorem. Surprise quantifies the difference between the posterior and prior distributions over the model class M. The remaining panels detail how surprise differs from conventional model fitting and outlier-based novelty. (b) In standard iterative Bayesian model fitting, at every iteration N , incoming data DN is used to update the prior {P (M |D1 , D2 , ..., DN −1 )}M ∈M into the posterior {P (M |D1 , D2 , ..., DN )}M ∈M . Freezing this learning at a given iteration, one then picks the currently best model, usually using either a maximum likelihood criterion, or a maximum a posteriori one (yielding MM AP shown). (c) This best model is used for a number of tasks at the current iteration, including outlier-based novelty detection. New data is then considered novel at that instant if it has low likelihood for the best model b a (e.g., DN is more novel than DN ). This focus onto the single best model presents obvious limitations, especially in situations where other models are nearly as good (e.g., M∗ in panel (b) is entirely ignored during standard novelty computation). One palliative solution is to consider mixture models, or simply P (D), but this just amounts to shifting the problem into a different model class. (d) Surprise directly addresses this problem by simultaneously considering all models and by measuring how data changes the observer’s distribution of beliefs from {P (M |D1 , D2 , ..., DN −1 )}M ∈M to {P (M |D1 , D2 , ..., DN )}M ∈M over the entire model class M (orange shaded area). reparameterizations. A unit of surprise — a “wow” — may then be defined for a single model M as the amount of surprise corresponding to a two-fold variation between P (M |D) and P (M ), i.e., as log P (M |D)/P (M ) (with log taken in base 2), with the total number of wows experienced for all models obtained through the integration in eq. 2. Surprise and outlier detection. Outlier detection based on the likelihood P (D|M best ) of D given a single best model Mbest is at best an approximation to surprise and, in some cases, is misleading. Consider, for instance, a case where D has very small probability both for a model or hypothesis M and for a single alternative hypothesis M. Although D is a strong outlier, it carries very little information regarding whether M or M is the better model, and therefore very little surprise. Thus an outlier detection method would strongly focus attentional resources onto D, although D is a false positive, in the sense that it carries no useful information for discriminating between the two alternative hypotheses M and M. Figure 1 further illustrates this disconnect between outlier detection and surprise. 2 Human experiments To test the surprise hypothesis — that surprise attracts human attention and gaze in natural scenes — we recorded eye movements from eight na¨ve observers (three females and ı five males, ages 23-32, normal or corrected-to-normal vision). Each watched a subset from 50 videoclips totaling over 25 minutes of playtime (46,489 video frames, 640 × 480, 60.27 Hz, mean screen luminance 30 cd/m2 , room 4 cd/m2 , viewing distance 80cm, field of view 28◦ × 21◦ ). Clips comprised outdoors daytime and nighttime scenes of crowded environments, video games, and television broadcast including news, sports, and commercials. Right-eye position was tracked with a 240 Hz video-based device (ISCAN RK-464), with methods as previously [30]. Two hundred calibrated eye movement traces (10,192 saccades) were analyzed, corresponding to four distinct observers for each of the 50 clips. Figure 2 shows sample scanpaths for one videoclip. To characterize image regions selected by participants, we process videoclips through computational metrics that output a topographic dynamic master response map, assigning in real-time a response value to every input location. A good master map would highlight, more than expected by chance, locations gazed to by observers. To score each metric we hence sample, at onset of every human saccade, master map activity around the saccade’s future endpoint, and around a uniformly random endpoint (random sampling was repeated 100 times to evaluate variability). We quantify differences between histograms of master Figure 2: (a) Sample eye movement traces from four observers (squares denote saccade endpoints). (b) Our data exhibits high inter-individual overlap, shown here with the locations where one human saccade endpoint was nearby (≈ 5◦ ) one (white squares), two (cyan squares), or all three (black squares) other humans. (c) A metric where the master map was created from the three eye movement traces other than that being tested yields an upper-bound KL score, computed by comparing the histograms of metric values at human (narrow blue bars) and random (wider green bars) saccade targets. Indeed, this metric’s map was very sparse (many random saccades landing on locations with nearzero response), yet humans preferentially saccaded towards the three active hotspots corresponding to the eye positions of three other humans (many human saccades landing on locations with near-unity responses). map samples collected from human and random saccades using again the Kullback-Leibler (KL) distance: metrics which better predict human scanpaths exhibit higher distances from random as, typically, observers non-uniformly gaze towards a minority of regions with highest metric responses while avoiding a majority of regions with low metric responses. This approach presents several advantages over simpler scoring schemes [31, 32], including agnosticity to putative mechanisms for generating saccades and the fact that applying any continuous nonlinearity to master map values would not affect scoring. Experimental results. We test six computational metrics, encompassing and extending the state-of-the-art found in previous studies. The first three quantify static image properties (local intensity variance in 16 × 16 image patches [31]; local oriented edge density as measured with Gabor filters [33]; and local Shannon entropy in 16 × 16 image patches [34]). The remaining three metrics are more sensitive to dynamic events (local motion [33]; outlier-based saliency [33]; and surprise [25]). For all metrics, we find that humans are significantly attracted by image regions with higher metric responses. However, the static metrics typically respond vigorously at numerous visual locations (Figure 3), hence they are poorly specific and yield relatively low KL scores between humans and random. The metrics sensitive to motion, outliers, and surprising events, in comparison, yield sparser maps and higher KL scores. The surprise metric of interest here quantifies low-level surprise in image patches over space and time, and at this point does not account for high-level or cognitive beliefs of our human observers. Rather, it assumes a family of simple models for image patches, each processed through 72 early feature detectors sensitive to color, orientation, motion, etc., and computes surprise from shifts in the distribution of beliefs about which models better describe the patches (see [25] and [35] for details). We find that the surprise metric significantly outperforms all other computational metrics (p < 10−100 or better on t-tests for equality of KL scores), scoring nearly 20% better than the second-best metric (saliency) and 60% better than the best static metric (entropy). Surprising stimuli often substantially differ from simple feature outliers; for example, a continually blinking light on a static background elicits sustained flicker due to its locally outlier temporal dynamics but is only surprising for a moment. Similarly, a shower of randomly-colored pixels continually excites all low-level feature detectors but rapidly becomes unsurprising. Strongest attractors of human attention. Clearly, in our and previous eye-tracking experiments, in some situations potentially interesting targets were more numerous than in others. With many possible targets, different observers may orient towards different locations, making it more difficult for a single metric to accurately predict all observers. Hence we consider (Figure 4) subsets of human saccades where at least two, three, or all four observers simultaneously agreed on a gaze target. Observers could have agreed based on bottom-up factors (e.g., only one location had interesting visual appearance at that time), top-down factors (e.g., only one object was of current cognitive interest), or both (e.g., a single cognitively interesting object was present which also had distinctive appearance). Irrespectively of the cause for agreement, it indicates consolidated belief that a location was attractive. While the KL scores of all metrics improved when progressively focusing onto only those locations, dynamic metrics improved more steeply, indicating that stimuli which more reliably attracted all observers carried more motion, saliency, and surprise. Surprise remained significantly the best metric to characterize these agreed-upon attractors of human gaze (p < 10−100 or better on t-tests for equality of KL scores). Overall, surprise explained the greatest fraction of human saccades, indicating that humans are significantly attracted towards surprising locations in video displays. Over 72% of all human saccades were targeted to locations predicted to be more surprising than on average. When only considering saccades where two, three, or four observers agreed on a common gaze target, this figure rose to 76%, 80%, and 84%, respectively. Figure 3: (a) Sample video frames, with corresponding human saccades and predictions from the entropy, surprise, and human-derived metrics. Entropy maps, like intensity variance and orientation maps, exhibited many locations with high responses, hence had low specificity and were poorly discriminative. In contrast, motion, saliency, and surprise maps were much sparser and more specific, with surprise significantly more often on target. For three example frames (first column), saccades from one subject are shown (arrows) with corresponding apertures over which master map activity at the saccade endpoint was sampled (circles). (b) KL scores for these metrics indicate significantly different performance levels, and a strict ranking of variance < orientation < entropy < motion < saliency < surprise < human-derived. KL scores were computed by comparing the number of human saccades landing onto each given range of master map values (narrow blue bars) to the number of random saccades hitting the same range (wider green bars). A score of zero would indicate equality between the human and random histograms, i.e., humans did not tend to hit various master map values any differently from expected by chance, or, the master map could not predict human saccades better than random saccades. Among the six computational metrics tested in total, surprise performed best, in that surprising locations were relatively few yet reliably gazed to by humans. Figure 4: KL scores when considering only saccades where at least one (all 10,192 saccades), two (7,948 saccades), three (5,565 saccades), or all four (2,951 saccades) humans agreed on a common gaze location, for the static (a) and dynamic metrics (b). Static metrics improved substantially when progressively focusing onto saccades with stronger inter-observer agreement (average slope 0.56 ± 0.37 percent KL score units per 1,000 pruned saccades). Hence, when humans agreed on a location, they also tended to be more reliably predicted by the metrics. Furthermore, dynamic metrics improved 4.5 times more steeply (slope 2.44 ± 0.37), suggesting a stronger role of dynamic events in attracting human attention. Surprising events were significantly the strongest (t-tests for equality of KL scores between surprise and other metrics, p < 10−100 ). 3 Discussion While previous research has shown with either static scenes or dynamic synthetic stimuli that humans preferentially fixate regions of high entropy [34], contrast [31], saliency [32], flicker [36], or motion [37], our data provides direct experimental evidence that humans fixate surprising locations even more reliably. These conclusions were made possible by developing new tools to quantify what attracts human gaze over space and time in dynamic natural scenes. Surprise explained best where humans look when considering all saccades, and even more so when restricting the analysis to only those saccades for which human observers tended to agree. Surprise hence represents an inexpensive, easily computable approximation to human attentional allocation. In the absence of quantitative tools to measure surprise, most experimental and modeling work to date has adopted the approximation that novel events are surprising, and has focused on experimental scenarios which are simple enough to ensure an overlap between informal notions of novelty and surprise: for example, a stimulus is novel during testing if it has not been seen during training [9]. Our definition opens new avenues for more sophisticated experiments, where surprise elicited by different stimuli can be precisely compared and calibrated, yielding predictions at the single-unit as well as behavioral levels. The definition of surprise — as the distance between the posterior and prior distributions of beliefs over models — is entirely general and readily applicable to the analysis of auditory, olfactory, gustatory, or somatosensory data. While here we have focused on behavior rather than detailed biophysical implementation, it is worth noting that detecting surprise in neural spike trains does not require semantic understanding of the data carried by the spike trains, and thus could provide guiding signals during self-organization and development of sensory areas. At higher processing levels, top-down cues and task demands are known to combine with stimulus novelty in capturing attention and triggering learning [1, 38], ideas which may now be formalized and quantified in terms of priors, posteriors, and surprise. Surprise, indeed, inherently depends on uncertainty and on prior beliefs. Hence surprise theory can further be tested and utilized in experiments where the prior is biased, for ex- ample by top-down instructions or prior exposures to stimuli [38]. In addition, simple surprise-based behavioral measures such as the eye-tracking one used here may prove useful for early diagnostic of human conditions including autism and attention-deficit hyperactive disorder, as well as for quantitative comparison between humans and animals which may have lower or different priors, including monkeys, frogs, and flies. Beyond sensory biology, computable surprise could guide the development of data mining and compression systems (giving more bits to surprising regions of interest), to find surprising agents in crowds, surprising sentences in books or speeches, surprising sequences in genomes, surprising medical symptoms, surprising odors in airport luggage racks, surprising documents on the world-wide-web, or to design surprising advertisements. Acknowledgments: Supported by HFSP, NSF and NGA (L.I.), NIH and NSF (P.B.). We thank UCI’s Institute for Genomics and Bioinformatics and USC’s Center High Performance Computing and Communications (www.usc.edu/hpcc) for access to their computing clusters. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] Ranganath, C. & Rainer, G. Nat Rev Neurosci 4, 193–202 (2003). Rao, R. P. & Ballard, D. H. Nat Neurosci 2, 79–87 (1999). Olshausen, B. A. & Field, D. J. Nature 381, 607–609 (1996). M¨ ller, J. R., Metha, A. B., Krauskopf, J. & Lennie, P. Science 285, 1405–1408 (1999). u Dragoi, V., Sharma, J., Miller, E. K. & Sur, M. Nat Neurosci 5, 883–891 (2002). David, S. V., Vinje, W. E. & Gallant, J. L. J Neurosci 24, 6991–7006 (2004). Maffei, L., Fiorentini, A. & Bisti, S. Science 182, 1036–1038 (1973). Movshon, J. A. & Lennie, P. Nature 278, 850–852 (1979). Fecteau, J. H. & Munoz, D. P. Nat Rev Neurosci 4, 435–443 (2003). Kurahashi, T. & Menini, A. Nature 385, 725–729 (1997). Bradley, J., Bonigk, W., Yau, K. W. & Frings, S. Nat Neurosci 7, 705–710 (2004). Ulanovsky, N., Las, L. & Nelken, I. Nat Neurosci 6, 391–398 (2003). Solomon, S. G., Peirce, J. W., Dhruv, N. T. & Lennie, P. Neuron 42, 155–162 (2004). Smirnakis, S. M., Berry, M. J. & et al. Nature 386, 69–73 (1997). Brown, S. P. & Masland, R. H. Nat Neurosci 4, 44–51 (2001). Kennedy, H. J., Evans, M. G. & et al. Nat Neurosci 6, 832–836 (2003). Schultz, W. & Dickinson, A. Annu Rev Neurosci 23, 473–500 (2000). Fletcher, P. C., Anderson, J. M., Shanks, D. R. et al. Nat Neurosci 4, 1043–1048 (2001). Knight, R. Nature 383, 256–259 (1996). Stern, C. E., Corkin, S., Gonzalez, R. G. et al. Proc Natl Acad Sci U S A 93, 8660–8665 (1996). Li, S., Cullen, W. K., Anwyl, R. & Rowan, M. J. Nat Neurosci 6, 526–531 (2003). Ebstein, R. P., Novick, O., Umansky, R. et al. Nat Genet 12, 78–80 (1996). Benjamin, J., Li, L. & et al. Nat Genet 12, 81–84 (1996). Lusher, J. M., Chandler, C. & Ball, D. Mol Psychiatry 6, 497–499 (2001). Itti, L. & Baldi, P. In Proc. IEEE CVPR. San Siego, CA (2005 in press). Cox, R. T. Am. J. Phys. 14, 1–13 (1964). Savage, L. J. The foundations of statistics (Dover, New York, 1972). (First Edition in 1954). Jaynes, E. T. Probability Theory. The Logic of Science (Cambridge University Press, 2003). Kullback, S. Information Theory and Statistics (Wiley, New York:New York, 1959). Itti, L. Visual Cognition (2005 in press). Reinagel, P. & Zador, A. M. Network 10, 341–350 (1999). Parkhurst, D., Law, K. & Niebur, E. Vision Res 42, 107–123 (2002). Itti, L. & Koch, C. Nat Rev Neurosci 2, 194–203 (2001). Privitera, C. M. & Stark, L. W. IEEE Trans Patt Anal Mach Intell 22, 970–982 (2000). All source code for all metrics is freely available at http://iLab.usc.edu/toolkit/. Theeuwes, J. Percept Psychophys 57, 637–644 (1995). Abrams, R. A. & Christ, S. E. Psychol Sci 14, 427–432 (2003). Wolfe, J. M. & Horowitz, T. S. Nat Rev Neurosci 5, 495–501 (2004).
same-paper 2 0.78525871 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
Author: Sham M. Kakade, Matthias W. Seeger, Dean P. Foster
Abstract: We present a competitive analysis of some non-parametric Bayesian algorithms in a worst-case online learning setting, where no probabilistic assumptions about the generation of the data are made. We consider models which use a Gaussian process prior (over the space of all functions) and provide bounds on the regret (under the log loss) for commonly used non-parametric Bayesian algorithms — including Gaussian regression and logistic regression — which show how these algorithms can perform favorably under rather general conditions. These bounds explicitly handle the infinite dimensionality of these non-parametric classes in a natural way. We also make formal connections to the minimax and minimum description length (MDL) framework. Here, we show precisely how Bayesian Gaussian regression is a minimax strategy. 1
3 0.74342382 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks
Author: Lei Zhang, Dimitris Samaras, Nelly Alia-klein, Nora Volkow, Rita Goldstein
Abstract: Functional Magnetic Resonance Imaging (fMRI) has enabled scientists to look into the active brain. However, interactivity between functional brain regions, is still little studied. In this paper, we contribute a novel framework for modeling the interactions between multiple active brain regions, using Dynamic Bayesian Networks (DBNs) as generative models for brain activation patterns. This framework is applied to modeling of neuronal circuits associated with reward. The novelty of our framework from a Machine Learning perspective lies in the use of DBNs to reveal the brain connectivity and interactivity. Such interactivity models which are derived from fMRI data are then validated through a group classification task. We employ and compare four different types of DBNs: Parallel Hidden Markov Models, Coupled Hidden Markov Models, Fully-linked Hidden Markov Models and Dynamically MultiLinked HMMs (DML-HMM). Moreover, we propose and compare two schemes of learning DML-HMMs. Experimental results show that by using DBNs, group classification can be performed even if the DBNs are constructed from as few as 5 brain regions. We also demonstrate that, by using the proposed learning algorithms, different DBN structures characterize drug addicted subjects vs. control subjects. This finding provides an independent test for the effect of psychopathology on brain function. In general, we demonstrate that incorporation of computer science principles into functional neuroimaging clinical studies provides a novel approach for probing human brain function.
4 0.49979827 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
Author: Alan L. Yuille
Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1
5 0.49927065 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex
Author: Rory Sayres, David Ress, Kalanit Grill-spector
Abstract: The category of visual stimuli has been reliably decoded from patterns of neural activity in extrastriate visual cortex [1]. It has yet to be seen whether object identity can be inferred from this activity. We present fMRI data measuring responses in human extrastriate cortex to a set of 12 distinct object images. We use a simple winner-take-all classifier, using half the data from each recording session as a training set, to evaluate encoding of object identity across fMRI voxels. Since this approach is sensitive to the inclusion of noisy voxels, we describe two methods for identifying subsets of voxels in the data which optimally distinguish object identity. One method characterizes the reliability of each voxel within subsets of the data, while another estimates the mutual information of each voxel with the stimulus set. We find that both metrics can identify subsets of the data which reliably encode object identity, even when noisy measurements are artificially added to the data. The mutual information metric is less efficient at this task, likely due to constraints in fMRI data. 1
6 0.49611729 74 nips-2005-Faster Rates in Regression via Active Learning
7 0.49545673 50 nips-2005-Convex Neural Networks
8 0.49431914 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
9 0.49421191 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
10 0.49206004 177 nips-2005-Size Regularized Cut for Data Clustering
11 0.48950395 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
12 0.48942274 54 nips-2005-Data-Driven Online to Batch Conversions
13 0.4892616 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
14 0.4890866 30 nips-2005-Assessing Approximations for Gaussian Process Classification
15 0.48878855 112 nips-2005-Learning Minimum Volume Sets
16 0.48872286 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
17 0.48864985 41 nips-2005-Coarse sample complexity bounds for active learning
18 0.48860225 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
19 0.48635018 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
20 0.48609519 47 nips-2005-Consistency of one-class SVM and related algorithms