nips nips2003 nips2003-51 knowledge-graph by maker-knowledge-mining

51 nips-2003-Design of Experiments via Information Theory


Source: pdf

Author: Liam Paninski

Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. [sent-5, score-0.58]

2 For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. [sent-6, score-0.309]

3 1 Introduction One simple model of experimental design (we have neurophysiological experiments in mind, but our results are general with respect to the identity of the system under study) is as follows. [sent-7, score-0.18]

4 This knowledge is summarized in the form of a prior distribution, p0 (θ), on some space Θ of models θ. [sent-9, score-0.056]

5 A model is a set of probabilistic input-output relationships: regular conditional distributions p(y|x, θ) on Y , the set of possible output responses, given each x in X. [sent-10, score-0.04]

6 Thus the joint probability of stimulus and response is: p(x, y) = p(x, y, θ)dθ = p0 (θ)p(x)p(y|θ, x)dθ. [sent-11, score-0.121]

7 The “design” of an experiment is given by the choice of input probability p(x). [sent-12, score-0.045]

8 We want to design our experiment — choose p(x) — optimally in some sense. [sent-13, score-0.211]

9 One natural idea would be to choose p(x) in such a way that we learn as much as possible about the underlying model, on average. [sent-14, score-0.046]

10 Information theory thus suggests we choose p(x) to optimize the ∗ A longer version of this paper, including proofs, has been submitted and is available at http://www. [sent-15, score-0.046]

11 following objective function: I({x, y}; θ) = p(x, y, θ) log X×Y ×Θ p(x, y, θ) p(x, y)p(θ) (1) where I(. [sent-19, score-0.077]

12 In other words, we want to maximize the information provided about θ by the pair {x, y}, given our current knowledge of the model as summarized in the posterior distribution given N samples of data: pN (θ) = p(θ|{xi , yi }1≤i≤N ). [sent-22, score-0.12]

13 Somewhat surprisingly, we have not seen any applications of the information-theoretic objective function (1) to the design of neurophysiological experiments (although see the abstract by [7], who seem to have independently implemented the same idea in a simulation study). [sent-27, score-0.214]

14 The primary goal of this paper is to elucidate the asymptotic behavior of the a posteriori density pN (θ) when we choose x according to the recipe outlined above; in particular, we want to compare the adaptive strategy to the more usual case, in which the stimuli are drawn i. [sent-28, score-0.812]

15 Our main result (section 2) states that, under acceptably weak conditions on the models p(y|θ, x), the informationmaximization strategy leads to consistent and efficient estimates of the true underlying model, in a natural sense. [sent-32, score-0.397]

16 Thus, instead of finding optimal distributions p(x), we need only find optimal inputs x, in the sense of maximizing the conditional information between θ and y, given a single input x: I(y; θ|x) ≡ pN (θ)p(y|θ, x) log Y Θ p(y|x, θ) . [sent-36, score-0.04]

17 pN (θ)p(y|x, θ) Θ Our main result is a “Bernstein-von Mises” - type theorem [12]. [sent-37, score-0.15]

18 It turns out that the hard part is proving consistency (c. [sent-40, score-0.116]

19 section 4); we give the basic consistency lemma (interesting in its own right) first, from which the main theorem follows fairly easily. [sent-42, score-0.333]

20 The loglikelihood log p(y|x, θ) is Lipschitz in θ, uniformly in x, with respect to some dominating measure on Y . [sent-47, score-0.2]

21 The prior measure p0 assigns positive measure to any neighborhood of θ0 . [sent-49, score-0.305]

22 The maximal divergence supx DKL (θ0 ; θ|x) is positive for all θ = θ0 . [sent-51, score-0.242]

23 Then the posteriors are consistent: pN (U ) → 1 in probability for any neighborhood U of θ0 . [sent-52, score-0.179]

24 Assume the conditions of Lemma 1, stengthened as follows: 1. [sent-54, score-0.065]

25 Θ has a smooth, finite-dimensional manifold structure in a neighborhood of θ 0 . [sent-55, score-0.107]

26 The loglikelihood log p(y|x, θ) is uniformly C 2 in θ. [sent-57, score-0.129]

27 In particular, the Fisher information matrices Iθ (x) = Y p(y|x, θ) ˙ p(y|x, θ) t p(y|x, θ) ˙ p(y|θ, x), p(y|x, θ) where the differential p is taken with respect to θ, are well-defined and continuous ˙ in θ, uniformly in (x, θ) in some neighborhood of θ0 . [sent-58, score-0.229]

28 The prior measure p0 is absolutely continuous in some neighborhood of θ0 , with a continuous positive density at θ0 . [sent-60, score-0.43]

29 maxC∈co(Iθ0 (x)) det(C) > 0, where co(Iθ0 (x)) denotes the convex closure of the set of Fisher information matrices Iθ0 (x). [sent-62, score-0.04]

30 || denotes variation distance, N (µN , σN ) denotes the normal den2 sity with mean µN and variance σN , and µN is asymptotically normal with mean θ0 and 2 variance σN . [sent-64, score-0.474]

31 Thus, under these conditions, the information maximization strategy works, and works better than the i. [sent-66, score-0.178]

32 x strategy (where the asymptotic variance σ 2 is inversely related to an average, not a maximum, over x, and is therefore generically larger). [sent-69, score-0.431]

33 It should also be clear that we have not stated the results as generally as possible; we have chosen instead to use assumptions that are simple to understand and verify, and to leave the technical generalizations to the interested reader. [sent-73, score-0.046]

34 Our assumptions should be weak enough for most neurophysiological and psychophysical situations, for example, by assuming that parameters take values in bounded (though possibly large) sets and that tuning curves are not infinitely steep. [sent-74, score-0.153]

35 The proofs of these three results are basically elaborations on Wald’s consistency method and Le Cam’s approach to the Bernstein-von Mises theorem [12]. [sent-75, score-0.267]

36 1 Applications Psychometric model As noted in the introduction, psychophysicists have employed versions of the informationmaximization procedure for some years [14, 9, 13, 6]. [sent-77, score-0.158]

37 Our results above allow us to precisely quantify the effectiveness of this stategy. [sent-79, score-0.056]

38 Let f be “sigmoidal”: a uniformly smooth, monotonically increasing function on the line, such that f (0) = 1/2, limt→−∞ f (t) = 0 and limt→∞ f (t) = 1 (this function represents the detection probability when the subject is presented with a stimulus of strength t). [sent-82, score-0.198]

39 Let fa,θ = f ((t − θ)/a); θ here serves as a location (“threshold”) parameter, while a sets the scale (we assume a is known, for now, although of course this can be relaxed [6]). [sent-83, score-0.069]

40 Finally, let p(x) and p 0 (θ) be some fixed sampling and prior distributions, respectively, both with smooth densities with respect to Lebesgue measure on some interval Θ. [sent-84, score-0.362]

41 Now, for any fixed scale a, we want to compare the performance of the informationmaximization strategy to that of the i. [sent-85, score-0.418]

42 We have by theorem 2 that the most efficient estimator of θ is asymptotically unbiased with asymptotic variance 2 σinf o ≈ (N sup Iθ0 (x))−1 , x while the usual calculations show that the asymptotic variance of any efficient estimator based on i. [sent-89, score-0.992]

43 sampling strategy is as asymptotically efficient as informationmaximization strategy; for all other f , information maximization is strictly more efficient. [sent-98, score-0.646]

44 [p(x)], and so f ∗ is the unique solution of the differential equation 1/2 df ∗ = c f ∗ (t)(1 − f ∗ (t)) , dt √ where the auxiliary constant c = I θ uniquely fixes the scale a. [sent-101, score-0.127]

45 After some calculus, we obtain sin(ct) + 1 f ∗ (t) = 2 on the interval [−π/2c, π/2c] (and defined uniquely, by monotonicity, as 0 or 1 outside this interval). [sent-102, score-0.069]

46 Since the support of the derivative of this function is compact, this result is quite dependent of the sampling density p(x); if p(x) places any of its mass outside of the interval 2 2 [−π/2c, π/2c], then σiid is always strictly greater than σinf o . [sent-103, score-0.264]

47 This recapitulates a basic theme from the psychophysical literature comparing adaptive and nonadaptive techniques: when the scale of the nonlinearity f is either unknown or smaller than the scale of the i. [sent-104, score-0.452]

48 2 Linear-nonlinear cascade model We now consider a model that has received increasing attention from the neurophysiology community (see, e. [sent-110, score-0.05]

49 Here the linear filter θ is some unit vector in X , the dual space of X (thus, the model space Θ is isomorphic to X), while the nonlinearity f is some nonconstant, nonnegative function on [−1, 1]. [sent-114, score-0.171]

50 We assume that f is uniformly smooth, to satisfy the conditions of theorem 2; we also assume f is known, although, again, this can be relaxed. [sent-115, score-0.248]

51 The response space Y — the space of possible spike counts, given the stimulus x — can be taken to be the nonnegative integers. [sent-116, score-0.16]

52 For simplicity, let the conditional probabilities p(y|x, θ) be parametrized uniquely by the mean firing rate f (< θ, x >); the most convenient model, as usual, is to assume that p(y|x, θ) is Poisson with mean f (< θ, x >). [sent-117, score-0.098]

53 Finally, we assume that the sampling density p(x) is uniform on the unit sphere (this choice is natural for several reasons, mainly involving symmetry; see, e. [sent-118, score-0.305]

54 , [2, 8]), and that the prior p 0 (θ) is positive and continuous (and is therefore bounded away from zero, by the compactness of Θ). [sent-120, score-0.222]

55 The Fisher information for this model is easily calculated as Iθ (x) = (f (< θ, x >))2 f (< θ, x >) Px,θ , where f is the usual derivative of the real function f and Px,θ is the projection operator corresponding to x, restricted to the (d − 1)-dimensional tangent space to the unit sphere at θ. [sent-121, score-0.296]

56 1 apply here as well: the ratio σiid /σinf o grows roughly linearly in the inverse of the scale of the nonlinearity. [sent-124, score-0.069]

57 The more interesting asymptotics here, though, are in d. [sent-125, score-0.069]

58 This is because the unit sphere has a measure concentration property [11]: as d → ∞, the measure p(t) becomes exponentially concentrated around 0. [sent-126, score-0.353]

59 In fact, it is easy to show directly that, in this limit, p(t) converges in distribution to the normal measure with mean zero and variance d−2 . [sent-127, score-0.189]

60 The most surprising implication of this result is seen for nonlinearities f such that f (0) = 0, f (0) > 0; we have in mind, for example, symmetric nonlinearities like those often used to model complex cells in visual cortex. [sent-128, score-0.144]

61 For these nonlinearities, 2 σinf o = O(d−2 ) : 2 σiid that is, the information maximization strategy becomes infinitely more efficient than the usual i. [sent-129, score-0.304]

62 4 A Negative Example Our next example is more negative and perhaps more surprising: it shows how the information-maximation strategy can fail, in a certain sense, if the conditions of the consistency lemma are not met. [sent-133, score-0.458]

63 Let Θ be multidimensional, with coordinates which are “independent” in a certain sense, and assume the expected information obtained from one coordinate of the parameter remains bounded strictly away from the expected information obtained from one of the other coordinates. [sent-134, score-0.147]

64 5|, < 0 and 0 < θ1 < 1 are the parameters we want to learn. [sent-139, score-0.061]

65 Let the initial prior be absolutely continuous with respect to Lebesgue measure; this implies that all posteriors will have the same property. [sent-140, score-0.236]

66 Thus the information-maximization strategy will sample from x < 0 forever, leading to a linear information growth rate (and easily-proven consistency) for the left parameter and non-convergence on the right. [sent-142, score-0.13]

67 approach for choosing x (using any Lebesgue-dominating measure on the parameter space), which leads to the standard root-N rate for both parameters (i. [sent-146, score-0.071]

68 Note that this kind of inconsistency problem does not occur in the case of sufficiently smooth p(y|x, θ), by our main theorem. [sent-149, score-0.118]

69 Thus one way of avoiding this problem would be to fix a finite sampling scale for each coordinate (i. [sent-150, score-0.205]

70 However, it is possible to find other examples which show that the lack of consistency is not necessarily tied to the discontinuous nature of the conditional densities. [sent-154, score-0.156]

71 5 Directions In this paper, we have presented a rigorous theoretical framework for adaptively designing experiments using an information-theoretic objective function. [sent-155, score-0.174]

72 For example, our theorem 2 might suggest the use of a mixture-of-Gaussians representation as an efficient approximation for the posteriors pN (θ) [5]. [sent-157, score-0.178]

73 Perhaps the most obvious such question concerns the use of non-information theoretic objective functions. [sent-159, score-0.077]

74 It turns out that many of our results apply with only modest changes if the experiment is instead designed to minimize something like the Bayes mean-square error (perhaps defined only locally if Θ has a nontrivial manifold structure), for example: in this case, the results in sections 3. [sent-160, score-0.045]

75 2 remain completely unchanged, while the statement of our main theorem requires only slight changes in the asymptotic variance formula (see http://www. [sent-162, score-0.492]

76 Thus it seems our results here can add very little to any discussion of what objective function is “best” in general. [sent-166, score-0.077]

77 1 “Batch mode” and stimulus dependencies Perhaps our strongest assumption here is that the experimenter will be able to freely choose the stimuli on each trial. [sent-169, score-0.328]

78 Another common situation involves stimuli which vary temporally, for which the system is commonly modelled as responding not just to a given stimulus x(t), but also to all of its time-translates x(t − τ ). [sent-171, score-0.213]

79 Finally, if there is some cost C(x0 , x1 ) associated with changing the state of the observational apparatus from the current state x0 to x1 , the experimenter may wish to optimize an objective function which incorporates this cost, for example I(y; θ|x1 ) C(x0 , x1 ). [sent-172, score-0.146]

80 Here we restrict ourselves to the first setting, and give a simple conjecture, based on the asymptotic results presented above and inspired by results like those of [1, 4, 10]. [sent-174, score-0.231]

81 First, we state more precisely the optimization problem inherent in designing a “batch” experiment: we wish to choose some sequence, {xi }1≤i≤k , to maximize I({xi , yi }1≤i≤k ; θ); the main difference here is that {xi }1≤i≤k must be chosen nonadaptively, i. [sent-175, score-0.129]

82 (Without such a smoothness condition — for example, if some input x could definitively decide between some given θ0 and θ1 — then no such asymptotic independence statement can hold, since no more than one sample from such an x would be necessary. [sent-179, score-0.272]

83 ) Thus, we can hope that we should be able to asymptotically approximate this optimal experiment by sampling in an i. [sent-180, score-0.295]

84 Moreover, we can make a guess as to the identity of this putative p(x): Conjecture (“Batch” mode). [sent-184, score-0.043]

85 Under suitable conditions, the empirical distribution corre- sponding to any optimal sequence {xi }1≤i≤k , pk (x) ≡ ˆ 1 k k δ(xi ), i=1 converges weakly as k → ∞ to S, the convex set of maximizers in p(x) of Eθ log(det(Ex Iθ (x))). [sent-185, score-0.052]

86 (2) Expression (2) above is an average over p(θ) of terms proportional to the negative entropy of the asymptotic Gaussian posterior distribution corresponding to each θ, and thus should be maximized by any optimal approximant distribution p(x). [sent-186, score-0.347]

87 ) In fact, it is not difficult, using the results of Clarke and Barron [3] to prove the above conjecture under the conditions like those of Theorem 2, assuming that X is finite (in which case weak convergence is equivalent to pointwise convergence); we leave generalizations for future work. [sent-188, score-0.186]

88 Bayesian Statistics 4, chapter On priors that maximize expected information, pages 35–60. [sent-199, score-0.046]

89 Jeffreys’ prior is asymptotically least favorable under entropy risk. [sent-213, score-0.271]

90 Using mutual information to pre-process input data for a virtual sensor. [sent-221, score-0.07]

91 Shannon optimal priors on iid statistical experiments converge weakly to jeffreys’ prior. [sent-246, score-0.411]

92 Concentration of measure and isoperimetric inequalities in product spaces. [sent-254, score-0.071]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('iid', 0.313), ('asymptotic', 0.231), ('pn', 0.199), ('psychometric', 0.197), ('inf', 0.173), ('asymptotically', 0.158), ('informationmaximization', 0.158), ('strategy', 0.13), ('usual', 0.126), ('stimulus', 0.121), ('clarke', 0.118), ('supx', 0.118), ('consistency', 0.116), ('sphere', 0.115), ('neighborhood', 0.107), ('theorem', 0.106), ('fisher', 0.104), ('batch', 0.104), ('co', 0.103), ('liam', 0.103), ('dkl', 0.094), ('stimuli', 0.092), ('sampling', 0.092), ('adaptive', 0.083), ('fa', 0.082), ('perhaps', 0.08), ('nonadaptive', 0.079), ('psychophysics', 0.078), ('neurophysiological', 0.078), ('compactness', 0.078), ('uniformly', 0.077), ('objective', 0.077), ('nonlinearity', 0.077), ('maximal', 0.075), ('psychophysical', 0.075), ('conjecture', 0.075), ('smooth', 0.074), ('nonlinearities', 0.072), ('posteriors', 0.072), ('measure', 0.071), ('variance', 0.07), ('mutual', 0.07), ('interval', 0.069), ('scale', 0.069), ('monotonicity', 0.069), ('asymptotics', 0.069), ('experimenter', 0.069), ('limt', 0.069), ('mises', 0.069), ('lemma', 0.067), ('conditions', 0.065), ('absolutely', 0.063), ('jeffreys', 0.063), ('normality', 0.063), ('want', 0.061), ('strictly', 0.06), ('det', 0.059), ('posterior', 0.059), ('design', 0.059), ('adaptively', 0.058), ('lipschitz', 0.058), ('uniquely', 0.058), ('entropy', 0.057), ('prior', 0.056), ('effectiveness', 0.056), ('ef', 0.055), ('lebesgue', 0.055), ('px', 0.055), ('unit', 0.055), ('mode', 0.052), ('watson', 0.052), ('loglikelihood', 0.052), ('weakly', 0.052), ('dp', 0.05), ('cascade', 0.05), ('divergence', 0.049), ('maximization', 0.048), ('normal', 0.048), ('choose', 0.046), ('priors', 0.046), ('generalizations', 0.046), ('experiment', 0.045), ('continuous', 0.045), ('mind', 0.045), ('basically', 0.045), ('main', 0.044), ('coordinate', 0.044), ('density', 0.043), ('identity', 0.043), ('away', 0.043), ('irrelevant', 0.042), ('concentration', 0.041), ('statement', 0.041), ('ring', 0.041), ('nitely', 0.041), ('conditional', 0.04), ('xi', 0.04), ('denotes', 0.04), ('designing', 0.039), ('nonnegative', 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 51 nips-2003-Design of Experiments via Information Theory

Author: Liam Paninski

Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1

2 0.16354325 151 nips-2003-PAC-Bayesian Generic Chaining

Author: Jean-yves Audibert, Olivier Bousquet

Abstract: There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting. 1

3 0.14743757 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

Author: Susanne Still, William Bialek, Léon Bottou

Abstract: We argue that K–means and deterministic annealing algorithms for geometric clustering can be derived from the more general Information Bottleneck approach. If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that define the optimal solution as an iterative algorithm, then a set of “smooth” initial conditions selects solutions with the desired geometrical properties. In addition to conceptual unification, we argue that this approach can be more efficient and robust than classic algorithms. 1

4 0.14192045 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

5 0.13113712 170 nips-2003-Self-calibrating Probability Forecasting

Author: Vladimir Vovk, Glenn Shafer, Ilia Nouretdinov

Abstract: In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for “multiprobability forecasting” (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class “Venn probability machines”. Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.

6 0.12710249 102 nips-2003-Large Scale Online Learning

7 0.120924 5 nips-2003-A Classification-based Cocktail-party Processor

8 0.11251912 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

9 0.11207349 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

10 0.10751586 115 nips-2003-Linear Dependent Dimensionality Reduction

11 0.093197882 176 nips-2003-Sequential Bayesian Kernel Regression

12 0.092211351 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

13 0.0885875 126 nips-2003-Measure Based Regularization

14 0.085563049 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

15 0.084522814 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

16 0.082573533 30 nips-2003-Approximability of Probability Distributions

17 0.080805793 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

18 0.078736283 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

19 0.075246572 113 nips-2003-Learning with Local and Global Consistency

20 0.073727295 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.288), (1, -0.034), (2, 0.007), (3, -0.032), (4, 0.115), (5, 0.087), (6, 0.022), (7, -0.012), (8, -0.177), (9, 0.084), (10, -0.121), (11, 0.057), (12, 0.155), (13, 0.202), (14, 0.143), (15, -0.074), (16, 0.105), (17, 0.007), (18, 0.026), (19, 0.052), (20, -0.046), (21, 0.044), (22, -0.019), (23, -0.027), (24, 0.014), (25, -0.004), (26, 0.08), (27, 0.02), (28, 0.07), (29, -0.018), (30, -0.065), (31, -0.041), (32, -0.045), (33, 0.05), (34, -0.003), (35, -0.029), (36, -0.086), (37, 0.005), (38, 0.089), (39, 0.04), (40, 0.047), (41, 0.017), (42, -0.056), (43, -0.045), (44, -0.172), (45, 0.057), (46, -0.07), (47, 0.016), (48, -0.102), (49, -0.185)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95703316 51 nips-2003-Design of Experiments via Information Theory

Author: Liam Paninski

Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1

2 0.64454341 151 nips-2003-PAC-Bayesian Generic Chaining

Author: Jean-yves Audibert, Olivier Bousquet

Abstract: There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting. 1

3 0.57158971 170 nips-2003-Self-calibrating Probability Forecasting

Author: Vladimir Vovk, Glenn Shafer, Ilia Nouretdinov

Abstract: In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for “multiprobability forecasting” (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class “Venn probability machines”. Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.

4 0.52472168 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

Author: Susanne Still, William Bialek, Léon Bottou

Abstract: We argue that K–means and deterministic annealing algorithms for geometric clustering can be derived from the more general Information Bottleneck approach. If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that define the optimal solution as an iterative algorithm, then a set of “smooth” initial conditions selects solutions with the desired geometrical properties. In addition to conceptual unification, we argue that this approach can be more efficient and robust than classic algorithms. 1

5 0.50535816 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

Author: Tong Zhang

Abstract: In this paper we obtain convergence bounds for the concentration of Bayesian posterior distributions (around the true distribution) using a novel method that simplifies and enhances previous results. Based on the analysis, we also introduce a generalized family of Bayesian posteriors, and show that the convergence behavior of these generalized posteriors is completely determined by the local prior structure around the true distribution. This important and surprising robustness property does not hold for the standard Bayesian posterior in that it may not concentrate when there exist “bad” prior structures even at places far away from the true distribution. 1

6 0.48192748 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

7 0.48099539 102 nips-2003-Large Scale Online Learning

8 0.47623655 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

9 0.46477202 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

10 0.44429892 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

11 0.4332622 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters

12 0.43172583 126 nips-2003-Measure Based Regularization

13 0.42443788 196 nips-2003-Wormholes Improve Contrastive Divergence

14 0.42029139 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

15 0.41779763 85 nips-2003-Human and Ideal Observers for Detecting Image Curves

16 0.41612306 30 nips-2003-Approximability of Probability Distributions

17 0.40895021 67 nips-2003-Eye Micro-movements Improve Stimulus Detection Beyond the Nyquist Limit in the Peripheral Retina

18 0.39651147 3 nips-2003-AUC Optimization vs. Error Rate Minimization

19 0.39373034 75 nips-2003-From Algorithmic to Subjective Randomness

20 0.39352763 130 nips-2003-Model Uncertainty in Classical Conditioning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (11, 0.014), (29, 0.021), (30, 0.012), (35, 0.045), (53, 0.12), (69, 0.362), (71, 0.07), (76, 0.049), (85, 0.065), (91, 0.116), (99, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87836587 196 nips-2003-Wormholes Improve Contrastive Divergence

Author: Max Welling, Andriy Mnih, Geoffrey E. Hinton

Abstract: In models that define probabilities via energies, maximum likelihood learning typically involves using Markov Chain Monte Carlo to sample from the model’s distribution. If the Markov chain is started at the data distribution, learning often works well even if the chain is only run for a few time steps [3]. But if the data distribution contains modes separated by regions of very low density, brief MCMC will not ensure that different modes have the correct relative energies because it cannot move particles from one mode to another. We show how to improve brief MCMC by allowing long-range moves that are suggested by the data distribution. If the model is approximately correct, these long-range moves have a reasonable acceptance rate.

same-paper 2 0.86626327 51 nips-2003-Design of Experiments via Information Theory

Author: Liam Paninski

Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1

3 0.59357518 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures

Author: Alexander T. Ihler, Erik B. Sudderth, William T. Freeman, Alan S. Willsky

Abstract: The problem of approximating the product of several Gaussian mixture distributions arises in a number of contexts, including the nonparametric belief propagation (NBP) inference algorithm and the training of product of experts models. This paper develops two multiscale algorithms for sampling from a product of Gaussian mixtures, and compares their performance to existing methods. The first is a multiscale variant of previously proposed Monte Carlo techniques, with comparable theoretical guarantees but improved empirical convergence rates. The second makes use of approximate kernel density evaluation methods to construct a fast approximate sampler, which is guaranteed to sample points to within a tunable parameter of their true probability. We compare both multiscale samplers on a set of computational examples motivated by NBP, demonstrating significant improvements over existing methods. 1

4 0.57420975 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

5 0.55471277 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1

6 0.55172473 126 nips-2003-Measure Based Regularization

7 0.54851264 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

8 0.54093796 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

9 0.53404671 143 nips-2003-On the Dynamics of Boosting

10 0.53080058 115 nips-2003-Linear Dependent Dimensionality Reduction

11 0.53039026 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

12 0.52528268 172 nips-2003-Semi-Supervised Learning with Trees

13 0.52359164 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

14 0.52221739 55 nips-2003-Distributed Optimization in Adaptive Networks

15 0.52185977 30 nips-2003-Approximability of Probability Distributions

16 0.51944894 161 nips-2003-Probabilistic Inference in Human Sensorimotor Processing

17 0.51898324 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

18 0.51816922 78 nips-2003-Gaussian Processes in Reinforcement Learning

19 0.51684129 194 nips-2003-Warped Gaussian Processes

20 0.5159871 102 nips-2003-Large Scale Online Learning