nips nips2007 nips2007-174 knowledge-graph by maker-knowledge-mining

174 nips-2007-Selecting Observations against Adversarial Objectives

Source: pdf

Author: Andreas Krause, Brendan Mcmahan, Carlos Guestrin, Anupam Gupta

Abstract: In many applications, one has to actively select among a set of expensive observations before making an informed decision. Often, we want to select observations which perform well when evaluated with an objective function chosen by an adversary. Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. In this paper, we present the Submodular Saturation algorithm, a simple and efﬁcient algorithm with strong theoretical approximation guarantees for the case where the possible objective functions exhibit submodularity, an intuitive diminishing returns property. Moreover, we prove that better approximation algorithms do not exist unless NP-complete problems admit efﬁcient algorithms. We evaluate our algorithm on several real-world problems. For Gaussian Process regression, our algorithm compares favorably with state-of-the-art heuristics described in the geostatistics literature, while being simpler, faster and providing theoretical guarantees. For robust experimental design, our algorithm performs favorably compared to SDP-based algorithms. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. [sent-5, score-0.517]

2 For robust experimental design, our algorithm performs favorably compared to SDP-based algorithms. [sent-10, score-0.241]

3 1 Introduction In tasks such as sensor placement for environmental temperature monitoring or experimental design, one has to select among a large set of possible, but expensive, observations. [sent-11, score-0.199]

4 For example, in the environmental monitoring problem, we want to minimize the marginal posterior variance of our temperature estimate at all locations simultaneously. [sent-13, score-0.24]

5 , the marginal variance at one location, or the information gain for a ﬁxed set of parameters [1, 2]) satisfy submodularity, an intuitive diminishing returns property: Adding a new observation helps less if we have already made many observations, and more if we have made few observation thus far. [sent-19, score-0.218]

6 While NP-hard, the problem of selecting an optimal set of k observations maximizing a single submodular objective can be approximately solved using a simple greedy forwardselection algorithm, which is guaranteed to perform near-optimally [3]. [sent-20, score-0.739]

7 In particular: (1) We present S ATURATE, an efﬁcient algorithm for settings where an adversarially-chosen submodular objective function must be optimized. [sent-23, score-0.517]

8 (3) We extensively evaluate our algorithm on several real-world tasks, including minimizing the maximum posterior variance in Gaussian Process regression, ﬁnding experiment designs which are robust with respect to parameter uncertainty, and sensor placement for outbreak detection. [sent-26, score-0.67]

9 2 The adversarial observation selection problem Observation selection with a single submodular objective. [sent-27, score-0.73]

10 Formally, F is submodular [3] if, for all A ⊆ B ⊆ V and s ∈ V \ B, it holds that F (A ∪ {s}) − F (A) ≥ F (B ∪ {s}) − F (B); F is monotonic if for all A ⊆ B ⊆ V it holds that F (A) ≤ F (B), and F is normalized if F (∅) = 0. [sent-30, score-0.516]

11 [3] states that for submodular functions, the greedy algorithm achieves a constant factor approximation: The set AG obtained by the greedy algorithm achieves at least a constant fraction (1 − 1/e) of the objective value obtained by the optimal solution, i. [sent-39, score-0.906]

12 Here, we are given a collection of monotonic submodular functions F1 , . [sent-45, score-0.516]

13 In fact, we show below that, in this setting, the simple greedy algorithm (which performs near-optimally in the single-criterion setting) can perform arbitrarily badly. [sent-55, score-0.234]

14 We want to select locations A such that the maximum marginal variance is as small as 2 2 possible. [sent-72, score-0.274]

15 Equivalently, we can deﬁne the variance reduction Fs (A) = σs − σs|A , and desire that the minimum variance reduction over all locations s is as large as possible. [sent-73, score-0.306]

16 Das and Kempe [1] show that, in many practical cases, the variance reduction Fs is a monotonic submodular function. [sent-74, score-0.642]

17 Another application is experimental design under nonlinear dynamics [7]. [sent-76, score-0.172]

18 In many cases, experimental design for linear models (where y = A(x)T θ + w) with Gaussian noise w can be efﬁciently solved [8]. [sent-78, score-0.172]

19 In [7], it was shown that the efﬁciency of the design can be very sensitive with respect to the initial parameter estimates θ0 . [sent-82, score-0.216]

20 , the goal is to minimize the maximum eigenvalue of the error covariance) which is robust against perturbations of the Jacobian 2 V . [sent-85, score-0.194]

21 We show how to ﬁnd (Bayesian A-optimal) designs which are robust against uncertainty in these parameter estimates. [sent-87, score-0.26]

22 Hence, in order to ﬁnd a robust design, we maximize the minimum variance reduction, where the minimum is taken over (a discretization into a ﬁnite subset of) all initial parameter values θ0 . [sent-89, score-0.267]

23 Another class of examples are outbreak detection problems on graphs, such as contamination detection in water distribution networks [9]. [sent-91, score-0.341]

24 We deﬁne a set of intrusion scenarios I; each scenario i ∈ I models an outbreak (e. [sent-93, score-0.262]

25 In [9], it was shown that these utilities Fi are monotonic and submodular for a large class of utility functions. [sent-99, score-0.546]

26 In the adversarial setting, the adversary observes our sensor placement A, and then decides on an intrusion i for which our utility Fi (A) is as small as possible. [sent-100, score-0.443]

27 Hence, our goal is to ﬁnd a placement A which performs well against such an adversarial opponent. [sent-101, score-0.271]

28 Given the near-optimal performance of the greedy algorithm for the single-objective problem, a natural question is if the performance guarantee generalizes to the more complex adversarial setting. [sent-103, score-0.401]

29 Consider the case with two submodular functions, F1 and F2 , where the set of observations is V = {s1 , s2 , t1 , t2 }. [sent-105, score-0.51]

30 Optimizing for a set of 2 elements, the greedy algorithm maximizing G(A) = min{F1 (A), F2 (A)} would choose the set {t1 , t2 }, since such choice increases G by 2ε, whereas adding si would not increase the score. [sent-109, score-0.195]

31 Hence, as ε → 0, the greedy algorithm performs arbitrarily worse than the optimal solution. [sent-111, score-0.26]

32 If there exists a polynomial-time algorithm which is guaranteed to ﬁnd a set A of size k such that mini Fi (A ) ≥ γ(n) max|A|≤k mini Fi (A), then P = NP. [sent-117, score-0.221]

33 For c > 0 deﬁne Fi,c (A) = min{Fi (A), c}, the original function Fi truncated at score level c; these Fi,c functions are also submodular [10]. [sent-136, score-0.484]

34 3 GPC (F c , c) A ← ∅; while F c (A) < c do foreach s ∈ V \ A do δs ← F c (A ∪ {s}) − F c (A); A ← A ∪ {argmaxs δs }; Algorithm 1: The greedy submodular partial cover (GPC) algorithm. [sent-137, score-0.623]

35 , Fm , k, α) cmin ← 0; cmax ← mini Fi (V); Abest ← ∅; 1 while (cmax − cmin ) ≥ m do 1 c ← (cmin + cmax )/2; ∀A deﬁne F c (A) ← m i min{Fi (A), c}; A ← GP C(F c , c); if |A| > αk then cmax ← c; else cmin ← c; Abest = A ; Algorithm 2: The Submodular Saturation algorithm. [sent-141, score-0.475]

36 1 Let F c (A) = m i Fi,c (A) be their average value; submodular functions are closed under convex combinations, so F c is submodular and monotonic. [sent-142, score-0.91]

37 Hence, in order to determine whether some c is feasible, we solve a submodular covering problem: Ac = argminA⊆V |A|, such that F c (A) = c. [sent-144, score-0.455]

38 2) Such problems are NP-hard in general [4], but in [11] it is shown that the greedy algorithm (c. [sent-146, score-0.195]

39 We can compute this approximation guarantee α for any given instance of the adversarial observation selection problem. [sent-154, score-0.283]

40 Hence, if for a given value of c the greedy algorithm returns a set of size greater than αk, there cannot exist a solution A with |A | ≤ k with Fi (A ) ≥ c for all i; thus, the optimal solution to the adversarial observation selection problem must be less than c. [sent-155, score-0.465]

41 We call Algorithm 2, which formalizes this procedure, the submodular saturation algorithm (S ATURATE), as the algorithm considers the truncated objectives Fi,c , and chooses sets which saturate all these objectives. [sent-157, score-0.742]

42 Theorem 3 (given below) states that S ATURATE is guaranteed to ﬁnd a set which achieves adversarial score mini Fi at least as high as the optimal solution, if we allow the set to be logarithmically larger than the optimal solution. [sent-158, score-0.345]

43 For any integer k, S ATURATE ﬁnds a solution AS such that mini Fi (AS ) ≥ max|A|≤k mini Fi (A) and |AS | ≤ αk, for α = 1 + log (maxs∈V i Fi (s)). [sent-160, score-0.194]

44 The total number of submodular function evaluations is O |V|2 m log( i Fi (V)) . [sent-161, score-0.455]

45 If we had an exact algorithm for submodular coverage, α = 1 would be the correct choice. [sent-164, score-0.482]

46 Since the greedy algorithm solves submodular coverage very effectively, in our experiments, we call S ATURATE with α = 1, which empirically performs very well. [sent-165, score-0.689]

47 If there were a polynomial time algorithm which, for any integer k, is guaranteed to ﬁnd a solution AS such that mini Fi (AS ) ≥ max|A|≤k mini Fi (A) and |AS | ≤ βk, where β ≤ (1 − ε)(1 + log maxs∈V i Fi (s)) for some ﬁxed ε > 0, then NP ⊆ DTIME(nlog log n ). [sent-173, score-0.221]

48 However, rather than guaranteeing that Fi (A) ≥ c for all i (which, in this example, means that the minimum variance re2 duction is c), we want to guarantee that σi|A ≤ c for all i. [sent-188, score-0.161]

49 In the geostatistics literature, the predominant choice of optimization algorithms are carefully tuned local search procedures, prominently simulated annealing (c. [sent-202, score-0.243]

50 We compare our S ATURATE algorithm against a state-of-the-art implementation of such a simulated annealing (SA) algorithm, ﬁrst proposed by [14]. [sent-205, score-0.245]

51 This algorithm has 7 parameters which need to be tuned, describing the annealing schedule, distribution of iterations among several inner loops, etc. [sent-207, score-0.169]

52 1(a) compares simulated annealing, S ATURATE, and the greedy algorithm which greedily selects elements which decrease the maximum variance the most. [sent-211, score-0.394]

53 We also used S ATURATE to initialize the simulated annealing algorithm (using only a single run of simulated annealing, as opposed to 10 random trials). [sent-212, score-0.321]

54 S ATURATE obtains placements which are drastically better than the placements obtained by the greedy algorithm. [sent-213, score-0.346]

55 Furthermore, the performance is very close to the performance of the simulated annealing algorithm. [sent-214, score-0.218]

56 When selecting 30 and more sensors, S ATURATE strictly outperforms the simulated annealing algorithm. [sent-215, score-0.218]

57 When using S ATURATE in order to initialize the simulated annealing algorithm, the resulting performance almost always resulted in the best solutions we were able to ﬁnd, while still executing faster than simulated annealing with 10 random restarts as proposed by [15]. [sent-218, score-0.436]

58 Hence we compared placements obtained by S ATURATE, minimizing the maximum marginal posterior variance, with placements obtained by the greedy algorithm, where we minimize the average marginal variance. [sent-221, score-0.455]

59 Note, that, whereas the reduction of the maximum variance is non-submodular, the average variance reduction is (often) submodular [1], and hence the greedy algorithm can be expected to provide near-optimal placements. [sent-222, score-0.936]

60 variance Figure 1: (a) S ATURATE, greedy and SA on the precipitation data. [sent-250, score-0.295]

61 (c) Optimizing for the maximum variance (using S ATURATE) leads to low average variance, but optimizing for average variance (using greedy) does not lead to low maximum variance. [sent-253, score-0.273]

62 We consider the robust design of experiments for the MichaelisMenten mass-action kinetics model, as discussed in [7]. [sent-257, score-0.25]

63 Classical experimental design considers the error covariance of the least squares ˆ ˆ estimate θ, Cov(θ | θ0 , w) = σ 2 (VθT W Vθ0 )−1 , where W = diag(w), and aims to ﬁnd designs 0 w which minimize this error covariance. [sent-265, score-0.351]

64 To avoid this problem, [7] ﬁnd a design wρ (θ0 ) by solving a robust SDP which minimizes the error size, subject to a worst-case (adversarially-chosen) perturbation ∆ on the Jacobian Vθ0 ; the robustness parameter ρ bounds the spectral norm of ∆. [sent-270, score-0.308]

65 As evaluation criterion, [7] deﬁne a notion of efﬁciency, which is the error size of the optimal design with correct initial parameter estimate, divided by the error when using a robust design obtained at the wrong initial parameter estimates, i. [sent-271, score-0.564]

66 , ˆ ˆ eﬃciency ≡ λmax [Cov(θ | θtrue , wopt (θtrue )))]/λmax [Cov(θ | θtrue , wρ (θ0 ))], where wopt (θ) is the E-optimal design for parameter θ. [sent-273, score-0.23]

67 They show that for appropriately chosen values of ρ, the robust design is more efﬁcient than the optimal design, if the initial parameter θ0 does not equal the true parameter. [sent-274, score-0.348]

68 , if the function f describes a process, which behaves characteristically differently in different “phases”, and the parameter θ controls which of the phases the process is in, then a robust design should intuitively “hedge” the design against the behavior in each possible phase. [sent-278, score-0.451]

69 In such a case, the uniform distribution (which the robust SDP chooses for large ρ) would not be the most robust design. [sent-279, score-0.212]

70 If we discretize the space of possible parameter perturbations (within a reasonably chosen interval), we can use S ATURATE to ﬁnd robust experimental designs. [sent-280, score-0.189]

71 While the classical E-optimality is not submodular [2], Bayesian A-optimality is (often) submodular [1, 2]. [sent-281, score-0.91]

72 3 ρ = 10-3 0 Classical E-optimal design true θ2 10 Initial parameter estimate θ02 10 1 4 x 10 3000 2500 Maximum population affected Efficiency (w. [sent-295, score-0.246]

73 5 0 Simulated Annealing 5 Saturate + SA 10 15 20 Number of sensors 25 30 (a) Robust experimental design (b) [W] algorithms Z1 (c) [W] algorithms Z2 Figure 2: (a) Efﬁciency of robust SDP of [7] and S ATURATE on a biological experimental design problem. [sent-300, score-0.503]

74 (b,c) S ATURATE, greedy and SA in the water network setting, when optimizing worst-case detection time (Z1 ) and affected population (Z2 ). [sent-302, score-0.401]

75 We ﬁrst optimized designs which are robust against a small perturbation of the initial parameter estimate. [sent-306, score-0.304]

76 In region B which contains the true parameter setting, the E-optimal design (which is optimal if the true parameter is known, i. [sent-312, score-0.226]

77 Outside of region B however, where the standard E-optimal design performs badly, both robust designs do not perform well either. [sent-318, score-0.415]

78 Consequently, we compared designs which are robust against a large parameter range. [sent-320, score-0.26]

79 For S ATURATE, we optimized a single design which achieves the maximum normalized variance reduction over all values of θ02 between 0 and 16. [sent-323, score-0.304]

80 2(a) shows, that in this case, the design obtained by S ATURATE achieves an efﬁciency of 69%, whereas the efﬁciency of the SDP design is only 52%. [sent-325, score-0.288]

81 In the regions A and C, the S ATURATE design strictly outperforms the other robust designs. [sent-326, score-0.25]

82 This experiment indicates that designs which are robust against a large range of initial parameter estimates, as provided by S ATURATE, can be more efﬁcient than designs which are robust against perturbations of the Jacobian (the SDP approach). [sent-327, score-0.563]

83 Consider a city water distribution network, delivering water to households via a system of pipes, pumps, and junctions. [sent-329, score-0.166]

84 In August 2006, the Battle of Water Sensor Networks (BWSN) [16] was organized as an international challenge to ﬁnd the best sensor placements for a real (but anonymized) metropolitan water distribution network, consisting of 12,527 nodes. [sent-331, score-0.238]

85 In this challenge, a set of intrusion scenarios is speciﬁed, and for each scenario a realistic simulator provided by the EPA [17] is used to simulate the spread of the contaminant for a 48 hour period. [sent-332, score-0.175]

86 In this paper, we consider the adversarial setting, where an opponent chooses the contamination scenario with knowledge of the sensor locations. [sent-336, score-0.321]

87 The objective functions Z1 and Z2 are in fact submodular for a ﬁxed intrusion scenario [9], and so the adversarial problem of minimizing the impact of the worst possible intrusion ﬁts into our model. [sent-337, score-0.891]

88 Figures 2(b) and 2(c) compare the greedy algorithm, S ATURATE and the simulated annealing (SA) algorithm for the problem of maximizing the worst-case detection time (Z1 ) and worst-case affected population (Z2 ). [sent-339, score-0.536]

89 For the affected population (Z2 ), greedy performs reasonably, and SA sometimes even outperforms S ATURATE. [sent-341, score-0.281]

90 For the detection 7 time (Z1 ), however, the greedy algorithm did not improve the objective at all, and SA performs poorly. [sent-342, score-0.318]

91 Hence, there is a strong “gradient”, as the adversarial objective changes quickly when the high impact scenarios are covered. [sent-344, score-0.276]

92 Hence, many node exchange proposals considered by SA, as well as the addition of a new sensor location by greedy, do not change the adversarial objective, and the algorithms have no useful performance metric. [sent-348, score-0.246]

93 We demonstrated how this class of problems encompasses the problem of ﬁnding designs which minimize the maximum posterior variance in Gaussian Processes regression, robust experimental design, and detecting events spreading over graphs. [sent-351, score-0.448]

94 In each of these settings, the individual objectives are submodular and can be approximated well using, e. [sent-352, score-0.524]

95 , the greedy algorithm; the adversarial objective, however, is not submodular. [sent-354, score-0.335]

96 We proved that there cannot exist any approximation algorithm for the adversarial problem if the constraint on the observation set size must be exactly met, unless P = NP. [sent-355, score-0.286]

97 For robust experimental design, S ATURATE performs favorably compared to SDP based approaches. [sent-360, score-0.214]

98 An analysis of the approximations for maximizing submodular set functions. [sent-380, score-0.455]

99 An analysis of the greedy algorithm for the submodular set covering problem. [sent-430, score-0.65]

100 Battle of water sensor networks: A design challenge for engineers and algorithms. [sent-464, score-0.306]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('aturate', 0.552), ('submodular', 0.455), ('fi', 0.249), ('greedy', 0.168), ('adversarial', 0.167), ('sa', 0.16), ('design', 0.144), ('annealing', 0.142), ('saturate', 0.127), ('designs', 0.126), ('outbreak', 0.116), ('robust', 0.106), ('sdp', 0.105), ('intrusion', 0.102), ('mini', 0.097), ('variance', 0.089), ('water', 0.083), ('sensor', 0.079), ('simulated', 0.076), ('placements', 0.076), ('jacobian', 0.075), ('objectives', 0.069), ('placement', 0.065), ('cmax', 0.063), ('cmin', 0.063), ('monotonic', 0.061), ('krause', 0.058), ('ag', 0.058), ('observations', 0.055), ('maxs', 0.054), ('locations', 0.054), ('sensors', 0.053), ('detection', 0.049), ('guestrin', 0.046), ('unless', 0.046), ('observation', 0.046), ('initial', 0.044), ('scenarios', 0.044), ('bwsn', 0.044), ('contamination', 0.044), ('dtime', 0.044), ('hereby', 0.044), ('scs', 0.044), ('submodularity', 0.044), ('linearization', 0.043), ('xa', 0.043), ('favorably', 0.041), ('np', 0.041), ('affected', 0.041), ('xv', 0.041), ('gp', 0.04), ('performs', 0.039), ('guarantee', 0.039), ('kriging', 0.038), ('precipitation', 0.038), ('nlog', 0.038), ('spreading', 0.038), ('reduction', 0.037), ('saturation', 0.037), ('marginal', 0.037), ('objective', 0.035), ('gpc', 0.035), ('maximum', 0.034), ('want', 0.033), ('population', 0.033), ('cov', 0.032), ('selection', 0.031), ('opponent', 0.031), ('impact', 0.03), ('utility', 0.03), ('robustness', 0.03), ('abest', 0.029), ('characteristically', 0.029), ('cheapest', 0.029), ('contaminant', 0.029), ('nemhauser', 0.029), ('wopt', 0.029), ('cmu', 0.029), ('fm', 0.029), ('carlos', 0.029), ('score', 0.029), ('parameter', 0.028), ('experimental', 0.028), ('reductions', 0.028), ('algorithm', 0.027), ('optimizing', 0.027), ('select', 0.027), ('ciency', 0.027), ('perturbations', 0.027), ('minimize', 0.027), ('optimal', 0.026), ('covariance', 0.026), ('drastically', 0.026), ('paci', 0.025), ('argmaxs', 0.025), ('geostatistics', 0.025), ('gupta', 0.025), ('anupam', 0.025), ('mcmahan', 0.025), ('heuristics', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 174 nips-2007-Selecting Observations against Adversarial Objectives

Author: Andreas Krause, Brendan Mcmahan, Carlos Guestrin, Anupam Gupta

2 0.098880686 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

Author: Andrea Lecchini-visintini, John Lygeros, Jan Maciejowski

Abstract: Simulated annealing is a popular method for approaching the solution of a global optimization problem. Existing results on its performance apply to discrete combinatorial optimization where the optimization variables can assume only a ﬁnite set of possible values. We introduce a new general formulation of simulated annealing which allows one to guarantee ﬁnite-time performance in the optimization of functions of continuous variables. The results hold universally for any optimization problem on a bounded domain and establish a connection between simulated annealing and up-to-date theory of convergence of Markov chain Monte Carlo methods on continuous domains. This work is inspired by the concept of ﬁnite-time learning with known accuracy and conﬁdence developed in statistical learning theory. Optimization is the general problem of ﬁnding a value of a vector of variables θ that maximizes (or minimizes) some scalar criterion U (θ). The set of all possible values of the vector θ is called the optimization domain. The elements of θ can be discrete or continuous variables. In the ﬁrst case the optimization domain is usually ﬁnite, such as in the well-known traveling salesman problem; in the second case the optimization domain is a continuous set. An important example of a continuous optimization domain is the set of 3-D conﬁgurations of a sequence of amino-acids in the problem of ﬁnding the minimum energy folding of the corresponding protein [1]. In principle, any optimization problem on a ﬁnite domain can be solved by an exhaustive search. However, this is often beyond computational capacity: the optimization domain of the traveling salesman problem with 100 cities contains more than 10155 possible tours. An efﬁcient algorithm to solve the traveling salesman and many similar problems has not yet been found and such problems remain reliably solvable only in principle [2]. Statistical mechanics has inspired widely used methods for ﬁnding good approximate solutions in hard discrete optimization problems which defy efﬁcient exact solutions [3, 4, 5, 6]. Here a key idea has been that of simulated annealing [3]: a random search based on the Metropolis-Hastings algorithm, such that the distribution of the elements of the domain visited during the search converges to an equilibrium distribution concentrated around the global optimizers. Convergence and ﬁnite-time performance of simulated annealing on ﬁnite domains has been evaluated in many works, e.g. [7, 8, 9, 10]. On continuous domains, most popular optimization methods perform a local gradient-based search and in general converge to local optimizers; with the notable exception of convex criteria where convergence to the unique global optimizer occurs [11]. Simulated annealing performs a global search and can be easily implemented on continuous domains. Hence it can be considered a powerful complement to local methods. In this paper, we introduce for the ﬁrst time rigorous guarantees on the ﬁnite-time performance of simulated annealing on continuous domains. We will show that it is possible to derive simulated annealing algorithms which, with an arbitrarily high level of conﬁdence, ﬁnd an approximate solution to the problem of optimizing a function of continuous variables, within a speciﬁed tolerance to the global optimal solution after a known ﬁnite number of steps. Rigorous guarantees on the ﬁnite-time performance of simulated annealing in the optimization of functions of continuous variables have never been obtained before; the only results available state that simulated annealing converges to a global optimizer as the number of steps grows to inﬁnity, e.g. [12, 13, 14, 15]. The background of our work is twofold. On the one hand, our notion of approximate solution to a global optimization problem is inspired by the concept of ﬁnite-time learning with known accuracy and conﬁdence developed in statistical learning theory [16, 17]. We actually maintain an important aspect of statistical learning theory which is that we do not introduce any particular assumption on the optimization criterion, i.e. our results hold regardless of what U is. On the other hand, we ground our results on the theory of convergence, with quantitative bounds on the distance to the target distribution, of the Metropolis-Hastings algorithm and Markov Chain Monte Carlo (MCMC) methods, which has been one of the main achievements of recent research in statistics [18, 19, 20, 21]. In this paper, we will not develop any ready-to-use optimization algorithm. We will instead introduce a general formulation of the simulated annealing method which allows one to derive new simulated annealing algorithms with rigorous ﬁnite-time guarantees on the basis of existing theory. The Metropolis-Hastings algorithm and the general family of MCMC methods have many degrees of freedom. The choice and comparison of speciﬁc algorithms goes beyond the scope of the paper. The paper is organized in the following sections. In Simulated annealing we introduce the method and ﬁx the notation. In Convergence we recall the reasons why ﬁnite-time guarantees for simulated annealing on continuous domains have not been obtained before. In Finite-time guarantees we present the main result of the paper. In Conclusions we state our ﬁndings and conclude the paper. 1 Simulated annealing The original formulation of simulated annealing was inspired by the analogy between the stochastic evolution of the thermodynamic state of an annealing material towards the conﬁgurations of minimal energy and the search for the global minimum of an optimization criterion [3]. In the procedure, the optimization criterion plays the role of the energy and the state of the annealed material is simulated by the evolution of the state of an inhomogeneous Markov chain. The state of the chain evolves according to the Metropolis-Hastings algorithm in order to simulate the Boltzmann distribution of thermodynamic equilibrium. The Boltzmann distribution is simulated for a decreasing sequence of temperatures (“cooling”). The target distribution of the cooling procedure is the limiting Boltzmann distribution, for the temperature that tends to zero, which takes non-zero values only on the set of global minimizers [7]. The original formulation of the method was for a ﬁnite domain. However, simulated annealing can be generalized straightforwardly to a continuous domain because the Metropolis-Hastings algorithm can be used with almost no differences on discrete and continuous domains The main difference is that on a continuous domain the equilibrium distributions are speciﬁed by probability densities. On a continuous domain, Markov transition kernels in which the distribution of the elements visited by the chain converges to an equilibrium distribution with the desired density can be constructed using the Metropolis-Hastings algorithm and the general family of MCMC methods [22]. We point out that Boltzmann distributions are not the only distributions which can be adopted as equilibrium distributions in simulated annealing [7]. In this paper it is convenient for us to adopt a different type of equilibrium distribution in place of Boltzmann distributions. 1.1 Our setting The optimization criterion is U : Θ → [0, 1], with Θ ⊂ RN . The assumption that U takes values in the interval [0, 1] is a technical one. It does not imply any serious loss of generality. In general, any bounded optimization criterion can be scaled to take values in [0, 1]. We assume that the optimization task is to ﬁnd a global maximizer; this can be done without loss of generality. We also assume that Θ is a bounded set. We consider equilibrium distributions deﬁned by probability density functions proportional to [U (θ) + δ]J where J and δ are two strictly positive parameters. We use π (J) to denote an equilibrium distribution, i.e. π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ) where πLeb is the standard Lebesgue measure. Here, J −1 plays the role of the temperature: if the function U (θ) plus δ is taken to a positive power J then as J increases (i.e. as J −1 decreases) [U (θ) + δ]J becomes increasingly peaked around the global maximizers. The parameter δ is an offset which guarantees that the equilibrium densities are always strictly positive, even if U takes zero values on some elements of the domain. The offset δ is chosen by the user and we show later that our results allow one to make an optimal selection of δ. The zero-temperature distribution is the limiting distribution, for J → ∞, which takes non-zero values only on the set of global maximizers. It is denoted by π (∞) . In the generic formulation of the method, the Markov transition kernel of the k-th step of the inhomogeneous chain has equilibrium distribution π (Jk ) where {Jk }k=1,2,... is the “cooling schedule”. The cooling schedule is a non-decreasing sequence of positive numbers according to which the equilibrium distribution become increasingly sharpened during the evolution of the chain. We use θk to denote the state of the chain and Pθk to denote its probability distribution. The distribution Pθk obviously depends on the initial condition θ0 . However, in this work, we don’t need to make this dependence explicit in the notation. Remark 1: If, given an element θ in Θ, the value U (θ) can be computed directly, we say that U is a deterministic criterion, e.g. the energy landscape in protein structure prediction [1]. In problems involving random variables, the value U (θ) may be the expected value U (θ) = g(x, θ)px (x; θ)dx of some function g which depends on both the optimization variable θ, and on some random variable x which has probability density px (x; θ) (which may itself depend on θ). In such problems it is usually not possible to compute U (θ) directly, either because evaluation of the integral requires too much computation, or because no analytical expression for px (x; θ) is available. Typically one must perform stochastic simulations in order to obtain samples of x for a given θ, hence obtain sample values of g(x, θ), and thus construct a Monte Carlo estimate of U (θ). The Bayesian design of clinical trials is an important application area where such expected-value criteria arise [23]. The authors of this paper investigate the optimization of expected-value criteria motivated by problems of aircraft routing [24]. In the particular case that px (x; θ) does not depend on θ, the optimization task is often called “empirical risk minimization”, and is studied extensively in statistical learning theory [16, 17]. The results of this paper apply in the same way to the optimization of both deterministic and expected-value criteria. The MCMC method developed by M¨ ller [25, 26] allows one u to construct simulated annealing algorithms for the optimization of expected-value criteria. M¨ ller u [25, 26] employs the same equilibrium distributions as those described in our setting; in his context J is restricted to integer values. 2 Convergence The rationale of simulated annealing is as follows: if the temperature is kept constant, say Jk = J, then the distribution of the state of the chain Pθk tends to the equilibrium distribution π (J) ; if J → ∞ then the equilibrium distribution π (J) tends to the zero-temperature distribution π (∞) ; as a result, if the cooling schedule Jk tends to inﬁnity, one obtains that Pθk “follows” π (Jk ) and that π (Jk ) tends to π (∞) and eventually that the distribution of the state of the chain Pθk tends to π (∞) . The theory shows that, under conditions on the cooling schedule and the Markov transition kernels, the distribution of the state of the chain Pθk actually converges to the target zero-temperature distribution π (∞) as k → ∞ [12, 13, 14, 15]. Convergence to the zero-temperature distribution implies that asymptotically the state of the chain eventually coincides with a global optimizer with probability one. The difﬁculty which must be overcome in order to obtain ﬁnite step results on simulated annealing algorithms on a continuous domain is that usually, in an optimization problem deﬁned over continuous variables, the set of global optimizers has zero Lebesgue measure (e.g. a set of isolated points). If the set of global optimizers has zero measure then the set of global optimizers has null probability according to the equilibrium distributions π (J) for any ﬁnite J and, as a consequence, according to the distributions Pθk for any ﬁnite k. Put another way, the probability that the state of the chain visits the set of global optimizers is constantly zero after any ﬁnite number of steps. Hence the conﬁdence of the fact that the solution provided by the algorithm in ﬁnite time coincides with a global optimizer is also constantly zero. Notice that this is not the case for a ﬁnite domain, where the set of global optimizers is of non-null measure with respect to the reference counting measure [7, 8, 9, 10]. It is instructive to look at the issue also in terms of the rate of convergence to the target zerotemperature distribution. On a discrete domain, the distribution of the state of the chain at each step and the zero-temperature distribution are both standard discrete distributions. It is then possible to deﬁne a distance between them and study the rate of convergence of this distance to zero. This analysis allows one to obtain results on the ﬁnite-time behavior of simulated annealing [7, 8]. On a continuous domain and for a set of global optimizers of measure zero, the target zero-temperature distribution π (∞) ends up being a mixture of probability masses on the set of global optimizers. In this situation, although the distribution of the state of the chain Pθk still converges asymptotically to π (∞) , it is not possible to introduce a sensible distance between the two distributions and a rate of convergence to the target distribution cannot even be deﬁned (weak convergence), see [12, Theorem 3.3]. This is the reason that until now there have been no guarantees on the performance of simulated annealing on a continuous domain after a ﬁnite number of computations: by adopting the zero-temperature distribution π (∞) as the target distribution it is only possible to prove asymptotic convergence in inﬁnite time to a global optimizer. Remark 2: The standard distance between two distributions, say µ1 and µ2 , on a continuous support is the total variation norm µ1 − µ2 T V = supA |µ1 (A) − µ2 (A)|, see e.g. [21]. In simulated annealing on a continuous domain the distribution of the state of the chain Pθk is absolutely continuous with respect to the Lebesgue measure (i.e. πLeb (A) = 0 ⇒ Pθk (A) = 0), by construction for any ﬁnite k. Hence if the set of global optimizers has zero Lebesgue measure then it has zero measure also according to Pθk . The set of global optimizers has however measure 1 according to π (∞) . The distance Pθk − π (∞) T V is then constantly 1 for any ﬁnite k. It is also worth mentioning that if the set of global optimizers has zero measure then asymptotic convergence to the zero-temperature distribution π (∞) can be proven only under the additional assumptions of continuity and differentiability of U [12, 13, 14, 15]. 3 Finite-time guarantees In general, optimization algorithms for problems deﬁned on continuous variables can only ﬁnd approximate solutions in ﬁnite time [27]. Given an element θ of a continuous domain how can we assess how good it is as an approximate solution to an optimization problem? Here we introduce the concept of approximate global optimizer to answer this question. The deﬁnition is given for a maximization problem in a continuous but bounded domain. We use two parameters: the value imprecision ǫ (greater than or equal to 0) and the residual domain α (between 0 and 1) which together determine the level of approximation. We say that θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if the function U takes values strictly greater than U (θ) + ǫ only on a subset of values of θ no larger than an α portion of the optimization domain. The formal deﬁnition is as follows. Deﬁnition 1 Let U : Θ → R be an optimization criterion where Θ ⊂ RN is bounded. Let πLeb denote the standard Lebesgue measure. Let ǫ ≥ 0 and α ∈ [0, 1] be given numbers. Then θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . In other words, the value U (θ) is within ǫ of a value which is greater than the values that U takes on at least a 1 − α portion of the domain. The smaller ǫ and α are, the better is the approximation of a true global optimizer. If both α and ǫ are equal to zero then U (θ) coincides with the essential supremum of U . Our deﬁnition of approximate global optimizer carries an important property, which holds regardless of what the criterion U is: if ǫ and α have non-zero values then the set of approximate global optimizers always has non-zero Lebesgue measure. It follows that the probability that the chain visits the set of approximate global optimizers can be non-zero. Hence, it is sensible to study the conﬁdence of the fact that the solution found by simulated annealing in ﬁnite time is an approximate global optimizer. Remark 3: The intuition that our notion of approximate global optimizer can be used to obtain formal guarantees on the ﬁnite-time performance of optimization methods based on a stochastic search of the domain is already apparent in the work of Vidyasagar [17, 28]. Vidyasagar [17, 28] introduces a similar deﬁnition and obtains rigorous ﬁnite-time guarantees in the optimization of ex- pected value criteria based on uniform independent sampling of the domain. Notably, the number of independent samples required to guarantee some desired accuracy and conﬁdence turns out to be polynomial in the values of the desired imprecision, residual domain and conﬁdence. Although the method of Vidyasagar is not highly sophisticated, it has had considerable success in solving difﬁcult control system design applications [28, 29]. Its appeal stems from its rigorous ﬁnite-time guarantees which exist without the need for any particular assumption on the optimization criterion. Here we show that ﬁnite-time guarantees for simulated annealing can be obtained by selecting a distribution π (J) with a ﬁnite J as the target distribution in place of the zero-temperature distribution π (∞) . The fundamental result is the following theorem which allows one to select in a rigorous way δ and J in the target distribution π (J) . It is important to stress that the result holds universally for any optimization criterion U on a bounded domain. The only minor requirement is that U takes values in [0, 1]. Theorem 1 Let U : Θ → [0, 1] be an optimization criterion where Θ ⊂ RN is bounded. Let J ≥ 1 and δ > 0 be given numbers. Let θ be a multivariate random variable with distribution π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ). Let α ∈ (0, 1] and ǫ ∈ [0, 1] be given numbers and deﬁne σ= 1 1+δ 1+ ǫ+1+δ J 1 1+δ 1+δ −1 α ǫ+δ δ . (1) Then the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ. Proof. See Appendix A. The importance of the choice of a target distribution π (J) with a ﬁnite J is that π (J) is absolutely continuous with respect to the Lebesgue measure. Hence, the distance Pθk − π (J) TV between the distribution of the state of the chain Pθk and the target distribution π (J) is a meaningful quantity. Convergence of the Metropolis-Hastings algorithm and MCMC methods in total variation norm is a well studied problem. The theory provides simple conditions under which one derives upper bounds on the distance to the target distribution which are known at each step of the chain and decrease monotonically to zero as the number of steps of the chain grows. The theory has been developed mainly for homogeneous chains [18, 19, 20, 21]. In the case of simulated annealing, the factor that enables us to employ these results is the absolute continuity of the target distribution π (J) with respect to the Lebesgue measure. However, simulated annealing involves the simulation of inhomogeneous chains. In this respect, another important fact is that the choice of a target distribution π (J) with a ﬁnite J implies that the inhomogeneous Markov chain can in fact be formed by a ﬁnite sequence of homogeneous chains (i.e. the cooling schedule {Jk }k=1,2,... can be chosen to be a sequence that takes only a ﬁnite set of values). In turn, this allows one to apply the theory of homogeneous MCMC methods to study the convergence of Pθk to π (J) in total variation norm. On a bounded domain, simple conditions on the ‘proposal distribution’ in the iteration of the simulated annealing algorithm allows one to obtain upper bounds on Pθk − π (J) TV that decrease geometrically to zero as k → ∞, without the need for any additional assumption on U [18, 19, 20, 21]. It is then appropriate to introduce the following ﬁnite-time result. Theorem 2 Let the notation and assumptions of Theorem 1 hold. Let θk , with distribution Pθk , be the state of the inhomogeneous chain of a simulated annealing algorithm with target distribution π (J) . Then the statement “θk is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ − Pθk − π (J) TV . The proof of the theorem follows directly from the deﬁnition of the total variation norm. It follows that if simulated annealing is implemented with an algorithm which converges in total variation distance to a target distribution π (J) with a ﬁnite J, then one can state with conﬁdence arbitrarily close to 1 that the solution found by the algorithm after the known appropriate ﬁnite number of steps is an approximate global optimizer with the desired approximation level. For given non-zero values of ǫ, α the value of σ given by (1) can be made arbitrarily close to 1 by choice of J; while the distance Pθk − π (J) TV can be made arbitrarily small by taking the known sufﬁcient number of steps. It can be shown that there exists the possibility of making an optimal choice of δ and J in the target distribution π (J) . In fact, for given ǫ and α and a given value of J there exists an optimal choice of δ which maximizes the value of σ given by (1). Hence, it is possible to obtain a desired σ with the smallest possible J. The advantage of choosing the smallest J, consistent with the required approximation and conﬁdence, is that it will decrease the number of steps required to achieve the desired reduction of Pθk − π (J) TV . 4 Conclusions We have introduced a new formulation of simulated annealing which admits rigorous ﬁnite-time guarantees in the optimization of functions of continuous variables. First, we have introduced the notion of approximate global optimizer. Then, we have shown that simulated annealing is guaranteed to ﬁnd approximate global optimizers, with the desired conﬁdence and the desired level of accuracy, in a known ﬁnite number of steps, if a proper choice of the target distribution is made and conditions for convergence in total variation norm are met. The results hold for any optimization criterion on a bounded domain with the only minor requirement that it takes values between 0 and 1. In this framework, simulated annealing algorithms with rigorous ﬁnite-time guarantees can be derived by studying the choice of the proposal distribution and of the cooling schedule, in the generic iteration of simulated annealing, in order to ensure convergence to the target distribution in total variation norm. To do this, existing theory of convergence of the Metropolis-Hastings algorithm and MCMC methods on continuous domains can be used [18, 19, 20, 21]. Vidyasagar [17, 28] has introduced a similar deﬁnition of approximate global optimizer and has shown that approximate optimizers with desired accuracy and conﬁdence can be obtained with a number of uniform independent samples of the domain which is polynomial in the accuracy and conﬁdence parameters. In general, algorithms developed with the MCMC methodology can be expected to be equally or more efﬁcient than uniform independent sampling. Acknowledgments Work supported by EPSRC, Grant EP/C014006/1, and by the European Commission under projects HYGEIA FP6-NEST-4995 and iFly FP6-TREN-037180. We thank S. Brooks, M. Vidyasagar and D. M. Wolpert for discussions and useful comments on the paper. A Proof of Theorem 1 Let α ∈ (0, 1] and ρ ∈ (0, 1] be given numbers. Let Uδ (θ) := U (θ) + δ. Let πδ be a normalized ¯ measure such that πδ (dθ) ∝ Uδ (θ)πLeb (dθ). In the ﬁrst part of the proof we ﬁnd a lower bound on the probability that θ belongs to the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} . ¯ Let yα := inf{y : πδ {θ ∈ Θ : Uδ (θ) ≤ y} ≥ 1 − α}. To start with we show that the set ¯ ¯ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} coincides with {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice ¯ ¯ that the quantity πδ {θ ∈ Θ : Uδ (θ) ≤ y} is a right-continuous non-decreasing function of y because it has the form of a distribution function (see e.g. [30, p.162] and [17, Lemma 11.1]). Therefore we have πδ {θ ∈ Θ : Uδ (θ) ≤ yα } ≥ 1 − α and ¯ ¯ y ≥ ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} ≥ 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α . ¯ ¯ ¯ Moreover, y < ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} < 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} > α ¯ ¯ ¯ and taking the contrapositive one obtains πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α ⇒ y ≥ ρ yα . ¯ ¯ ′ Therefore {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α}. ¯ ¯ We now derive a lower bound on π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Let us introduce the notation ¯ ¯¯ Aα := {θ ∈ Θ : Uδ (θ) < yα }, Aα := {θ ∈ Θ : Uδ (θ) ≥ yα }, Bα,ρ := {θ ∈ Θ : Uδ (θ) < ρ yα } ¯ ¯ ¯ ¯ ¯ ¯α,ρ := {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice that Bα,ρ ⊆ Aα and Aα ⊆ Bα,ρ . The quantity ¯¯ ¯¯ and B ¯ ¯ ¯ ¯ πδ {θ ∈ Θ : Uδ (θ) < y} as a function of y is the left-continuous version of πδ {θ ∈ Θ : Uδ (θ) ≤ ¯¯ y}[30, p.162]. Hence, the deﬁnition of yα implies πδ (Aα ) ≤ 1 − α and πδ (Aα ) ≥ α. Notice that ¯ ¯ ¯ ¯ δπLeb (Aα ) ¯ ≤ 1−α, ¯ πδ (Aα ) ≤ 1 − α ⇒ ¯ ¯ Uδ (θ)πLeb (dθ) Θ ¯¯ πδ (Aα ) ≥ α ¯ ¯¯ (1 + δ)πLeb (Aα ) ≥ α. ¯ U (θ)πLeb (dθ) Θ δ ⇒ ¯¯ Hence, πLeb (Aα ) > 0 and πLeb (Aα ) 1−α1+δ ¯ ¯ . ¯α ) ≤ α ¯ δ πLeb (A ¯ ¯¯ ¯¯ Notice that πLeb (Aα ) > 0 implies πLeb (Bα,ρ ) > 0. We obtain π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα } = ¯ 1+ 1 ≥ Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Bα,ρ ≥ 1+ J ρ J yα ¯ J yα ¯ Uδ (θ)J πLeb (dθ) 1+ 1 Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Aα Uδ (θ)J πLeb (dθ) 1 1 1 ≥ ≥ . 1−α1+δ ¯ πLeb (Aα ) πLeb (Bα,ρ ) ¯ ¯ 1 + ρJ 1 + ρJ ¯¯ ¯¯ α ¯ δ πLeb (Aα ) πLeb (Aα ) Since {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} the ﬁrst part of ¯ ¯ the proof is complete. In the second part of the proof we show that the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} is contained in the set of approximate global optimizers of U with value imprecision ¯ ǫ := (ρ−1 − 1)(1 + δ) and residual domain α := 1+δ α. Hence, we show that {θ ∈ Θ : πδ {θ′ ∈ ˜ ˜ ǫ+δ ¯ ˜ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} ⊆ {θ ∈ Θ : πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ)}. We ¯ ˜ ˜ have U (θ′ ) > U (θ) + ǫ ⇔ ρ Uδ (θ′ ) > ρ [Uδ (θ) + ǫ] ⇒ ρ Uδ (θ′ ) > Uδ (θ) ˜ ˜ which is proven by noticing that ρ [Uδ (θ) + ǫ] ≥ Uδ (θ) ⇔ 1 − ρ ≥ U (θ)(1 − ρ) ˜ and U (θ) ∈ [0, 1]. Hence {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ⊇ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} . ˜ Therefore πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α . Let ¯ ˜ ¯ Qθ,˜ := {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} and notice that ˜ ǫ U (θ′ )πLeb (dθ′ ) + δπLeb (Qθ,˜) ǫ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} = ˜ Qθ,˜ ǫ . U (θ′ )πLeb (dθ′ ) + δπLeb (Θ) Θ We obtain πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α ⇒ ǫ πLeb (Qθ,˜) + δπLeb (Qθ,˜) ≤ α(1 + δ)πLeb (Θ) ˜ ¯ ˜ ¯ ǫ ǫ ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . ˜ ˜ Hence we can conclude that πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) ¯ ˜ ˜ and the second part of the proof is complete. We have shown that given α ∈ (0, 1], ρ ∈ (0, 1], ǫ := (ρ−1 − 1)(1 + δ), α := ¯ ˜ ˜ σ := 1 = 1−α1+δ ¯ 1+δ 1+ρ 1+ α ¯ δ ǫ+1+δ ˜ J 1+δ ǫ+δ ˜ 1 J 1 1+δ 1+δ −1 α ǫ+δ ˜˜ δ α and ¯ , the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual ˜ domain α” holds with probability at least σ. Notice that ǫ ∈ [0, 1] and α ∈ (0, 1] are linked through ˜ ˜ ˜ ǫ+δ ˜ a bijective relation to ρ ∈ [ 1+δ , 1] and α ∈ (0, 1+δ ]. The statement of the theorem is eventually ¯ 2+δ obtained by expressing σ as a function of desired ǫ = ǫ and α = α. ˜ ˜ References [1] D. J. Wales. Energy Landscapes. Cambridge University Press, Cambridge, UK, 2003. [2] D. Achlioptas, A. Naor, and Y. Peres. Rigorous location of phase transitions in hard optimization problems. Nature, 435:759–764, 2005. [3] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. 220(4598):671–680, 1983. Optimization by Simulated Annealing. Science, [4] E. Bonomi and J. Lutton. The N -city travelling salesman problem: statistical mechanics and the Metropolis algorithm. SIAM Rev., 26(4):551–568, 1984. [5] Y. Fu and P. W. Anderson. Application of statistical mechanics to NP-complete problems in combinatorial optimization. J. Phys. A: Math. Gen., 19(9):1605–1620, 1986. [6] M. M´ zard, G. Parisi, and R. Zecchina. Analytic and Algorithmic Solution of Random Satisﬁability e Problems. Science, 297:812–815, 2002. [7] P. M. J. van Laarhoven and E. H. L. Aarts. Simulated Annealing: Theory and Applications. D. Reidel Publishing Company, Dordrecht, Holland, 1987. [8] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli. Convergence and ﬁnite-time behavior of simulated annealing. Adv. Appl. Prob., 18:747–771, 1986. [9] B. Hajek. Cooling schedules for optimal annealing. Math. Oper. Res., 13:311–329, 1988. [10] J. Hannig, E. K. P. Chong, and S. R. Kulkarni. Relative Frequencies of Generalized Simulated Annealing. Math. Oper. Res., 31(1):199–216, 2006. [11] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. [12] H. Haario and E. Saksman. Simulated annealing process in general state space. Adv. Appl. Prob., 23:866– 893, 1991. [13] S. B. Gelfand and S. K. Mitter. Simulated Annealing Type Algorithms for Multivariate Optimization. Algorithmica, 6:419–436, 1991. [14] C. Tsallis and D. A. Stariolo. Generalized simulated annealing. Physica A, 233:395–406, 1996. [15] M. Locatelli. Simulated Annealing Algorithms for Continuous Global Optimization: Convergence Conditions. J. Optimiz. Theory App., 104(1):121–133, 2000. [16] V. N. Vapnik. The Nature of Statistical Learning Theory. Cambridge University Press, Springer, New York, US, 1995. [17] M. Vidyasagar. Learning and Generalization: With Application to Neural Networks. Springer-Verlag, London, second edition, 2003. [18] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, London, 1993. [19] J. S. Rosenthal. Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo. J. Am. Stat. Assoc., 90(430):558–566, 1995. [20] K. L. Mengersen and R. L. Tweedie. Rates of convergence of the Hastings and Metropolis algorithm. Ann. Stat., 24(1):101–121, 1996. [21] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Prob. Surv., 1:20–71, 2004. [22] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, second edition, 2004. [23] D.J. Spiegelhalter, K.R. Abrams, and J.P. Myles. Bayesian approaches to clinical trials and health-care evaluation. John Wiley & Sons, Chichester, UK, 2004. [24] A. Lecchini-Visintini, W. Glover, J. Lygeros, and J. M. Maciejowski. Monte Carlo Optimization for Conﬂict Resolution in Air Trafﬁc Control. IEEE Trans. Intell. Transp. Syst., 7(4):470–482, 2006. [25] P. M¨ ller. Simulation based optimal design. In J. O. Berger, J. M. Bernardo, A. P. Dawid, and A. F. M. u Smith, editors, Bayesian Statistics 6: proceedings of the Sixth Valencia International Meeting, pages 459–474. Oxford: Clarendon Press, 1999. [26] P. M¨ ller, B. Sans´ , and M. De Iorio. Optimal Bayesian design by Inhomogeneous Markov Chain Simuu o lation. J. Am. Stat. Assoc., 99(467):788–798, 2004. [27] L. Blum, C. Cucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer-Verlag, New York, 1998. [28] M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory. Automatica, 37(10):1515–1528, 2001. [29] R. Tempo, G. Calaﬁore, and F. Dabbene. Randomized Algorithms for Analysis and Control of Uncertain Systems. Springer-Verlag, London, 2005. [30] B.V. Gnedenko. Theory of Probability. Chelsea, New York, fourth edition, 1968.

3 0.088101998 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

Author: Ulrik Beierholm, Ladan Shams, Wei J. Ma, Konrad Koerding

Abstract: Bayesian models of multisensory perception traditionally address the problem of estimating an underlying variable that is assumed to be the cause of the two sensory signals. The brain, however, has to solve a more general problem: it also has to establish which signals come from the same source and should be integrated, and which ones do not and should be segregated. In the last couple of years, a few models have been proposed to solve this problem in a Bayesian fashion. One of these has the strength that it formalizes the causal structure of sensory signals. We ﬁrst compare these models on a formal level. Furthermore, we conduct a psychophysics experiment to test human performance in an auditory-visual spatial localization task in which integration is not mandatory. We ﬁnd that the causal Bayesian inference model accounts for the data better than other models. Keywords: causal inference, Bayesian methods, visual perception. 1 Multisensory perception In the ventriloquist illusion, a performer speaks without moving his/her mouth while moving a puppet’s mouth in synchrony with his/her speech. This makes the puppet appear to be speaking. This illusion was ﬁrst conceptualized as ”visual capture”, occurring when visual and auditory stimuli exhibit a small conﬂict ([1, 2]). Only recently has it been demonstrated that the phenomenon may be seen as a byproduct of a much more ﬂexible and nearly Bayes-optimal strategy ([3]), and therefore is part of a large collection of cue combination experiments showing such statistical near-optimality [4, 5]. In fact, cue combination has become the poster child for Bayesian inference in the nervous system. In previous studies of multisensory integration, two sensory stimuli are presented which act as cues about a single underlying source. For instance, in the auditory-visual localization experiment by Alais and Burr [3], observers were asked to envisage each presentation of a light blob and a sound click as a single event, like a ball hitting the screen. In many cases, however, the brain is not only posed with the problem of identifying the position of a common source, but also of determining whether there was a common source at all. In the on-stage ventriloquist illusion, it is indeed primarily the causal inference process that is being fooled, because veridical perception would attribute independent causes to the auditory and the visual stimulus. 1 To extend our understanding of multisensory perception to this more general problem, it is necessary to manipulate the degree of belief assigned to there being a common cause within a multisensory task. Intuitively, we expect that when two signals are very different, they are less likely to be perceived as having a common source. It is well-known that increasing the discrepancy or inconsistency between stimuli reduces the inﬂuence that they have on each other [6, 7, 8, 9, 10, 11]. In auditoryvisual spatial localization, one variable that controls stimulus similarity is spatial disparity (another would be temporal disparity). Indeed, it has been reported that increasing spatial disparity leads to a decrease in auditory localization bias [1, 12, 13, 14, 15, 16, 17, 2, 18, 19, 20, 21]. This decrease also correlates with a decrease in the reports of unity [19, 21]. Despite the abundance of experimental data on this issue, no general theory exists that can explain multisensory perception across a wide range of cue conﬂicts. 2 Models The success of Bayesian models for cue integration has motivated attempts to extend them to situations of large sensory conﬂict and a consequent low degree of integration. In one of recent studies taking this approach, subjects were presented with concurrent visual ﬂashes and auditory beeps and asked to count both the number of ﬂashes and the number of beeps [11]. The advantage of the experimental paradigm adopted here was that it probed the joint response distribution by requiring a dual report. Human data were accounted for well by a Bayesian model in which the joint prior distribution over visual and auditory number was approximated from the data. In a similar study, subjects were presented with concurrent ﬂashes and taps and asked to count either the ﬂashes or the taps [9, 22]. The Bayesian model proposed by these authors assumed a joint prior distribution with a near-diagonal form. The corresponding generative model assumes that the sensory sources somehow interact with one another. A third experiment modulated the rates of ﬂashes and beeps. The task was to judge either the visual or the auditory modulation rate relative to a standard [23]. The data from this experiment were modeled using a joint prior distribution which is the sum of a near-diagonal prior and a ﬂat background. While all these models are Bayesian in a formal sense, their underlying generative model does not formalize the model selection process that underlies the combination of cues. This makes it necessary to either estimate an empirical prior [11] by ﬁtting it to human behavior or to assume an ad hoc form [22, 23]. However, we believe that such assumptions are not needed. It was shown recently that human judgments of spatial unity in an auditory-visual spatial localization task can be described using a Bayesian inference model that infers causal structure [24, 25]. In this model, the brain does not only estimate a stimulus variable, but also infers the probability that the two stimuli have a common cause. In this paper we compare these different models on a large data set of human position estimates in an auditory-visual task. In this section we ﬁrst describe the traditional cue integration model, then the recent models based on joint stimulus priors, and ﬁnally the causal inference model. To relate to the experiment in the next section, we will use the terminology of auditory-visual spatial localization, but the formalism is very general. 2.1 Traditional cue integration The traditional generative model of cue integration [26] has a single source location s which produces on each trial an internal representation (cue) of visual location, xV and one of auditory location, xA . We assume that the noise processes by which these internal representations are generated are conditionally independent from each other and follow Gaussian distributions. That is, p (xV |s) ∼ N (xV ; s, σV )and p (xA |s) ∼ N (xA ; s, σA ), where N (x; µ, σ) stands for the normal distribution over x with mean µ and standard deviation σ. If on a given trial the internal representations are xV and xA , the probability that their source was s is given by Bayes’ rule, p (s|xV , xA ) ∝ p (xV |s) p (xA |s) . If a subject performs maximum-likelihood estimation, then the estimate will be xV +wA s = wV wV +wA xA , where wV = σ1 and wA = σ1 . It is important to keep in mind that this is the ˆ 2 2 V A estimate on a single trial. A psychophysical experimenter can never have access to xV and xA , which 2 are the noisy internal representations. Instead, an experimenter will want to collect estimates over many trials and is interested in the distribution of s given sV and sA , which are the sources generated ˆ by the experimenter. In a typical cue combination experiment, xV and xA are not actually generated by the same source, but by different sources, a visual one sV and an auditory one sA . These sources are chosen close to each other so that the subject can imagine that the resulting cues originate from a single source and thus implicitly have a common cause. The experimentally observed distribution is then p (ˆ|sV , sA ) = s p (ˆ|xV , xA ) p (xV |sV ) p (xA |sA ) dxV dxA s Given that s is a linear combination of two normally distributed variables, it will itself follow a ˆ sV +wA 1 2 normal distribution, with mean s = wVwV +wA sA and variance σs = wV +wA . The reason that we ˆ ˆ emphasize this point is because many authors identify the estimate distribution p (ˆ|sV , sA ) with s the posterior distribution p (s|xV , xA ). This is justiﬁed in this case because all distributions are Gaussian and the estimate is a linear combination of cues. However, in the case of causal inference, these conditions are violated and the estimate distribution will in general not be the same as the posterior distribution. 2.2 Models with bisensory stimulus priors Models with bisensory stimulus priors propose the posterior over source positions to be proportional to the product of unimodal likelihoods and a two-dimensional prior: p (sV , sA |xV , xA ) = p (sV , sA ) p (xV |sV ) p (xA |sA ) The traditional cue combination model has p (sV , sA ) = p (sV ) δ (sV − sA ), usually (as above) even with p (sV ) uniform. The question arises what bisensory stimulus prior is appropriate. In [11], the prior is estimated from data, has a large number of parameters, and is therefore limited in its predictive power. In [23], it has the form − (sV −sA )2 p (sV , sA ) ∝ ω + e 2σ 2 coupling while in [22] the additional assumption ω = 0 is made1 . In all three models, the response distribution p (ˆV , sA |sV , sA ) is obtained by idens ˆ tifying it with the posterior distribution p (sV , sA |xV , xA ). This procedure thus implicitly assumes that marginalizing over the latent variables xV and xA is not necessary, which leads to a signiﬁcant error for non-Gaussian priors. In this paper we correctly deal with these issues and in all cases marginalize over the latent variables. The parametric models used for the coupling between the cues lead to an elegant low-dimensional model of cue integration that allows for estimates of single cues that differ from one another. C C=1 SA S XA 2.3 C=2 XV SV XA XV Causal inference model In the causal inference model [24, 25], we start from the traditional cue integration model but remove the assumption that two signals are caused by the same source. Instead, the number of sources can be one or two and is itself a variable that needs to be inferred from the cues. Figure 1: Generative model of causal inference. 1 This family of Bayesian posterior distributions also includes one used to successfully model cue combination in depth perception [27, 28]. In depth perception, however, there is no notion of segregation as always a single surface is assumed. 3 If there are two sources, they are assumed to be independent. Thus, we use the graphical model depicted in Fig. 1. We denote the number of sources by C. The probability distribution over C given internal representations xV and xA is given by Bayes’ rule: p (C|xV , xA ) ∝ p (xV , xA |C) p (C) . In this equation, p (C) is the a priori probability of C. We will denote the probability of a common cause by pcommon , so that p (C = 1) = pcommon and p (C = 2) = 1 − pcommon . The probability of generating xV and xA given C is obtained by inserting a summation over the sources: p (xV , xA |C = 1) = p (xV , xA |s)p (s) ds = p (xV |s) p (xA |s)p (s) ds Here p (s) is a prior for spatial location, which we assume to be distributed as N (s; 0, σP ). Then all three factors in this integral are Gaussians, allowing for an analytic solution: p (xV , xA |C = 1) = 2 2 2 2 2 −xA )2 σP σA √ 2 2 1 2 2 2 2 exp − 1 (xV σ2 σ2 +σ2+xV+σ2+xA σV . 2 σ2 σ2 2π σV σA +σV σP +σA σP V A V P A P For p (xV , xA |C = 2) we realize that xV and xA are independent of each other and thus obtain p (xV , xA |C = 2) = p (xV |sV )p (sV ) dsV p (xA |sA )p (sA ) dsA Again, as all these distributions are assumed to be Gaussian, we obtain an analytic solution, x2 x2 1 1 V A p (xV , xA |C = 2) = exp − 2 σ2 +σ2 + σ2 +σ2 . Now that we have com2 +σ 2 2 +σ 2 p p V A 2π (σV p )(σA p) puted p (C|xV , xA ), the posterior distribution over sources is given by p (si |xV , xA ) = p (si |xV , xA , C) p (C|xV , xA ) C=1,2 where i can be V or A and the posteriors conditioned on C are well-known: p (si |xA , xV , C = 1) = p (xA |si ) p (xV |si ) p (si ) , p (xA |s) p (xV |s) p (s) ds p (si |xA , xV , C = 2) = p (xi |si ) p (si ) p (xi |si ) p (si ) dsi The former is the same as in the case of mandatory integration with a prior, the latter is simply the unimodal posterior in the presence of a prior. Based on the posterior distribution on a given trial, p (si |xV , xA ), an estimate has to be created. For this, we use a sum-squared-error cost func2 2 tion, Cost = p (C = 1|xV , xA ) (ˆ − s) + p (C = 2|xV , xA ) (ˆ − sV or A ) . Then the best s s estimate is the mean of the posterior distribution, for instance for the visual estimation: sV = p (C = 1|xA , xV ) sV,C=1 + p (C = 2|xA , xV ) sV,C=2 ˆ ˆ ˆ where sV,C=1 = ˆ −2 −2 −2 xV σV +xA σA +xP σP −2 −2 −2 σV +σA +σP and sV,C=2 = ˆ −2 −2 xV σV +xP σP . −2 −2 σV +σP If pcommon equals 0 or 1, this estimate reduces to one of the conditioned estimates and is linear in xV and xA . If 0 < pcommon < 1, the estimate is a nonlinear combination of xV and xA , because of the functional form of p (C|xV , xA ). The response distributions, that is the distributions of sV and sA given ˆ ˆ sV and sA over many trials, now cannot be identiﬁed with the posterior distribution on a single trial and cannot be computed analytically either. The correct way to obtain the response distribution is to simulate an experiment numerically. Note that the causal inference model above can also be cast in the form of a bisensory stimulus prior by integrating out the latent variable C, with: p (sA , sV ) = p (C = 1) δ (sA − sV ) p (sA ) + p (sA ) p (sV ) p (C = 2) However, in addition to justifying the form of the interaction between the cues, the causal inference model has the advantage of being based on a generative model that well formalizes salient properties of the world, and it thereby also allows to predict judgments of unity. 4 3 Model performance and comparison To examine the performance of the causal inference model and to compare it to previous models, we performed a human psychophysics experiment in which we adopted the same dual-report paradigm as was used in [11]. Observers were simultaneously presented with a brief visual and also an auditory stimulus, each of which could originate from one of ﬁve locations on an imaginary horizontal line (-10◦ , -5◦ , 0◦ , 5◦ , or 10◦ with respect to the ﬁxation point). Auditory stimuli were 32 ms of white noise ﬁltered through an individually calibrated head related transfer function (HRTF) and presented through a pair of headphones, whereas the visual stimuli were high contrast Gabors on a noisy background presented on a 21-inch CRT monitor. Observers had to report by means of a key press (1-5) the perceived positions of both the visual and the auditory stimulus. Each combination of locations was presented with the same frequency over the course of the experiment. In this way, for each condition, visual and auditory response histograms were obtained. We obtained response distributions for each the three models described above by numeral simulation. On each trial, estimation is followed by a step in which, the key is selected which corresponds to the position closed to the best estimate. The simulated histograms obtained in this way were compared to the measured response frequencies of all subjects by computing the R2 statistic. Auditory response Auditory model Visual response Visual model no vision The parameters in the causal inference model were optimized using fminsearch in MATLAB to maximize R2 . The best combination of parameters yielded an R2 of 0.97. The response frequencies are depicted in Fig. 2. The bisensory prior models also explain most of the variance, with R2 = 0.96 for the Roach model and R2 = 0.91 for the Bresciani model. This shows that it is possible to model cue combination for large disparities well using such models. no audio 1 0 Figure 2: A comparison between subjects’ performance and the causal inference model. The blue line indicates the frequency of subjects responses to visual stimuli, red line is the responses to auditory stimuli. Each set of lines is one set of audio-visual stimulus conditions. Rows of conditions indicate constant visual stimulus, columns is constant audio stimulus. Model predictions is indicated by the red and blue dotted line. 5 3.1 Model comparison To facilitate quantitative comparison with other models, we now ﬁt the parameters of each model2 to individual subject data, maximizing the likelihood of the model, i.e., the probability of the response frequencies under the model. The causal inference model ﬁts human data better than the other models. Compared to the best ﬁt of the causal inference model, the Bresciani model has a maximal log likelihood ratio (base e) of the data of −22 ± 6 (mean ± s.e.m. over subjects), and the Roach model has a maximal log likelihood ratio of the data of −18 ± 6. A causal inference model that maximizes the probability of being correct instead of minimizing the mean squared error has a maximal log likelihood ratio of −18 ± 3. These values are considered decisive evidence in favor of the causal inference model that minimizes the mean squared error (for details, see [25]). The parameter values found in the likelihood optimization of the causal model are as follows: pcommon = 0.28 ± 0.05, σV = 2.14 ± 0.22◦ , σA = 9.2 ± 1.1◦ , σP = 12.3 ± 1.1◦ (mean ± s.e.m. over subjects). We see that there is a relatively low prior probability of a common cause. In this paradigm, auditory localization is considerably less precise than visual localization. Also, there is a weak prior for central locations. 3.2 Localization bias A useful quantity to gain more insight into the structure of multisensory data is the cross-modal bias. In our experiment, relative auditory bias is deﬁned as the difference between the mean auditory estimate in a given condition and the real auditory position, divided by the difference between the real visual position and the real auditory position in this condition. If the inﬂuence of vision on the auditory estimate is strong, then the relative auditory bias will be high (close to one). It is well-known that bias decreases with spatial disparity and our experiment is no exception (solid line in Fig. 3; data were combined between positive and negative disparities). It can easily be shown that a traditional cue integration model would predict a bias equal to σ2 −1 , which would be close to 1 and 1 + σV 2 A independent of disparity, unlike the data. This shows that a mandatory integration model is an insufﬁcient model of multisensory interactions. 45 % Auditory Bias We used the individual subject ﬁttings from above and and averaged the auditory bias values obtained from those ﬁts (i.e. we did not ﬁt the bias data themselves). Fits are shown in Fig. 3 (dashed lines). We applied a paired t-test to the differences between the 5◦ and 20◦ disparity conditions (model-subject comparison). Using a double-sided test, the null hypothesis that the difference between the bias in the 5◦ and 20◦ conditions is correctly predicted by each model is rejected for the Bresciani model (p < 0.002) and the Roach model (p < 0.042) and accepted for the causal inference model (p > 0.17). Alternatively, with a single-sided test, the hypothesis is rejected for the Bresciani model (p < 0.001) and the Roach model (p < 0.021) and accepted for the causal inference model (> 0.9). 50 40 35 30 25 20 5 10 15 Spatial Disparity (deg.) 20 Figure 3: Auditory bias as a function of spatial disparity. Solid blue line: data. Red: Causal inference model. Green: Model by Roach et al. [23]. Purple: Model by Bresciani et al. [22]. Models were optimized on response frequencies (as in Fig. 2), not on the bias data. The reason that the Bresciani model fares worst is that its prior distribution does not include a component that corresponds to independent causes. On 2 The Roach et al. model has four free parameters (ω,σV , σA , σcoupling ), the Bresciani et al. model has three (σV , σA , σcoupling ), and the causal inference model has four (pcommon ,σV , σA , σP ). We do not consider the Shams et al. model here, since it has many more parameters and it is not immediately clear how in this model the erroneous identiﬁcation of posterior with response distribution can be corrected. 6 the contrary, the prior used in the Roach model contains two terms, one term that is independent of the disparity and one term that decreases with increasing disparity. It is thus functionally somewhat similar to the causal inference model. 4 Discussion We have argued that any model of multisensory perception should account not only for situations of small, but also of large conﬂict. In these situations, segregation is more likely, in which the two stimuli are not perceived to have the same cause. Even when segregation occurs, the two stimuli can still inﬂuence each other. We compared three Bayesian models designed to account for situations of large conﬂict by applying them to auditory-visual spatial localization data. We pointed out a common mistake: for nonGaussian bisensory priors without mandatory integration, the response distribution can no longer be identiﬁed with the posterior distribution. After correct implementation of the three models, we found that the causal inference model is superior to the models with ad hoc bisensory priors. This is expected, as the nervous system actually needs to solve the problem of deciding which stimuli have a common cause and which stimuli are unrelated. We have seen that multisensory perception is a suitable tool for studying causal inference. However, the causal inference model also has the potential to quantitatively explain a number of other perceptual phenomena, including perceptual grouping and binding, as well as within-modality cue combination [27, 28]. Causal inference is a universal problem: whenever the brain has multiple pieces of information it must decide if they relate to one another or are independent. As the causal inference model describes how the brain processes probabilistic sensory information, the question arises about the neural basis of these processes. Neural populations encode probability distributions over stimuli through Bayes’ rule, a type of coding known as probabilistic population coding. Recent work has shown how the optimal cue combination assuming a common cause can be implemented in probabilistic population codes through simple linear operations on neural activities [29]. This framework makes essential use of the structure of neural variability and leads to physiological predictions for activity in areas that combine multisensory input, such as the superior colliculus. Computational mechanisms for causal inference are expected have a neural substrate that generalizes these linear operations on population activities. A neural implementation of the causal inference model will open the door to a complete neural theory of multisensory perception. References [1] H.L. Pick, D.H. Warren, and J.C. Hay. Sensory conﬂict in judgements of spatial direction. Percept. Psychophys., 6:203205, 1969. [2] D. H. Warren, R. B. Welch, and T. J. McCarthy. The role of visual-auditory ”compellingness” in the ventriloquism effect: implications for transitivity among the spatial senses. Percept Psychophys, 30(6):557– 64, 1981. [3] D. Alais and D. Burr. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol, 14(3):257–62, 2004. [4] R. A. Jacobs. Optimal integration of texture and motion cues to depth. Vision Res, 39(21):3621–9, 1999. [5] R. J. van Beers, A. C. Sittig, and J. J. Gon. Integration of proprioceptive and visual position-information: An experimentally supported model. J Neurophysiol, 81(3):1355–64, 1999. [6] D. H. Warren and W. T. Cleaves. Visual-proprioceptive interaction under large amounts of conﬂict. J Exp Psychol, 90(2):206–14, 1971. [7] C. E. Jack and W. R. Thurlow. Effects of degree of visual association and angle of displacement on the ”ventriloquism” effect. Percept Mot Skills, 37(3):967–79, 1973. [8] G. H. Recanzone. Auditory inﬂuences on visual temporal rate perception. J Neurophysiol, 89(2):1078–93, 2003. [9] J. P. Bresciani, M. O. Ernst, K. Drewing, G. Bouyer, V. Maury, and A. Kheddar. Feeling what you hear: auditory signals can modulate tactile tap perception. Exp Brain Res, 162(2):172–80, 2005. 7 [10] R. Gepshtein, P. Leiderman, L. Genosar, and D. Huppert. Testing the three step excited state proton transfer model by the effect of an excess proton. J Phys Chem A Mol Spectrosc Kinet Environ Gen Theory, 109(42):9674–84, 2005. [11] L. Shams, W. J. Ma, and U. Beierholm. Sound-induced ﬂash illusion as an optimal percept. Neuroreport, 16(17):1923–7, 2005. [12] G Thomas. Experimental study of the inﬂuence of vision on sound localisation. J Exp Psychol, 28:167177, 1941. [13] W. R. Thurlow and C. E. Jack. Certain determinants of the ”ventriloquism effect”. Percept Mot Skills, 36(3):1171–84, 1973. [14] C.S. Choe, R. B. Welch, R.M. Gilford, and J.F. Juola. The ”ventriloquist effect”: visual dominance or response bias. Perception and Psychophysics, 18:55–60, 1975. [15] R. I. Bermant and R. B. Welch. Effect of degree of separation of visual-auditory stimulus and eye position upon spatial interaction of vision and audition. Percept Mot Skills, 42(43):487–93, 1976. [16] R. B. Welch and D. H. Warren. Immediate perceptual response to intersensory discrepancy. Psychol Bull, 88(3):638–67, 1980. [17] P. Bertelson and M. Radeau. Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys, 29(6):578–84, 1981. [18] P. Bertelson, F. Pavani, E. Ladavas, J. Vroomen, and B. de Gelder. Ventriloquism in patients with unilateral visual neglect. Neuropsychologia, 38(12):1634–42, 2000. [19] D. A. Slutsky and G. H. Recanzone. Temporal and spatial dependency of the ventriloquism effect. Neuroreport, 12(1):7–10, 2001. [20] J. Lewald, W. H. Ehrenstein, and R. Guski. Spatio-temporal constraints for auditory–visual integration. Behav Brain Res, 121(1-2):69–79, 2001. [21] M. T. Wallace, G. E. Roberson, W. D. Hairston, B. E. Stein, J. W. Vaughan, and J. A. Schirillo. Unifying multisensory signals across time and space. Exp Brain Res, 158(2):252–8, 2004. [22] J. P. Bresciani, F. Dammeier, and M. O. Ernst. Vision and touch are automatically integrated for the perception of sequences of events. J Vis, 6(5):554–64, 2006. [23] N. W. Roach, J. Heron, and P. V. McGraw. Resolving multisensory conﬂict: a strategy for balancing the costs and beneﬁts of audio-visual integration. Proc Biol Sci, 273(1598):2159–68, 2006. [24] K. P. Kording and D. M. Wolpert. Bayesian decision theory in sensorimotor control. Trends Cogn Sci, 2006. 1364-6613 (Print) Journal article. [25] K.P. Kording, U. Beierholm, W.J. Ma, S. Quartz, J. Tenenbaum, and L. Shams. Causal inference in multisensory perception. PLoS ONE, 2(9):e943, 2007. [26] Z. Ghahramani. Computational and psychophysics of sensorimotor integration. PhD thesis, Massachusetts Institute of Technology, 1995. [27] D. C. Knill. Mixture models and the probabilistic structure of depth cues. Vision Res, 43(7):831–54, 2003. [28] D. C. Knill. Robust cue integration: A bayesian model and evidence from cue conﬂict studies with stereoscopic and ﬁgure cues to slant. Journal of Vision, 7(7):2–24. [29] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nat Neurosci, 9(11):1432–8, 2006. 8

4 0.066130556 162 nips-2007-Random Sampling of States in Dynamic Programming

Author: Chris Atkeson, Benjamin Stephens

Abstract: We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on ﬁnding steady state policies for deterministic time invariant discrete time control problems with continuous states and actions often found in robotics. In this paper we show that we can now solve problems we couldn’t solve previously. 1

5 0.058883861 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray

Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratiﬁed Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1

6 0.054428622 112 nips-2007-Learning Monotonic Transformations for Classification

7 0.053429816 135 nips-2007-Multi-task Gaussian Process Prediction

8 0.051937912 185 nips-2007-Stable Dual Dynamic Programming

9 0.051781315 214 nips-2007-Variational inference for Markov jump processes

10 0.051613837 170 nips-2007-Robust Regression with Twinned Gaussian Processes

11 0.050901718 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

12 0.050744105 97 nips-2007-Hidden Common Cause Relations in Relational Learning

13 0.050439384 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

14 0.047965143 30 nips-2007-Bayes-Adaptive POMDPs

15 0.045953169 19 nips-2007-Active Preference Learning with Discrete Choice Data

16 0.041051518 145 nips-2007-On Sparsity and Overcompleteness in Image Models

17 0.040909354 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

18 0.040667355 80 nips-2007-Ensemble Clustering using Semidefinite Programming

19 0.039643031 8 nips-2007-A New View of Automatic Relevance Determination

20 0.03931167 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.144), (1, -0.004), (2, -0.005), (3, 0.02), (4, -0.031), (5, -0.005), (6, -0.042), (7, -0.006), (8, -0.039), (9, -0.014), (10, -0.054), (11, -0.034), (12, 0.034), (13, -0.036), (14, 0.0), (15, -0.008), (16, 0.021), (17, 0.024), (18, 0.082), (19, -0.069), (20, 0.01), (21, 0.016), (22, -0.049), (23, 0.043), (24, -0.037), (25, -0.003), (26, -0.038), (27, 0.096), (28, 0.092), (29, -0.023), (30, -0.127), (31, -0.024), (32, 0.083), (33, 0.09), (34, 0.005), (35, -0.135), (36, -0.031), (37, -0.078), (38, 0.037), (39, 0.105), (40, -0.185), (41, 0.067), (42, 0.058), (43, -0.103), (44, 0.067), (45, -0.014), (46, 0.072), (47, 0.034), (48, 0.077), (49, -0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92011034 174 nips-2007-Selecting Observations against Adversarial Objectives

Author: Andreas Krause, Brendan Mcmahan, Carlos Guestrin, Anupam Gupta

2 0.647304 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

Author: Andrea Lecchini-visintini, John Lygeros, Jan Maciejowski

3 0.49354526 162 nips-2007-Random Sampling of States in Dynamic Programming

Author: Chris Atkeson, Benjamin Stephens

4 0.46298063 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

Author: Byron Boots, Geoffrey J. Gordon, Sajid M. Siddiqi

Abstract: Stability is a desirable characteristic for linear dynamical systems, but it is often ignored by algorithms that learn these systems from data. We propose a novel method for learning stable linear dynamical systems: we formulate an approximation of the problem as a convex program, start with a solution to a relaxed version of the program, and incrementally add constraints to improve stability. Rather than continuing to generate constraints until we reach a feasible solution, we test stability at each step; because the convex program is only an approximation of the desired problem, this early stopping rule can yield a higher-quality solution. We apply our algorithm to the task of learning dynamic textures from image sequences as well as to modeling biosurveillance drug-sales data. The constraint generation approach leads to noticeable improvement in the quality of simulated sequences. We compare our method to those of Lacy and Bernstein [1, 2], with positive results in terms of accuracy, quality of simulated sequences, and efﬁciency. 1

5 0.46083179 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

Author: Ulrik Beierholm, Ladan Shams, Wei J. Ma, Konrad Koerding

6 0.432246 112 nips-2007-Learning Monotonic Transformations for Classification

7 0.40068775 163 nips-2007-Receding Horizon Differential Dynamic Programming

8 0.40000388 119 nips-2007-Learning with Tree-Averaged Densities and Distributions

9 0.39704022 170 nips-2007-Robust Regression with Twinned Gaussian Processes

10 0.3905504 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

11 0.38501358 195 nips-2007-The Generalized FITC Approximation

12 0.38097668 214 nips-2007-Variational inference for Markov jump processes

13 0.37512892 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

14 0.36941099 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

15 0.36574179 150 nips-2007-Optimal models of sound localization by barn owls

16 0.3574518 97 nips-2007-Hidden Common Cause Relations in Relational Learning

17 0.34327948 63 nips-2007-Convex Relaxations of Latent Variable Training

18 0.32994351 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach

19 0.32788947 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

20 0.32683548 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.04), (13, 0.025), (16, 0.047), (18, 0.012), (19, 0.019), (21, 0.075), (22, 0.27), (31, 0.051), (34, 0.032), (35, 0.041), (46, 0.015), (47, 0.078), (49, 0.014), (83, 0.09), (85, 0.033), (90, 0.071)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73063028 174 nips-2007-Selecting Observations against Adversarial Objectives

Author: Andreas Krause, Brendan Mcmahan, Carlos Guestrin, Anupam Gupta

2 0.59655088 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

Author: Kaushik Sinha, Mikhail Belkin

Abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received signiﬁcant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identiﬁable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identiﬁable components. There have been recent efforts to analyze the non-parametric situation, for example, “cluster” and “manifold” assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identiﬁable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

3 0.51835001 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

Author: Jonathan W. Pillow, Peter E. Latham

Abstract: Point process encoding models provide powerful statistical methods for understanding the responses of neurons to sensory stimuli. Although these models have been successfully applied to neurons in the early sensory pathway, they have fared less well capturing the response properties of neurons in deeper brain areas, owing in part to the fact that they do not take into account multiple stages of processing. Here we introduce a new twist on the point-process modeling approach: we include unobserved as well as observed spiking neurons in a joint encoding model. The resulting model exhibits richer dynamics and more highly nonlinear response properties, making it more powerful and more ﬂexible for ﬁtting neural data. More importantly, it allows us to estimate connectivity patterns among neurons (both observed and unobserved), and may provide insight into how networks process sensory input. We formulate the estimation procedure using variational EM and the wake-sleep algorithm, and illustrate the model’s performance using a simulated example network consisting of two coupled neurons.

4 0.51830286 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

Author: Matthias Bethge, Philipp Berens

Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1

5 0.51469332 156 nips-2007-Predictive Matrix-Variate t Models

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difﬁcult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efﬁcient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1

6 0.51401281 214 nips-2007-Variational inference for Markov jump processes

7 0.51278573 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

8 0.51274049 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

9 0.51158714 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

10 0.50994498 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

11 0.50971627 195 nips-2007-The Generalized FITC Approximation

12 0.50762105 158 nips-2007-Probabilistic Matrix Factorization

13 0.50687182 63 nips-2007-Convex Relaxations of Latent Variable Training

14 0.50675607 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

15 0.50522077 164 nips-2007-Receptive Fields without Spike-Triggering

16 0.50406313 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

17 0.5026896 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

18 0.50214535 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

19 0.50119609 86 nips-2007-Exponential Family Predictive Representations of State

20 0.49980268 16 nips-2007-A learning framework for nearest neighbor search