nips nips2007 nips2007-198 knowledge-graph by maker-knowledge-mining

198 nips-2007-The Noisy-Logical Distribution and its Application to Causal Inference

Source: pdf

Author: Hongjing Lu, Alan L. Yuille

Abstract: We describe a novel noisy-logical distribution for representing the distribution of a binary output variable conditioned on multiple binary input variables. The distribution is represented in terms of noisy-or’s and noisy-and-not’s of causal features which are conjunctions of the binary inputs. The standard noisy-or and noisy-andnot models, used in causal reasoning and artiﬁcial intelligence, are special cases of the noisy-logical distribution. We prove that the noisy-logical distribution is complete in the sense that it can represent all conditional distributions provided a sufﬁcient number of causal factors are used. We illustrate the noisy-logical distribution by showing that it can account for new experimental ﬁndings on how humans perform causal reasoning in complex contexts. We speculate on the use of the noisy-logical distribution for causal reasoning and artiﬁcial intelligence.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We describe a novel noisy-logical distribution for representing the distribution of a binary output variable conditioned on multiple binary input variables. [sent-4, score-0.138]

2 The distribution is represented in terms of noisy-or’s and noisy-and-not’s of causal features which are conjunctions of the binary inputs. [sent-5, score-0.887]

3 The standard noisy-or and noisy-andnot models, used in causal reasoning and artiﬁcial intelligence, are special cases of the noisy-logical distribution. [sent-6, score-0.738]

4 We prove that the noisy-logical distribution is complete in the sense that it can represent all conditional distributions provided a sufﬁcient number of causal factors are used. [sent-7, score-0.87]

5 We illustrate the noisy-logical distribution by showing that it can account for new experimental ﬁndings on how humans perform causal reasoning in complex contexts. [sent-8, score-0.859]

6 We speculate on the use of the noisy-logical distribution for causal reasoning and artiﬁcial intelligence. [sent-9, score-0.777]

7 1 Introduction The noisy-or and noisy-and-not conditional probability distributions are frequently studied in cognitive science for modeling causal reasoning [1], [2],[3] and are also used as probabilistic models for artiﬁcial intelligence [4]. [sent-10, score-0.911]

8 It has been shown, for example, that human judgments of the power of causal cues in experiments involving two cues [1] can be interpreted in terms of maximum likelihood estimation and model selection using these types of models [3]. [sent-11, score-0.86]

9 But the noisy-or and noisy-and-not distributions are limited in the sense that they can only represent a restricted set of all possible conditional distributions. [sent-12, score-0.145]

10 This restriction is sometimes an advantage because there may not be sufﬁcient data to determine the full conditional distribution. [sent-13, score-0.079]

11 Nevertheless it would be better to have a representation that can expand to represent the full conditional distribution, if sufﬁcient data is available, but can be reduced to simpler forms (e. [sent-14, score-0.089]

12 This is deﬁned in terms of noisy-or’s and noisy-and-not’s of causal features which are conjunctions of the basic input variables (inspired by the use of conjunctive features in [2] and the extensions in [5]). [sent-18, score-0.896]

13 By restricting the choice of causal features we can obtain the standard noisy-or and noisy-and-not models. [sent-19, score-0.754]

14 We prove that the noisy-logical distribution is complete in the sense that it can represent any conditional distribution provided we use all the causal features. [sent-20, score-0.827]

15 Overall, it gives a distribution whose complexity can be adjusted by restricting the number of causal features. [sent-21, score-0.812]

16 To illustrate the noisy-logical distribution we apply it to modeling some recent human experiments on causal reasoning in complex environments [6]. [sent-22, score-0.861]

17 We show that noisy-logical distributions involving causal factors are able to account for human performance. [sent-23, score-0.836]

18 By contrast, an alternative linear model gives predictions which are the opposite of the observed trends in human causal judgments. [sent-24, score-0.825]

19 Section (2) presents the noisy-logical distribution for the case with two input causes (the case commonly studied in causal reasoning). [sent-25, score-0.833]

20 Section (5) illustrates the noisy-logical distribution by showing that it accounts for recent experimental ﬁndings in causal reasoning. [sent-27, score-0.751]

21 2 The Case with N = 2 causes In this section we study the simple case when the binary output effect E depends only on two binaryvalued causes C1 , C2 . [sent-28, score-0.371]

22 Firstly, we deﬁne four binary-valued causal features Ψ0 (. [sent-32, score-0.702]

23 Secondly, we introduce binaryvalued hidden states E0 , E1 , E2 , E3 which are caused by the corresponding features Ψ0 , Ψ1 , Ψ2 , Ψ3 . [sent-41, score-0.155]

24 Thirdly, we deﬁne the output effect E to be a logical combination of the states E0 , E1 , E2 , E3 which we write in form δE,f (E0 ,E1 ,E2 ,E3 ) , where f (. [sent-46, score-0.234]

25 ) is a logic function which is formed by a combination of three logic operations AN D, OR, N OT . [sent-50, score-0.162]

26 We can represent the distribution by a circuit diagram where the output E is a logical function of the hidden states E0 , . [sent-63, score-0.38]

27 , E3 and each state is caused probabilistically by the corresponding causal features Ψ0 , . [sent-66, score-0.744]

28 ]: Pnl (E = 1|C1 , C2 ; ω1 , ω2 ) δ1,E1 ∧¬E2 P (E1 |Ψ1 (C); ω1 )P (E2 |Ψ2 (C); ω2 ) = E1 ,E2 = ω1 C1 {1 − ω2 C2 } = Pn−and−not (E = 1|C1 , C2 ; ω1 , ω2 ) (2) 2 We claim that noisy-logical distributions of this form can represent any conditional distribution P (E|C). [sent-79, score-0.184]

29 The logical function f (E0 , E1 , E2 , E3 ) will be expressed as a combination of logic operations AND-NOT, OR. [sent-80, score-0.294]

30 This situation is studied in cognitive science where C1 is considered to be a background cause which always takes value 1, see [1] [3]. [sent-84, score-0.144]

31 In this case, the only causal features are considered, Ψ1 (C) = C1 and Ψ2 (C) = C2 . [sent-85, score-0.702]

32 3 The Noisy-Logical Distribution for N causes We next consider representing probability distributions of form P (E|C), where E ∈ {0, 1} and C = (C1 , . [sent-99, score-0.19]

33 We deﬁne the set of 2N binary-valued causal features {Ψi (C) : i = 0, . [sent-106, score-0.702]

34 , 2N − 1} which are related to the causal features {Ψi : i = 0, . [sent-123, score-0.702]

35 Then we deﬁne the output variable E to be a logical (i. [sent-130, score-0.188]

36 ) where E1 ⊗ E2 can be E1 ∨ E2 or E1 ∧ ¬E2 (where ¬E means logical negation). [sent-144, score-0.188]

37 (3) i=0 E The Completeness Result This section proves that the noisy-logical distribution is capable of representing any conditional distribution. [sent-150, score-0.091]

38 All conditional distributions can be represented in this form if we use all possible 2N causal features Ψ, choose the correct parameters ω, and select the correct logical combinations ⊗. [sent-153, score-0.998]

39 Result We can represent any conditional distribution P (E|C) deﬁned on binary variables in terms of a noisy logical distribution given by equation (3). [sent-154, score-0.385]

40 only one Ci is non-zero), then those with two conjunctions (i. [sent-168, score-0.116]

41 We loop over the states and incrementally construct the logical function f (E0 , . [sent-172, score-0.21]

42 5 Cognitive Science Human Experiments We illustrate noisy-logical distributions by applying them to model two recent cognitive science experiments by Liljeholm and Cheng which involve causal reasoning in complex environments [6]. [sent-268, score-0.904]

43 In these experiments, the participants are asked questions about the causal structure of the data. [sent-269, score-0.892]

44 But the participants are not given enough data to determine the full distribution (i. [sent-270, score-0.261]

45 not enough to determine the causal structure with certainty). [sent-272, score-0.687]

46 Instead the experimental design forces them to choose between two different causal structures. [sent-273, score-0.777]

47 Formally, we specify distributions P (D|ω, Graph) for generating the data D from a causal model speciﬁed by Graph and parameterized by ω. [sent-275, score-0.739]

48 The evidence for the causal model is given by: P (D|Graph) = dωP (D|ω, Graph)P (ω|Graph). [sent-278, score-0.683]

49 (5) P (D|Graph1) We then evaluate the log-likelihood ratio log P (D|Graph2) between two causal models Graph1 Graph2, called the causal support [3] and use this to predict the performance of the participants. [sent-279, score-1.32]

50 As an alternative theoretical model, we consider the possibility that the participants use the same causal structures, speciﬁed by Graph1 and Graph2, but use a linear model to combine cues. [sent-281, score-0.878]

51 Our simulations show that this model does not account for human participant performance . [sent-297, score-0.124]

52 We note that previous attempts to model experiments with multiple causes and conjunctions by Novick and Cheng [2] can be interpreted as performing maximum likelihood estimation of the parameters of noisy-logical distributions (their paper helped inspire our work). [sent-298, score-0.352]

53 1 Experiment I: Multiple Causes In Experiment 1 of [6], the cover story involves a set of allergy patients who either did or did not have a headache, and either had or had not received allergy medicines A and B. [sent-302, score-0.228]

54 The experimental participants were informed that two independent studies had been conducted in different labs using different patient groups. [sent-303, score-0.289]

55 In the ﬁrst study, patients were administered medicine A, whereas in the second study patients were administered both medicines A and B. [sent-304, score-0.425]

56 A simultaneous presentation format [7] was used to display the speciﬁc contingency conditions used in both studies to the experimental subjects. [sent-305, score-0.209]

57 The participants were then asked whether medicine B caused the headache. [sent-306, score-0.432]

58 We represent this experiment as follows using binary-valued variables E, B1 , B2 , C1 , C2 . [sent-307, score-0.089]

59 B1 = 1 and B2 = 1 notate background causes for the two studies (which are always present). [sent-309, score-0.215]

60 C1 and C2 indicate whether medicine A and B are present respectively (e. [sent-310, score-0.158]

61 The experimental design forces the participants to choose between the two causal models shown on the left of ﬁgure (3). [sent-314, score-0.972]

62 In the ﬁrst power-constant condition [6], the data is consistent with the causal structure for Graph1 (i. [sent-329, score-0.7]

63 In the second ∆P-constant condition [6], the data is consistent with the causal structure for Graph1 but with noisy-or replaced by the linear distributions (e. [sent-332, score-0.756]

64 The left panel shows the proportion of participants who decide that medicine B causes a headache for the two conditions. [sent-344, score-0.658]

65 The right panel shows the predictions of our model (labeled ”noisy-logical”) together with predictions of a model that replaces the noisy-logical distributions by a linear model (labeled ”linear”). [sent-345, score-0.191]

66 The simulations show that the noisy-logical model correctly predicts that participants (on average) judge that medicine B has no effect in the ﬁrst experimental condition, but B does have an effect in the second condition. [sent-346, score-0.505]

67 In summary, model selection comparing two noisy-logical models gives a good prediction of participant performance. [sent-348, score-0.095]

68 Left panel: two alternative causal models for the two studies. [sent-350, score-0.66]

69 Right panel: the experimental results (proportion of patients who think medicine B causes headaches)) for the Power-constant and ∆P-constant conditions [6]. [sent-351, score-0.451]

70 Far right, the causal support for the noisy-logic and linear models. [sent-352, score-0.66]

71 3 Experiment II: Causal Interaction Liljeholm and Cheng [6] also investigated causal interactions. [sent-354, score-0.66]

72 The experimental design was identical to that used in Experiment 1, except that participants were presented with three studies in which only one medicine (A) was tested. [sent-355, score-0.481]

73 Participants were asked to judge whether medicine A interacts with background causes that vary across the three studies. [sent-356, score-0.397]

74 We deﬁne the background causes as B1 ,B2 ,B3 for the three studies, and C1 for medicine A. [sent-357, score-0.331]

75 The ﬁrst power-constant condition [6] was consistent with a noisy-logical model, but the second power-varying condition [6] was not. [sent-359, score-0.08]

76 All the distributions are noisy-or on the unary causal features (e. [sent-366, score-0.758]

77 B, C1 ), but the nature of the conjunctive cause B ∧ C1 is unknown (i. [sent-368, score-0.076]

78 can prevent headaches), see graph 2 of Figure (4). [sent-375, score-0.086]

79 4 Results of Experiment II Figure (4) shows human and model performance for the two experimental conditions. [sent-377, score-0.137]

80 Our noisylogical model is in agreement with human performance – i. [sent-378, score-0.085]

81 there is no interaction between causes in the power-constant condition, but there is interaction in the power-varying condition. [sent-380, score-0.22]

82 By contrast, the linear model predicts interaction in both conditions and hence fails to model human performance. [sent-381, score-0.198]

83 Left panel: two alternative causal models (one involving conjunctions) for the three studies . [sent-383, score-0.734]

84 Right panel: the proportion of participants who think that there is an interaction (conjunction) between medicine A and the background for the powerconstant and power-varying conditions [6]. [sent-384, score-0.514]

85 Far right, the causal support for the noisy-logical and linear models. [sent-385, score-0.66]

86 7 6 Summary The noisy-logical distribution gives a new way to represent conditional probability distributions deﬁned over binary variables. [sent-386, score-0.247]

87 The complexity of the distribution can be adjusted by restricting the set of causal factors. [sent-387, score-0.779]

88 If all the causal factors are allowed, then the distribution can represent any conditional distribution. [sent-388, score-0.814]

89 But by restricting the set of causal factors we can obtain standard distributions such as the noisy-or and noisy-and-not. [sent-389, score-0.794]

90 We illustrated the noisy-logical distribution by modeling experimental ﬁndings on causal reasoning. [sent-390, score-0.751]

91 Our results showed that this distribution ﬁtted the experimental data and, in particular, accounted for the major trends (unlike the linear model). [sent-391, score-0.116]

92 This is consistent with the success of noisy-or and noisyand-not models for accounting for experiments involving two causes [1], [2],[3]. [sent-392, score-0.166]

93 This suggests that humans may make use of noisy-logical representations for causal reasoning. [sent-393, score-0.69]

94 One attraction of the noisy-logical representation is that it helps clarify the relationship between logic and probabilities. [sent-394, score-0.081]

95 Standard logical relationships between causes and effects arise in the limit as the ωi take values 0 or 1. [sent-395, score-0.322]

96 We can, for example, bias the data towards a logical form by using a prior on the ω. [sent-396, score-0.188]

97 This may be useful, for example, when modeling human cognition – evidence suggests that humans ﬁrst learn logical relationships and, only later, move to probabilities. [sent-397, score-0.304]

98 In summary, the noisy-logical distribution is a novel way to represent conditional probability distributions deﬁned on binary variables. [sent-398, score-0.214]

99 We hope this class of distributions will be useful for modeling cognitive phenomena and for applications to artiﬁcial intelligence. [sent-399, score-0.121]

100 From covariation to causation: A test of the assumption of causal power. [sent-449, score-0.702]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('causal', 0.66), ('cm', 0.322), ('participants', 0.195), ('logical', 0.188), ('cheng', 0.171), ('pd', 0.167), ('medicine', 0.158), ('ci', 0.157), ('causes', 0.134), ('liljeholm', 0.122), ('conjunctions', 0.116), ('pnor', 0.097), ('graph', 0.086), ('ei', 0.082), ('logic', 0.081), ('reasoning', 0.078), ('angeles', 0.078), ('headache', 0.073), ('headaches', 0.073), ('pnl', 0.073), ('contingency', 0.068), ('los', 0.068), ('cn', 0.067), ('panel', 0.066), ('cognitive', 0.065), ('human', 0.062), ('patients', 0.06), ('distributions', 0.056), ('circuit', 0.056), ('conjunction', 0.056), ('restricting', 0.052), ('experimental', 0.052), ('conditional', 0.052), ('experiment', 0.052), ('pn', 0.05), ('administered', 0.049), ('allergy', 0.049), ('binaryvalued', 0.049), ('hongjing', 0.049), ('medicines', 0.049), ('mimi', 0.049), ('novick', 0.049), ('patricia', 0.049), ('preventative', 0.049), ('yuille', 0.049), ('em', 0.049), ('conditions', 0.047), ('interaction', 0.043), ('causation', 0.042), ('covariation', 0.042), ('caused', 0.042), ('studies', 0.042), ('features', 0.042), ('cause', 0.04), ('condition', 0.04), ('distribution', 0.039), ('background', 0.039), ('psychological', 0.039), ('ii', 0.039), ('cb', 0.039), ('participant', 0.039), ('diagram', 0.038), ('di', 0.038), ('represent', 0.037), ('asked', 0.037), ('conjunctive', 0.036), ('design', 0.034), ('gives', 0.033), ('cg', 0.032), ('induction', 0.032), ('psychology', 0.032), ('proportion', 0.032), ('involving', 0.032), ('forces', 0.031), ('humans', 0.03), ('ndings', 0.03), ('cues', 0.03), ('completeness', 0.03), ('binary', 0.03), ('judge', 0.029), ('adjusted', 0.028), ('determine', 0.027), ('ca', 0.026), ('factors', 0.026), ('trends', 0.025), ('expressed', 0.025), ('summary', 0.025), ('effect', 0.024), ('cognition', 0.024), ('interpreted', 0.023), ('model', 0.023), ('environments', 0.022), ('states', 0.022), ('opposite', 0.022), ('story', 0.021), ('pavlovian', 0.021), ('prokasy', 0.021), ('rescorla', 0.021), ('table', 0.021), ('conditioning', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 198 nips-2007-The Noisy-Logical Distribution and its Application to Causal Inference

Author: Hongjing Lu, Alan L. Yuille

2 0.17891736 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

Author: Ulrik Beierholm, Ladan Shams, Wei J. Ma, Konrad Koerding

Abstract: Bayesian models of multisensory perception traditionally address the problem of estimating an underlying variable that is assumed to be the cause of the two sensory signals. The brain, however, has to solve a more general problem: it also has to establish which signals come from the same source and should be integrated, and which ones do not and should be segregated. In the last couple of years, a few models have been proposed to solve this problem in a Bayesian fashion. One of these has the strength that it formalizes the causal structure of sensory signals. We ﬁrst compare these models on a formal level. Furthermore, we conduct a psychophysics experiment to test human performance in an auditory-visual spatial localization task in which integration is not mandatory. We ﬁnd that the causal Bayesian inference model accounts for the data better than other models. Keywords: causal inference, Bayesian methods, visual perception. 1 Multisensory perception In the ventriloquist illusion, a performer speaks without moving his/her mouth while moving a puppet’s mouth in synchrony with his/her speech. This makes the puppet appear to be speaking. This illusion was ﬁrst conceptualized as ”visual capture”, occurring when visual and auditory stimuli exhibit a small conﬂict ([1, 2]). Only recently has it been demonstrated that the phenomenon may be seen as a byproduct of a much more ﬂexible and nearly Bayes-optimal strategy ([3]), and therefore is part of a large collection of cue combination experiments showing such statistical near-optimality [4, 5]. In fact, cue combination has become the poster child for Bayesian inference in the nervous system. In previous studies of multisensory integration, two sensory stimuli are presented which act as cues about a single underlying source. For instance, in the auditory-visual localization experiment by Alais and Burr [3], observers were asked to envisage each presentation of a light blob and a sound click as a single event, like a ball hitting the screen. In many cases, however, the brain is not only posed with the problem of identifying the position of a common source, but also of determining whether there was a common source at all. In the on-stage ventriloquist illusion, it is indeed primarily the causal inference process that is being fooled, because veridical perception would attribute independent causes to the auditory and the visual stimulus. 1 To extend our understanding of multisensory perception to this more general problem, it is necessary to manipulate the degree of belief assigned to there being a common cause within a multisensory task. Intuitively, we expect that when two signals are very different, they are less likely to be perceived as having a common source. It is well-known that increasing the discrepancy or inconsistency between stimuli reduces the inﬂuence that they have on each other [6, 7, 8, 9, 10, 11]. In auditoryvisual spatial localization, one variable that controls stimulus similarity is spatial disparity (another would be temporal disparity). Indeed, it has been reported that increasing spatial disparity leads to a decrease in auditory localization bias [1, 12, 13, 14, 15, 16, 17, 2, 18, 19, 20, 21]. This decrease also correlates with a decrease in the reports of unity [19, 21]. Despite the abundance of experimental data on this issue, no general theory exists that can explain multisensory perception across a wide range of cue conﬂicts. 2 Models The success of Bayesian models for cue integration has motivated attempts to extend them to situations of large sensory conﬂict and a consequent low degree of integration. In one of recent studies taking this approach, subjects were presented with concurrent visual ﬂashes and auditory beeps and asked to count both the number of ﬂashes and the number of beeps [11]. The advantage of the experimental paradigm adopted here was that it probed the joint response distribution by requiring a dual report. Human data were accounted for well by a Bayesian model in which the joint prior distribution over visual and auditory number was approximated from the data. In a similar study, subjects were presented with concurrent ﬂashes and taps and asked to count either the ﬂashes or the taps [9, 22]. The Bayesian model proposed by these authors assumed a joint prior distribution with a near-diagonal form. The corresponding generative model assumes that the sensory sources somehow interact with one another. A third experiment modulated the rates of ﬂashes and beeps. The task was to judge either the visual or the auditory modulation rate relative to a standard [23]. The data from this experiment were modeled using a joint prior distribution which is the sum of a near-diagonal prior and a ﬂat background. While all these models are Bayesian in a formal sense, their underlying generative model does not formalize the model selection process that underlies the combination of cues. This makes it necessary to either estimate an empirical prior [11] by ﬁtting it to human behavior or to assume an ad hoc form [22, 23]. However, we believe that such assumptions are not needed. It was shown recently that human judgments of spatial unity in an auditory-visual spatial localization task can be described using a Bayesian inference model that infers causal structure [24, 25]. In this model, the brain does not only estimate a stimulus variable, but also infers the probability that the two stimuli have a common cause. In this paper we compare these different models on a large data set of human position estimates in an auditory-visual task. In this section we ﬁrst describe the traditional cue integration model, then the recent models based on joint stimulus priors, and ﬁnally the causal inference model. To relate to the experiment in the next section, we will use the terminology of auditory-visual spatial localization, but the formalism is very general. 2.1 Traditional cue integration The traditional generative model of cue integration [26] has a single source location s which produces on each trial an internal representation (cue) of visual location, xV and one of auditory location, xA . We assume that the noise processes by which these internal representations are generated are conditionally independent from each other and follow Gaussian distributions. That is, p (xV |s) ∼ N (xV ; s, σV )and p (xA |s) ∼ N (xA ; s, σA ), where N (x; µ, σ) stands for the normal distribution over x with mean µ and standard deviation σ. If on a given trial the internal representations are xV and xA , the probability that their source was s is given by Bayes’ rule, p (s|xV , xA ) ∝ p (xV |s) p (xA |s) . If a subject performs maximum-likelihood estimation, then the estimate will be xV +wA s = wV wV +wA xA , where wV = σ1 and wA = σ1 . It is important to keep in mind that this is the ˆ 2 2 V A estimate on a single trial. A psychophysical experimenter can never have access to xV and xA , which 2 are the noisy internal representations. Instead, an experimenter will want to collect estimates over many trials and is interested in the distribution of s given sV and sA , which are the sources generated ˆ by the experimenter. In a typical cue combination experiment, xV and xA are not actually generated by the same source, but by different sources, a visual one sV and an auditory one sA . These sources are chosen close to each other so that the subject can imagine that the resulting cues originate from a single source and thus implicitly have a common cause. The experimentally observed distribution is then p (ˆ|sV , sA ) = s p (ˆ|xV , xA ) p (xV |sV ) p (xA |sA ) dxV dxA s Given that s is a linear combination of two normally distributed variables, it will itself follow a ˆ sV +wA 1 2 normal distribution, with mean s = wVwV +wA sA and variance σs = wV +wA . The reason that we ˆ ˆ emphasize this point is because many authors identify the estimate distribution p (ˆ|sV , sA ) with s the posterior distribution p (s|xV , xA ). This is justiﬁed in this case because all distributions are Gaussian and the estimate is a linear combination of cues. However, in the case of causal inference, these conditions are violated and the estimate distribution will in general not be the same as the posterior distribution. 2.2 Models with bisensory stimulus priors Models with bisensory stimulus priors propose the posterior over source positions to be proportional to the product of unimodal likelihoods and a two-dimensional prior: p (sV , sA |xV , xA ) = p (sV , sA ) p (xV |sV ) p (xA |sA ) The traditional cue combination model has p (sV , sA ) = p (sV ) δ (sV − sA ), usually (as above) even with p (sV ) uniform. The question arises what bisensory stimulus prior is appropriate. In [11], the prior is estimated from data, has a large number of parameters, and is therefore limited in its predictive power. In [23], it has the form − (sV −sA )2 p (sV , sA ) ∝ ω + e 2σ 2 coupling while in [22] the additional assumption ω = 0 is made1 . In all three models, the response distribution p (ˆV , sA |sV , sA ) is obtained by idens ˆ tifying it with the posterior distribution p (sV , sA |xV , xA ). This procedure thus implicitly assumes that marginalizing over the latent variables xV and xA is not necessary, which leads to a signiﬁcant error for non-Gaussian priors. In this paper we correctly deal with these issues and in all cases marginalize over the latent variables. The parametric models used for the coupling between the cues lead to an elegant low-dimensional model of cue integration that allows for estimates of single cues that differ from one another. C C=1 SA S XA 2.3 C=2 XV SV XA XV Causal inference model In the causal inference model [24, 25], we start from the traditional cue integration model but remove the assumption that two signals are caused by the same source. Instead, the number of sources can be one or two and is itself a variable that needs to be inferred from the cues. Figure 1: Generative model of causal inference. 1 This family of Bayesian posterior distributions also includes one used to successfully model cue combination in depth perception [27, 28]. In depth perception, however, there is no notion of segregation as always a single surface is assumed. 3 If there are two sources, they are assumed to be independent. Thus, we use the graphical model depicted in Fig. 1. We denote the number of sources by C. The probability distribution over C given internal representations xV and xA is given by Bayes’ rule: p (C|xV , xA ) ∝ p (xV , xA |C) p (C) . In this equation, p (C) is the a priori probability of C. We will denote the probability of a common cause by pcommon , so that p (C = 1) = pcommon and p (C = 2) = 1 − pcommon . The probability of generating xV and xA given C is obtained by inserting a summation over the sources: p (xV , xA |C = 1) = p (xV , xA |s)p (s) ds = p (xV |s) p (xA |s)p (s) ds Here p (s) is a prior for spatial location, which we assume to be distributed as N (s; 0, σP ). Then all three factors in this integral are Gaussians, allowing for an analytic solution: p (xV , xA |C = 1) = 2 2 2 2 2 −xA )2 σP σA √ 2 2 1 2 2 2 2 exp − 1 (xV σ2 σ2 +σ2+xV+σ2+xA σV . 2 σ2 σ2 2π σV σA +σV σP +σA σP V A V P A P For p (xV , xA |C = 2) we realize that xV and xA are independent of each other and thus obtain p (xV , xA |C = 2) = p (xV |sV )p (sV ) dsV p (xA |sA )p (sA ) dsA Again, as all these distributions are assumed to be Gaussian, we obtain an analytic solution, x2 x2 1 1 V A p (xV , xA |C = 2) = exp − 2 σ2 +σ2 + σ2 +σ2 . Now that we have com2 +σ 2 2 +σ 2 p p V A 2π (σV p )(σA p) puted p (C|xV , xA ), the posterior distribution over sources is given by p (si |xV , xA ) = p (si |xV , xA , C) p (C|xV , xA ) C=1,2 where i can be V or A and the posteriors conditioned on C are well-known: p (si |xA , xV , C = 1) = p (xA |si ) p (xV |si ) p (si ) , p (xA |s) p (xV |s) p (s) ds p (si |xA , xV , C = 2) = p (xi |si ) p (si ) p (xi |si ) p (si ) dsi The former is the same as in the case of mandatory integration with a prior, the latter is simply the unimodal posterior in the presence of a prior. Based on the posterior distribution on a given trial, p (si |xV , xA ), an estimate has to be created. For this, we use a sum-squared-error cost func2 2 tion, Cost = p (C = 1|xV , xA ) (ˆ − s) + p (C = 2|xV , xA ) (ˆ − sV or A ) . Then the best s s estimate is the mean of the posterior distribution, for instance for the visual estimation: sV = p (C = 1|xA , xV ) sV,C=1 + p (C = 2|xA , xV ) sV,C=2 ˆ ˆ ˆ where sV,C=1 = ˆ −2 −2 −2 xV σV +xA σA +xP σP −2 −2 −2 σV +σA +σP and sV,C=2 = ˆ −2 −2 xV σV +xP σP . −2 −2 σV +σP If pcommon equals 0 or 1, this estimate reduces to one of the conditioned estimates and is linear in xV and xA . If 0 < pcommon < 1, the estimate is a nonlinear combination of xV and xA , because of the functional form of p (C|xV , xA ). The response distributions, that is the distributions of sV and sA given ˆ ˆ sV and sA over many trials, now cannot be identiﬁed with the posterior distribution on a single trial and cannot be computed analytically either. The correct way to obtain the response distribution is to simulate an experiment numerically. Note that the causal inference model above can also be cast in the form of a bisensory stimulus prior by integrating out the latent variable C, with: p (sA , sV ) = p (C = 1) δ (sA − sV ) p (sA ) + p (sA ) p (sV ) p (C = 2) However, in addition to justifying the form of the interaction between the cues, the causal inference model has the advantage of being based on a generative model that well formalizes salient properties of the world, and it thereby also allows to predict judgments of unity. 4 3 Model performance and comparison To examine the performance of the causal inference model and to compare it to previous models, we performed a human psychophysics experiment in which we adopted the same dual-report paradigm as was used in [11]. Observers were simultaneously presented with a brief visual and also an auditory stimulus, each of which could originate from one of ﬁve locations on an imaginary horizontal line (-10◦ , -5◦ , 0◦ , 5◦ , or 10◦ with respect to the ﬁxation point). Auditory stimuli were 32 ms of white noise ﬁltered through an individually calibrated head related transfer function (HRTF) and presented through a pair of headphones, whereas the visual stimuli were high contrast Gabors on a noisy background presented on a 21-inch CRT monitor. Observers had to report by means of a key press (1-5) the perceived positions of both the visual and the auditory stimulus. Each combination of locations was presented with the same frequency over the course of the experiment. In this way, for each condition, visual and auditory response histograms were obtained. We obtained response distributions for each the three models described above by numeral simulation. On each trial, estimation is followed by a step in which, the key is selected which corresponds to the position closed to the best estimate. The simulated histograms obtained in this way were compared to the measured response frequencies of all subjects by computing the R2 statistic. Auditory response Auditory model Visual response Visual model no vision The parameters in the causal inference model were optimized using fminsearch in MATLAB to maximize R2 . The best combination of parameters yielded an R2 of 0.97. The response frequencies are depicted in Fig. 2. The bisensory prior models also explain most of the variance, with R2 = 0.96 for the Roach model and R2 = 0.91 for the Bresciani model. This shows that it is possible to model cue combination for large disparities well using such models. no audio 1 0 Figure 2: A comparison between subjects’ performance and the causal inference model. The blue line indicates the frequency of subjects responses to visual stimuli, red line is the responses to auditory stimuli. Each set of lines is one set of audio-visual stimulus conditions. Rows of conditions indicate constant visual stimulus, columns is constant audio stimulus. Model predictions is indicated by the red and blue dotted line. 5 3.1 Model comparison To facilitate quantitative comparison with other models, we now ﬁt the parameters of each model2 to individual subject data, maximizing the likelihood of the model, i.e., the probability of the response frequencies under the model. The causal inference model ﬁts human data better than the other models. Compared to the best ﬁt of the causal inference model, the Bresciani model has a maximal log likelihood ratio (base e) of the data of −22 ± 6 (mean ± s.e.m. over subjects), and the Roach model has a maximal log likelihood ratio of the data of −18 ± 6. A causal inference model that maximizes the probability of being correct instead of minimizing the mean squared error has a maximal log likelihood ratio of −18 ± 3. These values are considered decisive evidence in favor of the causal inference model that minimizes the mean squared error (for details, see [25]). The parameter values found in the likelihood optimization of the causal model are as follows: pcommon = 0.28 ± 0.05, σV = 2.14 ± 0.22◦ , σA = 9.2 ± 1.1◦ , σP = 12.3 ± 1.1◦ (mean ± s.e.m. over subjects). We see that there is a relatively low prior probability of a common cause. In this paradigm, auditory localization is considerably less precise than visual localization. Also, there is a weak prior for central locations. 3.2 Localization bias A useful quantity to gain more insight into the structure of multisensory data is the cross-modal bias. In our experiment, relative auditory bias is deﬁned as the difference between the mean auditory estimate in a given condition and the real auditory position, divided by the difference between the real visual position and the real auditory position in this condition. If the inﬂuence of vision on the auditory estimate is strong, then the relative auditory bias will be high (close to one). It is well-known that bias decreases with spatial disparity and our experiment is no exception (solid line in Fig. 3; data were combined between positive and negative disparities). It can easily be shown that a traditional cue integration model would predict a bias equal to σ2 −1 , which would be close to 1 and 1 + σV 2 A independent of disparity, unlike the data. This shows that a mandatory integration model is an insufﬁcient model of multisensory interactions. 45 % Auditory Bias We used the individual subject ﬁttings from above and and averaged the auditory bias values obtained from those ﬁts (i.e. we did not ﬁt the bias data themselves). Fits are shown in Fig. 3 (dashed lines). We applied a paired t-test to the differences between the 5◦ and 20◦ disparity conditions (model-subject comparison). Using a double-sided test, the null hypothesis that the difference between the bias in the 5◦ and 20◦ conditions is correctly predicted by each model is rejected for the Bresciani model (p < 0.002) and the Roach model (p < 0.042) and accepted for the causal inference model (p > 0.17). Alternatively, with a single-sided test, the hypothesis is rejected for the Bresciani model (p < 0.001) and the Roach model (p < 0.021) and accepted for the causal inference model (> 0.9). 50 40 35 30 25 20 5 10 15 Spatial Disparity (deg.) 20 Figure 3: Auditory bias as a function of spatial disparity. Solid blue line: data. Red: Causal inference model. Green: Model by Roach et al. [23]. Purple: Model by Bresciani et al. [22]. Models were optimized on response frequencies (as in Fig. 2), not on the bias data. The reason that the Bresciani model fares worst is that its prior distribution does not include a component that corresponds to independent causes. On 2 The Roach et al. model has four free parameters (ω,σV , σA , σcoupling ), the Bresciani et al. model has three (σV , σA , σcoupling ), and the causal inference model has four (pcommon ,σV , σA , σP ). We do not consider the Shams et al. model here, since it has many more parameters and it is not immediately clear how in this model the erroneous identiﬁcation of posterior with response distribution can be corrected. 6 the contrary, the prior used in the Roach model contains two terms, one term that is independent of the disparity and one term that decreases with increasing disparity. It is thus functionally somewhat similar to the causal inference model. 4 Discussion We have argued that any model of multisensory perception should account not only for situations of small, but also of large conﬂict. In these situations, segregation is more likely, in which the two stimuli are not perceived to have the same cause. Even when segregation occurs, the two stimuli can still inﬂuence each other. We compared three Bayesian models designed to account for situations of large conﬂict by applying them to auditory-visual spatial localization data. We pointed out a common mistake: for nonGaussian bisensory priors without mandatory integration, the response distribution can no longer be identiﬁed with the posterior distribution. After correct implementation of the three models, we found that the causal inference model is superior to the models with ad hoc bisensory priors. This is expected, as the nervous system actually needs to solve the problem of deciding which stimuli have a common cause and which stimuli are unrelated. We have seen that multisensory perception is a suitable tool for studying causal inference. However, the causal inference model also has the potential to quantitatively explain a number of other perceptual phenomena, including perceptual grouping and binding, as well as within-modality cue combination [27, 28]. Causal inference is a universal problem: whenever the brain has multiple pieces of information it must decide if they relate to one another or are independent. As the causal inference model describes how the brain processes probabilistic sensory information, the question arises about the neural basis of these processes. Neural populations encode probability distributions over stimuli through Bayes’ rule, a type of coding known as probabilistic population coding. Recent work has shown how the optimal cue combination assuming a common cause can be implemented in probabilistic population codes through simple linear operations on neural activities [29]. This framework makes essential use of the structure of neural variability and leads to physiological predictions for activity in areas that combine multisensory input, such as the superior colliculus. Computational mechanisms for causal inference are expected have a neural substrate that generalizes these linear operations on population activities. A neural implementation of the causal inference model will open the door to a complete neural theory of multisensory perception. References [1] H.L. Pick, D.H. Warren, and J.C. Hay. Sensory conﬂict in judgements of spatial direction. Percept. Psychophys., 6:203205, 1969. [2] D. H. Warren, R. B. Welch, and T. J. McCarthy. The role of visual-auditory ”compellingness” in the ventriloquism effect: implications for transitivity among the spatial senses. Percept Psychophys, 30(6):557– 64, 1981. [3] D. Alais and D. Burr. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol, 14(3):257–62, 2004. [4] R. A. Jacobs. Optimal integration of texture and motion cues to depth. Vision Res, 39(21):3621–9, 1999. [5] R. J. van Beers, A. C. Sittig, and J. J. Gon. Integration of proprioceptive and visual position-information: An experimentally supported model. J Neurophysiol, 81(3):1355–64, 1999. [6] D. H. Warren and W. T. Cleaves. Visual-proprioceptive interaction under large amounts of conﬂict. J Exp Psychol, 90(2):206–14, 1971. [7] C. E. Jack and W. R. Thurlow. Effects of degree of visual association and angle of displacement on the ”ventriloquism” effect. Percept Mot Skills, 37(3):967–79, 1973. [8] G. H. Recanzone. Auditory inﬂuences on visual temporal rate perception. J Neurophysiol, 89(2):1078–93, 2003. [9] J. P. Bresciani, M. O. Ernst, K. Drewing, G. Bouyer, V. Maury, and A. Kheddar. Feeling what you hear: auditory signals can modulate tactile tap perception. Exp Brain Res, 162(2):172–80, 2005. 7 [10] R. Gepshtein, P. Leiderman, L. Genosar, and D. Huppert. Testing the three step excited state proton transfer model by the effect of an excess proton. J Phys Chem A Mol Spectrosc Kinet Environ Gen Theory, 109(42):9674–84, 2005. [11] L. Shams, W. J. Ma, and U. Beierholm. Sound-induced ﬂash illusion as an optimal percept. Neuroreport, 16(17):1923–7, 2005. [12] G Thomas. Experimental study of the inﬂuence of vision on sound localisation. J Exp Psychol, 28:167177, 1941. [13] W. R. Thurlow and C. E. Jack. Certain determinants of the ”ventriloquism effect”. Percept Mot Skills, 36(3):1171–84, 1973. [14] C.S. Choe, R. B. Welch, R.M. Gilford, and J.F. Juola. The ”ventriloquist effect”: visual dominance or response bias. Perception and Psychophysics, 18:55–60, 1975. [15] R. I. Bermant and R. B. Welch. Effect of degree of separation of visual-auditory stimulus and eye position upon spatial interaction of vision and audition. Percept Mot Skills, 42(43):487–93, 1976. [16] R. B. Welch and D. H. Warren. Immediate perceptual response to intersensory discrepancy. Psychol Bull, 88(3):638–67, 1980. [17] P. Bertelson and M. Radeau. Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys, 29(6):578–84, 1981. [18] P. Bertelson, F. Pavani, E. Ladavas, J. Vroomen, and B. de Gelder. Ventriloquism in patients with unilateral visual neglect. Neuropsychologia, 38(12):1634–42, 2000. [19] D. A. Slutsky and G. H. Recanzone. Temporal and spatial dependency of the ventriloquism effect. Neuroreport, 12(1):7–10, 2001. [20] J. Lewald, W. H. Ehrenstein, and R. Guski. Spatio-temporal constraints for auditory–visual integration. Behav Brain Res, 121(1-2):69–79, 2001. [21] M. T. Wallace, G. E. Roberson, W. D. Hairston, B. E. Stein, J. W. Vaughan, and J. A. Schirillo. Unifying multisensory signals across time and space. Exp Brain Res, 158(2):252–8, 2004. [22] J. P. Bresciani, F. Dammeier, and M. O. Ernst. Vision and touch are automatically integrated for the perception of sequences of events. J Vis, 6(5):554–64, 2006. [23] N. W. Roach, J. Heron, and P. V. McGraw. Resolving multisensory conﬂict: a strategy for balancing the costs and beneﬁts of audio-visual integration. Proc Biol Sci, 273(1598):2159–68, 2006. [24] K. P. Kording and D. M. Wolpert. Bayesian decision theory in sensorimotor control. Trends Cogn Sci, 2006. 1364-6613 (Print) Journal article. [25] K.P. Kording, U. Beierholm, W.J. Ma, S. Quartz, J. Tenenbaum, and L. Shams. Causal inference in multisensory perception. PLoS ONE, 2(9):e943, 2007. [26] Z. Ghahramani. Computational and psychophysics of sensorimotor integration. PhD thesis, Massachusetts Institute of Technology, 1995. [27] D. C. Knill. Mixture models and the probabilistic structure of depth cues. Vision Res, 43(7):831–54, 2003. [28] D. C. Knill. Robust cue integration: A bayesian model and evidence from cue conﬂict studies with stereoscopic and ﬁgure cues to slant. Journal of Vision, 7(7):2–24. [29] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nat Neurosci, 9(11):1432–8, 2006. 8

3 0.079132795 125 nips-2007-Markov Chain Monte Carlo with People

Author: Adam Sanborn, Thomas L. Griffiths

Abstract: Many formal models of cognition implicitly use subjective probability distributions to capture the assumptions of human learners. Most applications of these models determine these distributions indirectly. We propose a method for directly determining the assumptions of human learners by sampling from subjective probability distributions. Using a correspondence between a model of human choice and Markov chain Monte Carlo (MCMC), we describe a method for sampling from the distributions over objects that people associate with different categories. In our task, subjects choose whether to accept or reject a proposed change to an object. The task is constructed so that these decisions follow an MCMC acceptance rule, deﬁning a Markov chain for which the stationary distribution is the category distribution. We test this procedure for both artiﬁcial categories acquired in the laboratory, and natural categories acquired from experience. 1

4 0.075205989 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

Author: Michael Ross, Andrew Cohen

Abstract: This paper describes a new model for human visual classiﬁcation that enables the recovery of image features that explain human subjects’ performance on different visual classiﬁcation tasks. Unlike previous methods, this algorithm does not model their performance with a single linear classiﬁer operating on raw image pixels. Instead, it represents classiﬁcation as the combination of multiple feature detectors. This approach extracts more information about human visual classiﬁcation than previous methods and provides a foundation for further exploration. 1

5 0.066381648 114 nips-2007-Learning and using relational theories

Author: Charles Kemp, Noah Goodman, Joshua B. Tenenbaum

Abstract: Much of human knowledge is organized into sophisticated systems that are often called intuitive theories. We propose that intuitive theories are mentally represented in a logical language, and that the subjective complexity of a theory is determined by the length of its representation in this language. This complexity measure helps to explain how theories are learned from relational data, and how they support inductive inferences about unobserved relations. We describe two experiments that test our approach, and show that it provides a better account of human learning and reasoning than an approach developed by Goodman [1]. What is a theory, and what makes one theory better than another? Questions like these are of obvious interest to philosophers of science but are also discussed by psychologists, who have argued that everyday knowledge is organized into rich and complex systems that are similar in many respects to scientiﬁc theories. Even young children, for instance, have systematic beliefs about domains including folk physics, folk biology, and folk psychology [2]. Intuitive theories like these play many of the same roles as scientiﬁc theories: in particular, both kinds of theories are used to explain and encode observations of the world, and to predict future observations. This paper explores the nature, use and acquisition of simple theories. Consider, for instance, an anthropologist who has just begun to study the social structure of a remote tribe, and observes that certain words are used to indicate relationships between selected pairs of individuals. Suppose that term T1(·, ·) can be glossed as ancestor(·, ·), and that T2(·, ·) can be glossed as friend(·, ·). The anthropologist might discover that the ﬁrst term is transitive, and that the second term is symmetric with a few exceptions. Suppose that term T3(·, ·) can be glossed as defers to(·, ·), and that the tribe divides into two castes such that members of the second caste defer to members of the ﬁrst caste. In this case the anthropologist might discover two latent concepts (caste 1(·) and caste 2(·)) along with the relationship between these concepts. As these examples suggest, a theory can be deﬁned as a system of laws and concepts that specify the relationships between the elements in some domain [2]. We will consider how these theories are learned, how they are used to encode relational data, and how they support predictions about unobserved relations. Our approach to all three problems relies on the notion of subjective complexity. We propose that theory learners prefer simple theories, that people remember relational data in terms of the simplest underlying theory, and that people extend a partially observed data set according to the simplest theory that is consistent with their observations. There is no guarantee that a single measure of subjective complexity can do all of the work that we require [3]. This paper, however, explores the strong hypothesis that a single measure will sufﬁce. Our formal treatment of subjective complexity begins with the question of how theories are mentally represented. We suggest that theories are represented in some logical language, and propose a speciﬁc ﬁrst-order language that serves as a hypothesis about the “language of thought.” We then pursue the idea that the subjective complexity of a theory corresponds to the length of its representation in this language. Our approach therefore builds on the work of Feldman [4], and is related to other psychological applications of the notion of Kolmogorov complexity [5]. The complexity measure we describe can be used to deﬁne a probability distribution over a space of theories, and we develop a model of theory acquisition by using this distribution as the prior for a Bayesian learner. We also 1 (a) Star 11 (b) Bipartite (c) Exception 22 33 44 55 66 77 88 16 26 36 46 56 21 31 41 51 61 71 81 11 17 27 37 47 57 18 28 38 48 58 R(X, X). T(6). T(7). T(8). R(X, Y) ← ¯(X), T(Y). T R(X, 1). (d) Symmetric (e) Transitive 11 22 33 44 55 66 77 13 31 12 21 24 42 56 65 26 36 46 56 17 27 37 47 57 18 28 38 48 58 T(6). T(7). T(8). R(X, Y) ← ¯(X), T(Y). T ¯ R(1, 1). R(1, 6). (f) Random 12 21 13 23 14 24 34 13 32 14 24 34 15 25 35 45 16 26 36 46 56 51 52 35 54 61 26 63 46 56 R(1, 2). R(1, 3). R(2, 4). R(5, 6). R(1, 2). R(2, 3). R(3, 4). R(5, X). R(X, 4). R(X, Y) ← R(Y, X). R(X, X). R(4, 5). R(5, 6). R(X, Z) ← R(X, Y), R(Y, Z). R(2, 1). R(1, 3). R(6, 1). R(3, 2). R(2, 6). R(3, 5). R(6, 3). R(4, 6). ¯ ¯ ¯ R(X, X). R(6, 4). R(5, 3). Figure 1: Six possible extensions for a binary predicate R(·, ·). In each case, the objects in the domain are represented as digits, and a pair such as 16 indicates that R(1, 6) is true. Below each set of pairs, the simplest theory according to our complexity measure is shown. show how the same Bayesian approach helps to explain how theories support inductive generalization: given a set of observations, future observations (e.g. whether one individual defers to another) can be predicted using the posterior distribution over the space of theories. We test our approach by developing two experiments where people learn and make predictions about binary and ternary relations. As far as we know, the approach of Goodman [1] is the only other measure of theory complexity that has previously been tested as a psychological model [6]. We show that our experiments support our approach and raise challenges for this alternative model. 1 Theory complexity: a representation length approach Intuitive theories correspond to mental representations of some sort, and our ﬁrst task is to characterize the elements used to build these representations. We explore the idea that a theory is a system of statements in a logical language, and six examples are shown in Fig. 1. The theory in Fig. 1b is related to the defers to(·, ·) example already described. Here we are interested in a domain including 9 elements, and a two place predicate R(·, ·) that is true of all and only the 15 pairs shown. R is deﬁned using a unary predicate T which is true of only three elements: 6, 7, and 8. The theory includes a clause which states that R(X, Y) is true for all pairs XY such that T(X) is false and T(Y) is true. The theory in Fig. 1c is very similar, but includes an additional clause which speciﬁes that R(1, 1) is true, and an exception which speciﬁes that R(1, 6) is false. Formally, each theory we consider is a collection of function-free deﬁnite clauses. All variables are universally quantiﬁed: for instance, the clause R(X, Z) ← R(X, Y), R(Y, Z) is equivalent to the logical formula ∀x ∀y ∀z (R(x, z) ← R(x, y) ∧ R(y, z)). For readability, the theories in Fig. 1 include parentheses and arrows, but note that these symbols are unnecessary and can be removed. Our proposed language includes only predicate symbols, variable symbols, constant symbols, and a period that indicates when one clause ﬁnishes and another begins. Each theory in Fig. 1 speciﬁes the extension of one or more predicates. The extension of predicate P is deﬁned in terms of predicate P+ (which captures the basic rules that lead to membership in P) and predicate P− (which captures exceptions to these rules). The resulting extension of P is deﬁned 2 as P+ \ P− , or the set difference of P+ and P− .1 Once P has been deﬁned, later clauses in the theory may refer to P or its negation ¯. To ensure that our semantics is well-deﬁned, the predicates P in any valid theory must permit an ordering so that the deﬁnition of any predicate does not refer to predicates that follow it in the order. Formally, the deﬁnition of each predicate P+ or P− can refer only to itself (recursive deﬁnitions are allowed) and to any predicate M or ¯ where M < P. M Once we have committed to a speciﬁc language, the subjective complexity of a theory is assumed to correspond to the number of symbols in its representation. We have chosen a language where there is one symbol for each position in a theory where a predicate, variable or constant appears, and one symbol to indicate when each clause ends. Given this language, the subjective complexity c(T ) of theory T is equal to the sum of the number of clauses in the theory and the number of positions in the theory where a predicate, variable or constant appears: c(T ) = #clauses(T ) + #pred slots(T ) + #var slots(T ) + #const slots(T ). (1) For instance, the clause R(X, Z) ← R(X, Y), R(Y, Z). contributes ten symbols towards the complexity of a theory (three predicate symbols, six variable symbols, and one period). Other languages might be considered: for instance, we could use a language which uses ﬁve symbols (e.g. ﬁve bits) to represent each predicate, variable and constant, and one symbol (e.g. one bit) to indicate the end of a clause. Our approach to subjective complexity depends critically on the representation language, but once a language has been chosen the complexity measure is uniquely speciﬁed. Although our approach is closely related to the notion of Kolmogorov complexity and to Minimum Message Length (MML) and Minimum Description Length (MDL) approaches, we refer to it as a Representation Length (RL) approach. A RL approach includes a commitment to a speciﬁc language that is proposed as a psychological hypothesis, but these other approaches aspire towards results that do not depend on the language chosen.2 It is sometimes suggested that the notion of Kolmogorov complexity provides a more suitable framework for psychological research than the RL approach, precisely because it allows for results that do not depend on a speciﬁc description language [8]. We subscribe to the opposite view. Mental representations presumably rely on some particular language, and identifying this language is a central challenge for psychological research. The language we described should be considered as a tentative approximation of the language of thought. Other languages can and should be explored, but our language has several appealing properties. Feldman [4] has argued that deﬁnite clauses are psychologically natural, and working with these representations allows our approach to account for several classic results from the concept learning literature. For instance, our language leads to the prediction that conjunctive concepts are easier to learn than disjunctive concepts [9].3 Working with deﬁnite clauses also ensures that each of our theories has a unique minimal model, which means that the extension of a theory can be deﬁned in a particularly simple way. Finally, human learners deal gracefully with noise and exceptions, and our language provides a simple way to handle exceptions. Any concrete proposal about the language of thought should make predictions about memory, learning and reasoning. Suppose that data set D lists the extensions of one or more predicates, and that a theory is a “candidate theory” for D if it correctly deﬁnes the extensions of all predicates in D. Note that a candidate theory may well include latent predicates—predicates that do not appear in D, but are useful for deﬁning the predicates that have been observed. We will assume that humans encode D in terms of the simplest candidate theory for D, and that the difﬁculty of memorizing D is determined by the subjective complexity of this theory. Our approach can and should be tested against classic results from the memory literature. Unlike some other approaches to complexity [10], for instance, our model predicts that a sequence of k items is about equally easy to remember regardless of whether the items are drawn from a set of size 2, a set of size 10, or a set of size 1000 [11]. 1 The extension of P+ is the smallest set that satisﬁes all of the clauses that deﬁne P+ , and the extension of P is deﬁned similarly. To simplify our notation, Fig. 1 uses P to refer to both P and P+ , and ¯ to refer to ¯ and P P P− . Any instance of P that appears in a clause deﬁning P is really an instance of P+ , and any instance of ¯ that P appears in a clause deﬁning ¯ is really an instance of P− . P 2 MDL approaches also commit to a speciﬁc language, but this language is often intended to be as general as possible. See, for instance, the discussion of universal codes in Gr¨ nwald et al. [7]. u 3 A conjunctive concept C(·) can be deﬁned using a single clause: C(X) ← A(X), B(X). The shortest deﬁnition of a disjunctive concept requires two clauses: D(X) ← A(X). D(X) ← B(X). − 3 To develop a model of inductive learning and reasoning, we take a Bayesian approach, and use our complexity measure to deﬁne a prior distribution over a hypothesis space of theories: P (T ) ∝ 2−c(T ) .4 Given this prior distribution, we can use Bayesian inference to make predictions about unobserved relations and to discover the theory T that best accounts for the observations in data set D [12, 13]. Suppose that we have a likelihood function P (D|T ) which speciﬁes how the examples in D were generated from some underlying theory T . The best explanation for the data D is the theory that maximizes the posterior distribution P (T |D) ∝ P (D|T )P (T ). If we need to predict whether ground term g is likely to be true, 5 we can sum over the space of theories: P (g|D) = P (g|T )P (T |D) = T 1 P (D) P (D|T )P (T ) (2) T :g∈T where the ﬁnal sum is over all theories T that make ground term g true. 1.1 Related work The theories we consider are closely related to logic programs, and methods for Inductive Logic Programming (ILP) explore how these programs can be learned from examples [14]. ILP algorithms are often inspired by the idea of searching for the shortest theory that accounts for the available data, and ILP is occasionally cast as the problem of minimizing an explicit MDL criterion [10]. Although ILP algorithms are rarely considered as cognitive models, the RL approach has a long psychological history, and is proposed by Chomsky [15] and Leeuwenberg [16] among others. Formal measures of complexity have been developed in many ﬁelds [17], and there is at least one other psychological account of theory complexity. Goodman [1] developed a complexity measure that was originally a philosophical proposal about scientiﬁc theories, but was later tested as a model of subjective complexity [6]. A detailed description of this measure is not possible here, but we attempt to give a ﬂavor of the approach. Suppose that a basis is a set of predicates. The starting point for Goodman’s model is the intuition that basis B1 is at least as complex as basis B2 if B1 can be used to deﬁne B2. Goodman argues that this intuition is ﬂawed, but his model is founded on a reﬁnement of this intuition. For instance, since the binary predicate in Fig. 1b can be deﬁned in terms of two unary predicates, Goodman’s approach requires that the complexity of the binary predicate is no more than the sum of the complexities of the two unary predicates. We will use Goodman’s model as a baseline for evaluating our own approach, and a comparison between these two models should be informed by both theoretical and empirical considerations. On the theoretical side, our approach relies on a simple principle for deciding which structural properties are relevant to the measurement of complexity: the relevant properties are those with short logical representations. Goodman’s approach incorporates no such principle, and he proposes somewhat arbitrarily that reﬂexivity and symmetry are among the relevant structural properties but that transitivity is not. A second reason for preferring our model is that it makes contact with a general principle—the idea that simplicity is related to representation length—that has found many applications across psychology, machine learning, and philosophy. 2 Experimental results We designed two experiments to explore settings where people learn, remember, and make inductive inferences about relational data. Although theories often consist of systems of many interlocking relations, we keep our experiments simple by asking subjects to learn and reason about a single relation at a time. Despite this restriction, our experiments still make contact with several issues raised by systems of relations. As the defers to(·, ·) example suggests, a single relation may be best explained as the observable tip of a system involving several latent predicates (e.g. caste 1(·) and caste 2(·)). 4 To ensure that this distribution can be normalized, we assume that there is some upper bound on the number of predicate symbols, variable symbols, and constants, and on the length of the theories we will consider. There will therefore be a ﬁnite number of possible theories, and our prior will be a valid probability distribution. 5 A ground term is a term such as R(8, 9) that does not include any variables. 4 Learning time Complexity (RL) Complexity (Human) 6 300 Complexity (Goodman) 4 20 0 0 0 star bprt excp sym trans rand 2 0 star bprt excp sym trans rand 2 star bprt excp sym trans rand 100 200 star bprt excp sym trans rand 4 40 Figure 2: (a) Average time in seconds to learn the six sets in Fig. 1. (b) Average ratings of set complexity. (c) Complexity scores according to our representation length (RL) model. (d) Complexity scores according to Goodman’s model. 2.1 Experiment 1: memory and induction In our ﬁrst experiment, we studied the subjective complexity of six binary relations that display a range of structural properties, including reﬂexivity, symmetry, and transitivity. Materials and Methods. 18 adults participated in this experiment. Subjects were required to learn the 6 sets shown in Fig. 1, and to make inductive inferences about each set. Although Fig. 1 shows pairs of digits, the experiment used letter pairs, and the letters for each condition and the order in which these conditions were presented were randomized across subjects. The pairs for each condition were initially laid out randomly on screen, and subjects could drag them around and organize them to help them understand the structure of the set. At any stage, subjects could enter a test phase where they were asked to list the 15 pairs belonging to the current set. Subjects who made an error on the test were returned to the learning phase. After 9 minutes had elapsed, subjects were allowed to pass the test regardless of how many errors they made. After passing the test, subjects were asked to rate the complexity of the set compared to other sets with 15 pairs. Ratings were provided on a 7 point scale. Subjects were then asked to imagine that a new letter (e.g. letter 9) had belonged to the current alphabet, and were given two inductive tasks. First they were asked to enter between 1 and 10 novel pairs that they might have expected to see (each novel pair was required to include the new letter). Next they were told about a novel pair that belonged to the set (e.g. pair 91), and were again asked to enter up to 10 additional pairs that they might have expected to see. Results. The average time needed to learn each set is shown in Fig. 2a, and ratings of set complexity are shown in Fig. 2b. It is encouraging that these measures yield converging results, but they may be confounded since subjects rated the complexity of a set immediately after learning it. The complexities plotted in Fig. 2c are the complexities of the theories shown in Fig. 1, which we believe to be the simplest theories according to our complexity measure. The ﬁnal plot in Fig. 2 shows complexities according to Goodman’s model, which assigns each binary relation an integer between 0 and 4. There are several differences between these models: for instance, Goodman’s account incorrectly predicts that the exception case is the hardest of the six, but our model acknowledges that a simple theory remains simple if a handful of exceptions are added. Goodman’s account also predicts that transitivity is not an important structural regularity, but our model correctly predicts that the transitive set is simpler than the same set with some of the pairs reversed (the random set). Results for the inductive task are shown in Fig. 3. The ﬁrst two columns show the number of subjects who listed each novel pair. The remaining two columns show the probability of set membership predicted by our model. To generate these predictions, we applied Equation 2 and summed over a set of theories created by systematically extending the theories shown in Fig. 1. Each extended theory includes up to one additional clause for each predicate in the base theory, and each additional clause includes at most two predicate slots. For instance, each extended theory for the bipartite case is created by choosing whether or not to add the clause T(9), and adding up to one clause for predicate R.6 For the ﬁrst inductive task, the likelihood term P (D|T ) (see Equation 2) is set to 0 for all theories that are not consistent with the pairs observed during training, and to a constant for all remaining theories. For the second task we assumed in addition that the novel pair observed is 6 R(9, X), ¯(2, 9), and R(X, 9) ← R(X, 2) are three possible additions. R 5 18 9 9 0.5 trans symm excep bipart 0 91 random r=0.99 1 star 18 99 19 0 91 89 99 19 89 0.5 0 91 18 18 1 9 9 99 19 r=0.96 89 0.5 0 91 99 19 0 91 89 99 19 89 18 9 1 99 19 89 r=0.98 1 9 0.5 0 91 99 19 0 91 89 99 19 89 18 9 99 19 89 0.5 0 81 88 18 0 78 81 88 18 78 0 18 18 9 0 0 71 77 17 67 71 77 17 67 18 18 81 9 88 18 r=0.62 78 71 77 17 67 Human (no examples) 0 71 77 17 67 Human (1 example) 0 0 91 99 19 89 r=0.99 0 81 88 18 78 71 77 17 67 1 71 77 17 67 r=0.38 0.5 0 89 r=0.93 0.5 1 9 99 19 r=0.99 1 0.5 0 89 r=0.99 0.5 1 9 0 91 1 r=0.88 1 9 99 19 0.5 0 91 18 0 91 0.5 0 91 18 r=0.99 1 0 r=0.74 1 0.5 71 77 17 67 RL (no examples) 0 71 77 17 67 RL (one example) Figure 3: Data and model predictions for the induction task in Experiment 1. Columns 1 and 3 show predictions before any pairs involving the new letter are observed. Columns 2 and 4 show predictions after a single novel pair (marked with a gray bar) is observed to belong to the set. The model plots for each condition include correlations with the human data. sampled at random from all pairs involving the new letter.7 All model predictions were computed using Mace4 [18] to generate the extension of each theory considered. The supporting material includes predictions for a model based on the Goodman complexity measure and an exemplar model which assumes that the new letter will be just like one of the old letters.8 The exemplar model outperforms our model in the random condition, and makes accurate predictions about three other conditions. Overall, however, our model performs better than the two baselines. Here we focus on two important predictions that are not well handled by the exemplar model. In the symmetry condition, almost all subjects predict that 78 belongs to the set after learning that 87 belongs to the set, suggesting that they have learned an abstract rule. In the transitive condition, most subjects predict that pairs 72 through 76 belong to the set after learning that 71 belongs to the set. Our model accounts for this result, but the exemplar model has no basis for making predictions about letter 7, since this letter is now known to be unlike any of the others. 2.2 Experiment 2: learning from positive examples During the learning phase of our ﬁrst experiment, subjects learned a theory based on positive examples (the theory included all pairs they had seen) and negative examples (the theory ruled out all pairs they had not seen). Often, however, humans learn theories based on positive examples alone. Suppose, for instance, that our anthropologist has spent only a few hours with a new tribe. She may have observed several pairs who are obviously friends, but should realize that many other pairs of friends have not yet interacted in her presence. 7 For the second task, P (D|T ) is set to 0 for theories that are inconsistent with the training pairs and theories 1 which do not include the observed novel pair. For all remaining theories, P (D|T ) is set to n , where n is the total number of novel pairs that are consistent with T . 8 Supporting material is available at www.charleskemp.com 6 1 221 331 441 551 c) 7 1 R(X, X, Y). 221 443 552 663 d) 7 1 R(X, Y, Z). 231 456 615 344 e) 7 1 −10 −5 −0.1 −10 −20 −20 −10 −0.2 −20 777 771 778 789 237 777 771 778 789 237 −10 231 234 235 236 0 777 771 778 789 237 0 777 771 778 789 237 0 777 771 778 789 237 0 RL 0 R(2, 3, X). 777 771 778 789 237 7 R(X, X, 1). 777 771 778 789 237 b) 777 771 778 789 237 1 111 222 333 444 777 771 778 789 237 7 R(X, X, X). 777 771 778 789 237 Human a) Figure 4: Data and model predictions for Experiment 2. The four triples observed for each set are shown at the top of the ﬁgure. The ﬁrst row of plots shows average ratings on a scale from 1 (very unlikely to belong to the set) to 7 (very likely). Model predictions are plotted as log probabilities. Our framework can handle cases like these if we assume that the data D in Equation 2 are sampled from the ground terms that are true according to the underlying theory. We follow [10] and [13] and use a distribution P (D|T ) which assumes that the examples in D are randomly sampled with replacement from the ground terms that are true. This sampling assumption encourages our model to identify the theory with the smallest extension that is compatible with all of the training examples. We tested this approach by designing an experiment where learners were given sets of examples that were compatible with several underlying theories. Materials and Methods. 15 adults participated in this experiment immediately after taking Experiment 1. In each of ﬁve conditions, subjects were told about a set of triples built from an alphabet of 9 letters. They were shown four triples that belonged to the set (Fig. 4), and told that the set might include triples that they had not seen. Subjects then gave ratings on a seven point scale to indicate whether ﬁve additional triples (see Fig. 4) were likely to belong to the set. Results. Average ratings and model predictions are shown in Fig. 4. Model predictions for each condition were computed using Equation 2 and summing over a space of theories that included the ﬁve theories shown at the top of Fig. 4, variants of these ﬁve theories which stated that certain pairs of slots could not be occupied by the same constant,9 and theories that included no variables but merely enumerated up to 5 triples.10 Although there are general theories like R(X, Y, Z) that are compatible with the triples observed in all ﬁve conditions, Fig. 4 shows that people were sensitive to different regularities in each case.11 We focus on one condition (Fig. 4b) that exposes the strengths and weaknesses of our model. According to our model, the two most probable theories given the triples for this condition are R(X, X, 1) and the closely related variant that rules out R(1, 1, 1). The next most probable theory is R(X, X, Y). These predictions are consistent with people’s judgments that 771 is very likely to belong to the set, and that 778 is the next most likely option. Unlike our model, however, people consider 777 to be substantially less likely than 778 to belong to the set. This result may suggest that the variant of R(X, X, Y) that rules out R(X, X, X) deserves a higher prior probability than our model recognizes. To better account for cases like this, it may be worth considering languages where any two variables that belong to the same clause but have different names must refer to different entities. 3 Discussion and Conclusion There are many psychological models of concept learning [4, 12, 13], but few that use representations rich enough to capture the content of intuitive theories. We suggested that intuitive theories are mentally represented in a ﬁrst-order logical language, and proposed a speciﬁc hypothesis about 9 ¯ One such theory includes two clauses: R(X, X, Y). R(X, X, X). One such theory is the following list of clauses: R(2, 2, 1). R(3, 3, 1). R(4, 4, 1). R(5, 5, 1). R(7, 7, 7). 11 Similar results have been found with 9-month old infants. Cases like Figs. 4b and 4c have been tested in an infant language-learning study where the stimuli were three-syllable strings [19]. 9-month old infants exposed to strings like the four in Fig. 4c generalized to other strings consistent with the theory R(X, X, Y), but infants in the condition corresponding to Fig. 4b generalized only to strings consistent with the theory R(X, X, 1). 10 7 this “language of thought.” We assumed that the subjective complexity of a theory depends on the length of its representation in this language, and described experiments which suggest that the resulting complexity measure helps to explain how theories are learned and used for inductive inference. Our experiments deliberately used stimuli that minimize the inﬂuence of prior knowledge. Theories, however, are cumulative, and the theory that seems simplest to a learner will often depend on her background knowledge. Our approach provides a natural place for background knowledge to be inserted. A learner can be supplied with a stock of background predicates, and the shortest representation for a data set will depend on which background predicates are available. Since different sets of predicates will lead to different predictions about subjective complexity, empirical results can help to determine the background knowledge that people bring to a given class of problems. Future work should aim to reﬁne the representation language and complexity measure we proposed. We expect that something like our approach will be suitable for modeling a broad class of intuitive theories, but the speciﬁc framework presented here can almost certainly be improved. Future work should also consider different strategies for searching the space of theories. Some of the strategies developed in the ILP literature should be relevant [14], but a detailed investigation of search algorithms seems premature until our approach has held up to additional empirical tests. It is comparatively easy to establish whether the theories that are simple according to our approach are also considered simple by people, and our experiments have made a start in this direction. It is much harder to establish that our approach captures most of the theories that are subjectively simple, and more exhaustive experiments are needed before this conclusion can be drawn. Boolean concept learning has been studied for more than ﬁfty years [4, 9], and many psychologists have made empirical and theoretical contributions to this ﬁeld. An even greater effort will be needed to crack the problem of theory learning, since the space of intuitive theories is much richer than the space of Boolean concepts. The difﬁculty of this problem should not be underestimated, but computational approaches can contribute part of the solution. Acknowledgments Supported by the William Asbjornsen Albert memorial fellowship (CK), the James S. McDonnell Foundation Causal Learning Collaborative Initiative (NDG, JBT) and the Paul E. Newton chair (JBT). References [1] N. Goodman. The structure of appearance. 2nd edition, 1961. [2] S. Carey. Conceptual change in childhood. MIT Press, Cambridge, MA, 1985. [3] H. A. Simon. Complexity and the representation of patterned sequences of symbols. Psychological Review, 79:369–382, 1972. [4] J. Feldman. An algebra of human concept learning. JMP, 50:339–368, 2006. [5] N. Chater and P. Vitanyi. Simplicity: a unifying principle in cognitive science. TICS, 7:19–22, 2003. [6] J. T. Krueger. A theory of structural simplicity and its relevance to aspects of memory, perception, and conceptual naturalness. PhD thesis, University of Pennsylvania, 1979. [7] P. Gr¨ nwald, I. J. Myung, and M. Pitt, editors. Advances in Minimum Description Length: Theory and u Applications. 2005. [8] N. Chater. Reconciling simplicity and likelihood principles in perceptual organization. Psychological Review, 103:566–581, 1996. [9] J. A. Bruner, J. S. Goodnow, and G. J. Austin. A study of thinking. Wiley, 1956. [10] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [11] G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(1):81–97, 1956. [12] N. D. Goodman, T. L. Grifﬁths, J. Feldman, and J. B. Tenenbaum. A rational analysis of rule-based concept learning. In CogSci, 2007. [13] J. B. Tenenbaum and T. L. Grifﬁths. Generalization, similarity, and Bayesian inference. BBS, 24:629–641, 2001. [14] S. Muggleton and L. De Raedt. Inductive logic programming: theory and methods. Journal of Logic Programming, 19-20:629–679, 1994. [15] N. Chomsky. The logical structure of linguistic theory. University of Chicago Press, Chicago, 1975. [16] E. L. J. Leeuwenberg. A perceptual coding language for visual and auditory patterns. American Journal of Psychology, 84(3):307–349, 1971. [17] B. Edmonds. Syntactic measures of complexity. PhD thesis, University of Manchester, 1999. [18] W. McCune. Mace4 reference manual and guide. Technical Report ANL/MCS-TM-264, Argonne National Laboratory, 2003. [19] L. Gerken. Decisions, decisions: infant language learning when multiple generalizations are possible. Cognition, 98(3):67–74, 2006. 8

6 0.060096655 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

7 0.055656567 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

8 0.050478283 3 nips-2007-A Bayesian Model of Conditioned Perception

9 0.047878485 203 nips-2007-The rat as particle filter

10 0.047393113 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

11 0.046693578 97 nips-2007-Hidden Common Cause Relations in Relational Learning

12 0.045126386 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers

13 0.044527989 84 nips-2007-Expectation Maximization and Posterior Constraints

14 0.044393927 59 nips-2007-Continuous Time Particle Filtering for fMRI

15 0.042724226 63 nips-2007-Convex Relaxations of Latent Variable Training

16 0.042449258 80 nips-2007-Ensemble Clustering using Semidefinite Programming

17 0.036987137 86 nips-2007-Exponential Family Predictive Representations of State

18 0.036512159 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

19 0.036010306 33 nips-2007-Bayesian Inference for Spiking Neuron Models with a Sparsity Prior

20 0.035366621 121 nips-2007-Local Algorithms for Approximate Inference in Minor-Excluded Graphs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.119), (1, 0.03), (2, 0.015), (3, -0.04), (4, 0.002), (5, 0.011), (6, 0.016), (7, 0.062), (8, -0.037), (9, -0.098), (10, -0.031), (11, -0.032), (12, -0.008), (13, -0.049), (14, 0.046), (15, -0.056), (16, -0.012), (17, 0.062), (18, 0.103), (19, 0.025), (20, 0.094), (21, 0.053), (22, -0.044), (23, -0.056), (24, -0.033), (25, 0.045), (26, -0.074), (27, -0.01), (28, 0.253), (29, 0.133), (30, 0.075), (31, 0.087), (32, -0.051), (33, -0.04), (34, -0.033), (35, -0.107), (36, -0.02), (37, -0.06), (38, -0.036), (39, 0.053), (40, -0.007), (41, 0.113), (42, 0.123), (43, -0.035), (44, 0.027), (45, -0.053), (46, -0.081), (47, -0.007), (48, -0.129), (49, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95807421 198 nips-2007-The Noisy-Logical Distribution and its Application to Causal Inference

Author: Hongjing Lu, Alan L. Yuille

2 0.81521183 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

Author: Ulrik Beierholm, Ladan Shams, Wei J. Ma, Konrad Koerding

3 0.57199925 3 nips-2007-A Bayesian Model of Conditioned Perception

Author: Alan Stocker, Eero P. Simoncelli

Abstract: unkown-abstract

4 0.54721487 119 nips-2007-Learning with Tree-Averaged Densities and Distributions

Author: Sergey Kirshner

Abstract: We utilize the ensemble of trees framework, a tractable mixture over superexponential number of tree-structured distributions [1], to develop a new model for multivariate density estimation. The model is based on a construction of treestructured copulas – multivariate distributions with uniform on [0, 1] marginals. By averaging over all possible tree structures, the new model can approximate distributions with complex variable dependencies. We propose an EM algorithm to estimate the parameters for these tree-averaged models for both the real-valued and the categorical case. Based on the tree-averaged framework, we propose a new model for joint precipitation amounts data on networks of rain stations. 1

5 0.5065735 150 nips-2007-Optimal models of sound localization by barn owls

Author: Brian J. Fischer

Abstract: Sound localization by barn owls is commonly modeled as a matching procedure where localization cues derived from auditory inputs are compared to stored templates. While the matching models can explain properties of neural responses, no model explains how the owl resolves spatial ambiguity in the localization cues to produce accurate localization for sources near the center of gaze. Here, I examine two models for the barn owl’s sound localization behavior. First, I consider a maximum likelihood estimator in order to further evaluate the cue matching model. Second, I consider a maximum a posteriori estimator to test whether a Bayesian model with a prior that emphasizes directions near the center of gaze can reproduce the owl’s localization behavior. I show that the maximum likelihood estimator can not reproduce the owl’s behavior, while the maximum a posteriori estimator is able to match the behavior. This result suggests that the standard cue matching model will not be sufﬁcient to explain sound localization behavior in the barn owl. The Bayesian model provides a new framework for analyzing sound localization in the barn owl and leads to predictions about the owl’s localization behavior.

6 0.47498214 114 nips-2007-Learning and using relational theories

7 0.40442297 125 nips-2007-Markov Chain Monte Carlo with People

8 0.40212128 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

9 0.38807967 203 nips-2007-The rat as particle filter

10 0.31953526 81 nips-2007-Estimating disparity with confidence from energy neurons

11 0.30162948 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

12 0.28617406 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs

13 0.27772275 174 nips-2007-Selecting Observations against Adversarial Objectives

14 0.27260107 97 nips-2007-Hidden Common Cause Relations in Relational Learning

15 0.26013589 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

16 0.25492844 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

17 0.25363415 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

18 0.24542645 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

19 0.24221788 89 nips-2007-Feature Selection Methods for Improving Protein Structure Prediction with Rosetta

20 0.23362143 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.031), (13, 0.041), (16, 0.015), (18, 0.057), (21, 0.063), (31, 0.036), (34, 0.028), (35, 0.027), (47, 0.107), (49, 0.015), (63, 0.016), (74, 0.305), (83, 0.1), (85, 0.015), (90, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71789646 198 nips-2007-The Noisy-Logical Distribution and its Application to Causal Inference

Author: Hongjing Lu, Alan L. Yuille

2 0.625745 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

Author: Andrea Lecchini-visintini, John Lygeros, Jan Maciejowski

Abstract: Simulated annealing is a popular method for approaching the solution of a global optimization problem. Existing results on its performance apply to discrete combinatorial optimization where the optimization variables can assume only a ﬁnite set of possible values. We introduce a new general formulation of simulated annealing which allows one to guarantee ﬁnite-time performance in the optimization of functions of continuous variables. The results hold universally for any optimization problem on a bounded domain and establish a connection between simulated annealing and up-to-date theory of convergence of Markov chain Monte Carlo methods on continuous domains. This work is inspired by the concept of ﬁnite-time learning with known accuracy and conﬁdence developed in statistical learning theory. Optimization is the general problem of ﬁnding a value of a vector of variables θ that maximizes (or minimizes) some scalar criterion U (θ). The set of all possible values of the vector θ is called the optimization domain. The elements of θ can be discrete or continuous variables. In the ﬁrst case the optimization domain is usually ﬁnite, such as in the well-known traveling salesman problem; in the second case the optimization domain is a continuous set. An important example of a continuous optimization domain is the set of 3-D conﬁgurations of a sequence of amino-acids in the problem of ﬁnding the minimum energy folding of the corresponding protein [1]. In principle, any optimization problem on a ﬁnite domain can be solved by an exhaustive search. However, this is often beyond computational capacity: the optimization domain of the traveling salesman problem with 100 cities contains more than 10155 possible tours. An efﬁcient algorithm to solve the traveling salesman and many similar problems has not yet been found and such problems remain reliably solvable only in principle [2]. Statistical mechanics has inspired widely used methods for ﬁnding good approximate solutions in hard discrete optimization problems which defy efﬁcient exact solutions [3, 4, 5, 6]. Here a key idea has been that of simulated annealing [3]: a random search based on the Metropolis-Hastings algorithm, such that the distribution of the elements of the domain visited during the search converges to an equilibrium distribution concentrated around the global optimizers. Convergence and ﬁnite-time performance of simulated annealing on ﬁnite domains has been evaluated in many works, e.g. [7, 8, 9, 10]. On continuous domains, most popular optimization methods perform a local gradient-based search and in general converge to local optimizers; with the notable exception of convex criteria where convergence to the unique global optimizer occurs [11]. Simulated annealing performs a global search and can be easily implemented on continuous domains. Hence it can be considered a powerful complement to local methods. In this paper, we introduce for the ﬁrst time rigorous guarantees on the ﬁnite-time performance of simulated annealing on continuous domains. We will show that it is possible to derive simulated annealing algorithms which, with an arbitrarily high level of conﬁdence, ﬁnd an approximate solution to the problem of optimizing a function of continuous variables, within a speciﬁed tolerance to the global optimal solution after a known ﬁnite number of steps. Rigorous guarantees on the ﬁnite-time performance of simulated annealing in the optimization of functions of continuous variables have never been obtained before; the only results available state that simulated annealing converges to a global optimizer as the number of steps grows to inﬁnity, e.g. [12, 13, 14, 15]. The background of our work is twofold. On the one hand, our notion of approximate solution to a global optimization problem is inspired by the concept of ﬁnite-time learning with known accuracy and conﬁdence developed in statistical learning theory [16, 17]. We actually maintain an important aspect of statistical learning theory which is that we do not introduce any particular assumption on the optimization criterion, i.e. our results hold regardless of what U is. On the other hand, we ground our results on the theory of convergence, with quantitative bounds on the distance to the target distribution, of the Metropolis-Hastings algorithm and Markov Chain Monte Carlo (MCMC) methods, which has been one of the main achievements of recent research in statistics [18, 19, 20, 21]. In this paper, we will not develop any ready-to-use optimization algorithm. We will instead introduce a general formulation of the simulated annealing method which allows one to derive new simulated annealing algorithms with rigorous ﬁnite-time guarantees on the basis of existing theory. The Metropolis-Hastings algorithm and the general family of MCMC methods have many degrees of freedom. The choice and comparison of speciﬁc algorithms goes beyond the scope of the paper. The paper is organized in the following sections. In Simulated annealing we introduce the method and ﬁx the notation. In Convergence we recall the reasons why ﬁnite-time guarantees for simulated annealing on continuous domains have not been obtained before. In Finite-time guarantees we present the main result of the paper. In Conclusions we state our ﬁndings and conclude the paper. 1 Simulated annealing The original formulation of simulated annealing was inspired by the analogy between the stochastic evolution of the thermodynamic state of an annealing material towards the conﬁgurations of minimal energy and the search for the global minimum of an optimization criterion [3]. In the procedure, the optimization criterion plays the role of the energy and the state of the annealed material is simulated by the evolution of the state of an inhomogeneous Markov chain. The state of the chain evolves according to the Metropolis-Hastings algorithm in order to simulate the Boltzmann distribution of thermodynamic equilibrium. The Boltzmann distribution is simulated for a decreasing sequence of temperatures (“cooling”). The target distribution of the cooling procedure is the limiting Boltzmann distribution, for the temperature that tends to zero, which takes non-zero values only on the set of global minimizers [7]. The original formulation of the method was for a ﬁnite domain. However, simulated annealing can be generalized straightforwardly to a continuous domain because the Metropolis-Hastings algorithm can be used with almost no differences on discrete and continuous domains The main difference is that on a continuous domain the equilibrium distributions are speciﬁed by probability densities. On a continuous domain, Markov transition kernels in which the distribution of the elements visited by the chain converges to an equilibrium distribution with the desired density can be constructed using the Metropolis-Hastings algorithm and the general family of MCMC methods [22]. We point out that Boltzmann distributions are not the only distributions which can be adopted as equilibrium distributions in simulated annealing [7]. In this paper it is convenient for us to adopt a different type of equilibrium distribution in place of Boltzmann distributions. 1.1 Our setting The optimization criterion is U : Θ → [0, 1], with Θ ⊂ RN . The assumption that U takes values in the interval [0, 1] is a technical one. It does not imply any serious loss of generality. In general, any bounded optimization criterion can be scaled to take values in [0, 1]. We assume that the optimization task is to ﬁnd a global maximizer; this can be done without loss of generality. We also assume that Θ is a bounded set. We consider equilibrium distributions deﬁned by probability density functions proportional to [U (θ) + δ]J where J and δ are two strictly positive parameters. We use π (J) to denote an equilibrium distribution, i.e. π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ) where πLeb is the standard Lebesgue measure. Here, J −1 plays the role of the temperature: if the function U (θ) plus δ is taken to a positive power J then as J increases (i.e. as J −1 decreases) [U (θ) + δ]J becomes increasingly peaked around the global maximizers. The parameter δ is an offset which guarantees that the equilibrium densities are always strictly positive, even if U takes zero values on some elements of the domain. The offset δ is chosen by the user and we show later that our results allow one to make an optimal selection of δ. The zero-temperature distribution is the limiting distribution, for J → ∞, which takes non-zero values only on the set of global maximizers. It is denoted by π (∞) . In the generic formulation of the method, the Markov transition kernel of the k-th step of the inhomogeneous chain has equilibrium distribution π (Jk ) where {Jk }k=1,2,... is the “cooling schedule”. The cooling schedule is a non-decreasing sequence of positive numbers according to which the equilibrium distribution become increasingly sharpened during the evolution of the chain. We use θk to denote the state of the chain and Pθk to denote its probability distribution. The distribution Pθk obviously depends on the initial condition θ0 . However, in this work, we don’t need to make this dependence explicit in the notation. Remark 1: If, given an element θ in Θ, the value U (θ) can be computed directly, we say that U is a deterministic criterion, e.g. the energy landscape in protein structure prediction [1]. In problems involving random variables, the value U (θ) may be the expected value U (θ) = g(x, θ)px (x; θ)dx of some function g which depends on both the optimization variable θ, and on some random variable x which has probability density px (x; θ) (which may itself depend on θ). In such problems it is usually not possible to compute U (θ) directly, either because evaluation of the integral requires too much computation, or because no analytical expression for px (x; θ) is available. Typically one must perform stochastic simulations in order to obtain samples of x for a given θ, hence obtain sample values of g(x, θ), and thus construct a Monte Carlo estimate of U (θ). The Bayesian design of clinical trials is an important application area where such expected-value criteria arise [23]. The authors of this paper investigate the optimization of expected-value criteria motivated by problems of aircraft routing [24]. In the particular case that px (x; θ) does not depend on θ, the optimization task is often called “empirical risk minimization”, and is studied extensively in statistical learning theory [16, 17]. The results of this paper apply in the same way to the optimization of both deterministic and expected-value criteria. The MCMC method developed by M¨ ller [25, 26] allows one u to construct simulated annealing algorithms for the optimization of expected-value criteria. M¨ ller u [25, 26] employs the same equilibrium distributions as those described in our setting; in his context J is restricted to integer values. 2 Convergence The rationale of simulated annealing is as follows: if the temperature is kept constant, say Jk = J, then the distribution of the state of the chain Pθk tends to the equilibrium distribution π (J) ; if J → ∞ then the equilibrium distribution π (J) tends to the zero-temperature distribution π (∞) ; as a result, if the cooling schedule Jk tends to inﬁnity, one obtains that Pθk “follows” π (Jk ) and that π (Jk ) tends to π (∞) and eventually that the distribution of the state of the chain Pθk tends to π (∞) . The theory shows that, under conditions on the cooling schedule and the Markov transition kernels, the distribution of the state of the chain Pθk actually converges to the target zero-temperature distribution π (∞) as k → ∞ [12, 13, 14, 15]. Convergence to the zero-temperature distribution implies that asymptotically the state of the chain eventually coincides with a global optimizer with probability one. The difﬁculty which must be overcome in order to obtain ﬁnite step results on simulated annealing algorithms on a continuous domain is that usually, in an optimization problem deﬁned over continuous variables, the set of global optimizers has zero Lebesgue measure (e.g. a set of isolated points). If the set of global optimizers has zero measure then the set of global optimizers has null probability according to the equilibrium distributions π (J) for any ﬁnite J and, as a consequence, according to the distributions Pθk for any ﬁnite k. Put another way, the probability that the state of the chain visits the set of global optimizers is constantly zero after any ﬁnite number of steps. Hence the conﬁdence of the fact that the solution provided by the algorithm in ﬁnite time coincides with a global optimizer is also constantly zero. Notice that this is not the case for a ﬁnite domain, where the set of global optimizers is of non-null measure with respect to the reference counting measure [7, 8, 9, 10]. It is instructive to look at the issue also in terms of the rate of convergence to the target zerotemperature distribution. On a discrete domain, the distribution of the state of the chain at each step and the zero-temperature distribution are both standard discrete distributions. It is then possible to deﬁne a distance between them and study the rate of convergence of this distance to zero. This analysis allows one to obtain results on the ﬁnite-time behavior of simulated annealing [7, 8]. On a continuous domain and for a set of global optimizers of measure zero, the target zero-temperature distribution π (∞) ends up being a mixture of probability masses on the set of global optimizers. In this situation, although the distribution of the state of the chain Pθk still converges asymptotically to π (∞) , it is not possible to introduce a sensible distance between the two distributions and a rate of convergence to the target distribution cannot even be deﬁned (weak convergence), see [12, Theorem 3.3]. This is the reason that until now there have been no guarantees on the performance of simulated annealing on a continuous domain after a ﬁnite number of computations: by adopting the zero-temperature distribution π (∞) as the target distribution it is only possible to prove asymptotic convergence in inﬁnite time to a global optimizer. Remark 2: The standard distance between two distributions, say µ1 and µ2 , on a continuous support is the total variation norm µ1 − µ2 T V = supA |µ1 (A) − µ2 (A)|, see e.g. [21]. In simulated annealing on a continuous domain the distribution of the state of the chain Pθk is absolutely continuous with respect to the Lebesgue measure (i.e. πLeb (A) = 0 ⇒ Pθk (A) = 0), by construction for any ﬁnite k. Hence if the set of global optimizers has zero Lebesgue measure then it has zero measure also according to Pθk . The set of global optimizers has however measure 1 according to π (∞) . The distance Pθk − π (∞) T V is then constantly 1 for any ﬁnite k. It is also worth mentioning that if the set of global optimizers has zero measure then asymptotic convergence to the zero-temperature distribution π (∞) can be proven only under the additional assumptions of continuity and differentiability of U [12, 13, 14, 15]. 3 Finite-time guarantees In general, optimization algorithms for problems deﬁned on continuous variables can only ﬁnd approximate solutions in ﬁnite time [27]. Given an element θ of a continuous domain how can we assess how good it is as an approximate solution to an optimization problem? Here we introduce the concept of approximate global optimizer to answer this question. The deﬁnition is given for a maximization problem in a continuous but bounded domain. We use two parameters: the value imprecision ǫ (greater than or equal to 0) and the residual domain α (between 0 and 1) which together determine the level of approximation. We say that θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if the function U takes values strictly greater than U (θ) + ǫ only on a subset of values of θ no larger than an α portion of the optimization domain. The formal deﬁnition is as follows. Deﬁnition 1 Let U : Θ → R be an optimization criterion where Θ ⊂ RN is bounded. Let πLeb denote the standard Lebesgue measure. Let ǫ ≥ 0 and α ∈ [0, 1] be given numbers. Then θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . In other words, the value U (θ) is within ǫ of a value which is greater than the values that U takes on at least a 1 − α portion of the domain. The smaller ǫ and α are, the better is the approximation of a true global optimizer. If both α and ǫ are equal to zero then U (θ) coincides with the essential supremum of U . Our deﬁnition of approximate global optimizer carries an important property, which holds regardless of what the criterion U is: if ǫ and α have non-zero values then the set of approximate global optimizers always has non-zero Lebesgue measure. It follows that the probability that the chain visits the set of approximate global optimizers can be non-zero. Hence, it is sensible to study the conﬁdence of the fact that the solution found by simulated annealing in ﬁnite time is an approximate global optimizer. Remark 3: The intuition that our notion of approximate global optimizer can be used to obtain formal guarantees on the ﬁnite-time performance of optimization methods based on a stochastic search of the domain is already apparent in the work of Vidyasagar [17, 28]. Vidyasagar [17, 28] introduces a similar deﬁnition and obtains rigorous ﬁnite-time guarantees in the optimization of ex- pected value criteria based on uniform independent sampling of the domain. Notably, the number of independent samples required to guarantee some desired accuracy and conﬁdence turns out to be polynomial in the values of the desired imprecision, residual domain and conﬁdence. Although the method of Vidyasagar is not highly sophisticated, it has had considerable success in solving difﬁcult control system design applications [28, 29]. Its appeal stems from its rigorous ﬁnite-time guarantees which exist without the need for any particular assumption on the optimization criterion. Here we show that ﬁnite-time guarantees for simulated annealing can be obtained by selecting a distribution π (J) with a ﬁnite J as the target distribution in place of the zero-temperature distribution π (∞) . The fundamental result is the following theorem which allows one to select in a rigorous way δ and J in the target distribution π (J) . It is important to stress that the result holds universally for any optimization criterion U on a bounded domain. The only minor requirement is that U takes values in [0, 1]. Theorem 1 Let U : Θ → [0, 1] be an optimization criterion where Θ ⊂ RN is bounded. Let J ≥ 1 and δ > 0 be given numbers. Let θ be a multivariate random variable with distribution π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ). Let α ∈ (0, 1] and ǫ ∈ [0, 1] be given numbers and deﬁne σ= 1 1+δ 1+ ǫ+1+δ J 1 1+δ 1+δ −1 α ǫ+δ δ . (1) Then the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ. Proof. See Appendix A. The importance of the choice of a target distribution π (J) with a ﬁnite J is that π (J) is absolutely continuous with respect to the Lebesgue measure. Hence, the distance Pθk − π (J) TV between the distribution of the state of the chain Pθk and the target distribution π (J) is a meaningful quantity. Convergence of the Metropolis-Hastings algorithm and MCMC methods in total variation norm is a well studied problem. The theory provides simple conditions under which one derives upper bounds on the distance to the target distribution which are known at each step of the chain and decrease monotonically to zero as the number of steps of the chain grows. The theory has been developed mainly for homogeneous chains [18, 19, 20, 21]. In the case of simulated annealing, the factor that enables us to employ these results is the absolute continuity of the target distribution π (J) with respect to the Lebesgue measure. However, simulated annealing involves the simulation of inhomogeneous chains. In this respect, another important fact is that the choice of a target distribution π (J) with a ﬁnite J implies that the inhomogeneous Markov chain can in fact be formed by a ﬁnite sequence of homogeneous chains (i.e. the cooling schedule {Jk }k=1,2,... can be chosen to be a sequence that takes only a ﬁnite set of values). In turn, this allows one to apply the theory of homogeneous MCMC methods to study the convergence of Pθk to π (J) in total variation norm. On a bounded domain, simple conditions on the ‘proposal distribution’ in the iteration of the simulated annealing algorithm allows one to obtain upper bounds on Pθk − π (J) TV that decrease geometrically to zero as k → ∞, without the need for any additional assumption on U [18, 19, 20, 21]. It is then appropriate to introduce the following ﬁnite-time result. Theorem 2 Let the notation and assumptions of Theorem 1 hold. Let θk , with distribution Pθk , be the state of the inhomogeneous chain of a simulated annealing algorithm with target distribution π (J) . Then the statement “θk is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ − Pθk − π (J) TV . The proof of the theorem follows directly from the deﬁnition of the total variation norm. It follows that if simulated annealing is implemented with an algorithm which converges in total variation distance to a target distribution π (J) with a ﬁnite J, then one can state with conﬁdence arbitrarily close to 1 that the solution found by the algorithm after the known appropriate ﬁnite number of steps is an approximate global optimizer with the desired approximation level. For given non-zero values of ǫ, α the value of σ given by (1) can be made arbitrarily close to 1 by choice of J; while the distance Pθk − π (J) TV can be made arbitrarily small by taking the known sufﬁcient number of steps. It can be shown that there exists the possibility of making an optimal choice of δ and J in the target distribution π (J) . In fact, for given ǫ and α and a given value of J there exists an optimal choice of δ which maximizes the value of σ given by (1). Hence, it is possible to obtain a desired σ with the smallest possible J. The advantage of choosing the smallest J, consistent with the required approximation and conﬁdence, is that it will decrease the number of steps required to achieve the desired reduction of Pθk − π (J) TV . 4 Conclusions We have introduced a new formulation of simulated annealing which admits rigorous ﬁnite-time guarantees in the optimization of functions of continuous variables. First, we have introduced the notion of approximate global optimizer. Then, we have shown that simulated annealing is guaranteed to ﬁnd approximate global optimizers, with the desired conﬁdence and the desired level of accuracy, in a known ﬁnite number of steps, if a proper choice of the target distribution is made and conditions for convergence in total variation norm are met. The results hold for any optimization criterion on a bounded domain with the only minor requirement that it takes values between 0 and 1. In this framework, simulated annealing algorithms with rigorous ﬁnite-time guarantees can be derived by studying the choice of the proposal distribution and of the cooling schedule, in the generic iteration of simulated annealing, in order to ensure convergence to the target distribution in total variation norm. To do this, existing theory of convergence of the Metropolis-Hastings algorithm and MCMC methods on continuous domains can be used [18, 19, 20, 21]. Vidyasagar [17, 28] has introduced a similar deﬁnition of approximate global optimizer and has shown that approximate optimizers with desired accuracy and conﬁdence can be obtained with a number of uniform independent samples of the domain which is polynomial in the accuracy and conﬁdence parameters. In general, algorithms developed with the MCMC methodology can be expected to be equally or more efﬁcient than uniform independent sampling. Acknowledgments Work supported by EPSRC, Grant EP/C014006/1, and by the European Commission under projects HYGEIA FP6-NEST-4995 and iFly FP6-TREN-037180. We thank S. Brooks, M. Vidyasagar and D. M. Wolpert for discussions and useful comments on the paper. A Proof of Theorem 1 Let α ∈ (0, 1] and ρ ∈ (0, 1] be given numbers. Let Uδ (θ) := U (θ) + δ. Let πδ be a normalized ¯ measure such that πδ (dθ) ∝ Uδ (θ)πLeb (dθ). In the ﬁrst part of the proof we ﬁnd a lower bound on the probability that θ belongs to the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} . ¯ Let yα := inf{y : πδ {θ ∈ Θ : Uδ (θ) ≤ y} ≥ 1 − α}. To start with we show that the set ¯ ¯ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} coincides with {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice ¯ ¯ that the quantity πδ {θ ∈ Θ : Uδ (θ) ≤ y} is a right-continuous non-decreasing function of y because it has the form of a distribution function (see e.g. [30, p.162] and [17, Lemma 11.1]). Therefore we have πδ {θ ∈ Θ : Uδ (θ) ≤ yα } ≥ 1 − α and ¯ ¯ y ≥ ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} ≥ 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α . ¯ ¯ ¯ Moreover, y < ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} < 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} > α ¯ ¯ ¯ and taking the contrapositive one obtains πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α ⇒ y ≥ ρ yα . ¯ ¯ ′ Therefore {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α}. ¯ ¯ We now derive a lower bound on π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Let us introduce the notation ¯ ¯¯ Aα := {θ ∈ Θ : Uδ (θ) < yα }, Aα := {θ ∈ Θ : Uδ (θ) ≥ yα }, Bα,ρ := {θ ∈ Θ : Uδ (θ) < ρ yα } ¯ ¯ ¯ ¯ ¯ ¯α,ρ := {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice that Bα,ρ ⊆ Aα and Aα ⊆ Bα,ρ . The quantity ¯¯ ¯¯ and B ¯ ¯ ¯ ¯ πδ {θ ∈ Θ : Uδ (θ) < y} as a function of y is the left-continuous version of πδ {θ ∈ Θ : Uδ (θ) ≤ ¯¯ y}[30, p.162]. Hence, the deﬁnition of yα implies πδ (Aα ) ≤ 1 − α and πδ (Aα ) ≥ α. Notice that ¯ ¯ ¯ ¯ δπLeb (Aα ) ¯ ≤ 1−α, ¯ πδ (Aα ) ≤ 1 − α ⇒ ¯ ¯ Uδ (θ)πLeb (dθ) Θ ¯¯ πδ (Aα ) ≥ α ¯ ¯¯ (1 + δ)πLeb (Aα ) ≥ α. ¯ U (θ)πLeb (dθ) Θ δ ⇒ ¯¯ Hence, πLeb (Aα ) > 0 and πLeb (Aα ) 1−α1+δ ¯ ¯ . ¯α ) ≤ α ¯ δ πLeb (A ¯ ¯¯ ¯¯ Notice that πLeb (Aα ) > 0 implies πLeb (Bα,ρ ) > 0. We obtain π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα } = ¯ 1+ 1 ≥ Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Bα,ρ ≥ 1+ J ρ J yα ¯ J yα ¯ Uδ (θ)J πLeb (dθ) 1+ 1 Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Aα Uδ (θ)J πLeb (dθ) 1 1 1 ≥ ≥ . 1−α1+δ ¯ πLeb (Aα ) πLeb (Bα,ρ ) ¯ ¯ 1 + ρJ 1 + ρJ ¯¯ ¯¯ α ¯ δ πLeb (Aα ) πLeb (Aα ) Since {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} the ﬁrst part of ¯ ¯ the proof is complete. In the second part of the proof we show that the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} is contained in the set of approximate global optimizers of U with value imprecision ¯ ǫ := (ρ−1 − 1)(1 + δ) and residual domain α := 1+δ α. Hence, we show that {θ ∈ Θ : πδ {θ′ ∈ ˜ ˜ ǫ+δ ¯ ˜ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} ⊆ {θ ∈ Θ : πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ)}. We ¯ ˜ ˜ have U (θ′ ) > U (θ) + ǫ ⇔ ρ Uδ (θ′ ) > ρ [Uδ (θ) + ǫ] ⇒ ρ Uδ (θ′ ) > Uδ (θ) ˜ ˜ which is proven by noticing that ρ [Uδ (θ) + ǫ] ≥ Uδ (θ) ⇔ 1 − ρ ≥ U (θ)(1 − ρ) ˜ and U (θ) ∈ [0, 1]. Hence {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ⊇ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} . ˜ Therefore πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α . Let ¯ ˜ ¯ Qθ,˜ := {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} and notice that ˜ ǫ U (θ′ )πLeb (dθ′ ) + δπLeb (Qθ,˜) ǫ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} = ˜ Qθ,˜ ǫ . U (θ′ )πLeb (dθ′ ) + δπLeb (Θ) Θ We obtain πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α ⇒ ǫ πLeb (Qθ,˜) + δπLeb (Qθ,˜) ≤ α(1 + δ)πLeb (Θ) ˜ ¯ ˜ ¯ ǫ ǫ ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . ˜ ˜ Hence we can conclude that πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) ¯ ˜ ˜ and the second part of the proof is complete. We have shown that given α ∈ (0, 1], ρ ∈ (0, 1], ǫ := (ρ−1 − 1)(1 + δ), α := ¯ ˜ ˜ σ := 1 = 1−α1+δ ¯ 1+δ 1+ρ 1+ α ¯ δ ǫ+1+δ ˜ J 1+δ ǫ+δ ˜ 1 J 1 1+δ 1+δ −1 α ǫ+δ ˜˜ δ α and ¯ , the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual ˜ domain α” holds with probability at least σ. Notice that ǫ ∈ [0, 1] and α ∈ (0, 1] are linked through ˜ ˜ ˜ ǫ+δ ˜ a bijective relation to ρ ∈ [ 1+δ , 1] and α ∈ (0, 1+δ ]. The statement of the theorem is eventually ¯ 2+δ obtained by expressing σ as a function of desired ǫ = ǫ and α = α. ˜ ˜ References [1] D. J. Wales. Energy Landscapes. Cambridge University Press, Cambridge, UK, 2003. [2] D. Achlioptas, A. Naor, and Y. Peres. Rigorous location of phase transitions in hard optimization problems. Nature, 435:759–764, 2005. [3] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. 220(4598):671–680, 1983. Optimization by Simulated Annealing. Science, [4] E. Bonomi and J. Lutton. The N -city travelling salesman problem: statistical mechanics and the Metropolis algorithm. SIAM Rev., 26(4):551–568, 1984. [5] Y. Fu and P. W. Anderson. Application of statistical mechanics to NP-complete problems in combinatorial optimization. J. Phys. A: Math. Gen., 19(9):1605–1620, 1986. [6] M. M´ zard, G. Parisi, and R. Zecchina. Analytic and Algorithmic Solution of Random Satisﬁability e Problems. Science, 297:812–815, 2002. [7] P. M. J. van Laarhoven and E. H. L. Aarts. Simulated Annealing: Theory and Applications. D. Reidel Publishing Company, Dordrecht, Holland, 1987. [8] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli. Convergence and ﬁnite-time behavior of simulated annealing. Adv. Appl. Prob., 18:747–771, 1986. [9] B. Hajek. Cooling schedules for optimal annealing. Math. Oper. Res., 13:311–329, 1988. [10] J. Hannig, E. K. P. Chong, and S. R. Kulkarni. Relative Frequencies of Generalized Simulated Annealing. Math. Oper. Res., 31(1):199–216, 2006. [11] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. [12] H. Haario and E. Saksman. Simulated annealing process in general state space. Adv. Appl. Prob., 23:866– 893, 1991. [13] S. B. Gelfand and S. K. Mitter. Simulated Annealing Type Algorithms for Multivariate Optimization. Algorithmica, 6:419–436, 1991. [14] C. Tsallis and D. A. Stariolo. Generalized simulated annealing. Physica A, 233:395–406, 1996. [15] M. Locatelli. Simulated Annealing Algorithms for Continuous Global Optimization: Convergence Conditions. J. Optimiz. Theory App., 104(1):121–133, 2000. [16] V. N. Vapnik. The Nature of Statistical Learning Theory. Cambridge University Press, Springer, New York, US, 1995. [17] M. Vidyasagar. Learning and Generalization: With Application to Neural Networks. Springer-Verlag, London, second edition, 2003. [18] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, London, 1993. [19] J. S. Rosenthal. Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo. J. Am. Stat. Assoc., 90(430):558–566, 1995. [20] K. L. Mengersen and R. L. Tweedie. Rates of convergence of the Hastings and Metropolis algorithm. Ann. Stat., 24(1):101–121, 1996. [21] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Prob. Surv., 1:20–71, 2004. [22] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, second edition, 2004. [23] D.J. Spiegelhalter, K.R. Abrams, and J.P. Myles. Bayesian approaches to clinical trials and health-care evaluation. John Wiley & Sons, Chichester, UK, 2004. [24] A. Lecchini-Visintini, W. Glover, J. Lygeros, and J. M. Maciejowski. Monte Carlo Optimization for Conﬂict Resolution in Air Trafﬁc Control. IEEE Trans. Intell. Transp. Syst., 7(4):470–482, 2006. [25] P. M¨ ller. Simulation based optimal design. In J. O. Berger, J. M. Bernardo, A. P. Dawid, and A. F. M. u Smith, editors, Bayesian Statistics 6: proceedings of the Sixth Valencia International Meeting, pages 459–474. Oxford: Clarendon Press, 1999. [26] P. M¨ ller, B. Sans´ , and M. De Iorio. Optimal Bayesian design by Inhomogeneous Markov Chain Simuu o lation. J. Am. Stat. Assoc., 99(467):788–798, 2004. [27] L. Blum, C. Cucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer-Verlag, New York, 1998. [28] M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory. Automatica, 37(10):1515–1528, 2001. [29] R. Tempo, G. Calaﬁore, and F. Dabbene. Randomized Algorithms for Analysis and Control of Uncertain Systems. Springer-Verlag, London, 2005. [30] B.V. Gnedenko. Theory of Probability. Chelsea, New York, fourth edition, 1968.

3 0.50976169 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artiﬁcial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efﬁcient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

4 0.49807325 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identiﬁcation of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modiﬁes the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

5 0.49213964 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

Author: Michael Ross, Andrew Cohen

6 0.49184155 86 nips-2007-Exponential Family Predictive Representations of State

7 0.48926163 100 nips-2007-Hippocampal Contributions to Control: The Third Way

8 0.48727503 63 nips-2007-Convex Relaxations of Latent Variable Training

9 0.48568147 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

10 0.48566699 213 nips-2007-Variational Inference for Diffusion Processes

11 0.48497489 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

12 0.48404941 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

13 0.48351672 125 nips-2007-Markov Chain Monte Carlo with People

14 0.48098645 43 nips-2007-Catching Change-points with Lasso

15 0.47979051 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

16 0.47802475 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

17 0.47774059 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

18 0.47701064 174 nips-2007-Selecting Observations against Adversarial Objectives

19 0.47668236 158 nips-2007-Probabilistic Matrix Factorization

20 0.47621039 69 nips-2007-Discriminative Batch Mode Active Learning