nips nips2007 nips2007-119 knowledge-graph by maker-knowledge-mining

119 nips-2007-Learning with Tree-Averaged Densities and Distributions


Source: pdf

Author: Sergey Kirshner

Abstract: We utilize the ensemble of trees framework, a tractable mixture over superexponential number of tree-structured distributions [1], to develop a new model for multivariate density estimation. The model is based on a construction of treestructured copulas – multivariate distributions with uniform on [0, 1] marginals. By averaging over all possible tree structures, the new model can approximate distributions with complex variable dependencies. We propose an EM algorithm to estimate the parameters for these tree-averaged models for both the real-valued and the categorical case. Based on the tree-averaged framework, we propose a new model for joint precipitation amounts data on networks of rain stations. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We utilize the ensemble of trees framework, a tractable mixture over superexponential number of tree-structured distributions [1], to develop a new model for multivariate density estimation. [sent-3, score-0.252]

2 The model is based on a construction of treestructured copulas – multivariate distributions with uniform on [0, 1] marginals. [sent-4, score-0.479]

3 By averaging over all possible tree structures, the new model can approximate distributions with complex variable dependencies. [sent-5, score-0.156]

4 We propose an EM algorithm to estimate the parameters for these tree-averaged models for both the real-valued and the categorical case. [sent-6, score-0.064]

5 Based on the tree-averaged framework, we propose a new model for joint precipitation amounts data on networks of rain stations. [sent-7, score-0.208]

6 1 Introduction Multivariate real-valued data appears in many real-world data sets, and a lot of research is being focused on the development of multivariate real-valued distributions. [sent-8, score-0.083]

7 One of the challenges in constructing such distributions is that univariate continuous distributions commonly do not have a clear multivariate generalization. [sent-9, score-0.314]

8 The most studied exception is the multivariate Gaussian distribution owing to properties such as closed form density expression with a convenient generalization to higher dimensions and closure over the set of linear projections. [sent-10, score-0.135]

9 While modeling multivariate distributions is in general difficult due to complicated functional forms and the curse of dimensionality, learning models for individual variables (univariate marginals) is often straightforward. [sent-14, score-0.172]

10 Once the univariate marginals are known (or assumed known), the rest can be modeled using copulas, multivariate distributions with all univariate marginals equal to uniform distributions on [0, 1] (e. [sent-15, score-0.693]

11 A large portion of copula research concentrated on bivariate copulas as extensions to higher dimensions are often difficult. [sent-18, score-1.031]

12 Thus if the desired distribution decomposes into its univariate marginals and only bivariate distributions, the machinery of copulas can be effectively utilized. [sent-19, score-0.816]

13 , [4]) have exactly these properties, as probability density functions over the variables with tree-structured conditional independence graphs can be written as a product involving univariate marginals and bivariate marginals corresponding to the edges of the tree. [sent-22, score-0.689]

14 In this paper, we extend this tree-averaged model to continuous variables with the help of copulas and derive a learning algorithm to estimate the parameters within the maximum likelihood framework with EM [6]. [sent-24, score-0.371]

15 In the process, we introduce previously unexplored tree-structured copula density and an algorithm for estimation of its structure and parameters. [sent-29, score-0.57]

16 We then construct multivariate copulas with tree-structured dependence from bivariate copulas (Section 3. [sent-32, score-1.029]

17 1) and show how to estimate the parameters of the bivariate copulas and perform the edge selection. [sent-33, score-0.588]

18 1); we also develop a new model for multi-site precipitation amounts, a problem involving both binary (rain/no rain) and continuous (how much rain) variables (Section 5. [sent-36, score-0.143]

19 Let Fv (xv ) = F (Xv = xv , Xu = ∞ : u ∈ V \ {v}) denote a univariate marginal of F over the variable Xv . [sent-46, score-0.421]

20 Let pv (xv ) denote the probability density function (pdf) of Xv . [sent-47, score-0.168]

21 , ad ), so a is a vector of quantiles of components of x with respect to corresponding univariate marginals. [sent-51, score-0.205]

22 Next, we define copula, a multivariate distribution over vectors of quantiles. [sent-52, score-0.102]

23 The copula associated with F is a distribution function C : [0, 1] → [0, 1] that satisfies F (x) = C (F1 (x1 ) , . [sent-54, score-0.497]

24 (1) If F is a continuous distribution on Rd with univariate marginals F1 , . [sent-58, score-0.267]

25 Assuming that F has d-th order partial derivatives, the probability density function (pdf) can be obtained from the distribution function via differentiation and expressed in terms of a derivative of a copula: p (x) = where c (a) = ∂ d F (x) ∂ d C (a) ∂ d C (a) = = ∂x1 . [sent-65, score-0.052]

26 ∂ad v∈V ∂av = c (a) ∂xv pv (xv ) (2) v∈V is referred to as a copula density function. [sent-77, score-0.665]

27 , xN of d-component real-valued vectors xn = xn , . [sent-81, score-0.093]

28 A maximum likelihood (ML) estimate for the 1 1 parameters of c (or p) from data can be obtained my maximizing the log-likelihood of D N N ln pv (xn ) + v ln p (D) = v∈V n=1 ln c (F1 (xn ) , . [sent-88, score-0.461]

29 1 d (3) n=1 The first term of the log-likelihood corresponds to the total log-likelihood of all univariate marginals of p, and the second term to the log-likelihood of its d-variate copula. [sent-92, score-0.267]

30 However a useful (and asymptotically consistent) heuristic is first to maximize the log-likelihood for the marginals (first term only), and then to estimate the parameters for the copula given the solution 2 for the marginals. [sent-94, score-0.628]

31 The univariate marginals can be accurately estimated by either fitting the parameters for some appropriately chosen univariate distributions or by applying non-parametric methods1 as the marginals are estimated independent of each other and do not suffer from the curse of diˆ mensionality. [sent-95, score-0.626]

32 Let pv (xv ) be the estimated pdf for component v, and Fv be the corresponding cdf. [sent-96, score-0.159]

33 Under the above heuristic, ML estimate for copula density c is computed by maximizing N ln c (A) = n=1 ln c (an ). [sent-107, score-0.773]

34 3 Exploiting Tree-Structured Dependence Joint probability distributions are often modeled with probabilistic graphical models where the structure of the graph captures the conditional independence relations of the variables. [sent-108, score-0.038]

35 We would like to keep the number of variables for each of the functions small as the number of parameters and the number of points needed for parameter estimation often grows exponentially with the number of variables. [sent-110, score-0.056]

36 They can also be placed in a fully Bayesian framework with decomposable priors allowing to compute expected values (over all possible spanning trees) of product of functions defined on the edges of the trees [1]. [sent-113, score-0.213]

37 As we will see later in this section, under the tree-structured dependence, a copula density can be computed as products of bivariate copula densities over the edges of the graph. [sent-114, score-1.322]

38 This property allows us to estimate the parameters for the edge copulas independently. [sent-115, score-0.39]

39 For a distribution F admitting tree-structured Markov networks (referred from now on as tree-structured distributions), assuming that p (x) > 0 and p (x) < ∞ for x ∈ R ⊆ Rd , the density (for x ∈ R) can be rewritten as puv (xu , xv ) p (x) = pv (xv ) . [sent-120, score-0.478]

40 (4) pu (xu ) pv (xv ) v∈V {u,v}∈E This formulation easily follows from the Hammersley-Clifford theorem [11]. [sent-121, score-0.149]

41 Note that for {u, v} ∈ E, a copula density cuv (au , av ) for F (xu , xv ) can be computed using Equation 2: puv (xu , xv ) . [sent-122, score-1.485]

42 (5) cuv (au , av ) = pu (xu ) pv (xv ) Using Equations 2, 4, and 5, cp (a) for F (x) can be computed as cp (a) = p (x) = v∈V pv (xv ) {u,v}∈E puv (xu , xv ) = pu (xu ) pv (xv ) cp (au , av ) . [sent-123, score-1.312]

43 (6) {u,v}∈E Equation 6 states that a copula density for a tree-structured distribution decomposes as a product of bivariate copulas over its edges. [sent-124, score-1.115]

44 The converse is true as well; a tree-structured copula can be constructed by specifying copulas for the edges of the tree. [sent-125, score-0.86]

45 Given a tree or a forest G = (V, E) and copula densities cuv (au , av ) for {u, v} ∈ E, cE (a) = cuv (au , av ) {u,v}∈E is a valid copula density. [sent-127, score-1.847]

46 3 N and the parameters can be fitted by maximizing n=1 ln cuv (an , an ) independently for different u v pairs {u, v} ∈ E. [sent-129, score-0.388]

47 The tree structure can be learned from the data as well, as in the Chow-Liu algorithm [4]. [sent-130, score-0.082]

48 1 is computationally efficient and convenient for implementation, the imposed tree-structured dependence is too restrictive for real-world problems. [sent-133, score-0.076]

49 Vines [7], for example, deal with this problem by allowing recursive refinements for the bivariate probabilities over variables not connected by the tree edges. [sent-134, score-0.312]

50 However, vines require estimation of additional characteristics of the distribution (e. [sent-135, score-0.079]

51 Our proposed method would only require optimization of parameters of bivariate copulas from the corresponding two components of weighted data vectors. [sent-138, score-0.573]

52 Using the Bayesian framework for spanning trees from [1], it is possible to construct an object constituting a convex combination over all possible spanning trees allowing a much richer set of conditional independencies than a single tree. [sent-139, score-0.276]

53 Meil˘ and Jaakkola [1] proposed a decomposable prior over all possible spanning tree structures. [sent-140, score-0.174]

54 a Let β be a symmetric matrix of non-negative weights for all pairs of distinct variables and zeros on the diagonal. [sent-141, score-0.044]

55 Let E be a set of all possible spanning trees over V. [sent-142, score-0.13]

56 The probability distribution over all spanning tree structures over V is defined as 1 P (E ∈ E|β) = βuv where Z = βuv . [sent-143, score-0.173]

57 Let P (E) be a distribution over spanning tree structures defined by (7). [sent-148, score-0.173]

58 The decomposability property of the tree prior (Equation 7) allows us to compute the average of the tree-structured distributions over all dd−2 tree structures. [sent-151, score-0.202]

59 In [1], such averaging was applied to tree-structured distributions over categorical variables. [sent-152, score-0.119]

60 Similarly, we define a tree-averaged copula density as a convex combination of copula densities of the form (6):    |L (βc (a))| 1  βuv   cuv (au , av ) = r (a) = P (E|β) c (a) = Z |L (β)| E∈E E∈E {u,v}∈E {u,v}∈E where entry (uv) of matrix βc (a) denotes βuv cuv (au , av ). [sent-153, score-1.817]

61 A finite convex combination of copulas is a copula, so r (a) is a copula density. [sent-154, score-0.833]

62 Instead, noticing that we are dealing with a mixture model (granted, one where the number of mixture components is super-exponential), we propose performing the parameter optimization with the EM algorithm [6]. [sent-158, score-0.056]

63 2 2 A possibility of EM algorithm for ensemble-of-trees with categorical data was mentioned [1], but the idea was abandoned due to the concern about the M-step. [sent-159, score-0.045]

64 4 Algorithm T REE AVERAGED C OPULA D ENSITY(D, c) Inputs: A complete data set D of d-component real-valued vectors; a set of of bivariate parametric copula densities c = {cuv : u, v ∈ V} ˆ 1. [sent-160, score-0.746]

65 Estimate univariate margins Fv (Xv ) for all components v ∈ V treating all components independently. [sent-161, score-0.222]

66 {u,v}∈E n The probability distribution P (En |a , β, θ) is of the same form as the tree prior, so to compute sn ({u, v}) one needs to compute the sum of probabilities of all trees containing edge {u, v}. [sent-171, score-0.219]

67 Let P (E|β) be a tree prior defined in Equation 7. [sent-173, score-0.082]

68 As a consequence of Theorem 3, for each an , all d (d − 1) /2 edge probabilities sn ({u, v}) can be computed simultaneously with time complexity of a single (d − 1) × (d − 1) matrix inversion, O d3 . [sent-176, score-0.076]

69 Assuming a candidate bivariate copula cuv has one free parameter θuv , θuv can be optimized by setting N ∂M β , θ ; β, θ ∂ ln cuv (an , an ; θuv ) u v = sn ({u, v}) , (9) ∂θuv ∂θuv n=1 to 0. [sent-177, score-1.276]

70 ) The parameters of the tree prior can be updated by maximizing {u,v} 1 N N sn ({u, v}) ln βuv − ln |L (β)| , n=1 5 an expression concave in ln βuv ∀ {u, v}. [sent-179, score-0.487]

71 β can be updated using a gradient ascent algorithm on ln βuv ∀ {u, v}, with time complexity O d3 per iteration. [sent-180, score-0.121]

72 Assuming the complexity of each bivariate copula update is O (N ), the time complexity of each EM iteration is O N d3 . [sent-182, score-0.695]

73 The EM algorithm can be easily transferred to tree averaging for categorical data. [sent-183, score-0.163]

74 The E-step does not change, and in the M-step, the parameters for the univariate marginals are updated ignoring bivariate terms. [sent-184, score-0.503]

75 Then, the parameters for the bivariate distributions for each edge are updated constrained on the new values of the parameters for the univariate distributions. [sent-185, score-0.483]

76 1 Experiments MAGIC Gamma Telescope Data Set First, we tested our tree-averaged density estimator on a MAGIC Gamma Telescope Data Set from the UCI Machine Learning Repository [13]. [sent-188, score-0.052]

77 We considered only the examples from class gamma (signal); this set consists of 12332 vectors of d = 10 real-valued components. [sent-189, score-0.039]

78 The univariate marginals are not Gaussian (some are bounded; some have multiple modes). [sent-190, score-0.267]

79 The marginals were estimated using Gaussian kernel density estimators (KDE) with Rule-of-Thumb bandwidth selection. [sent-193, score-0.164]

80 All of the models except for full Gaussian have the same marginals, differ only in the multivariate dependence (copula). [sent-194, score-0.159]

81 As expected from the curse of dimensionality, product KDE improves logarithmically with the amount of data. [sent-195, score-0.052]

82 Not only the marginals are not Gaussian (evidenced by a Gaussian copula with KDE marginals outperforming a Gaussian distribution), the multivariate dependence is also not Gaussian, evidenced by a tree-structured Frank copula outperforming a tree-structured and a full Gaussian copula. [sent-196, score-1.44]

83 However, model averaging even with the wrong dependence model (tree-averaged Gaussian copula) yields superior performance. [sent-197, score-0.112]

84 2 Multi-Site Precipitation Modeling We applied the tree-averaged framework to the problem of modeling daily rainfall amounts for a regional spatial network of stations. [sent-199, score-0.139]

85 , [14]) with the transition distribution responsible for modeling of temporal dependence, and the emission distributions capturing most of the spatial dependence. [sent-206, score-0.091]

86 We will use HMMs as the wrapper model with tree-averaged (and tree-structured) distributions to model the emission components. [sent-208, score-0.074]

87 The distribution of daily rainfall amounts for any given station can be viewed as a non-overlapping mixture with one component corresponding to zero precipitation, and the other component to positive precipitation. [sent-209, score-0.198]

88 For a station v, let rv be the precipitation amount, πv be a probability of positive precipitation, and let fv (rv |λv ) be a probability density function for amounts given positive precipitation: p (rv |πv , λv ) = 1 − πv : rv = 0, πv fv (rv |λv ) : rv > 0. [sent-210, score-1.334]

89 |L (β)| and bivariate Gaussian copulas 2 θ 2 Φ−1 (au )2 +θuv Φ−1 (av )2 −2θuv Φ−1 (au )Φ−1 (av ) − uv 2 2(1−θuv ) 1 e 2 1−θuv ωuv (r) , {u,v}∈E . [sent-213, score-0.87]

90 We applied the models to a data set collected from 30 stations from a region in Southeastern Australia (Fig. [sent-214, score-0.117]

91 We used a 5-state HMM with three different types of emission distributions: tree-averaged (pta ), treestructured (pt ), and conditionally independent (first term of pt and pta ). [sent-216, score-0.152]

92 For HMM-TA, we reduced the number of free parameters by only allowing edges for stations adjacent to each other as determined by the the Delaunay triangulation (Fig. [sent-218, score-0.202]

93 We also did not learn the edge weights (β) setting them to 1 for selected edges and to 0 for the rest. [sent-220, score-0.062]

94 ) The resulting log-likelihoods divided by the number of days and stations are −0. [sent-223, score-0.117]

95 We are particularly interested in how well they measure pairwise dependence; we concentrate on two measures: log-odds ratio for occurrence and Kendall’s τ measure of concordance for pairs when both stations had positive amounts. [sent-228, score-0.183]

96 We thank Stephen Charles (CSIRO, Australia) for providing us with precipitation data. [sent-233, score-0.127]

97 146 Longitude 147 148 149 150 Figure 3: Station map with station locations (red dots), coastline, and the pairs of stations selected according to Delaunay triangulation (dotted lines) 5 HMM−TA HMM−Tree HMM−CI y=x 0. [sent-269, score-0.226]

98 7 Figure 4: Scatter-plots of log-odds ratios for occurrence (left) and Kendall’s τ measure of concordance (right) for all pairs of stations for the historical data vs HMM-TA (red o), HMM-Tree (blue x), and HMM-CI (green ·). [sent-291, score-0.21]

99 The estimation method of inference functions for margins for multivariate models. [sent-311, score-0.131]

100 A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. [sent-318, score-0.199]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('copula', 0.497), ('copulas', 0.336), ('uv', 0.336), ('xv', 0.266), ('rv', 0.244), ('cuv', 0.219), ('bivariate', 0.198), ('fv', 0.161), ('univariate', 0.155), ('av', 0.141), ('precipitation', 0.127), ('stations', 0.117), ('pv', 0.116), ('au', 0.113), ('ru', 0.113), ('marginals', 0.112), ('ln', 0.102), ('multivariate', 0.083), ('tree', 0.082), ('kde', 0.076), ('dependence', 0.076), ('spanning', 0.069), ('xu', 0.065), ('trees', 0.061), ('pta', 0.058), ('station', 0.058), ('vines', 0.058), ('fd', 0.054), ('density', 0.052), ('hmm', 0.051), ('densities', 0.051), ('em', 0.049), ('categorical', 0.045), ('magic', 0.044), ('puv', 0.044), ('pdf', 0.043), ('fu', 0.043), ('amounts', 0.043), ('xd', 0.042), ('sn', 0.041), ('daily', 0.041), ('kendall', 0.041), ('concordance', 0.038), ('rain', 0.038), ('rainfall', 0.038), ('distributions', 0.038), ('xn', 0.037), ('pt', 0.036), ('emission', 0.036), ('averaging', 0.036), ('edge', 0.035), ('curse', 0.035), ('alberta', 0.035), ('pu', 0.033), ('meil', 0.033), ('en', 0.031), ('dd', 0.031), ('ad', 0.03), ('coastline', 0.029), ('delaunay', 0.029), ('qvv', 0.029), ('sergey', 0.029), ('synoptic', 0.029), ('tcopula', 0.029), ('telescope', 0.029), ('frank', 0.029), ('cp', 0.029), ('gaussian', 0.028), ('pairs', 0.028), ('edges', 0.027), ('historical', 0.027), ('margins', 0.027), ('quu', 0.025), ('triangulation', 0.023), ('decomposable', 0.023), ('evidenced', 0.023), ('structures', 0.022), ('treestructured', 0.022), ('estimation', 0.021), ('ta', 0.02), ('outperforming', 0.02), ('odds', 0.02), ('components', 0.02), ('gamma', 0.02), ('maximizing', 0.02), ('solid', 0.019), ('parameters', 0.019), ('updated', 0.019), ('vectors', 0.019), ('hmms', 0.019), ('australia', 0.019), ('rd', 0.018), ('mixture', 0.018), ('spatial', 0.017), ('product', 0.017), ('variables', 0.016), ('equation', 0.016), ('allowing', 0.016), ('simulated', 0.015), ('decomposes', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 119 nips-2007-Learning with Tree-Averaged Densities and Distributions

Author: Sergey Kirshner

Abstract: We utilize the ensemble of trees framework, a tractable mixture over superexponential number of tree-structured distributions [1], to develop a new model for multivariate density estimation. The model is based on a construction of treestructured copulas – multivariate distributions with uniform on [0, 1] marginals. By averaging over all possible tree structures, the new model can approximate distributions with complex variable dependencies. We propose an EM algorithm to estimate the parameters for these tree-averaged models for both the real-valued and the categorical case. Based on the tree-averaged framework, we propose a new model for joint precipitation amounts data on networks of rain stations. 1

2 0.14868842 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

Author: Ulrik Beierholm, Ladan Shams, Wei J. Ma, Konrad Koerding

Abstract: Bayesian models of multisensory perception traditionally address the problem of estimating an underlying variable that is assumed to be the cause of the two sensory signals. The brain, however, has to solve a more general problem: it also has to establish which signals come from the same source and should be integrated, and which ones do not and should be segregated. In the last couple of years, a few models have been proposed to solve this problem in a Bayesian fashion. One of these has the strength that it formalizes the causal structure of sensory signals. We first compare these models on a formal level. Furthermore, we conduct a psychophysics experiment to test human performance in an auditory-visual spatial localization task in which integration is not mandatory. We find that the causal Bayesian inference model accounts for the data better than other models. Keywords: causal inference, Bayesian methods, visual perception. 1 Multisensory perception In the ventriloquist illusion, a performer speaks without moving his/her mouth while moving a puppet’s mouth in synchrony with his/her speech. This makes the puppet appear to be speaking. This illusion was first conceptualized as ”visual capture”, occurring when visual and auditory stimuli exhibit a small conflict ([1, 2]). Only recently has it been demonstrated that the phenomenon may be seen as a byproduct of a much more flexible and nearly Bayes-optimal strategy ([3]), and therefore is part of a large collection of cue combination experiments showing such statistical near-optimality [4, 5]. In fact, cue combination has become the poster child for Bayesian inference in the nervous system. In previous studies of multisensory integration, two sensory stimuli are presented which act as cues about a single underlying source. For instance, in the auditory-visual localization experiment by Alais and Burr [3], observers were asked to envisage each presentation of a light blob and a sound click as a single event, like a ball hitting the screen. In many cases, however, the brain is not only posed with the problem of identifying the position of a common source, but also of determining whether there was a common source at all. In the on-stage ventriloquist illusion, it is indeed primarily the causal inference process that is being fooled, because veridical perception would attribute independent causes to the auditory and the visual stimulus. 1 To extend our understanding of multisensory perception to this more general problem, it is necessary to manipulate the degree of belief assigned to there being a common cause within a multisensory task. Intuitively, we expect that when two signals are very different, they are less likely to be perceived as having a common source. It is well-known that increasing the discrepancy or inconsistency between stimuli reduces the influence that they have on each other [6, 7, 8, 9, 10, 11]. In auditoryvisual spatial localization, one variable that controls stimulus similarity is spatial disparity (another would be temporal disparity). Indeed, it has been reported that increasing spatial disparity leads to a decrease in auditory localization bias [1, 12, 13, 14, 15, 16, 17, 2, 18, 19, 20, 21]. This decrease also correlates with a decrease in the reports of unity [19, 21]. Despite the abundance of experimental data on this issue, no general theory exists that can explain multisensory perception across a wide range of cue conflicts. 2 Models The success of Bayesian models for cue integration has motivated attempts to extend them to situations of large sensory conflict and a consequent low degree of integration. In one of recent studies taking this approach, subjects were presented with concurrent visual flashes and auditory beeps and asked to count both the number of flashes and the number of beeps [11]. The advantage of the experimental paradigm adopted here was that it probed the joint response distribution by requiring a dual report. Human data were accounted for well by a Bayesian model in which the joint prior distribution over visual and auditory number was approximated from the data. In a similar study, subjects were presented with concurrent flashes and taps and asked to count either the flashes or the taps [9, 22]. The Bayesian model proposed by these authors assumed a joint prior distribution with a near-diagonal form. The corresponding generative model assumes that the sensory sources somehow interact with one another. A third experiment modulated the rates of flashes and beeps. The task was to judge either the visual or the auditory modulation rate relative to a standard [23]. The data from this experiment were modeled using a joint prior distribution which is the sum of a near-diagonal prior and a flat background. While all these models are Bayesian in a formal sense, their underlying generative model does not formalize the model selection process that underlies the combination of cues. This makes it necessary to either estimate an empirical prior [11] by fitting it to human behavior or to assume an ad hoc form [22, 23]. However, we believe that such assumptions are not needed. It was shown recently that human judgments of spatial unity in an auditory-visual spatial localization task can be described using a Bayesian inference model that infers causal structure [24, 25]. In this model, the brain does not only estimate a stimulus variable, but also infers the probability that the two stimuli have a common cause. In this paper we compare these different models on a large data set of human position estimates in an auditory-visual task. In this section we first describe the traditional cue integration model, then the recent models based on joint stimulus priors, and finally the causal inference model. To relate to the experiment in the next section, we will use the terminology of auditory-visual spatial localization, but the formalism is very general. 2.1 Traditional cue integration The traditional generative model of cue integration [26] has a single source location s which produces on each trial an internal representation (cue) of visual location, xV and one of auditory location, xA . We assume that the noise processes by which these internal representations are generated are conditionally independent from each other and follow Gaussian distributions. That is, p (xV |s) ∼ N (xV ; s, σV )and p (xA |s) ∼ N (xA ; s, σA ), where N (x; µ, σ) stands for the normal distribution over x with mean µ and standard deviation σ. If on a given trial the internal representations are xV and xA , the probability that their source was s is given by Bayes’ rule, p (s|xV , xA ) ∝ p (xV |s) p (xA |s) . If a subject performs maximum-likelihood estimation, then the estimate will be xV +wA s = wV wV +wA xA , where wV = σ1 and wA = σ1 . It is important to keep in mind that this is the ˆ 2 2 V A estimate on a single trial. A psychophysical experimenter can never have access to xV and xA , which 2 are the noisy internal representations. Instead, an experimenter will want to collect estimates over many trials and is interested in the distribution of s given sV and sA , which are the sources generated ˆ by the experimenter. In a typical cue combination experiment, xV and xA are not actually generated by the same source, but by different sources, a visual one sV and an auditory one sA . These sources are chosen close to each other so that the subject can imagine that the resulting cues originate from a single source and thus implicitly have a common cause. The experimentally observed distribution is then p (ˆ|sV , sA ) = s p (ˆ|xV , xA ) p (xV |sV ) p (xA |sA ) dxV dxA s Given that s is a linear combination of two normally distributed variables, it will itself follow a ˆ sV +wA 1 2 normal distribution, with mean s = wVwV +wA sA and variance σs = wV +wA . The reason that we ˆ ˆ emphasize this point is because many authors identify the estimate distribution p (ˆ|sV , sA ) with s the posterior distribution p (s|xV , xA ). This is justified in this case because all distributions are Gaussian and the estimate is a linear combination of cues. However, in the case of causal inference, these conditions are violated and the estimate distribution will in general not be the same as the posterior distribution. 2.2 Models with bisensory stimulus priors Models with bisensory stimulus priors propose the posterior over source positions to be proportional to the product of unimodal likelihoods and a two-dimensional prior: p (sV , sA |xV , xA ) = p (sV , sA ) p (xV |sV ) p (xA |sA ) The traditional cue combination model has p (sV , sA ) = p (sV ) δ (sV − sA ), usually (as above) even with p (sV ) uniform. The question arises what bisensory stimulus prior is appropriate. In [11], the prior is estimated from data, has a large number of parameters, and is therefore limited in its predictive power. In [23], it has the form − (sV −sA )2 p (sV , sA ) ∝ ω + e 2σ 2 coupling while in [22] the additional assumption ω = 0 is made1 . In all three models, the response distribution p (ˆV , sA |sV , sA ) is obtained by idens ˆ tifying it with the posterior distribution p (sV , sA |xV , xA ). This procedure thus implicitly assumes that marginalizing over the latent variables xV and xA is not necessary, which leads to a significant error for non-Gaussian priors. In this paper we correctly deal with these issues and in all cases marginalize over the latent variables. The parametric models used for the coupling between the cues lead to an elegant low-dimensional model of cue integration that allows for estimates of single cues that differ from one another. C C=1 SA S XA 2.3 C=2 XV SV XA XV Causal inference model In the causal inference model [24, 25], we start from the traditional cue integration model but remove the assumption that two signals are caused by the same source. Instead, the number of sources can be one or two and is itself a variable that needs to be inferred from the cues. Figure 1: Generative model of causal inference. 1 This family of Bayesian posterior distributions also includes one used to successfully model cue combination in depth perception [27, 28]. In depth perception, however, there is no notion of segregation as always a single surface is assumed. 3 If there are two sources, they are assumed to be independent. Thus, we use the graphical model depicted in Fig. 1. We denote the number of sources by C. The probability distribution over C given internal representations xV and xA is given by Bayes’ rule: p (C|xV , xA ) ∝ p (xV , xA |C) p (C) . In this equation, p (C) is the a priori probability of C. We will denote the probability of a common cause by pcommon , so that p (C = 1) = pcommon and p (C = 2) = 1 − pcommon . The probability of generating xV and xA given C is obtained by inserting a summation over the sources: p (xV , xA |C = 1) = p (xV , xA |s)p (s) ds = p (xV |s) p (xA |s)p (s) ds Here p (s) is a prior for spatial location, which we assume to be distributed as N (s; 0, σP ). Then all three factors in this integral are Gaussians, allowing for an analytic solution: p (xV , xA |C = 1) = 2 2 2 2 2 −xA )2 σP σA √ 2 2 1 2 2 2 2 exp − 1 (xV σ2 σ2 +σ2+xV+σ2+xA σV . 2 σ2 σ2 2π σV σA +σV σP +σA σP V A V P A P For p (xV , xA |C = 2) we realize that xV and xA are independent of each other and thus obtain p (xV , xA |C = 2) = p (xV |sV )p (sV ) dsV p (xA |sA )p (sA ) dsA Again, as all these distributions are assumed to be Gaussian, we obtain an analytic solution, x2 x2 1 1 V A p (xV , xA |C = 2) = exp − 2 σ2 +σ2 + σ2 +σ2 . Now that we have com2 +σ 2 2 +σ 2 p p V A 2π (σV p )(σA p) puted p (C|xV , xA ), the posterior distribution over sources is given by p (si |xV , xA ) = p (si |xV , xA , C) p (C|xV , xA ) C=1,2 where i can be V or A and the posteriors conditioned on C are well-known: p (si |xA , xV , C = 1) = p (xA |si ) p (xV |si ) p (si ) , p (xA |s) p (xV |s) p (s) ds p (si |xA , xV , C = 2) = p (xi |si ) p (si ) p (xi |si ) p (si ) dsi The former is the same as in the case of mandatory integration with a prior, the latter is simply the unimodal posterior in the presence of a prior. Based on the posterior distribution on a given trial, p (si |xV , xA ), an estimate has to be created. For this, we use a sum-squared-error cost func2 2 tion, Cost = p (C = 1|xV , xA ) (ˆ − s) + p (C = 2|xV , xA ) (ˆ − sV or A ) . Then the best s s estimate is the mean of the posterior distribution, for instance for the visual estimation: sV = p (C = 1|xA , xV ) sV,C=1 + p (C = 2|xA , xV ) sV,C=2 ˆ ˆ ˆ where sV,C=1 = ˆ −2 −2 −2 xV σV +xA σA +xP σP −2 −2 −2 σV +σA +σP and sV,C=2 = ˆ −2 −2 xV σV +xP σP . −2 −2 σV +σP If pcommon equals 0 or 1, this estimate reduces to one of the conditioned estimates and is linear in xV and xA . If 0 < pcommon < 1, the estimate is a nonlinear combination of xV and xA , because of the functional form of p (C|xV , xA ). The response distributions, that is the distributions of sV and sA given ˆ ˆ sV and sA over many trials, now cannot be identified with the posterior distribution on a single trial and cannot be computed analytically either. The correct way to obtain the response distribution is to simulate an experiment numerically. Note that the causal inference model above can also be cast in the form of a bisensory stimulus prior by integrating out the latent variable C, with: p (sA , sV ) = p (C = 1) δ (sA − sV ) p (sA ) + p (sA ) p (sV ) p (C = 2) However, in addition to justifying the form of the interaction between the cues, the causal inference model has the advantage of being based on a generative model that well formalizes salient properties of the world, and it thereby also allows to predict judgments of unity. 4 3 Model performance and comparison To examine the performance of the causal inference model and to compare it to previous models, we performed a human psychophysics experiment in which we adopted the same dual-report paradigm as was used in [11]. Observers were simultaneously presented with a brief visual and also an auditory stimulus, each of which could originate from one of five locations on an imaginary horizontal line (-10◦ , -5◦ , 0◦ , 5◦ , or 10◦ with respect to the fixation point). Auditory stimuli were 32 ms of white noise filtered through an individually calibrated head related transfer function (HRTF) and presented through a pair of headphones, whereas the visual stimuli were high contrast Gabors on a noisy background presented on a 21-inch CRT monitor. Observers had to report by means of a key press (1-5) the perceived positions of both the visual and the auditory stimulus. Each combination of locations was presented with the same frequency over the course of the experiment. In this way, for each condition, visual and auditory response histograms were obtained. We obtained response distributions for each the three models described above by numeral simulation. On each trial, estimation is followed by a step in which, the key is selected which corresponds to the position closed to the best estimate. The simulated histograms obtained in this way were compared to the measured response frequencies of all subjects by computing the R2 statistic. Auditory response Auditory model Visual response Visual model no vision The parameters in the causal inference model were optimized using fminsearch in MATLAB to maximize R2 . The best combination of parameters yielded an R2 of 0.97. The response frequencies are depicted in Fig. 2. The bisensory prior models also explain most of the variance, with R2 = 0.96 for the Roach model and R2 = 0.91 for the Bresciani model. This shows that it is possible to model cue combination for large disparities well using such models. no audio 1 0 Figure 2: A comparison between subjects’ performance and the causal inference model. The blue line indicates the frequency of subjects responses to visual stimuli, red line is the responses to auditory stimuli. Each set of lines is one set of audio-visual stimulus conditions. Rows of conditions indicate constant visual stimulus, columns is constant audio stimulus. Model predictions is indicated by the red and blue dotted line. 5 3.1 Model comparison To facilitate quantitative comparison with other models, we now fit the parameters of each model2 to individual subject data, maximizing the likelihood of the model, i.e., the probability of the response frequencies under the model. The causal inference model fits human data better than the other models. Compared to the best fit of the causal inference model, the Bresciani model has a maximal log likelihood ratio (base e) of the data of −22 ± 6 (mean ± s.e.m. over subjects), and the Roach model has a maximal log likelihood ratio of the data of −18 ± 6. A causal inference model that maximizes the probability of being correct instead of minimizing the mean squared error has a maximal log likelihood ratio of −18 ± 3. These values are considered decisive evidence in favor of the causal inference model that minimizes the mean squared error (for details, see [25]). The parameter values found in the likelihood optimization of the causal model are as follows: pcommon = 0.28 ± 0.05, σV = 2.14 ± 0.22◦ , σA = 9.2 ± 1.1◦ , σP = 12.3 ± 1.1◦ (mean ± s.e.m. over subjects). We see that there is a relatively low prior probability of a common cause. In this paradigm, auditory localization is considerably less precise than visual localization. Also, there is a weak prior for central locations. 3.2 Localization bias A useful quantity to gain more insight into the structure of multisensory data is the cross-modal bias. In our experiment, relative auditory bias is defined as the difference between the mean auditory estimate in a given condition and the real auditory position, divided by the difference between the real visual position and the real auditory position in this condition. If the influence of vision on the auditory estimate is strong, then the relative auditory bias will be high (close to one). It is well-known that bias decreases with spatial disparity and our experiment is no exception (solid line in Fig. 3; data were combined between positive and negative disparities). It can easily be shown that a traditional cue integration model would predict a bias equal to σ2 −1 , which would be close to 1 and 1 + σV 2 A independent of disparity, unlike the data. This shows that a mandatory integration model is an insufficient model of multisensory interactions. 45 % Auditory Bias We used the individual subject fittings from above and and averaged the auditory bias values obtained from those fits (i.e. we did not fit the bias data themselves). Fits are shown in Fig. 3 (dashed lines). We applied a paired t-test to the differences between the 5◦ and 20◦ disparity conditions (model-subject comparison). Using a double-sided test, the null hypothesis that the difference between the bias in the 5◦ and 20◦ conditions is correctly predicted by each model is rejected for the Bresciani model (p < 0.002) and the Roach model (p < 0.042) and accepted for the causal inference model (p > 0.17). Alternatively, with a single-sided test, the hypothesis is rejected for the Bresciani model (p < 0.001) and the Roach model (p < 0.021) and accepted for the causal inference model (> 0.9). 50 40 35 30 25 20 5 10 15 Spatial Disparity (deg.) 20 Figure 3: Auditory bias as a function of spatial disparity. Solid blue line: data. Red: Causal inference model. Green: Model by Roach et al. [23]. Purple: Model by Bresciani et al. [22]. Models were optimized on response frequencies (as in Fig. 2), not on the bias data. The reason that the Bresciani model fares worst is that its prior distribution does not include a component that corresponds to independent causes. On 2 The Roach et al. model has four free parameters (ω,σV , σA , σcoupling ), the Bresciani et al. model has three (σV , σA , σcoupling ), and the causal inference model has four (pcommon ,σV , σA , σP ). We do not consider the Shams et al. model here, since it has many more parameters and it is not immediately clear how in this model the erroneous identification of posterior with response distribution can be corrected. 6 the contrary, the prior used in the Roach model contains two terms, one term that is independent of the disparity and one term that decreases with increasing disparity. It is thus functionally somewhat similar to the causal inference model. 4 Discussion We have argued that any model of multisensory perception should account not only for situations of small, but also of large conflict. In these situations, segregation is more likely, in which the two stimuli are not perceived to have the same cause. Even when segregation occurs, the two stimuli can still influence each other. We compared three Bayesian models designed to account for situations of large conflict by applying them to auditory-visual spatial localization data. We pointed out a common mistake: for nonGaussian bisensory priors without mandatory integration, the response distribution can no longer be identified with the posterior distribution. After correct implementation of the three models, we found that the causal inference model is superior to the models with ad hoc bisensory priors. This is expected, as the nervous system actually needs to solve the problem of deciding which stimuli have a common cause and which stimuli are unrelated. We have seen that multisensory perception is a suitable tool for studying causal inference. However, the causal inference model also has the potential to quantitatively explain a number of other perceptual phenomena, including perceptual grouping and binding, as well as within-modality cue combination [27, 28]. Causal inference is a universal problem: whenever the brain has multiple pieces of information it must decide if they relate to one another or are independent. As the causal inference model describes how the brain processes probabilistic sensory information, the question arises about the neural basis of these processes. Neural populations encode probability distributions over stimuli through Bayes’ rule, a type of coding known as probabilistic population coding. Recent work has shown how the optimal cue combination assuming a common cause can be implemented in probabilistic population codes through simple linear operations on neural activities [29]. This framework makes essential use of the structure of neural variability and leads to physiological predictions for activity in areas that combine multisensory input, such as the superior colliculus. Computational mechanisms for causal inference are expected have a neural substrate that generalizes these linear operations on population activities. A neural implementation of the causal inference model will open the door to a complete neural theory of multisensory perception. References [1] H.L. Pick, D.H. Warren, and J.C. Hay. Sensory conflict in judgements of spatial direction. Percept. Psychophys., 6:203205, 1969. [2] D. H. Warren, R. B. Welch, and T. J. McCarthy. The role of visual-auditory ”compellingness” in the ventriloquism effect: implications for transitivity among the spatial senses. Percept Psychophys, 30(6):557– 64, 1981. [3] D. Alais and D. Burr. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol, 14(3):257–62, 2004. [4] R. A. Jacobs. Optimal integration of texture and motion cues to depth. Vision Res, 39(21):3621–9, 1999. [5] R. J. van Beers, A. C. Sittig, and J. J. Gon. Integration of proprioceptive and visual position-information: An experimentally supported model. J Neurophysiol, 81(3):1355–64, 1999. [6] D. H. Warren and W. T. Cleaves. Visual-proprioceptive interaction under large amounts of conflict. J Exp Psychol, 90(2):206–14, 1971. [7] C. E. Jack and W. R. Thurlow. Effects of degree of visual association and angle of displacement on the ”ventriloquism” effect. Percept Mot Skills, 37(3):967–79, 1973. [8] G. H. Recanzone. Auditory influences on visual temporal rate perception. J Neurophysiol, 89(2):1078–93, 2003. [9] J. P. Bresciani, M. O. Ernst, K. Drewing, G. Bouyer, V. Maury, and A. Kheddar. Feeling what you hear: auditory signals can modulate tactile tap perception. Exp Brain Res, 162(2):172–80, 2005. 7 [10] R. Gepshtein, P. Leiderman, L. Genosar, and D. Huppert. Testing the three step excited state proton transfer model by the effect of an excess proton. J Phys Chem A Mol Spectrosc Kinet Environ Gen Theory, 109(42):9674–84, 2005. [11] L. Shams, W. J. Ma, and U. Beierholm. Sound-induced flash illusion as an optimal percept. Neuroreport, 16(17):1923–7, 2005. [12] G Thomas. Experimental study of the influence of vision on sound localisation. J Exp Psychol, 28:167177, 1941. [13] W. R. Thurlow and C. E. Jack. Certain determinants of the ”ventriloquism effect”. Percept Mot Skills, 36(3):1171–84, 1973. [14] C.S. Choe, R. B. Welch, R.M. Gilford, and J.F. Juola. The ”ventriloquist effect”: visual dominance or response bias. Perception and Psychophysics, 18:55–60, 1975. [15] R. I. Bermant and R. B. Welch. Effect of degree of separation of visual-auditory stimulus and eye position upon spatial interaction of vision and audition. Percept Mot Skills, 42(43):487–93, 1976. [16] R. B. Welch and D. H. Warren. Immediate perceptual response to intersensory discrepancy. Psychol Bull, 88(3):638–67, 1980. [17] P. Bertelson and M. Radeau. Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys, 29(6):578–84, 1981. [18] P. Bertelson, F. Pavani, E. Ladavas, J. Vroomen, and B. de Gelder. Ventriloquism in patients with unilateral visual neglect. Neuropsychologia, 38(12):1634–42, 2000. [19] D. A. Slutsky and G. H. Recanzone. Temporal and spatial dependency of the ventriloquism effect. Neuroreport, 12(1):7–10, 2001. [20] J. Lewald, W. H. Ehrenstein, and R. Guski. Spatio-temporal constraints for auditory–visual integration. Behav Brain Res, 121(1-2):69–79, 2001. [21] M. T. Wallace, G. E. Roberson, W. D. Hairston, B. E. Stein, J. W. Vaughan, and J. A. Schirillo. Unifying multisensory signals across time and space. Exp Brain Res, 158(2):252–8, 2004. [22] J. P. Bresciani, F. Dammeier, and M. O. Ernst. Vision and touch are automatically integrated for the perception of sequences of events. J Vis, 6(5):554–64, 2006. [23] N. W. Roach, J. Heron, and P. V. McGraw. Resolving multisensory conflict: a strategy for balancing the costs and benefits of audio-visual integration. Proc Biol Sci, 273(1598):2159–68, 2006. [24] K. P. Kording and D. M. Wolpert. Bayesian decision theory in sensorimotor control. Trends Cogn Sci, 2006. 1364-6613 (Print) Journal article. [25] K.P. Kording, U. Beierholm, W.J. Ma, S. Quartz, J. Tenenbaum, and L. Shams. Causal inference in multisensory perception. PLoS ONE, 2(9):e943, 2007. [26] Z. Ghahramani. Computational and psychophysics of sensorimotor integration. PhD thesis, Massachusetts Institute of Technology, 1995. [27] D. C. Knill. Mixture models and the probabilistic structure of depth cues. Vision Res, 43(7):831–54, 2003. [28] D. C. Knill. Robust cue integration: A bayesian model and evidence from cue conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7):2–24. [29] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nat Neurosci, 9(11):1432–8, 2006. 8

3 0.1138767 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

Author: M.a. S. Elmohamed, Dexter Kozen, Daniel R. Sheldon

Abstract: We investigate a family of inference problems on Markov models, where many sample paths are drawn from a Markov chain and partial information is revealed to an observer who attempts to reconstruct the sample paths. We present algorithms and hardness results for several variants of this problem which arise by revealing different information to the observer and imposing different requirements for the reconstruction of sample paths. Our algorithms are analogous to the classical Viterbi algorithm for Hidden Markov Models, which finds the single most probable sample path given a sequence of observations. Our work is motivated by an important application in ecology: inferring bird migration paths from a large database of observations. 1

4 0.06757006 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

Author: Venkat Chandrasekaran, Alan S. Willsky, Jason K. Johnson

Abstract: We consider the estimation problem in Gaussian graphical models with arbitrary structure. We analyze the Embedded Trees algorithm, which solves a sequence of problems on tractable subgraphs thereby leading to the solution of the estimation problem on an intractable graph. Our analysis is based on the recently developed walk-sum interpretation of Gaussian estimation. We show that non-stationary iterations of the Embedded Trees algorithm using any sequence of subgraphs converge in walk-summable models. Based on walk-sum calculations, we develop adaptive methods that optimize the choice of subgraphs used at each iteration with a view to achieving maximum reduction in error. These adaptive procedures provide a significant speedup in convergence over stationary iterative methods, and also appear to converge in a larger class of models. 1

5 0.054911144 116 nips-2007-Learning the structure of manifolds using random projections

Author: Yoav Freund, Sanjoy Dasgupta, Mayank Kabra, Nakul Verma

Abstract: We present a simple variant of the k-d tree which automatically adapts to intrinsic low dimensional structure in data. 1

6 0.052482951 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

7 0.049125805 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

8 0.048479997 156 nips-2007-Predictive Matrix-Variate t Models

9 0.044783585 36 nips-2007-Better than least squares: comparison of objective functions for estimating linear-nonlinear models

10 0.044262111 7 nips-2007-A Kernel Statistical Test of Independence

11 0.043009624 63 nips-2007-Convex Relaxations of Latent Variable Training

12 0.041650511 173 nips-2007-Second Order Bilinear Discriminant Analysis for single trial EEG analysis

13 0.039453596 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

14 0.036246803 84 nips-2007-Expectation Maximization and Posterior Constraints

15 0.034951385 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines

16 0.034570843 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

17 0.033510905 22 nips-2007-Agreement-Based Learning

18 0.03121235 197 nips-2007-The Infinite Markov Model

19 0.030816136 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

20 0.029485121 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.114), (1, 0.006), (2, -0.013), (3, -0.016), (4, -0.001), (5, -0.051), (6, -0.014), (7, 0.033), (8, -0.043), (9, -0.086), (10, -0.023), (11, 0.042), (12, 0.026), (13, -0.052), (14, 0.027), (15, -0.065), (16, 0.017), (17, 0.015), (18, 0.078), (19, 0.0), (20, 0.016), (21, 0.082), (22, -0.009), (23, -0.022), (24, 0.127), (25, -0.021), (26, -0.082), (27, -0.047), (28, 0.142), (29, 0.021), (30, -0.127), (31, 0.098), (32, 0.046), (33, -0.016), (34, -0.011), (35, -0.136), (36, 0.008), (37, -0.121), (38, -0.125), (39, 0.205), (40, -0.166), (41, 0.06), (42, -0.031), (43, 0.121), (44, 0.029), (45, -0.016), (46, -0.016), (47, -0.04), (48, -0.26), (49, 0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92614859 119 nips-2007-Learning with Tree-Averaged Densities and Distributions

Author: Sergey Kirshner

Abstract: We utilize the ensemble of trees framework, a tractable mixture over superexponential number of tree-structured distributions [1], to develop a new model for multivariate density estimation. The model is based on a construction of treestructured copulas – multivariate distributions with uniform on [0, 1] marginals. By averaging over all possible tree structures, the new model can approximate distributions with complex variable dependencies. We propose an EM algorithm to estimate the parameters for these tree-averaged models for both the real-valued and the categorical case. Based on the tree-averaged framework, we propose a new model for joint precipitation amounts data on networks of rain stations. 1

2 0.64370322 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

Author: Ulrik Beierholm, Ladan Shams, Wei J. Ma, Konrad Koerding

Abstract: Bayesian models of multisensory perception traditionally address the problem of estimating an underlying variable that is assumed to be the cause of the two sensory signals. The brain, however, has to solve a more general problem: it also has to establish which signals come from the same source and should be integrated, and which ones do not and should be segregated. In the last couple of years, a few models have been proposed to solve this problem in a Bayesian fashion. One of these has the strength that it formalizes the causal structure of sensory signals. We first compare these models on a formal level. Furthermore, we conduct a psychophysics experiment to test human performance in an auditory-visual spatial localization task in which integration is not mandatory. We find that the causal Bayesian inference model accounts for the data better than other models. Keywords: causal inference, Bayesian methods, visual perception. 1 Multisensory perception In the ventriloquist illusion, a performer speaks without moving his/her mouth while moving a puppet’s mouth in synchrony with his/her speech. This makes the puppet appear to be speaking. This illusion was first conceptualized as ”visual capture”, occurring when visual and auditory stimuli exhibit a small conflict ([1, 2]). Only recently has it been demonstrated that the phenomenon may be seen as a byproduct of a much more flexible and nearly Bayes-optimal strategy ([3]), and therefore is part of a large collection of cue combination experiments showing such statistical near-optimality [4, 5]. In fact, cue combination has become the poster child for Bayesian inference in the nervous system. In previous studies of multisensory integration, two sensory stimuli are presented which act as cues about a single underlying source. For instance, in the auditory-visual localization experiment by Alais and Burr [3], observers were asked to envisage each presentation of a light blob and a sound click as a single event, like a ball hitting the screen. In many cases, however, the brain is not only posed with the problem of identifying the position of a common source, but also of determining whether there was a common source at all. In the on-stage ventriloquist illusion, it is indeed primarily the causal inference process that is being fooled, because veridical perception would attribute independent causes to the auditory and the visual stimulus. 1 To extend our understanding of multisensory perception to this more general problem, it is necessary to manipulate the degree of belief assigned to there being a common cause within a multisensory task. Intuitively, we expect that when two signals are very different, they are less likely to be perceived as having a common source. It is well-known that increasing the discrepancy or inconsistency between stimuli reduces the influence that they have on each other [6, 7, 8, 9, 10, 11]. In auditoryvisual spatial localization, one variable that controls stimulus similarity is spatial disparity (another would be temporal disparity). Indeed, it has been reported that increasing spatial disparity leads to a decrease in auditory localization bias [1, 12, 13, 14, 15, 16, 17, 2, 18, 19, 20, 21]. This decrease also correlates with a decrease in the reports of unity [19, 21]. Despite the abundance of experimental data on this issue, no general theory exists that can explain multisensory perception across a wide range of cue conflicts. 2 Models The success of Bayesian models for cue integration has motivated attempts to extend them to situations of large sensory conflict and a consequent low degree of integration. In one of recent studies taking this approach, subjects were presented with concurrent visual flashes and auditory beeps and asked to count both the number of flashes and the number of beeps [11]. The advantage of the experimental paradigm adopted here was that it probed the joint response distribution by requiring a dual report. Human data were accounted for well by a Bayesian model in which the joint prior distribution over visual and auditory number was approximated from the data. In a similar study, subjects were presented with concurrent flashes and taps and asked to count either the flashes or the taps [9, 22]. The Bayesian model proposed by these authors assumed a joint prior distribution with a near-diagonal form. The corresponding generative model assumes that the sensory sources somehow interact with one another. A third experiment modulated the rates of flashes and beeps. The task was to judge either the visual or the auditory modulation rate relative to a standard [23]. The data from this experiment were modeled using a joint prior distribution which is the sum of a near-diagonal prior and a flat background. While all these models are Bayesian in a formal sense, their underlying generative model does not formalize the model selection process that underlies the combination of cues. This makes it necessary to either estimate an empirical prior [11] by fitting it to human behavior or to assume an ad hoc form [22, 23]. However, we believe that such assumptions are not needed. It was shown recently that human judgments of spatial unity in an auditory-visual spatial localization task can be described using a Bayesian inference model that infers causal structure [24, 25]. In this model, the brain does not only estimate a stimulus variable, but also infers the probability that the two stimuli have a common cause. In this paper we compare these different models on a large data set of human position estimates in an auditory-visual task. In this section we first describe the traditional cue integration model, then the recent models based on joint stimulus priors, and finally the causal inference model. To relate to the experiment in the next section, we will use the terminology of auditory-visual spatial localization, but the formalism is very general. 2.1 Traditional cue integration The traditional generative model of cue integration [26] has a single source location s which produces on each trial an internal representation (cue) of visual location, xV and one of auditory location, xA . We assume that the noise processes by which these internal representations are generated are conditionally independent from each other and follow Gaussian distributions. That is, p (xV |s) ∼ N (xV ; s, σV )and p (xA |s) ∼ N (xA ; s, σA ), where N (x; µ, σ) stands for the normal distribution over x with mean µ and standard deviation σ. If on a given trial the internal representations are xV and xA , the probability that their source was s is given by Bayes’ rule, p (s|xV , xA ) ∝ p (xV |s) p (xA |s) . If a subject performs maximum-likelihood estimation, then the estimate will be xV +wA s = wV wV +wA xA , where wV = σ1 and wA = σ1 . It is important to keep in mind that this is the ˆ 2 2 V A estimate on a single trial. A psychophysical experimenter can never have access to xV and xA , which 2 are the noisy internal representations. Instead, an experimenter will want to collect estimates over many trials and is interested in the distribution of s given sV and sA , which are the sources generated ˆ by the experimenter. In a typical cue combination experiment, xV and xA are not actually generated by the same source, but by different sources, a visual one sV and an auditory one sA . These sources are chosen close to each other so that the subject can imagine that the resulting cues originate from a single source and thus implicitly have a common cause. The experimentally observed distribution is then p (ˆ|sV , sA ) = s p (ˆ|xV , xA ) p (xV |sV ) p (xA |sA ) dxV dxA s Given that s is a linear combination of two normally distributed variables, it will itself follow a ˆ sV +wA 1 2 normal distribution, with mean s = wVwV +wA sA and variance σs = wV +wA . The reason that we ˆ ˆ emphasize this point is because many authors identify the estimate distribution p (ˆ|sV , sA ) with s the posterior distribution p (s|xV , xA ). This is justified in this case because all distributions are Gaussian and the estimate is a linear combination of cues. However, in the case of causal inference, these conditions are violated and the estimate distribution will in general not be the same as the posterior distribution. 2.2 Models with bisensory stimulus priors Models with bisensory stimulus priors propose the posterior over source positions to be proportional to the product of unimodal likelihoods and a two-dimensional prior: p (sV , sA |xV , xA ) = p (sV , sA ) p (xV |sV ) p (xA |sA ) The traditional cue combination model has p (sV , sA ) = p (sV ) δ (sV − sA ), usually (as above) even with p (sV ) uniform. The question arises what bisensory stimulus prior is appropriate. In [11], the prior is estimated from data, has a large number of parameters, and is therefore limited in its predictive power. In [23], it has the form − (sV −sA )2 p (sV , sA ) ∝ ω + e 2σ 2 coupling while in [22] the additional assumption ω = 0 is made1 . In all three models, the response distribution p (ˆV , sA |sV , sA ) is obtained by idens ˆ tifying it with the posterior distribution p (sV , sA |xV , xA ). This procedure thus implicitly assumes that marginalizing over the latent variables xV and xA is not necessary, which leads to a significant error for non-Gaussian priors. In this paper we correctly deal with these issues and in all cases marginalize over the latent variables. The parametric models used for the coupling between the cues lead to an elegant low-dimensional model of cue integration that allows for estimates of single cues that differ from one another. C C=1 SA S XA 2.3 C=2 XV SV XA XV Causal inference model In the causal inference model [24, 25], we start from the traditional cue integration model but remove the assumption that two signals are caused by the same source. Instead, the number of sources can be one or two and is itself a variable that needs to be inferred from the cues. Figure 1: Generative model of causal inference. 1 This family of Bayesian posterior distributions also includes one used to successfully model cue combination in depth perception [27, 28]. In depth perception, however, there is no notion of segregation as always a single surface is assumed. 3 If there are two sources, they are assumed to be independent. Thus, we use the graphical model depicted in Fig. 1. We denote the number of sources by C. The probability distribution over C given internal representations xV and xA is given by Bayes’ rule: p (C|xV , xA ) ∝ p (xV , xA |C) p (C) . In this equation, p (C) is the a priori probability of C. We will denote the probability of a common cause by pcommon , so that p (C = 1) = pcommon and p (C = 2) = 1 − pcommon . The probability of generating xV and xA given C is obtained by inserting a summation over the sources: p (xV , xA |C = 1) = p (xV , xA |s)p (s) ds = p (xV |s) p (xA |s)p (s) ds Here p (s) is a prior for spatial location, which we assume to be distributed as N (s; 0, σP ). Then all three factors in this integral are Gaussians, allowing for an analytic solution: p (xV , xA |C = 1) = 2 2 2 2 2 −xA )2 σP σA √ 2 2 1 2 2 2 2 exp − 1 (xV σ2 σ2 +σ2+xV+σ2+xA σV . 2 σ2 σ2 2π σV σA +σV σP +σA σP V A V P A P For p (xV , xA |C = 2) we realize that xV and xA are independent of each other and thus obtain p (xV , xA |C = 2) = p (xV |sV )p (sV ) dsV p (xA |sA )p (sA ) dsA Again, as all these distributions are assumed to be Gaussian, we obtain an analytic solution, x2 x2 1 1 V A p (xV , xA |C = 2) = exp − 2 σ2 +σ2 + σ2 +σ2 . Now that we have com2 +σ 2 2 +σ 2 p p V A 2π (σV p )(σA p) puted p (C|xV , xA ), the posterior distribution over sources is given by p (si |xV , xA ) = p (si |xV , xA , C) p (C|xV , xA ) C=1,2 where i can be V or A and the posteriors conditioned on C are well-known: p (si |xA , xV , C = 1) = p (xA |si ) p (xV |si ) p (si ) , p (xA |s) p (xV |s) p (s) ds p (si |xA , xV , C = 2) = p (xi |si ) p (si ) p (xi |si ) p (si ) dsi The former is the same as in the case of mandatory integration with a prior, the latter is simply the unimodal posterior in the presence of a prior. Based on the posterior distribution on a given trial, p (si |xV , xA ), an estimate has to be created. For this, we use a sum-squared-error cost func2 2 tion, Cost = p (C = 1|xV , xA ) (ˆ − s) + p (C = 2|xV , xA ) (ˆ − sV or A ) . Then the best s s estimate is the mean of the posterior distribution, for instance for the visual estimation: sV = p (C = 1|xA , xV ) sV,C=1 + p (C = 2|xA , xV ) sV,C=2 ˆ ˆ ˆ where sV,C=1 = ˆ −2 −2 −2 xV σV +xA σA +xP σP −2 −2 −2 σV +σA +σP and sV,C=2 = ˆ −2 −2 xV σV +xP σP . −2 −2 σV +σP If pcommon equals 0 or 1, this estimate reduces to one of the conditioned estimates and is linear in xV and xA . If 0 < pcommon < 1, the estimate is a nonlinear combination of xV and xA , because of the functional form of p (C|xV , xA ). The response distributions, that is the distributions of sV and sA given ˆ ˆ sV and sA over many trials, now cannot be identified with the posterior distribution on a single trial and cannot be computed analytically either. The correct way to obtain the response distribution is to simulate an experiment numerically. Note that the causal inference model above can also be cast in the form of a bisensory stimulus prior by integrating out the latent variable C, with: p (sA , sV ) = p (C = 1) δ (sA − sV ) p (sA ) + p (sA ) p (sV ) p (C = 2) However, in addition to justifying the form of the interaction between the cues, the causal inference model has the advantage of being based on a generative model that well formalizes salient properties of the world, and it thereby also allows to predict judgments of unity. 4 3 Model performance and comparison To examine the performance of the causal inference model and to compare it to previous models, we performed a human psychophysics experiment in which we adopted the same dual-report paradigm as was used in [11]. Observers were simultaneously presented with a brief visual and also an auditory stimulus, each of which could originate from one of five locations on an imaginary horizontal line (-10◦ , -5◦ , 0◦ , 5◦ , or 10◦ with respect to the fixation point). Auditory stimuli were 32 ms of white noise filtered through an individually calibrated head related transfer function (HRTF) and presented through a pair of headphones, whereas the visual stimuli were high contrast Gabors on a noisy background presented on a 21-inch CRT monitor. Observers had to report by means of a key press (1-5) the perceived positions of both the visual and the auditory stimulus. Each combination of locations was presented with the same frequency over the course of the experiment. In this way, for each condition, visual and auditory response histograms were obtained. We obtained response distributions for each the three models described above by numeral simulation. On each trial, estimation is followed by a step in which, the key is selected which corresponds to the position closed to the best estimate. The simulated histograms obtained in this way were compared to the measured response frequencies of all subjects by computing the R2 statistic. Auditory response Auditory model Visual response Visual model no vision The parameters in the causal inference model were optimized using fminsearch in MATLAB to maximize R2 . The best combination of parameters yielded an R2 of 0.97. The response frequencies are depicted in Fig. 2. The bisensory prior models also explain most of the variance, with R2 = 0.96 for the Roach model and R2 = 0.91 for the Bresciani model. This shows that it is possible to model cue combination for large disparities well using such models. no audio 1 0 Figure 2: A comparison between subjects’ performance and the causal inference model. The blue line indicates the frequency of subjects responses to visual stimuli, red line is the responses to auditory stimuli. Each set of lines is one set of audio-visual stimulus conditions. Rows of conditions indicate constant visual stimulus, columns is constant audio stimulus. Model predictions is indicated by the red and blue dotted line. 5 3.1 Model comparison To facilitate quantitative comparison with other models, we now fit the parameters of each model2 to individual subject data, maximizing the likelihood of the model, i.e., the probability of the response frequencies under the model. The causal inference model fits human data better than the other models. Compared to the best fit of the causal inference model, the Bresciani model has a maximal log likelihood ratio (base e) of the data of −22 ± 6 (mean ± s.e.m. over subjects), and the Roach model has a maximal log likelihood ratio of the data of −18 ± 6. A causal inference model that maximizes the probability of being correct instead of minimizing the mean squared error has a maximal log likelihood ratio of −18 ± 3. These values are considered decisive evidence in favor of the causal inference model that minimizes the mean squared error (for details, see [25]). The parameter values found in the likelihood optimization of the causal model are as follows: pcommon = 0.28 ± 0.05, σV = 2.14 ± 0.22◦ , σA = 9.2 ± 1.1◦ , σP = 12.3 ± 1.1◦ (mean ± s.e.m. over subjects). We see that there is a relatively low prior probability of a common cause. In this paradigm, auditory localization is considerably less precise than visual localization. Also, there is a weak prior for central locations. 3.2 Localization bias A useful quantity to gain more insight into the structure of multisensory data is the cross-modal bias. In our experiment, relative auditory bias is defined as the difference between the mean auditory estimate in a given condition and the real auditory position, divided by the difference between the real visual position and the real auditory position in this condition. If the influence of vision on the auditory estimate is strong, then the relative auditory bias will be high (close to one). It is well-known that bias decreases with spatial disparity and our experiment is no exception (solid line in Fig. 3; data were combined between positive and negative disparities). It can easily be shown that a traditional cue integration model would predict a bias equal to σ2 −1 , which would be close to 1 and 1 + σV 2 A independent of disparity, unlike the data. This shows that a mandatory integration model is an insufficient model of multisensory interactions. 45 % Auditory Bias We used the individual subject fittings from above and and averaged the auditory bias values obtained from those fits (i.e. we did not fit the bias data themselves). Fits are shown in Fig. 3 (dashed lines). We applied a paired t-test to the differences between the 5◦ and 20◦ disparity conditions (model-subject comparison). Using a double-sided test, the null hypothesis that the difference between the bias in the 5◦ and 20◦ conditions is correctly predicted by each model is rejected for the Bresciani model (p < 0.002) and the Roach model (p < 0.042) and accepted for the causal inference model (p > 0.17). Alternatively, with a single-sided test, the hypothesis is rejected for the Bresciani model (p < 0.001) and the Roach model (p < 0.021) and accepted for the causal inference model (> 0.9). 50 40 35 30 25 20 5 10 15 Spatial Disparity (deg.) 20 Figure 3: Auditory bias as a function of spatial disparity. Solid blue line: data. Red: Causal inference model. Green: Model by Roach et al. [23]. Purple: Model by Bresciani et al. [22]. Models were optimized on response frequencies (as in Fig. 2), not on the bias data. The reason that the Bresciani model fares worst is that its prior distribution does not include a component that corresponds to independent causes. On 2 The Roach et al. model has four free parameters (ω,σV , σA , σcoupling ), the Bresciani et al. model has three (σV , σA , σcoupling ), and the causal inference model has four (pcommon ,σV , σA , σP ). We do not consider the Shams et al. model here, since it has many more parameters and it is not immediately clear how in this model the erroneous identification of posterior with response distribution can be corrected. 6 the contrary, the prior used in the Roach model contains two terms, one term that is independent of the disparity and one term that decreases with increasing disparity. It is thus functionally somewhat similar to the causal inference model. 4 Discussion We have argued that any model of multisensory perception should account not only for situations of small, but also of large conflict. In these situations, segregation is more likely, in which the two stimuli are not perceived to have the same cause. Even when segregation occurs, the two stimuli can still influence each other. We compared three Bayesian models designed to account for situations of large conflict by applying them to auditory-visual spatial localization data. We pointed out a common mistake: for nonGaussian bisensory priors without mandatory integration, the response distribution can no longer be identified with the posterior distribution. After correct implementation of the three models, we found that the causal inference model is superior to the models with ad hoc bisensory priors. This is expected, as the nervous system actually needs to solve the problem of deciding which stimuli have a common cause and which stimuli are unrelated. We have seen that multisensory perception is a suitable tool for studying causal inference. However, the causal inference model also has the potential to quantitatively explain a number of other perceptual phenomena, including perceptual grouping and binding, as well as within-modality cue combination [27, 28]. Causal inference is a universal problem: whenever the brain has multiple pieces of information it must decide if they relate to one another or are independent. As the causal inference model describes how the brain processes probabilistic sensory information, the question arises about the neural basis of these processes. Neural populations encode probability distributions over stimuli through Bayes’ rule, a type of coding known as probabilistic population coding. Recent work has shown how the optimal cue combination assuming a common cause can be implemented in probabilistic population codes through simple linear operations on neural activities [29]. This framework makes essential use of the structure of neural variability and leads to physiological predictions for activity in areas that combine multisensory input, such as the superior colliculus. Computational mechanisms for causal inference are expected have a neural substrate that generalizes these linear operations on population activities. A neural implementation of the causal inference model will open the door to a complete neural theory of multisensory perception. References [1] H.L. Pick, D.H. Warren, and J.C. Hay. Sensory conflict in judgements of spatial direction. Percept. Psychophys., 6:203205, 1969. [2] D. H. Warren, R. B. Welch, and T. J. McCarthy. The role of visual-auditory ”compellingness” in the ventriloquism effect: implications for transitivity among the spatial senses. Percept Psychophys, 30(6):557– 64, 1981. [3] D. Alais and D. Burr. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol, 14(3):257–62, 2004. [4] R. A. Jacobs. Optimal integration of texture and motion cues to depth. Vision Res, 39(21):3621–9, 1999. [5] R. J. van Beers, A. C. Sittig, and J. J. Gon. Integration of proprioceptive and visual position-information: An experimentally supported model. J Neurophysiol, 81(3):1355–64, 1999. [6] D. H. Warren and W. T. Cleaves. Visual-proprioceptive interaction under large amounts of conflict. J Exp Psychol, 90(2):206–14, 1971. [7] C. E. Jack and W. R. Thurlow. Effects of degree of visual association and angle of displacement on the ”ventriloquism” effect. Percept Mot Skills, 37(3):967–79, 1973. [8] G. H. Recanzone. Auditory influences on visual temporal rate perception. J Neurophysiol, 89(2):1078–93, 2003. [9] J. P. Bresciani, M. O. Ernst, K. Drewing, G. Bouyer, V. Maury, and A. Kheddar. Feeling what you hear: auditory signals can modulate tactile tap perception. Exp Brain Res, 162(2):172–80, 2005. 7 [10] R. Gepshtein, P. Leiderman, L. Genosar, and D. Huppert. Testing the three step excited state proton transfer model by the effect of an excess proton. J Phys Chem A Mol Spectrosc Kinet Environ Gen Theory, 109(42):9674–84, 2005. [11] L. Shams, W. J. Ma, and U. Beierholm. Sound-induced flash illusion as an optimal percept. Neuroreport, 16(17):1923–7, 2005. [12] G Thomas. Experimental study of the influence of vision on sound localisation. J Exp Psychol, 28:167177, 1941. [13] W. R. Thurlow and C. E. Jack. Certain determinants of the ”ventriloquism effect”. Percept Mot Skills, 36(3):1171–84, 1973. [14] C.S. Choe, R. B. Welch, R.M. Gilford, and J.F. Juola. The ”ventriloquist effect”: visual dominance or response bias. Perception and Psychophysics, 18:55–60, 1975. [15] R. I. Bermant and R. B. Welch. Effect of degree of separation of visual-auditory stimulus and eye position upon spatial interaction of vision and audition. Percept Mot Skills, 42(43):487–93, 1976. [16] R. B. Welch and D. H. Warren. Immediate perceptual response to intersensory discrepancy. Psychol Bull, 88(3):638–67, 1980. [17] P. Bertelson and M. Radeau. Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys, 29(6):578–84, 1981. [18] P. Bertelson, F. Pavani, E. Ladavas, J. Vroomen, and B. de Gelder. Ventriloquism in patients with unilateral visual neglect. Neuropsychologia, 38(12):1634–42, 2000. [19] D. A. Slutsky and G. H. Recanzone. Temporal and spatial dependency of the ventriloquism effect. Neuroreport, 12(1):7–10, 2001. [20] J. Lewald, W. H. Ehrenstein, and R. Guski. Spatio-temporal constraints for auditory–visual integration. Behav Brain Res, 121(1-2):69–79, 2001. [21] M. T. Wallace, G. E. Roberson, W. D. Hairston, B. E. Stein, J. W. Vaughan, and J. A. Schirillo. Unifying multisensory signals across time and space. Exp Brain Res, 158(2):252–8, 2004. [22] J. P. Bresciani, F. Dammeier, and M. O. Ernst. Vision and touch are automatically integrated for the perception of sequences of events. J Vis, 6(5):554–64, 2006. [23] N. W. Roach, J. Heron, and P. V. McGraw. Resolving multisensory conflict: a strategy for balancing the costs and benefits of audio-visual integration. Proc Biol Sci, 273(1598):2159–68, 2006. [24] K. P. Kording and D. M. Wolpert. Bayesian decision theory in sensorimotor control. Trends Cogn Sci, 2006. 1364-6613 (Print) Journal article. [25] K.P. Kording, U. Beierholm, W.J. Ma, S. Quartz, J. Tenenbaum, and L. Shams. Causal inference in multisensory perception. PLoS ONE, 2(9):e943, 2007. [26] Z. Ghahramani. Computational and psychophysics of sensorimotor integration. PhD thesis, Massachusetts Institute of Technology, 1995. [27] D. C. Knill. Mixture models and the probabilistic structure of depth cues. Vision Res, 43(7):831–54, 2003. [28] D. C. Knill. Robust cue integration: A bayesian model and evidence from cue conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7):2–24. [29] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nat Neurosci, 9(11):1432–8, 2006. 8

3 0.53780246 198 nips-2007-The Noisy-Logical Distribution and its Application to Causal Inference

Author: Hongjing Lu, Alan L. Yuille

Abstract: We describe a novel noisy-logical distribution for representing the distribution of a binary output variable conditioned on multiple binary input variables. The distribution is represented in terms of noisy-or’s and noisy-and-not’s of causal features which are conjunctions of the binary inputs. The standard noisy-or and noisy-andnot models, used in causal reasoning and artificial intelligence, are special cases of the noisy-logical distribution. We prove that the noisy-logical distribution is complete in the sense that it can represent all conditional distributions provided a sufficient number of causal factors are used. We illustrate the noisy-logical distribution by showing that it can account for new experimental findings on how humans perform causal reasoning in complex contexts. We speculate on the use of the noisy-logical distribution for causal reasoning and artificial intelligence.

4 0.45897186 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

Author: M.a. S. Elmohamed, Dexter Kozen, Daniel R. Sheldon

Abstract: We investigate a family of inference problems on Markov models, where many sample paths are drawn from a Markov chain and partial information is revealed to an observer who attempts to reconstruct the sample paths. We present algorithms and hardness results for several variants of this problem which arise by revealing different information to the observer and imposing different requirements for the reconstruction of sample paths. Our algorithms are analogous to the classical Viterbi algorithm for Hidden Markov Models, which finds the single most probable sample path given a sequence of observations. Our work is motivated by an important application in ecology: inferring bird migration paths from a large database of observations. 1

5 0.39063409 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

Author: Venkat Chandrasekaran, Alan S. Willsky, Jason K. Johnson

Abstract: We consider the estimation problem in Gaussian graphical models with arbitrary structure. We analyze the Embedded Trees algorithm, which solves a sequence of problems on tractable subgraphs thereby leading to the solution of the estimation problem on an intractable graph. Our analysis is based on the recently developed walk-sum interpretation of Gaussian estimation. We show that non-stationary iterations of the Embedded Trees algorithm using any sequence of subgraphs converge in walk-summable models. Based on walk-sum calculations, we develop adaptive methods that optimize the choice of subgraphs used at each iteration with a view to achieving maximum reduction in error. These adaptive procedures provide a significant speedup in convergence over stationary iterative methods, and also appear to converge in a larger class of models. 1

6 0.38823491 150 nips-2007-Optimal models of sound localization by barn owls

7 0.36914185 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

8 0.33623639 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

9 0.33435655 27 nips-2007-Anytime Induction of Cost-sensitive Trees

10 0.32726294 174 nips-2007-Selecting Observations against Adversarial Objectives

11 0.29850775 142 nips-2007-Non-parametric Modeling of Partially Ranked Data

12 0.28336462 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

13 0.28008193 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index

14 0.27417257 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

15 0.25699753 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines

16 0.23995256 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

17 0.21217473 196 nips-2007-The Infinite Gamma-Poisson Feature Model

18 0.20753595 197 nips-2007-The Infinite Markov Model

19 0.20746174 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

20 0.20647028 116 nips-2007-Learning the structure of manifolds using random projections


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.021), (13, 0.033), (16, 0.012), (18, 0.016), (21, 0.03), (31, 0.018), (34, 0.015), (35, 0.024), (47, 0.054), (49, 0.012), (83, 0.091), (85, 0.014), (87, 0.011), (90, 0.553)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91396314 8 nips-2007-A New View of Automatic Relevance Determination

Author: David P. Wipf, Srikantan S. Nagarajan

Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1

2 0.90939885 85 nips-2007-Experience-Guided Search: A Theory of Attentional Control

Author: David Baldwin, Michael C. Mozer

Abstract: People perform a remarkable range of tasks that require search of the visual environment for a target item among distractors. The Guided Search model (Wolfe, 1994, 2007), or GS, is perhaps the best developed psychological account of human visual search. To prioritize search, GS assigns saliency to locations in the visual field. Saliency is a linear combination of activations from retinotopic maps representing primitive visual features. GS includes heuristics for setting the gain coefficient associated with each map. Variants of GS have formalized the notion of optimization as a principle of attentional control (e.g., Baldwin & Mozer, 2006; Cave, 1999; Navalpakkam & Itti, 2006; Rao et al., 2002), but every GS-like model must be ’dumbed down’ to match human data, e.g., by corrupting the saliency map with noise and by imposing arbitrary restrictions on gain modulation. We propose a principled probabilistic formulation of GS, called Experience-Guided Search (EGS), based on a generative model of the environment that makes three claims: (1) Feature detectors produce Poisson spike trains whose rates are conditioned on feature type and whether the feature belongs to a target or distractor; (2) the environment and/or task is nonstationary and can change over a sequence of trials; and (3) a prior specifies that features are more likely to be present for target than for distractors. Through experience, EGS infers latent environment variables that determine the gains for guiding search. Control is thus cast as probabilistic inference, not optimization. We show that EGS can replicate a range of human data from visual search, including data that GS does not address. 1

same-paper 3 0.88545585 119 nips-2007-Learning with Tree-Averaged Densities and Distributions

Author: Sergey Kirshner

Abstract: We utilize the ensemble of trees framework, a tractable mixture over superexponential number of tree-structured distributions [1], to develop a new model for multivariate density estimation. The model is based on a construction of treestructured copulas – multivariate distributions with uniform on [0, 1] marginals. By averaging over all possible tree structures, the new model can approximate distributions with complex variable dependencies. We propose an EM algorithm to estimate the parameters for these tree-averaged models for both the real-valued and the categorical case. Based on the tree-averaged framework, we propose a new model for joint precipitation amounts data on networks of rain stations. 1

4 0.87523592 184 nips-2007-Stability Bounds for Non-i.i.d. Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are designed for specific learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.

5 0.78697169 182 nips-2007-Sparse deep belief net model for visual area V2

Author: Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng

Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or “deep,” structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear (“contour”) features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex “corner” features matches well with the results from the Ito & Komatsu’s study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features. 1

6 0.54094815 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

7 0.53966677 202 nips-2007-The discriminant center-surround hypothesis for bottom-up saliency

8 0.51342899 156 nips-2007-Predictive Matrix-Variate t Models

9 0.48629457 128 nips-2007-Message Passing for Max-weight Independent Set

10 0.47519255 113 nips-2007-Learning Visual Attributes

11 0.47450531 96 nips-2007-Heterogeneous Component Analysis

12 0.47346753 185 nips-2007-Stable Dual Dynamic Programming

13 0.47267023 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

14 0.46877366 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

15 0.46570989 63 nips-2007-Convex Relaxations of Latent Variable Training

16 0.4629803 155 nips-2007-Predicting human gaze using low-level saliency combined with face detection

17 0.4605642 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

18 0.45893228 49 nips-2007-Colored Maximum Variance Unfolding

19 0.45133793 47 nips-2007-Collapsed Variational Inference for HDP

20 0.45027155 163 nips-2007-Receding Horizon Differential Dynamic Programming