nips nips2007 nips2007-150 knowledge-graph by maker-knowledge-mining

150 nips-2007-Optimal models of sound localization by barn owls


Source: pdf

Author: Brian J. Fischer

Abstract: Sound localization by barn owls is commonly modeled as a matching procedure where localization cues derived from auditory inputs are compared to stored templates. While the matching models can explain properties of neural responses, no model explains how the owl resolves spatial ambiguity in the localization cues to produce accurate localization for sources near the center of gaze. Here, I examine two models for the barn owl’s sound localization behavior. First, I consider a maximum likelihood estimator in order to further evaluate the cue matching model. Second, I consider a maximum a posteriori estimator to test whether a Bayesian model with a prior that emphasizes directions near the center of gaze can reproduce the owl’s localization behavior. I show that the maximum likelihood estimator can not reproduce the owl’s behavior, while the maximum a posteriori estimator is able to match the behavior. This result suggests that the standard cue matching model will not be sufficient to explain sound localization behavior in the barn owl. The Bayesian model provides a new framework for analyzing sound localization in the barn owl and leads to predictions about the owl’s localization behavior.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Optimal models of sound localization by barn owls Brian J. [sent-1, score-0.926]

2 edu Abstract Sound localization by barn owls is commonly modeled as a matching procedure where localization cues derived from auditory inputs are compared to stored templates. [sent-3, score-1.438]

3 While the matching models can explain properties of neural responses, no model explains how the owl resolves spatial ambiguity in the localization cues to produce accurate localization for sources near the center of gaze. [sent-4, score-1.723]

4 Here, I examine two models for the barn owl’s sound localization behavior. [sent-5, score-0.846]

5 Second, I consider a maximum a posteriori estimator to test whether a Bayesian model with a prior that emphasizes directions near the center of gaze can reproduce the owl’s localization behavior. [sent-7, score-0.922]

6 This result suggests that the standard cue matching model will not be sufficient to explain sound localization behavior in the barn owl. [sent-9, score-0.939]

7 The Bayesian model provides a new framework for analyzing sound localization in the barn owl and leads to predictions about the owl’s localization behavior. [sent-10, score-1.785]

8 Owls localize broadband noise signals with great accuracy for source directions near the center of gaze [1]. [sent-12, score-0.518]

9 However, localization errors increase as source directions move to the periphery, consistent with an underestimate of the source direction [1]. [sent-13, score-0.742]

10 Behavioral experiments show that the barn owl uses the interaural time difference (ITD) for localization in the horizontal dimension and the interaural level difference (ILD) for localization in the vertical dimension [2]. [sent-14, score-1.964]

11 Direct measurements of the sounds received at the ears for sources at different locations in space show that disparate directions are associated with very similar localization cues. [sent-15, score-0.552]

12 Specifically, there is a similarity between ILD and ITD cues for directions near the center of gaze and directions with eccentric elevations on the vertical plane. [sent-16, score-0.779]

13 How does the owl resolve this ambiguity in the localization cues to produce accurate localization for sound sources near the center of gaze? [sent-17, score-1.748]

14 Theories regarding the use of localization cues by the barn owl are drawn from the extensive knowledge of processing in the barn owl’s auditory system. [sent-18, score-1.939]

15 Neurophysiological and anatomical studies show that the barn owl’s auditory system contains specialized circuitry that is devoted to extracting spectral ILD and ITD cues and processing them to derive source direction information [2]. [sent-19, score-0.791]

16 It has been suggested that a spectral matching operation between ILD and ITD cues computed from auditory inputs and preferred ILD and ITD spectra associated with spatially selective auditory neurons underlies the derivation of spatial information from the auditory cues [3–6]. [sent-20, score-0.693]

17 The spectral matching models reproduce aspects of neural responses, but none reproduces the sound localization behavior of the barn owl. [sent-21, score-0.997]

18 In particular, the spectral matching models do not describe how the owl resolves ambiguities in the localization cues. [sent-22, score-1.057]

19 In addition to spectral matching of localization cues, it is possible 1 that the owl incorporates prior experience or beliefs into the process of deriving direction estimates from the auditory input signals. [sent-23, score-1.201]

20 These two approaches to sound localization can be formalized using the language of estimation theory as maximum likelihood (ML) and Bayesian solutions, respectively. [sent-24, score-0.534]

21 Here, I examine two models for the barn owl’s sound localization behavior in order to further evaluate the spectral matching model and to test whether a Bayesian model with a prior that emphasizes directions near the center of gaze can reproduce the owl’s localization behavior. [sent-25, score-1.792]

22 I begin by viewing the sound localization problem as a statistical estimation problem. [sent-26, score-0.48]

23 Maximum likelihood and maximum a posteriori (MAP) solutions to the estimation problem are compared with the localization behavior of a barn owl in a head turning task. [sent-27, score-1.512]

24 2 Observation model To define the localization problem, we must specify an observation model that describes the information the owl uses to produce a direction estimate. [sent-28, score-0.993]

25 Neurophysiological and behavioral experiments suggest that the barn owl derives direction estimates from ILD and ITD cues that are computed at an array of frequencies [2, 7, 8]. [sent-29, score-1.222]

26 Here I consider a model where the observation made by the owl is given by the ILD and IPD spectra derived from barn owl head-related transfer functions (HRTFs) after corruption with additive noise. [sent-31, score-1.584]

27 The likelihood function will have peaks at directions where the expected spectral cues ILDθ,φ and IPDθ,φ are near the observed values rILD and rIPD . [sent-51, score-0.458]

28 2 3 Model performance measure I evaluate maximum likelihood and maximum a posteriori methods for estimating the source direction from the observed ILD and IPD cues by computing an expected localization error and comparing the results to an owl’s behavior. [sent-52, score-0.744]

29 This measure of estimation error is directly compared to the behavioral performance of a barn owl in a head turning localization task [1]. [sent-54, score-1.465]

30 The error is computed using HRTFs for two barn owls [10] and is calculated for direci=1 tions in the frontal hemisphere with 5◦ increments in azimuth and elevation, as defined using double polar coordinates. [sent-56, score-0.527]

31 4 Maximum likelihood estimate The maximum likelihood direction estimate is derived from the observed noisy ILD and IPD cues by finding the source direction that maximizes the likelihood function, yielding ˆ ˆ (θML (r), φML (r)) = arg max pr|Θ,Φ (r|θ, φ). [sent-57, score-0.456]

32 The direction with associated cues that are closest to the observed cues is designated as the estimate. [sent-60, score-0.358]

33 This estimator is of particular interest because of the claim that salience in the neural map of auditory space in the barn owl can be described by a spectral cue matching operation [3, 4, 6]. [sent-61, score-1.247]

34 The maximum likelihood estimator was unable to reproduce the owl’s localization behavior. [sent-62, score-0.475]

35 For noise variances large enough that the error increased at peripheral directions, in accordance with the barn owl’s behavior, the error also increased significantly for directions near the center of the interaural coordinate system (Figure 1). [sent-64, score-0.889]

36 This pattern of error as a function of eccentricity, with a large central peak, is not consistent with the performance of the barn owl in the head turning task [1]. [sent-65, score-1.083]

37 Additionally, directions near the center of gaze were often confused with directions in the periphery leading to a high variability in the direction estimates, which is not seen in the owl’s behavior. [sent-66, score-0.631]

38 5 Maximum a posteriori estimate In the Bayesian framework, the direction estimate depends on both the likelihood function and the prior distribution over source directions through the posterior distribution. [sent-67, score-0.379]

39 (9) The prior distribution is used to summarize the owl’s belief about the most likely source directions before an observation of ILD and IPD cues is made. [sent-69, score-0.41]

40 Based on the barn owl’s tendency to underestimate source directions [1], I use a prior that emphasizes directions near the center of gaze. [sent-70, score-0.972]

41 The maximum a posteriori source direction estimate is computed for a given observation by finding the source direction that maximizes the posterior density, yielding ˆ ˆ (θMAP (r), φMAP (r)) = arg max pΘ,Φ|r (θ, φ|r). [sent-72, score-0.303]

42 Center column: Estimation error on the horizontal plane along with the estimation error of a barn owl in a head turning task [1]. [sent-76, score-1.152]

43 Right column: Estimation error on the vertical plane along with the estimation error of a barn owl in a head turning task. [sent-77, score-1.185]

44 4 Figure 2: Estimates for the MAP estimator on the horizontal plane (left) and the vertical plane (right) using HRTFs from owl 880. [sent-79, score-0.779]

45 In the MAP case, the estimate depends on spectral matching of observations with expected cues for each direction, but with a penalty on the selection of peripheral directions. [sent-83, score-0.318]

46 It was possible to find a MAP estimator that was consistent with the owl’s localization behavior (Figures 1,2). [sent-84, score-0.401]

47 This pattern of error qualitatively matches the pattern of error displayed by the owl in a head turning localization task [1]. [sent-87, score-1.062]

48 For comparison, the range of ILD values normally experienced by the barn owl falls between ± 30 dB [10]. [sent-94, score-0.975]

49 2 for owl 884 and an elevational width parameter κ2 of 0. [sent-97, score-0.594]

50 The implication of this model for implementation in the owl’s auditory system is that the spectral localization cues ILD and IPD do not need to be computed with great accuracy and the emphasis on central directions does not need to be large in order to produce the barn owl’s behavior. [sent-100, score-1.188]

51 1 A new approach to modeling sound localization in the barn owl The simulation results show that the maximum likelihood model considered here can not reproduce the owl’s behavior, while the maximum a posteriori solution is able to match the behavior. [sent-102, score-1.583]

52 This result suggests that the standard spectral matching model will not be sufficient to explain sound localization behavior in the barn owl. [sent-103, score-0.959]

53 Previously, suggestions have been made that sound localization by the barn owl can be described using the Bayesian framework [11, 12], but no specific models have been proposed. [sent-104, score-1.44]

54 This paper demonstrates that a Bayesian model can qualitatively match the owl’s localization behavior. [sent-105, score-0.363]

55 The Bayesian approach described here provides a new framework for analyzing sound localization in the owl. [sent-106, score-0.465]

56 The existence of spatial ambiguity has been noted in previous descriptions of barn owl HRTFs [3, 10, 13]. [sent-109, score-1.047]

57 In addition to sim5 ilarity of cues between proximal directions, distant directions can have similar ILD and IPD cues. [sent-111, score-0.311]

58 Most significantly, there is a similarity between the ILD and IPD cues at the center of gaze and at peripheral directions on the vertical plane. [sent-112, score-0.598]

59 The consequence of such ambiguity between distant directions is that noise in measuring localization cues can lead to large errors in direction estimation, as seen in the ML estimate. [sent-113, score-0.803]

60 This work shows that a possible mechanism for choosing between such directions is to incorporate a bias towards directions at the center of gaze through a prior distribution and utilize the Bayesian estimation framework. [sent-115, score-0.518]

61 The use of a prior that emphasizes directions near the center of gaze is similar to the use of central weighting functions in models of human lateralization [14]. [sent-116, score-0.469]

62 3 Predictions of the Bayesian model The MAP estimator predicts the underestimation of peripheral source directions on the horizontal and vertical planes (Figure 2). [sent-118, score-0.446]

63 For the owl whose performance is displayed in Figure 1, the largest errors on the vertical and horizontal planes were less than 20 deg and 11 deg, respectively. [sent-121, score-0.791]

64 The large errors in elevation result from the ambiguity in the localization cues on the vertical plane and the shape of the prior distribution. [sent-123, score-0.719]

65 As discussed above, for broadband noise stimuli, there is a similarity between the ILD and IPD cues for central and peripheral directions on the vertical plane [3, 10, 13]. [sent-124, score-0.554]

66 The presence of a prior distribution that emphasizes central directions causes direction estimates for both central and peripheral directions to be concentrated near zero deg. [sent-125, score-0.614]

67 Therefore, estimation errors are minimal for sources at the center of gaze, but approach the magnitude of the source direction for peripheral source directions. [sent-126, score-0.386]

68 Behavioral data shows that localization accuracy is the greatest near the center of gaze [1], but there is no data for localization performance at the most eccentric directions on the vertical plane. [sent-127, score-1.158]

69 There is a significant spatial ambiguity in the localization cues when target sounds are narrowband. [sent-129, score-0.593]

70 The owl measures the interaural time difference for each frequency of the input sound as an interaural phase difference. [sent-131, score-0.93]

71 Therefore, multiple directions in space that differ in their associated interaural time difference by the period of a tone at that frequency are consistent with the same interaural phase difference and can not be distinguished. [sent-132, score-0.389]

72 Behavioral experiments show that the owl may localize a phantom source in the horizontal dimension when the signal is a tone [20]. [sent-133, score-0.723]

73 This prediction can not be evaluated from available data because localization of tonal signals has only been systematically studied using 5 kHz tones with target directions at ± 20 deg [19]. [sent-135, score-0.628]

74 Because the prior is broad, the target direction of ± 20 deg and the phantom direction of ± 50 deg may both be considered central. [sent-136, score-0.285]

75 Therefore, for narrowband sounds, the owl can not uniquely determine the direction of a sound source from the ITD and ILD cues. [sent-139, score-0.84]

76 I predict that for tonal signals above 7 kHz, there will be multiple directions on the vertical plane that are confused with directions near zero deg. [sent-140, score-0.529]

77 I predict that confusion between source directions near zero deg and eccentric directions will always lead to estimates of directions near zero deg. [sent-141, score-0.791]

78 6 Figure 3: Model predictions for localization of tones on the vertical plane. [sent-143, score-0.426]

79 (A) ILD as a function of elevation at 8 kHz, computed from HRTFs of owl 880 recorded by Keller et al. [sent-144, score-0.625]

80 If the target is at any of the three directions, there will be large localization errors because of confusion with the other directions. [sent-147, score-0.387]

81 4 Testing the Bayesian model Further head turning localization experiments with barn owls must be performed to test predictions generated by the Bayesian hypothesis and to provide constraints on a model of sound localization. [sent-152, score-0.999]

82 Experiments should test the localization accuracy of the owl for broadband noise sources and tonal signals at directions covering the frontal hemisphere. [sent-153, score-1.241]

83 The Bayesian model will be supported if, first, localization accuracy is high for both tonal and broadband noise sources near the center of gaze and, second, peripherally located sources are confused for targets near the center of gaze, leading to large localization errors. [sent-154, score-1.199]

84 5 Implications for neural processing The analysis presented here does not directly address the neural implementation of the solution to the localization problem. [sent-158, score-0.345]

85 However, our abstract analysis of the sound localization problem has implications for neural processing. [sent-159, score-0.465]

86 The prior may correspond to the distribution of best directions of space-specific neurons in ICx and OT, which emphasizes directions near the center of gaze [23]. [sent-164, score-0.622]

87 6 Conclusion This analysis supports the Bayesian model of the barn owl’s solution to the localization problem over the maximum likelihood model. [sent-166, score-0.78]

88 This result suggests that the standard spectral matching model will not be sufficient to explain sound localization behavior in the barn owl. [sent-167, score-0.959]

89 The Bayesian model 7 provides a new framework for analyzing sound localization in the owl. [sent-168, score-0.465]

90 The simulation results using the MAP estimator lead to testable predictions that can be used to evaluate the Bayesian model of sound localization in the barn owl. [sent-169, score-0.884]

91 Sound localization by the barn owl (Tyto alba) measured with the search coil technique. [sent-177, score-1.335]

92 Neural computations leading to space-specific auditory responses in the barn owl. [sent-205, score-0.483]

93 A model of the computations leading to a representation of auditory space in the midbrain of the barn owl. [sent-212, score-0.467]

94 Sound-localization experiments with barn owls in virtual space: influence of broadband interaural level difference on head-turning behavior. [sent-229, score-0.628]

95 From spectrum to space: The contribution of level difference cues to spatial receptive fields in the barn owl inferior colliculus. [sent-239, score-1.176]

96 Head-related transfer functions of the barn owl: measurement and neural responses. [sent-257, score-0.396]

97 Neural maps of interaural time and intensity differences in the optic tectum of the barn owl. [sent-276, score-0.525]

98 Sound-localization deficits induced by lesions in the barn owl’s auditory space map. [sent-295, score-0.467]

99 Sound-localization experiments with barn owls in virtual space: influence of interaural time difference on head-turning behavior. [sent-303, score-0.584]

100 Mechanisms of sound localization in the barn owl (Tyto alba) measured with the search coil technique. [sent-337, score-1.455]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('owl', 0.594), ('barn', 0.381), ('ild', 0.346), ('localization', 0.345), ('ipd', 0.231), ('directions', 0.159), ('cues', 0.152), ('sound', 0.12), ('interaural', 0.108), ('gaze', 0.105), ('auditory', 0.086), ('itd', 0.08), ('owls', 0.08), ('source', 0.072), ('hrtfs', 0.071), ('peripheral', 0.071), ('deg', 0.066), ('near', 0.066), ('vertical', 0.058), ('direction', 0.054), ('knudsen', 0.053), ('center', 0.053), ('ambiguity', 0.049), ('matching', 0.049), ('nf', 0.046), ('spectral', 0.046), ('broadband', 0.044), ('rild', 0.044), ('ripd', 0.044), ('behavioral', 0.041), ('turning', 0.04), ('emphasizes', 0.04), ('mises', 0.039), ('reproduce', 0.038), ('estimator', 0.038), ('keller', 0.035), ('tonal', 0.035), ('likelihood', 0.035), ('head', 0.033), ('posteriori', 0.032), ('plane', 0.032), ('khz', 0.031), ('elevation', 0.031), ('bayesian', 0.029), ('map', 0.027), ('prior', 0.027), ('binaural', 0.027), ('eccentric', 0.027), ('icx', 0.027), ('takahashi', 0.027), ('spectrum', 0.026), ('cue', 0.026), ('von', 0.026), ('errors', 0.025), ('horizontal', 0.025), ('sources', 0.024), ('sounds', 0.024), ('planes', 0.023), ('resolves', 0.023), ('tones', 0.023), ('spatial', 0.023), ('db', 0.023), ('frontal', 0.021), ('hl', 0.02), ('confused', 0.02), ('maximum', 0.019), ('noise', 0.019), ('central', 0.019), ('ml', 0.019), ('behavior', 0.018), ('qualitatively', 0.018), ('alba', 0.018), ('bessel', 0.018), ('hartung', 0.018), ('optic', 0.018), ('phantom', 0.018), ('poganiatz', 0.018), ('prild', 0.018), ('pripd', 0.018), ('quartile', 0.018), ('tectum', 0.018), ('tyto', 0.018), ('ot', 0.018), ('confusion', 0.017), ('error', 0.016), ('responses', 0.016), ('cos', 0.016), ('underestimate', 0.015), ('periphery', 0.015), ('coil', 0.015), ('hemisphere', 0.015), ('hr', 0.015), ('transfer', 0.015), ('virtual', 0.015), ('estimation', 0.015), ('hearing', 0.014), ('tone', 0.014), ('behaviorally', 0.014), ('polar', 0.014), ('neurons', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 150 nips-2007-Optimal models of sound localization by barn owls

Author: Brian J. Fischer

Abstract: Sound localization by barn owls is commonly modeled as a matching procedure where localization cues derived from auditory inputs are compared to stored templates. While the matching models can explain properties of neural responses, no model explains how the owl resolves spatial ambiguity in the localization cues to produce accurate localization for sources near the center of gaze. Here, I examine two models for the barn owl’s sound localization behavior. First, I consider a maximum likelihood estimator in order to further evaluate the cue matching model. Second, I consider a maximum a posteriori estimator to test whether a Bayesian model with a prior that emphasizes directions near the center of gaze can reproduce the owl’s localization behavior. I show that the maximum likelihood estimator can not reproduce the owl’s behavior, while the maximum a posteriori estimator is able to match the behavior. This result suggests that the standard cue matching model will not be sufficient to explain sound localization behavior in the barn owl. The Bayesian model provides a new framework for analyzing sound localization in the barn owl and leads to predictions about the owl’s localization behavior.

2 0.078454725 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

Author: Ulrik Beierholm, Ladan Shams, Wei J. Ma, Konrad Koerding

Abstract: Bayesian models of multisensory perception traditionally address the problem of estimating an underlying variable that is assumed to be the cause of the two sensory signals. The brain, however, has to solve a more general problem: it also has to establish which signals come from the same source and should be integrated, and which ones do not and should be segregated. In the last couple of years, a few models have been proposed to solve this problem in a Bayesian fashion. One of these has the strength that it formalizes the causal structure of sensory signals. We first compare these models on a formal level. Furthermore, we conduct a psychophysics experiment to test human performance in an auditory-visual spatial localization task in which integration is not mandatory. We find that the causal Bayesian inference model accounts for the data better than other models. Keywords: causal inference, Bayesian methods, visual perception. 1 Multisensory perception In the ventriloquist illusion, a performer speaks without moving his/her mouth while moving a puppet’s mouth in synchrony with his/her speech. This makes the puppet appear to be speaking. This illusion was first conceptualized as ”visual capture”, occurring when visual and auditory stimuli exhibit a small conflict ([1, 2]). Only recently has it been demonstrated that the phenomenon may be seen as a byproduct of a much more flexible and nearly Bayes-optimal strategy ([3]), and therefore is part of a large collection of cue combination experiments showing such statistical near-optimality [4, 5]. In fact, cue combination has become the poster child for Bayesian inference in the nervous system. In previous studies of multisensory integration, two sensory stimuli are presented which act as cues about a single underlying source. For instance, in the auditory-visual localization experiment by Alais and Burr [3], observers were asked to envisage each presentation of a light blob and a sound click as a single event, like a ball hitting the screen. In many cases, however, the brain is not only posed with the problem of identifying the position of a common source, but also of determining whether there was a common source at all. In the on-stage ventriloquist illusion, it is indeed primarily the causal inference process that is being fooled, because veridical perception would attribute independent causes to the auditory and the visual stimulus. 1 To extend our understanding of multisensory perception to this more general problem, it is necessary to manipulate the degree of belief assigned to there being a common cause within a multisensory task. Intuitively, we expect that when two signals are very different, they are less likely to be perceived as having a common source. It is well-known that increasing the discrepancy or inconsistency between stimuli reduces the influence that they have on each other [6, 7, 8, 9, 10, 11]. In auditoryvisual spatial localization, one variable that controls stimulus similarity is spatial disparity (another would be temporal disparity). Indeed, it has been reported that increasing spatial disparity leads to a decrease in auditory localization bias [1, 12, 13, 14, 15, 16, 17, 2, 18, 19, 20, 21]. This decrease also correlates with a decrease in the reports of unity [19, 21]. Despite the abundance of experimental data on this issue, no general theory exists that can explain multisensory perception across a wide range of cue conflicts. 2 Models The success of Bayesian models for cue integration has motivated attempts to extend them to situations of large sensory conflict and a consequent low degree of integration. In one of recent studies taking this approach, subjects were presented with concurrent visual flashes and auditory beeps and asked to count both the number of flashes and the number of beeps [11]. The advantage of the experimental paradigm adopted here was that it probed the joint response distribution by requiring a dual report. Human data were accounted for well by a Bayesian model in which the joint prior distribution over visual and auditory number was approximated from the data. In a similar study, subjects were presented with concurrent flashes and taps and asked to count either the flashes or the taps [9, 22]. The Bayesian model proposed by these authors assumed a joint prior distribution with a near-diagonal form. The corresponding generative model assumes that the sensory sources somehow interact with one another. A third experiment modulated the rates of flashes and beeps. The task was to judge either the visual or the auditory modulation rate relative to a standard [23]. The data from this experiment were modeled using a joint prior distribution which is the sum of a near-diagonal prior and a flat background. While all these models are Bayesian in a formal sense, their underlying generative model does not formalize the model selection process that underlies the combination of cues. This makes it necessary to either estimate an empirical prior [11] by fitting it to human behavior or to assume an ad hoc form [22, 23]. However, we believe that such assumptions are not needed. It was shown recently that human judgments of spatial unity in an auditory-visual spatial localization task can be described using a Bayesian inference model that infers causal structure [24, 25]. In this model, the brain does not only estimate a stimulus variable, but also infers the probability that the two stimuli have a common cause. In this paper we compare these different models on a large data set of human position estimates in an auditory-visual task. In this section we first describe the traditional cue integration model, then the recent models based on joint stimulus priors, and finally the causal inference model. To relate to the experiment in the next section, we will use the terminology of auditory-visual spatial localization, but the formalism is very general. 2.1 Traditional cue integration The traditional generative model of cue integration [26] has a single source location s which produces on each trial an internal representation (cue) of visual location, xV and one of auditory location, xA . We assume that the noise processes by which these internal representations are generated are conditionally independent from each other and follow Gaussian distributions. That is, p (xV |s) ∼ N (xV ; s, σV )and p (xA |s) ∼ N (xA ; s, σA ), where N (x; µ, σ) stands for the normal distribution over x with mean µ and standard deviation σ. If on a given trial the internal representations are xV and xA , the probability that their source was s is given by Bayes’ rule, p (s|xV , xA ) ∝ p (xV |s) p (xA |s) . If a subject performs maximum-likelihood estimation, then the estimate will be xV +wA s = wV wV +wA xA , where wV = σ1 and wA = σ1 . It is important to keep in mind that this is the ˆ 2 2 V A estimate on a single trial. A psychophysical experimenter can never have access to xV and xA , which 2 are the noisy internal representations. Instead, an experimenter will want to collect estimates over many trials and is interested in the distribution of s given sV and sA , which are the sources generated ˆ by the experimenter. In a typical cue combination experiment, xV and xA are not actually generated by the same source, but by different sources, a visual one sV and an auditory one sA . These sources are chosen close to each other so that the subject can imagine that the resulting cues originate from a single source and thus implicitly have a common cause. The experimentally observed distribution is then p (ˆ|sV , sA ) = s p (ˆ|xV , xA ) p (xV |sV ) p (xA |sA ) dxV dxA s Given that s is a linear combination of two normally distributed variables, it will itself follow a ˆ sV +wA 1 2 normal distribution, with mean s = wVwV +wA sA and variance σs = wV +wA . The reason that we ˆ ˆ emphasize this point is because many authors identify the estimate distribution p (ˆ|sV , sA ) with s the posterior distribution p (s|xV , xA ). This is justified in this case because all distributions are Gaussian and the estimate is a linear combination of cues. However, in the case of causal inference, these conditions are violated and the estimate distribution will in general not be the same as the posterior distribution. 2.2 Models with bisensory stimulus priors Models with bisensory stimulus priors propose the posterior over source positions to be proportional to the product of unimodal likelihoods and a two-dimensional prior: p (sV , sA |xV , xA ) = p (sV , sA ) p (xV |sV ) p (xA |sA ) The traditional cue combination model has p (sV , sA ) = p (sV ) δ (sV − sA ), usually (as above) even with p (sV ) uniform. The question arises what bisensory stimulus prior is appropriate. In [11], the prior is estimated from data, has a large number of parameters, and is therefore limited in its predictive power. In [23], it has the form − (sV −sA )2 p (sV , sA ) ∝ ω + e 2σ 2 coupling while in [22] the additional assumption ω = 0 is made1 . In all three models, the response distribution p (ˆV , sA |sV , sA ) is obtained by idens ˆ tifying it with the posterior distribution p (sV , sA |xV , xA ). This procedure thus implicitly assumes that marginalizing over the latent variables xV and xA is not necessary, which leads to a significant error for non-Gaussian priors. In this paper we correctly deal with these issues and in all cases marginalize over the latent variables. The parametric models used for the coupling between the cues lead to an elegant low-dimensional model of cue integration that allows for estimates of single cues that differ from one another. C C=1 SA S XA 2.3 C=2 XV SV XA XV Causal inference model In the causal inference model [24, 25], we start from the traditional cue integration model but remove the assumption that two signals are caused by the same source. Instead, the number of sources can be one or two and is itself a variable that needs to be inferred from the cues. Figure 1: Generative model of causal inference. 1 This family of Bayesian posterior distributions also includes one used to successfully model cue combination in depth perception [27, 28]. In depth perception, however, there is no notion of segregation as always a single surface is assumed. 3 If there are two sources, they are assumed to be independent. Thus, we use the graphical model depicted in Fig. 1. We denote the number of sources by C. The probability distribution over C given internal representations xV and xA is given by Bayes’ rule: p (C|xV , xA ) ∝ p (xV , xA |C) p (C) . In this equation, p (C) is the a priori probability of C. We will denote the probability of a common cause by pcommon , so that p (C = 1) = pcommon and p (C = 2) = 1 − pcommon . The probability of generating xV and xA given C is obtained by inserting a summation over the sources: p (xV , xA |C = 1) = p (xV , xA |s)p (s) ds = p (xV |s) p (xA |s)p (s) ds Here p (s) is a prior for spatial location, which we assume to be distributed as N (s; 0, σP ). Then all three factors in this integral are Gaussians, allowing for an analytic solution: p (xV , xA |C = 1) = 2 2 2 2 2 −xA )2 σP σA √ 2 2 1 2 2 2 2 exp − 1 (xV σ2 σ2 +σ2+xV+σ2+xA σV . 2 σ2 σ2 2π σV σA +σV σP +σA σP V A V P A P For p (xV , xA |C = 2) we realize that xV and xA are independent of each other and thus obtain p (xV , xA |C = 2) = p (xV |sV )p (sV ) dsV p (xA |sA )p (sA ) dsA Again, as all these distributions are assumed to be Gaussian, we obtain an analytic solution, x2 x2 1 1 V A p (xV , xA |C = 2) = exp − 2 σ2 +σ2 + σ2 +σ2 . Now that we have com2 +σ 2 2 +σ 2 p p V A 2π (σV p )(σA p) puted p (C|xV , xA ), the posterior distribution over sources is given by p (si |xV , xA ) = p (si |xV , xA , C) p (C|xV , xA ) C=1,2 where i can be V or A and the posteriors conditioned on C are well-known: p (si |xA , xV , C = 1) = p (xA |si ) p (xV |si ) p (si ) , p (xA |s) p (xV |s) p (s) ds p (si |xA , xV , C = 2) = p (xi |si ) p (si ) p (xi |si ) p (si ) dsi The former is the same as in the case of mandatory integration with a prior, the latter is simply the unimodal posterior in the presence of a prior. Based on the posterior distribution on a given trial, p (si |xV , xA ), an estimate has to be created. For this, we use a sum-squared-error cost func2 2 tion, Cost = p (C = 1|xV , xA ) (ˆ − s) + p (C = 2|xV , xA ) (ˆ − sV or A ) . Then the best s s estimate is the mean of the posterior distribution, for instance for the visual estimation: sV = p (C = 1|xA , xV ) sV,C=1 + p (C = 2|xA , xV ) sV,C=2 ˆ ˆ ˆ where sV,C=1 = ˆ −2 −2 −2 xV σV +xA σA +xP σP −2 −2 −2 σV +σA +σP and sV,C=2 = ˆ −2 −2 xV σV +xP σP . −2 −2 σV +σP If pcommon equals 0 or 1, this estimate reduces to one of the conditioned estimates and is linear in xV and xA . If 0 < pcommon < 1, the estimate is a nonlinear combination of xV and xA , because of the functional form of p (C|xV , xA ). The response distributions, that is the distributions of sV and sA given ˆ ˆ sV and sA over many trials, now cannot be identified with the posterior distribution on a single trial and cannot be computed analytically either. The correct way to obtain the response distribution is to simulate an experiment numerically. Note that the causal inference model above can also be cast in the form of a bisensory stimulus prior by integrating out the latent variable C, with: p (sA , sV ) = p (C = 1) δ (sA − sV ) p (sA ) + p (sA ) p (sV ) p (C = 2) However, in addition to justifying the form of the interaction between the cues, the causal inference model has the advantage of being based on a generative model that well formalizes salient properties of the world, and it thereby also allows to predict judgments of unity. 4 3 Model performance and comparison To examine the performance of the causal inference model and to compare it to previous models, we performed a human psychophysics experiment in which we adopted the same dual-report paradigm as was used in [11]. Observers were simultaneously presented with a brief visual and also an auditory stimulus, each of which could originate from one of five locations on an imaginary horizontal line (-10◦ , -5◦ , 0◦ , 5◦ , or 10◦ with respect to the fixation point). Auditory stimuli were 32 ms of white noise filtered through an individually calibrated head related transfer function (HRTF) and presented through a pair of headphones, whereas the visual stimuli were high contrast Gabors on a noisy background presented on a 21-inch CRT monitor. Observers had to report by means of a key press (1-5) the perceived positions of both the visual and the auditory stimulus. Each combination of locations was presented with the same frequency over the course of the experiment. In this way, for each condition, visual and auditory response histograms were obtained. We obtained response distributions for each the three models described above by numeral simulation. On each trial, estimation is followed by a step in which, the key is selected which corresponds to the position closed to the best estimate. The simulated histograms obtained in this way were compared to the measured response frequencies of all subjects by computing the R2 statistic. Auditory response Auditory model Visual response Visual model no vision The parameters in the causal inference model were optimized using fminsearch in MATLAB to maximize R2 . The best combination of parameters yielded an R2 of 0.97. The response frequencies are depicted in Fig. 2. The bisensory prior models also explain most of the variance, with R2 = 0.96 for the Roach model and R2 = 0.91 for the Bresciani model. This shows that it is possible to model cue combination for large disparities well using such models. no audio 1 0 Figure 2: A comparison between subjects’ performance and the causal inference model. The blue line indicates the frequency of subjects responses to visual stimuli, red line is the responses to auditory stimuli. Each set of lines is one set of audio-visual stimulus conditions. Rows of conditions indicate constant visual stimulus, columns is constant audio stimulus. Model predictions is indicated by the red and blue dotted line. 5 3.1 Model comparison To facilitate quantitative comparison with other models, we now fit the parameters of each model2 to individual subject data, maximizing the likelihood of the model, i.e., the probability of the response frequencies under the model. The causal inference model fits human data better than the other models. Compared to the best fit of the causal inference model, the Bresciani model has a maximal log likelihood ratio (base e) of the data of −22 ± 6 (mean ± s.e.m. over subjects), and the Roach model has a maximal log likelihood ratio of the data of −18 ± 6. A causal inference model that maximizes the probability of being correct instead of minimizing the mean squared error has a maximal log likelihood ratio of −18 ± 3. These values are considered decisive evidence in favor of the causal inference model that minimizes the mean squared error (for details, see [25]). The parameter values found in the likelihood optimization of the causal model are as follows: pcommon = 0.28 ± 0.05, σV = 2.14 ± 0.22◦ , σA = 9.2 ± 1.1◦ , σP = 12.3 ± 1.1◦ (mean ± s.e.m. over subjects). We see that there is a relatively low prior probability of a common cause. In this paradigm, auditory localization is considerably less precise than visual localization. Also, there is a weak prior for central locations. 3.2 Localization bias A useful quantity to gain more insight into the structure of multisensory data is the cross-modal bias. In our experiment, relative auditory bias is defined as the difference between the mean auditory estimate in a given condition and the real auditory position, divided by the difference between the real visual position and the real auditory position in this condition. If the influence of vision on the auditory estimate is strong, then the relative auditory bias will be high (close to one). It is well-known that bias decreases with spatial disparity and our experiment is no exception (solid line in Fig. 3; data were combined between positive and negative disparities). It can easily be shown that a traditional cue integration model would predict a bias equal to σ2 −1 , which would be close to 1 and 1 + σV 2 A independent of disparity, unlike the data. This shows that a mandatory integration model is an insufficient model of multisensory interactions. 45 % Auditory Bias We used the individual subject fittings from above and and averaged the auditory bias values obtained from those fits (i.e. we did not fit the bias data themselves). Fits are shown in Fig. 3 (dashed lines). We applied a paired t-test to the differences between the 5◦ and 20◦ disparity conditions (model-subject comparison). Using a double-sided test, the null hypothesis that the difference between the bias in the 5◦ and 20◦ conditions is correctly predicted by each model is rejected for the Bresciani model (p < 0.002) and the Roach model (p < 0.042) and accepted for the causal inference model (p > 0.17). Alternatively, with a single-sided test, the hypothesis is rejected for the Bresciani model (p < 0.001) and the Roach model (p < 0.021) and accepted for the causal inference model (> 0.9). 50 40 35 30 25 20 5 10 15 Spatial Disparity (deg.) 20 Figure 3: Auditory bias as a function of spatial disparity. Solid blue line: data. Red: Causal inference model. Green: Model by Roach et al. [23]. Purple: Model by Bresciani et al. [22]. Models were optimized on response frequencies (as in Fig. 2), not on the bias data. The reason that the Bresciani model fares worst is that its prior distribution does not include a component that corresponds to independent causes. On 2 The Roach et al. model has four free parameters (ω,σV , σA , σcoupling ), the Bresciani et al. model has three (σV , σA , σcoupling ), and the causal inference model has four (pcommon ,σV , σA , σP ). We do not consider the Shams et al. model here, since it has many more parameters and it is not immediately clear how in this model the erroneous identification of posterior with response distribution can be corrected. 6 the contrary, the prior used in the Roach model contains two terms, one term that is independent of the disparity and one term that decreases with increasing disparity. It is thus functionally somewhat similar to the causal inference model. 4 Discussion We have argued that any model of multisensory perception should account not only for situations of small, but also of large conflict. In these situations, segregation is more likely, in which the two stimuli are not perceived to have the same cause. Even when segregation occurs, the two stimuli can still influence each other. We compared three Bayesian models designed to account for situations of large conflict by applying them to auditory-visual spatial localization data. We pointed out a common mistake: for nonGaussian bisensory priors without mandatory integration, the response distribution can no longer be identified with the posterior distribution. After correct implementation of the three models, we found that the causal inference model is superior to the models with ad hoc bisensory priors. This is expected, as the nervous system actually needs to solve the problem of deciding which stimuli have a common cause and which stimuli are unrelated. We have seen that multisensory perception is a suitable tool for studying causal inference. However, the causal inference model also has the potential to quantitatively explain a number of other perceptual phenomena, including perceptual grouping and binding, as well as within-modality cue combination [27, 28]. Causal inference is a universal problem: whenever the brain has multiple pieces of information it must decide if they relate to one another or are independent. As the causal inference model describes how the brain processes probabilistic sensory information, the question arises about the neural basis of these processes. Neural populations encode probability distributions over stimuli through Bayes’ rule, a type of coding known as probabilistic population coding. Recent work has shown how the optimal cue combination assuming a common cause can be implemented in probabilistic population codes through simple linear operations on neural activities [29]. This framework makes essential use of the structure of neural variability and leads to physiological predictions for activity in areas that combine multisensory input, such as the superior colliculus. Computational mechanisms for causal inference are expected have a neural substrate that generalizes these linear operations on population activities. A neural implementation of the causal inference model will open the door to a complete neural theory of multisensory perception. References [1] H.L. Pick, D.H. Warren, and J.C. Hay. Sensory conflict in judgements of spatial direction. Percept. Psychophys., 6:203205, 1969. [2] D. H. Warren, R. B. Welch, and T. J. McCarthy. The role of visual-auditory ”compellingness” in the ventriloquism effect: implications for transitivity among the spatial senses. Percept Psychophys, 30(6):557– 64, 1981. [3] D. Alais and D. Burr. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol, 14(3):257–62, 2004. [4] R. A. Jacobs. Optimal integration of texture and motion cues to depth. Vision Res, 39(21):3621–9, 1999. [5] R. J. van Beers, A. C. Sittig, and J. J. Gon. Integration of proprioceptive and visual position-information: An experimentally supported model. J Neurophysiol, 81(3):1355–64, 1999. [6] D. H. Warren and W. T. Cleaves. Visual-proprioceptive interaction under large amounts of conflict. J Exp Psychol, 90(2):206–14, 1971. [7] C. E. Jack and W. R. Thurlow. Effects of degree of visual association and angle of displacement on the ”ventriloquism” effect. Percept Mot Skills, 37(3):967–79, 1973. [8] G. H. Recanzone. Auditory influences on visual temporal rate perception. J Neurophysiol, 89(2):1078–93, 2003. [9] J. P. Bresciani, M. O. Ernst, K. Drewing, G. Bouyer, V. Maury, and A. Kheddar. Feeling what you hear: auditory signals can modulate tactile tap perception. Exp Brain Res, 162(2):172–80, 2005. 7 [10] R. Gepshtein, P. Leiderman, L. Genosar, and D. Huppert. Testing the three step excited state proton transfer model by the effect of an excess proton. J Phys Chem A Mol Spectrosc Kinet Environ Gen Theory, 109(42):9674–84, 2005. [11] L. Shams, W. J. Ma, and U. Beierholm. Sound-induced flash illusion as an optimal percept. Neuroreport, 16(17):1923–7, 2005. [12] G Thomas. Experimental study of the influence of vision on sound localisation. J Exp Psychol, 28:167177, 1941. [13] W. R. Thurlow and C. E. Jack. Certain determinants of the ”ventriloquism effect”. Percept Mot Skills, 36(3):1171–84, 1973. [14] C.S. Choe, R. B. Welch, R.M. Gilford, and J.F. Juola. The ”ventriloquist effect”: visual dominance or response bias. Perception and Psychophysics, 18:55–60, 1975. [15] R. I. Bermant and R. B. Welch. Effect of degree of separation of visual-auditory stimulus and eye position upon spatial interaction of vision and audition. Percept Mot Skills, 42(43):487–93, 1976. [16] R. B. Welch and D. H. Warren. Immediate perceptual response to intersensory discrepancy. Psychol Bull, 88(3):638–67, 1980. [17] P. Bertelson and M. Radeau. Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys, 29(6):578–84, 1981. [18] P. Bertelson, F. Pavani, E. Ladavas, J. Vroomen, and B. de Gelder. Ventriloquism in patients with unilateral visual neglect. Neuropsychologia, 38(12):1634–42, 2000. [19] D. A. Slutsky and G. H. Recanzone. Temporal and spatial dependency of the ventriloquism effect. Neuroreport, 12(1):7–10, 2001. [20] J. Lewald, W. H. Ehrenstein, and R. Guski. Spatio-temporal constraints for auditory–visual integration. Behav Brain Res, 121(1-2):69–79, 2001. [21] M. T. Wallace, G. E. Roberson, W. D. Hairston, B. E. Stein, J. W. Vaughan, and J. A. Schirillo. Unifying multisensory signals across time and space. Exp Brain Res, 158(2):252–8, 2004. [22] J. P. Bresciani, F. Dammeier, and M. O. Ernst. Vision and touch are automatically integrated for the perception of sequences of events. J Vis, 6(5):554–64, 2006. [23] N. W. Roach, J. Heron, and P. V. McGraw. Resolving multisensory conflict: a strategy for balancing the costs and benefits of audio-visual integration. Proc Biol Sci, 273(1598):2159–68, 2006. [24] K. P. Kording and D. M. Wolpert. Bayesian decision theory in sensorimotor control. Trends Cogn Sci, 2006. 1364-6613 (Print) Journal article. [25] K.P. Kording, U. Beierholm, W.J. Ma, S. Quartz, J. Tenenbaum, and L. Shams. Causal inference in multisensory perception. PLoS ONE, 2(9):e943, 2007. [26] Z. Ghahramani. Computational and psychophysics of sensorimotor integration. PhD thesis, Massachusetts Institute of Technology, 1995. [27] D. C. Knill. Mixture models and the probabilistic structure of depth cues. Vision Res, 43(7):831–54, 2003. [28] D. C. Knill. Robust cue integration: A bayesian model and evidence from cue conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7):2–24. [29] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nat Neurosci, 9(11):1432–8, 2006. 8

3 0.048761945 110 nips-2007-Learning Bounds for Domain Adaptation

Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1

4 0.045227587 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

Author: Noah Goodman, Joshua B. Tenenbaum, Michael J. Black

Abstract: For infants, early word learning is a chicken-and-egg problem. One way to learn a word is to observe that it co-occurs with a particular referent across different situations. Another way is to use the social context of an utterance to infer the intended referent of a word. Here we present a Bayesian model of cross-situational word learning, and an extension of this model that also learns which social cues are relevant to determining reference. We test our model on a small corpus of mother-infant interaction and find it performs better than competing models. Finally, we show that our model accounts for experimental phenomena including mutual exclusivity, fast-mapping, and generalization from social cues. To understand the difficulty of an infant word-learner, imagine walking down the street with a friend who suddenly says “dax blicket philbin na fivy!” while at the same time wagging her elbow. If you knew any of these words you might infer from the syntax of her sentence that blicket is a novel noun, and hence the name of a novel object. At the same time, if you knew that this friend indicated her attention by wagging her elbow at objects, you might infer that she intends to refer to an object in a nearby show window. On the other hand if you already knew that “blicket” meant the object in the window, you might be able to infer these elements of syntax and social cues. Thus, the problem of early word-learning is a classic chicken-and-egg puzzle: in order to learn word meanings, learners must use their knowledge of the rest of language (including rules of syntax, parts of speech, and other word meanings) as well as their knowledge of social situations. But in order to learn about the facts of their language they must first learn some words, and in order to determine which cues matter for establishing reference (for instance, pointing and looking at an object but normally not waggling your elbow) they must first have a way to know the intended referent in some situations. For theories of language acquisition, there are two common ways out of this dilemma. The first involves positing a wide range of innate structures which determine the syntax and categories of a language and which social cues are informative. (Though even when all of these elements are innately determined using them to learn a language from evidence may not be trivial [1].) The other alternative involves bootstrapping: learning some words, then using those words to learn how to learn more. This paper gives a proposal for the second alternative. We first present a Bayesian model of how learners could use a statistical strategy—cross-situational word-learning—to learn how words map to objects, independent of syntactic and social cues. We then extend this model to a true bootstrapping situation: using social cues to learn words while using words to learn social cues. Finally, we examine several important phenomena in word learning: mutual exclusivity (the tendency to assign novel words to novel referents), fast-mapping (the ability to assign a novel word in a linguistic context to a novel referent after only a single use), and social generalization (the ability to use social context to learn the referent of a novel word). Without adding additional specialized machinery, we show how these can be explained within our model as the result of domain-general probabilistic inference mechanisms operating over the linguistic domain. 1 Os r, b Is Ws Figure 1: Graphical model describing the generation of words (Ws ) from an intention (Is ) and lexicon ( ), and intention from the objects present in a situation (Os ). The plate indicates multiple copies of the model for different situation/utterance pairs (s). Dotted portions indicate additions to include the generation of social cues Ss from intentions. Ss ∀s 1 The Model Behind each linguistic utterance is a meaning that the speaker intends to communicate. Our model operates by attempting to infer this intended meaning (which we call the intent) on the basis of the utterance itself and observations of the physical and social context. For the purpose of modeling early word learning—which consists primarily of learning words for simple object categories—in our model, we assume that intents are simply groups of objects. To state the model formally, we assume the non-linguistic situation consists of a set Os of objects and that utterances are unordered sets of words Ws 1 . The lexicon is a (many-to-many) map from words to objects, which captures the meaning of those words. (Syntax enters our model only obliquely by different treatment of words depending on whether they are in the lexicon or not—that is, whether they are common nouns or other types of words.) In this setting the speaker’s intention will be captured by a set of objects in the situation to which she intends to refer: Is ⊆ Os . This setup is indicated in the graphical model of Fig. 1. Different situation-utterance pairs Ws , Os are independent given the lexicon , giving: P (Ws |Is , ) · P (Is |Os ). P (W| , O) = s (1) Is We further simplify by assuming that P (Is |Os ) ∝ 1 (which could be refined by adding a more detailed model of the communicative intentions a person is likely to form in different situations). We will assume that words in the utterance are generated independently given the intention and the lexicon and that the length of the utterance is observed. Each word is then generated from the intention set and lexicon by first choosing whether the word is a referential word or a non-referential word (from a binomial distribution of weight γ), then, for referential words, choosing which object in the intent it refers to (uniformly). This process gives: P (Ws |Is , ) = (1 − γ)PNR (w| ) + γ w∈Ws x∈Is 1 PR (w|x, ) . |Is | The probability of word w referring to object x is PR (w|x, ) ∝ δx∈ w occurring as a non-referring word is PNR (w| ) ∝ 1 if (w) = ∅, κ otherwise. (w) , (2) and the probability of word (3) (this probability is a distribution over all words in the vocabulary, not just those in lexicon ). The constant κ is a penalty for using a word in the lexicon as a non-referring word—this penalty indirectly enforces a light-weight difference between two different groups of words (parts-of-speech): words that refer and words that do not refer. Because the generative structure of this model exposes the role of speaker’s intentions, it is straightforward to add non-linguistic social cues. We assume that social cues such as pointing are generated 1 Note that, since we ignore word order, the distribution of words in a sentence should be exchangeable given the lexicon and situation. This implies, by de Finetti’s theorem, that they are independent conditioned on a latent state—we assume that the latent state giving rise to words is the intention of the speaker. 2 from the speaker’s intent independently of the linguistic aspects (as shown in the dotted arrows of Fig. 1). With the addition of social cues Ss , Eq. 1 becomes: P (Ws |Is , ) · P (Ss |Is ) · P (Is |Os ). P (W| , O) = s (4) Is We assume that the social cues are a set Si (x) of independent binary (cue present or not) feature values for each object x ∈ Os , which are generated through a noisy-or process: P (Si (x)=1|Is , ri , bi ) = 1 − (1 − bi )(1 − ri )δx∈Is . (5) Here ri is the relevance of cue i, while bi is its base rate. For the model without social cues the posterior probability of a lexicon given a set of situated utterances is: P ( |W, O) ∝ P (W| , O)P ( ). (6) And for the model with social cues the joint posterior over lexicon and cue parameters is: P ( , r, b|W, O) ∝ P (W| , r, b, O)P ( )P (r, b). (7) We take the prior probability of a lexicon to be exponential in its size: P ( ) ∝ e−α| | , and the prior probability of social cue parameters to be uniform. Given the model above and the corpus described below, we found the best lexicon (or lexicon and cue parameters) according to Eq. 6 and 7 by MAP inference using stochastic search2 . 2 Previous work While cross-situational word-learning has been widely discussed in the empirical literature, e.g., [2], there have been relatively few attempts to model this process computationally. Siskind [3] created an ambitious model which used deductive rules to make hypotheses about propositional word meanings their use across situations. This model achieved surprising success in learning word meanings in artificial corpora, but was extremely complex and relied on the availability of fully coded representations of the meaning of each sentence, making it difficult to extend to empirical corpus data. More recently, Yu and Ballard [4] have used a machine translation model (similar to IBM Translation Model I) to learn word-object association probabilities. In their study, they used a pre-existing corpus of mother-infant interactions and coded the objects present during each utterance (an example from this corpus—illustrated with our own coding scheme—is shown in Fig. 2). They applied their translation model to estimate the probability of an object given a word, creating a table of associations between words and objects. Using this table, they extracted a lexicon (a group of word-object mappings) which was relatively accurate in its guesses about the names of objects that were being talked about. They further extended their model to incorporate prosodic emphasis on words (a useful cue which we will not discuss here) and joint attention on objects. Joint attention was coded by hand, isolating a subset of objects which were attended to by both mother and infant. Their results reflected a sizable increase in recall with the use of social cues. 3 Materials and Assessment Methods To test the performance of our model on natural data, we used the Rollins section of the CHILDES corpus[5]. For comparison with the model by Yu and Ballard [4], we chose the files me03 and di06, each of which consisted of approximately ten minutes of interaction between a mother and a preverbal infant playing with objects found in a box of toys. Because we were not able to obtain the exact corpus Yu and Ballard used, we recoded the objects in the videos and added a coding of social cues co-occurring with each utterance. We annotated each utterance with the set of objects visible to the infant and with a social coding scheme (for an illustrated example, see Figure 2). Our social code included seven features: infants eyes, infants hands, infants mouth, infant touching, mothers hands, mothers eyes, mother touching. For each utterance, this coding created an object by social feature matrix. 2 In order to speed convergence we used a simulated tempering scheme with three temperature chains and a range of data-driven proposals. 3 Figure 2: A still frame from our corpus showing the coding of objects and social cues. We coded all mid-sized objects visible to the infant as well as social information including what both mother and infant were touching and looking at. We evaluated all models based on their coverage of a gold-standard lexicon, computing precision (how many of the word-object mappings in a lexicon were correct relative to the gold-standard), recall (how many of the total correct mappings were found), and their geometric mean, F-score. However, the gold-standard lexicon for word-learning is not obvious. For instance, should it include the mapping between the plural “pigs” or the sound “oink” and the object PIG? Should a goldstandard lexicon include word-object pairings that are correct but were not present in the learning situation? In the results we report, we included those pairings which would be useful for a child to learn (e.g., “oink” → PIG) but not including those pairings which were not observed to co-occur in the corpus (however, modifying these decisions did not affect the qualitative pattern of results). 4 Results For the purpose of comparison, we give scores for several other models on the same corpus. We implemented a range of simple associative models based on co-occurrence frequency, conditional probability (both word given object and object given word), and point-wise mutual information. In each of these models, we computed the relevant statistic across the entire corpus and then created a lexicon by including all word-object pairings for which the association statistic met a threshold value. We additionally implemented a translation model (based on Yu and Ballard [4]). Because Yu and Ballard did not include details on how they evaluated their model, we scored it in the same way as the other associative models, by creating an association matrix based on the scores P (O|W ) (as given in equation (3) in their paper) and then creating a lexicon based on a threshold value. In order to simulate this type of threshold value for our model, we searched for the MAP lexicon over a range of parameters α in our prior (the larger the prior value, the less probable a larger lexicon, thus this manipulation served to create more or less selective lexicons) . Base model. In Figure 3, we plot the precision and the recall for lexicons across a range of prior parameter values for our model and the full range of threshold values for the translation model and two of the simple association models (since results for the conditional probability models were very similar but slightly inferior to the performance of mutual information, we did not include them). For our model, we averaged performance at each threshold value across three runs of 5000 search iterations each. Our model performed better than any of the other models on a number of dimensions (best lexicon shown in Table 1), both achieving the highest F-score and showing a better tradeoff between precision and recall at sub-optimal threshold values. The translation model also performed well, increasing precision as the threshold of association was raised. Surprisingly, standard cooccurrence statistics proved to be relatively ineffective at extracting high-scoring lexicons: at any given threshold value, these models included a very large number of incorrect pairs. Table 1: The best lexicon found by the Bayesian model (α=11, γ=0.2, κ=0.01). baby → book hand → hand bigbird → bird hat → hat on → ring bird → rattle meow → kitty ring → ring 4 birdie → duck moocow → cow sheep → sheep book → book oink → pig 1 Co!occurrence frequency Mutual information Translation model Bayesian model 0.9 0.8 0.7 recall 0.6 0.5 0.4 0.3 F=0.54 F=0.44 F=0.21 F=0.12 0.2 0.1 0 0 0.2 0.4 0.6 precision 0.8 1 Figure 3: Comparison of models on corpus data: we plot model precision vs. recall across a range of threshold values for each model (see text). Unlike standard ROC curves for classification tasks, the precision and recall of a lexicon depends on the entire lexicon, and irregularities in the curves reflect the small size of the lexicons). One additional virtue of our model over other associative models is its ability to determine which objects the speaker intended to refer to. In Table 2, we give some examples of situations in which the model correctly inferred the objects that the speaker was talking about. Social model. While the addition of social cues did not increase corpus performance above that found in the base model, the lexicons which were found by the social model did have several properties that were not present in the base model. First, the model effectively and quickly converged on the social cues that we found subjectively important in viewing the corpus videos. The two cues which were consistently found relevant across the model were (1) the target of the infant’s gaze and (2) the caregiver’s hand. These data are especially interesting in light of the speculation that infants initially believe their own point of gaze is a good cue to reference, and must learn over the second year that the true cue is the caregiver’s point of gaze, not their own [6]. Second, while the social model did not outperform the base model on the full corpus (where many words were paired with their referents several times), on a smaller corpus (taking every other utterance), the social cue model did slightly outperform a model without social cues (max F-score=0.43 vs. 0.37). Third, the addition of social cues allowed the model to infer the intent of a speaker even in the absence of a word being used. In the right-hand column of Table 2, we give an example of a situation in which the caregiver simply says ”see that?” but from the direction of the infant’s eyes and the location of her hand, the model correctly infers that she is talking about the COW, not either of the other possible referents. This kind of inference might lead the way in allowing infants to learn words like pronouns, which serve pick out an unambiguous focus of attention (one that is so obvious based on social and contextual cues that it does not need to be named). Finally, in the next section we show that the addition of social cues to the model allows correct performance in experimental tests of social generalization which only children older than 18 months can pass, suggesting perhaps that the social model is closer to the strategy used by more mature word learners. Table 2: Intentions inferred by the Bayesian model after having learned a lexicon from the corpus. (IE=Infant’s eyes, CH=Caregiver’s hands). Words Objects Social Cues Inferred intention “look at the moocow” COW GIRL BEAR “see the bear by the rattle?” BEAR RATTLE COW COW BEAR RATTLE 5 “see that?” BEAR RATTLE COW IE & CH→COW COW situation: !7.3, corpus: !631.1, total: !638.4

5 0.044128589 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

Author: Richard Turner, Maneesh Sahani

Abstract: Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼ 1 s); phonemes (∼ 10−1 s); glottal pulses (∼ 10−2 s); and formants ( 10−3 s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscienceinspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task. 1

6 0.042255893 3 nips-2007-A Bayesian Model of Conditioned Perception

7 0.032004509 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

8 0.027114598 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning

9 0.026671549 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

10 0.02517611 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

11 0.025051653 74 nips-2007-EEG-Based Brain-Computer Interaction: Improved Accuracy by Automatic Single-Trial Error Detection

12 0.022754168 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

13 0.021867983 26 nips-2007-An online Hebbian learning rule that performs Independent Component Analysis

14 0.021428192 129 nips-2007-Mining Internet-Scale Software Repositories

15 0.020999286 145 nips-2007-On Sparsity and Overcompleteness in Image Models

16 0.020912819 8 nips-2007-A New View of Automatic Relevance Determination

17 0.019553786 33 nips-2007-Bayesian Inference for Spiking Neuron Models with a Sparsity Prior

18 0.019196725 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images

19 0.01903525 182 nips-2007-Sparse deep belief net model for visual area V2

20 0.018902833 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.064), (1, 0.023), (2, 0.016), (3, -0.02), (4, 0.009), (5, 0.031), (6, 0.011), (7, 0.039), (8, -0.021), (9, -0.026), (10, -0.005), (11, -0.003), (12, -0.014), (13, -0.023), (14, 0.02), (15, -0.063), (16, -0.021), (17, 0.021), (18, 0.065), (19, -0.021), (20, -0.021), (21, 0.018), (22, 0.018), (23, -0.03), (24, -0.019), (25, -0.057), (26, -0.007), (27, 0.015), (28, 0.091), (29, -0.053), (30, 0.036), (31, 0.033), (32, -0.074), (33, -0.056), (34, 0.11), (35, -0.075), (36, 0.014), (37, -0.061), (38, -0.105), (39, 0.05), (40, -0.061), (41, -0.0), (42, 0.055), (43, -0.002), (44, 0.087), (45, 0.02), (46, -0.047), (47, 0.022), (48, 0.048), (49, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93100154 150 nips-2007-Optimal models of sound localization by barn owls

Author: Brian J. Fischer

Abstract: Sound localization by barn owls is commonly modeled as a matching procedure where localization cues derived from auditory inputs are compared to stored templates. While the matching models can explain properties of neural responses, no model explains how the owl resolves spatial ambiguity in the localization cues to produce accurate localization for sources near the center of gaze. Here, I examine two models for the barn owl’s sound localization behavior. First, I consider a maximum likelihood estimator in order to further evaluate the cue matching model. Second, I consider a maximum a posteriori estimator to test whether a Bayesian model with a prior that emphasizes directions near the center of gaze can reproduce the owl’s localization behavior. I show that the maximum likelihood estimator can not reproduce the owl’s behavior, while the maximum a posteriori estimator is able to match the behavior. This result suggests that the standard cue matching model will not be sufficient to explain sound localization behavior in the barn owl. The Bayesian model provides a new framework for analyzing sound localization in the barn owl and leads to predictions about the owl’s localization behavior.

2 0.63353974 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

Author: Ulrik Beierholm, Ladan Shams, Wei J. Ma, Konrad Koerding

Abstract: Bayesian models of multisensory perception traditionally address the problem of estimating an underlying variable that is assumed to be the cause of the two sensory signals. The brain, however, has to solve a more general problem: it also has to establish which signals come from the same source and should be integrated, and which ones do not and should be segregated. In the last couple of years, a few models have been proposed to solve this problem in a Bayesian fashion. One of these has the strength that it formalizes the causal structure of sensory signals. We first compare these models on a formal level. Furthermore, we conduct a psychophysics experiment to test human performance in an auditory-visual spatial localization task in which integration is not mandatory. We find that the causal Bayesian inference model accounts for the data better than other models. Keywords: causal inference, Bayesian methods, visual perception. 1 Multisensory perception In the ventriloquist illusion, a performer speaks without moving his/her mouth while moving a puppet’s mouth in synchrony with his/her speech. This makes the puppet appear to be speaking. This illusion was first conceptualized as ”visual capture”, occurring when visual and auditory stimuli exhibit a small conflict ([1, 2]). Only recently has it been demonstrated that the phenomenon may be seen as a byproduct of a much more flexible and nearly Bayes-optimal strategy ([3]), and therefore is part of a large collection of cue combination experiments showing such statistical near-optimality [4, 5]. In fact, cue combination has become the poster child for Bayesian inference in the nervous system. In previous studies of multisensory integration, two sensory stimuli are presented which act as cues about a single underlying source. For instance, in the auditory-visual localization experiment by Alais and Burr [3], observers were asked to envisage each presentation of a light blob and a sound click as a single event, like a ball hitting the screen. In many cases, however, the brain is not only posed with the problem of identifying the position of a common source, but also of determining whether there was a common source at all. In the on-stage ventriloquist illusion, it is indeed primarily the causal inference process that is being fooled, because veridical perception would attribute independent causes to the auditory and the visual stimulus. 1 To extend our understanding of multisensory perception to this more general problem, it is necessary to manipulate the degree of belief assigned to there being a common cause within a multisensory task. Intuitively, we expect that when two signals are very different, they are less likely to be perceived as having a common source. It is well-known that increasing the discrepancy or inconsistency between stimuli reduces the influence that they have on each other [6, 7, 8, 9, 10, 11]. In auditoryvisual spatial localization, one variable that controls stimulus similarity is spatial disparity (another would be temporal disparity). Indeed, it has been reported that increasing spatial disparity leads to a decrease in auditory localization bias [1, 12, 13, 14, 15, 16, 17, 2, 18, 19, 20, 21]. This decrease also correlates with a decrease in the reports of unity [19, 21]. Despite the abundance of experimental data on this issue, no general theory exists that can explain multisensory perception across a wide range of cue conflicts. 2 Models The success of Bayesian models for cue integration has motivated attempts to extend them to situations of large sensory conflict and a consequent low degree of integration. In one of recent studies taking this approach, subjects were presented with concurrent visual flashes and auditory beeps and asked to count both the number of flashes and the number of beeps [11]. The advantage of the experimental paradigm adopted here was that it probed the joint response distribution by requiring a dual report. Human data were accounted for well by a Bayesian model in which the joint prior distribution over visual and auditory number was approximated from the data. In a similar study, subjects were presented with concurrent flashes and taps and asked to count either the flashes or the taps [9, 22]. The Bayesian model proposed by these authors assumed a joint prior distribution with a near-diagonal form. The corresponding generative model assumes that the sensory sources somehow interact with one another. A third experiment modulated the rates of flashes and beeps. The task was to judge either the visual or the auditory modulation rate relative to a standard [23]. The data from this experiment were modeled using a joint prior distribution which is the sum of a near-diagonal prior and a flat background. While all these models are Bayesian in a formal sense, their underlying generative model does not formalize the model selection process that underlies the combination of cues. This makes it necessary to either estimate an empirical prior [11] by fitting it to human behavior or to assume an ad hoc form [22, 23]. However, we believe that such assumptions are not needed. It was shown recently that human judgments of spatial unity in an auditory-visual spatial localization task can be described using a Bayesian inference model that infers causal structure [24, 25]. In this model, the brain does not only estimate a stimulus variable, but also infers the probability that the two stimuli have a common cause. In this paper we compare these different models on a large data set of human position estimates in an auditory-visual task. In this section we first describe the traditional cue integration model, then the recent models based on joint stimulus priors, and finally the causal inference model. To relate to the experiment in the next section, we will use the terminology of auditory-visual spatial localization, but the formalism is very general. 2.1 Traditional cue integration The traditional generative model of cue integration [26] has a single source location s which produces on each trial an internal representation (cue) of visual location, xV and one of auditory location, xA . We assume that the noise processes by which these internal representations are generated are conditionally independent from each other and follow Gaussian distributions. That is, p (xV |s) ∼ N (xV ; s, σV )and p (xA |s) ∼ N (xA ; s, σA ), where N (x; µ, σ) stands for the normal distribution over x with mean µ and standard deviation σ. If on a given trial the internal representations are xV and xA , the probability that their source was s is given by Bayes’ rule, p (s|xV , xA ) ∝ p (xV |s) p (xA |s) . If a subject performs maximum-likelihood estimation, then the estimate will be xV +wA s = wV wV +wA xA , where wV = σ1 and wA = σ1 . It is important to keep in mind that this is the ˆ 2 2 V A estimate on a single trial. A psychophysical experimenter can never have access to xV and xA , which 2 are the noisy internal representations. Instead, an experimenter will want to collect estimates over many trials and is interested in the distribution of s given sV and sA , which are the sources generated ˆ by the experimenter. In a typical cue combination experiment, xV and xA are not actually generated by the same source, but by different sources, a visual one sV and an auditory one sA . These sources are chosen close to each other so that the subject can imagine that the resulting cues originate from a single source and thus implicitly have a common cause. The experimentally observed distribution is then p (ˆ|sV , sA ) = s p (ˆ|xV , xA ) p (xV |sV ) p (xA |sA ) dxV dxA s Given that s is a linear combination of two normally distributed variables, it will itself follow a ˆ sV +wA 1 2 normal distribution, with mean s = wVwV +wA sA and variance σs = wV +wA . The reason that we ˆ ˆ emphasize this point is because many authors identify the estimate distribution p (ˆ|sV , sA ) with s the posterior distribution p (s|xV , xA ). This is justified in this case because all distributions are Gaussian and the estimate is a linear combination of cues. However, in the case of causal inference, these conditions are violated and the estimate distribution will in general not be the same as the posterior distribution. 2.2 Models with bisensory stimulus priors Models with bisensory stimulus priors propose the posterior over source positions to be proportional to the product of unimodal likelihoods and a two-dimensional prior: p (sV , sA |xV , xA ) = p (sV , sA ) p (xV |sV ) p (xA |sA ) The traditional cue combination model has p (sV , sA ) = p (sV ) δ (sV − sA ), usually (as above) even with p (sV ) uniform. The question arises what bisensory stimulus prior is appropriate. In [11], the prior is estimated from data, has a large number of parameters, and is therefore limited in its predictive power. In [23], it has the form − (sV −sA )2 p (sV , sA ) ∝ ω + e 2σ 2 coupling while in [22] the additional assumption ω = 0 is made1 . In all three models, the response distribution p (ˆV , sA |sV , sA ) is obtained by idens ˆ tifying it with the posterior distribution p (sV , sA |xV , xA ). This procedure thus implicitly assumes that marginalizing over the latent variables xV and xA is not necessary, which leads to a significant error for non-Gaussian priors. In this paper we correctly deal with these issues and in all cases marginalize over the latent variables. The parametric models used for the coupling between the cues lead to an elegant low-dimensional model of cue integration that allows for estimates of single cues that differ from one another. C C=1 SA S XA 2.3 C=2 XV SV XA XV Causal inference model In the causal inference model [24, 25], we start from the traditional cue integration model but remove the assumption that two signals are caused by the same source. Instead, the number of sources can be one or two and is itself a variable that needs to be inferred from the cues. Figure 1: Generative model of causal inference. 1 This family of Bayesian posterior distributions also includes one used to successfully model cue combination in depth perception [27, 28]. In depth perception, however, there is no notion of segregation as always a single surface is assumed. 3 If there are two sources, they are assumed to be independent. Thus, we use the graphical model depicted in Fig. 1. We denote the number of sources by C. The probability distribution over C given internal representations xV and xA is given by Bayes’ rule: p (C|xV , xA ) ∝ p (xV , xA |C) p (C) . In this equation, p (C) is the a priori probability of C. We will denote the probability of a common cause by pcommon , so that p (C = 1) = pcommon and p (C = 2) = 1 − pcommon . The probability of generating xV and xA given C is obtained by inserting a summation over the sources: p (xV , xA |C = 1) = p (xV , xA |s)p (s) ds = p (xV |s) p (xA |s)p (s) ds Here p (s) is a prior for spatial location, which we assume to be distributed as N (s; 0, σP ). Then all three factors in this integral are Gaussians, allowing for an analytic solution: p (xV , xA |C = 1) = 2 2 2 2 2 −xA )2 σP σA √ 2 2 1 2 2 2 2 exp − 1 (xV σ2 σ2 +σ2+xV+σ2+xA σV . 2 σ2 σ2 2π σV σA +σV σP +σA σP V A V P A P For p (xV , xA |C = 2) we realize that xV and xA are independent of each other and thus obtain p (xV , xA |C = 2) = p (xV |sV )p (sV ) dsV p (xA |sA )p (sA ) dsA Again, as all these distributions are assumed to be Gaussian, we obtain an analytic solution, x2 x2 1 1 V A p (xV , xA |C = 2) = exp − 2 σ2 +σ2 + σ2 +σ2 . Now that we have com2 +σ 2 2 +σ 2 p p V A 2π (σV p )(σA p) puted p (C|xV , xA ), the posterior distribution over sources is given by p (si |xV , xA ) = p (si |xV , xA , C) p (C|xV , xA ) C=1,2 where i can be V or A and the posteriors conditioned on C are well-known: p (si |xA , xV , C = 1) = p (xA |si ) p (xV |si ) p (si ) , p (xA |s) p (xV |s) p (s) ds p (si |xA , xV , C = 2) = p (xi |si ) p (si ) p (xi |si ) p (si ) dsi The former is the same as in the case of mandatory integration with a prior, the latter is simply the unimodal posterior in the presence of a prior. Based on the posterior distribution on a given trial, p (si |xV , xA ), an estimate has to be created. For this, we use a sum-squared-error cost func2 2 tion, Cost = p (C = 1|xV , xA ) (ˆ − s) + p (C = 2|xV , xA ) (ˆ − sV or A ) . Then the best s s estimate is the mean of the posterior distribution, for instance for the visual estimation: sV = p (C = 1|xA , xV ) sV,C=1 + p (C = 2|xA , xV ) sV,C=2 ˆ ˆ ˆ where sV,C=1 = ˆ −2 −2 −2 xV σV +xA σA +xP σP −2 −2 −2 σV +σA +σP and sV,C=2 = ˆ −2 −2 xV σV +xP σP . −2 −2 σV +σP If pcommon equals 0 or 1, this estimate reduces to one of the conditioned estimates and is linear in xV and xA . If 0 < pcommon < 1, the estimate is a nonlinear combination of xV and xA , because of the functional form of p (C|xV , xA ). The response distributions, that is the distributions of sV and sA given ˆ ˆ sV and sA over many trials, now cannot be identified with the posterior distribution on a single trial and cannot be computed analytically either. The correct way to obtain the response distribution is to simulate an experiment numerically. Note that the causal inference model above can also be cast in the form of a bisensory stimulus prior by integrating out the latent variable C, with: p (sA , sV ) = p (C = 1) δ (sA − sV ) p (sA ) + p (sA ) p (sV ) p (C = 2) However, in addition to justifying the form of the interaction between the cues, the causal inference model has the advantage of being based on a generative model that well formalizes salient properties of the world, and it thereby also allows to predict judgments of unity. 4 3 Model performance and comparison To examine the performance of the causal inference model and to compare it to previous models, we performed a human psychophysics experiment in which we adopted the same dual-report paradigm as was used in [11]. Observers were simultaneously presented with a brief visual and also an auditory stimulus, each of which could originate from one of five locations on an imaginary horizontal line (-10◦ , -5◦ , 0◦ , 5◦ , or 10◦ with respect to the fixation point). Auditory stimuli were 32 ms of white noise filtered through an individually calibrated head related transfer function (HRTF) and presented through a pair of headphones, whereas the visual stimuli were high contrast Gabors on a noisy background presented on a 21-inch CRT monitor. Observers had to report by means of a key press (1-5) the perceived positions of both the visual and the auditory stimulus. Each combination of locations was presented with the same frequency over the course of the experiment. In this way, for each condition, visual and auditory response histograms were obtained. We obtained response distributions for each the three models described above by numeral simulation. On each trial, estimation is followed by a step in which, the key is selected which corresponds to the position closed to the best estimate. The simulated histograms obtained in this way were compared to the measured response frequencies of all subjects by computing the R2 statistic. Auditory response Auditory model Visual response Visual model no vision The parameters in the causal inference model were optimized using fminsearch in MATLAB to maximize R2 . The best combination of parameters yielded an R2 of 0.97. The response frequencies are depicted in Fig. 2. The bisensory prior models also explain most of the variance, with R2 = 0.96 for the Roach model and R2 = 0.91 for the Bresciani model. This shows that it is possible to model cue combination for large disparities well using such models. no audio 1 0 Figure 2: A comparison between subjects’ performance and the causal inference model. The blue line indicates the frequency of subjects responses to visual stimuli, red line is the responses to auditory stimuli. Each set of lines is one set of audio-visual stimulus conditions. Rows of conditions indicate constant visual stimulus, columns is constant audio stimulus. Model predictions is indicated by the red and blue dotted line. 5 3.1 Model comparison To facilitate quantitative comparison with other models, we now fit the parameters of each model2 to individual subject data, maximizing the likelihood of the model, i.e., the probability of the response frequencies under the model. The causal inference model fits human data better than the other models. Compared to the best fit of the causal inference model, the Bresciani model has a maximal log likelihood ratio (base e) of the data of −22 ± 6 (mean ± s.e.m. over subjects), and the Roach model has a maximal log likelihood ratio of the data of −18 ± 6. A causal inference model that maximizes the probability of being correct instead of minimizing the mean squared error has a maximal log likelihood ratio of −18 ± 3. These values are considered decisive evidence in favor of the causal inference model that minimizes the mean squared error (for details, see [25]). The parameter values found in the likelihood optimization of the causal model are as follows: pcommon = 0.28 ± 0.05, σV = 2.14 ± 0.22◦ , σA = 9.2 ± 1.1◦ , σP = 12.3 ± 1.1◦ (mean ± s.e.m. over subjects). We see that there is a relatively low prior probability of a common cause. In this paradigm, auditory localization is considerably less precise than visual localization. Also, there is a weak prior for central locations. 3.2 Localization bias A useful quantity to gain more insight into the structure of multisensory data is the cross-modal bias. In our experiment, relative auditory bias is defined as the difference between the mean auditory estimate in a given condition and the real auditory position, divided by the difference between the real visual position and the real auditory position in this condition. If the influence of vision on the auditory estimate is strong, then the relative auditory bias will be high (close to one). It is well-known that bias decreases with spatial disparity and our experiment is no exception (solid line in Fig. 3; data were combined between positive and negative disparities). It can easily be shown that a traditional cue integration model would predict a bias equal to σ2 −1 , which would be close to 1 and 1 + σV 2 A independent of disparity, unlike the data. This shows that a mandatory integration model is an insufficient model of multisensory interactions. 45 % Auditory Bias We used the individual subject fittings from above and and averaged the auditory bias values obtained from those fits (i.e. we did not fit the bias data themselves). Fits are shown in Fig. 3 (dashed lines). We applied a paired t-test to the differences between the 5◦ and 20◦ disparity conditions (model-subject comparison). Using a double-sided test, the null hypothesis that the difference between the bias in the 5◦ and 20◦ conditions is correctly predicted by each model is rejected for the Bresciani model (p < 0.002) and the Roach model (p < 0.042) and accepted for the causal inference model (p > 0.17). Alternatively, with a single-sided test, the hypothesis is rejected for the Bresciani model (p < 0.001) and the Roach model (p < 0.021) and accepted for the causal inference model (> 0.9). 50 40 35 30 25 20 5 10 15 Spatial Disparity (deg.) 20 Figure 3: Auditory bias as a function of spatial disparity. Solid blue line: data. Red: Causal inference model. Green: Model by Roach et al. [23]. Purple: Model by Bresciani et al. [22]. Models were optimized on response frequencies (as in Fig. 2), not on the bias data. The reason that the Bresciani model fares worst is that its prior distribution does not include a component that corresponds to independent causes. On 2 The Roach et al. model has four free parameters (ω,σV , σA , σcoupling ), the Bresciani et al. model has three (σV , σA , σcoupling ), and the causal inference model has four (pcommon ,σV , σA , σP ). We do not consider the Shams et al. model here, since it has many more parameters and it is not immediately clear how in this model the erroneous identification of posterior with response distribution can be corrected. 6 the contrary, the prior used in the Roach model contains two terms, one term that is independent of the disparity and one term that decreases with increasing disparity. It is thus functionally somewhat similar to the causal inference model. 4 Discussion We have argued that any model of multisensory perception should account not only for situations of small, but also of large conflict. In these situations, segregation is more likely, in which the two stimuli are not perceived to have the same cause. Even when segregation occurs, the two stimuli can still influence each other. We compared three Bayesian models designed to account for situations of large conflict by applying them to auditory-visual spatial localization data. We pointed out a common mistake: for nonGaussian bisensory priors without mandatory integration, the response distribution can no longer be identified with the posterior distribution. After correct implementation of the three models, we found that the causal inference model is superior to the models with ad hoc bisensory priors. This is expected, as the nervous system actually needs to solve the problem of deciding which stimuli have a common cause and which stimuli are unrelated. We have seen that multisensory perception is a suitable tool for studying causal inference. However, the causal inference model also has the potential to quantitatively explain a number of other perceptual phenomena, including perceptual grouping and binding, as well as within-modality cue combination [27, 28]. Causal inference is a universal problem: whenever the brain has multiple pieces of information it must decide if they relate to one another or are independent. As the causal inference model describes how the brain processes probabilistic sensory information, the question arises about the neural basis of these processes. Neural populations encode probability distributions over stimuli through Bayes’ rule, a type of coding known as probabilistic population coding. Recent work has shown how the optimal cue combination assuming a common cause can be implemented in probabilistic population codes through simple linear operations on neural activities [29]. This framework makes essential use of the structure of neural variability and leads to physiological predictions for activity in areas that combine multisensory input, such as the superior colliculus. Computational mechanisms for causal inference are expected have a neural substrate that generalizes these linear operations on population activities. A neural implementation of the causal inference model will open the door to a complete neural theory of multisensory perception. References [1] H.L. Pick, D.H. Warren, and J.C. Hay. Sensory conflict in judgements of spatial direction. Percept. Psychophys., 6:203205, 1969. [2] D. H. Warren, R. B. Welch, and T. J. McCarthy. The role of visual-auditory ”compellingness” in the ventriloquism effect: implications for transitivity among the spatial senses. Percept Psychophys, 30(6):557– 64, 1981. [3] D. Alais and D. Burr. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol, 14(3):257–62, 2004. [4] R. A. Jacobs. Optimal integration of texture and motion cues to depth. Vision Res, 39(21):3621–9, 1999. [5] R. J. van Beers, A. C. Sittig, and J. J. Gon. Integration of proprioceptive and visual position-information: An experimentally supported model. J Neurophysiol, 81(3):1355–64, 1999. [6] D. H. Warren and W. T. Cleaves. Visual-proprioceptive interaction under large amounts of conflict. J Exp Psychol, 90(2):206–14, 1971. [7] C. E. Jack and W. R. Thurlow. Effects of degree of visual association and angle of displacement on the ”ventriloquism” effect. Percept Mot Skills, 37(3):967–79, 1973. [8] G. H. Recanzone. Auditory influences on visual temporal rate perception. J Neurophysiol, 89(2):1078–93, 2003. [9] J. P. Bresciani, M. O. Ernst, K. Drewing, G. Bouyer, V. Maury, and A. Kheddar. Feeling what you hear: auditory signals can modulate tactile tap perception. Exp Brain Res, 162(2):172–80, 2005. 7 [10] R. Gepshtein, P. Leiderman, L. Genosar, and D. Huppert. Testing the three step excited state proton transfer model by the effect of an excess proton. J Phys Chem A Mol Spectrosc Kinet Environ Gen Theory, 109(42):9674–84, 2005. [11] L. Shams, W. J. Ma, and U. Beierholm. Sound-induced flash illusion as an optimal percept. Neuroreport, 16(17):1923–7, 2005. [12] G Thomas. Experimental study of the influence of vision on sound localisation. J Exp Psychol, 28:167177, 1941. [13] W. R. Thurlow and C. E. Jack. Certain determinants of the ”ventriloquism effect”. Percept Mot Skills, 36(3):1171–84, 1973. [14] C.S. Choe, R. B. Welch, R.M. Gilford, and J.F. Juola. The ”ventriloquist effect”: visual dominance or response bias. Perception and Psychophysics, 18:55–60, 1975. [15] R. I. Bermant and R. B. Welch. Effect of degree of separation of visual-auditory stimulus and eye position upon spatial interaction of vision and audition. Percept Mot Skills, 42(43):487–93, 1976. [16] R. B. Welch and D. H. Warren. Immediate perceptual response to intersensory discrepancy. Psychol Bull, 88(3):638–67, 1980. [17] P. Bertelson and M. Radeau. Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys, 29(6):578–84, 1981. [18] P. Bertelson, F. Pavani, E. Ladavas, J. Vroomen, and B. de Gelder. Ventriloquism in patients with unilateral visual neglect. Neuropsychologia, 38(12):1634–42, 2000. [19] D. A. Slutsky and G. H. Recanzone. Temporal and spatial dependency of the ventriloquism effect. Neuroreport, 12(1):7–10, 2001. [20] J. Lewald, W. H. Ehrenstein, and R. Guski. Spatio-temporal constraints for auditory–visual integration. Behav Brain Res, 121(1-2):69–79, 2001. [21] M. T. Wallace, G. E. Roberson, W. D. Hairston, B. E. Stein, J. W. Vaughan, and J. A. Schirillo. Unifying multisensory signals across time and space. Exp Brain Res, 158(2):252–8, 2004. [22] J. P. Bresciani, F. Dammeier, and M. O. Ernst. Vision and touch are automatically integrated for the perception of sequences of events. J Vis, 6(5):554–64, 2006. [23] N. W. Roach, J. Heron, and P. V. McGraw. Resolving multisensory conflict: a strategy for balancing the costs and benefits of audio-visual integration. Proc Biol Sci, 273(1598):2159–68, 2006. [24] K. P. Kording and D. M. Wolpert. Bayesian decision theory in sensorimotor control. Trends Cogn Sci, 2006. 1364-6613 (Print) Journal article. [25] K.P. Kording, U. Beierholm, W.J. Ma, S. Quartz, J. Tenenbaum, and L. Shams. Causal inference in multisensory perception. PLoS ONE, 2(9):e943, 2007. [26] Z. Ghahramani. Computational and psychophysics of sensorimotor integration. PhD thesis, Massachusetts Institute of Technology, 1995. [27] D. C. Knill. Mixture models and the probabilistic structure of depth cues. Vision Res, 43(7):831–54, 2003. [28] D. C. Knill. Robust cue integration: A bayesian model and evidence from cue conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7):2–24. [29] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nat Neurosci, 9(11):1432–8, 2006. 8

3 0.4742493 3 nips-2007-A Bayesian Model of Conditioned Perception

Author: Alan Stocker, Eero P. Simoncelli

Abstract: unkown-abstract

4 0.46229246 198 nips-2007-The Noisy-Logical Distribution and its Application to Causal Inference

Author: Hongjing Lu, Alan L. Yuille

Abstract: We describe a novel noisy-logical distribution for representing the distribution of a binary output variable conditioned on multiple binary input variables. The distribution is represented in terms of noisy-or’s and noisy-and-not’s of causal features which are conjunctions of the binary inputs. The standard noisy-or and noisy-andnot models, used in causal reasoning and artificial intelligence, are special cases of the noisy-logical distribution. We prove that the noisy-logical distribution is complete in the sense that it can represent all conditional distributions provided a sufficient number of causal factors are used. We illustrate the noisy-logical distribution by showing that it can account for new experimental findings on how humans perform causal reasoning in complex contexts. We speculate on the use of the noisy-logical distribution for causal reasoning and artificial intelligence.

5 0.40578547 203 nips-2007-The rat as particle filter

Author: Aaron C. Courville, Nathaniel D. Daw

Abstract: Although theorists have interpreted classical conditioning as a laboratory model of Bayesian belief updating, a recent reanalysis showed that the key features that theoretical models capture about learning are artifacts of averaging over subjects. Rather than learning smoothly to asymptote (reflecting, according to Bayesian models, the gradual tradeoff from prior to posterior as data accumulate), subjects learn suddenly and their predictions fluctuate perpetually. We suggest that abrupt and unstable learning can be modeled by assuming subjects are conducting inference using sequential Monte Carlo sampling with a small number of samples — one, in our simulations. Ensemble behavior resembles exact Bayesian models since, as in particle filters, it averages over many samples. Further, the model is capable of exhibiting sophisticated behaviors like retrospective revaluation at the ensemble level, even given minimally sophisticated individuals that do not track uncertainty in their beliefs over trials. 1

6 0.36751711 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

7 0.35967964 125 nips-2007-Markov Chain Monte Carlo with People

8 0.35938126 119 nips-2007-Learning with Tree-Averaged Densities and Distributions

9 0.34665331 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

10 0.34628418 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning

11 0.33805412 110 nips-2007-Learning Bounds for Domain Adaptation

12 0.29698128 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events

13 0.2918275 174 nips-2007-Selecting Observations against Adversarial Objectives

14 0.28740913 29 nips-2007-Automatic Generation of Social Tags for Music Recommendation

15 0.28139541 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

16 0.27714071 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

17 0.27701873 81 nips-2007-Estimating disparity with confidence from energy neurons

18 0.27640921 9 nips-2007-A Probabilistic Approach to Language Change

19 0.27470335 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

20 0.27373394 26 nips-2007-An online Hebbian learning rule that performs Independent Component Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.036), (13, 0.019), (16, 0.017), (18, 0.046), (19, 0.019), (21, 0.039), (34, 0.012), (35, 0.034), (47, 0.05), (83, 0.079), (85, 0.01), (90, 0.051), (95, 0.475)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78462315 150 nips-2007-Optimal models of sound localization by barn owls

Author: Brian J. Fischer

Abstract: Sound localization by barn owls is commonly modeled as a matching procedure where localization cues derived from auditory inputs are compared to stored templates. While the matching models can explain properties of neural responses, no model explains how the owl resolves spatial ambiguity in the localization cues to produce accurate localization for sources near the center of gaze. Here, I examine two models for the barn owl’s sound localization behavior. First, I consider a maximum likelihood estimator in order to further evaluate the cue matching model. Second, I consider a maximum a posteriori estimator to test whether a Bayesian model with a prior that emphasizes directions near the center of gaze can reproduce the owl’s localization behavior. I show that the maximum likelihood estimator can not reproduce the owl’s behavior, while the maximum a posteriori estimator is able to match the behavior. This result suggests that the standard cue matching model will not be sufficient to explain sound localization behavior in the barn owl. The Bayesian model provides a new framework for analyzing sound localization in the barn owl and leads to predictions about the owl’s localization behavior.

2 0.37429667 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa

Abstract: Planning in partially observable environments remains a challenging problem, despite significant recent advances in offline approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their offline counterparts. Thus it seems natural to try to unify offline and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and ǫ-optimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can find (provably) near-optimal solutions in reasonable time. 1

3 0.25177807 213 nips-2007-Variational Inference for Diffusion Processes

Author: Cédric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, John S. Shawe-taylor

Abstract: Diffusion processes are a family of continuous-time continuous-state stochastic processes that are in general only partially observed. The joint estimation of the forcing parameters and the system noise (volatility) in these dynamical systems is a crucial, but non-trivial task, especially when the system is nonlinear and multimodal. We propose a variational treatment of diffusion processes, which allows us to compute type II maximum likelihood estimates of the parameters by simple gradient techniques and which is computationally less demanding than most MCMC approaches. We also show how a cheap estimate of the posterior over the parameters can be constructed based on the variational free energy. 1

4 0.25172931 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

5 0.25053823 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1

6 0.24968253 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

7 0.24955106 156 nips-2007-Predictive Matrix-Variate t Models

8 0.24651383 125 nips-2007-Markov Chain Monte Carlo with People

9 0.24647443 96 nips-2007-Heterogeneous Component Analysis

10 0.24632245 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

11 0.24592181 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

12 0.24521524 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

13 0.24464227 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

14 0.24453783 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

15 0.24433032 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

16 0.24429728 86 nips-2007-Exponential Family Predictive Representations of State

17 0.2442473 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

18 0.24397708 3 nips-2007-A Bayesian Model of Conditioned Perception

19 0.24344338 187 nips-2007-Structured Learning with Approximate Inference

20 0.24300586 158 nips-2007-Probabilistic Matrix Factorization