nips nips2007 nips2007-130 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Richard Turner, Maneesh Sahani
Abstract: Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼ 1 s); phonemes (∼ 10−1 s); glottal pulses (∼ 10−2 s); and formants ( 10−3 s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscienceinspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Turner and Maneesh Sahani Gatsby Computational Neuroscience Unit 17 Alexandra House, Queen Square, London, WC1N 3AR, London Abstract Natural sounds are structured on many time-scales. [sent-2, score-0.239]
2 A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼ 1 s); phonemes (∼ 10−1 s); glottal pulses (∼ 10−2 s); and formants ( 10−3 s). [sent-3, score-0.218]
3 The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. [sent-4, score-0.514]
4 One route toward understanding how auditory processing accomplishes this analysis is to build neuroscienceinspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. [sent-5, score-0.514]
5 The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. [sent-10, score-0.378]
6 In contrast, there has been much less success in the auditory domain, and this is due in part to the paucity of flexible models with an explicit temporal dimension (although see [2]). [sent-15, score-0.291]
7 In the first we review models for the short-time structure of sound and argue that a probabilistic time-frequency model has several distinct benefits over traditional timefrequency representations for auditory modeling. [sent-18, score-0.451]
8 In the third section these two models are combined with the notion of auditory features to produce a full generative model for sounds called the Modulation Cascade Process (MCP). [sent-20, score-0.667]
9 We then show how to carry out learning and inference in such a complex hierarchical model, and provide results on speech for complete and missing data tasks. [sent-21, score-0.206]
10 Of course, the spectral content of natural sounds changes slowly over time. [sent-24, score-0.339]
11 More specifically, the STFT (xd,t ) and spectrogram (sd,t ) of a discretised sound (yt ) are given by, T xd,t = rt−t yt exp (−iωd t ) , sd,t = log |xd,t |. [sent-26, score-0.178]
12 For example, the window for speech is typically chosen to last for several pitch periods, so that both pitch and formant information is represented spectrally. [sent-28, score-0.291]
13 The first stage of the auditory pathway derives a time-frequency-like representation mechanically at the basilar membrane. [sent-29, score-0.288]
14 Subsequent stages extract progressively more complex auditory features, with structure extending over more time. [sent-30, score-0.317]
15 Thus, computational models of auditory processing often begin with a time-frequency (or auditory-filter bank) decomposition, deriving new representations from the time-frequency coefficients [4]. [sent-31, score-0.291]
16 The potential advantage lies in the ease with which auditory features may be extracted from the STFT representation. [sent-33, score-0.307]
17 This means that realisable sounds live on a manifold in the time-frequency space (for the STFT this manifold is a hyper-plane). [sent-38, score-0.335]
18 The heuristic behind the STFT – that sound comprises sinusoids in slowly-varying linear superposition – led Qi et al [6] to propose a probabilistic algorithm called Bayesian Spectrum Estimation (BSE), in which the sinusoid coefficients (xd,t ) are latent variables. [sent-44, score-0.235]
19 BSE is a model for the short-time structure of sounds and it will essentially form the bottom level of the MCP. [sent-53, score-0.346]
20 3 Probabilistic Demodulation Cascade A salient property of the long-time statistics of sounds is the persistence of strong amplitude modulation [7]. [sent-55, score-0.424]
21 Motivated by these observations, Anonymous Authors [8] have proposed a model for the long-time structures in sounds using a demodulation cascade. [sent-58, score-0.468]
22 The basic idea of the demodulation cascade is to represent a sound as a product of processes drawn from a hierarchy, or cascade, of progressively longer time-scale modulators. [sent-59, score-0.426]
23 For speech this might involve three processes: representing sentences on top, phonemes in the middle, and pitch and formants at the bottom (e. [sent-60, score-0.39]
24 To construct such a representation, one might start with a traditional amplitude demodulation algorithm, which decomposes a signal into a quickly-varying carrier and more slowly-varying envelope. [sent-64, score-0.494]
25 The cascade could then be built by applying the same algorithm to the (possibly transformed) envelope, and then to the envelope that results from this, and so on. [sent-65, score-0.196]
26 This procedure is only stable, however, if both the carrier and the envelope found by the demodulation algorithm are well-behaved. [sent-66, score-0.487]
27 Unfortunately, traditional methods (like the Hilbert Transform, or low-pass filtering a non-linear transformation of the stimulus) return a suitable carrier or envelope, but not both. [sent-67, score-0.199]
28 A new approach to amplitude demodulation is thus called for. [sent-68, score-0.295]
29 In a nutshell, the new approach is to view amplitude demodulation as a task of probabilistic inference. [sent-69, score-0.327]
30 This is natural, as demodulation is fundamentally ill-posed — there are infinitely many decompositions of a signal into a positive envelope and real valued carrier — and so prior information must always be leveraged to realise such a decomposition. [sent-70, score-0.487]
31 Furthermore, it not necessary to use the recursive procedure (just described) to derive a modulation cascade: the whole hierarchy can be estimated at once using a single generative model. [sent-72, score-0.328]
32 The generative model for Probabilistic Amplitude Demodulation (PAD) is (m) p z0 = Norm (0, 1) , (m) (m) (m) p zt (m) (m) 2 |zt−1 = Norm λm zt−1 , σm ∀t > 0, (4) M xt = fa(m) zt (1) ∀m > 1, xt = Norm (0, 1) , (m) xt yt = . [sent-73, score-0.579]
33 (5) m=1 A set of modulators (X2:M ) are drawn in a two stage process: First a set of slowly varying processes (Z2:M ) are drawn from a one-step linear Gaussian prior (identical to Eq. [sent-74, score-0.458]
34 Second the modulators are formed by passing these variables through a point-wise non-linearity to enforce positivity. [sent-77, score-0.371]
35 A typical choice might be (m) fa(m) zt = (m) log exp zt + a(m) + 1 , (6) (m) which is logarithmic for large negative values of zt , and linear for large positive values. [sent-78, score-0.283]
36 For example, as described in the previous section, we expect that the carrier process will be structured and yet it is modelled as Gaussian white noise. [sent-85, score-0.232]
37 Firstly the (1) carrier (xt ) is integrated out and then the modulators are found by maximum a posteriori (MAP). [sent-88, score-0.538]
38 Slower, more Bayesian algorithms that integrate over the modulators using MCMC indicate that this approximation is not too severe, and the results are compelling. [sent-89, score-0.339]
39 4 Modulation Cascade Processes We have reviewed two contrasting models: The first captures the local harmonic structure of sounds, but has no long-time structure; The second captures long-time amplitude modulations, but models the short-time structure as white noise. [sent-90, score-0.317]
40 We are guided by the observation that the auditory system might implement a similar synthesis. [sent-92, score-0.257]
41 In the well-known psychophysical phenomenon of comodulation masking release (see [9] for a review), a tone masked by noise with a bandwidth greater than an auditory filter becomes audible 3 B. [sent-93, score-0.365]
42 5 time /s Figure 1: An example of a modulation-cascade representation of speech (A and B) and typical samples from generative models used to derive that representation (C). [sent-98, score-0.255]
43 A) The spoken-speech waveform (black) is represented as the product of a carrier (blue), a phoneme modulator (red) and a sentence modulator (magenta). [sent-99, score-1.086]
44 B) A close up of the first sentence (2 s) additionally showing the derived enve(2) (3) lope (xt xt ) superposed onto the speech (red, bottom panel). [sent-100, score-0.323]
45 C) The generative model (M = 3) with a carrier (blue), a phoneme modulator (red) and a sentence modulator (magenta). [sent-101, score-1.165]
46 This suggests that long-time envelope information is processed and analysed across (short-time) frequency channels in the auditory system. [sent-103, score-0.405]
47 However, power across even widely seperated channels of natural sounds can be strongly correlated [7]. [sent-107, score-0.306]
48 Thus, a synthesis of BSE and PAD should incorporate the notion of auditory features. [sent-110, score-0.257]
49 (9) d,k1 ,k2 Once again, latent variables are arranged in a hierarchy according to their time-scales (which depend on m). [sent-113, score-0.236]
50 At the top of the hierarchy is a long-time process which models slow structures, like the sentences of speech. [sent-114, score-0.235]
51 Finally, the bottom level of the hierarchy captures short-time variability (intra-phoneme variability for instance). [sent-116, score-0.239]
52 So, for example if K1 = 4 and K2 = 2, there would be four quickly varying modulators in the lower level, two modulators in the middle level, and one slowly varying modulator at the top (see fig. [sent-118, score-1.118]
53 The idea is that the modulators in the first level independently control the presence or absence of individual spectral features (given by d gd,k1 ,k2 sin (ωd t + φd )). [sent-120, score-0.502]
54 For example, in speech a typical phoneme might be periodic, but this periodicity might change systematically as the speaker alters their pitch. [sent-121, score-0.347]
55 This change in pitch might be modeled using two spectral features: one for the start of the phoneme and one for the end, with a region of coactivation in the middle. [sent-122, score-0.289]
56 Indeed it is because speech 4 x(3) t x(2) t × × × x(2) t × × yt × x(3) t B. [sent-123, score-0.185]
57 The hierarchy of latent variables moves from the slowest modulator at the top (magenta) to the fastest (blue) with an intermediate modulator between (red). [sent-135, score-0.839]
58 The outer-product of the modulators multiplies the generative weights (black and white, only 4 of the 8 shown). [sent-136, score-0.46]
59 and other natural sounds are not precisely stationary even over short time-scales that we require the lowest layer of the hierarchy. [sent-141, score-0.33]
60 The role of the modulators in the second level is to simultaneously turn on groups of similar features. [sent-142, score-0.376]
61 For example, one modulator might control the presence of all the harmonic features and the other the broad-band features. [sent-143, score-0.369]
62 Finally the top level modulator gates all the auditory features at once. [sent-144, score-0.668]
63 The MCP is similar, in that the higher level latent variables alter the power of similar auditory features. [sent-152, score-0.412]
64 The MCP can be seen as a generalisation of the GSMMs to include time-varying latent variables, a deeper hierarchy and a probabilistic time-frequency representation. [sent-154, score-0.236]
65 The second is that the generative tensor contains a large number of elements D × K1 × K2 , making learning slow too. [sent-163, score-0.274]
66 In the first stage of the initialisation, for example, the top and middle levels of the hierarchy are clamped and the mean of the emission distribution becomes µyt (1) = γd,k1 xk1 ,t sin (ωd t + φd ) , (12) d,k1 where γd,k1 = k2 gd,k1 ,k2 . [sent-169, score-0.265]
67 Learning and inference then proceed by gradient based optimisation of (1) the cost-function (log p(X, Y, G|θ)) with respect to the un-clamped latents (xk1 ,t ) and the contracted generative weights (γd,k1 ). [sent-170, score-0.294]
68 When this process is complete, the second layer of latent variables is un-clamped, and learning of these variables commences. [sent-172, score-0.178]
69 This requires the full generative tensor, which must be initialised from the contracted generative weights learned at the previous 1 stage. [sent-173, score-0.334]
70 An alternative is to use small chunks of sounds to learn the lower level weights. [sent-175, score-0.306]
71 This allows us to make the simplifying assumption that just one second-level modulator was active during the chunk. [sent-177, score-0.279]
72 The generative tensor can be therefore be initialised using gd,k1 ,k2 = γd,k1 δk2 ,j . [sent-178, score-0.271]
73 In the E-Step the latent variables are updated simultaneously using gradient based optimisation of Eq. [sent-181, score-0.177]
74 In the M-Step, the generative tensor is updated using co-ordinate ascent. [sent-183, score-0.239]
75 In principle, joint optimisation of the generative tensor and latent variables is possible, but the memory requirements are prohibitive. [sent-186, score-0.416]
76 This is also why co-ordinate ascent is used to learn the generative tensor (rather than using the usual linear regression solution which involves a prohibitive matrix inverse). [sent-187, score-0.239]
77 The time-scales of the modulators were chosen to be {20 ms, 200 ms, 2 s}. [sent-189, score-0.339]
78 These are coloured according to which of the phoneme modulators they interact most strongly with. [sent-200, score-0.547]
79 The speech waveform is shown in the bottom panel. [sent-201, score-0.182]
80 Right panels: 2 2 The learned spectral features ( gsin + gcos ) coloured according to phoneme modulator. [sent-202, score-0.323]
81 Spectra corresponding to one phoneme modulator look similar and offer the features only differ in their phase. [sent-204, score-0.502]
82 The MCP recovers a sentence modulator, phoneme modulators, and intra-phoneme modulators. [sent-207, score-0.287]
83 One way of assessing which features of speech the model captures is to sample from the forward model using the learned parameters. [sent-210, score-0.267]
84 The conclusion is that the model is capturing structure across a wide range of time-scales: formants and pitch structure, phoneme structure, and sentence structure. [sent-213, score-0.456]
85 The reason is that learned generative tensor contains many gk1 ,k2 which are nearly zero. [sent-219, score-0.267]
86 In generation, this means that significant contributions to the output are only made when particular pairs of phoneme and intra-phoneme modulators are active. [sent-220, score-0.512]
87 So although many modulators are active at one time, only one or two make sizeable contributions. [sent-221, score-0.339]
88 Conversely, in inference, we can only get information about the value of a modulator when it is part of a contributing pair. [sent-222, score-0.279]
89 In order to use the MCP to fill in the missing data, it is first necessary to learn a set of auditory features. [sent-227, score-0.333]
90 The MCP was therefore trained on a different spoken sentence from the same speaker, before inference was carried out on the test data. [sent-228, score-0.207]
91 As PAD models the carrier as white noise it predicts zeros in the missing regions 7 [t] original MCP BSE 0. [sent-236, score-0.308]
92 3 Figure 4: A selection of typical missing data results for three phonemes (columns). [sent-251, score-0.184]
93 The top row shows the original speech segement with the missing regions shown in red. [sent-252, score-0.221]
94 Both MCP and BSE smoothly interpolate their latent variables over the missing region. [sent-255, score-0.194]
95 However, whereas BSE smoothly interpolates each sinusoidal component independently, MCP interpolates the set of learned auditory features in a complex manner determined by the interaction of the modulators. [sent-256, score-0.367]
96 6 Conclusion We have introduced a neuroscience-inspired generative model for natural sounds that is capable of capturing structure spanning a wide range of temporal scales. [sent-258, score-0.459]
97 The model is a marriage between a probabilistic time-frequency representation (that captures the short-time structure) and a probabilistic demodulation cascade (that captures the long-time structure). [sent-259, score-0.458]
98 When the model is trained on a spoken sentence, the first level of the hierarchy learns auditory features (weighted sets of sinusoids) that capture structures like different voiced sections of speech. [sent-260, score-0.595]
99 The upper levels comprise a temporally ordered set of modulators are used to represent sentence structure, phoneme structure and intra-phoneme variability. [sent-261, score-0.656]
100 (2000) Auditory images: How complex sounds are represented in the auditory system. [sent-278, score-0.496]
wordName wordTfidf (topN-words)
[('mcp', 0.379), ('modulators', 0.339), ('modulator', 0.279), ('auditory', 0.257), ('sounds', 0.239), ('bse', 0.199), ('carrier', 0.199), ('demodulation', 0.199), ('phoneme', 0.173), ('zkm', 0.159), ('pad', 0.121), ('generative', 0.121), ('stft', 0.12), ('tensor', 0.118), ('hierarchy', 0.118), ('sentence', 0.114), ('cascade', 0.107), ('speech', 0.1), ('amplitude', 0.096), ('modulation', 0.089), ('envelope', 0.089), ('latent', 0.086), ('yt', 0.085), ('zt', 0.083), ('pitch', 0.079), ('cients', 0.079), ('missing', 0.076), ('phonemes', 0.074), ('coef', 0.07), ('xt', 0.069), ('spoken', 0.063), ('formants', 0.06), ('gsmms', 0.06), ('sinusoids', 0.059), ('initialisation', 0.059), ('optimisation', 0.059), ('sound', 0.058), ('spectra', 0.056), ('magenta', 0.056), ('latents', 0.052), ('features', 0.05), ('top', 0.045), ('forward', 0.045), ('captures', 0.044), ('ica', 0.044), ('norm', 0.043), ('waveform', 0.042), ('segments', 0.041), ('bottom', 0.04), ('alters', 0.04), ('comodulation', 0.04), ('realisable', 0.04), ('release', 0.04), ('timefrequency', 0.04), ('voiced', 0.04), ('xkm', 0.04), ('harmonic', 0.04), ('sin', 0.039), ('level', 0.037), ('sentences', 0.037), ('spectral', 0.037), ('natural', 0.035), ('slow', 0.035), ('spectrogram', 0.035), ('coloured', 0.035), ('lewicki', 0.035), ('typical', 0.034), ('representations', 0.034), ('temporal', 0.034), ('fourier', 0.034), ('white', 0.033), ('window', 0.033), ('processes', 0.032), ('middle', 0.032), ('channels', 0.032), ('probabilistic', 0.032), ('variables', 0.032), ('sinusoidal', 0.032), ('contracted', 0.032), ('recursions', 0.032), ('initialised', 0.032), ('stage', 0.031), ('inference', 0.03), ('structures', 0.03), ('structure', 0.03), ('spectrum', 0.03), ('chunks', 0.03), ('nonstationary', 0.03), ('progressively', 0.03), ('panels', 0.029), ('short', 0.028), ('manifold', 0.028), ('layer', 0.028), ('slowly', 0.028), ('modulate', 0.028), ('masking', 0.028), ('learned', 0.028), ('varying', 0.028), ('frequency', 0.027), ('lter', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes
Author: Richard Turner, Maneesh Sahani
Abstract: Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼ 1 s); phonemes (∼ 10−1 s); glottal pulses (∼ 10−2 s); and formants ( 10−3 s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscienceinspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task. 1
2 0.095205724 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
Author: Gwenn Englebienne, Tim Cootes, Magnus Rattray
Abstract: The present work aims to model the correspondence between facial motion and speech. The face and sound are modelled separately, with phonemes being the link between both. We propose a sequential model and evaluate its suitability for the generation of the facial animation from a sequence of phonemes, which we obtain from speech. We evaluate the results both by computing the error between generated sequences and real video, as well as with a rigorous double-blind test with human subjects. Experiments show that our model compares favourably to other existing methods and that the sequences generated are comparable to real video sequences. 1
3 0.082762852 145 nips-2007-On Sparsity and Overcompleteness in Image Models
Author: Pietro Berkes, Richard Turner, Maneesh Sahani
Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. 1
4 0.078017622 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration
Author: Ulrik Beierholm, Ladan Shams, Wei J. Ma, Konrad Koerding
Abstract: Bayesian models of multisensory perception traditionally address the problem of estimating an underlying variable that is assumed to be the cause of the two sensory signals. The brain, however, has to solve a more general problem: it also has to establish which signals come from the same source and should be integrated, and which ones do not and should be segregated. In the last couple of years, a few models have been proposed to solve this problem in a Bayesian fashion. One of these has the strength that it formalizes the causal structure of sensory signals. We first compare these models on a formal level. Furthermore, we conduct a psychophysics experiment to test human performance in an auditory-visual spatial localization task in which integration is not mandatory. We find that the causal Bayesian inference model accounts for the data better than other models. Keywords: causal inference, Bayesian methods, visual perception. 1 Multisensory perception In the ventriloquist illusion, a performer speaks without moving his/her mouth while moving a puppet’s mouth in synchrony with his/her speech. This makes the puppet appear to be speaking. This illusion was first conceptualized as ”visual capture”, occurring when visual and auditory stimuli exhibit a small conflict ([1, 2]). Only recently has it been demonstrated that the phenomenon may be seen as a byproduct of a much more flexible and nearly Bayes-optimal strategy ([3]), and therefore is part of a large collection of cue combination experiments showing such statistical near-optimality [4, 5]. In fact, cue combination has become the poster child for Bayesian inference in the nervous system. In previous studies of multisensory integration, two sensory stimuli are presented which act as cues about a single underlying source. For instance, in the auditory-visual localization experiment by Alais and Burr [3], observers were asked to envisage each presentation of a light blob and a sound click as a single event, like a ball hitting the screen. In many cases, however, the brain is not only posed with the problem of identifying the position of a common source, but also of determining whether there was a common source at all. In the on-stage ventriloquist illusion, it is indeed primarily the causal inference process that is being fooled, because veridical perception would attribute independent causes to the auditory and the visual stimulus. 1 To extend our understanding of multisensory perception to this more general problem, it is necessary to manipulate the degree of belief assigned to there being a common cause within a multisensory task. Intuitively, we expect that when two signals are very different, they are less likely to be perceived as having a common source. It is well-known that increasing the discrepancy or inconsistency between stimuli reduces the influence that they have on each other [6, 7, 8, 9, 10, 11]. In auditoryvisual spatial localization, one variable that controls stimulus similarity is spatial disparity (another would be temporal disparity). Indeed, it has been reported that increasing spatial disparity leads to a decrease in auditory localization bias [1, 12, 13, 14, 15, 16, 17, 2, 18, 19, 20, 21]. This decrease also correlates with a decrease in the reports of unity [19, 21]. Despite the abundance of experimental data on this issue, no general theory exists that can explain multisensory perception across a wide range of cue conflicts. 2 Models The success of Bayesian models for cue integration has motivated attempts to extend them to situations of large sensory conflict and a consequent low degree of integration. In one of recent studies taking this approach, subjects were presented with concurrent visual flashes and auditory beeps and asked to count both the number of flashes and the number of beeps [11]. The advantage of the experimental paradigm adopted here was that it probed the joint response distribution by requiring a dual report. Human data were accounted for well by a Bayesian model in which the joint prior distribution over visual and auditory number was approximated from the data. In a similar study, subjects were presented with concurrent flashes and taps and asked to count either the flashes or the taps [9, 22]. The Bayesian model proposed by these authors assumed a joint prior distribution with a near-diagonal form. The corresponding generative model assumes that the sensory sources somehow interact with one another. A third experiment modulated the rates of flashes and beeps. The task was to judge either the visual or the auditory modulation rate relative to a standard [23]. The data from this experiment were modeled using a joint prior distribution which is the sum of a near-diagonal prior and a flat background. While all these models are Bayesian in a formal sense, their underlying generative model does not formalize the model selection process that underlies the combination of cues. This makes it necessary to either estimate an empirical prior [11] by fitting it to human behavior or to assume an ad hoc form [22, 23]. However, we believe that such assumptions are not needed. It was shown recently that human judgments of spatial unity in an auditory-visual spatial localization task can be described using a Bayesian inference model that infers causal structure [24, 25]. In this model, the brain does not only estimate a stimulus variable, but also infers the probability that the two stimuli have a common cause. In this paper we compare these different models on a large data set of human position estimates in an auditory-visual task. In this section we first describe the traditional cue integration model, then the recent models based on joint stimulus priors, and finally the causal inference model. To relate to the experiment in the next section, we will use the terminology of auditory-visual spatial localization, but the formalism is very general. 2.1 Traditional cue integration The traditional generative model of cue integration [26] has a single source location s which produces on each trial an internal representation (cue) of visual location, xV and one of auditory location, xA . We assume that the noise processes by which these internal representations are generated are conditionally independent from each other and follow Gaussian distributions. That is, p (xV |s) ∼ N (xV ; s, σV )and p (xA |s) ∼ N (xA ; s, σA ), where N (x; µ, σ) stands for the normal distribution over x with mean µ and standard deviation σ. If on a given trial the internal representations are xV and xA , the probability that their source was s is given by Bayes’ rule, p (s|xV , xA ) ∝ p (xV |s) p (xA |s) . If a subject performs maximum-likelihood estimation, then the estimate will be xV +wA s = wV wV +wA xA , where wV = σ1 and wA = σ1 . It is important to keep in mind that this is the ˆ 2 2 V A estimate on a single trial. A psychophysical experimenter can never have access to xV and xA , which 2 are the noisy internal representations. Instead, an experimenter will want to collect estimates over many trials and is interested in the distribution of s given sV and sA , which are the sources generated ˆ by the experimenter. In a typical cue combination experiment, xV and xA are not actually generated by the same source, but by different sources, a visual one sV and an auditory one sA . These sources are chosen close to each other so that the subject can imagine that the resulting cues originate from a single source and thus implicitly have a common cause. The experimentally observed distribution is then p (ˆ|sV , sA ) = s p (ˆ|xV , xA ) p (xV |sV ) p (xA |sA ) dxV dxA s Given that s is a linear combination of two normally distributed variables, it will itself follow a ˆ sV +wA 1 2 normal distribution, with mean s = wVwV +wA sA and variance σs = wV +wA . The reason that we ˆ ˆ emphasize this point is because many authors identify the estimate distribution p (ˆ|sV , sA ) with s the posterior distribution p (s|xV , xA ). This is justified in this case because all distributions are Gaussian and the estimate is a linear combination of cues. However, in the case of causal inference, these conditions are violated and the estimate distribution will in general not be the same as the posterior distribution. 2.2 Models with bisensory stimulus priors Models with bisensory stimulus priors propose the posterior over source positions to be proportional to the product of unimodal likelihoods and a two-dimensional prior: p (sV , sA |xV , xA ) = p (sV , sA ) p (xV |sV ) p (xA |sA ) The traditional cue combination model has p (sV , sA ) = p (sV ) δ (sV − sA ), usually (as above) even with p (sV ) uniform. The question arises what bisensory stimulus prior is appropriate. In [11], the prior is estimated from data, has a large number of parameters, and is therefore limited in its predictive power. In [23], it has the form − (sV −sA )2 p (sV , sA ) ∝ ω + e 2σ 2 coupling while in [22] the additional assumption ω = 0 is made1 . In all three models, the response distribution p (ˆV , sA |sV , sA ) is obtained by idens ˆ tifying it with the posterior distribution p (sV , sA |xV , xA ). This procedure thus implicitly assumes that marginalizing over the latent variables xV and xA is not necessary, which leads to a significant error for non-Gaussian priors. In this paper we correctly deal with these issues and in all cases marginalize over the latent variables. The parametric models used for the coupling between the cues lead to an elegant low-dimensional model of cue integration that allows for estimates of single cues that differ from one another. C C=1 SA S XA 2.3 C=2 XV SV XA XV Causal inference model In the causal inference model [24, 25], we start from the traditional cue integration model but remove the assumption that two signals are caused by the same source. Instead, the number of sources can be one or two and is itself a variable that needs to be inferred from the cues. Figure 1: Generative model of causal inference. 1 This family of Bayesian posterior distributions also includes one used to successfully model cue combination in depth perception [27, 28]. In depth perception, however, there is no notion of segregation as always a single surface is assumed. 3 If there are two sources, they are assumed to be independent. Thus, we use the graphical model depicted in Fig. 1. We denote the number of sources by C. The probability distribution over C given internal representations xV and xA is given by Bayes’ rule: p (C|xV , xA ) ∝ p (xV , xA |C) p (C) . In this equation, p (C) is the a priori probability of C. We will denote the probability of a common cause by pcommon , so that p (C = 1) = pcommon and p (C = 2) = 1 − pcommon . The probability of generating xV and xA given C is obtained by inserting a summation over the sources: p (xV , xA |C = 1) = p (xV , xA |s)p (s) ds = p (xV |s) p (xA |s)p (s) ds Here p (s) is a prior for spatial location, which we assume to be distributed as N (s; 0, σP ). Then all three factors in this integral are Gaussians, allowing for an analytic solution: p (xV , xA |C = 1) = 2 2 2 2 2 −xA )2 σP σA √ 2 2 1 2 2 2 2 exp − 1 (xV σ2 σ2 +σ2+xV+σ2+xA σV . 2 σ2 σ2 2π σV σA +σV σP +σA σP V A V P A P For p (xV , xA |C = 2) we realize that xV and xA are independent of each other and thus obtain p (xV , xA |C = 2) = p (xV |sV )p (sV ) dsV p (xA |sA )p (sA ) dsA Again, as all these distributions are assumed to be Gaussian, we obtain an analytic solution, x2 x2 1 1 V A p (xV , xA |C = 2) = exp − 2 σ2 +σ2 + σ2 +σ2 . Now that we have com2 +σ 2 2 +σ 2 p p V A 2π (σV p )(σA p) puted p (C|xV , xA ), the posterior distribution over sources is given by p (si |xV , xA ) = p (si |xV , xA , C) p (C|xV , xA ) C=1,2 where i can be V or A and the posteriors conditioned on C are well-known: p (si |xA , xV , C = 1) = p (xA |si ) p (xV |si ) p (si ) , p (xA |s) p (xV |s) p (s) ds p (si |xA , xV , C = 2) = p (xi |si ) p (si ) p (xi |si ) p (si ) dsi The former is the same as in the case of mandatory integration with a prior, the latter is simply the unimodal posterior in the presence of a prior. Based on the posterior distribution on a given trial, p (si |xV , xA ), an estimate has to be created. For this, we use a sum-squared-error cost func2 2 tion, Cost = p (C = 1|xV , xA ) (ˆ − s) + p (C = 2|xV , xA ) (ˆ − sV or A ) . Then the best s s estimate is the mean of the posterior distribution, for instance for the visual estimation: sV = p (C = 1|xA , xV ) sV,C=1 + p (C = 2|xA , xV ) sV,C=2 ˆ ˆ ˆ where sV,C=1 = ˆ −2 −2 −2 xV σV +xA σA +xP σP −2 −2 −2 σV +σA +σP and sV,C=2 = ˆ −2 −2 xV σV +xP σP . −2 −2 σV +σP If pcommon equals 0 or 1, this estimate reduces to one of the conditioned estimates and is linear in xV and xA . If 0 < pcommon < 1, the estimate is a nonlinear combination of xV and xA , because of the functional form of p (C|xV , xA ). The response distributions, that is the distributions of sV and sA given ˆ ˆ sV and sA over many trials, now cannot be identified with the posterior distribution on a single trial and cannot be computed analytically either. The correct way to obtain the response distribution is to simulate an experiment numerically. Note that the causal inference model above can also be cast in the form of a bisensory stimulus prior by integrating out the latent variable C, with: p (sA , sV ) = p (C = 1) δ (sA − sV ) p (sA ) + p (sA ) p (sV ) p (C = 2) However, in addition to justifying the form of the interaction between the cues, the causal inference model has the advantage of being based on a generative model that well formalizes salient properties of the world, and it thereby also allows to predict judgments of unity. 4 3 Model performance and comparison To examine the performance of the causal inference model and to compare it to previous models, we performed a human psychophysics experiment in which we adopted the same dual-report paradigm as was used in [11]. Observers were simultaneously presented with a brief visual and also an auditory stimulus, each of which could originate from one of five locations on an imaginary horizontal line (-10◦ , -5◦ , 0◦ , 5◦ , or 10◦ with respect to the fixation point). Auditory stimuli were 32 ms of white noise filtered through an individually calibrated head related transfer function (HRTF) and presented through a pair of headphones, whereas the visual stimuli were high contrast Gabors on a noisy background presented on a 21-inch CRT monitor. Observers had to report by means of a key press (1-5) the perceived positions of both the visual and the auditory stimulus. Each combination of locations was presented with the same frequency over the course of the experiment. In this way, for each condition, visual and auditory response histograms were obtained. We obtained response distributions for each the three models described above by numeral simulation. On each trial, estimation is followed by a step in which, the key is selected which corresponds to the position closed to the best estimate. The simulated histograms obtained in this way were compared to the measured response frequencies of all subjects by computing the R2 statistic. Auditory response Auditory model Visual response Visual model no vision The parameters in the causal inference model were optimized using fminsearch in MATLAB to maximize R2 . The best combination of parameters yielded an R2 of 0.97. The response frequencies are depicted in Fig. 2. The bisensory prior models also explain most of the variance, with R2 = 0.96 for the Roach model and R2 = 0.91 for the Bresciani model. This shows that it is possible to model cue combination for large disparities well using such models. no audio 1 0 Figure 2: A comparison between subjects’ performance and the causal inference model. The blue line indicates the frequency of subjects responses to visual stimuli, red line is the responses to auditory stimuli. Each set of lines is one set of audio-visual stimulus conditions. Rows of conditions indicate constant visual stimulus, columns is constant audio stimulus. Model predictions is indicated by the red and blue dotted line. 5 3.1 Model comparison To facilitate quantitative comparison with other models, we now fit the parameters of each model2 to individual subject data, maximizing the likelihood of the model, i.e., the probability of the response frequencies under the model. The causal inference model fits human data better than the other models. Compared to the best fit of the causal inference model, the Bresciani model has a maximal log likelihood ratio (base e) of the data of −22 ± 6 (mean ± s.e.m. over subjects), and the Roach model has a maximal log likelihood ratio of the data of −18 ± 6. A causal inference model that maximizes the probability of being correct instead of minimizing the mean squared error has a maximal log likelihood ratio of −18 ± 3. These values are considered decisive evidence in favor of the causal inference model that minimizes the mean squared error (for details, see [25]). The parameter values found in the likelihood optimization of the causal model are as follows: pcommon = 0.28 ± 0.05, σV = 2.14 ± 0.22◦ , σA = 9.2 ± 1.1◦ , σP = 12.3 ± 1.1◦ (mean ± s.e.m. over subjects). We see that there is a relatively low prior probability of a common cause. In this paradigm, auditory localization is considerably less precise than visual localization. Also, there is a weak prior for central locations. 3.2 Localization bias A useful quantity to gain more insight into the structure of multisensory data is the cross-modal bias. In our experiment, relative auditory bias is defined as the difference between the mean auditory estimate in a given condition and the real auditory position, divided by the difference between the real visual position and the real auditory position in this condition. If the influence of vision on the auditory estimate is strong, then the relative auditory bias will be high (close to one). It is well-known that bias decreases with spatial disparity and our experiment is no exception (solid line in Fig. 3; data were combined between positive and negative disparities). It can easily be shown that a traditional cue integration model would predict a bias equal to σ2 −1 , which would be close to 1 and 1 + σV 2 A independent of disparity, unlike the data. This shows that a mandatory integration model is an insufficient model of multisensory interactions. 45 % Auditory Bias We used the individual subject fittings from above and and averaged the auditory bias values obtained from those fits (i.e. we did not fit the bias data themselves). Fits are shown in Fig. 3 (dashed lines). We applied a paired t-test to the differences between the 5◦ and 20◦ disparity conditions (model-subject comparison). Using a double-sided test, the null hypothesis that the difference between the bias in the 5◦ and 20◦ conditions is correctly predicted by each model is rejected for the Bresciani model (p < 0.002) and the Roach model (p < 0.042) and accepted for the causal inference model (p > 0.17). Alternatively, with a single-sided test, the hypothesis is rejected for the Bresciani model (p < 0.001) and the Roach model (p < 0.021) and accepted for the causal inference model (> 0.9). 50 40 35 30 25 20 5 10 15 Spatial Disparity (deg.) 20 Figure 3: Auditory bias as a function of spatial disparity. Solid blue line: data. Red: Causal inference model. Green: Model by Roach et al. [23]. Purple: Model by Bresciani et al. [22]. Models were optimized on response frequencies (as in Fig. 2), not on the bias data. The reason that the Bresciani model fares worst is that its prior distribution does not include a component that corresponds to independent causes. On 2 The Roach et al. model has four free parameters (ω,σV , σA , σcoupling ), the Bresciani et al. model has three (σV , σA , σcoupling ), and the causal inference model has four (pcommon ,σV , σA , σP ). We do not consider the Shams et al. model here, since it has many more parameters and it is not immediately clear how in this model the erroneous identification of posterior with response distribution can be corrected. 6 the contrary, the prior used in the Roach model contains two terms, one term that is independent of the disparity and one term that decreases with increasing disparity. It is thus functionally somewhat similar to the causal inference model. 4 Discussion We have argued that any model of multisensory perception should account not only for situations of small, but also of large conflict. In these situations, segregation is more likely, in which the two stimuli are not perceived to have the same cause. Even when segregation occurs, the two stimuli can still influence each other. We compared three Bayesian models designed to account for situations of large conflict by applying them to auditory-visual spatial localization data. We pointed out a common mistake: for nonGaussian bisensory priors without mandatory integration, the response distribution can no longer be identified with the posterior distribution. After correct implementation of the three models, we found that the causal inference model is superior to the models with ad hoc bisensory priors. This is expected, as the nervous system actually needs to solve the problem of deciding which stimuli have a common cause and which stimuli are unrelated. We have seen that multisensory perception is a suitable tool for studying causal inference. However, the causal inference model also has the potential to quantitatively explain a number of other perceptual phenomena, including perceptual grouping and binding, as well as within-modality cue combination [27, 28]. Causal inference is a universal problem: whenever the brain has multiple pieces of information it must decide if they relate to one another or are independent. As the causal inference model describes how the brain processes probabilistic sensory information, the question arises about the neural basis of these processes. Neural populations encode probability distributions over stimuli through Bayes’ rule, a type of coding known as probabilistic population coding. Recent work has shown how the optimal cue combination assuming a common cause can be implemented in probabilistic population codes through simple linear operations on neural activities [29]. This framework makes essential use of the structure of neural variability and leads to physiological predictions for activity in areas that combine multisensory input, such as the superior colliculus. Computational mechanisms for causal inference are expected have a neural substrate that generalizes these linear operations on population activities. A neural implementation of the causal inference model will open the door to a complete neural theory of multisensory perception. References [1] H.L. Pick, D.H. Warren, and J.C. Hay. Sensory conflict in judgements of spatial direction. Percept. Psychophys., 6:203205, 1969. [2] D. H. Warren, R. B. Welch, and T. J. McCarthy. The role of visual-auditory ”compellingness” in the ventriloquism effect: implications for transitivity among the spatial senses. Percept Psychophys, 30(6):557– 64, 1981. [3] D. Alais and D. Burr. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol, 14(3):257–62, 2004. [4] R. A. Jacobs. Optimal integration of texture and motion cues to depth. Vision Res, 39(21):3621–9, 1999. [5] R. J. van Beers, A. C. Sittig, and J. J. Gon. Integration of proprioceptive and visual position-information: An experimentally supported model. J Neurophysiol, 81(3):1355–64, 1999. [6] D. H. Warren and W. T. Cleaves. Visual-proprioceptive interaction under large amounts of conflict. J Exp Psychol, 90(2):206–14, 1971. [7] C. E. Jack and W. R. Thurlow. Effects of degree of visual association and angle of displacement on the ”ventriloquism” effect. Percept Mot Skills, 37(3):967–79, 1973. [8] G. H. Recanzone. Auditory influences on visual temporal rate perception. J Neurophysiol, 89(2):1078–93, 2003. [9] J. P. Bresciani, M. O. Ernst, K. Drewing, G. Bouyer, V. Maury, and A. Kheddar. Feeling what you hear: auditory signals can modulate tactile tap perception. Exp Brain Res, 162(2):172–80, 2005. 7 [10] R. Gepshtein, P. Leiderman, L. Genosar, and D. Huppert. Testing the three step excited state proton transfer model by the effect of an excess proton. J Phys Chem A Mol Spectrosc Kinet Environ Gen Theory, 109(42):9674–84, 2005. [11] L. Shams, W. J. Ma, and U. Beierholm. Sound-induced flash illusion as an optimal percept. Neuroreport, 16(17):1923–7, 2005. [12] G Thomas. Experimental study of the influence of vision on sound localisation. J Exp Psychol, 28:167177, 1941. [13] W. R. Thurlow and C. E. Jack. Certain determinants of the ”ventriloquism effect”. Percept Mot Skills, 36(3):1171–84, 1973. [14] C.S. Choe, R. B. Welch, R.M. Gilford, and J.F. Juola. The ”ventriloquist effect”: visual dominance or response bias. Perception and Psychophysics, 18:55–60, 1975. [15] R. I. Bermant and R. B. Welch. Effect of degree of separation of visual-auditory stimulus and eye position upon spatial interaction of vision and audition. Percept Mot Skills, 42(43):487–93, 1976. [16] R. B. Welch and D. H. Warren. Immediate perceptual response to intersensory discrepancy. Psychol Bull, 88(3):638–67, 1980. [17] P. Bertelson and M. Radeau. Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys, 29(6):578–84, 1981. [18] P. Bertelson, F. Pavani, E. Ladavas, J. Vroomen, and B. de Gelder. Ventriloquism in patients with unilateral visual neglect. Neuropsychologia, 38(12):1634–42, 2000. [19] D. A. Slutsky and G. H. Recanzone. Temporal and spatial dependency of the ventriloquism effect. Neuroreport, 12(1):7–10, 2001. [20] J. Lewald, W. H. Ehrenstein, and R. Guski. Spatio-temporal constraints for auditory–visual integration. Behav Brain Res, 121(1-2):69–79, 2001. [21] M. T. Wallace, G. E. Roberson, W. D. Hairston, B. E. Stein, J. W. Vaughan, and J. A. Schirillo. Unifying multisensory signals across time and space. Exp Brain Res, 158(2):252–8, 2004. [22] J. P. Bresciani, F. Dammeier, and M. O. Ernst. Vision and touch are automatically integrated for the perception of sequences of events. J Vis, 6(5):554–64, 2006. [23] N. W. Roach, J. Heron, and P. V. McGraw. Resolving multisensory conflict: a strategy for balancing the costs and benefits of audio-visual integration. Proc Biol Sci, 273(1598):2159–68, 2006. [24] K. P. Kording and D. M. Wolpert. Bayesian decision theory in sensorimotor control. Trends Cogn Sci, 2006. 1364-6613 (Print) Journal article. [25] K.P. Kording, U. Beierholm, W.J. Ma, S. Quartz, J. Tenenbaum, and L. Shams. Causal inference in multisensory perception. PLoS ONE, 2(9):e943, 2007. [26] Z. Ghahramani. Computational and psychophysics of sensorimotor integration. PhD thesis, Massachusetts Institute of Technology, 1995. [27] D. C. Knill. Mixture models and the probabilistic structure of depth cues. Vision Res, 43(7):831–54, 2003. [28] D. C. Knill. Robust cue integration: A bayesian model and evidence from cue conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7):2–24. [29] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nat Neurosci, 9(11):1432–8, 2006. 8
5 0.077466153 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
Author: Jonathan W. Pillow, Peter E. Latham
Abstract: Point process encoding models provide powerful statistical methods for understanding the responses of neurons to sensory stimuli. Although these models have been successfully applied to neurons in the early sensory pathway, they have fared less well capturing the response properties of neurons in deeper brain areas, owing in part to the fact that they do not take into account multiple stages of processing. Here we introduce a new twist on the point-process modeling approach: we include unobserved as well as observed spiking neurons in a joint encoding model. The resulting model exhibits richer dynamics and more highly nonlinear response properties, making it more powerful and more flexible for fitting neural data. More importantly, it allows us to estimate connectivity patterns among neurons (both observed and unobserved), and may provide insight into how networks process sensory input. We formulate the estimation procedure using variational EM and the wake-sleep algorithm, and illustrate the model’s performance using a simulated example network consisting of two coupled neurons.
6 0.069725834 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
7 0.065303378 173 nips-2007-Second Order Bilinear Discriminant Analysis for single trial EEG analysis
8 0.062441118 146 nips-2007-On higher-order perceptron algorithms
9 0.060127735 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
10 0.05942145 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes
11 0.058570087 77 nips-2007-Efficient Inference for Distributions on Permutations
12 0.056354079 21 nips-2007-Adaptive Online Gradient Descent
13 0.056243554 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding
14 0.053915594 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
15 0.051895946 9 nips-2007-A Probabilistic Approach to Language Change
16 0.047920235 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables
17 0.047881264 213 nips-2007-Variational Inference for Diffusion Processes
18 0.046902951 203 nips-2007-The rat as particle filter
19 0.04673937 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
20 0.046556372 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
topicId topicWeight
[(0, -0.16), (1, 0.042), (2, 0.047), (3, -0.042), (4, 0.011), (5, 0.043), (6, -0.045), (7, 0.03), (8, -0.06), (9, -0.063), (10, 0.087), (11, 0.05), (12, 0.004), (13, 0.032), (14, -0.009), (15, -0.049), (16, -0.1), (17, 0.014), (18, 0.025), (19, -0.029), (20, -0.016), (21, 0.052), (22, 0.07), (23, 0.054), (24, -0.006), (25, 0.001), (26, 0.04), (27, 0.015), (28, 0.058), (29, 0.009), (30, 0.017), (31, 0.091), (32, -0.022), (33, 0.073), (34, 0.031), (35, -0.098), (36, 0.001), (37, -0.001), (38, -0.082), (39, 0.139), (40, 0.146), (41, -0.12), (42, 0.083), (43, 0.08), (44, -0.018), (45, 0.051), (46, 0.004), (47, -0.006), (48, 0.015), (49, -0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.91291851 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes
Author: Richard Turner, Maneesh Sahani
Abstract: Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼ 1 s); phonemes (∼ 10−1 s); glottal pulses (∼ 10−2 s); and formants ( 10−3 s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscienceinspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task. 1
2 0.5588572 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
Author: Gwenn Englebienne, Tim Cootes, Magnus Rattray
Abstract: The present work aims to model the correspondence between facial motion and speech. The face and sound are modelled separately, with phonemes being the link between both. We propose a sequential model and evaluate its suitability for the generation of the facial animation from a sequence of phonemes, which we obtain from speech. We evaluate the results both by computing the error between generated sequences and real video, as well as with a rigorous double-blind test with human subjects. Experiments show that our model compares favourably to other existing methods and that the sequences generated are comparable to real video sequences. 1
3 0.51263261 77 nips-2007-Efficient Inference for Distributions on Permutations
Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas
Abstract: Permutations are ubiquitous in many real world problems, such as voting, rankings and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact representations such as graphical models cannot efficiently capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent such distributions compactly. We present Kronecker conditioning, a general and efficient approach for maintaining these distributions directly in the Fourier domain. Low order Fourier-based approximations can lead to functions that do not correspond to valid distributions. To address this problem, we present an efficient quadratic program defined directly in the Fourier domain to project the approximation onto a relaxed form of the marginal polytope. We demonstrate the effectiveness of our approach on a real camera-based multi-people tracking setting. 1
4 0.49191418 96 nips-2007-Heterogeneous Component Analysis
Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii
Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1
5 0.48318064 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
Author: Nicolas Chapados, Yoshua Bengio
Abstract: We introduce a functional representation of time series which allows forecasts to be performed over an unspecified horizon with progressively-revealed information sets. By virtue of using Gaussian processes, a complete covariance matrix between forecasts at several time-steps is available. This information is put to use in an application to actively trade price spreads between commodity futures contracts. The approach delivers impressive out-of-sample risk-adjusted returns after transaction costs on a portfolio of 30 spreads. 1
6 0.46402568 146 nips-2007-On higher-order perceptron algorithms
7 0.45403308 150 nips-2007-Optimal models of sound localization by barn owls
8 0.44968531 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
9 0.43303698 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration
10 0.41690782 50 nips-2007-Combined discriminative and generative articulated pose and non-rigid shape estimation
11 0.40728685 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
12 0.40358013 145 nips-2007-On Sparsity and Overcompleteness in Image Models
13 0.40206653 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
14 0.39502835 26 nips-2007-An online Hebbian learning rule that performs Independent Component Analysis
15 0.39063469 173 nips-2007-Second Order Bilinear Discriminant Analysis for single trial EEG analysis
16 0.38860214 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables
17 0.38082704 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
18 0.37828985 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding
19 0.36454523 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
20 0.35243478 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks
topicId topicWeight
[(5, 0.047), (13, 0.03), (16, 0.028), (18, 0.015), (19, 0.028), (21, 0.065), (34, 0.014), (35, 0.455), (47, 0.097), (49, 0.014), (83, 0.059), (85, 0.014), (87, 0.017), (90, 0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.86194909 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes
Author: Richard Turner, Maneesh Sahani
Abstract: Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼ 1 s); phonemes (∼ 10−1 s); glottal pulses (∼ 10−2 s); and formants ( 10−3 s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscienceinspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task. 1
2 0.79906416 160 nips-2007-Random Features for Large-Scale Kernel Machines
Author: Ali Rahimi, Benjamin Recht
Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1
3 0.79255718 78 nips-2007-Efficient Principled Learning of Thin Junction Trees
Author: Anton Chechetka, Carlos Guestrin
Abstract: We present the first truly polynomial algorithm for PAC-learning the structure of bounded-treewidth junction trees – an attractive subclass of probabilistic graphical models that permits both the compact representation of probability distributions and efficient exact inference. For a constant treewidth, our algorithm has polynomial time and sample complexity. If a junction tree with sufficiently strong intraclique dependencies exists, we provide strong theoretical guarantees in terms of KL divergence of the result from the true distribution. We also present a lazy extension of our approach that leads to very significant speed ups in practice, and demonstrate the viability of our method empirically, on several real world datasets. One of our key new theoretical insights is a method for bounding the conditional mutual information of arbitrarily large sets of variables with only polynomially many mutual information computations on fixed-size subsets of variables, if the underlying distribution can be approximated by a bounded-treewidth junction tree. 1
4 0.75303435 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
Author: Andreas Argyriou, Massimiliano Pontil, Yiming Ying, Charles A. Micchelli
Abstract: Learning the common structure shared by a set of supervised tasks is an important practical and theoretical problem. Knowledge of this structure may lead to better generalization performance on the tasks and may also facilitate learning new tasks. We propose a framework for solving this problem, which is based on regularization with spectral functions of matrices. This class of regularization problems exhibits appealing computational properties and can be optimized efficiently by an alternating minimization algorithm. In addition, we provide a necessary and sufficient condition for convexity of the regularizer. We analyze concrete examples of the framework, which are equivalent to regularization with Lp matrix norms. Experiments on two real data sets indicate that the algorithm scales well with the number of tasks and improves on state of the art statistical performance. 1
5 0.49425295 96 nips-2007-Heterogeneous Component Analysis
Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii
Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specific components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1
6 0.48714098 16 nips-2007-A learning framework for nearest neighbor search
7 0.45060587 134 nips-2007-Multi-Task Learning via Conic Programming
8 0.44945863 175 nips-2007-Semi-Supervised Multitask Learning
9 0.44649369 156 nips-2007-Predictive Matrix-Variate t Models
10 0.42465967 49 nips-2007-Colored Maximum Variance Unfolding
11 0.42229739 158 nips-2007-Probabilistic Matrix Factorization
12 0.42200372 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
13 0.42081755 63 nips-2007-Convex Relaxations of Latent Variable Training
14 0.42064875 35 nips-2007-Bayesian binning beats approximate alternatives: estimating peri-stimulus time histograms
15 0.4196949 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
16 0.41752476 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
17 0.41687936 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking
18 0.41338286 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
19 0.40375444 179 nips-2007-SpAM: Sparse Additive Models
20 0.40208507 116 nips-2007-Learning the structure of manifolds using random projections