nips nips2010 nips2010-157 knowledge-graph by maker-knowledge-mining

157 nips-2010-Learning to localise sounds with spiking neural networks

Source: pdf

Author: Dan Goodman, Romain Brette

Abstract: To localise the source of a sound, we use location-speciﬁc properties of the signals received at the two ears caused by the asymmetric ﬁltering of the original sound by our head and pinnae, the head-related transfer functions (HRTFs). These HRTFs change throughout an organism’s lifetime, during development for example, and so the required neural circuitry cannot be entirely hardwired. Since HRTFs are not directly accessible from perceptual experience, they can only be inferred from ﬁltered sounds. We present a spiking neural network model of sound localisation based on extracting location-speciﬁc synchrony patterns, and a simple supervised algorithm to learn the mapping between synchrony patterns and locations from a set of example sounds, with no previous knowledge of HRTFs. After learning, our model was able to accurately localise new sounds in both azimuth and elevation, including the difﬁcult task of distinguishing sounds coming from the front and back. Keywords: Auditory Perception & Modeling (Primary); Computational Neural Models, Neuroscience, Supervised Learning (Secondary) 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Learning to localise sounds with spiking neural networks Romain Brette D´ partment d’Etudes Cognitive e Ecole Normale Sup´ rieure e 29 Rue d’Ulm Paris 75005, France romain. [sent-1, score-0.468]

2 fr Abstract To localise the source of a sound, we use location-speciﬁc properties of the signals received at the two ears caused by the asymmetric ﬁltering of the original sound by our head and pinnae, the head-related transfer functions (HRTFs). [sent-7, score-0.721]

3 We present a spiking neural network model of sound localisation based on extracting location-speciﬁc synchrony patterns, and a simple supervised algorithm to learn the mapping between synchrony patterns and locations from a set of example sounds, with no previous knowledge of HRTFs. [sent-10, score-0.793]

4 After learning, our model was able to accurately localise new sounds in both azimuth and elevation, including the difﬁcult task of distinguishing sounds coming from the front and back. [sent-11, score-0.809]

5 Psychophysical studies have shown that source localisation relies on a variety of acoustic cues such as interaural time and level differences (ITDs and ILDs) and spectral cues (Blauert 1997). [sent-14, score-0.438]

6 Previous neural models addressed the mechanisms of cue extraction, in particular neural mechanisms underlying ITD sensitivity, using simpliﬁed binaural stimuli such as tones or noise bursts with artiﬁcially induced ITDs (Colburn, 1973; Reed and Blum, 1990; Gerstner et al. [sent-17, score-0.375]

7 , 2008), but did not address the problem of learning to localise natural sounds in realistic acoustical environments. [sent-20, score-0.573]

8 For binaural hearing, the acoustical environment includes the head, body and pinnae, and the sounds received at the two ears are FL ∗ S and FR ∗ S, where (FL , FR ) is a pair of location-speciﬁc ﬁlters. [sent-23, score-1.0]

9 Because the two sounds originate from the same 1 signal, the binaural stimulus has a speciﬁc structure, which should result in synchrony patterns in the encoding neurons. [sent-24, score-0.819]

10 Speciﬁcally, we modelled the response of monaural neurons by a linear ﬁltering of the sound followed by a spiking nonlinearity. [sent-25, score-0.606]

11 Two neurons A and B responding to two different sides (left and right), with receptive ﬁelds NA and NB , transform the signals NA ∗ FL ∗ S and NB ∗ FR ∗ S into spike trains. [sent-26, score-0.304]

12 Thus, in our model, sounds presented at a given location induce speciﬁc synchrony patterns, which then activate a speciﬁc assembly of postsynaptic neurons (coincidence detection neurons), in a way that is independent of the source signal (see Goodman and Brette, in press). [sent-30, score-0.932]

13 Learning a new location consists in assigning a label to the activated assembly, using a teacher signal (for example visual input). [sent-31, score-0.18]

14 We used measured human HRTFs to generate binaural signals at different source locations from a set of various sounds. [sent-32, score-0.503]

15 After learning, the model was able to accurately locate unknown sounds in both azimuth and elevation. [sent-34, score-0.45]

16 All sounds were of 1 second duration and were presented at 80 dB SPL. [sent-42, score-0.255]

17 Consider two neurons A and B which respond to sounds from the left and right ear, respectively. [sent-53, score-0.446]

18 When a sound S is produced by a source at a given location, it arrives at the two ears as the binaural signal (FL ∗ S, FR ∗ S) (convolution), where (FL , FR ) is the location-speciﬁc pair of acoustical ﬁlters. [sent-54, score-1.024]

19 These will be identical for any sound S whenever NA ∗ FL = NB ∗ FR , implying that the two neurons ﬁre synchronously. [sent-56, score-0.422]

20 For each location indicated by its ﬁlter pair (FL , FR ), we deﬁne the synchrony pattern as the set of binaural pairs of neurons (A, B) such that NA ∗ FL = NB ∗ FR . [sent-57, score-0.828]

21 Therefore, the identity of the synchrony pattern induced by a binaural stimulus indicates the location of the source. [sent-59, score-0.637]

22 Learning consists in assigning a synchrony pattern induced by a sound to the location of the source. [sent-60, score-0.493]

23 Then neurons A and B ﬁre in synchrony whenever FR ∗ FL = FL ∗ FR , ∗ ∗ in particular when FL = FL and FR = FR , that is, at location x (since convolution is commutative). [sent-62, score-0.453]

24 ∗ More generally, if U is a band-pass ﬁlter and the receptive ﬁelds of neurons A and B are U ∗ FR and ∗ U ∗ FL , respectively, then the neurons ﬁre synchronously at location x. [sent-63, score-0.53]

25 Therefore, to represent all possible 2 locations in pairs of neuron ﬁlters, we consider that the set of neural transformations N is a bank of band-pass ﬁlters followed by a set of delays and gains. [sent-68, score-0.426]

26 To decode synchrony patterns, we deﬁne a set of binaural neurons which receive input spike trains from monaural neurons on both sides (two inputs per neuron). [sent-69, score-1.034]

27 A binaural neuron responds preferentially when its two inputs are synchronous, so that synchrony patterns are mapped to assemblies of binaural neurons. [sent-70, score-1.184]

28 Each location-speciﬁc assembly is the set of binaural neurons for which the input neurons ﬁre synchronously at that location. [sent-71, score-0.896]

29 This is conceptually similar to the Jeffress model (Jeffress, 1948), where a neuron is maximally activated when acoustical and axonal delays match, and related models (Lindemann, 1986; Gaik, 1993). [sent-72, score-0.692]

30 However, the Jeffress model is restricted to azimuth estimation and it is difﬁcult to implement it directly with neuron models because ILDs always co-occur with ITDs and disturb spike-timing. [sent-73, score-0.295]

31 3 Implementation with spiking neuron models HRTF Cochlear filtering Neural filtering Coincidence detection Neural filtering Cochlear filtering L HRTF R γi γi L Fj R (30°, 15°) (45°, 15°) (90°, 15°) Fj Figure 1: Implementation of the model. [sent-75, score-0.393]

32 The source signal arrives at the two ears after acoustical ﬁltering by HRTFs. [sent-76, score-0.418]

33 The two monaural signals are ﬁltered by a set of gammatone ﬁlters γi with central frequencies between 150 Hz and 5 kHz (cochlear ﬁltering). [sent-77, score-0.196]

34 In each band (3 bands shown, between dashed lines), various gains and delays are applied to the signal (neural ﬁltering FjL and FjR ) and spiking neuron models transform the resulting signals into spike trains, which converge from each side on a coincidence detector neuron (same neuron model). [sent-78, score-1.061]

35 The neural assembly corresponding to a particular location is the set of coincidence detector neurons for which their input neurons ﬁre in synchrony at that location (one pair for each frequency channel). [sent-79, score-1.078]

36 Head-ﬁltered sounds were passed through a bank of fourth-order gammatone ﬁlters with center frequencies distributed on the ERB scale (central frequencies from 150 Hz to 5 kHz), modeling cochlear ﬁltering (Glasberg and Moore, 1990). [sent-85, score-0.474]

37 Gains and delays were then applied, with delays at most 1 ms and gains at most ±10 dB. [sent-87, score-0.628]

38 The ﬁltered sounds were half-wave rectiﬁed and compressed by a 1/3 power law I = k([x]+ )1/3 (where x is the sound pressure in pascals). [sent-89, score-0.486]

39 These neurons make synaptic connections with binaural neurons in a second layer (two presynaptic neurons for each binaural neuron). [sent-94, score-1.386]

40 These coincidence detector neurons are leaky integrate-and-ﬁre neurons with the same equations but their inputs are synaptic. [sent-95, score-0.513]

41 Spikes arriving at these neurons cause an instantaneous increase W in V (where W is the synaptic weight). [sent-96, score-0.223]

42 Each location is assigned an assembly of coincidence detector neurons, one in each frequency channel. [sent-99, score-0.434]

43 When a sound is presented to the model, the total ﬁring rate of all neurons in each assembly is computed. [sent-100, score-0.561]

44 The estimated location is the one assigned to the maximally activated assembly. [sent-101, score-0.181]

45 Figure 2 shows the activation of all location-speciﬁc assemblies in an example where a sound was presented to the model, after learning. [sent-102, score-0.345]

46 In the hardwired model, we deﬁned the location-speciﬁc assemblies from the knowledge of HRTFs (the learning algorithm is explained in section 2. [sent-104, score-0.287]

47 The RMS difference is minimized when the delays correspond to the maximum of the cross-correlation between L and R, C(s) = (G∗FL )(t)·(G∗FR )(t+s)dt, so that C(dR −dL ) is the maximum, and gR /gL = C(dR − dL )/ R(t)2 dt. [sent-108, score-0.258]

48 4 Learning In the hardwired model, the knowledge of the full set of HRTFs is used to estimate source location. [sent-110, score-0.228]

49 In our model, when HRTFs are not explicitly known, location-speciﬁc assemblies are learned by presenting unknown sounds at different locations to the model, where there is one coincidence detector neuron for each choice of frequency, relative delay and relative gain. [sent-113, score-0.711]

50 In total 69 relative delays were chosen and 61 relative gains. [sent-117, score-0.258]

51 With 80 frequency channels, this gives a total of roughly 106 neurons in the model. [sent-118, score-0.245]

52 When a sound is presented at a given location, we deﬁne the assembly for this location by picking the maximally activated neuron in each frequency channel, as would be expected from a supervised Hebbian learning process with a teacher signal (e. [sent-119, score-0.766]

53 For practical reasons, 4 Elevation (deg) 80 60 40 20 0 20 40 150 100 50 0 50 100 150 Azimuth (deg) Figure 2: Activation of all location-speciﬁc assemblies in response to a sound coming from a particular location indicated by a black +. [sent-122, score-0.455]

54 we did not implement this supervised learning with spiking models, but supervised learning with spiking neurons has been described in several previous studies (Song and Abbott, 2001; Davison and Frgnac, 2006; Witten et al. [sent-125, score-0.387]

55 Performance is better for sounds with broader spectrums, as each channel provides additional information. [sent-128, score-0.294]

56 The model was also able to distinguish between sounds coming from the left and right (with an accuracy of almost 100%), and performed well for the more difﬁcult tasks of distinguishing between front and back (80-85%) and between up and down (70-90%). [sent-129, score-0.304]

57 Figure 3D-F show the results using the learned best delays and gains, using the full training data set (seven sounds presented at each location, each of one second duration) and different test sounds. [sent-130, score-0.513]

58 Average azimuth errors for 80 channels are 4-8 degrees, and elevation errors are 10-27 degrees. [sent-132, score-0.385]

59 With only a single sound of one second duration at each location, the performance is already very good. [sent-135, score-0.231]

60 Although it is close, the performance does not seem to converge to that of the hardwired model, which might be due to a limited sampling of delays and gains (69 relative delays and 61 relative gains), or perhaps to the presence of physiological noise in our models (Goodman and Brette, in press). [sent-137, score-0.765]

61 Figure 5 shows the properties of neurons in a location-speciﬁc assembly: interaural delay (Figure 5A) and interaural gain difference (Figure 5B) for each frequency channel. [sent-138, score-0.634]

62 For this location, the assemblies in the hardwired model and with learning were very similar, which indicates that the learning procedure was indeed able to catch the binaural cues associated with that location. [sent-139, score-0.724]

63 The distributions of delays and gain differences were similar in the hardwired model and with learning. [sent-140, score-0.431]

64 In the hardwired model, these interaural delays and gains correspond to the ITDs and ILDs in ﬁne frequency bands. [sent-141, score-0.734]

65 To each location corresponds a speciﬁc frequency-dependent pattern of ITDs and ILDs, which is informative of both azimuth and elevation. [sent-142, score-0.274]

66 In particular, these patterns are different when the location is reversed between front and back (not shown), and this difference is exploited by the model to distinguish between these two cases. [sent-143, score-0.196]

67 A, D, Mean error in azimuth estimates as a function of the number of frequency channels (i. [sent-145, score-0.287]

68 A, Average estimation error in azimuth (blue) and elevation (green) as a function of the number of sounds presented at each location during learning (each sound lasts 1 second). [sent-153, score-0.912]

69 number of sounds per location for discriminating left and right (green), front and back (blue) and up and down (red). [sent-157, score-0.414]

70 4 Discussion The sound produced by a source propagates to the ears according to linear laws. [sent-158, score-0.387]

71 Thus the ears receive two differently ﬁltered versions of the same signal, which induce a location-speciﬁc structure 6 A B Figure 5: Location-speciﬁc assembly in the hardwired model and with learning. [sent-159, score-0.413]

72 preferred frequency for neurons in an assembly corresponding to one particular location, in the hardwired model (white circles) and with learning (black circles). [sent-161, score-0.557]

73 The colored background shows the distribution of preferred delays in all neurons in the hardwired model. [sent-162, score-0.622]

74 When binaural signals are transformed by a heterogeneous population of neurons, this structure is mapped to synchrony patterns, which are location-speciﬁc. [sent-166, score-0.563]

75 We designed a simple spiking neuron model which exploits this property to estimate the location of sound sources in a way that is independent of the source signal. [sent-167, score-0.625]

76 We showed that the mapping between assemblies and locations can be directly learned in a supervised way from the presentation of a set of sounds at different locations, with no previous knowledge of the HRTFs or the sounds. [sent-169, score-0.406]

77 With 80 frequency channels, we found that 1 second of training data per location was enough to estimate the azimuth of a new sound with mean error 6 degrees and the elevation with error 18 degrees. [sent-170, score-0.711]

78 Humans can learn to localise sound sources when their acoustical cues change, for example when molds are inserted into their ears (Hofman et al. [sent-171, score-0.741]

79 Learning a new mapping can take a long time (several weeks in the ﬁrst study), which is consistent with the idea that the new mapping is learned from exposure to sounds from known locations. [sent-174, score-0.255]

80 Interestingly, the previous mapping is instantly recovered when the ear molds are removed, meaning that the representations of the two acoustical environments do not interfere. [sent-175, score-0.319]

81 This is consistent with our model, in which two acoustical environments would be represented by two possibly overlapping sets of neural assemblies. [sent-176, score-0.232]

82 In our model, we assumed that the receptive ﬁeld of monaural neurons can be modeled as a band-pass ﬁlter with various gains and delays. [sent-177, score-0.391]

83 Delays could arise from many causes: axonal delays (either presynaptic or postsynaptic), cochlear delays (Joris et al. [sent-179, score-0.661]

84 The distribution of best delays of the binaural neurons in our model reﬂect the distribution of ITDs in the acoustical environment. [sent-182, score-1.056]

85 This contradicts the observation in many species that the best delays are always smaller than half the characteristic period, i. [sent-183, score-0.258]

86 However, we checked that the model performed almost equally well with this constraint (Goodman and Brette, in press), which is not very surprising since best delays above the π-limit are mostly redundant. [sent-186, score-0.258]

87 In small mammals (guinea pigs, gerbils), it has been shown that the best phases of binaural neurons in the MSO and IC are in fact even more constrained, since they are scattered around ±π/4, in constrast with birds (e. [sent-187, score-0.599]

88 It has not been measured in humans, but the same optimal coding theory that predicts the discrete distribution of phases in small mammals predicts that best delays should be continuously distributed above 400 Hz (80% of the frequency channels in our model). [sent-192, score-0.414]

89 In addition, psychophysical results also 7 imply that humans can estimate both the azimuth and elevation of low-pass ﬁltered sound sources (< 3 kHz) (Algazi et al. [sent-193, score-0.547]

90 This is contradictory with the twochannel model (best delays at ±π/4) and in agreement with ours (including the fact that elevation could only be estimated away from the median plane in these experiments). [sent-195, score-0.41]

91 The HRTFs used in our virtual acoustic environment were recorded at a constant distance, so that we could only test the model performance in estimating the azimuth and elevation of a sound source. [sent-198, score-0.584]

92 It should also apply equally well to non-anechoic environments, because our model only relies on the linearity of sound propagation. [sent-200, score-0.231]

93 However, a difﬁcult task, which we have not addressed, is to locate sounds in a new environment, because reﬂections would change the binaural cues and therefore the location-speciﬁc assemblies. [sent-201, score-0.723]

94 One possibility would be to isolate the direct sound from the reﬂections, but this requires additional mechanisms, which probably underlie the precedence effect (Litovsky et al. [sent-202, score-0.231]

95 A matter of time: internal delays in binaural processing. [sent-299, score-0.633]

96 Extension of a binaural cross-correlation model by contralateral inhibition. [sent-314, score-0.375]

97 A biologically inspired spiking neural network for sound localisation by the inferior colliculus. [sent-334, score-0.415]

98 Distribution of interaural time difference in the barn owl’s inferior colliculus in the low- and High-Frequency ranges. [sent-371, score-0.202]

99 Perceptual recalibration in human sound localization: Learning to remediate front-back reversals. [sent-396, score-0.231]

100 A model for interaural time difference sensitivity in the medial superior olive: Interaction of excitatory and inhibitory synaptic inputs, channel dynamics, and cellular morphology. [sent-404, score-0.244]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('binaural', 0.375), ('delays', 0.258), ('sounds', 0.255), ('acoustical', 0.232), ('sound', 0.231), ('fl', 0.192), ('neurons', 0.191), ('hrtfs', 0.187), ('fr', 0.178), ('hardwired', 0.173), ('interaural', 0.173), ('azimuth', 0.164), ('synchrony', 0.152), ('elevation', 0.152), ('assembly', 0.139), ('neuron', 0.131), ('cochlear', 0.114), ('assemblies', 0.114), ('location', 0.11), ('ears', 0.101), ('coincidence', 0.098), ('spiking', 0.098), ('brette', 0.089), ('itds', 0.086), ('localisation', 0.086), ('localise', 0.086), ('monaural', 0.086), ('gains', 0.076), ('nb', 0.072), ('channels', 0.069), ('ltered', 0.069), ('goodman', 0.069), ('pmid', 0.063), ('cues', 0.062), ('ear', 0.058), ('america', 0.058), ('colburn', 0.058), ('ilds', 0.058), ('jeffress', 0.058), ('source', 0.055), ('frequency', 0.054), ('membrane', 0.053), ('yin', 0.052), ('front', 0.049), ('na', 0.048), ('joris', 0.046), ('ltering', 0.046), ('mv', 0.046), ('lter', 0.045), ('dl', 0.044), ('gammatone', 0.043), ('hrtf', 0.043), ('pinnae', 0.043), ('delay', 0.043), ('khz', 0.041), ('filtering', 0.041), ('auditory', 0.04), ('activated', 0.04), ('dr', 0.039), ('channel', 0.039), ('spike', 0.039), ('receptive', 0.038), ('mcalpine', 0.038), ('wagner', 0.038), ('locations', 0.037), ('patterns', 0.037), ('environment', 0.037), ('ms', 0.036), ('signals', 0.036), ('gr', 0.035), ('hofman', 0.035), ('db', 0.034), ('hz', 0.034), ('head', 0.034), ('detector', 0.033), ('mammals', 0.033), ('lters', 0.032), ('synaptic', 0.032), ('presynaptic', 0.031), ('locate', 0.031), ('maximally', 0.031), ('frequencies', 0.031), ('signal', 0.03), ('hearing', 0.03), ('algazi', 0.029), ('barn', 0.029), ('etudes', 0.029), ('frgnac', 0.029), ('gaik', 0.029), ('glasberg', 0.029), ('hrirs', 0.029), ('lindemann', 0.029), ('litovsky', 0.029), ('lorenzi', 0.029), ('macdonald', 0.029), ('molds', 0.029), ('mso', 0.029), ('olive', 0.029), ('owl', 0.029), ('partment', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 157 nips-2010-Learning to localise sounds with spiking neural networks

Author: Dan Goodman, Romain Brette

2 0.11861367 96 nips-2010-Fractionally Predictive Spiking Neurons

Author: Jaldert Rombouts, Sander M. Bohte

Abstract: Recent experimental work has suggested that the neural ﬁring rate can be interpreted as a fractional derivative, at least when signal variation induces neural adaptation. Here, we show that the actual neural spike-train itself can be considered as the fractional derivative, provided that the neural signal is approximated by a sum of power-law kernels. A simple standard thresholding spiking neuron sufﬁces to carry out such an approximation, given a suitable refractory response. Empirically, we ﬁnd that the online approximation of signals with a sum of powerlaw kernels is beneﬁcial for encoding signals with slowly varying components, like long-memory self-similar signals. For such signals, the online power-law kernel approximation typically required less than half the number of spikes for similar SNR as compared to sums of similar but exponentially decaying kernels. As power-law kernels can be accurately approximated using sums or cascades of weighted exponentials, we demonstrate that the corresponding decoding of spiketrains by a receiving neuron allows for natural and transparent temporal signal ﬁltering by tuning the weights of the decoding kernel. 1

3 0.11683626 16 nips-2010-A VLSI Implementation of the Adaptive Exponential Integrate-and-Fire Neuron Model

Author: Sebastian Millner, Andreas Grübl, Karlheinz Meier, Johannes Schemmel, Marc-olivier Schwartz

Abstract: We describe an accelerated hardware neuron being capable of emulating the adaptive exponential integrate-and-ﬁre neuron model. Firing patterns of the membrane stimulated by a step current are analyzed in transistor level simulations and in silicon on a prototype chip. The neuron is destined to be the hardware neuron of a highly integrated wafer-scale system reaching out for new computational paradigms and opening new experimentation possibilities. As the neuron is dedicated as a universal device for neuroscientiﬁc experiments, the focus lays on parameterizability and reproduction of the analytical model. 1

4 0.11600591 268 nips-2010-The Neural Costs of Optimal Control

Author: Samuel Gershman, Robert Wilson

Abstract: Optimal control entails combining probabilities and utilities. However, for most practical problems, probability densities can be represented only approximately. Choosing an approximation requires balancing the beneﬁts of an accurate approximation against the costs of computing it. We propose a variational framework for achieving this balance and apply it to the problem of how a neural population code should optimally represent a distribution under resource constraints. The essence of our analysis is the conjecture that population codes are organized to maximize a lower bound on the log expected utility. This theory can account for a plethora of experimental data, including the reward-modulation of sensory receptive ﬁelds, GABAergic effects on saccadic movements, and risk aversion in decisions under uncertainty. 1

5 0.10455762 10 nips-2010-A Novel Kernel for Learning a Neuron Model from Spike Train Data

Author: Nicholas Fisher, Arunava Banerjee

Abstract: From a functional viewpoint, a spiking neuron is a device that transforms input spike trains on its various synapses into an output spike train on its axon. We demonstrate in this paper that the function mapping underlying the device can be tractably learned based on input and output spike train data alone. We begin by posing the problem in a classiﬁcation based framework. We then derive a novel kernel for an SRM0 model that is based on PSP and AHP like functions. With the kernel we demonstrate how the learning problem can be posed as a Quadratic Program. Experimental results demonstrate the strength of our approach. 1

6 0.094974533 161 nips-2010-Linear readout from a neural population with partial correlation data

7 0.086311355 252 nips-2010-SpikeAnts, a spiking neuron network modelling the emergence of organization in a complex system

8 0.084974349 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection

9 0.07657823 244 nips-2010-Sodium entry efficiency during action potentials: A novel single-parameter family of Hodgkin-Huxley models

10 0.074984804 253 nips-2010-Spike timing-dependent plasticity as dynamic filter

11 0.073445983 115 nips-2010-Identifying Dendritic Processing

12 0.068883516 119 nips-2010-Implicit encoding of prior probabilities in optimal neural populations

13 0.06780114 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models

14 0.057034414 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts

15 0.049267139 101 nips-2010-Gaussian sampling by local perturbations

16 0.049242489 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs

17 0.042250227 65 nips-2010-Divisive Normalization: Justification and Effectiveness as Efficient Coding Transform

18 0.042197935 107 nips-2010-Global seismic monitoring as probabilistic inference

19 0.041993815 17 nips-2010-A biologically plausible network for the computation of orientation dominance

20 0.041027263 227 nips-2010-Rescaling, thinning or complementing? On goodness-of-fit procedures for point process models and Generalized Linear Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.094), (1, 0.033), (2, -0.162), (3, 0.126), (4, 0.067), (5, 0.137), (6, -0.054), (7, 0.058), (8, 0.037), (9, -0.031), (10, 0.034), (11, 0.042), (12, 0.03), (13, 0.017), (14, 0.016), (15, -0.018), (16, 0.024), (17, -0.044), (18, -0.037), (19, -0.027), (20, 0.032), (21, 0.016), (22, 0.029), (23, -0.019), (24, 0.026), (25, -0.05), (26, 0.046), (27, -0.075), (28, -0.014), (29, -0.059), (30, 0.009), (31, -0.002), (32, 0.028), (33, -0.031), (34, 0.042), (35, 0.034), (36, 0.088), (37, -0.011), (38, 0.011), (39, 0.049), (40, 0.048), (41, 0.013), (42, 0.017), (43, -0.043), (44, 0.002), (45, 0.022), (46, -0.005), (47, -0.037), (48, 0.058), (49, 0.087)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95465332 157 nips-2010-Learning to localise sounds with spiking neural networks

Author: Dan Goodman, Romain Brette

2 0.86012155 16 nips-2010-A VLSI Implementation of the Adaptive Exponential Integrate-and-Fire Neuron Model

Author: Sebastian Millner, Andreas Grübl, Karlheinz Meier, Johannes Schemmel, Marc-olivier Schwartz

3 0.79796529 115 nips-2010-Identifying Dendritic Processing

Author: Aurel A. Lazar, Yevgeniy Slutskiy

Abstract: In system identiﬁcation both the input and the output of a system are available to an observer and an algorithm is sought to identify parameters of a hypothesized model of that system. Here we present a novel formal methodology for identifying dendritic processing in a neural circuit consisting of a linear dendritic processing ﬁlter in cascade with a spiking neuron model. The input to the circuit is an analog signal that belongs to the space of bandlimited functions. The output is a time sequence associated with the spike train. We derive an algorithm for identiﬁcation of the dendritic processing ﬁlter and reconstruct its kernel with arbitrary precision. 1

4 0.73893517 252 nips-2010-SpikeAnts, a spiking neuron network modelling the emergence of organization in a complex system

Author: Sylvain Chevallier, Hél\`ene Paugam-moisy, Michele Sebag

Abstract: Many complex systems, ranging from neural cell assemblies to insect societies, involve and rely on some division of labor. How to enforce such a division in a decentralized and distributed way, is tackled in this paper, using a spiking neuron network architecture. Speciﬁcally, a spatio-temporal model called SpikeAnts is shown to enforce the emergence of synchronized activities in an ant colony. Each ant is modelled from two spiking neurons; the ant colony is a sparsely connected spiking neuron network. Each ant makes its decision (among foraging, sleeping and self-grooming) from the competition between its two neurons, after the signals received from its neighbor ants. Interestingly, three types of temporal patterns emerge in the ant colony: asynchronous, synchronous, and synchronous periodic foraging activities − similar to the actual behavior of some living ant colonies. A phase diagram of the emergent activity patterns with respect to two control parameters, respectively accounting for ant sociability and receptivity, is presented and discussed. 1

5 0.73500884 96 nips-2010-Fractionally Predictive Spiking Neurons

Author: Jaldert Rombouts, Sander M. Bohte

6 0.61582834 10 nips-2010-A Novel Kernel for Learning a Neuron Model from Spike Train Data

7 0.60786414 244 nips-2010-Sodium entry efficiency during action potentials: A novel single-parameter family of Hodgkin-Huxley models

8 0.5906527 253 nips-2010-Spike timing-dependent plasticity as dynamic filter

9 0.54083025 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models

10 0.51516086 161 nips-2010-Linear readout from a neural population with partial correlation data

11 0.48027325 8 nips-2010-A Log-Domain Implementation of the Diffusion Network in Very Large Scale Integration

12 0.45095509 34 nips-2010-Attractor Dynamics with Synaptic Depression

13 0.44603193 268 nips-2010-The Neural Costs of Optimal Control

14 0.44436949 65 nips-2010-Divisive Normalization: Justification and Effectiveness as Efficient Coding Transform

15 0.43815154 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs

16 0.43373176 119 nips-2010-Implicit encoding of prior probabilities in optimal neural populations

17 0.43357188 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection

18 0.42358232 17 nips-2010-A biologically plausible network for the computation of orientation dominance

19 0.40010536 81 nips-2010-Evaluating neuronal codes for inference using Fisher information

20 0.37859982 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.025), (17, 0.026), (27, 0.091), (30, 0.022), (35, 0.02), (45, 0.113), (50, 0.037), (52, 0.038), (60, 0.015), (77, 0.062), (90, 0.025), (92, 0.429)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74425542 157 nips-2010-Learning to localise sounds with spiking neural networks

Author: Dan Goodman, Romain Brette

2 0.53898656 237 nips-2010-Shadow Dirichlet for Restricted Probability Modeling

Author: Bela Frigyik, Maya Gupta, Yihua Chen

Abstract: Although the Dirichlet distribution is widely used, the independence structure of its components limits its accuracy as a model. The proposed shadow Dirichlet distribution manipulates the support in order to model probability mass functions (pmfs) with dependencies or constraints that often arise in real world problems, such as regularized pmfs, monotonic pmfs, and pmfs with bounded variation. We describe some properties of this new class of distributions, provide maximum entropy constructions, give an expectation-maximization method for estimating the mean parameter, and illustrate with real data. 1 Modeling Probabilities for Machine Learning Modeling probability mass functions (pmfs) as random is useful in solving many real-world problems. A common random model for pmfs is the Dirichlet distribution [1]. The Dirichlet is conjugate to the multinomial and hence mathematically convenient for Bayesian inference, and the number of parameters is conveniently linear in the size of the sample space. However, the Dirichlet is a distribution over the entire probability simplex, and for many problems this is simply the wrong domain if there is application-speciﬁc prior knowledge that the pmfs come from a restricted subset of the simplex. For example, in natural language modeling, it is common to regularize a pmf over n-grams by some generic language model distribution q0 , that is, the pmf to be modeled is assumed to have the form θ = λq + (1 − λ)q0 for some q in the simplex, λ ∈ (0, 1) and a ﬁxed generic model q0 [2]. But once q0 and λ are ﬁxed, the pmf θ can only come from a subset of the simplex. Another natural language processing example is modeling the probability of keywords in a dictionary where some words are related, such as espresso and latte, and evidence for the one is to some extent evidence for the other. This relationship can be captured with a bounded variation model that would constrain the modeled probability of espresso to be within some of the modeled probability of latte. We show that such bounds on the variation between pmf components also restrict the domain of the pmf to a subset of the simplex. As a third example of restricting the domain, the similarity discriminant analysis classiﬁer estimates class-conditional pmfs that are constrained to be monotonically increasing over an ordered sample space of discrete similarity values [3]. In this paper we propose a simple variant of the Dirichlet whose support is a subset of the simplex, explore its properties, and show how to learn the model from data. We ﬁrst discuss the alternative solution of renormalizing the Dirichlet over the desired subset of the simplex, and other related work. Then we propose the shadow Dirichlet distribution; explain how to construct a shadow Dirichlet for three types of restricted domains: the regularized pmf case, bounded variation between pmf components, and monotonic pmfs; and discuss the most general case. We show how to use the expectation-maximization (EM) algorithm to estimate the shadow Dirichlet parameter α, and present simulation results for the estimation. 1 Dirichlet Shadow Dirichlet Renormalized Dirichlet Figure 1: Dirichlet, shadow Dirichlet, and renormalized Dirichlet for α = [3.94 2.25 2.81]. 2 Related Work One solution to modeling pmfs on only a subset of the simplex is to simply restrict the support of ˜ ˜ the Dirichlet to the desired support S, and renormalize the Dirichlet over S (see Fig. 1 for an example). This renormalized Dirichlet has the advantage that it is still a conjugate distribution for the multinomial. Nallapati et al.considered the renormalized Dirichlet for language modeling, but found it difﬁcult to use because the density requires numerical integration to compute the normalizer [4] . In addition, there is no closed form solution for the mean, covariance, or peak of the renormalized Dirichlet, making it difﬁcult to work with. Table 1 summarizes these properties. Additionally, generating samples from the renormalized Dirichlet is inefﬁcient: one draws samples from the stan˜ dard Dirichlet, then rejects realizations that are outside S. For high-dimensional sample spaces, this could greatly increase the time to generate samples. Although the Dirichlet is a classic and popular distribution on the simplex, Aitchison warns it “is totally inadequate for the description of the variability of compositional data,” because of its “implied independence structure and so the Dirichlet class is unlikely to be of any great use for describing compositions whose components have even weak forms of dependence” [5]. Aitchison instead championed a logistic normal distribution with more parameters to control covariance between components. A number of variants of the Dirichlet that can capture more dependence have been proposed and analyzed. For example, the scaled Dirichlet enables a more ﬂexible shape for the distribution [5], but does not change the support. The original Dirichlet(α1 , α2 , . . . αd ) can be derived as Yj / j Yj where Yj ∼ Γ(αj , β), whereas the scaled Dirichlet is derived from Yj ∼ Γ(αj , βj ), resulting in α density p(θ) = γ j ( α −1 βj j θ j j βi θi )α1 +···+αd i , where β, α ∈ Rd are parameters, and γ is the normalizer. + Another variant is the generalized Dirichlet [6] which also has parameters β, α ∈ Rd , and allows + greater control of the covariance structure, again without changing the support. As perhaps ﬁrst noted by Karl Pearson [7] and expounded upon by Aitchison [5], correlations of proportional data can be very misleading. Many Dirichlet variants have been generalizations of the Connor-Mossiman variant, Dirichlet process variants, other compound Dirichlet models, and hierarchical Dirichlet models. Ongaro et al. [8] propose the ﬂexible Dirichlet distribution by forming a re-parameterized mixture of Dirichlet distributions. Rayens and Srinivasan [9] considered the dependence structure for the general Dirichlet family called the generalized Liouville distributions. In contrast to prior efforts, the shadow Dirichlet manipulates the support to achieve various kinds of dependence that arise frequently in machine learning problems. 3 Shadow Dirichlet Distribution We introduce a new distribution that we call the shadow Dirichlet distribution. Let S be the prob˜ ability (d − 1)-simplex, and let Θ ∈ S be a random pmf drawn from a Dirichlet distribution with density pD and unnormalized parameter α ∈ Rd . Then we say the random pmf Θ ∈ S is distributed + ˜ according to a shadow Dirichlet distribution if Θ = M Θ for some ﬁxed d × d left-stochastic (that ˜ is, each column of M sums to 1) full-rank (and hence invertible) matrix M , and we call Θ the gen2 erating Dirichlet of Θ, or Θ’s Dirichlet shadow. Because M is a left-stochastic linear map between ﬁnite-dimensional spaces, it is a continuous map from the convex and compact S to a convex and compact subset of S that we denote SM . The shadow Dirichlet has two parameters: the generating Dirichlet’s parameter α ∈ Rd , and the + d × d matrix M . Both α and M can be estimated from data. However, as we show in the following subsections, the matrix M can be proﬁtably used as a design parameter that is chosen based on application-speciﬁc knowledge or side-information to specify the restricted domain SM , and in that way impose dependency between the components of the random pmfs. The shadow Dirichlet density p(θ) is the normalized pushforward of the Dirichlet density, that is, it is the composition of the Dirichlet density and M −1 with the Jacobian: 1 α −1 p(θ) = (M −1 θ)j j , (1) B(α) |det(M )| j Γ(αj ) d j is the standard Dirichlet normalizer, and α0 = j=1 αj is the standard where B(α) Γ(α0 ) Dirichlet precision factor. Table 1 summarizes the basic properties of the shadow Dirichlet. Fig. 1 shows an example shadow Dirichlet distribution. Generating samples from the shadow Dirichlet is trivial: generate samples from its generating Dirichlet (for example, using stick-breaking or urn-drawing) and multiply each sample by M to create the corresponding shadow Dirichlet sample. Table 1: Table compares and summarizes the Dirichlet, renormalized Dirichlet, and shadow Dirichlet distributions. Dirichlet(α) Density p(θ) Mean 1 B(α) d j=1 α −1 θj j Shadow Dirichlet (α, M ) 1 B(α)|det(M )| α α0 Renormalized ˜ Dirichlet (α, S) d −1 αj −1 θ)j j=1 (M 1 ˜ S M d j=1 α α0 αj −1 qj dq d j=1 α −1 θj j ˜ θp(θ)dθ S ¯ ¯ − θ)(θ − θ)T p(θ)dθ Cov(Θ) M Cov(Θ)M T αj −1 α0 −d j M α0 −d max p(θ) stick-breaking, urn-drawing draw from Dirichlet(α), multiply by M draw from Dirichlet(α), ˜ reject if not in S ML Estimate iterative (simple functions) iterative (simple functions) unknown complexity ML Compound Estimate iterative (simple functions) iterative (numerical integration) unknown complexity Covariance Mode (if α > 1) How to Sample 3.1 α −1 ˜ (θ S ˜ θ∈S Example: Regularized Pmfs The shadow Dirichlet can be designed to specify a distribution over a set of regularized pmfs SM = ˜ ˘ ˜ ˘ ˘ {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for speciﬁc values of λ and θ. In general, for a given λ and θ ∈ S, the following d × d matrix M will change the support to the desired subset SM by mapping the extreme points of S to the extreme points of SM : ˘ M = (1 − λ)θ1T + λI, (2) where I is the d × d identity matrix. In Section 4 we show that the M given in (2) is optimal in a maximum entropy sense. 3 3.2 Example: Bounded Variation Pmfs We describe how to use the shadow Dirichlet to model a random pmf that has bounded variation such that |θk − θl | ≤ k,l for any k, ∈ {1, 2, . . . , d} and k,l > 0. To construct speciﬁed bounds on the variation, we ﬁrst analyze the variation for a given M . For any d × d left stochastic matrix T d d ˜ ˜ ˜ M, θ = Mθ = M1j θj . . . Mdj θj , so the difference between any two entries is j=1 j=1 ˜ |Mkj − Mlj | θj . ˜ (Mkj − Mlj )θj ≤ |θk − θl | = (3) j j Thus, to obtain a distribution over pmfs with bounded |θk − θ | ≤ k,l for any k, components, it is sufﬁcient to choose components of the matrix M such that |Mkj − Mlj | ≤ k,l for all j = 1, . . . , d ˜ because θ in (3) sums to 1. One way to create such an M is using the regularization strategy described in Section 3.1. For this ˜ ˜ ˘ case, the jth component of θ is θj = M θ = λθj + (1 − λ)θj , and thus the variation between the j ith and jth component of any pmf in SM is: ˜ ˘ ˜ ˘ ˜ ˜ ˘ ˘ |θi − θj | = λθi + (1 − λ)θi − λθj − (1 − λ)θj ≤ λ θi − θj + (1 − λ) θi − θj ˘ ˘ ≤ λ + (1 − λ) max θi − θj . (4) i,j ˘ Thus by choosing an appropriate λ and regularizing pmf θ, one can impose the bounded variation ˘ to be the uniform pmf, and choose any λ ∈ (0, 1), then the matrix given by (4). For example, set θ M given by (2) will guarantee that the difference between any two entries of any pmf drawn from the shadow Dirichlet (M, α) will be less than or equal to λ. 3.3 Example: Monotonic Pmfs For pmfs over ordered components, it may be desirable to restrict the support of the random pmf distribution to only monotonically increasing pmfs (or to only monotonically decreasing pmfs). A d × d left-stochastic matrix M that will result in a shadow Dirichlet that generates only monotonically increasing d × 1 pmfs has kth column [0 . . . 0 1/(d − k + 1) . . . 1/(d − k + 1)]T , we call this the monotonic M . It is easy to see that with this M only monotonic θ’s can be produced, 1˜ 1˜ 1 ˜ because θ1 = d θ1 which is less than or equal to θ2 = d θ1 + d−1 θ2 and so on. In Section 4 we show that the monotonic M is optimal in a maximum entropy sense. Note that to provide support over both monotonically increasing and decreasing pmfs with one distribution is not achievable with a shadow Dirichlet, but could be achieved by a mixture of two shadow Dirichlets. 3.4 What Restricted Subsets are Possible? Above we have described solutions to construct M for three kinds of dependence that arise in machine learning applications. Here we consider the more general question: What subsets of the simplex can be the support of the shadow Dirichlet, and how to design a shadow Dirichlet for a particular support? For any matrix M , by the Krein-Milman theorem [10], SM = M S is the convex hull of its extreme points. If M is injective, the extreme points of SM are easy to specify, as a d × d matrix M will have d extreme points that occur for the d choices of θ that have only one nonzero component, as the rest of the θ will create a non-trivial convex combination of the columns of M , and therefore cannot result in extreme points of SM by deﬁnition. That is, the extreme points of SM are the d columns of M , and one can design any SM with d extreme points by setting the columns of M to be those extreme pmfs. However, if one wants the new support to be a polytope in the probability (d − 1)-simplex with m > d extreme points, then one must use a fat M with d × m entries. Let S m denote the probability 4 (m − 1)-simplex, then the domain of the shadow Dirichlet will be M S m , which is the convex hull of the m columns of M and forms a convex polytope in S with at most m vertices. In this case M cannot be injective, and hence it is not bijective between S m and M S m . However, a density on M S m can be deﬁned as: p(θ) = 1 B(α) ˜ {θ ˜α −1 ˜ θj j dθ. ˜ M θ=θ} j (5) On the other hand, if one wants the support to be a low-dimensional polytope subset of a higherdimensional probability simplex, then a thin d × m matrix M , where m < d, can be used to implement this. If M is injective, then it has a left inverse M ∗ that is a matrix of dimension m × d, and the normalized pushforward of the original density can be used as a density on the image M S m : p(θ) = 1 α −1 1/2 B(α) |det(M T M )| (M ∗ θ)j j , j If M is not injective then one way to determine a density is to use (5). 4 Information-theoretic Properties In this section we note two information-theoretic properties of the shadow Dirichlet. Let Θ be drawn ˜ from shadow Dirichlet density pM , and let its generating Dirichlet Θ be drawn from pD . Then the differential entropy of the shadow Dirichlet is h(pM ) = log |det(M )| + h(pD ), where h(pD ) is the differential entropy of its generating Dirichlet. In fact, the shadow Dirichlet always has less entropy than its Dirichlet shadow because log |det(M )| ≤ 0, which can be shown as a corollary to the following lemma (proof not included due to lack of space): Lemma 4.1. Let {x1 , . . . , xn } and {y1 , . . . , yn } be column vectors in Rn . If each yj is a convex n n combination of the xi ’s, i.e. yj = i=1 γji xi , i=1 γji = 1, γjk ≥ 0, ∀j, k ∈ {1, . . . , n} then |det[y1 , . . . , yn ]| ≤ |det[x1 , . . . , xn ]|. It follows from Lemma 4.1 that the constructive solutions for M given in (2) and the monotonic M are optimal in the sense of maximizing entropy: Corollary 4.1. Let Mreg be the set of left-stochastic matrices M that parameterize shadow Dirichlet ˜ ˘ ˜ ˘ distributions with support in {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for a speciﬁc choice of λ and θ. Then the M given in (2) results in the shadow Dirichlet with maximum entropy, that is, (2) solves arg maxM ∈Mreg h(pM ). Corollary 4.2. Let Mmono be the set of left-stochastic matrices M that parameterize shadow Dirichlet distributions that generate only monotonic pmfs. Then the monotonic M given in Section 3.3 results in the shadow Dirichlet with maximum entropy, that is, the monotonic M solves arg maxM ∈Mmono h(pM ). 5 Estimating the Distribution from Data In this section, we discuss the estimation of α for the shadow Dirichlet and compound shadow Dirichlet, and the estimation of M . 5.1 Estimating α for the Shadow Dirichlet Let matrix M be speciﬁed (for example, as described in the subsections of Section 3), and let q be a d × N matrix where the ith column qi is the ith sample pmf for i = 1 . . . N , and let (qi )j be the jth component of the ith sample pmf for j = 1, . . . , d. Then ﬁnding the maximum likelihood estimate 5 of α for the shadow Dirichlet is straightforward:  N 1 arg max log + log  p(qi |α) ≡ arg max log B(α) |det(M )| α∈Rk α∈Rk + + i=1   1 αj −1  (˜i )j q , ≡ arg max log  B(α)N i j α∈Rk +  N α −1 (M −1 qi )j j  i j (6) where q = M −1 q. Note (6) is the maximum likelihood estimation problem for the Dirichlet dis˜ tribution given the matrix q , and can be solved using the standard methods for that problem (see ˜ e.g. [11, 12]). 5.2 Estimating α for the Compound Shadow Dirichlet For many machine learning applications the given data are modeled as samples from realizations of a random pmf, and given these samples one must estimate the random pmf model’s parameters. We refer to this case as the compound shadow Dirichlet, analogous to the compound Dirichlet (also called the multivariate P´ lya distribution). Assuming one has already speciﬁed M , we ﬁrst discuss o method of moments estimation, and then describe an expectation-maximization (EM) method for computing the maximum likelihood estimate α. ˘ One can form an estimate of α by the method of moments. For the standard compound Dirichlet, one treats the samples of the realizations as normalized empirical histograms, sets the normalized α parameter equal to the empirical mean of the normalized histograms, and uses the empirical variances to determine the precision α0 . By deﬁnition, this estimate will be less likely than the maximum likelihood estimate, but may be a practical short-cut in some cases. For the compound shadow Dirichlet, we believe the method of moments estimator will be a poorer estimate in general. The problem is that if one draws samples from a pmf θ from a restricted subset SM of the simplex, ˘ then the normalized empirical histogram θ of those samples may not be in SM . For example given a monotonic pmf, the histogram of ﬁve samples drawn from it may not be monotonic. Then the empirical mean of such normalized empirical histograms may not be in SM , and so setting the shadow Dirichlet mean M α equal to the empirical mean may lead to an infeasible estimate (one that is outside SM ). A heuristic solution is to project the empirical mean into SM ﬁrst, for example, by ﬁnding the nearest pmf in SM in squared error or relative entropy. As with the compound Dirichlet, this may still be a useful approach in practice for some problems. Next we state an EM method to ﬁnd the maximum likelihood estimate α. Let s be a d × N matrix ˘ of sample histograms from different experiments, such that the ith column si is the ith histogram for i = 1, . . . , N , and (si )j is the number of times we have observed the jth event from the ith pmf vi . Then the maximum log-likelihood estimate of α solves arg max log p(s|α) for α ∈ Rk . + If the random pmfs are drawn from a Dirichlet distribution, then ﬁnding this maximum likelihood estimate requires an iterative procedure, and can be done in several ways including a gradient descent (ascent) approach. However, if the random pmfs are drawn from a shadow Dirichlet distribution, then a direct gradient descent approach is highly inconvenient as it requires taking derivatives of numerical integrals. However, it is practical to apply the expectation-maximization (EM) algorithm [13][14], as we describe in the rest of this section. Code to perform the EM estimation of α can be downloaded from idl.ee.washington.edu/publications.php. We assume that the experiments are independent and therefore p(s|α) = p({si }|α) = and hence arg maxα∈Rk log p(s|α) = arg maxα∈Rk i log p(si |α). + + i p(si |α) To apply the EM method, we consider the complete data to be the sample histograms s and the pmfs that generated them (s, v1 , v2 , . . . , vN ), whose expected log-likelihood will be maximized. Speciﬁcally, because of the assumed independence of the {vi }, the EM method requires one to repeatedly maximize the Q-function such that the estimate of α at the (m + 1)th iteration is: N α(m+1) = arg max α∈Rk + Evi |si ,α(m) [log p(vi |α)] . i=1 6 (7) Like the compound Dirichlet likelihood, the compound shadow Dirichlet likelihood is not necessarily concave. However, note that the Q-function given in (7) is concave, because log p(vi |α) = − log |det(M )| + log pD,α M −1 vi , where pD,α is the Dirichlet distribution with parameter α, and by a theorem of Ronning [11], log pD,α is a concave function, and adding a constant does not change the concavity. The Q-function is a ﬁnite integration of such concave functions and hence also concave [15]. We simplify (7) without destroying the concavity to yield the equivalent problem α(m+1) = d d arg max g(α) for α ∈ Rk , where g(α) = log Γ(α0 ) − j=1 log Γ(αj ) + j=1 βj αj , and + βj = N tij i=1 zi , 1 N where tij and zi are integrals we compute with Monte Carlo integration: d (s ) log(M −1 vi )j γi tij = SM (vi )k i k pM (vi |α(m) )dvi k=1 d zi = (vi )j k(si )k pM (vi |α(m) )dvi , γi SM k=1 where γi is the normalization constant for the multinomial with histogram si . We apply the Newton method [16] to maximize g(α), where the gradient g(α) has kth component ψ0 (α0 ) − ψ0 (α1 ) + β1 , where ψ0 denotes the digamma function. Let ψ1 denote the trigamma function, then the Hessian matrix of g(α) is: H = ψ1 (α0 )11T − diag (ψ1 (α1 ), . . . , ψ1 (αd )) . Note that because H has a very simple structure, the inversion of H required by the Newton step is greatly simpliﬁed by using the Woodbury identity [17]: H −1 = − diag(ξ1 , . . . , ξd ) − 1 1 1 [ξi ξj ]d×d , where ξ0 = ψ1 (α0 ) and ξj = ψ1 (αj ) , j = 1, . . . , d. ξ − d ξ 0 5.3 j=1 j Estimating M for the Shadow Dirichlet Thus far we have discussed how to construct M to achieve certain desired properties and how to interpret a given M ’s effect on the support. In some cases it may be useful to estimate M directly from data, for example, ﬁnding the maximum likelihood M . In general, this is a non-convex problem because the set of rank d − 1 matrices is not convex. However, we offer two approximations. First, note that as in estimating the support of a uniform distribution, the maximum likelihood M will correspond to a support that is no larger than needed to contain the convex hull of sample pmfs. Second, the mean of the empirical pmfs will be in the support, and thus a heuristic is to set the kth column of M (which corresponds to the kth vertex of the support) to be a convex combination of the kth vertex of the standard probability simplex and the empirical mean pmf. We provide code that ﬁnds the d optimal such convex combinations such that a speciﬁced percentage of the sample pmfs are within the support, which reduces the non-convex problem of ﬁnding the maximum likelihood d × d matrix M to a d-dimensional convex relaxation. 6 Demonstrations It is reasonable to believe that if the shadow Dirichlet better matches the problem’s statistics, it will perform better in practice, but an open question is how much better? To motivate the reader to investigate this question further in applications, we provide two small demonstrations. 6.1 Verifying the EM Estimation We used a broad suite of simulations to test and verify the EM estimation. Here we include a simple visual conﬁrmation that the EM estimation works: we drew 100 i.i.d. pmfs from a shadow Dirichlet with monotonic M for d = 3 and α = [3.94 2.25 2.81] (used in [18]). From each of the 100 pmfs, we drew 100 i.i.d. samples. Then we applied the EM algorithm to ﬁnd the α for both the standard compound Dirichlet, and the compound shadow Dirichlet with the correct M . Fig. 2 shows the true distribution and the two estimated distributions. 7 True Distribution (Shadow Dirichlet) Estimated Shadow Dirichlet Estimated Dirichlet Figure 2: Samples were drawn from the true distribution and the given EM method was applied to form the estimated distributions. 6.2 Estimating Proportions from Sales Manufacturers often have constrained manufacturing resources, such as equipment, inventory of raw materials, and employee time, with which to produce multiple products. The manufacturer must decide how to proportionally allocate such constrained resources across their product line based on their estimate of proportional sales. Manufacturer Artifact Puzzles gave us their past retail sales data for the 20 puzzles they sold during July 2009 through Dec 2009, which we used to predict the proportion of sales expected for each puzzle. These estimates were then tested on the next ﬁve months of sales data, for January 2010 through April 2010. The company also provided a similarity between puzzles S, where S(A, B) is the proportion of times an order during the six training months included both puzzle A and B if it included puzzle A. We compared treating each of the six training months of sales data as a sample from a compound Dirichlet versus or a compound shadow Dirichlet. For the shadow Dirichlet, we normalized each column of the similarity matrix S to sum to one so that it was left-stochastic, and used that as the M matrix; this forces puzzles that are often bought together to have closer estimated proportions. We estimated each α parameter by EM to maximize the likelihood of the past sales data, and then estimated the future sales proportions to be the mean of the estimated Dirichlet or shadow Dirichlet distribution. We also compared with treating all six months of sales data as coming from one multinomial which we estimated as the maximum likelihood multinomial, and to taking the mean of the six empirical pmfs. Table 2: Squared errors between estimates and actual proportional sales. Jan. Feb. Mar. Apr. 7 Multinomial .0129 .0185 .0231 .0240 Mean Pmf .0106 .0206 .0222 .0260 Dirichlet .0109 .0172 .0227 .0235 Shadow Dirichlet .0093 .0164 .0197 .0222 Summary In this paper we have proposed a variant of the Dirichlet distribution that naturally captures some of the dependent structure that arises often in machine learning applications. We have discussed some of its theoretical properties, and shown how to specify the distribution for regularized pmfs, bounded variation pmfs, monotonic pmfs, and for any desired convex polytopal domain. We have derived the EM method and made available code to estimate both the shadow Dirichlet and compound shadow Dirichlet from data. Experimental results demonstrate that the EM method can estimate the shadow Dirichlet effectively, and that the shadow Dirichlet may provide worthwhile advantages in practice. 8 References [1] B. Frigyik, A. Kapila, and M. R. Gupta, “Introduction to the Dirichlet distribution and related processes,” Tech. Rep., University of Washington, 2010. [2] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to information retrieval,” ACM Trans. on Information Systems, vol. 22, no. 2, pp. 179–214, 2004. [3] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti, “Similarity-based classiﬁcation: Concepts and algorithms,” Journal of Machine Learning Research, vol. 10, pp. 747–776, March 2009. [4] R. Nallapati, T. Minka, and S. Robertson, “The smoothed-Dirichlet distribution: a building block for generative topic models,” Tech. Rep., Microsoft Research, Cambridge, 2007. [5] Aitchison, Statistical Analysis of Compositional Data, Chapman Hall, New York, 1986. [6] R. J. Connor and J. E. Mosiman, “Concepts of independence for proportions with a generalization of the Dirichlet distibution,” Journal of the American Statistical Association, vol. 64, pp. 194–206, 1969. [7] K. Pearson, “Mathematical contributions to the theory of evolution–on a form of spurious correlation which may arise when indices are used in the measurement of organs,” Proc. Royal Society of London, vol. 60, pp. 489–498, 1897. [8] A. Ongaro, S. Migliorati, and G. S. Monti, “A new distribution on the simplex containing the Dirichlet family,” Proc. 3rd Compositional Data Analysis Workshop, 2008. [9] W. S. Rayens and C. Srinivasan, “Dependence properties of generalized Liouville distributions on the simplex,” Journal of the American Statistical Association, vol. 89, no. 428, pp. 1465– 1470, 1994. [10] Walter Rudin, Functional Analysis, McGraw-Hill, New York, 1991. [11] G. Ronning, “Maximum likelihood estimation of Dirichlet distributions,” Journal of Statistical Computation and Simulation, vol. 34, no. 4, pp. 215221, 1989. [12] T. Minka, “Estimating a Dirichlet distribution,” Tech. Rep., Microsoft Research, Cambridge, 2009. [13] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [14] M. R. Gupta and Y. Chen, Theory and Use of the EM Method, Foundations and Trends in Signal Processing, Hanover, MA, 2010. [15] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970. [16] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004. [17] K. B. Petersen and M. S. Pedersen, Matrix Cookbook, 2009, Available at matrixcookbook.com. [18] R. E. Madsen, D. Kauchak, and C. Elkan, “Modeling word burstiness using the Dirichlet distribution,” in Proc. Intl. Conf. Machine Learning, 2005. 9

3 0.52094293 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case

Author: Wei Wang, Zhi-hua Zhou

Abstract: The sample complexity of active learning under the realizability assumption has been well-studied. The realizability assumption, however, rarely holds in practice. In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be O(log 1 ), contrasting to single-view setting where the polynomial improveǫ ment is the best possible achievement. We also prove that in general multi-view setting the sample complexity of active learning with unbounded Tsybakov noise is O( 1 ), where the order of 1/ǫ is independent of the parameter in Tsybakov noise, ǫ contrasting to previous polynomial bounds where the order of 1/ǫ is related to the parameter in Tsybakov noise. 1

4 0.51446557 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories

Author: George Konidaris, Scott Kuindersma, Roderic Grupen, Andre S. Barreto

Abstract: We introduce CST, an algorithm for constructing skill trees from demonstration trajectories in continuous reinforcement learning domains. CST uses a changepoint detection method to segment each trajectory into a skill chain by detecting a change of appropriate abstraction, or that a segment is too complex to model as a single skill. The skill chains from each trajectory are then merged to form a skill tree. We demonstrate that CST constructs an appropriate skill tree that can be further reﬁned through learning in a challenging continuous domain, and that it can be used to segment demonstration trajectories on a mobile manipulator into chains of skills where each skill is assigned an appropriate abstraction. 1

5 0.36139575 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models

Author: Ryan Kelly, Matthew Smith, Robert Kass, Tai S. Lee

Abstract: Activity of a neuron, even in the early sensory areas, is not simply a function of its local receptive ﬁeld or tuning properties, but depends on global context of the stimulus, as well as the neural context. This suggests the activity of the surrounding neurons and global brain states can exert considerable inﬂuence on the activity of a neuron. In this paper we implemented an L1 regularized point process model to assess the contribution of multiple factors to the ﬁring rate of many individual units recorded simultaneously from V1 with a 96-electrode “Utah” array. We found that the spikes of surrounding neurons indeed provide strong predictions of a neuron’s response, in addition to the neuron’s receptive ﬁeld transfer function. We also found that the same spikes could be accounted for with the local ﬁeld potentials, a surrogate measure of global network states. This work shows that accounting for network ﬂuctuations can improve estimates of single trial ﬁring rate and stimulus-response transfer functions. 1

6 0.35925415 161 nips-2010-Linear readout from a neural population with partial correlation data

7 0.35593036 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts

8 0.35546556 81 nips-2010-Evaluating neuronal codes for inference using Fisher information

9 0.35233331 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

10 0.35168427 17 nips-2010-A biologically plausible network for the computation of orientation dominance

11 0.35065886 127 nips-2010-Inferring Stimulus Selectivity from the Spatial Structure of Neural Network Dynamics

12 0.34996292 96 nips-2010-Fractionally Predictive Spiking Neurons

13 0.34961098 98 nips-2010-Functional form of motion priors in human motion perception

14 0.3457458 268 nips-2010-The Neural Costs of Optimal Control

15 0.34522086 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior

16 0.34492964 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication

17 0.34324455 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

18 0.34166384 19 nips-2010-A rational decision making framework for inhibitory control

19 0.34146407 39 nips-2010-Bayesian Action-Graph Games

20 0.34017229 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters