nips nips2012 nips2012-150 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yan Karklin, Chaitanya Ekanadham, Eero P. Simoncelli
Abstract: Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The first layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Hierarchical spike coding of sound Yan Karklin∗ Howard Hughes Medical Institute, Center for Neural Science New York University yan. [sent-3, score-0.363]
2 edu Abstract Natural sounds exhibit complex statistical regularities at multiple scales. [sent-10, score-0.356]
3 Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. [sent-11, score-0.446]
4 Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. [sent-12, score-0.234]
5 Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. [sent-13, score-0.274]
6 The first layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. [sent-14, score-0.627]
7 Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. [sent-15, score-0.811]
8 When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. [sent-16, score-1.062]
9 Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. [sent-17, score-0.214]
10 This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods. [sent-18, score-0.279]
11 1 Introduction Natural sounds, such as speech and animal vocalizations, consist of complex acoustic events occurring at multiple scales. [sent-19, score-0.485]
12 Precise timing and frequency relationships among these events convey important information about the sound, while intrinsic variability confounds simple approaches to sound processing and understanding. [sent-20, score-0.409]
13 An auditory representation that captures the corresponding structure while remaining invariant to this variability would provide a useful first step for many applications in auditory processing. [sent-22, score-0.289]
14 ∗ Contributed equally 1 Many recent efforts to learn auditory representations in an unsupervised setting have focused on sparse decompositions chosen to capture structure inherent in sound ensembles. [sent-23, score-0.306]
15 For example, Klein et al [3] adapted a set of time-frequency kernels to represent spectrograms of speech signals and showed that the resulting kernels were localized and bore resemblance to auditory receptive fields. [sent-25, score-0.723]
16 First, they operate on spectrograms (rather than the original sound waveforms), which impose limitations on both time and frequency resolution. [sent-28, score-0.344]
17 The features learned by these models are tied to specific frequencies, and must be replicated at different frequency offsets to accommodate pitch shifts that occur in natural sounds. [sent-30, score-0.337]
18 To address these limitations, we propose a two-layer hierarchical model that encodes complex acoustic events using a representation that is shiftable in both time and frequency. [sent-32, score-0.533]
19 The first layer is a “spikegram” representation of the sound pressure waveform, as developed in [6, 5]. [sent-33, score-0.367]
20 The prior probabilities for coefficients in the first layer are modulated by the output of the second layer, combined with a recurrent component that operates within the first layer. [sent-34, score-0.238]
21 When trained on speech, the kernels learned at the second layer encode complex acoustic events which, when positioned at specific times and frequencies, compactly represent the first-layer spikegram, which is itself a compact description of the sound pressure waveform. [sent-35, score-0.955]
22 Despite its very sparse activation, the second-layer representation retains much of the acoustic information: sounds sampled according to the generative model approximate well the original sound. [sent-36, score-0.562]
23 Finally, we demonstrate that the model performs well on a denoising task, particularly when the noise is structured, suggesting that the higher-order representation provides a useful statistical description of speech. [sent-37, score-0.315]
24 2 Hierarchical spike coding In the “spikegram” representation [5], a sound is encoded using a linear combination of sparse, time-shifted kernels φf (t): xt = τ,f Sτ,f φf (t − τ ) + ǫt (1) where ǫt denotes Gaussian white noise and the coefficients Sτ,f are mostly zero. [sent-38, score-0.727]
25 1b, offers an efficient representation of sounds [8] that avoids the blocking artifacts and time-frequency trade-offs associated with more traditional spectrogram representations. [sent-44, score-0.342]
26 We aim to model the statistical regularities present in the spikegram representations. [sent-45, score-0.5]
27 Spikes placed at precise locations in time and frequency reveal acoustic features, harmonic structures, as well as slow modulations in the sound envelope. [sent-49, score-0.683]
28 The coarse scale non-stationarity is likely caused by higher-order acoustic events, such as phoneme utterances that span a much larger time-frequency range than the individual gammatone kernels. [sent-50, score-0.438]
29 On the other hand, the fine-scale correlations are due to some combination of the correlations inherent in the gammatone filterbank and the precise temporal structure present in speech. [sent-51, score-0.2]
30 We introduce the hierarchical spike coding (HSC) model, illustrated in Fig. [sent-52, score-0.292]
31 2, to capture the structure in the spikegrams (S (1) ) on both coarse and fine scales. [sent-53, score-0.188]
32 We add a second layer of unobserved spikes (S (2) ), assumed to arise from a Poisson process with constant rate λ. [sent-54, score-0.569]
33 These spikes are convolved with a set of time-frequency “rate kernels” (K r ) to yield the logarithm of the firing rate of the first-layer spikes on a coarse scale. [sent-55, score-0.965]
34 On a fine scale, the logarithm of the firing rate of firstlayer spikes is modulated using recurrent interactions, by convolving the local spike history with 2 center freq (Hz) b 3 e center freq (Hz) d 0 0. [sent-56, score-1.379]
35 84 spikegram representation 4 time/freq cross−correlation c 10 1 (∆ logHz) speech waveform a 3 10 −1 −0. [sent-60, score-0.637]
36 02 inter spike interval (sec) Figure 1: Coarse (top row) and fine (bottom row) scale structure in spikegram encodings of speech. [sent-68, score-0.546]
37 The sound pressure waveform of a spoken sentence and b. [sent-70, score-0.304]
38 Each spike (dot) has an associated time (abscissa) and center frequency (ordinate) as well as an amplitude (dot size). [sent-72, score-0.394]
39 Cross-correlation function for a spikegram ensemble reveals correlations across large time/frequency scales. [sent-74, score-0.398]
40 Magnification of a portion of (a), with two gammatone kernels (red and blue), corresponding to the red and blue spikes in (e). [sent-76, score-0.704]
41 Magnification of corresponding portion of (b) , revealing that spike timing exhibits strong regularities at a fine scale. [sent-78, score-0.25]
42 Histograms of interspike-intervals for two frequency channels corresponding to the colored spikes in (e) reveal strong temporal dependencies. [sent-80, score-0.598]
43 Second-layer spikes S (2) associated with 3 features (indicated by color) are sampled in time and frequency according to a Poisson process, with exponentially-distributed amplitudes (indicated by dot size). [sent-92, score-0.628]
44 These are convolved with corresponding rate kernels K r (outlined in colored rectangles), summed together, and passed through an exponential nonlinearity to drive the instantaneous rate of the first-layer spikes on a coarse scale. [sent-93, score-0.843]
45 The first-layer spike rate is also modulated on a fine scale by a recurrent component that convolves previous spikes with coupling kernels K c . [sent-94, score-1.053]
46 At a given time step (vertical line), spikes S (1) are generated according to a Poisson process whose rate depends on the top-down and the recurrent terms. [sent-95, score-0.501]
47 Section 4 describes a method for approximate inference of the second-layer spikes (solving Eq. [sent-105, score-0.378]
48 84 center freq = 246Hz center freq = 546Hz center freq = 1214Hz 1. [sent-111, score-0.921]
49 02 Figure 3: Example model kernels learned on the TIMIT data set. [sent-115, score-0.256]
50 Bottom: Four representative coupling kernels (scaling indicated by colorbar). [sent-117, score-0.333]
51 4 Inference Inference of the second-layer spikes S (2) (Eq. [sent-118, score-0.378]
52 (8)) involves maximizing the trade-off between the ˜ GLM likelihood term, which we denote by L(Θ, S (2) ) and the last term which penalizes the number (2) of spikes ( S 0 ). [sent-119, score-0.378]
53 5 Results Model parameters learned from speech We applied the model to the TIMIT speech corpus [10]. [sent-129, score-0.297]
54 First, we obtained spikegrams by encoding sounds to 20dB precision using a set of 200 gammatone filters with center frequencies spaced evenly on a logarithmic scale (see [5] for details). [sent-130, score-0.584]
55 For each audio sample, this gave us a spikegram with fine time and frequency resolution (6. [sent-131, score-0.653]
56 We trained a model with 20 rate and 20 amplitude kernels, with frequency resolution equivalent to that of the spikegram and time resolution of 20ms. [sent-134, score-0.72]
57 Coupling kernels were defined independently for each frequency channel; they extended over 20ms and 2. [sent-137, score-0.361]
58 7 octaves around the channel center frequency with the same time/frequency resolution as the spikegram. [sent-138, score-0.366]
59 3 displays the learned rate kernels (top) and coupling kernels (bottom). [sent-142, score-0.643]
60 Among the patterns learned by the rate kernels are harmonic stacks of different durations and pitch shifts (e. [sent-143, score-0.472]
61 , kernels 4, 9, 11, 18), ramps in frequency (kernels 1, 7, 15, 16), sharp temporal onsets and offsets (kernels 5 S aa + r (2) + + + + + + + + + + + ≈ ≈ + + ≈ ≈ + + ao + l (2) ≈ + + S + + ≈ ≈ + + ≈ freq 5 + 0 0 time 0. [sent-145, score-0.792]
62 Each row shows inferred second-layer spikes, the rate kernels most correlated with the utterance of each phone pair, shifted to their corresponding spikes’ frequencies (colored on left), and the encoded log firing rate centered on the phone pair utterance. [sent-147, score-0.561]
63 7, 13, 19), and acoustic features localized in time and frequency (kernels 5, 10, 12, 20) (example sounds synthesized by turning on single features are available in supplementary materials). [sent-148, score-0.71]
64 The corresponding amplitude kernels (not shown) contain patterns highly correlated with the rate kernels, suggesting a strong dependence in the spikegram between spike rate and magnitude. [sent-149, score-0.913]
65 For most frequency channels, the coupling kernels are strongly negative at times immediately following the spike and at adjacent frequencies, representing “refractory periods” observed in the spikegrams. [sent-150, score-0.627]
66 Positive peaks in the coupling kernels encode precise alignment of spikes across time and frequency. [sent-151, score-0.758]
67 Second-layer representation The learned kernels combine in various ways to represent complex acoustic events. [sent-152, score-0.585]
68 Vowel phones are approximated by a harmonic stack (outlined in yellow) together with a ramp in frequency (outlined in orange and dark blue). [sent-155, score-0.237]
69 Because the rate kernels add to specify the logarithm of the firing rate, their superposition results in a multiplicative modulation of the intensities at each level of the harmonic stack. [sent-156, score-0.368]
70 Translating the kernels in log-frequency allows the same set of fundamental features to participate in a range of acoustic events: the same vocalizations at different pitch are often represented by the same set of features. [sent-159, score-0.631]
71 4, the same set of kernels is used in a similar configuration across different speakers and genders. [sent-161, score-0.245]
72 It should be noted that the second-layer representation does not discard precise time and frequency information (this information is carried in the times and frequencies of the second-layer spikes). [sent-162, score-0.336]
73 However, the identities of the features that are active remain invariant to pitch and frequency modulations. [sent-163, score-0.255]
74 Synthesis One can further understand the acoustic information that is captured by second-layer spikes by sampling a spikegram according to the generative model. [sent-164, score-1.02]
75 5 middle) and sampled two spikegrams: one with only the hierarchical component (left), and one that included both hierarchical and coupling components (right). [sent-166, score-0.276]
76 At a coarse scale the two samples closely resemble the spikegram of the original sound. [sent-167, score-0.481]
77 However, at the fine time scale, only the spikegram sampled with coupling contains the regularities observed in speech data (Fig. [sent-168, score-0.746]
78 Sounds were also generated from these spikegram samples by superimposing gammatone kernels as in [5]. [sent-170, score-0.724]
79 Despite the fact that the second6 Second layer (176 spikes) 4 freq (log Hz) 10 3 10 2 10 0 Hierarchical (2741 spikes) 4 3 Data (2544 spikes) Coupling + Hierarchical (2358 spikes) freq (log Hz) 10 3 10 2 10 0 3 4 freq (log Hz) 10 3 10 2 10 0. [sent-171, score-0.851]
80 Middle bottom: spikegram representation of the sentence in Fig. [sent-174, score-0.487]
81 1; Middle top: Inferred second-layer representation; Left: first-layer spikes generated using only the hierarchical model component; Right: first-layer spikes generated using hierarchical and coupling kernels. [sent-175, score-1.032]
82 83 -10dB -5dB 0dB 5dB 10dB sparse temporally modulated noise Wiener wav thr MP HSC -8. [sent-197, score-0.287]
83 88 Table 1: Denoising accuracy (dB SNR) for speech corrupted with white noise (left) or with sparse, temporally modulated noise (right). [sent-217, score-0.423]
84 layer representation contains over 15 times fewer spikes as the first-layer spikegrams, the synthesized sounds are intelligible and the addition of the coupling filters provides a noticeable improvement (audio examples in supplementary materials). [sent-218, score-0.911]
85 We incorporate the HSC model directly into this denoising algorithm by replacing the fixed probability of spiking at the first layer with the 7 rate specified by the second layer. [sent-222, score-0.398]
86 Since neither the first- nor second-layer spike code for the noisy signal is known, we first infer the first and then the second layer using MAP estimation, and then recompute the first layer given both the data and second layer. [sent-223, score-0.377]
87 To the extent that the parameters learned by HSC reflect statistical properties of the signal, incorporating the more sophisticated spikegram prior into a denoising algorithm should allow us to better distinguish signal from noise. [sent-225, score-0.678]
88 We tested this by denoising speech waveforms (held out during model training) that have been corrupted by additive white Gaussian noise. [sent-226, score-0.461]
89 HSC-based denoising is able to outperform standard methods, as well as matching pursuit denoising (Table 1 left). [sent-228, score-0.504]
90 To test more rigorously the benefit of a structured prior, we evaluated denoising performance on signals corrupted with non-stationary noise whose power is correlated over time. [sent-230, score-0.292]
91 We generated sparse temporally modulated noise by scaling white Gaussian noise with a temporally smooth envelope (given as a convolution of a Gaussian function with st. [sent-235, score-0.38]
92 The reconstruction SNR does not fully convey the manner in which different algorithms handle noise: perceptually, we find that the sounds denoised by the hierarchical model sound more similar to the original (audio examples in supplementary materials). [sent-241, score-0.52]
93 6 Discussion We developed a hierarchical spike code model that captures complex structure in sounds. [sent-242, score-0.257]
94 Our work builds on the spikegram representation of [5], thus avoiding the limitations arising from spectrogram-based methods, and makes a number of novel contributions. [sent-243, score-0.453]
95 Unlike previous work [3, 4], the learned kernels are shiftable in both time and log-frequency, which enables the model to learn time- and frequency-relative patterns and use a small number of kernels efficiently to represent a wide variety of sound features. [sent-244, score-0.663]
96 In addition, the model describes acoustic structure on multiple scales (via a hierarchical component and a recurrent component), which capture fundamentally different kinds of statistical regularities. [sent-245, score-0.392]
97 Applying the model to complex natural sounds (speech), we demonstrated that it can learn nontrivial features, and we have shown how these features can be composed to form basic acoustic units. [sent-249, score-0.527]
98 The framework provides a general methodology for learning higher-order features of sounds, and we expect that it will prove useful in representing other structured sounds such as music, animal vocalizations, or ambient natural sounds. [sent-251, score-0.253]
99 Godsill, “Sparse linear regression with structured priors and application to denoising of musical audio,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. [sent-258, score-0.206]
100 Dahlgren, “Darpa timit acoustic phonetic continuous speech corpus cdrom,” 1993. [sent-328, score-0.436]
wordName wordTfidf (topN-words)
[('spikegram', 0.398), ('spikes', 0.378), ('freq', 0.251), ('acoustic', 0.244), ('sounds', 0.224), ('kernels', 0.215), ('denoising', 0.206), ('sound', 0.15), ('spike', 0.148), ('frequency', 0.146), ('speech', 0.128), ('hsc', 0.126), ('coupling', 0.118), ('auditory', 0.117), ('gammatone', 0.111), ('spikegrams', 0.105), ('regularities', 0.102), ('layer', 0.098), ('octaves', 0.092), ('frequencies', 0.088), ('events', 0.083), ('coarse', 0.083), ('pitch', 0.08), ('hierarchical', 0.079), ('modulated', 0.071), ('audio', 0.07), ('sec', 0.069), ('recurrent', 0.069), ('coding', 0.065), ('timit', 0.064), ('pressure', 0.064), ('vocalizations', 0.063), ('phone', 0.058), ('waveform', 0.056), ('wiener', 0.056), ('center', 0.056), ('hz', 0.056), ('pursuit', 0.056), ('ao', 0.055), ('waveforms', 0.055), ('representation', 0.055), ('harmonic', 0.054), ('noise', 0.054), ('rate', 0.054), ('outlined', 0.052), ('lewicki', 0.051), ('spectrograms', 0.048), ('precise', 0.047), ('amplitudes', 0.046), ('logarithm', 0.045), ('br', 0.045), ('amplitude', 0.044), ('temporally', 0.044), ('temporal', 0.042), ('consonant', 0.042), ('daudet', 0.042), ('modulations', 0.042), ('onsets', 0.042), ('shiftable', 0.042), ('thr', 0.042), ('learned', 0.041), ('offsets', 0.041), ('ring', 0.041), ('spiking', 0.04), ('white', 0.04), ('resolution', 0.039), ('unobserved', 0.039), ('sparse', 0.039), ('synthesized', 0.038), ('wav', 0.037), ('denoised', 0.037), ('phones', 0.037), ('matching', 0.036), ('wavelet', 0.036), ('courant', 0.034), ('utterance', 0.034), ('spectrogram', 0.034), ('magni', 0.034), ('sentence', 0.034), ('convolution', 0.034), ('ba', 0.033), ('materials', 0.033), ('channel', 0.033), ('signal', 0.033), ('colored', 0.032), ('corrupted', 0.032), ('poisson', 0.031), ('positioned', 0.03), ('convey', 0.03), ('mallat', 0.03), ('speakers', 0.03), ('complex', 0.03), ('hughes', 0.029), ('artifacts', 0.029), ('dot', 0.029), ('features', 0.029), ('stacks', 0.028), ('mp', 0.028), ('coef', 0.028), ('convolved', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 150 nips-2012-Hierarchical spike coding of sound
Author: Yan Karklin, Chaitanya Ekanadham, Eero P. Simoncelli
Abstract: Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The first layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods. 1
2 0.22306231 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio
Author: Sourish Chaudhuri, Bhiksha Raj
Abstract: Approaches to audio classification and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the first layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report significant improvements over standard baselines. 1
3 0.19617154 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex
Author: Hiroki Terashima, Masato Okada
Abstract: The computational modelling of the primary auditory cortex (A1) has been less fruitful than that of the primary visual cortex (V1) due to the less organized properties of A1. Greater disorder has recently been demonstrated for the tonotopy of A1 that has traditionally been considered to be as ordered as the retinotopy of V1. This disorder appears to be incongruous, given the uniformity of the neocortex; however, we hypothesized that both A1 and V1 would adopt an efficient coding strategy and that the disorder in A1 reflects natural sound statistics. To provide a computational model of the tonotopic disorder in A1, we used a model that was originally proposed for the smooth V1 map. In contrast to natural images, natural sounds exhibit distant correlations, which were learned and reflected in the disordered map. The auditory model predicted harmonic relationships among neighbouring A1 cells; furthermore, the same mechanism used to model V1 complex cells reproduced nonlinear responses similar to the pitch selectivity. These results contribute to the understanding of the sensory cortices of different modalities in a novel and integrated manner.
4 0.17419466 347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity
Author: David Balduzzi, Michel Besserve
Abstract: This paper suggests a learning-theoretic perspective on how synaptic plasticity benefits global brain functioning. We introduce a model, the selectron, that (i) arises as the fast time constant limit of leaky integrate-and-fire neurons equipped with spiking timing dependent plasticity (STDP) and (ii) is amenable to theoretical analysis. We show that the selectron encodes reward estimates into spikes and that an error bound on spikes is controlled by a spiking margin and the sum of synaptic weights. Moreover, the efficacy of spikes (their usefulness to other reward maximizing selectrons) also depends on total synaptic strength. Finally, based on our analysis, we propose a regularized version of STDP, and show the regularization improves the robustness of neuronal learning when faced with multiple stimuli. 1
5 0.15921924 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
Author: Junyuan Xie, Linli Xu, Enhong Chen
Abstract: We present a novel approach to low-level vision problems that combines sparse coding and deep networks pre-trained with denoising auto-encoder (DA). We propose an alternative training scheme that successfully adapts DA, originally designed for unsupervised feature learning, to the tasks of image denoising and blind inpainting. Our method’s performance in the image denoising task is comparable to that of KSVD which is a widely used sparse coding technique. More importantly, in blind image inpainting task, the proposed method provides solutions to some complex problems that have not been tackled before. Specifically, we can automatically remove complex patterns like superimposed text from an image, rather than simple patterns like pixels missing at random. Moreover, the proposed method does not need the information regarding the region that requires inpainting to be given a priori. Experimental results demonstrate the effectiveness of the proposed method in the tasks of image denoising and blind inpainting. We also show that our new training scheme for DA is more effective and can improve the performance of unsupervised feature learning. 1
6 0.15755408 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter
7 0.12477428 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models
8 0.12379528 190 nips-2012-Learning optimal spike-based representations
9 0.11579494 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model
10 0.10424665 72 nips-2012-Cocktail Party Processing via Structured Prediction
11 0.10385366 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System
12 0.093305416 197 nips-2012-Learning with Recursive Perceptual Representations
13 0.085303135 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging
14 0.077957027 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing
15 0.076855376 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
16 0.075481251 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking
17 0.070596591 188 nips-2012-Learning from Distributions via Support Measure Machines
18 0.069898017 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
19 0.068545267 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
20 0.062107291 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
topicId topicWeight
[(0, 0.147), (1, 0.055), (2, -0.15), (3, 0.103), (4, -0.01), (5, 0.189), (6, -0.008), (7, 0.049), (8, -0.041), (9, 0.006), (10, 0.013), (11, 0.042), (12, 0.027), (13, 0.041), (14, 0.016), (15, -0.061), (16, -0.006), (17, 0.014), (18, 0.031), (19, -0.095), (20, 0.011), (21, -0.029), (22, -0.053), (23, -0.164), (24, 0.04), (25, 0.084), (26, 0.014), (27, -0.16), (28, -0.093), (29, 0.049), (30, 0.174), (31, -0.024), (32, 0.045), (33, -0.055), (34, -0.116), (35, -0.026), (36, 0.152), (37, 0.074), (38, 0.116), (39, 0.022), (40, 0.096), (41, 0.165), (42, -0.126), (43, -0.037), (44, -0.036), (45, 0.073), (46, 0.002), (47, 0.095), (48, -0.001), (49, -0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.94856852 150 nips-2012-Hierarchical spike coding of sound
Author: Yan Karklin, Chaitanya Ekanadham, Eero P. Simoncelli
Abstract: Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The first layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods. 1
2 0.76543754 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio
Author: Sourish Chaudhuri, Bhiksha Raj
Abstract: Approaches to audio classification and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the first layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report significant improvements over standard baselines. 1
3 0.59017706 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex
Author: Hiroki Terashima, Masato Okada
Abstract: The computational modelling of the primary auditory cortex (A1) has been less fruitful than that of the primary visual cortex (V1) due to the less organized properties of A1. Greater disorder has recently been demonstrated for the tonotopy of A1 that has traditionally been considered to be as ordered as the retinotopy of V1. This disorder appears to be incongruous, given the uniformity of the neocortex; however, we hypothesized that both A1 and V1 would adopt an efficient coding strategy and that the disorder in A1 reflects natural sound statistics. To provide a computational model of the tonotopic disorder in A1, we used a model that was originally proposed for the smooth V1 map. In contrast to natural images, natural sounds exhibit distant correlations, which were learned and reflected in the disordered map. The auditory model predicted harmonic relationships among neighbouring A1 cells; furthermore, the same mechanism used to model V1 complex cells reproduced nonlinear responses similar to the pitch selectivity. These results contribute to the understanding of the sensory cortices of different modalities in a novel and integrated manner.
4 0.58913159 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter
Author: Dmitri B. Chklovskii, Daniel Soudry
Abstract: We test the hypothesis that the neuronal spike generation mechanism is an analog-to-digital (AD) converter encoding rectified low-pass filtered summed synaptic currents into a spike train linearly decodable in postsynaptic neurons. Faithful encoding of an analog waveform by a binary signal requires that the spike generation mechanism has a sampling rate exceeding the Nyquist rate of the analog signal. Such oversampling is consistent with the experimental observation that the precision of the spikegeneration mechanism is an order of magnitude greater than the cut -off frequency of low-pass filtering in dendrites. Additional improvement in the coding accuracy may be achieved by noise-shaping, a technique used in signal processing. If noise-shaping were used in neurons, it would reduce coding error relative to Poisson spike generator for frequencies below Nyquist by introducing correlations into spike times. By using experimental data from three different classes of neurons, we demonstrate that biological neurons utilize noise-shaping. Therefore, the spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter. The nature of the neural spike code remains a central problem in neuroscience [1-3]. In particular, no consensus exists on whether information is encoded in firing rates [4, 5] or individual spike timing [6, 7]. On the single-neuron level, evidence exists to support both points of view. On the one hand, post-synaptic currents are low-pass-filtered by dendrites with the cut-off frequency of approximately 30Hz [8], Figure 1B, providing ammunition for the firing rate camp: if the signal reaching the soma is slowly varying, why would precise spike timing be necessary? On the other hand, the ability of the spike-generation mechanism to encode harmonics of the injected current up to about 300Hz [9, 10], Figure 1B, points at its exquisite temporal precision [11]. Yet, in view of the slow variation of the somatic current, such precision may seem gratuitous and puzzling. The timescale mismatch between gradual variation of the somatic current and high precision of spike generation has been addressed previously. Existing explanations often rely on the population nature of the neural code [10, 12]. Although this is a distinct possibility, the question remains whether invoking population coding is necessary. Other possible explanations for the timescale mismatch include the possibility that some synaptic currents (for example, GABAergic) may be generated by synapses proximal to the soma and therefore not subject to low-pass filtering or that the high frequency harmonics are so strong in the pre-synaptic spike that despite attenuation, their trace is still present. Although in some cases, these explanations could apply, for the majority of synaptic inputs to typical neurons there is a glaring mismatch. The perceived mismatch between the time scales of somatic currents and the spike-generation mechanism can be resolved naturally if one views spike trains as digitally encoding analog somatic currents [13-15], Figure 1A. Although somatic currents vary slowly, information that could be communicated by their analog amplitude far exceeds that of binary signals, such as all- or-none spikes, of the same sampling rate. Therefore, faithful digital encoding requires sampling rate of the digital signal to be much higher than the cut-off frequency of the analog signal, socalled over-sampling. Although the spike generation mechanism operates in continuous time, the high temporal precision of the spikegeneration mechanism may be viewed as a manifestation of oversampling, which is needed for the digital encoding of the analog signal. Therefore, the extra order of magnitude in temporal precision available to the spike-generation mechanism relative to somatic current, Figure 1B, is necessary to faithfully encode the amplitude of the analog signal, thus potentially reconciling the firing rate and the spike timing points of view [13-15]. Figure 1. Hybrid digital-analog operation of neuronal circuits. A. Post-synaptic currents are low-pass filtered and summed in dendrites (black) to produce a somatic current (blue). This analog signal is converted by the spike generation mechanism into a sequence of all-or-none spikes (green), a digital signal. Spikes propagate along an axon and are chemically transduced across synapses (gray) into post-synatpic currents (black), whose amplitude reflects synaptic weights, thus converting digital signal back to analog. B. Frequency response function for dendrites (blue, adapted from [8]) and for the spike generation mechanism (green, adapted from [9]). Note one order of magnitude gap between the cut off frequencies. C. Amplitude of the summed postsynaptic currents depends strongly on spike timing. If the blue spike arrives just 5ms later, as shown in red, the EPSCs sum to a value already 20% less. Therefore, the extra precision of the digital signal may be used to communicate the amplitude of the analog signal. In signal processing, efficient AD conversion combines the principle of oversampling with that of noise-shaping, which utilizes correlations in the digital signal to allow more accurate encoding of the analog amplitude. This is exemplified by a family of AD converters called modulators [16], of which the basic one is analogous to an integrate-and-fire (IF) neuron [13-15]. The analogy between the basic modulator and the IF neuron led to the suggestion that neurons also use noise-shaping to encode incoming analog current waveform in the digital spike train [13]. However, the hypothesis of noise-shaping AD conversion has never been tested experimentally in biological neurons. In this paper, by analyzing existing experimental datasets, we demonstrate that noise-shaping is present in three different classes of neurons from vertebrates and invertebrates. This lends support to the view that neurons act as oversampling and noise-shaping AD converters and accounts for the mismatch between the slowly varying somatic currents and precise spike timing. Moreover, we show that the degree of noise-shaping in biological neurons exceeds that used by basic modulators or IF neurons and propose viewing more complicated models in the noise-shaping framework. This paper is organized as follows: We review the principles of oversampling and noise-shaping in Section 2. In Section 3, we present experimental evidence for noise-shaping AD conversion in neurons. In Section 4 we argue that rectification of somatic currents may improve energy efficiency and/or implement de-noising. 2 . Oversampling and noise-shaping in AD converters To understand how oversampling can lead to more accurate encoding of the analog signal amplitude in a digital form, we first consider a Poisson spike encoder, whose rate of spiking is modulated by the signal amplitude, Figure 2A. Such an AD converter samples an analog signal at discrete time points and generates a spike with a probability given by the (normalized) signal amplitude. Because of the binary nature of spike trains, the resulting spike train encodes the signal with a large error even when the sampling is done at Nyquist rate, i.e. the lowest rate for alias-free sampling. To reduce the encoding error a Poisson encoder can sample at frequencies, fs , higher than Nyquist, fN – hence, the term oversampling, Figure 2B. When combined with decoding by lowpass filtering (down to Nyquist) on the receiving end, this leads to a reduction of the error, which can be estimated as follows. The number of samples over a Nyquist half-period (1/2fN) is given by the oversampling ratio: . As the normalized signal amplitude, , stays roughly constant over the Nyquist half-period, it can be encoded by spikes generated with a fixed probability, x. For a Poisson process the variance in the number of spikes is equal to the mean, . Therefore, the mean relative error of the signal decoded by averaging over the Nyquist half-period: , (1) indicating that oversampling reduces transmission error. However, the weak dependence of the error on the oversampling frequency indicates diminishing returns on the investment in oversampling and motivates one to search for other ways to lower the error. Figure 2. Oversampling and noise-shaping in AD conversion. A. Analog somatic current (blue) and its digital code (green). The difference between the green and the blue curves is encoding error. B. Digital output of oversampling Poisson encoder over one Nyquist half-period. C. Error power spectrum of a Nyquist (dark green) and oversampled (light green) Poisson encoder. Although the total error power is the same, the fraction surviving low-pass filtering during decoding (solid green) is smaller in oversampled case. D. Basic modulator. E. Signal at the output of the integrator. F. Digital output of the modulator over one Nyquist period. G. Error power spectrum of the modulator (brown) is shifted to higher frequencies and low-pass filtered during decoding. The remaining error power (solid brown) is smaller than for Poisson encoder. To reduce encoding error beyond the ½ power of the oversampling ratio, the principle of noiseshaping was put forward [17]. To illustrate noise-shaping consider a basic AD converter called [18], Figure 2D. In the basic modulator, the previous quantized signal is fed back and subtracted from the incoming signal and then the difference is integrated in time. Rather than quantizing the input signal, as would be done in the Poisson encoder, modulator quantizes the integral of the difference between the incoming analog signal and the previous quantized signal, Figure 2F. One can see that, in the oversampling regime, the quantization error of the basic modulator is significantly less than that of the Poisson encoder. As the variance in the number of spikes over the Nyquist period is less than one, the mean relative error of the signal is at most, , which is better than the Poisson encoder. To gain additional insight and understand the origin of the term noise-shaping, we repeat the above analysis in the Fourier domain. First, the Poisson encoder has a flat power spectrum up to the sampling frequency, Figure 2C. Oversampling preserves the total error power but extends the frequency range resulting in the lower error power below Nyquist. Second, a more detailed analysis of the basic modulator, where the dynamics is linearized by replacing the quantization device with a random noise injection [19], shows that the quantization noise is effectively differentiated. Taking the derivative in time is equivalent to multiplying the power spectrum of the quantization noise by frequency squared. Such reduction of noise power at low frequencies is an example of noise shaping, Figure 2G. Under the additional assumption of the white quantization noise, such analysis yields: , (2) which for R >> 1 is significantly better performance than for the Poisson encoder, Eq.(1). As mentioned previously, the basic modulator, Figure 2D, in the continuous-time regime is nothing other than an IF neuron [13, 20, 21]. In the IF neuron, quantization is implemented by the spike generation mechanism and the negative feedback corresponds to the after-spike reset. Note that resetting the integrator to zero is strictly equivalent to subtraction only for continuous-time operation. In discrete-time computer simulations, the integrator value may exceed the threshold, and, therefore, subtraction of the threshold value rather than reset must be used. Next, motivated by the -IF analogy, we look for the signs of noise-shaping AD conversion in real neurons. 3 . Experimental evidence of noise-shaping AD conversion in real neurons In order to determine whether noise-shaping AD conversion takes place in biological neurons, we analyzed three experimental datasets, where spike trains were generated by time-varying somatic currents: 1) rat somatosensory cortex L5 pyramidal neurons [9], 2) mouse olfactory mitral cells [22, 23], and 3) fruit fly olfactory receptor neurons [24]. In the first two datasets, the current was injected through an electrode in whole-cell patch clamp mode, while in the third, the recording was extracellular and the intrinsic somatic current could be measured because the glial compartment included only one active neuron. Testing the noise-shaping AD conversion hypothesis is complicated by the fact that encoded and decoded signals are hard to measure accurately. First, as somatic current is rectified by the spikegeneration mechanism, only its super-threshold component can be encoded faithfully making it hard to know exactly what is being encoded. Second, decoding in the dendrites is not accessible in these single-neuron recordings. In view of these difficulties, we start by simply computing the power spectrum of the reconstruction error obtained by subtracting a scaled and shifted, but otherwise unaltered, spike train from the somatic current. The scaling factor was determined by the total weight of the decoding linear filter and the shift was optimized to maximize information capacity, see below. At the frequencies below 20Hz the error contains significantly lower power than the input signal, Figure 3, indicating that the spike generation mechanism may be viewed as an AD converter. Furthermore, the error power spectrum of the biological neuron is below that of the Poisson encoder, thus indicating the presence of noise-shaping. For dataset 3 we also plot the error power spectrum of the IF neuron, the threshold of which is chosen to generate the same number of spikes as the biological neuron. 4 somatic current biological neuron error Poisson encoder error I&F; neuron error 10 1 10 0 Spectral power, a.u. Spectral power, a.u. 10 3 10 -1 10 -2 10 -3 10 2 10 -4 10 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 3. Evidence of noise-shaping. Power spectra of the somatic current (blue), difference between the somatic current and the digital spike train of the biological neuron (black), of the Poisson encoder (green) and of the IF neuron (red). Left: datset 1, right: dataset 3. Although the simple analysis presented above indicates noise-shaping, subtracting the spike train from the input signal, Figure 3, does not accurately quantify the error when decoding involves additional filtering. An example of such additional encoding/decoding is predictive coding, which will be discussed below [25]. To take such decoding filter into account, we computed a decoded waveform by convolving the spike train with the optimal linear filter, which predicts the somatic current from the spike train with the least mean squared error. Our linear decoding analysis lends additional support to the noise-shaping AD conversion hypothesis [13-15]. First, the optimal linear filter shape is similar to unitary post-synaptic currents, Figure 4B, thus supporting the view that dendrites reconstruct the somatic current of the presynaptic neuron by low-pass filtering the spike train in accordance with the noise-shaping principle [13]. Second, we found that linear decoding using an optimal filter accounts for 60-80% of the somatic current variance. Naturally, such prediction works better for neurons in suprathreshold regime, i.e. with high firing rates, an issue to which we return in Section 4. To avoid complications associated with rectification for now we focused on neurons which were in suprathreshold regime by monitoring that the relationship between predicted and actual current is close to linear. 2 10 C D 1 10 somatic current biological neuron error Poisson encoder error Spectral power, a.u. Spectral power, a.u. I&F; neuron error 3 10 0 10 -1 10 -2 10 -3 10 2 10 -4 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 10 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 4. Linear decoding of experimentally recorded spike trains. A. Waveform of somatic current (blue), resulting spike train (black), and the linearly decoded waveform (red) from dataset 1. B. Top: Optimal linear filter for the trace in A, is representative of other datasets as well. Bottom: Typical EPSPs have a shape similar to the decoding filter (adapted from [26]). C-D. Power spectra of the somatic current (blue), the decdoding error of the biological neuron (black), the Poisson encoder (green), and IF neuron (red) for dataset 1 (C) dataset 3 (D). Next, we analyzed the spectral distribution of the reconstruction error calculated by subtracting the decoded spike train, i.e. convolved with the computed optimal linear filter, from the somatic current. We found that at low frequencies the error power is significantly lower than in the input signal, Figure 4C,D. This observation confirms that signals below the dendritic cut-off frequency of 20-30Hz can be efficiently communicated using spike trains. To quantify the effect of noise-shaping we computed information capacity of different encoders: where S(f) and N(f) are the power spectra of the somatic current and encoding error correspondingly and the sum is computed only over the frequencies for which S(f) > N(f). Because the plots in Figure 4C,D use semi-logrithmic scale, the information capacity can be estimated from the area between a somatic current (blue) power spectrum and an error power spectrum. We find that the biological spike generation mechanism has higher information capacity than the Poisson encoder and IF neurons. Therefore, neurons act as AD converters with stronger noise-shaping than IF neurons. We now return to the predictive nature of the spike generation mechanism. Given the causal nature of the spike generation mechanism it is surprising that the optimal filters for all three datasets carry most of their weight following a spike, Figure 4B. This indicates that the spike generation mechanism is capable of making predictions, which are possible in these experiments because somatic currents are temporally correlated. We note that these observations make delay-free reconstruction of the signal possible, thus allowing fast operation of neural circuits [27]. The predictive nature of the encoder can be captured by a modulator embedded in a predictive coding feedback loop [28], Figure 5A. We verified by simulation that such a nested architecture generates a similar optimal linear filter with most of its weight in the time following a spike, Figure 5A right. Of course such prediction is only possible for correlated inputs implying that the shape of the optimal linear filter depends on the statistics of the inputs. The role of predictive coding is to reduce the dynamic range of the signal that enters , thus avoiding overloading. A possible biological implementation for such integrating feedback could be Ca2+ 2+ concentration and Ca dependent potassium channels [25, 29]. Figure 5. Enhanced modulators. A. modulator combined with predictive coder. In such device, the optimal decoding filter computed for correlated inputs has most of its weight following a spike, similar to experimental measurements, Figure 4B. B. Second-order modulator possesses stronger noise-shaping properties. Because such circuit contains an internal state variable it generates a non-periodic spike train in response to a constant input. Bottom trace shows a typical result of a simulation. Black – spikes, blue – input current. 4 . Possible reasons for current rectification: energy efficiency and de-noising We have shown that at high firing rates biological neurons encode somatic current into a linearly decodable spike train. However, at low firing rates linear decoding cannot faithfully reproduce the somatic current because of rectification in the spike generation mechanism. If the objective of spike generation is faithful AD conversion, why would such rectification exist? We see two potential reasons: energy efficiency and de-noising. It is widely believed that minimizing metabolic costs is an important consideration in brain design and operation [30, 31]. Moreover, spikes are known to consume a significant fraction of the metabolic budget [30, 32] placing a premium on their total number. Thus, we can postulate that neuronal spike trains find a trade-off between the mean squared error in the decoded spike train relative to the input signal and the total number of spikes, as expressed by the following cost function over a time interval T: , (3) where x is the analog input signal, s is the binary spike sequence composed of zeros and ones, and is the linear filter. To demonstrate how solving Eq.(3) would lead to thresholding, let us consider a simplified version taken over a Nyquist period, during which the input signal stays constant: (4) where and normalized by w. Minimizing such a cost function reduces to choosing the lowest lying parabola for a given , Figure 6A. Therefore, thresholding is a natural outcome of minimizing a cost function combining the decoding error and the energy cost, Eq.(3). In addition to energy efficiency, there may be a computational reason for thresholding somatic current in neurons. To illustrate this point, we note that the cost function in Eq. (3) for continuous variables, st, may be viewed as a non-negative version of the L1-norm regularized linear regression called LASSO [33], which is commonly used for de-noising of sparse and Laplacian signals [34]. Such cost function can be minimized by iteratively applying a gradient descent and a shrinkage steps [35], which is equivalent to thresholding (one-sided in case of non-negative variables), Figure 6B,C. Therefore, neurons may be encoding a de-noised input signal. Figure 6. Possible reasons for rectification in neurons. A. Cost function combining encoding error squared with metabolic expense vs. input signal for different values of the spike number N, Eq.(4). Note that the optimal number of spikes jumps from zero to one as a function of input. B. Estimating most probable “clean” signal value for continuous non-negative Laplacian signal and Gaussian noise, Eq.(3) (while setting w = 1). The parabolas (red) illustrate the quadratic loglikelihood term in (3) for different values of the measurement, s, while the linear function (blue) reflects the linear log-prior term in (3). C. The minimum of the combined cost function in B is at zero if s , and grows linearly with s, if s >. 5 . Di scu ssi on In this paper, we demonstrated that the neuronal spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter, which encodes a rectified low-pass filtered somatic current as a digital spike train. Rectification by the spike generation mechanism may subserve both energy efficiency and de-noising. As the degree of noise-shaping in biological neurons exceeds that in IF neurons, or basic , we suggest that neurons should be modeled by more advanced modulators, e.g. Figure 5B. Interestingly, modulators can be also viewed as coders with error prediction feedback [19]. Many publications studied various aspects of spike generation in neurons yet we believe that the framework [13-15] we adopt is different and discuss its relationship to some of the studies. Our framework is different from previous proposals to cast neurons as predictors [36, 37] because a different quantity is being predicted. The possibility of perfect decoding from a spike train with infinite temporal precision has been proven in [38]. Here, we are concerned with a more practical issue of how reconstruction error scales with the over-sampling ratio. Also, we consider linear decoding which sets our work apart from [39]. Finally, previous experiments addressing noiseshaping [40] studied the power spectrum of the spike train rather than that of the encoding error. Our work is aimed at understanding biological and computational principles of spike-generation and decoding and is not meant as a substitute for the existing phenomenological spike-generation models [41], which allow efficient fitting of parameters and prediction of spike trains [42]. Yet, the theoretical framework [13-15] we adopt may assist in building better models of spike generation for a given somatic current waveform. First, having interpreted spike generation as AD conversion, we can draw on the rich experience in signal processing to attack the problem. Second, this framework suggests a natural metric to compare the performance of different spike generation models in the high firing rate regime: a mean squared error between the injected current waveform and the filtered version of the spike train produced by a model provided the total number of spikes is the same as in the experimental data. The AD conversion framework adds justification to the previously proposed spike distance obtained by subtracting low-pass filtered spike trains [43]. As the framework [13-15] we adopt relies on viewing neuronal computation as an analog-digital hybrid, which requires AD and DA conversion at every step, one may wonder about the reason for such a hybrid scheme. Starting with the early days of computers, the analog mode is known to be advantageous for computation. For example, performing addition of many variables in one step is possible in the analog mode simply by Kirchhoff law, but would require hundreds of logical gates in the digital mode [44]. However, the analog mode is vulnerable to noise build-up over many stages of computation and is inferior in precisely communicating information over long distances under limited energy budget [30, 31]. While early analog computers were displaced by their digital counterparts, evolution combined analog and digital modes into a computational hybrid [44], thus necessitating efficient AD and DA conversion, which was the focus of the present study. We are grateful to L. Abbott, S. Druckmann, D. Golomb, T. Hu, J. Magee, N. Spruston, B. Theilman for helpful discussions and comments on the manuscript, to X.-J. Wang, D. McCormick, K. Nagel, R. Wilson, K. Padmanabhan, N. Urban, S. Tripathy, H. Koendgen, and M. Giugliano for sharing their data. The work of D.S. was partially supported by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI). R e f e re n c e s 1. Ferster, D. and N. Spruston, Cracking the neural code. Science, 1995. 270: p. 756-7. 2. Panzeri, S., et al., Sensory neural codes using multiplexed temporal scales. Trends Neurosci, 2010. 33(3): p. 111-20. 3. Stevens, C.F. and A. Zador, Neural coding: The enigma of the brain. Curr Biol, 1995. 5(12): p. 1370-1. 4. Shadlen, M.N. and W.T. Newsome, The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci, 1998. 18(10): p. 3870-96. 5. Shadlen, M.N. and W.T. Newsome, Noise, neural codes and cortical organization. Curr Opin Neurobiol, 1994. 4(4): p. 569-79. 6. Singer, W. and C.M. Gray, Visual feature integration and the temporal correlation hypothesis. Annu Rev Neurosci, 1995. 18: p. 555-86. 7. Meister, M., Multineuronal codes in retinal signaling. Proc Natl Acad Sci U S A, 1996. 93(2): p. 609-14. 8. Cook, E.P., et al., Dendrite-to-soma input/output function of continuous timevarying signals in hippocampal CA1 pyramidal neurons. J Neurophysiol, 2007. 98(5): p. 2943-55. 9. Kondgen, H., et al., The dynamical response properties of neocortical neurons to temporally modulated noisy inputs in vitro. Cereb Cortex, 2008. 18(9): p. 2086-97. 10. Tchumatchenko, T., et al., Ultrafast population encoding by cortical neurons. J Neurosci, 2011. 31(34): p. 12171-9. 11. Mainen, Z.F. and T.J. Sejnowski, Reliability of spike timing in neocortical neurons. Science, 1995. 268(5216): p. 1503-6. 12. Mar, D.J., et al., Noise shaping in populations of coupled model neurons. Proc Natl Acad Sci U S A, 1999. 96(18): p. 10450-5. 13. Shin, J., Adaptive noise shaping neural spike encoding and decoding. Neurocomputing, 2001. 38-40: p. 369-381. 14. Shin, J., The noise shaping neural coding hypothesis: a brief history and physiological implications. Neurocomputing, 2002. 44: p. 167-175. 15. Shin, J.H., Adaptation in spiking neurons based on the noise shaping neural coding hypothesis. Neural Networks, 2001. 14(6-7): p. 907-919. 16. Schreier, R. and G.C. Temes, Understanding delta-sigma data converters2005, Piscataway, NJ: IEEE Press, Wiley. xii, 446 p. 17. Candy, J.C., A use of limit cycle oscillations to obtain robust analog-to-digital converters. IEEE Trans. Commun, 1974. COM-22: p. 298-305. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. Inose, H., Y. Yasuda, and J. Murakami, A telemetring system code modulation - modulation. IRE Trans. Space Elect. Telemetry, 1962. SET-8: p. 204-209. Spang, H.A. and P.M. Schultheiss, Reduction of quantizing noise by use of feedback. IRE TRans. Commun. Sys., 1962: p. 373-380. Hovin, M., et al., Delta-Sigma modulation in single neurons, in IEEE International Symposium on Circuits and Systems2002. Cheung, K.F. and P.Y.H. Tang, Sigma-Delta Modulation Neural Networks. Proc. IEEE Int Conf Neural Networkds, 1993: p. 489-493. Padmanabhan, K. and N. Urban, Intrinsic biophysical diversity decorelates neuronal firing while increasing information content. Nat Neurosci, 2010. 13: p. 1276-82. Urban, N. and S. Tripathy, Neuroscience: Circuits drive cell diversity. Nature, 2012. 488(7411): p. 289-90. Nagel, K.I. and R.I. Wilson, personal communication. Shin, J., C. Koch, and R. Douglas, Adaptive neural coding dependent on the timevarying statistics of the somatic input current. Neural Comp, 1999. 11: p. 1893-913. Magee, J.C. and E.P. Cook, Somatic EPSP amplitude is independent of synapse location in hippocampal pyramidal neurons. Nat Neurosci, 2000. 3(9): p. 895-903. Thorpe, S., D. Fize, and C. Marlot, Speed of processing in the human visual system. Nature, 1996. 381(6582): p. 520-2. Tewksbury, S.K. and R.W. Hallock, Oversample, linear predictive and noiseshaping coders of order N>1. IEEE Trans Circuits & Sys, 1978. CAS25: p. 436-47. Wang, X.J., et al., Adaptation and temporal decorrelation by single neurons in the primary visual cortex. J Neurophysiol, 2003. 89(6): p. 3279-93. Attwell, D. and S.B. Laughlin, An energy budget for signaling in the grey matter of the brain. J Cereb Blood Flow Metab, 2001. 21(10): p. 1133-45. Laughlin, S.B. and T.J. Sejnowski, Communication in neuronal networks. Science, 2003. 301(5641): p. 1870-4. Lennie, P., The cost of cortical computation. Curr Biol, 2003. 13(6): p. 493-7. Tibshirani, R., Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological, 1996. 58(1): p. 267-288. Chen, S.S.B., D.L. Donoho, and M.A. Saunders, Atomic decomposition by basis pursuit. Siam Journal on Scientific Computing, 1998. 20(1): p. 33-61. Elad, M., et al., Wide-angle view at iterated shrinkage algorithms. P SOc Photo-Opt Ins, 2007. 6701: p. 70102. Deneve, S., Bayesian spiking neurons I: inference. Neural Comp, 2008. 20: p. 91. Yu, A.J., Optimal Change-Detection and Spinking Neurons, in NIPS, B. Scholkopf, J. Platt, and T. Hofmann, Editors. 2006. Lazar, A. and L. Toth, Perfect Recovery and Sensitivity Analysis of Time Encoded Bandlimited Signals. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, 2004. 51(10). Pfister, J.P., P. Dayan, and M. Lengyel, Synapses with short-term plasticity are optimal estimators of presynaptic membrane potentials. Nat Neurosci, 2010. 13(10): p. 1271-5. Chacron, M.J., et al., Experimental and theoretical demonstration of noise shaping by interspike interval correlations. Fluctuations and Noise in Biological, Biophysical, and Biomedical Systems III, 2005. 5841: p. 150-163. Pillow, J., Likelihood-based approaches to modeling the neural code, in Bayesian Brain: Probabilistic Approaches to Neural Coding, K. Doya, et al., Editors. 2007, MIT Press. Jolivet, R., et al., A benchmark test for a quantitative assessment of simple neuron models. J Neurosci Methods, 2008. 169(2): p. 417-24. van Rossum, M.C., A novel spike distance. Neural Comput, 2001. 13(4): p. 751-63. Sarpeshkar, R., Analog versus digital: extrapolating from electronics to neurobiology. Neural Computation, 1998. 10(7): p. 1601-38.
5 0.51716334 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System
Author: Hyunsin Park, Sungrack Yun, Sanghyuk Park, Jongmin Kim, Chang D. Yoo
Abstract: For phoneme classification, this paper describes an acoustic model based on the variational Gaussian process dynamical system (VGPDS). The nonlinear and nonparametric acoustic model is adopted to overcome the limitations of classical hidden Markov models (HMMs) in modeling speech. The Gaussian process prior on the dynamics and emission functions respectively enable the complex dynamic structure and long-range dependency of speech to be better represented than that by an HMM. In addition, a variance constraint in the VGPDS is introduced to eliminate the sparse approximation error in the kernel matrix. The effectiveness of the proposed model is demonstrated with three experimental results, including parameter estimation and classification performance, on the synthetic and benchmark datasets. 1
6 0.51227272 72 nips-2012-Cocktail Party Processing via Structured Prediction
7 0.51113671 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model
8 0.45620382 347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity
9 0.43600243 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing
10 0.43212929 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
11 0.42770961 190 nips-2012-Learning optimal spike-based representations
12 0.39450729 39 nips-2012-Analog readout for optical reservoir computers
13 0.37005112 362 nips-2012-Waveform Driven Plasticity in BiFeO3 Memristive Devices: Model and Implementation
14 0.3654983 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves
15 0.36126012 224 nips-2012-Multi-scale Hyper-time Hardware Emulation of Human Motor Nervous System Based on Spiking Neurons using FPGA
16 0.35727054 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes
17 0.35189024 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning
18 0.34862056 322 nips-2012-Spiking and saturating dendrites differentially expand single neuron computation capacity
19 0.34664598 289 nips-2012-Recognizing Activities by Attribute Dynamics
20 0.34599954 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
topicId topicWeight
[(0, 0.093), (21, 0.033), (38, 0.097), (42, 0.026), (44, 0.286), (54, 0.025), (55, 0.041), (74, 0.031), (76, 0.089), (80, 0.058), (87, 0.053), (92, 0.057), (94, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.76851732 150 nips-2012-Hierarchical spike coding of sound
Author: Yan Karklin, Chaitanya Ekanadham, Eero P. Simoncelli
Abstract: Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The first layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods. 1
2 0.69177568 193 nips-2012-Learning to Align from Scratch
Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller
Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1
3 0.64621991 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
Author: Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, Horst Bischof
Abstract: In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classifiers with the ability to access intermediate prediction (here: classification and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for refining the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-ofthe-art methods. 1 Introduction and Related Work In the last years, the random forest framework [1, 6] has become a very popular and powerful tool for classification and regression problems by exhibiting many appealing properties like inherent multi-class capability, robustness to label noise and reduced tendencies to overfitting [7]. They are considered to be close to an ideal learner [13], making them attractive in many areas of computer vision like image classification [5, 17], clustering [19], regression [8] or semantic segmentation [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include contextual information during learning and inference for classification and regression problems. We focus on applying random forests to object detection, i.e. the problem of localizing multiple instances of a given object class in a test image. This task has been previously addressed in random forests [9], where the trees were modified to learn a mapping between the appearance of an image patch and its relative position to the object category centroid (i.e. center voting information). During inference, the resulting Hough Forest not only performs classification on test samples but also casts probabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object center hypotheses. Ever since, a series of applications such as tracking and action recognition [10], body-joint position estimation [12] and multi-class object detection [22] have been presented. However, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and hence there is the need to perform non-maximum suppression (NMS) for obtaining the final results. While this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat samples in a completely independent way, i.e. there is no mechanism that encourages the classifier to perform consistent predictions. Within this work we are proposing that context information can be used to overcome the aforementioned problems. For example, training data for visual learning is often represented by images in form of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in a specific context. The importance of contextual information was already highlighted in the 80’s with 1 Figure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower, the earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-first training after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue, Inverted Hough image using priority queue shows distinctive peaks at the end of training. a pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20] that addressed the issue of learning within the same framework. More recently, contextual information has been used in the field of object class segmentation [21], however, mostly for high-level reasoning in random field models or to resolve contradicting segmentation results. The introduction of contextual information as additional features in low-level classifiers was initially proposed in the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general approach for classifier boosting by iteratively learning from appearance and context information. In this line of research [18] augmented the feature space for an Entanglement Random Forest with a classification feature, that is consequently refined by the class posterior distributions according to the progress of the trained subtree. The training procedure is allowed to perform tests for specific, contextual label configurations which was demonstrated to significantly improve the segmentation results. However, the In this paper we are presenting Context-Sensitve Decision Forests - A novel and unified interpretation of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and Entanglement Forests, but instead of providing only posterior classification results from an earlier level of the classifier construction during learning and testing, we additionally provide regression (voting) information as it is used in Hough Forests. The second core contribution of our work is related to how we grow the trees: Instead of training them in a depth- or breadth-first way, we propose a priority-based construction (which could actually consider depth- or breadth-first as particular cases). The priority is determined by the current training error, i.e. we first grow the parts of the tree where we experience higher error. To this end, we introduce a unified splitting criterion that estimates the joint error of classification and regression. The consequence of using our priority-based training are illustrated in Figure 1: Given the training image with corresponding label image (top row, images 1 and 2), the tree first tries to learn the foreground samples as shown in the color-coded plot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the intermediate prediction quality are shown in the bottom row for the regression case: The first image shows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-first way while the second image shows the progress after growing 64 nodes according to the priority based training. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to more accurate intermediate regression information that can be used for further tree construction. Our third contribution is a new family of split functions that allows to learn from training images containing multiple training instances as shown for the pedestrians in the example. We introduce a test that checks the centroid compatibility for pairs of training samples taken from the context, based on the intermediate classification and regression derived as described before. To assess our contributions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a significant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough Forests, when learning from crowded pedestrian images. 2 2 Context-Sensitive Decision Trees This section introduces the general idea behind the context-sensitive decision forest without references to specific applications. Only in Section 3 we show a particular application to the problem of object detection. After showing some basic notational conventions that are used in the paper, we provide a section that revisits the random forest framework for classification and regression tasks from a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a unified way. Starting from this general view we finally introduce the context-sensitive forests in 2.2. Notations. In the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by using uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are denoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator function returning 1 or 0 according to whether the proposition P is true or false. Moreover, with P(Y) we denote the set of probability distributions having Y as sample space and we implicitly assume that some σ-algebra is defined on Y. We denote by δ(x) the Dirac delta function. Finally, Ex∼Q [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q. 2.1 Random Decision Forests for joint classification and regression A (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is routed until it reaches a leaf where the prediction takes place. At each internal node of the tree the decision is taken whether the sample should be forwarded to the left or right child, according to a binary-valued function. In formal terms, let X denote the input space, let Y denote the output space and let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node (a leaf ) and is parametrized by a probability distribution Q ∈ P(Y) which represents the posterior probability of elements in Y given any data sample reaching the leaf. We denote this (admittedly rudimentary) tree as L F (Q) ∈ T td . Otherwise, a decision tree consists of a node with a left and a right sub-tree. This node is parametrized by a split function φ : X → {0, 1}, which determines whether to route a data sample x ∈ X reaching it to the left decision sub-tree tl ∈ T dt (if φ(x) = 0) or to the right one tr ∈ T dt (if φ(x) = 1). We denote such a tree as N D (φ, tl , tr ) ∈ T td . Finally, a decision forest is an ensemble F ⊆ T td of decision trees which makes a prediction about a data sample by averaging over the single predictions gathered from all trees. Inference. Given a decision tree t ∈ T dt , the associated posterior probability of each element in Y given a sample x ∈ X is determined by finding the probability distribution Q parametrizing the leaf that is reached by x when routed along the tree. This is compactly presented with the following definition of P (y|x, t), which is inductive in the structure of t: if t = L F (Q) Q(y) P (y | x, t ) = P (y | x, tl ) if t = N D (φ, tl , tr ) and φ(x) = 0 (1) P (y | x, tr ) if t = N D (φ, tl , tr ) and φ(x) = 1 . Finally, the combination of the posterior probabilities derived from the trees in a forest F ⊆ T dt can be done by an averaging operation [6], yielding a single posterior probability for the whole forest: P (y|x, F) = 1 |F| P (y|x, t) . (2) t∈F Randomized training. A random forest is created by training a set of random decision trees independently on random subsets of the training data D ⊆ X ×Y. The training procedure for a single decision tree heuristically optimizes a set of parameters like the tree structure, the split functions at the internal nodes and the density estimates at the leaves in order to reduce the prediction error on the training data. In order to prevent overfitting problems, the search space of possible split functions is limited to a random set and a minimum number of training samples is required to grow a leaf node. During the training procedure, each new node is fed with a set of training samples Z ⊆ D. If some stopping condition holds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z. Otherwise, an internal node is grown and a split function is selected from a pool of random ones in a way to minimize some sort of training error on Z. The selected split function induces a partition 1 we use the term predictor because we will jointly consider classification and regression. 3 of Z into two sets, which are in turn becoming the left and right childs of the current node where the training procedure is continued, respectively. We will now write this training procedure in more formal terms. To this end we introduce a function π(Z) ∈ P(Y) providing a density on Y estimated from the training data Z ⊆ D and a loss function L(Z | Q) ∈ R penalizing wrong predictions on the training samples in Z, when predictions are given according to a distribution Q ∈ P(Y). The loss function L can be further decomposed in terms of a loss function (·|Q) : Y → R acting on each sample of the training set: L(Z | Q) = (y | Q) . (3) (x,y)∈Z Also, let Φ(Z) be a set of split functions randomly generated for a training set Z and given a split φ function φ ∈ Φ(Z), we denote by Zlφ and Zr the sets identified by splitting Z according to φ, i.e. Zlφ = {(x, y) ∈ Z : φ(x) = 0} and φ Zr = {(x, y) ∈ Z : φ(x) = 1} . We can now summarize the training procedure in terms of a recursive function g : 2X ×Y → T , which generates a random decision tree from a training set given as argument: g(Z) = L F (π(Z)) ND if some stopping condition holds φ φ, g(Zlφ ), g(Zr ) otherwise . (4) Here, we determine the optimal split function φ in the pool Φ(Z) as the one minimizing the loss we incur as a result of the node split: φ φ ∈ arg min L(Zlφ ) + L(Zr ) : φ ∈ Φ(Z) (5) where we compactly write L(Z) for L(Z|π(Z)), i.e. the loss on Z obtained with predictions driven by π(Z). A typical split function selection criterion commonly adopted for classification and regression is information gain. The equivalent counterpart in terms of loss can be obtained by using a log-loss, i.e. (y|Q) = − log(Q(y)). A further widely used criterion is based on Gini impurity, which can be expressed in this setting by using (y|Q) = 1 − Q(y). Finally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue branching the tree typically consists in checking |Z|, i.e. the number of training samples at the node, or the loss L(Z) are below some given thresholds, or if a maximum depth is reached. 2.2 Context-sensitive decision forests A context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the ability of testing contextual information of a sample, before taking a decision about where to route it. We generate contextual information at each node of a decision tree by exploiting a truncated version of the same tree as a predictor. This idea is shared with [18], however, we introduce some novelties by tackling both, classification and regression problems in a joint manner and by leaving a wider flexibility in the tree truncation procedure. We denote the set of CS decision trees as T . The main differences characterizing a CS decision tree t ∈ T compared with a standard decision tree are the following: a) every node (leaves and internal nodes) of t has an associated probability distribution Q ∈ P(Y) representing the posterior probability of an element in Y given any data sample reaching it; b) internal nodes are indexed with distinct natural numbers n ∈ N in a way to preserve the property that children nodes have a larger index compared to their parent node; c) the split function at each internal node, denoted by ϕ(·|t ) : X → {0, 1}, is bound to a CS decision tree t ∈ T , which is a truncated version of t and can be used to compute intermediate, contextual information. Similar to Section 2.1 we denote by L F (Q) ∈ T the simplest CS decision tree consisting of a single leaf node parametrized by the distribution Q, while we denote by N D (n, Q, ϕ, tl , tr ) ∈ T , the rest of the trees consisting of a node having a left and a right sub-tree, denoted by tl , tr ∈ T respectively, and being parametrized by the index n, a probability distribution Q and the split function ϕ as described above. As shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting the indexing imposed on the internal nodes of the tree. Given a CS decision tree t ∈ T and m ∈ N, 4 1 1 4 2 3 6 2 5 4 3 (b) The truncated version t(<5) (a) A CS decision tree t Figure 2: On the left, we find a CS decision tree t, where only the internal nodes are indexed. On the right, we see the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index ≥ 5 (we marked with colors the corresponding node transformations). we denote by t( < τ 2 In the experiments conducted, we never exceeded 10 iterations for finding a mode. 6 (8) where Pj = P (·|(u + hj , I), t), with j = 1, 2, are the posterior probabilities obtained from tree t given samples at position u+h1 and u+h2 of image I, respectively. Please note that this test should not be confused with the regression split criterion in [9], which tries to partition the training set in a way to group examples with similar voting direction and length. Besides the novel context-sensitive split function we employ also standard split functions performing tests on X as defined in [24]. 4 Experiments To assess our proposed approach, we have conducted several experiments on the task of pedestrian detection. Detecting pedestrians is very challenging for Hough-voting based methods as they typically exhibit strong articulations of feet and arms, yielding to non-distinctive hypotheses in the Hough space. We evaluated our method on the TUD pedestrian data base [2] in two different ways: First, we show our detection results with training according to the standard protocol using 400 training images (where each image contains a single annotation of a pedestrian) and evaluation on the Campus and Crossing scenes, respectively (Section 4.1). With this experiment we show the improvement over state-of-the-art approaches when learning can be performed with simultaneous knowledge about context information. In a second variation (Section 4.2), we use the images of the Crossing scene (201 images) as a training set. Most images of this scene contain more than four persons with strong overlap and mutual occlusions. However, instead of using the original annotation which covers only pedestrians with at least 50% overlap (1008 bounding boxes), we use the more accurate, pixel-wise ground truth annotations of [23] for the entire scene that includes all persons and consists of 1215 bounding boxes. Please note that this annotation is even more detailed than the one presented in [4] with 1018 bounding boxes. The purpose of the second experiment is to show that our context-sensitive forest can exploit the availability of multiple training instances significantly better than state-of-the-art. The most related work and therefore also the baseline in our experiments is the Hough Forest [9]. To guarantee a fair comparison, we use the same training parameters for [9] and our context sensitive forest: We trained 20 trees and the training data (including horizontally flipped images) was sampled homogeneously per category per image. The patch size was fixed to 30 × 30 and we performed 1600 node tests for finding the best split function parameters per node. The trees were stopped growing when < 7 samples were available. As image features, we used the the first 16 feature channels provided in the publicly available Hough Forest code of [9]. In order to obtain the object detection hypotheses from the Hough space, we use the same Non-maximum suppression (NMS) technique in all our experiments as suggested in [9]. To evaluate the obtained hypotheses, we use the standard PASAL-VOC criterion which requires the mutual overlap between ground truth and detected bounding boxes to be ≥ 50%. The additional parameter of (7) was fixed to σ = 7. 4.1 Evaluation using standard protocol training set The standard training set contains 400 images where each image comes with a single pedestrian annotation. For our experiments, we rescaled the images by a factor of 0.5 and doubled the training image set by including also the horizontally flipped images. We randomly chose 125 training samples per image for foreground and background, resulting in 2 · 400 · 2 · 125 = 200k training samples per tree. For additional comparisons, we provide the results presented in the recent work on joint object detection and segmentation of [23], from which we also provide evaluation results of the Implicit Shape Model (ISM) [16]. However, please note that the results of [23] are based on a different baseline implementation. Moreover, we show the results of [4] when using the provided code and configuration files from the first authors homepage. Unfortunately, we could not reproduce the results of the original paper. First, we discuss the results obtained on the Campus scene. This data set consists of 71 images showing walking pedestrians at severe scale differences and partial occlusions. The ground truth we use has been released with [4] and contains a total number of 314 pedestrians. Figure 3, first row, plot 1 shows the precision-recall curves when using 3 scales (factors 0.3, 0.4, 0.55) for our baseline [9] (blue), results from re-evaluating [4] (cyan, 5 scales), [23] (green) and our ContextSensitive Forest without and with using the priority queue based tree construction (red/magenta). In case of not using the priority queue, we trained the trees according to a breadth-first way. We obtain a performance boost of ≈ 6% in recall at a precision of 90% when using both, context information and the priority based construction of our forest. The second plot in the first row of Figure 3 shows the results when the same forests are tested on the Crossing scene, using the more detailed ground 7 TUD Campus (3 scales) TUD−Crossing (3 scales) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 1 0.5 0.4 0.3 0.2 0.1 0 0 0.5 0.4 0.3 Baseline Hough Forest Barinova et al. CVPR’10, 5 scales Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.2 0.1 0.9 0 0 1 Baseline Hough Forest Barinova et al. CVPR’10 Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 (1 scale) Leibe et al. IJCV’08 (1 scale) 0.1 TUD Campus (3 scales) 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.9 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 0.2 TUD Campus (5 scales) 0.5 0.4 0.3 0 0 0.4 0.3 0.2 0.1 0.5 0.2 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.1 0.9 1 0 0 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 3: Precision-Recall Curves for detections, Top row: Standard training (400 images), evaluation on Campus and Crossing (3 scales). Bottom row: Training on Crossing annotations of [23], evaluation on Campus, 3 and 5 scales. Right images: Qualitative examples for Campus (top 2) and Crossing (bottom 2) scenes. (green) correctly found by our method (blue) ground truth (red) wrong association (cyan) missed detection. truth annotations. The data set shows walking pedestrians (Figure 3, right side, last 2 images) with a smaller variation in scale compared to the Campus scene but with strong mutual occlusions and overlaps. The improvement with respect to the baseline is lower (≈ 2% gain at a precision of 90%) and we find similar developments of the curves. However, this comes somewhat expectedly as the training data does not properly reflect the occlusions we actually want to model. 4.2 Evaluation on Campus scene using Crossing scene as training set In our next experiment we trained the forests (same parameters) on the novel annotations of [23] for the Crossing scene. Please note that this reduces the training set to only 201 images (we did not include the flipped images). Qualitative detection results are shown in Figure 3, right side, images 1 and 2. From the first precison-recall curve in the second row of Figure 3 we can see, that the margin between the baseline and our proposed method could be clearly improved (gain of ≈ 9% recall at precision 90%) when evaluating on the same 3 scales. With evaluation on 5 scales (factors 0.34, 0.42, 0.51, 0.65, 0.76) we found a strong increase in the recall, however, at the cost of loosing 2 − 3% of precision below a recall of 60%, as illustrated in the second plot of row 2 in Figure 3. While our method is able to maintain a precision above 90% up to a recall of ≈ 83%, the baseline implementation drops already at a recall of ≈ 20%. 5 Conclusions In this work we have presented Context-Sensitive Decision Forests with application to the object detection problem. Our new forest has the ability to access intermediate prediction (classification and regression) information about all samples of the training set and can therefore learn from contextual information throughout the growing process. This is in contrast to existing random forest methods used for object detection which typically treat training samples in an independent manner. Moreover, we have introduced a novel splitting criterion together with a mode isolation technique, which allows us to (a) perform a priority-driven way of tree growing and (b) install novel context-based test functions to check for mutual object centroid agreements. In our experimental results on pedestrian detection we demonstrated superior performance with respect to state-of-the-art methods and additionally found that our new algorithm can significantly better exploit training data containing multiple training objects. Acknowledgements. Peter Kontschieder acknowledges financial support of the Austrian Science Fund (FWF) from project ’Fibermorph’ with number P22261-N22. 8 References [1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 1997. [2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In (CVPR), 2008. [3] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 1981. [4] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. In (CVPR), 2010. [5] A. Bosch, A. Zisserman, and X. Mu˜oz. Image classification using random forests and ferns. In (ICCV), n 2007. [6] L. Breiman. Random forests. In Machine Learning, 2001. [7] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. In Foundations and Trends in Computer Graphics and Vision, volume 7, pages 81–227, 2012. [8] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regression forests for efficient anatomy detection and localization in CT scans. In MICCAI-MCV Workshop, 2010. [9] J. Gall and V. Lempitsky. Class-specific hough forests for object detection. In (CVPR), 2009. [10] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. (PAMI), 2011. [11] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 2006. [12] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In (ICCV), 2011. [13] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2009. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling. (PAMI), 5(3):267–287, 1983. [15] P. Kontschieder, S. Rota Bul` , H. Bischof, and M. Pelillo. Structured class-labels in random forests for o semantic image labelling. In (ICCV), 2011. [16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. (IJCV), 2008. [17] R. Mar´ e, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classification. In e (CVPR), 2005. [18] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi. Entangled decision forests and their application for semantic segmentation of CT images. In (IPMI), 2011. [19] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In (NIPS), 2006. [20] M. Pelillo and M. Refice. Learning compatibility coefficients for relaxation labeling processes. (PAMI), 16(9):933–945, 1994. [21] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In (ICCV), 2007. [22] N. Razavi, J. Gall, and L. Van Gool. Scalable multi-class object detection. In (CVPR), 2011. [23] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In (ECCV), 2012. [24] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In (CVPR), 2008. [25] Z. Tu. Auto-context and its application to high-level vision tasks. In (CVPR), 2008. [26] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger. Demisting the hough transform for 3d shape recognition and registration. In (BMVC), 2011. 9
4 0.60677123 148 nips-2012-Hamming Distance Metric Learning
Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov
Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. 1
5 0.5722158 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
Author: Emile Richard, Stephane Gaiffas, Nicolas Vayatis
Abstract: In the paper, we consider the problem of link prediction in time-evolving graphs. We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices which takes into account both sparsity and low rank properties of the matrices. Oracle inequalities are derived and illustrate the trade-offs in the choice of smoothing parameters when modeling the joint effect of sparsity and low rank property. The estimate is computed efficiently using proximal methods through a generalized forward-backward agorithm. 1
6 0.55836689 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex
7 0.52446663 192 nips-2012-Learning the Dependency Structure of Latent Factors
8 0.51164556 233 nips-2012-Multiresolution Gaussian Processes
9 0.51162797 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables
10 0.51109123 182 nips-2012-Learning Networks of Heterogeneous Influence
11 0.50954199 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models
12 0.50658083 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter
13 0.50641018 282 nips-2012-Proximal Newton-type methods for convex optimization
14 0.50251532 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System
15 0.50083667 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
16 0.50040364 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
17 0.49877477 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation
18 0.49830785 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
19 0.49816468 12 nips-2012-A Neural Autoregressive Topic Model
20 0.49735186 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes