nips nips2004 nips2004-5 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rasmus K. Olsson, Lars K. Hansen
Abstract: We discuss an identification framework for noisy speech mixtures. A block-based generative model is formulated that explicitly incorporates the time-varying harmonic plus noise (H+N) model for a number of latent sources observed through noisy convolutive mixtures. All parameters including the pitches of the source signals, the amplitudes and phases of the sources, the mixing filters and the noise statistics are estimated by maximum likelihood, using an EM-algorithm. Exact averaging over the hidden sources is obtained using the Kalman smoother. We show that pitch estimation and source separation can be performed simultaneously. The pitch estimates are compared to laryngograph (EGG) measurements. Artificial and real room mixtures are used to demonstrate the viability of the approach. Intelligible speech signals are re-synthesized from the estimated H+N models.
Reference: text
sentIndex sentText sentNum sentScore
1 A harmonic excitation state-space approach to blind separation of speech Rasmus Kongsgaard Olsson and Lars Kai Hansen Informatics and Mathematical Modelling Technical University of Denmark, 2800 Lyngby, Denmark rko,lkh@imm. [sent-1, score-0.77]
2 dk Abstract We discuss an identification framework for noisy speech mixtures. [sent-3, score-0.374]
3 A block-based generative model is formulated that explicitly incorporates the time-varying harmonic plus noise (H+N) model for a number of latent sources observed through noisy convolutive mixtures. [sent-4, score-0.708]
4 All parameters including the pitches of the source signals, the amplitudes and phases of the sources, the mixing filters and the noise statistics are estimated by maximum likelihood, using an EM-algorithm. [sent-5, score-0.76]
5 Exact averaging over the hidden sources is obtained using the Kalman smoother. [sent-6, score-0.258]
6 We show that pitch estimation and source separation can be performed simultaneously. [sent-7, score-0.771]
7 The pitch estimates are compared to laryngograph (EGG) measurements. [sent-8, score-0.324]
8 Artificial and real room mixtures are used to demonstrate the viability of the approach. [sent-9, score-0.194]
9 Intelligible speech signals are re-synthesized from the estimated H+N models. [sent-10, score-0.526]
10 1 Introduction Our aim is to understand the properties of mixtures of speech signals within a generative statistical framework. [sent-11, score-0.61]
11 , L−1 xt = Ak st−k + nt , (1) k=0 where the elements of the source signal vector, st , i. [sent-14, score-0.624]
12 , the ds statistically independent source signals, are convolved with the corresponding elements of the filter matrix, Ak . [sent-16, score-0.478]
13 The multichannel sensor signal, xt , is furthermore degraded by additive Gaussian white noise. [sent-17, score-0.149]
14 It is well-known that separation of the source signals based on second order statistics is infeasible in general. [sent-18, score-0.669]
15 Consider the second order statistic L−1 xt xt Ak st−k st −k Ak + R, = (2) k,k =0 where R is the (diagonal) noise covariance matrix. [sent-19, score-0.364]
16 If the sources can be assumed stationary white noise, the source covariance matrix can be assumed proportional to the unit matrix without loss of generality, and we see that the statistic is symmetric to a common rotation of all mixing matrices Ak → Ak U. [sent-20, score-0.99]
17 This rotational invariance means that the acquired statistic is not informative enough to identify the mixing matrix, hence, the source time series. [sent-21, score-0.599]
18 For non-stationary sources, on the other hand, the autocorrelation functions vary through time and it is not possible to choose a single common whitening filter for each source. [sent-24, score-0.057]
19 This means that the mixing matrices may be identifiable from multiple estimates of the second order correlation statistic (2) for non-stationary sources. [sent-25, score-0.319]
20 Also in [2], the constraining effect of source non-stationarity was exploited by the simultaneous diagonalization of multiple estimates of the source power spectrum. [sent-28, score-0.769]
21 In [3] we formulated a generative probabilistic model of this process and proved that it could estimate sources and mixing matrices in noisy mixtures. [sent-29, score-0.559]
22 Blind source separation based on state-space models has been studied, e. [sent-30, score-0.46]
23 The approach is especially useful for including prior knowledge about the source signals and for handling noisy mixtures. [sent-33, score-0.588]
24 One example of considerable practical importance is the case of speech mixtures. [sent-34, score-0.302]
25 For speech mixtures the generative model based on white noise excitation may be improved using more realistic priors. [sent-35, score-0.633]
26 Speech models based on sinusoidal excitation have been quite popular in speech modelling since [6]. [sent-36, score-0.432]
27 This approach assumes that the speech signal is a time-varying mixture of a harmonic signal and a noise signal (H+N model). [sent-37, score-0.667]
28 A recent application of this model for pitch estimation can be found in [7]. [sent-38, score-0.311]
29 Also [8] and [9] exploit the harmonic structure of certain classes of signals for enhancement purposes. [sent-39, score-0.245]
30 A related application is the BSS algorithm of [10], which uses the cross-correlation of the amplitude in different frequency. [sent-40, score-0.07]
31 The state-space model naturally leads to maximum-likelihood estimation using the EM-algorithm, e. [sent-41, score-0.035]
32 In this work we generalize our previous work on state space models for blind source separation to include harmonic excitation and demonstrate that it is possible to perform simultaneous un-mixing and pitch tracking. [sent-45, score-1.125]
33 2 The model The assumption of time variant source statistics help identify parameters that would otherwise not be unique within the model. [sent-46, score-0.34]
34 In the following, the measured signals are segmented into frames, in which they are assumed stationary. [sent-47, score-0.211]
35 The mixing filters and observation noise covariance matrix are assumed stationary across all frames. [sent-48, score-0.28]
36 The colored noise (AR) process that was used in [3] to model the sources is augmented to include a periodic excitation signal that is also time-varying. [sent-49, score-0.54]
37 In frame n, source i is represented by: p sn i,t K n fi,t sn i,t−t + = t =1 p n n n n αi,k sin(ω0,i kt + βi ) + vi,t k=1 K n fi,t sn i,t−t + = t =1 n n n n cn i,2k−1 sin(ω0,i kt) + ci,2k cos(ω0,i kt) + vi,t (3) k=1 n where n ∈ {1, 2, . [sent-53, score-1.24]
38 The fundamental frequency, ω0,i , enters the estimation problem in an inherent non-linear manner. [sent-62, score-0.035]
39 In order to benefit from well-established estimation theory, the above recursion is fitted into the framework of Gaussian linear models, see [15]. [sent-63, score-0.073]
40 All sn ’s are stacked in the total source veci,t i,t−1 i,t−p+1 i,t i,t tor: ¯n = st (sn ) 1,t (sn ) 2,t . [sent-70, score-0.695]
41 The resulting state-space model is: ¯n Fn¯n + Cn un + vt st−1 t n n A¯t + nt s and ¯n ∼ N (µn , Σn ). [sent-74, score-0.258]
42 (uns ,t ) (un ) 2,t d ¯ where vt ∼ N (0, Q), nt ∼ N (0, R) put vector is defined: un = (un ) 1,t t corresponding to source i in frame n are: un = i,t n n sin(ω0,i t) cos(ω0,i t) . [sent-78, score-0.836]
43 n n sin(Kω0,i t) cos(Kω0,i t) It is apparent that the matrix multiplication by A sources, where the dx × ds channel filters are: . [sent-81, score-0.138]
44 adx 1 In order to implement the follows: n ··· F1 0 Fn · · · 0 2 Fn = . [sent-94, score-0.054]
45 0 0 ··· ad x 2 combined harmonics in, where the harmonics constitutes a convolutive mixing of the a1ds a2ds . [sent-121, score-0.535]
46 adx ds H+N source model, the parameter matrices are constrained as 0 0 , . [sent-126, score-0.584]
47 0 ··· 1 0 (Qn )jj = i Cn i = n qi 0 j=j =1 j=1 j =1 cn i,1 0 0 . [sent-150, score-0.147]
48 0 0 ··· 0 3 Learning Having described the convolutive mixing problem in the general framework of linear Gaussian models, more specifically the Kalman filter model, optimal inference of the sources is obtained by the Kalman smoother. [sent-162, score-0.621]
49 For the Gausˆ sian model the means are also source MAP estimates. [sent-171, score-0.34]
50 1 E-step The forward-backward recursions which comprise the Kalman smoother are employed in the E-step to infer moments of the source posterior, p(S|X, θ), i. [sent-174, score-0.34]
51 the joint posterior of the sources conditioned on all observations. [sent-176, score-0.292]
52 The relevant second-order statistic of this n ¯ distribution in segment n is the marginal posterior mean, ˆt ≡ ¯n , and autocorrelation, s st n n n n n n Mi,t ≡ si,t (si,t ) ≡ [ mi,1,t mi,2,t . [sent-177, score-0.236]
53 In particular, covariance, M1,n ≡ sn (sn ) i,t i,t−1 i,1,t i,2,t i,t i,L,t mn is the first element of mn . [sent-182, score-0.35]
54 The forward i,t i,1,t recursion also yields the log-likelihood, L(θ). [sent-184, score-0.038]
55 The linear source parameters are grouped as dn ≡ i (fin ) (cn ) i zn ≡ i , (sn ) i,t−1 (un ) i,t where fin ≡ [ fi,1 fi,2 . [sent-187, score-0.488]
56 [12] in order to respect the special constrained format of the parameter matrices and to allow for an external input to the model. [sent-195, score-0.052]
57 More details on the estimators for the correlated source model are given in [3]. [sent-196, score-0.371]
58 that the pitch of speech lies in the range n 50-400Hz. [sent-205, score-0.578]
59 A candidate estimate for ωi,0 is obtained by computing the autocorrelation n n n function of si,t − (fi ) si,t−1 . [sent-206, score-0.057]
60 5 1 t [sec] Figure 1: Amplitude spectrograms of the frequency range 0-4000Hz, from left to right: the true sources, the estimated sources and the re-synthesized source. [sent-211, score-0.371]
61 that is enforcing a unity norm on the filter coefficients related to source i. [sent-212, score-0.376]
62 4 Experiment I: BSS and pitch tracking in a noisy artificial mixture The performance of a pitch detector can be evaluated using electro-laryngograph (EGG) recordings, which are obtained from electrodes placed on the neck, see [7]. [sent-213, score-0.697]
63 In the following experiment, speech signals from the TIMIT [16] corpus is used for which the EGG signals were measured, kindly provided by the ‘festvox’ project (http://festvox. [sent-214, score-0.654]
64 Two male speech signals (Fs = 16kHz) were mixed through known mixing filters and degraded by additive white noise (SNR ∼20dB), constructing two observation signals. [sent-216, score-0.869]
65 00 The signals were segmented into frames, τ = 320 ∼ 20ms, and the order of the ARprocess was set to p = 1. [sent-238, score-0.211]
66 The pitch grid search involved 30 re-estimations of dn . [sent-240, score-0.374]
67 In figure 1 is shown the spectrograms of i F [Hz] (source 1) 160 140 0 120 100 160 140 120 100 0 F [Hz] (source 2) 80 80 0. [sent-241, score-0.065]
68 8 t [sec] 1 Figure 2: The estimated (dashed) and EGG-provided (solid) pitches as a function of time. [sent-244, score-0.213]
69 The speech mixtures were artificially mixed from TIMIT utterances and white noise was added. [sent-245, score-0.548]
70 approximately 1 second of 1) the original sources, 2) the MAP source estimates and 3) the resynthesized sources (from the estimated model parameters). [sent-246, score-0.694]
71 Also, the re-synthesizations are almost indistinguishable from the source estimates. [sent-248, score-0.34]
72 In figure 2, the estimated pitch of both speech signals are shown along with the pitch of the EGG measurements. [sent-249, score-1.078]
73 1 The voiced sections of the speech were manually preselected, this step is easily automated. [sent-250, score-0.342]
74 The estimated pitches do follow the ’true’ pitches as provided by the EGG. [sent-251, score-0.378]
75 The smoothness of the estimates is further indicating the viability of the approach, as the pitch estimates are frame-local. [sent-252, score-0.415]
76 5 Experiment II: BSS and pitch tracking in a real mixture The algorithm was further evaluated on real room recordings that were also used in [17]. [sent-253, score-0.432]
77 The filter length, the frame length, the order of the AR-process and the number of harmonics were set to L = 25, τ = 320, p = 1 and K = 40, respectively. [sent-256, score-0.148]
78 Figure 3 shows the MAP source estimates and the re-synthesized sources. [sent-257, score-0.388]
79 Features of speech such as amplitude modulation are clearly evident in estimates and re-synthesizations. [sent-258, score-0.465]
80 3 A listening test confirms: 1) the separation of the sources and 2) the good quality of the synthesized sources, reconfirming the applicability of the H+N model. [sent-259, score-0.378]
81 Figure 4 displays the estimated pitches of the sources, where the voiced sections were manually preselected. [sent-260, score-0.253]
82 Although, the ’true’ pitch is unavailable in this experiment, the smoothness of the frame-local pitch-estimates is further support for the approach. [sent-261, score-0.276]
83 1 The EGG data are themselves noisy measurements of the hypothesized ‘truth’. [sent-262, score-0.072]
84 3 Note that the ’English’ counter lowers the pitch throughout the sentence. [sent-268, score-0.276]
85 F [Hz] (source 1) 3000 2000 1000 F [Hz] (source 2) 0 3000 2000 1000 0 0 1 2 30 t [sec] 1 2 3 t [sec] Figure 3: Spectrograms of the estimated (left) and re-synthesized sources (right) extracted from the ’one two . [sent-269, score-0.306]
86 ’ mixtures, source 1 and 2, respectively 6 Conclusion It was shown that prior knowledge on speech signals and quasi-periodic signals in general can be integrated into a linear non-stationary state-space model. [sent-275, score-0.994]
87 As a result, the simultaneous separation of the speech sources and estimation of their pitches could be achieved. [sent-276, score-0.921]
88 It was demonstrated that the method could cope with noisy artificially mixed signals and real room mixtures. [sent-277, score-0.334]
89 Future research concerns more realistic mixtures in terms of reverberation time and inclusion of further domain knowledge. [sent-278, score-0.101]
90 , Estimating the number of sources in a noisy convolutive mixture using BIC. [sent-308, score-0.581]
91 , Blind separtion of independent sources in linear dynamical media. [sent-319, score-0.258]
92 pdf 120 100 0 F [Hz] (source 2) 0 F [Hz] (source 1) 140 80 180 160 140 120 100 80 0 2 4 t [sec] 6 8 Figure 4: Pitch tracking in ’one two . [sent-325, score-0.039]
93 , Blind Deconvolution of dynamical systems: a state space appraoch, Journal of signal processing, vol. [sent-334, score-0.067]
94 , Speech analysis/synthesis based on a sinusoidal representation, IEEE Trans. [sent-343, score-0.04]
95 , Approximate Kalman filtering for the harmonic plus noise model. [sent-350, score-0.13]
96 IEEE Workshop on applications of signal processing to audio and acoustics, pp. [sent-351, score-0.117]
97 , One microphone blind dereverberation based on quasi-periodicity of speech signals, Advances in Neural Information Processing Systems 16 (to appear), MIT Press, 2004. [sent-356, score-0.491]
98 , Monaural speech segregation based on pitch tracking and amplitude modulation, IEEE Trans. [sent-359, score-0.687]
99 , Convolutive blind source separation of speech signals based on u amplitude modulation decorrelation, Journal of the Acoustical Society of America, vol. [sent-363, score-1.242]
100 , Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models, ICASSP, vol. [sent-388, score-0.651]
wordName wordTfidf (topN-words)
[('source', 0.34), ('speech', 0.302), ('pitch', 0.276), ('sources', 0.258), ('sn', 0.226), ('convolutive', 0.217), ('blind', 0.189), ('signals', 0.176), ('un', 0.176), ('pitches', 0.165), ('mixing', 0.146), ('ds', 0.138), ('st', 0.129), ('hz', 0.128), ('separation', 0.12), ('egg', 0.118), ('sec', 0.116), ('cn', 0.112), ('lter', 0.102), ('mixtures', 0.101), ('kalman', 0.096), ('excitation', 0.09), ('harmonics', 0.086), ('olsson', 0.081), ('ak', 0.078), ('statistic', 0.073), ('noisy', 0.072), ('amplitude', 0.07), ('harmonic', 0.069), ('signal', 0.067), ('dn', 0.065), ('bss', 0.065), ('degraded', 0.065), ('qn', 0.065), ('spectrograms', 0.065), ('frame', 0.062), ('mn', 0.062), ('noise', 0.061), ('deconvolution', 0.06), ('sin', 0.06), ('lters', 0.058), ('fn', 0.058), ('autocorrelation', 0.057), ('hansen', 0.057), ('adx', 0.054), ('eusipco', 0.054), ('uno', 0.054), ('matrices', 0.052), ('nt', 0.052), ('audio', 0.05), ('room', 0.05), ('estimates', 0.048), ('estimated', 0.048), ('kt', 0.048), ('white', 0.048), ('dos', 0.047), ('fin', 0.047), ('modulation', 0.045), ('stationary', 0.044), ('viability', 0.043), ('cardoso', 0.043), ('decorrelation', 0.043), ('parra', 0.043), ('timit', 0.043), ('cos', 0.041), ('simultaneous', 0.041), ('em', 0.041), ('rotational', 0.04), ('sinusoidal', 0.04), ('snr', 0.04), ('voiced', 0.04), ('tracking', 0.039), ('det', 0.038), ('fs', 0.038), ('recursion', 0.038), ('xt', 0.036), ('enforcing', 0.036), ('zn', 0.036), ('mixed', 0.036), ('estimation', 0.035), ('qi', 0.035), ('cially', 0.035), ('denmark', 0.035), ('english', 0.035), ('male', 0.035), ('periodic', 0.035), ('pp', 0.035), ('segmented', 0.035), ('mixture', 0.034), ('posterior', 0.034), ('infeasible', 0.033), ('recordings', 0.033), ('grid', 0.033), ('acoustics', 0.032), ('estimators', 0.031), ('generative', 0.031), ('ltering', 0.031), ('vt', 0.03), ('covariance', 0.029), ('augmented', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
Author: Rasmus K. Olsson, Lars K. Hansen
Abstract: We discuss an identification framework for noisy speech mixtures. A block-based generative model is formulated that explicitly incorporates the time-varying harmonic plus noise (H+N) model for a number of latent sources observed through noisy convolutive mixtures. All parameters including the pitches of the source signals, the amplitudes and phases of the sources, the mixing filters and the noise statistics are estimated by maximum likelihood, using an EM-algorithm. Exact averaging over the hidden sources is obtained using the Kalman smoother. We show that pitch estimation and source separation can be performed simultaneously. The pitch estimates are compared to laryngograph (EGG) measurements. Artificial and real room mixtures are used to demonstrate the viability of the approach. Intelligible speech signals are re-synthesized from the estimated H+N models.
2 0.40880996 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm to perform blind, one-microphone speech separation. Our algorithm separates mixtures of speech without modeling individual speakers. Instead, we formulate the problem of speech separation as a problem in segmenting the spectrogram of the signal into two or more disjoint sets. We build feature sets for our segmenter using classical cues from speech psychophysics. We then combine these features into parameterized affinity matrices. We also take advantage of the fact that we can generate training examples for segmentation by artificially superposing separately-recorded signals. Thus the parameters of the affinity matrices can be tuned using recent work on learning spectral clustering [1]. This yields an adaptive, speech-specific segmentation algorithm that can successfully separate one-microphone speech mixtures. 1
3 0.32272512 152 nips-2004-Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization
Author: Fei Sha, Lawrence K. Saul
Abstract: An auditory “scene”, composed of overlapping acoustic sources, can be viewed as a complex object whose constituent parts are the individual sources. Pitch is known to be an important cue for auditory scene analysis. In this paper, with the goal of building agents that operate in human environments, we describe a real-time system to identify the presence of one or more voices and compute their pitch. The signal processing in the front end is based on instantaneous frequency estimation, a method for tracking the partials of voiced speech, while the pattern-matching in the back end is based on nonnegative matrix factorization, an unsupervised algorithm for learning the parts of complex objects. While supporting a framework to analyze complicated auditory scenes, our system maintains real-time operability and state-of-the-art performance in clean speech.
Author: Tobias Blaschke, Laurenz Wiskott
Abstract: In contrast to the equivalence of linear blind source separation and linear independent component analysis it is not possible to recover the original source signal from some unknown nonlinear transformations of the sources using only the independence assumption. Integrating the objectives of statistical independence and temporal slowness removes this indeterminacy leading to a new method for nonlinear blind source separation. The principle of temporal slowness is adopted from slow feature analysis, an unsupervised method to extract slowly varying features from a given observed vectorial signal. The performance of the algorithm is demonstrated on nonlinearly mixed speech data. 1
5 0.14729548 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
Author: Yuanqing Lin, Daniel D. Lee
Abstract: Bayesian Regularization and Nonnegative Deconvolution (BRAND) is proposed for estimating time delays of acoustic signals in reverberant environments. Sparsity of the nonnegative filter coefficients is enforced using an L1 -norm regularization. A probabilistic generative model is used to simultaneously estimate the regularization parameters and filter coefficients from the signal data. Iterative update rules are derived under a Bayesian framework using the Expectation-Maximization procedure. The resulting time delay estimation algorithm is demonstrated on noisy acoustic data.
6 0.13403997 102 nips-2004-Learning first-order Markov models for control
7 0.12082705 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
8 0.11257113 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
9 0.10938415 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
10 0.099896356 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces
11 0.095866442 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
12 0.092594609 103 nips-2004-Limits of Spectral Clustering
13 0.090927891 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process
14 0.083940603 50 nips-2004-Dependent Gaussian Processes
15 0.077689931 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes
16 0.073459044 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
17 0.072279498 124 nips-2004-Multiple Alignment of Continuous Time Series
18 0.057907682 166 nips-2004-Semi-supervised Learning via Gaussian Processes
19 0.057703733 163 nips-2004-Semi-parametric Exponential Family PCA
20 0.055682629 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
topicId topicWeight
[(0, -0.193), (1, -0.046), (2, -0.001), (3, -0.259), (4, -0.338), (5, -0.343), (6, 0.318), (7, 0.163), (8, 0.057), (9, 0.087), (10, 0.071), (11, 0.018), (12, 0.039), (13, 0.104), (14, -0.164), (15, 0.03), (16, -0.036), (17, 0.031), (18, 0.053), (19, 0.073), (20, -0.051), (21, 0.069), (22, 0.125), (23, 0.028), (24, 0.012), (25, -0.042), (26, 0.022), (27, -0.011), (28, 0.029), (29, -0.041), (30, 0.004), (31, -0.021), (32, 0.035), (33, -0.067), (34, 0.002), (35, -0.051), (36, -0.03), (37, -0.022), (38, 0.012), (39, -0.01), (40, -0.01), (41, 0.063), (42, 0.005), (43, 0.01), (44, -0.043), (45, -0.038), (46, -0.074), (47, 0.009), (48, -0.046), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.98464942 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
Author: Rasmus K. Olsson, Lars K. Hansen
Abstract: We discuss an identification framework for noisy speech mixtures. A block-based generative model is formulated that explicitly incorporates the time-varying harmonic plus noise (H+N) model for a number of latent sources observed through noisy convolutive mixtures. All parameters including the pitches of the source signals, the amplitudes and phases of the sources, the mixing filters and the noise statistics are estimated by maximum likelihood, using an EM-algorithm. Exact averaging over the hidden sources is obtained using the Kalman smoother. We show that pitch estimation and source separation can be performed simultaneously. The pitch estimates are compared to laryngograph (EGG) measurements. Artificial and real room mixtures are used to demonstrate the viability of the approach. Intelligible speech signals are re-synthesized from the estimated H+N models.
2 0.8525278 152 nips-2004-Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization
Author: Fei Sha, Lawrence K. Saul
Abstract: An auditory “scene”, composed of overlapping acoustic sources, can be viewed as a complex object whose constituent parts are the individual sources. Pitch is known to be an important cue for auditory scene analysis. In this paper, with the goal of building agents that operate in human environments, we describe a real-time system to identify the presence of one or more voices and compute their pitch. The signal processing in the front end is based on instantaneous frequency estimation, a method for tracking the partials of voiced speech, while the pattern-matching in the back end is based on nonnegative matrix factorization, an unsupervised algorithm for learning the parts of complex objects. While supporting a framework to analyze complicated auditory scenes, our system maintains real-time operability and state-of-the-art performance in clean speech.
3 0.79392087 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm to perform blind, one-microphone speech separation. Our algorithm separates mixtures of speech without modeling individual speakers. Instead, we formulate the problem of speech separation as a problem in segmenting the spectrogram of the signal into two or more disjoint sets. We build feature sets for our segmenter using classical cues from speech psychophysics. We then combine these features into parameterized affinity matrices. We also take advantage of the fact that we can generate training examples for segmentation by artificially superposing separately-recorded signals. Thus the parameters of the affinity matrices can be tuned using recent work on learning spectral clustering [1]. This yields an adaptive, speech-specific segmentation algorithm that can successfully separate one-microphone speech mixtures. 1
Author: Tobias Blaschke, Laurenz Wiskott
Abstract: In contrast to the equivalence of linear blind source separation and linear independent component analysis it is not possible to recover the original source signal from some unknown nonlinear transformations of the sources using only the independence assumption. Integrating the objectives of statistical independence and temporal slowness removes this indeterminacy leading to a new method for nonlinear blind source separation. The principle of temporal slowness is adopted from slow feature analysis, an unsupervised method to extract slowly varying features from a given observed vectorial signal. The performance of the algorithm is demonstrated on nonlinearly mixed speech data. 1
5 0.59372938 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process
Author: Tanzeem Choudhury, Sumit Basu
Abstract: In this work, we quantitatively investigate the ways in which a given person influences the joint turn-taking behavior in a conversation. After collecting an auditory database of social interactions among a group of twenty-three people via wearable sensors (66 hours of data each over two weeks), we apply speech and conversation detection methods to the auditory streams. These methods automatically locate the conversations, determine their participants, and mark which participant was speaking when. We then model the joint turn-taking behavior as a Mixed-Memory Markov Model [1] that combines the statistics of the individual subjects' self-transitions and the partners ' cross-transitions. The mixture parameters in this model describe how much each person 's individual behavior contributes to the joint turn-taking behavior of the pair. By estimating these parameters, we thus estimate how much influence each participant has in determining the joint turntaking behavior. We show how this measure correlates significantly with betweenness centrality [2], an independent measure of an individual's importance in a social network. This result suggests that our estimate of conversational influence is predictive of social influence. 1
6 0.52025312 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
7 0.40014336 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes
8 0.39747003 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
9 0.37638518 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
10 0.33647773 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
11 0.31773189 102 nips-2004-Learning first-order Markov models for control
12 0.29772434 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces
13 0.2895883 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
14 0.26146105 29 nips-2004-Beat Tracking the Graphical Model Way
15 0.25183055 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
16 0.25105241 50 nips-2004-Dependent Gaussian Processes
17 0.20714748 124 nips-2004-Multiple Alignment of Continuous Time Series
18 0.20387624 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
19 0.2016876 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units
20 0.19936953 74 nips-2004-Harmonising Chorales by Probabilistic Inference
topicId topicWeight
[(13, 0.275), (15, 0.089), (26, 0.045), (31, 0.027), (33, 0.131), (39, 0.011), (47, 0.056), (50, 0.031), (54, 0.013), (65, 0.026), (76, 0.091), (82, 0.01), (89, 0.018), (94, 0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.94505173 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
Author: Rasmus K. Olsson, Lars K. Hansen
Abstract: We discuss an identification framework for noisy speech mixtures. A block-based generative model is formulated that explicitly incorporates the time-varying harmonic plus noise (H+N) model for a number of latent sources observed through noisy convolutive mixtures. All parameters including the pitches of the source signals, the amplitudes and phases of the sources, the mixing filters and the noise statistics are estimated by maximum likelihood, using an EM-algorithm. Exact averaging over the hidden sources is obtained using the Kalman smoother. We show that pitch estimation and source separation can be performed simultaneously. The pitch estimates are compared to laryngograph (EGG) measurements. Artificial and real room mixtures are used to demonstrate the viability of the approach. Intelligible speech signals are re-synthesized from the estimated H+N models.
2 0.93036819 176 nips-2004-Sub-Microwatt Analog VLSI Support Vector Machine for Pattern Classification and Sequence Estimation
Author: Shantanu Chakrabartty, Gert Cauwenberghs
Abstract: An analog system-on-chip for kernel-based pattern classification and sequence estimation is presented. State transition probabilities conditioned on input data are generated by an integrated support vector machine. Dot product based kernels and support vector coefficients are implemented in analog programmable floating gate translinear circuits, and probabilities are propagated and normalized using sub-threshold current-mode circuits. A 14-input, 24-state, and 720-support vector forward decoding kernel machine is integrated on a 3mm×3mm chip in 0.5µm CMOS technology. Experiments with the processor trained for speaker verification and phoneme sequence estimation demonstrate real-time recognition accuracy at par with floating-point software, at sub-microwatt power. 1
3 0.93005395 138 nips-2004-Online Bounds for Bayesian Algorithms
Author: Sham M. Kakade, Andrew Y. Ng
Abstract: We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms (such as Gaussian linear regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct,” and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. 1
4 0.92686033 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
Author: Yuanqing Lin, Daniel D. Lee
Abstract: Bayesian Regularization and Nonnegative Deconvolution (BRAND) is proposed for estimating time delays of acoustic signals in reverberant environments. Sparsity of the nonnegative filter coefficients is enforced using an L1 -norm regularization. A probabilistic generative model is used to simultaneously estimate the regularization parameters and filter coefficients from the signal data. Iterative update rules are derived under a Bayesian framework using the Expectation-Maximization procedure. The resulting time delay estimation algorithm is demonstrated on noisy acoustic data.
5 0.9252063 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
Author: Elizaveta Levina, Peter J. Bickel
Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1
6 0.92257267 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval
7 0.92016405 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification
8 0.83212495 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
10 0.79330689 152 nips-2004-Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization
11 0.79045182 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
12 0.78754956 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
13 0.78112894 28 nips-2004-Bayesian inference in spiking neurons
14 0.77695423 131 nips-2004-Non-Local Manifold Tangent Learning
15 0.77578861 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
16 0.77559197 102 nips-2004-Learning first-order Markov models for control
17 0.77348411 163 nips-2004-Semi-parametric Exponential Family PCA
18 0.77288437 116 nips-2004-Message Errors in Belief Propagation
19 0.77195573 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
20 0.77146065 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss