nips nips2002 nips2002-29 knowledge-graph by maker-knowledge-mining

29 nips-2002-Analysis of Information in Speech Based on MANOVA

Source: pdf

Author: Sachin S. Kajarekar, Hynek Hermansky

Abstract: We propose analysis of information in speech using three sources - language (phone), speaker and channeL Information in speech is measured as mutual information between the source and the set of features extracted from speech signaL We assume that distribution of features can be modeled using Gaussian distribution. The mutual information is computed using the results of analysis of variability in speech. We observe similarity in the results of phone variability and phone information, and show that the results of the proposed analysis have more meaningful interpretations than the analysis of variability. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The mutual information is computed using the results of analysis of variability in speech. [sent-5, score-0.312]

2 We observe similarity in the results of phone variability and phone information, and show that the results of the proposed analysis have more meaningful interpretations than the analysis of variability. [sent-6, score-1.714]

3 1 Introduction Speech signal carries information about the linguistic message, the speaker, the communication channeL In the previous work [1, 2], we proposed analysis of information in speech as analysis of variability in a set of features extracted from the speech signal. [sent-7, score-0.659]

4 The variability was measured as covariance of the features , and analysis was performed using using multivariate analysis of variance (MANOVA). [sent-8, score-0.38]

5 Total variability was divided into three types of variabilities, namely, intra-phone (or phone) variability, speaker variability, and channel variability. [sent-9, score-0.899]

6 Effect of each type was measured as its contribution to the total variability. [sent-10, score-0.091]

7 In this paper, we extend our previous work by proposing an information-theoretic analysis of information in speech. [sent-11, score-0.068]

8 Similar to MANOVA, we assume that speech carries information from three main sources- language, speaker, and channeL We measure information from a source as mutual information (MI) [3] between the corresponding class labels and features. [sent-12, score-0.314]

9 For example, linguistic information is measured as MI between phone labels and features. [sent-13, score-0.82]

10 The effect of sources is measured in nats (or bits). [sent-14, score-0.151]

11 In this work, we show it is easier to interpret the results of this analysis than the analysis of variability. [sent-15, score-0.082]

12 The proposed analysis is based on the second method, where distribution of features is modeled as a Gaussian distribution. [sent-21, score-0.127]

13 Section 4 proposes information theoretic approach for analysis of information in speech and presents the results. [sent-26, score-0.222]

14 2 Experimental Setup In the previous work [1 , 2], we have analyzed variability in the features using three databases - HTIMIT, OGI Stories and TIMIT. [sent-29, score-0.23]

15 Therefore, phone is considered as a source of variability or source of information. [sent-33, score-0.982]

16 The utterances are not labeled separately by speakers and channels, so we cannot measure speaker and channel as separate sources. [sent-34, score-0.787]

17 Instead, we assume that different speakers have used different channels and consider speaker+channel as a single source of variability or a single source of information. [sent-35, score-0.382]

18 Figure 1 shows a commonly used time-frequency representation of energy in speech signal. [sent-36, score-0.105]

19 The y-axis represents frequency, x-axis represents time, and the darkness of each element shows the energy at a given frequency and time. [sent-37, score-0.073]

20 A spectral vector is defined by the number of points on the y-axis, S(w, t m ). [sent-38, score-0.097]

21 The vector is estimated at every 10 ms using a 25 ms speech segment. [sent-40, score-0.253]

22 It is labeled by the phone and the speaker and channel label of the corresponding speech segment. [sent-41, score-1.516]

23 A temporal vector is defined by a sequence of points along time at a given frequency, S(wn, t). [sent-42, score-0.114]

24 As the spectral vectors are computed every 10 ms, the temporal vector represents 1 sec of temporal information. [sent-44, score-0.356]

25 The temporal vectors are labeled by the phone and the speaker and channel label of the current speech segment. [sent-45, score-1.651]

26 In this work, the analysis is performed independently using spectral and temporal vectors. [sent-46, score-0.252]

27 In this work, we use two factors - phone and speaker+channel. [sent-48, score-0.732]

28 The underline model of MAN OVA is (1) where, i = 1"" ,p, represents phones, j = 1"" Be, represents speakers and channels. [sent-49, score-0.113]

29 , mean of the speaker and channel j, and phone i; and Eij k, an error in this approximation. [sent-54, score-1.411]

30 First, all the feature vectors (X) belonging to each phone i are collected and their mean (Xi) is computed. [sent-75, score-0.75]

31 The covariance of these phone means, ~p, is the estimate of phone variability. [sent-76, score-1.438]

32 Next, the data for each speaker and channel j within each phone i is collected and the mean of the data (X ij ) is computed. [sent-77, score-1.435]

33 The covariance of the means of different speakers averaged over all phones, ~s c, is the estimate of speaker variability. [sent-78, score-0.513]

34 All the variability in the data is not explained using these sources. [sent-79, score-0.217]

35 The unaccounted sources, such as context and coarticulation, cause variability in the data collected from one speaker speaking one phone through one channel. [sent-80, score-1.347]

36 The covariance within each phone, speaker, and channel is averaged over all phones, speakers, and channels, and the resulting covariance, ~r e sidual, is the estimate of residual variability. [sent-81, score-0.358]

37 1 Results Results of MAN OVA are interpreted at two levels - feature element and feature vector. [sent-83, score-0.078]

38 The contribution of different sources is calculated as trace (~source )ltrace(~total). [sent-86, score-0.077]

39 Therefore, we cannot directly compare contribution of variabilities in time and frequency domains. [sent-88, score-0.1]

40 For comparison, contribution of sources in temporal domain is calculated Table 1: Contribution of sources in spectral and temporal domains o contribution source pectral Domain Temporal Domain phone 35. [sent-89, score-1.342]

41 ElOl x 15 is a matrix of 15 leading In spectral domain, the highest phone variability is between 4-6 Barks. [sent-94, score-1.024]

42 The highest speaker and channel variability is between 1-2 Barks where phone variability is the lowest. [sent-95, score-1.826]

43 In temporal domain, phone variability spreads for approximately 250 ms around the current phone. [sent-96, score-1.181]

44 Speaker and channel variability is almost constant except around the current frame. [sent-97, score-0.54]

45 This deviation is explained by the difference in t he phonetic context among the phone instances across different speakers. [sent-98, score-0.812]

46 Thus, features for speakers within a phone differ not only because of different speaker characteristics but also different phonetic contexts. [sent-99, score-1.269]

47 This deviation is also seen in the speaker and channel information in the proposed analysis. [sent-100, score-0.752]

48 In the overall results for each domain, spectral domain has higher variability due to different phones than temporal domain. [sent-101, score-0.608]

49 It also has higher speaker and channel variability than temporal domain. [sent-102, score-1.032]

50 For example, how much phone variability is needed for perfect phone recognition? [sent-104, score-1.614]

51 and is 4% of phone variability in temporal domain significant? [sent-105, score-1.102]

52 In order to answer these questions, we propose an information theoretic analysis. [sent-106, score-0.067]

53 Therefore, we propose a different formulation for the information theoretic analysis as follows. [sent-118, score-0.09]

54 For example, we can assume that Y1 = {yf} represents phone factor and each yf represent a phone class. [sent-125, score-1.444]

55 The term on the left hand side is entropy of X which is the total information in X that can be explained using Y . [sent-137, score-0.098]

56 On the right hand side, first term is similar to the phone variability, second term is similar to the speaker variability, and the last term which calculates the effect of unaccounted factors (Y3 , . [sent-139, score-1.161]

57 (5) h () terms are estimated using parametric approximation to the total and conditional distribution It is assumed that the total distribution of features is a Gaussian distribution with covariance ~. [sent-145, score-0.183]

58 1 5 10 Frequency (Critical Band Index) 15 -250 0 250 Time (ms) Figure 3: Results of information-theoretic analysis Table 2: Mutual information between features and phone and speaker and channel labels in spectral and temporal domains source phone speaker+ channel 4 . [sent-173, score-2.828]

59 1 Results Figure 3 shows the results of information-theoretic analysis in spectral and temporal domain. [sent-174, score-0.252]

60 In spectral domain, phone information is highest between 3-6 Barks. [sent-176, score-0.86]

61 Speaker and channel information is lowest in that range and highest between 1-2 Barks. [sent-177, score-0.366]

62 Since OGI Stories database was collected over different telephones, speaker+ channel information below 2 Barks ( :=::: 200 Hz ) is due to different telephone channels. [sent-178, score-0.379]

63 In temporal domain, the highest phone information is at the center (0 ms). [sent-179, score-0.877]

64 It spreads for approximately 200 ms around the center. [sent-180, score-0.152]

65 Speaker and channel information is almost constant across t ime except near the center. [sent-181, score-0.348]

66 Note that the nature of speaker and channel variability also deviates from the constant around the current frame. [sent-182, score-0.957]

67 But, at the current frame , phone variability is higher than speaker and channel variability. [sent-183, score-1.664]

68 The results of analysis of informat ion show that, at the current frame, phone information is lower than speaker and channel information. [sent-184, score-1.5]

69 They are related to data insufficiency, specifically, in temporal domain where the feature vector is 101 points and there are approximately 60 vectors per speaker per phone. [sent-190, score-0.652]

70 We have use condition number of 1000 to estimate determinant of ~ and ~i, and condition number of 100 to estimate the determinant of ~ij. [sent-194, score-0.07]

71 The results show that phone information in spectral domain is 1. [sent-195, score-0.921]

72 Comparison of results from spectral and temporal domains shows that spectral domain has higher phone information than temporal domain. [sent-203, score-1.287]

73 Temporal domain has higher speaker and channel information than spectral domain. [sent-204, score-0.945]

74 First question was how much phone variability is needed for perfect phone recognition? [sent-206, score-1.614]

75 The answer to the question is H(Yd, because the maximum value of leX; Yd is H(Yd· We compute H(Yl ) using phone priors. [sent-207, score-0.721]

76 Question about significance of phone information in temporal domain is addressed by comparing it with information-less MI level. [sent-211, score-0.954]

77 The information-less MI is computed as MI between the current phone label and features at 500 ms in the past or in the future . [sent-212, score-0.886]

78 0013 nats considering feature at 500 ms in the past, and 0. [sent-214, score-0.165]

79 0010 nats considering features at 500 ms in the future l . [sent-215, score-0.196]

80 The difference in actual MI levels across the two studies is related to the difference in the estimation techniques. [sent-226, score-0.075]

81 In spectral domain, Yang's study showed higher phone information between 3-8 Barks. [sent-227, score-0.866]

82 The highest phone information was observed at 4 Barks. [sent-228, score-0.763]

83 Higher speaker and channel information was observed around 1-2 Barks. [sent-229, score-0.757]

84 In temporal domain, their study showed that phone information spreads for approximately 200 ms around the current time frame. [sent-230, score-1.037]

85 Comparison of results from this analysis and our analysis shows that nature of phone information is similar in both studies. [sent-231, score-0.827]

86 Nature of speaker and channel information in spectral domain is also similar. [sent-232, score-0.926]

87 We could not compare the speaker and channel information in temporal domain because Yang's study did not present these results. [sent-233, score-0.963]

88 1, we observed difference in the nature of speaker and channel variability, and speaker and channel information at Ii =5 Barks. [sent-235, score-1.48]

89 Comparing MI levels from our study to those from Yang's study, we observe that Yang's results show that speaker and channel information at 5 Barks is less that the corresponding phone information. [sent-236, score-1.474]

90 In the future work, we plan to model the densities using more sophisticated techniques, and improve the estimation of speaker and channel information. [sent-242, score-0.738]

91 6 Conclusions We proposed analysis of information in speech using three sources of information - language (phone), speaker and channel. [sent-243, score-0.683]

92 Information in speech was measured as MI between the class labels and the set of features extracted from speech signal. [sent-244, score-0.324]

93 For example, linguistic information was measured using phone labels and the features. [sent-245, score-0.82]

94 Thus we related the analysis to previous proposed analysis of variability in speech. [sent-247, score-0.29]

95 We observed similar results for phone variability and phone information. [sent-248, score-1.597]

96 The speaker and channel variability and speaker and channel information around the current frame was different. [sent-249, score-1.699]

97 This was shown to be related to the over-estimation of speaker and channel information using unimodal Gaussian model. [sent-250, score-0.735]

98 Note that the analysis of information was proposed because its results have more meaningful interpretations than results of analysis of variability. [sent-251, score-0.144]

99 Hermansky, "Analysis of sources of variability in speech," in Proc. [sent-259, score-0.238]

100 Hermansky, "Analysis of speaker and channel variability in speech," in Proc. [sent-265, score-0.899]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('phone', 0.703), ('speaker', 0.402), ('channel', 0.306), ('variability', 0.191), ('mi', 0.153), ('temporal', 0.114), ('speech', 0.105), ('spectral', 0.097), ('yang', 0.097), ('domain', 0.094), ('manova', 0.093), ('phones', 0.093), ('ova', 0.081), ('speakers', 0.079), ('hermansky', 0.078), ('ogi', 0.078), ('ms', 0.074), ('man', 0.072), ('yd', 0.069), ('nats', 0.068), ('yl', 0.052), ('stories', 0.049), ('barks', 0.047), ('variabilities', 0.047), ('sources', 0.047), ('phonetic', 0.046), ('source', 0.044), ('analysis', 0.041), ('features', 0.039), ('mutual', 0.039), ('spreads', 0.037), ('measured', 0.036), ('determinant', 0.035), ('highest', 0.033), ('covariance', 0.032), ('kajarekar', 0.031), ('malayath', 0.031), ('sidual', 0.031), ('vuuren', 0.031), ('xijk', 0.031), ('iii', 0.03), ('linguistic', 0.03), ('contribution', 0.03), ('factors', 0.029), ('covariances', 0.029), ('information', 0.027), ('unaccounted', 0.027), ('explained', 0.026), ('total', 0.025), ('bits', 0.024), ('channels', 0.024), ('collected', 0.024), ('labels', 0.024), ('ii', 0.023), ('feature', 0.023), ('frequency', 0.023), ('zj', 0.023), ('xij', 0.023), ('database', 0.022), ('theoretic', 0.022), ('difference', 0.022), ('domains', 0.022), ('frame', 0.022), ('around', 0.022), ('current', 0.021), ('yf', 0.021), ('carries', 0.021), ('side', 0.02), ('describes', 0.02), ('residual', 0.02), ('ym', 0.02), ('study', 0.02), ('past', 0.02), ('gaussian', 0.019), ('icassp', 0.019), ('il', 0.019), ('higher', 0.019), ('approximately', 0.019), ('answer', 0.018), ('interpretations', 0.018), ('represents', 0.017), ('proposed', 0.017), ('language', 0.017), ('parametric', 0.017), ('band', 0.017), ('perfect', 0.017), ('yi', 0.016), ('comparing', 0.016), ('levels', 0.016), ('element', 0.016), ('ll', 0.016), ('across', 0.015), ('future', 0.015), ('distribution', 0.015), ('extracted', 0.015), ('plan', 0.015), ('modeled', 0.015), ('nature', 0.015), ('computed', 0.014), ('ip', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 29 nips-2002-Analysis of Information in Speech Based on MANOVA

Author: Sachin S. Kajarekar, Hynek Hermansky

2 0.17237821 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

Author: Shantanu Chakrabartty, Gert Cauwenberghs

Abstract: Forward decoding kernel machines (FDKM) combine large-margin classifiers with hidden Markov models (HMM) for maximum a posteriori (MAP) adaptive sequence estimation. State transitions in the sequence are conditioned on observed data using a kernel-based probability model trained with a recursive scheme that deals effectively with noisy and partially labeled data. Training over very large data sets is accomplished using a sparse probabilistic support vector machine (SVM) model based on quadratic entropy, and an on-line stochastic steepest descent algorithm. For speaker-independent continuous phone recognition, FDKM trained over 177 ,080 samples of the TlMIT database achieves 80.6% recognition accuracy over the full test set, without use of a prior phonetic language model.

3 0.11832944 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

Author: Gil-jin Jang, Te-Won Lee

Abstract: We present a new technique for achieving source separation when given only a single channel recording. The main idea is based on exploiting the inherent time structure of sound sources by learning a priori sets of basis ﬁlters in time domain that encode the sources in a statistically efﬁcient manner. We derive a learning algorithm using a maximum likelihood approach given the observed single channel data and sets of basis ﬁlters. For each time point we infer the source signals and their contribution factors. This inference is possible due to the prior knowledge of the basis ﬁlters and the associated coefﬁcient densities. A ﬂexible model for density estimation allows accurate modeling of the observation and our experimental results exhibit a high level of separation performance for mixtures of two music signals as well as the separation of two voice signals.

4 0.087622441 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

Author: Patrik O. Hoyer, Aapo Hyvärinen

Abstract: The responses of cortical sensory neurons are notoriously variable, with the number of spikes evoked by identical stimuli varying signiﬁcantly from trial to trial. This variability is most often interpreted as ‘noise’, purely detrimental to the sensory system. In this paper, we propose an alternative view in which the variability is related to the uncertainty, about world parameters, which is inherent in the sensory stimulus. Speciﬁcally, the responses of a population of neurons are interpreted as stochastic samples from the posterior distribution in a latent variable model. In addition to giving theoretical arguments supporting such a representational scheme, we provide simulations suggesting how some aspects of response variability might be understood in this framework.

5 0.075247779 170 nips-2002-Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

Author: Lawrence K. Saul, Daniel D. Lee, Charles L. Isbell, Yann L. Cun

Abstract: We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The algorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours. The algorithm is based on two simple ideas in neural computation: the introduction of a purposeful nonlinearity, and the error signal of a least squares ﬁt. The pitch tracker is used in two real time multimedia applications: a voice-to-MIDI player that synthesizes electronic music from vocalized melodies, and an audiovisual Karaoke machine with multimodal feedback. Both applications run on a laptop and display the user’s pitch scrolling across the screen as he or she sings into the computer.

6 0.07083071 147 nips-2002-Monaural Speech Separation

7 0.055932824 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

8 0.045801897 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits

9 0.044003852 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

10 0.042216197 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex

11 0.041620936 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

12 0.041095883 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

13 0.040556714 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

14 0.040128328 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion

15 0.04007173 148 nips-2002-Morton-Style Factorial Coding of Color in Primary Visual Cortex

16 0.037148878 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement

17 0.037090518 43 nips-2002-Binary Coding in Auditory Cortex

18 0.036980741 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

19 0.036354657 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

20 0.036303669 90 nips-2002-Feature Selection in Mixture-Based Clustering

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.106), (1, 0.029), (2, 0.008), (3, 0.016), (4, -0.021), (5, -0.003), (6, -0.104), (7, 0.001), (8, 0.131), (9, -0.058), (10, 0.09), (11, 0.022), (12, -0.122), (13, 0.001), (14, -0.015), (15, -0.018), (16, -0.028), (17, 0.034), (18, -0.065), (19, 0.016), (20, -0.007), (21, -0.022), (22, 0.033), (23, -0.127), (24, -0.0), (25, -0.169), (26, -0.1), (27, -0.069), (28, 0.056), (29, 0.014), (30, -0.055), (31, -0.095), (32, 0.019), (33, -0.009), (34, 0.145), (35, -0.12), (36, -0.13), (37, -0.017), (38, 0.037), (39, 0.003), (40, 0.072), (41, 0.033), (42, -0.016), (43, -0.134), (44, 0.153), (45, -0.135), (46, -0.045), (47, -0.235), (48, 0.042), (49, -0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96449906 29 nips-2002-Analysis of Information in Speech Based on MANOVA

Author: Sachin S. Kajarekar, Hynek Hermansky

2 0.60253644 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

Author: Shantanu Chakrabartty, Gert Cauwenberghs

3 0.3894878 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

Author: Gil-jin Jang, Te-Won Lee

4 0.37751982 114 nips-2002-Information Regularization with Partially Labeled Data

Author: Martin Szummer, Tommi S. Jaakkola

Abstract: Classiﬁcation with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for ﬁnite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.

5 0.3475925 147 nips-2002-Monaural Speech Separation

Author: Guoning Hu, Deliang Wang

Abstract: Monaural speech separation has been studied in previous systems that incorporate auditory scene analysis principles. A major problem for these systems is their inability to deal with speech in the highfrequency range. Psychoacoustic evidence suggests that different perceptual mechanisms are involved in handling resolved and unresolved harmonics. Motivated by this, we propose a model for monaural separation that deals with low-frequency and highfrequency signals differently. For resolved harmonics, our model generates segments based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For unresolved harmonics, the model generates segments based on amplitude modulation (AM) in addition to temporal continuity and groups them according to AM repetition rates derived from sinusoidal modeling. Underlying the separation process is a pitch contour obtained according to psychoacoustic constraints. Our model is systematically evaluated, and it yields substantially better performance than previous systems, especially in the high-frequency range. 1 In t rod u ct i on In a natural environment, speech usually occurs simultaneously with acoustic interference. An effective system for attenuating acoustic interference would greatly facilitate many applications, including automatic speech recognition (ASR) and speaker identification. Blind source separation using independent component analysis [10] or sensor arrays for spatial filtering require multiple sensors. In many situations, such as telecommunication and audio retrieval, a monaural (one microphone) solution is required, in which intrinsic properties of speech or interference must be considered. Various algorithms have been proposed for monaural speech enhancement [14]. These methods assume certain properties of interference and have difficulty in dealing with general acoustic interference. Monaural separation has also been studied using phasebased decomposition [3] and statistical learning [17], but with only limited evaluation. While speech enhancement remains a challenge, the auditory system shows a remarkable capacity for monaural speech separation. According to Bregman [1], the auditory system separates the acoustic signal into streams, corresponding to different sources, based on auditory scene analysis (ASA) principles. Research in ASA has inspired considerable work to build computational auditory scene analysis (CASA) systems for sound separation [19] [4] [7] [18]. Such systems generally approach speech separation in two main stages: segmentation (analysis) and grouping (synthesis). In segmentation, the acoustic input is decomposed into sensory segments, each of which is likely to originate from a single source. In grouping, those segments that likely come from the same source are grouped together, based mostly on periodicity. In a recent CASA model by Wang and Brown [18], segments are formed on the basis of similarity between adjacent filter responses (cross-channel correlation) and temporal continuity, while grouping among segments is performed according to the global pitch extracted within each time frame. In most situations, the model is able to remove intrusions and recover low-frequency (below 1 kHz) energy of target speech. However, this model cannot handle high-frequency (above 1 kHz) signals well, and it loses much of target speech in the high-frequency range. In fact, the inability to deal with speech in the high-frequency range is a common problem for CASA systems. We study monaural speech separation with particular emphasis on the high-frequency problem in CASA. For voiced speech, we note that the auditory system can resolve the first few harmonics in the low-frequency range [16]. It has been suggested that different perceptual mechanisms are used to handle resolved and unresolved harmonics [2]. Consequently, our model employs different methods to segregate resolved and unresolved harmonics of target speech. More specifically, our model generates segments for resolved harmonics based on temporal continuity and cross-channel correlation, and these segments are grouped according to common periodicity. For unresolved harmonics, it is well known that the corresponding filter responses are strongly amplitude-modulated and the response envelopes fluctuate at the fundamental frequency (F0) of target speech [8]. Therefore, our model generates segments for unresolved harmonics based on common AM in addition to temporal continuity. The segments are grouped according to AM repetition rates. We calculate AM repetition rates via sinusoidal modeling, which is guided by target pitch estimated according to characteristics of natural speech. Section 2 describes the overall system. In section 3, systematic results and a comparison with the Wang-Brown system are given. Section 4 concludes the paper. 2 M od el d escri p t i on Our model is a multistage system, as shown in Fig. 1. Description for each stage is given below. 2.1 I n i t i a l p r oc e s s i n g First, an acoustic input is analyzed by a standard cochlear filtering model with a bank of 128 gammatone filters [15] and subsequent hair cell transduction [12]. This peripheral processing is done in time frames of 20 ms long with 10 ms overlap between consecutive frames. As a result, the input signal is decomposed into a group of timefrequency (T-F) units. Each T-F unit contains the response from a certain channel at a certain frame. The envelope of the response is obtained by a lowpass filter with Segregated Speech Mixture Peripheral and Initial Pitch mid-level segregation tracking processing Unit Final Resynthesis labeling segregation Figure 1. Schematic diagram of the proposed multistage system. passband [0, 1 kHz] and a Kaiser window of 18.25 ms. Mid-level processing is performed by computing a correlogram (autocorrelation function) of the individual responses and their envelopes. These autocorrelation functions reveal response periodicities as well as AM repetition rates. The global pitch is obtained from the summary correlogram. For clean speech, the autocorrelations generally have peaks consistent with the pitch and their summation shows a dominant peak corresponding to the pitch period. With acoustic interference, a global pitch may not be an accurate description of the target pitch, but it is reasonably close. Because a harmonic extends for a period of time and its frequency changes smoothly, target speech likely activates contiguous T-F units. This is an instance of the temporal continuity principle. In addition, since the passbands of adjacent channels overlap, a resolved harmonic usually activates adjacent channels, which leads to high crosschannel correlations. Hence, in initial segregation, the model first forms segments by merging T-F units based on temporal continuity and cross-channel correlation. Then the segments are grouped into a foreground stream and a background stream by comparing the periodicities of unit responses with global pitch. A similar process is described in [18]. Fig. 2(a) and Fig. 2(b) illustrate the segments and the foreground stream. The input is a mixture of a voiced utterance and a cocktail party noise (see Sect. 3). Since the intrusion is not strongly structured, most segments correspond to target speech. In addition, most segments are in the low-frequency range. The initial foreground stream successfully groups most of the major segments. 2.2 P i t c h tr a c k i n g In the presence of acoustic interference, the global pitch estimated in mid-level processing is generally not an accurate description of target pitch. To obtain accurate pitch information, target pitch is first estimated from the foreground stream. At each frame, the autocorrelation functions of T-F units in the foreground stream are summated. The pitch period is the lag corresponding to the maximum of the summation in the plausible pitch range: [2 ms, 12.5 ms]. Then we employ the following two constraints to check its reliability. First, an accurate pitch period at a frame should be consistent with the periodicity of the T-F units at this frame in the foreground stream. At frame j, let τ ( j) represent the estimated pitch period, and A(i, j,τ ) the autocorrelation function of uij, the unit in channel i. uij agrees with τ ( j) if A(i , j , τ ( j )) / A(i, j ,τ m ) > θ d (1) (a) (b) Frequency (Hz) 5000 5000 2335 2335 1028 1028 387 387 80 0 0.5 1 Time (Sec) 1.5 80 0 0.5 1 Time (Sec) 1.5 Figure 2. Results of initial segregation for a speech and cocktail-party mixture. (a) Segments formed. Each segment corresponds to a contiguous black region. (b) Foreground stream. Here, θd = 0.95, the same threshold used in [18], and τ m is the lag corresponding to the maximum of A(i, j,τ ) within [2 ms, 12.5 ms]. τ ( j) is considered reliable if more than half of the units in the foreground stream at frame j agree with it. Second, pitch periods in natural speech vary smoothly in time [11]. We stipulate the difference between reliable pitch periods at consecutive frames be smaller than 20% of the pitch period, justified from pitch statistics. Unreliable pitch periods are replaced by new values extrapolated from reliable pitch points using temporal continuity. As an example, suppose at two consecutive frames j and j+1 that τ ( j) is reliable while τ ( j+1) is not. All the channels corresponding to the T-F units agreeing with τ ( j) are selected. τ ( j+1) is then obtained from the summation of the autocorrelations for the units at frame j+1 in those selected channels. Then the re-estimated pitch is further verified with the second constraint. For more details, see [9]. Fig. 3 illustrates the estimated pitch periods from the speech and cocktail-party mixture, which match the pitch periods obtained from clean speech very well. 2.3 U n i t l a be l i n g With estimated pitch periods, (1) provides a criterion to label T-F units according to whether target speech dominates the unit responses or not. This criterion compares an estimated pitch period with the periodicity of the unit response. It is referred as the periodicity criterion. It works well for resolved harmonics, and is used to label the units of the segments generated in initial segregation. However, the periodicity criterion is not suitable for units responding to multiple harmonics because unit responses are amplitude-modulated. As shown in Fig. 4, for a filter response that is strongly amplitude-modulated (Fig. 4(a)), the target pitch corresponds to a local maximum, indicated by the vertical line, in the autocorrelation instead of the global maximum (Fig. 4(b)). Observe that for a filter responding to multiple harmonics of a harmonic source, the response envelope fluctuates at the rate of F0 [8]. Hence, we propose a new criterion for labeling the T-F units corresponding to unresolved harmonics by comparing AM repetition rates with estimated pitch. This criterion is referred as the AM criterion. To obtain an AM repetition rate, the entire response of a gammatone filter is half-wave rectified and then band-pass filtered to remove the DC component and other possible 14 Pitch Period (ms) 12 (a) 10 180 185 190 195 200 Time (ms) 2 4 6 8 Lag (ms) 205 210 8 6 4 0 (b) 0.5 1 Time (Sec) Figure 3. Estimated target pitch for the speech and cocktail-party mixture, marked by “x”. The solid line indicates the pitch contour obtained from clean speech. 0 10 12 Figure 4. AM effects. (a) Response of a filter with center frequency 2.6 kHz. (b) Corresponding autocorrelation. The vertical line marks the position corresponding to the pitch period of target speech. harmonics except for the F0 component. The rectified and filtered signal is then normalized by its envelope to remove the intensity fluctuations of the original signal, where the envelope is obtained via the Hilbert Transform. Because the pitch of natural speech does not change noticeably within a single frame, we model the corresponding normalized signal within a T-F unit by a single sinusoid to obtain the AM repetition rate. Specifically, f ,φ f ij , φ ij = arg min M ˆ [r (i, jT − k ) − sin(2π k f / f S + φ )]2 , for f ∈[80 Hz, 500 Hz], (2) k =1 ˆ where a square error measure is used. r (i , t ) is the normalized filter response, fS is the sampling frequency, M spans a frame, and T= 10 ms is the progressing period from one frame to the next. In the above equation, fij gives the AM repetition rate for unit uij. Note that in the discrete case, a single sinusoid with a sufficiently high frequency can always match these samples perfectly. However, we are interested in finding a frequency within the plausible pitch range. Hence, the solution does not reduce to a degenerate case. With appropriately chosen initial values, this optimization problem can be solved effectively using iterative gradient descent (see [9]). The AM criterion is used to label T-F units that do not belong to any segments generated in initial segregation; such segments, as discussed earlier, tend to miss unresolved harmonics. Specifically, unit uij is labeled as target speech if the final square error is less than half of the total energy of the corresponding signal and the AM repetition rate is close to the estimated target pitch: | f ijτ ( j ) − 1 | < θ f . (3) Psychoacoustic evidence suggests that to separate sounds with overlapping spectra requires 6-12% difference in F0 [6]. Accordingly, we choose θf to be 0.12. 2.4 F i n a l s e gr e g a t i on a n d r e s y n t he s i s For adjacent channels responding to unresolved harmonics, although their responses may be quite different, they exhibit similar AM patterns and their response envelopes are highly correlated. Therefore, for T-F units labeled as target speech, segments are generated based on cross-channel envelope correlation in addition to temporal continuity. The spectra of target speech and intrusion often overlap and, as a result, some segments generated in initial segregation contain both units where target speech dominates and those where intrusion dominates. Given unit labels generated in the last stage, we further divide the segments in the foreground stream, SF, so that all the units in a segment have the same label. Then the streams are adjusted as follows. First, since segments for speech usually are at least 50 ms long, segments with the target label are retained in SF only if they are no shorter than 50 ms. Second, segments with the intrusion label are added to the background stream, SB, if they are no shorter than 50 ms. The remaining segments are removed from SF, becoming undecided. Finally, other units are grouped into the two streams by temporal and spectral continuity. First, SB expands iteratively to include undecided segments in its neighborhood. Then, all the remaining undecided segments are added back to SF. For individual units that do not belong to either stream, they are grouped into SF iteratively if the units are labeled as target speech as well as in the neighborhood of SF. The resulting SF is the final segregated stream of target speech. Fig. 5(a) shows the new segments generated in this process for the speech and cocktailparty mixture. Fig. 5(b) illustrates the segregated stream from the same mixture. Fig. 5(c) shows all the units where target speech is stronger than intrusion. The foreground stream generated by our algorithm contains most of the units where target speech is stronger. In addition, only a small number of units where intrusion is stronger are incorrectly grouped into it. A speech waveform is resynthesized from the final foreground stream. Here, the foreground stream works as a binary mask. It is used to retain the acoustic energy from the mixture that corresponds to 1’s and reject the mixture energy corresponding to 0’s. For more details, see [19]. 3 Evalu at i on an d comp ari son Our model is evaluated with a corpus of 100 mixtures composed of 10 voiced utterances mixed with 10 intrusions collected by Cooke [4]. The intrusions have a considerable variety. Specifically, they are: N0 - 1 kHz pure tone, N1 - white noise, N2 - noise bursts, N3 - “cocktail party” noise, N4 - rock music, N5 - siren, N6 - trill telephone, N7 - female speech, N8 - male speech, and N9 - female speech. Given our decomposition of an input signal into T-F units, we suggest the use of an ideal binary mask as the ground truth for target speech. The ideal binary mask is constructed as follows: a T-F unit is assigned one if the target energy in the corresponding unit is greater than the intrusion energy and zero otherwise. Theoretically speaking, an ideal binary mask gives a performance ceiling for all binary masks. Figure 5(c) illustrates the ideal mask for the speech and cocktail-party mixture. Ideal masks also suit well the situations where more than one target need to be segregated or the target changes dynamically. The use of ideal masks is supported by the auditory masking phenomenon: within a critical band, a weaker signal is masked by a stronger one [13]. In addition, an ideal mask gives excellent resynthesis for a variety of sounds and is similar to a prior mask used in a recent ASR study that yields excellent recognition performance [5]. The speech waveform resynthesized from the final foreground stream is used for evaluation, and it is denoted by S(t). The speech waveform resynthesized from the ideal binary mask is denoted by I(t). Furthermore, let e1(t) denote the signal present in I(t) but missing from S(t), and e2(t) the signal present in S(t) but missing from I(t). Then, the relative energy loss, REL, and the relative noise residue, RNR, are calculated as follows: R EL = e12 (t ) t I 2 (t ) , S 2 (t ) . (4b) ¡ ¡ R NR = (4a) t 2 e 2 (t ) t t (a) (b) (c) Frequency (Hz) 5000 2355 1054 387 80 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) Figure 5. Results of final segregation for the speech and cocktail-party mixture. (a) New segments formed in the final segregation. (b) Final foreground stream. (c) Units where target speech is stronger than the intrusion. Table 1: REL and RNR Proposed model Wang-Brown model REL (%) RNR (%) N0 2.12 0.02 N1 4.66 3.55 N2 1.38 1.30 N3 3.83 2.72 N4 4.00 2.27 N5 2.83 0.10 N6 1.61 0.30 N7 3.21 2.18 N8 1.82 1.48 N9 8.57 19.33 3.32 Average 3.40 REL (%) RNR (%) 6.99 0 28.96 1.61 5.77 0.71 21.92 1.92 10.22 1.41 7.47 0 5.99 0.48 8.61 4.23 7.27 0.48 15.81 33.03 11.91 4.39 15 SNR (dB) Intrusion 20 10 5 0 −5 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Intrusion Type Figure 6. SNR results for segregated speech. White bars show the results from the proposed model, gray bars those from the Wang-Brown system, and black bars those of the mixtures. The results from our model are shown in Table 1. Each value represents the average of one intrusion with 10 voiced utterances. A further average across all intrusions is also shown in the table. On average, our system retains 96.60% of target speech energy, and the relative residual noise is kept at 3.32%. As a comparison, Table 1 also shows the results from the Wang-Brown model [18], whose performance is representative of current CASA systems. As shown in the table, our model reduces REL significantly. In addition, REL and RNR are balanced in our system. Finally, to compare waveforms directly we measure a form of signal-to-noise ratio (SNR) in decibels using the resynthesized signal from the ideal binary mask as ground truth: ( I (t ) − S (t )) 2 ] . I 2 (t ) SNR = 10 log10 [ t (5) t The SNR for each intrusion averaged across 10 target utterances is shown in Fig. 6, together with the results from the Wang-Brown system and the SNR of the original mixtures. Our model achieves an average SNR gain of around 12 dB and 5 dB improvement over the Wang-Brown model. 4 Di scu ssi on The main feature of our model lies in using different mechanisms to deal with resolved and unresolved harmonics. As a result, our model is able to recover target speech and reduce noise interference in the high-frequency range where harmonics of target speech are unresolved. The proposed system considers the pitch contour of the target source only. However, it is possible to track the pitch contour of the intrusion if it has a harmonic structure. With two pitch contours, one could label a T-F unit more accurately by comparing whether its periodicity is more consistent with one or the other. Such a method is expected to lead to better performance for the two-speaker situation, e.g. N7 through N9. As indicated in Fig. 6, the performance gain of our system for such intrusions is relatively limited. Our model is limited to separation of voiced speech. In our view, unvoiced speech poses the biggest challenge for monaural speech separation. Other grouping cues, such as onset, offset, and timbre, have been demonstrated to be effective for human ASA [1], and may play a role in grouping unvoiced speech. In addition, one should consider the acoustic and phonetic characteristics of individual unvoiced consonants. We plan to investigate these issues in future study. A c k n ow l e d g me n t s We thank G. J. Brown and M. Wu for helpful comments. Preliminary versions of this work were presented in 2001 IEEE WASPAA and 2002 IEEE ICASSP. This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F4962001-1-0027). References [1] A. S. Bregman, Auditory scene analysis, Cambridge MA: MIT Press, 1990. [2] R. P. Carlyon and T. M. Shackleton, “Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms?” J. Acoust. Soc. Am., Vol. 95, pp. 3541-3554, 1994. [3] G. Cauwenberghs, “Monaural separation of independent acoustical components,” In Proc. of IEEE Symp. Circuit & Systems, 1999. [4] M. Cooke, Modeling auditory processing and organization, Cambridge U.K.: Cambridge University Press, 1993. [5] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., Vol. 34, pp. 267-285, 2001. [6] C. J. Darwin and R. P. Carlyon, “Auditory grouping,” in Hearing, B. C. J. Moore, Ed., San Diego CA: Academic Press, 1995. [7] D. P. W. Ellis, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineering and Computer Science, 1996. [8] H. Helmholtz, On the sensations of tone, Braunschweig: Vieweg & Son, 1863. (A. J. Ellis, English Trans., Dover, 1954.) [9] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” Technical Report TR6, Ohio State University Department of Computer and Information Science, 2002. (available at www.cis.ohio-state.edu/~hu) [10] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis, New York: Wiley, 2001. [11] W. J. M. Levelt, Speaking: From intention to articulation, Cambridge MA: MIT Press, 1989. [12] R. Meddis, “Simulation of auditory-neural transduction: further studies,” J. Acoust. Soc. Am., Vol. 83, pp. 1056-1063, 1988. [13] B. C. J. Moore, An Introduction to the psychology of hearing, 4th Ed., San Diego CA: Academic Press, 1997. [14] D. O’Shaughnessy, Speech communications: human and machine, 2nd Ed., New York: IEEE Press, 2000. [15] R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An efficient auditory filterbank based on the gammatone function,” APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K., 1988. [16] R. Plomp and A. M. Mimpen, “The ear as a frequency analyzer II,” J. Acoust. Soc. Am., Vol. 43, pp. 764-767, 1968. [17] S. Roweis, “One microphone source separation,” In Advances in Neural Information Processing Systems 13 (NIPS’00), 2001. [18] D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. Neural Networks, Vol. 10, pp. 684-697, 1999. [19] M. Weintraub, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering, 1985.

6 0.32780039 60 nips-2002-Convergence Properties of Some Spike-Triggered Analysis Techniques

7 0.32551882 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

8 0.32027158 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

9 0.28901237 170 nips-2002-Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

10 0.27569121 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

11 0.27216256 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion

12 0.27163619 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

13 0.26354277 129 nips-2002-Learning in Spiking Neural Assemblies

14 0.2613112 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

15 0.25229439 124 nips-2002-Learning Graphical Models with Mercer Kernels

16 0.24499798 111 nips-2002-Independent Components Analysis through Product Density Estimation

17 0.22914229 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement

18 0.22827254 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

19 0.22371724 174 nips-2002-Regularized Greedy Importance Sampling

20 0.21699522 89 nips-2002-Feature Selection by Maximum Marginal Diversity

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.026), (23, 0.019), (41, 0.014), (42, 0.082), (54, 0.085), (55, 0.024), (67, 0.013), (68, 0.043), (74, 0.099), (91, 0.348), (92, 0.026), (98, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75557739 29 nips-2002-Analysis of Information in Speech Based on MANOVA

Author: Sachin S. Kajarekar, Hynek Hermansky

2 0.4602657 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

Author: Max Welling, Simon Osindero, Geoffrey E. Hinton

Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some ﬁlter outputs. We encourage the system to ﬁnd sparse features by using a Studentt distribution to model each ﬁlter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent ﬁlters, the system learns a topographic map in which the orientation, spatial frequency and location of the ﬁlters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efﬁcient learning procedure that works well even for highly overcomplete sets of ﬁlters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener ﬁlter” for the purpose of denoising images.

3 0.45553255 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf

Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1

4 0.45231622 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

Author: David R. Martin, Charless C. Fowlkes, Jitendra Malik

Abstract: The goal of this work is to accurately detect and localize boundaries in natural scenes using local image measurements. We formulate features that respond to characteristic changes in brightness and texture associated with natural boundaries. In order to combine the information from these features in an optimal way, a classiﬁer is trained using human labeled images as ground truth. We present precision-recall curves showing that the resulting detector outperforms existing approaches.

5 0.45215726 3 nips-2002-A Convergent Form of Approximate Policy Iteration

Author: Theodore J. Perkins, Doina Precup

Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the ﬁrst convergence result for any form of approximate policy iteration under similar computational-resource assumptions.

6 0.45176619 74 nips-2002-Dynamic Structure Super-Resolution

7 0.45011467 124 nips-2002-Learning Graphical Models with Mercer Kernels

8 0.44916824 163 nips-2002-Prediction and Semantic Association

9 0.44866443 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

10 0.44847858 41 nips-2002-Bayesian Monte Carlo

11 0.448452 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

12 0.44588646 169 nips-2002-Real-Time Particle Filters

13 0.44552091 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits

14 0.44530469 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

15 0.44517499 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals

16 0.44438523 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design

17 0.44400772 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

18 0.44372424 53 nips-2002-Clustering with the Fisher Score

19 0.44329196 173 nips-2002-Recovering Intrinsic Images from a Single Image

20 0.44241208 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games