nips nips2000 nips2000-123 knowledge-graph by maker-knowledge-mining

123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

Source: pdf

Author: Hagai Attias, John C. Platt, Alex Acero, Li Deng

Abstract: This paper presents a unified probabilistic framework for denoising and dereverberation of speech signals. The framework transforms the denoising and dereverberation problems into Bayes-optimal signal estimation. The key idea is to use a strong speech model that is pre-trained on a large data set of clean speech. Computational efficiency is achieved by using variational EM, working in the frequency domain, and employing conjugate priors. The framework covers both single and multiple microphones. We apply this approach to noisy reverberant speech signals and get results substantially better than standard methods.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract This paper presents a unified probabilistic framework for denoising and dereverberation of speech signals. [sent-3, score-1.196]

2 The framework transforms the denoising and dereverberation problems into Bayes-optimal signal estimation. [sent-4, score-0.523]

3 The key idea is to use a strong speech model that is pre-trained on a large data set of clean speech. [sent-5, score-0.868]

4 Computational efficiency is achieved by using variational EM, working in the frequency domain, and employing conjugate priors. [sent-6, score-0.296]

5 We apply this approach to noisy reverberant speech signals and get results substantially better than standard methods. [sent-8, score-0.829]

6 1 Introduction This paper presents a statistical-model-based algorithm for reconstructing a speech source from microphone signals recorded in a stationary noisy reverberant environment. [sent-9, score-1.405]

7 Speech enhancement in a realistic environment is a challenging problem, which remains largely unsolved in spite of more than three decades of research. [sent-10, score-0.195]

8 Speech enhancement has many applications and is particularly useful for robust speech recognition [7] and for telecommunication. [sent-11, score-0.725]

9 The difficulty of speech enhancement depends strongly on environmental conditions. [sent-12, score-0.801]

10 If a speaker is close to a microphone, reverberation effects are minimal and traditional methods can handle typical moderate noise levels. [sent-13, score-0.298]

11 However, if the speaker is far away from a microphone, there are more severe distortions , including large amounts of noise and noticeable reverberation. [sent-14, score-0.202]

12 Denoising and dereverberation of speech in this condition has proven to be a very difficult problem [4]. [sent-15, score-0.742]

13 Current speech enhancement methods can be placed into two categories: singlemicrophone methods and multiple-microphone methods. [sent-16, score-0.725]

14 A large body of literature exists on single-microphone speech enhancement methods. [sent-17, score-0.725]

15 These methods often use a probabilistic framework with statistical models of a single speech signal corrupted by Gaussian noise [6, 8]. [sent-18, score-0.888]

16 These models have not been extended to dereverberation or multiple microphones. [sent-19, score-0.226]

17 Multiple-microphone methods start with microphone array processing, where an array of microphones with a known geometry is deployed to make both spatial and temporal measurements of sounds. [sent-20, score-0.626]

18 A microphone array offers significant advantages compared to single microphone methods. [sent-21, score-0.843]

19 Non-adaptive algorithms can denoise a signal reasonably well, as long as it originates from a limited range of azimuth. [sent-22, score-0.113]

20 Adaptive algorithms can handle reverberation to some extent [4], but existing methods are not derived from a principled probabilistic framework and hence may be sub-optimal. [sent-24, score-0.36]

21 Work on blind source separation has attempted to remove the need for fixed array geometries and pre-specified room models. [sent-25, score-0.26]

22 In practice, the most successful algorithms concentrate on instantaneous noise-free mixing with the same number of sources as sensors and with very weak probabilistic models for the source [5]. [sent-27, score-0.369]

23 Some algorithms for noisy non-square instantaneous mixing have been developed [1], as well as algorithms for convolutive square noise-free, mixing [9]. [sent-28, score-0.372]

24 However, the full problem including noise and convolution has so far remained open. [sent-29, score-0.253]

25 In this paper, we present a new method for speech denoising and dereverberation. [sent-30, score-0.802]

26 We use the framework of probabilistic models, which allows us to integrate the different aspects of the whole problem, including strong speech models, environmental noise and reverberation, and microphone arrays. [sent-31, score-1.346]

27 This integration is performed in a principled manner facilitating a coherent unified treatment. [sent-32, score-0.119]

28 The framework allows us to produce a Bayes-optimal estimation algorithm. [sent-33, score-0.049]

29 Using a strong speech model leads to computational intractability, which we overcome using a variational approach. [sent-34, score-0.72]

30 The computational efficiency is further enhanced by working in the frequency domain and by employing conjugate priors. [sent-35, score-0.341]

31 Results on noisy speech show significant improvement over standard methods. [sent-37, score-0.668]

32 Due to space limitations, the full derivation and mathematical details for this method are provided in the technical report [3]. [sent-38, score-0.033]

33 With the superscript i omitted, Yn denotes all microphone signals. [sent-47, score-0.379]

34 When n is also omitted, Y denotes all signals at all time points. [sent-48, score-0.056]

35 Superscripts may become subscripts and vice versa when no confusion arises. [sent-49, score-0.03]

36 We define the primed quantity p ii~ =1 - L e-iwknan (1) n=l for variables an with n = 1, . [sent-51, score-0.03]

37 The Gaussian distribution for a random vector a with mean fl and precision matrix V (defined as the inverse covariance matrix) is denotedN(a I fl, V). [sent-55, score-0.23]

38 The Gamma distribution for a non-negative random variable v with a degrees of freedom and inverse scale (3 is denoted g(v I a, (3) IX v a / 2 - 1 exp( -(3v/2). [sent-56, score-0.052]

39 Their product, the Normal-Gamma distribution Ng(a, v I fl, V, a, (3) = N(a I fl, vV)g(v I a, (3) , (2) turns out to be particularly useful. [sent-57, score-0.027]

40 Problem Formulation We consider the case where a single speech source is present and M microphones are available. [sent-59, score-0.712]

41 Let Xn be the signal emitted by the source at time n, and let y~ be the signal received at microphone i at the same time. [sent-61, score-0.617]

42 Then y~ = h~ * Xn + u~ = L h~xn-m + u~ , (3) m where h:'" is the impulse response of the filter (of length Ki ~ N) operating on the source as it propagates toward microphone i, * is the convolution operator, and u~ denotes the noise recorded at that microphone. [sent-62, score-0.76]

43 Noise may originate from both microphone responses and from environmental sources. [sent-63, score-0.455]

44 In a given environment, the task is to provide an optimal estimate of the clean speech signal x from the noisy microphone signals yi. [sent-64, score-1.347]

45 This requires the estimation of the convolving filters hi and characteristics of the noise u i . [sent-65, score-0.234]

46 This estimation is accomplished by Bayesian inference on probabilistic models for x and u i . [sent-66, score-0.168]

47 2 Probabilistic Signal Models We now turn to our model for the speech source. [sent-67, score-0.61]

48 Much of the work on speech denoising in the past has usually employed very simple source models: AR or ARMA descriptions [6]. [sent-68, score-0.933]

49 These simple denoising models incorporate very little information on the structure of speech. [sent-70, score-0.286]

50 Such an approach a priori allows any value for the model coefficients, including values that are unlikely to occur in a speech signal. [sent-71, score-0.644]

51 Without a strong prior, it is difficult to estimate the convolving filters accurately due to identifiability. [sent-72, score-0.154]

52 A source prior is especially important in the single microphone case, which estimates N clean samples plus model coefficients from N noisy samples. [sent-73, score-0.909]

53 Thus, the absence of a strong speech model degrades reconstruction quality. [sent-74, score-0.706]

54 The most detailed statistical speech models available are those employed by state-of-theart speech recognition engines. [sent-75, score-1.194]

55 These systems are generally based on mixture of diagonal Gaussian models in the mel-cepstral domain. [sent-76, score-0.045]

56 These models are endowed with temporal Markov dynamics and have a very large (f'. [sent-77, score-0.045]

57 However, in the mel-cepstral domain, the noisy reverberant speech has a strong non-linear relationship to the clean speech. [sent-79, score-1.031]

58 In this paper, we work in the linear time/frequency domain using a statistical model and take an intermediate approach regarding the model size. [sent-81, score-0.187]

59 We model speech production with an AR(P) model: p Xn = L amXn-m +Vn , (4) m=l where the coefficients am are related to the physical shape of a "lossless tube" model of the vocal tract. [sent-82, score-0.912]

60 To tum this physical model into a probabilistic model, we assume that Vn are independent zero-mean Gaussian variables with scalar precision v. [sent-83, score-0.29]

61 Given (J, the joint distribution of x is generally a zero-mean Gaussian, p(x 1 (J) = N(x 1 0, A), where A is the N x N precision matrix. [sent-91, score-0.135]

62 Specifically, the joint distribution is given by the product p(x 1 (J) = IT N(xn 1 L amXn-m, v). [sent-92, score-0.042]

63 (5) m n Probabilistic model in the frequency domain. [sent-93, score-0.113]

64 However, rather than employing this product form directly, we work in the frequency domain and use the DFf to write p(x 1(J) ()( exp( - 2~ N-l L 1ii~ 121 Xk 12) , (6) k=O where ii~ is defined in (1). [sent-94, score-0.22]

65 The precision matrix A is now given by an inverse DFf, Anm = (v/N)I:keiWk(n-m) 1 ii~ 12. [sent-95, score-0.145]

66 This matrix belongs to a sub-class of Toeplitz matrices called circulant Toeplitz. [sent-96, score-0.07]

67 To complete our speech model, we must specify a distribution over the speech production parameters O. [sent-99, score-1.266]

68 We use a S-state mixture model with a Normal-Gamma distribution (2) for each component s = 1, ' ''' S: p(O 1 s) = N(al' " " ap 1 /-Ls, vVs)Q(v 1 O:s, (3s) . [sent-100, score-0.087]

69 This form is chosen by invoking the idea of a conjugate prior, which is defined as follows. [sent-101, score-0.118]

70 Given the model p(x 1 O)p( 1 s) , the prior p( 1 s) is conjugate to p(x 1 0) iff the posterior p(O 1 x, s) , computed by Bayes' rule, has the same functional form as the prior. [sent-102, score-0.137]

71 This choice has the advantage of being quite general while keeping the clean speech model analytically tractable. [sent-103, score-0.802]

72 ° ° It turns out, as discussed below, that significant computational savings result if we restrict the p x p precision matrices Vs to have a circulant Toeplitz structure. [sent-104, score-0.19]

73 To do this without having to impose an explicit constraint, we reparametrize p(O 1 s) in terms of ~;, 'f/; instead of /-L;, V':m' and work in the frequency domain: p(O 1 s) ex exp(-~ p-l L: 2p k=O 1 ~kak - iik 2 1) , v-~ exp(_(3s v) . [sent-105, score-0.064]

74 The precisions are now given by the inverse DFT V':m = (lip) Lk eiWk(n-m) 1 ~k 12 and are manifestly circulant. [sent-107, score-0.107]

75 Finally, the mixing fractions are given by p( s) = 7r s . [sent-109, score-0.066]

76 This completes the specification of our clean speech modelp(x) in terms of the latent variable modelp(x, 0, s) = p(x 1 O)p(O 1 s)p(s). [sent-110, score-0.853]

77 The model is parametrized by W = (~~, 'f/~, O:s, (3s, 7rs) . [sent-111, score-0.049]

78 We pre-train the speech model parameters W using 10000 sentences of the Wall Street Journal corpus, recorded with a close-talking microphone for 150 male and female speakers of North American English. [sent-113, score-1.113]

79 We used 16msec overlapping frames with N = 256 time points at 16kHz sampling rate. [sent-114, score-0.03]

80 Training was performed using an EM algorithm derived specifically for this model [3]. [sent-115, score-0.049]

81 W were initialized by extracting the AR(P) coefficients from each frame using the autocorrelation method. [sent-117, score-0.138]

82 These coefficents were converted into cepstral coefficients, and clustered into S classes by k-means clustering. [sent-118, score-0.03]

83 We then considered the corresponding hard clusters of the AR(p) coefficients, and separately fit a model p(O 1 s) (7) to each. [sent-119, score-0.084]

84 The resulting parameters were used as initial values for the full EM algorithm. [sent-120, score-0.068]

85 In this paper, we use an AR(q) description for the noise recorded by microphone i, u~ = Lm b~u~_m + w~. [sent-122, score-0.54]

86 The noise parameters are ¢>i = (b~, Ai), where Ai are the precisions of the zero-mean Gaussian excitations w~ . [sent-123, score-0.219]

87 In the frequency domain we have the joint distribution . [sent-124, score-0.195]

88 N-l p(u i 1 ¢i) ex exp( - 2~ L: 1 b~,k 121 u~ (8) 12) , k=O As in (6), the parameters ¢i determine the spectra of the noise. [sent-125, score-0.035]

89 But unlike the speech model, the AR(q) noise model is chosen for mathematical convenience rather than for its relation to an underlying physical model. [sent-126, score-0.775]

90 The form (8) now implies that given the clean speech x, the distribution of the data yi is . [sent-128, score-0.799]

91 (9) k=O ° This completes the specification of our noisy speech model p(y) in terms of the joint distribution Oi p(yi 1 x )p( x 1 O)p( 1 s )p( s). [sent-133, score-0.859]

92 3 Variational Speech Enhancement (VSE) Algorithm The denoising and dereverberation task is accomplished by estimating the clean speech x, which requires estimating the speech parameters 8, the filter coefficients hi, and the noise parameters qi. [sent-134, score-2.053]

93 This algorithm receives the data yi from an utterance (a long sequence of frames) as input and proceeds iteratively. [sent-136, score-0.046]

94 In the E-step, the algorithm computes the sufficient statistics of the clean speech x and the production parameters 8 for each frame. [sent-137, score-0.897]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('speech', 0.561), ('microphone', 0.379), ('denoising', 0.241), ('clean', 0.192), ('dereverberation', 0.181), ('enhancement', 0.164), ('ar', 0.137), ('production', 0.109), ('reverberation', 0.109), ('noisy', 0.107), ('dff', 0.105), ('reverberant', 0.105), ('source', 0.104), ('noise', 0.099), ('precision', 0.093), ('domain', 0.089), ('conjugate', 0.088), ('fl', 0.085), ('array', 0.085), ('probabilistic', 0.082), ('coefficients', 0.078), ('environmental', 0.076), ('circulant', 0.07), ('modelp', 0.07), ('toeplitz', 0.07), ('employing', 0.067), ('physical', 0.066), ('mixing', 0.066), ('strong', 0.066), ('frequency', 0.064), ('recorded', 0.062), ('convolving', 0.06), ('convolution', 0.06), ('signals', 0.056), ('exp', 0.055), ('xk', 0.055), ('vn', 0.055), ('completes', 0.055), ('precisions', 0.055), ('xn', 0.054), ('signal', 0.052), ('inverse', 0.052), ('handle', 0.051), ('unified', 0.051), ('framework', 0.049), ('model', 0.049), ('em', 0.048), ('microphones', 0.047), ('hi', 0.047), ('yi', 0.046), ('models', 0.045), ('specification', 0.045), ('variational', 0.044), ('joint', 0.042), ('accomplished', 0.041), ('blind', 0.041), ('instantaneous', 0.041), ('speaker', 0.039), ('microsoft', 0.039), ('ap', 0.038), ('principled', 0.038), ('clusters', 0.035), ('parameters', 0.035), ('including', 0.034), ('full', 0.033), ('efficiency', 0.033), ('gaussian', 0.033), ('algorithms', 0.031), ('omitted', 0.031), ('presents', 0.031), ('environment', 0.031), ('separation', 0.03), ('frame', 0.03), ('frames', 0.03), ('primed', 0.03), ('invoking', 0.03), ('deployed', 0.03), ('excitations', 0.03), ('facilitating', 0.03), ('originates', 0.03), ('convolutive', 0.03), ('versa', 0.03), ('noticeable', 0.03), ('degrades', 0.03), ('acero', 0.03), ('dft', 0.03), ('hagai', 0.03), ('autocorrelation', 0.03), ('clustered', 0.03), ('vv', 0.03), ('conjugacy', 0.03), ('emitted', 0.03), ('deng', 0.03), ('filter', 0.029), ('filters', 0.028), ('turns', 0.027), ('employed', 0.027), ('impulse', 0.027), ('sentences', 0.027), ('remained', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

Author: Hagai Attias, John C. Platt, Alex Acero, Li Deng

2 0.31004018 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition

Author: Jürgen Tchorz, Michael Kleinschmidt, Birger Kollmeier

Abstract: A novel noise suppression scheme for speech signals is proposed which is based on a neurophysiologically-motivated estimation of the local signal-to-noise ratio (SNR) in different frequency channels. For SNR-estimation, the input signal is transformed into so-called Amplitude Modulation Spectrograms (AMS), which represent both spectral and temporal characteristics of the respective analysis frame, and which imitate the representation of modulation frequencies in higher stages of the mammalian auditory system. A neural network is used to analyse AMS patterns generated from noisy speech and estimates the local SNR. Noise suppression is achieved by attenuating frequency channels according to their SNR. The noise suppression algorithm is evaluated in speakerindependent digit recognition experiments and compared to noise suppression by Spectral Subtraction. 1

3 0.23587736 96 nips-2000-One Microphone Source Separation

Author: Sam T. Roweis

Abstract: Source separation, or computational auditory scene analysis , attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as lCA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting (

4 0.18839677 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

Author: Hervé Bourlard, Samy Bengio, Katrin Weber

Abstract: In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent)

5 0.17848183 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

Author: Lucas C. Parra, Clay Spence, Paul Sajda

Abstract: We present evidence that several higher-order statistical properties of natural images and signals can be explained by a stochastic model which simply varies scale of an otherwise stationary Gaussian process. We discuss two interesting consequences. The first is that a variety of natural signals can be related through a common model of spherically invariant random processes, which have the attractive property that the joint densities can be constructed from the one dimensional marginal. The second is that in some cases the non-stationarity assumption and only second order methods can be explicitly exploited to find a linear basis that is equivalent to independent components obtained with higher-order methods. This is demonstrated on spectro-temporal components of speech. 1

6 0.13766128 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

7 0.12828471 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech

8 0.12753493 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

9 0.11901182 51 nips-2000-Factored Semi-Tied Covariance Matrices

10 0.099587902 33 nips-2000-Combining ICA and Top-Down Attention for Robust Speech Recognition

11 0.093430974 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

12 0.075288944 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System

13 0.073666923 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

14 0.068812408 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

15 0.068536893 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

16 0.058867402 27 nips-2000-Automatic Choice of Dimensionality for PCA

17 0.057346411 92 nips-2000-Occam's Razor

18 0.05282101 122 nips-2000-Sparse Representation for Gaussian Process Models

19 0.051395945 138 nips-2000-The Use of Classifiers in Sequential Inference

20 0.049599975 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.216), (1, -0.128), (2, 0.172), (3, 0.218), (4, -0.087), (5, -0.223), (6, -0.423), (7, -0.078), (8, -0.003), (9, 0.015), (10, 0.117), (11, 0.058), (12, -0.059), (13, -0.038), (14, 0.077), (15, -0.008), (16, 0.045), (17, 0.023), (18, -0.045), (19, -0.054), (20, -0.067), (21, 0.028), (22, -0.007), (23, 0.04), (24, -0.026), (25, 0.041), (26, 0.015), (27, 0.02), (28, -0.052), (29, 0.021), (30, -0.009), (31, 0.026), (32, 0.023), (33, 0.025), (34, 0.023), (35, -0.087), (36, 0.074), (37, 0.035), (38, -0.062), (39, 0.004), (40, 0.034), (41, -0.009), (42, -0.004), (43, 0.011), (44, -0.021), (45, -0.06), (46, -0.004), (47, 0.019), (48, 0.027), (49, -0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97351402 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

Author: Hagai Attias, John C. Platt, Alex Acero, Li Deng

2 0.90525842 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition

Author: Jürgen Tchorz, Michael Kleinschmidt, Birger Kollmeier

3 0.78759325 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

Author: Hervé Bourlard, Samy Bengio, Katrin Weber

4 0.77210099 96 nips-2000-One Microphone Source Separation

Author: Sam T. Roweis

5 0.66608512 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech

Author: Lawrence K. Saul, Jont B. Allen

Abstract: An eigenvalue method is developed for analyzing periodic structure in speech. Signals are analyzed by a matrix diagonalization reminiscent of methods for principal component analysis (PCA) and independent component analysis (ICA). Our method-called periodic component analysis (1l

6 0.64034837 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

7 0.50958687 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

8 0.41018867 51 nips-2000-Factored Semi-Tied Covariance Matrices

9 0.34689105 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

10 0.31188801 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System

11 0.30031154 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

12 0.29440054 138 nips-2000-The Use of Classifiers in Sequential Inference

13 0.28618228 33 nips-2000-Combining ICA and Top-Down Attention for Robust Speech Recognition

14 0.26405963 27 nips-2000-Automatic Choice of Dimensionality for PCA

15 0.25006011 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

16 0.23198949 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

17 0.22982819 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach

18 0.22522669 92 nips-2000-Occam's Razor

19 0.21916637 85 nips-2000-Mixtures of Gaussian Processes

20 0.20631492 115 nips-2000-Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.025), (8, 0.311), (10, 0.027), (17, 0.097), (26, 0.016), (32, 0.022), (33, 0.044), (62, 0.039), (65, 0.03), (67, 0.052), (75, 0.012), (76, 0.052), (79, 0.047), (81, 0.033), (90, 0.047), (91, 0.027), (97, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78936386 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

Author: Hagai Attias, John C. Platt, Alex Acero, Li Deng

2 0.42295158 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1

3 0.41933045 74 nips-2000-Kernel Expansions with Unlabeled Examples

Author: Martin Szummer, Tommi Jaakkola

Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.

4 0.41528949 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

5 0.41105369 4 nips-2000-A Linear Programming Approach to Novelty Detection

Author: Colin Campbell, Kristin P. Bennett

Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J

6 0.40932772 94 nips-2000-On Reversing Jensen's Inequality

7 0.40931591 122 nips-2000-Sparse Representation for Gaussian Process Models

8 0.40751991 133 nips-2000-The Kernel Gibbs Sampler

9 0.40704215 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

10 0.40595055 37 nips-2000-Convergence of Large Margin Separable Linear Classification

11 0.40279147 79 nips-2000-Learning Segmentation by Random Walks

12 0.40161029 146 nips-2000-What Can a Single Neuron Compute?

13 0.40128344 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

14 0.40082499 92 nips-2000-Occam's Razor

15 0.40051413 52 nips-2000-Fast Training of Support Vector Classifiers

16 0.40031606 96 nips-2000-One Microphone Source Separation

17 0.39811403 21 nips-2000-Algorithmic Stability and Generalization Performance

18 0.39808285 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

19 0.39805013 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System

20 0.3977097 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers