nips nips2000 nips2000-51 knowledge-graph by maker-knowledge-mining

51 nips-2000-Factored Semi-Tied Covariance Matrices


Source: pdf

Author: Mark J. F. Gales

Abstract: A new form of covariance modelling for Gaussian mixture models and hidden Markov models is presented. This is an extension to an efficient form of covariance modelling used in speech recognition, semi-tied covariance matrices. In the standard form of semi-tied covariance matrices the covariance matrix is decomposed into a highly shared decorrelating transform and a component-specific diagonal covariance matrix. The use of a factored decorrelating transform is presented in this paper. This factoring effectively increases the number of possible transforms without increasing the number of free parameters. Maximum likelihood estimation schemes for all the model parameters are presented including the component/transform assignment, transform and component parameters. This new model form is evaluated on a large vocabulary speech recognition task. It is shown that using this factored form of covariance modelling reduces the word error rate.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract A new form of covariance modelling for Gaussian mixture models and hidden Markov models is presented. [sent-8, score-0.328]

2 This is an extension to an efficient form of covariance modelling used in speech recognition, semi-tied covariance matrices. [sent-9, score-0.687]

3 In the standard form of semi-tied covariance matrices the covariance matrix is decomposed into a highly shared decorrelating transform and a component-specific diagonal covariance matrix. [sent-10, score-1.408]

4 The use of a factored decorrelating transform is presented in this paper. [sent-11, score-0.453]

5 This factoring effectively increases the number of possible transforms without increasing the number of free parameters. [sent-12, score-0.444]

6 Maximum likelihood estimation schemes for all the model parameters are presented including the component/transform assignment, transform and component parameters. [sent-13, score-0.617]

7 This new model form is evaluated on a large vocabulary speech recognition task. [sent-14, score-0.343]

8 It is shown that using this factored form of covariance modelling reduces the word error rate. [sent-15, score-0.595]

9 Solutions should be efficient both in terms of number of model parameters and cost of the likelihood calculation. [sent-17, score-0.168]

10 For speech recognition this is particularly important due to the large number of Gaussian components used, typically in the tens of thousands, and the relatively large dimensionality of the data, typically 30-60. [sent-18, score-0.24]

11 When the dimensionality of the nuisance dimensions is reduced to zero this generative model becomes equivalent to a semi-tied covariance matrix system [3] with a single, global, semi-tied class. [sent-29, score-0.695]

12 This generative model has a clear advantage during recognition compared to the standard linear Gaussian models [2] in the reduction in the computational cost of the likelihood calculation. [sent-30, score-0.389]

13 The likelihood for component m may be computed as 3 ( po ( ). [sent-31, score-0.253]

14 [1]07 , IL (m) , ~(m)) diag (3) where lL(m) is the n1 -dimensional mean and ~~~lg the diagonal covariance matrix of Gaussian component m. [sent-33, score-0.711]

15 l (7) is the nuisance dimension likelihood which is independent of the component being considered and only needs to be computed once for each time instance. [sent-34, score-0.334]

16 The initial normalisation term is only required during recognition when multiple transforms are used. [sent-35, score-0.475]

17 The dominant cost is a diagonal Gaussian computation for each component, O(n1) per component. [sent-36, score-0.195]

18 In contrast a scheme such as factor analysis (a covariance modelling scheme from the linear Gaussian model in [7]) has a cost of O(ni) per component (assuming there are n1 factors). [sent-37, score-0.889]

19 The disadvantage of this form of generative model is that there is no simple expectation-maximisation (EM) [1] scheme for estimating the model parameters. [sent-38, score-0.297]

20 For some tasks, such as speech recognition where there are many different "sounds" to be recognised, it is unlikely that a single transform is sufficient to well model the data. [sent-40, score-0.433]

21 The standard approach for using multiple transforms is to assign each component, m, to a particular transform, F( Tm). [sent-42, score-0.454]

22 To simplify the description of the new scheme only modifications to the semi-tied covariance matrix scheme, where the nuisance dimension is zero, are considered. [sent-43, score-0.552]

23 The generative model is modified to be 0(7) = F(T m )X(7), where Tm is the transform class associated with the generating component, m, at time instance 7. [sent-44, score-0.302]

24 The assignment variable, T m , may either be determined by an "expert", for example using phonetic context information, or it may be assigned in a maximum likelihood (ML) fashion [3]. [sent-45, score-0.403]

25 Simply 2 Although it is not strictly necessary to use diagonal covariance matrices, tllese currently dominate applications in speech recognition. [sent-46, score-0.504]

26 3This paper uses the following convention: capital bold letters refer to matrices e. [sent-48, score-0.192]

27 ai is tlle ith row of matrix A, aij is tlle element of row i column j of matrix A and bi is element i of vector b. [sent-57, score-0.81]

28 x n matrix (n is tlle dimensionality of tlle feature vector and n. [sent-60, score-0.623]

29 Where subsets of tlle diagonal matrices are specified tlle matrices are square, e. [sent-62, score-0.924]

30 AT is tlle transpose of tlle matrix and det( A) is tlle determinant of the matrix. [sent-65, score-0.859]

31 increasing the number of transforms increases the number of model parameters to be estimated, hence reducing the robustness of the estimates. [sent-66, score-0.44]

32 In the limit there is a single transform per component, the standard full-covariance matrix case. [sent-68, score-0.354]

33 The approach adopted in this paper is to factor the transform into multiple streams. [sent-69, score-0.232]

34 Each component can then use a different transform for each stream. [sent-70, score-0.371]

35 Hence instead of using an assignment variable an assignment vector is used. [sent-71, score-0.596]

36 In order to maintain the efficient likelihood computation of equation 3, F(r)-l, rather than F(r), must be factored into rows. [sent-72, score-0.283]

37 In common with other factoring schemes this dramatically increases the effective number of transforms from which each component may select without increasing the number of transform parameters. [sent-74, score-0.896]

38 Though this paper only considers factoring semi-tied covariance matrices the extension to the "projection" schemes presented in [2] is straightforward. [sent-75, score-0.54]

39 This paper describes how to estimate the set of transforms and determine which subspaces a particular component should use. [sent-76, score-0.56]

40 The next section describes how to assign components to transforms and, given this assignment, how to estimate the appropriate transforms . [sent-77, score-0.723]

41 Some initial experiments on a large vocabulary speech recognition task are presented in the following section. [sent-78, score-0.367]

42 2 Factored Semi-Tied Covariance Matrices In order to factor semi-tied covariance matrices the inverse of the observation transformation for a component is broken into multiple streams. [sent-79, score-0.683]

43 The feature space of each stream is then determined by selecting from an inventory of possible transforms. [sent-80, score-0.234]

44 The effective full covariance matrix of component m, ~(m), may be written as ~(m) = F(z(~)) ~(':') F(Z(~))T where the form of F(z(~)) is restricted dlag , so that 4 (4) and z(m) is the S-dimensional assignment vector for component m. [sent-82, score-1.029]

45 The complete set of model parameters, M, consists of the standard model parameters, the component means, '. [sent-83, score-0.311]

46 , Af~')} for each variances, weights and, additionally, the set of transforms { Af~l stream s (Rs is the number of transforms associated with stream s) and the assignment vector z(m) for each component. [sent-86, score-1.406]

47 Note that the semi-tied covariance matrix scheme is the case when S = 1. [sent-87, score-0.471]

48 The likelihood is efficiently estimated by storing transformed observations for each stream transform, i. [sent-88, score-0.332]

49 The estimation of the component priors and HMM transition matrices are estimated in the standard fashion [6]. [sent-103, score-0.492]

50 Directly optimising the auxiliary function for the model parameters is computationally expensive [3] and does not allow the embedding of the assignment process. [sent-104, score-0.463]

51 Instead a simple iterative optimisation scheme is used as follows : 1. [sent-105, score-0.263]

52 Estimate the within class covariance matrix for each Gaussian component in the system, W(m), using the values of 'Ym (T). [sent-106, score-0.513]

53 Initialise the set of assignment vectors, {z} = {Z(1), . [sent-107, score-0.298]

54 , Z(M)} and the set of transforms for each stream {A} = A(Rt) A(1) A(RS)} {A (1) [1)"'" [1) , . [sent-110, score-0.554]

55 Using the current estimates of the transforms and assignment vectors obtain the ML estimate of the set of component specific diagonal covariance matrices incorporating the appropriate parameter tying as required. [sent-115, score-1.343]

56 Estimate the new set of transforms, { A }, using the current set of component covariance matrices { t } and assignment vectors { Z }. [sent-118, score-0.877]

57 Update the set of assignment variables for each component { Z }, given the current set of model transforms, { A } . [sent-121, score-0.519]

58 Oth- {t} erwise update and the component means using the latest transforms and assignment variables. [sent-124, score-0.805]

59 First the ML estimate of the set of component specific diagonal covariance matrices is required. [sent-126, score-0.725]

60 Second, the new set of transforms must be estimated. [sent-127, score-0.32]

61 Finally the new set of assignment vectors is required. [sent-128, score-0.298]

62 The ML estimates of the component specific variances (and means) under a transformation is a standard problem, e. [sent-129, score-0.297]

63 The ML estimation of the transforms and assignment variables are described below. [sent-132, score-0.681]

64 The proposed scheme is derived by modifying the standard semi-tied covariance optimisation equation in [3]. [sent-134, score-0.531]

65 Consider row i of stream p of transform r, a[;fi' the auxiliary function may be written as (ignoring constant scalings and elements independent of a[;fi) Q(M M' {t} ", {z}) = "" (3(m) log ((c(z(m»a(Z~~»T)2) _ L. [sent-136, score-0.581]

66 K(srj)a(r)T L 'Ym(r) (6) T is the cofactor of row i of stream p of transform A (z(m» (r) . [sent-145, score-0.486]

67 At the t + 1th iteration (r) ( a[pji t+ 1) _ - (r) () t - a[pji where the gradient and Hessian are based on the estimation scheme was highly stable. [sent-161, score-0.209]

68 In practice this The assignment for stream s of component m is found using a greedy search technique based on ML estimation. [sent-163, score-0.719]

69 3 Results and Discussion An initial investigation of the use of factored semi-tied covariance matrices was carried out on a large-vocabulary speaker-independent continuous-speech recognition task. [sent-168, score-0.749]

70 The recognition experiments were performed on the 1994 ARPA Hub 1 data (the HI task), an unlimited vocabulary task. [sent-169, score-0.173]

71 The baseline system used for the recognition task was a gender-independent cross-word-triphone mixtureGaussian tied-state HMM system. [sent-172, score-0.23]

72 The set of baseline experiments with semi-tied covariance matrices (8 = 1) used "expert" knowledge to determine the transform classes. [sent-176, score-0.624]

73 The first was based on phone level transforms where all components of all states from the same phone shared the same class (phone classes). [sent-178, score-0.764]

74 The second used an individual transform per state (state classes). [sent-179, score-0.262]

75 In addition a global transform (global class) and a full-covariance matrix system (comp class) were tested. [sent-180, score-0.375]

76 Two systems were examined, a four Gaussian components per state system and a twelve Gaussian component system. [sent-181, score-0.405]

77 The twelve component system is the standard system described in [8]. [sent-182, score-0.483]

78 In both cases a diagonal covariance matrix system (labelled none) was generated in the standard HTK fashion [9]. [sent-183, score-0.639]

79 The previously described expert system and two ML-based schemes, standard andfactored. [sent-187, score-0.323]

80 The standard scheme used a single stream (8 = 1) which is similar to the scheme described in [3]. [sent-188, score-0.607]

81 The factored scheme used the new approach described in this paper with a separate stream for each of the elements of the feature vector (8 = 39). [sent-189, score-0.623]

82 Table 1: System performance on the 1994 ARPA HI task Assignment Scheme none global phone state comp phone phone - expert expert - standard factored 10. [sent-190, score-1.408]

83 42 The results of the baseline semi-tied covariance matrix systems are shown in table 1. [sent-202, score-0.374]

84 For the four component system the full covariance matrix system achieved approximately the same performance as that of the expert state semi-tied system. [sent-203, score-0.866]

85 The expert phone system shows around an 9% degradation in performance compared to the state system, but used less than a hundredth of the number of transforms (46 versus 6399). [sent-206, score-0.847]

86 Using the standard ML assignment scheme with initial phone classes, S = 1, reduced the error rate of the phone system by around 3% over the expert system. [sent-207, score-1.287]

87 The factored scheme, S = 39, achieved further reductions in error rate. [sent-208, score-0.245]

88 A 5% reduction in word error rate was achieved over the expert system, which is significant at the 95% level. [sent-209, score-0.275]

89 Table 1 also shows the performance of the twelve component system. [sent-210, score-0.254]

90 The use of a global semi-tied transform significantly reduced the error rate by around 9% relative. [sent-211, score-0.332]

91 Increasing the number of transforms using the expert assignment showed no reduction in error rate. [sent-212, score-0.844]

92 Again using the phone level system and training the component transform assignments, either the standard or the factored schemes, reduced the word error rate. [sent-213, score-1.039]

93 Using the factored semi-tied transforms (S = 39) significantly reduced the error rate, by around 5%, compared to the expert systems. [sent-214, score-0.81]

94 4 Conclusions This paper has presented a new form of semi-tied covariance, the factored semi-tied covariance matrix. [sent-215, score-0.467]

95 The theory for estimating these transforms has been developed and implemented on a large vocabulary speech recognition task. [sent-216, score-0.629]

96 On this task the use of these factored transforms was found to decrease the word error rate by around 5% over using a single transform, or multiple transforms, where the assignments are expertly determined. [sent-217, score-0.788]

97 In future work the problems of determining the required number of transforms for each of the streams and how to determine the appropriate dimensions will be investigated. [sent-219, score-0.386]

98 Maximum likelihood multiple projection schemes for hidden Markov models. [sent-224, score-0.222]

99 A tutorial on hidden Markov models and selected applications in speech recognition. [sent-242, score-0.163]

100 The development of the 1994 HTK large vocabulary speech recognition system. [sent-248, score-0.345]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('transforms', 0.32), ('assignment', 0.298), ('pji', 0.269), ('tlle', 0.261), ('covariance', 0.25), ('stream', 0.234), ('factored', 0.217), ('phone', 0.207), ('component', 0.187), ('transform', 0.184), ('expert', 0.167), ('scheme', 0.145), ('matrices', 0.142), ('speech', 0.136), ('diagonal', 0.118), ('ml', 0.105), ('htk', 0.104), ('vocabulary', 0.094), ('generative', 0.084), ('schemes', 0.081), ('nuisance', 0.081), ('optimisation', 0.08), ('diag', 0.08), ('recognition', 0.079), ('arpa', 0.078), ('gmm', 0.078), ('idet', 0.078), ('matrix', 0.076), ('system', 0.073), ('row', 0.068), ('twelve', 0.067), ('factoring', 0.067), ('likelihood', 0.066), ('hmm', 0.066), ('auxiliary', 0.064), ('af', 0.056), ('assignments', 0.056), ('standard', 0.056), ('hessian', 0.053), ('cofactors', 0.052), ('decorrelating', 0.052), ('lri', 0.052), ('odell', 0.052), ('pri', 0.052), ('modelling', 0.051), ('bold', 0.05), ('word', 0.049), ('gaussian', 0.049), ('multiple', 0.048), ('baseline', 0.048), ('lda', 0.045), ('srm', 0.045), ('tm', 0.045), ('global', 0.042), ('comp', 0.041), ('around', 0.04), ('state', 0.04), ('cost', 0.039), ('fi', 0.039), ('fashion', 0.039), ('iterative', 0.038), ('optimising', 0.038), ('per', 0.038), ('reduced', 0.038), ('estimation', 0.036), ('development', 0.036), ('dimensions', 0.034), ('model', 0.034), ('investigation', 0.033), ('ns', 0.033), ('estimated', 0.032), ('streams', 0.032), ('markov', 0.032), ('reduction', 0.031), ('increasing', 0.031), ('written', 0.031), ('shared', 0.03), ('assign', 0.03), ('task', 0.03), ('observation', 0.029), ('rs', 0.029), ('parameters', 0.029), ('gradient', 0.028), ('error', 0.028), ('estimate', 0.028), ('static', 0.028), ('labelled', 0.028), ('discriminant', 0.028), ('sj', 0.028), ('initial', 0.028), ('described', 0.027), ('hidden', 0.027), ('denoted', 0.027), ('transformation', 0.027), ('none', 0.027), ('variances', 0.027), ('generated', 0.027), ('increases', 0.026), ('dimensionality', 0.025), ('describes', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 51 nips-2000-Factored Semi-Tied Covariance Matrices

Author: Mark J. F. Gales

Abstract: A new form of covariance modelling for Gaussian mixture models and hidden Markov models is presented. This is an extension to an efficient form of covariance modelling used in speech recognition, semi-tied covariance matrices. In the standard form of semi-tied covariance matrices the covariance matrix is decomposed into a highly shared decorrelating transform and a component-specific diagonal covariance matrix. The use of a factored decorrelating transform is presented in this paper. This factoring effectively increases the number of possible transforms without increasing the number of free parameters. Maximum likelihood estimation schemes for all the model parameters are presented including the component/transform assignment, transform and component parameters. This new model form is evaluated on a large vocabulary speech recognition task. It is shown that using this factored form of covariance modelling reduces the word error rate.

2 0.151943 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding

Author: Cynthia Archer, Todd K. Leen

Abstract: We establish a principled framework for adaptive transform coding. Transform coders are often constructed by concatenating an ad hoc choice of transform with suboptimal bit allocation and quantizer design. Instead, we start from a probabilistic latent variable model in the form of a mixture of constrained Gaussian mixtures. From this model we derive a transform coding algorithm, which is a constrained version of the generalized Lloyd algorithm for vector quantizer design. A byproduct of our derivation is the introduction of a new transform basis, which unlike other transforms (PCA, DCT, etc.) is explicitly optimized for coding. Image compression experiments show adaptive transform coders designed with our algorithm improve compressed image signal-to-noise ratio up to 3 dB compared to global transform coding and 0.5 to 2 dB compared to other adaptive transform coders. 1

3 0.14136778 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

Author: George Saon, Mukund Padmanabhan

Abstract: We consider the problem of designing a linear transformation () E lRPx n, of rank p ~ n, which projects the features of a classifier x E lRn onto y = ()x E lRP such as to achieve minimum Bayes error (or probability of misclassification). Two avenues will be explored: the first is to maximize the ()-average divergence between the class densities and the second is to minimize the union Bhattacharyya bound in the range of (). While both approaches yield similar performance in practice, they outperform standard LDA features and show a 10% relative improvement in the word error rate over state-of-the-art cepstral features on a large vocabulary telephony speech recognition task.

4 0.11901182 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

Author: Hagai Attias, John C. Platt, Alex Acero, Li Deng

Abstract: This paper presents a unified probabilistic framework for denoising and dereverberation of speech signals. The framework transforms the denoising and dereverberation problems into Bayes-optimal signal estimation. The key idea is to use a strong speech model that is pre-trained on a large data set of clean speech. Computational efficiency is achieved by using variational EM, working in the frequency domain, and employing conjugate priors. The framework covers both single and multiple microphones. We apply this approach to noisy reverberant speech signals and get results substantially better than standard methods.

5 0.10936049 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

Author: Hervé Bourlard, Samy Bengio, Katrin Weber

Abstract: In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent)

6 0.10617781 121 nips-2000-Sparse Kernel Principal Component Analysis

7 0.087988801 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

8 0.084940359 60 nips-2000-Gaussianization

9 0.084325314 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition

10 0.083802037 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

11 0.081934348 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

12 0.077690758 27 nips-2000-Automatic Choice of Dimensionality for PCA

13 0.077318974 96 nips-2000-One Microphone Source Separation

14 0.077035561 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

15 0.076819174 49 nips-2000-Explaining Away in Weight Space

16 0.076729149 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

17 0.074566945 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks

18 0.071260296 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

19 0.070267715 6 nips-2000-A Neural Probabilistic Language Model

20 0.070254788 120 nips-2000-Sparse Greedy Gaussian Process Regression


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.218), (1, -0.05), (2, 0.138), (3, 0.097), (4, 0.0), (5, -0.015), (6, -0.221), (7, -0.049), (8, 0.016), (9, 0.019), (10, -0.06), (11, -0.063), (12, 0.074), (13, 0.041), (14, -0.124), (15, 0.051), (16, 0.008), (17, 0.003), (18, -0.047), (19, 0.185), (20, 0.163), (21, 0.092), (22, 0.102), (23, 0.118), (24, -0.295), (25, 0.002), (26, 0.021), (27, 0.204), (28, 0.133), (29, 0.013), (30, -0.081), (31, 0.005), (32, -0.047), (33, 0.035), (34, 0.03), (35, -0.005), (36, 0.041), (37, 0.025), (38, -0.037), (39, -0.03), (40, -0.057), (41, 0.033), (42, -0.088), (43, -0.068), (44, -0.113), (45, -0.041), (46, -0.077), (47, -0.026), (48, 0.061), (49, 0.149)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96981871 51 nips-2000-Factored Semi-Tied Covariance Matrices

Author: Mark J. F. Gales

Abstract: A new form of covariance modelling for Gaussian mixture models and hidden Markov models is presented. This is an extension to an efficient form of covariance modelling used in speech recognition, semi-tied covariance matrices. In the standard form of semi-tied covariance matrices the covariance matrix is decomposed into a highly shared decorrelating transform and a component-specific diagonal covariance matrix. The use of a factored decorrelating transform is presented in this paper. This factoring effectively increases the number of possible transforms without increasing the number of free parameters. Maximum likelihood estimation schemes for all the model parameters are presented including the component/transform assignment, transform and component parameters. This new model form is evaluated on a large vocabulary speech recognition task. It is shown that using this factored form of covariance modelling reduces the word error rate.

2 0.64681357 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding

Author: Cynthia Archer, Todd K. Leen

Abstract: We establish a principled framework for adaptive transform coding. Transform coders are often constructed by concatenating an ad hoc choice of transform with suboptimal bit allocation and quantizer design. Instead, we start from a probabilistic latent variable model in the form of a mixture of constrained Gaussian mixtures. From this model we derive a transform coding algorithm, which is a constrained version of the generalized Lloyd algorithm for vector quantizer design. A byproduct of our derivation is the introduction of a new transform basis, which unlike other transforms (PCA, DCT, etc.) is explicitly optimized for coding. Image compression experiments show adaptive transform coders designed with our algorithm improve compressed image signal-to-noise ratio up to 3 dB compared to global transform coding and 0.5 to 2 dB compared to other adaptive transform coders. 1

3 0.62609249 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

Author: George Saon, Mukund Padmanabhan

Abstract: We consider the problem of designing a linear transformation () E lRPx n, of rank p ~ n, which projects the features of a classifier x E lRn onto y = ()x E lRP such as to achieve minimum Bayes error (or probability of misclassification). Two avenues will be explored: the first is to maximize the ()-average divergence between the class densities and the second is to minimize the union Bhattacharyya bound in the range of (). While both approaches yield similar performance in practice, they outperform standard LDA features and show a 10% relative improvement in the word error rate over state-of-the-art cepstral features on a large vocabulary telephony speech recognition task.

4 0.60111392 60 nips-2000-Gaussianization

Author: Scott Saobing Chen, Ramesh A. Gopinath

Abstract: High dimensional data modeling is difficult mainly because the so-called

5 0.42618975 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

Author: Hervé Bourlard, Samy Bengio, Katrin Weber

Abstract: In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent)

6 0.42573792 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

7 0.3797369 27 nips-2000-Automatic Choice of Dimensionality for PCA

8 0.36766675 121 nips-2000-Sparse Kernel Principal Component Analysis

9 0.35107169 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

10 0.32615539 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition

11 0.3225159 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

12 0.32005668 49 nips-2000-Explaining Away in Weight Space

13 0.29374546 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

14 0.28837731 138 nips-2000-The Use of Classifiers in Sequential Inference

15 0.283402 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems

16 0.27077293 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

17 0.26742083 120 nips-2000-Sparse Greedy Gaussian Process Regression

18 0.26562175 6 nips-2000-A Neural Probabilistic Language Model

19 0.2543436 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

20 0.2535314 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.02), (10, 0.032), (12, 0.296), (17, 0.16), (26, 0.029), (32, 0.011), (33, 0.042), (54, 0.012), (55, 0.032), (62, 0.046), (65, 0.015), (67, 0.041), (76, 0.052), (79, 0.013), (81, 0.034), (90, 0.042), (97, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83878106 131 nips-2000-The Early Word Catches the Weights

Author: Mark A. Smith, Garrison W. Cottrell, Karen L. Anderson

Abstract: The strong correlation between the frequency of words and their naming latency has been well documented. However, as early as 1973, the Age of Acquisition (AoA) of a word was alleged to be the actual variable of interest, but these studies seem to have been ignored in most of the literature. Recently, there has been a resurgence of interest in AoA. While some studies have shown that frequency has no effect when AoA is controlled for, more recent studies have found independent contributions of frequency and AoA. Connectionist models have repeatedly shown strong effects of frequency, but little attention has been paid to whether they can also show AoA effects. Indeed, several researchers have explicitly claimed that they cannot show AoA effects. In this work, we explore these claims using a simple feed forward neural network. We find a significant contribution of AoA to naming latency, as well as conditions under which frequency provides an independent contribution. 1 Background Naming latency is the time between the presentation of a picture or written word and the beginning of the correct utterance of that word. It is undisputed that there are significant differences in the naming latency of many words, even when controlling word length, syllabic complexity, and other structural variants. The cause of differences in naming latency has been the subject of numerous studies. Earlier studies found that the frequency with which a word appears in spoken English is the best determinant of its naming latency (Oldfield & Wingfield, 1965). More recent psychological studies, however, show that the age at which a word is learned, or its Age of Acquisition (AoA), may be a better predictor of naming latency. Further, in many multiple regression analyses, frequency is not found to be significant when AoA is controlled for (Brown & Watson, 1987; Carroll & White, 1973; Morrison et al. 1992; Morrison & Ellis, 1995). These studies show that frequency and AoA are highly correlated (typically r =-.6) explaining the confound of older studies on frequency. However, still more recent studies question this finding and find that both AoA and frequency are significant and contribute independently to naming latency (Ellis & Morrison, 1998; Gerhand & Barry, 1998,1999). Much like their psychological counterparts, connectionist networks also show very strong frequency effects. However, the ability of a connectionist network to show AoA effects has been doubted (Gerhand & Barry, 1998; Morrison & Ellis, 1995). Most of these claims are based on the well known fact that connectionist networks exhibit

same-paper 2 0.81373888 51 nips-2000-Factored Semi-Tied Covariance Matrices

Author: Mark J. F. Gales

Abstract: A new form of covariance modelling for Gaussian mixture models and hidden Markov models is presented. This is an extension to an efficient form of covariance modelling used in speech recognition, semi-tied covariance matrices. In the standard form of semi-tied covariance matrices the covariance matrix is decomposed into a highly shared decorrelating transform and a component-specific diagonal covariance matrix. The use of a factored decorrelating transform is presented in this paper. This factoring effectively increases the number of possible transforms without increasing the number of free parameters. Maximum likelihood estimation schemes for all the model parameters are presented including the component/transform assignment, transform and component parameters. This new model form is evaluated on a large vocabulary speech recognition task. It is shown that using this factored form of covariance modelling reduces the word error rate.

3 0.52872497 122 nips-2000-Sparse Representation for Gaussian Process Models

Author: Lehel Csatč´¸, Manfred Opper

Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.

4 0.52650946 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

Author: Christopher K. I. Williams

Abstract: In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed. 1

5 0.52309728 4 nips-2000-A Linear Programming Approach to Novelty Detection

Author: Colin Campbell, Kristin P. Bennett

Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J

6 0.52252972 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

7 0.52184391 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

8 0.52110302 74 nips-2000-Kernel Expansions with Unlabeled Examples

9 0.52049381 130 nips-2000-Text Classification using String Kernels

10 0.51347053 133 nips-2000-The Kernel Gibbs Sampler

11 0.51128143 60 nips-2000-Gaussianization

12 0.51023549 37 nips-2000-Convergence of Large Margin Separable Linear Classification

13 0.50766319 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images

14 0.50748086 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

15 0.50634742 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

16 0.50539249 79 nips-2000-Learning Segmentation by Random Walks

17 0.50463259 36 nips-2000-Constrained Independent Component Analysis

18 0.50411093 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

19 0.5030337 10 nips-2000-A Productive, Systematic Framework for the Representation of Visual Structure

20 0.50229836 111 nips-2000-Regularized Winnow Methods