nips nips2001 nips2001-162 knowledge-graph by maker-knowledge-mining

162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's


Source: pdf

Author: Andrew D. Brown, Geoffrey E. Hinton

Abstract: Logistic units in the first hidden layer of a feedforward neural network compute the relative probability of a data point under two Gaussians. This leads us to consider substituting other density models. We present an architecture for performing discriminative learning of Hidden Markov Models using a network of many small HMM's. Experiments on speech data show it to be superior to the standard method of discriminatively training HMM's. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Logistic units in the first hidden layer of a feedforward neural network compute the relative probability of a data point under two Gaussians. [sent-8, score-0.48]

2 This leads us to consider substituting other density models. [sent-9, score-0.089]

3 We present an architecture for performing discriminative learning of Hidden Markov Models using a network of many small HMM's. [sent-10, score-0.227]

4 Experiments on speech data show it to be superior to the standard method of discriminatively training HMM's. [sent-11, score-0.152]

5 1 Introduction A standard way of performing classification using a generative model is to divide the training cases into their respective classes and t hen train a set of class conditional models. [sent-12, score-0.229]

6 This unsupervised approach to classification is appealing for two reasons. [sent-13, score-0.109]

7 It is possible to reduce overfitting, because t he model learns the class-conditional input densities P(xlc) rather t han the input -conditional class probabilities P(clx). [sent-14, score-0.126]

8 Also, provided that the model density is a good match to the underlying data density then the decision provided by a probabilistic model is Bayes optimal. [sent-15, score-0.178]

9 The problem with this unsupervised approach to using probabilistic models for classification is that, for reasons of computational efficiency and analytical convenience, very simple generative models are typically used and the optimality of the procedure no longer holds. [sent-16, score-0.173]

10 For this reason it is usually advantageous to train a classifier discriminatively. [sent-17, score-0.074]

11 In this paper we will look specifically at the problem of learning HMM 's for classifying speech sequences. [sent-18, score-0.086]

12 It is an application area where the assumption that the HMM is the correct generative model for the data is inaccurate and discriminative methods of training have been successful. [sent-19, score-0.09]

13 The first section will give an overview of current methods of discriminatively training HMM classifiers. [sent-20, score-0.092]

14 We will then introduce a new type of multi-layer backpropagation network which takes better advantage of the HMM 's for discrimination. [sent-21, score-0.138]

15 Finally, we present some simulations comparing the two methods. [sent-22, score-0.021]

16 1' S 9 1 c1 1 1 1 =" [tn] [tn][t n] HMM 's \V Sequence Figure 1: An Alphanet with one HMM per class. [sent-23, score-0.031]

17 Each computes a score for the sequence and this feeds into a softmax output layer. [sent-24, score-0.207]

18 2 Alphanets and Discriminative Learning The unsupervised way of using an HMM for classifying a collection of sequences is to use the Baum-Welch algorithm [1] to fit one HMM per class. [sent-25, score-0.141]

19 Then new sequences are classified by computing the probability of a sequence under each model and assigning it to the one with the highest probability. [sent-26, score-0.086]

20 Speech recognition is one of the commonest applications of HMM 's, but unfortunately an HMM is a poor model of the speech production process. [sent-27, score-0.088]

21 For this reason speech researchers have looked at the possibility of improving the performance of an HMM classifier by using information from negative examples - examples drawn from classes other than the one which the HMM was meant to model. [sent-28, score-0.136]

22 One way of doing this is to compute the mutual information between the class label and the data under the HMM density, and maximize that objective function [2]. [sent-29, score-0.068]

23 It was later shown that this procedure could be viewed as a type of neural network (see Figure 1) in which the inputs to the network are the log-probability scores C(Xl:TIH) of the sequence under hidden Markov model H [3]. [sent-30, score-0.329]

24 In such a model there is one HMM per class, and the output is a softmax non-linearity: (1) Training this model by maximizing the log probability of correct classification leads to a classifier which will perform better than an equivalent HMM model trained solely in a unsupervised manner. [sent-31, score-0.413]

25 Such an architecture has been termed an "AIphanet" because it may be implemented as a recurrent neural network which mimics the forward pass of the forward-backward algorithm. [sent-32, score-0.209]

26 Given a mixture of two Gaussians where we know the component priors P(9) and the component densities P(xl9) then the posterior probability that Gaussian, 90 , generated an observation x , is a logistic function whose argument is the negative log-odds of the two classes [4] . [sent-34, score-0.184]

27 This can clearly be seen by rearranging lThe results of the forward pass are the probabilities of the hidden states conditioned on the past observations, or "alphas" in standard HMM terminology. [sent-35, score-0.225]

28 4 A New Kind of Discriminative Net This view of a feedforward network suggests variations in which other kinds of density models are used in place of Gaussians in the input space. [sent-37, score-0.235]

29 In particular, instead of performing pairwise comparisons between Gaussians, the units in the first hidden layer can perform pairwise comparisons between the densities of an input sequence under M different HMM's. [sent-38, score-0.672]

30 For a given sequence the log-probability of a sequence under each HMM is computed and the difference in log-probability is used as input to the logistic hidden unit. [sent-39, score-0.322]

31 2 This is equivalent to computing the posterior responsibilities of a mixture of two HMM's with equal prior probabilities. [sent-40, score-0.058]

32 In order to maximally leverage the information captured by the HMM's we use (~) hidden units so that all possible pairs are included. [sent-41, score-0.22]

33 The output of a hidden unit h is given by (7) where we have used (mn) as an index over the set, (~) , of all unordered pairs of the HMM's. [sent-42, score-0.279]

34 The results of this hidden layer computation are then combined using a fully connected layer of free weights, W, and finally passed through a soft max function to make the final decision. [sent-43, score-0.454]

35 Density Comparator Units Figure 2: A multi-layer density net with HMM's in the input layer. [sent-45, score-0.152]

36 The hidden layer units perform all pairwise comparisons between the HMM 's. [sent-46, score-0.461]

37 where we have used u(·) as shorthand for the logistic function, and Pk is the value of the kth output unit. [sent-47, score-0.158]

38 Because each unit in the hidden layer takes as input the difference in log-probability of two HMM 's, this can be thought of as a fixed layer of weights connecting each hidden unit to a pair of HMM's with weights of ±l. [sent-49, score-0.764]

39 In contrast to the Alphanet , which allocates one HMM to model each class, this network does not require a one-to-one alignment between models and classes and it gets maximum discriminative benefit from the HMM's by comparing all pairs. [sent-50, score-0.227]

40 Another benefit of this architecture is that it allows us to use more HMM's than there are classes. [sent-51, score-0.141]

41 The unsupervised approach to training HMM classifiers is problematic because it depends on the assumption that a single HMM is a good model of the data and, in the case of speech, this is a poor assumption. [sent-52, score-0.06]

42 Training the classifier discriminatively alleviated this drawback and the multi-layer classifier goes even further in this direction by allowing many HMM's to be used to learn the decision boundaries between the classes. [sent-53, score-0.199]

43 The intuition here is that many small HMM's can be a far more efficient way to characterize sequences than one big HMM. [sent-54, score-0.076]

44 When many small HMM's cooperate to generate sequences, the mutual information between different parts of generated sequences scales linearly with the number of HMM's and only logarithmically with the number of hidden nodes in each HMM [5]. [sent-55, score-0.254]

45 5 Derivative Updates for a Relative Density Network The learning algorithm for an RDN is just the backpropagation algorithm applied to the network architecture as defined in equations 7,8 and 9. [sent-56, score-0.23]

46 The output layer is a distribution over class memberships of data point Xl:T, and this is parameterized as a softmax function. [sent-57, score-0.359]

47 We minimize the cross-entropy loss function: K f = 2: tk logpk (10) k= l where Pk is the value of the kth output unit and tk is an indicator variable which is equal to 1 if k is the true class. [sent-58, score-0.266]

48 This derivative can be chained with the the derivatives backpropagated from the output to the hidden layer. [sent-60, score-0.432]

49 For the final step of the backpropagation procedure we need the derivative of the log-likelihood of each HMM with respect to its parameters. [sent-61, score-0.124]

50 In the experiments we use HMM 's with a single, axis-aligned, Gaussian output density per state. [sent-62, score-0.186]

51 We use the following notation for the parameters: • • • • • A: aij is the transition probability from state i to state j II: 7ri is the initial state prior f. [sent-63, score-0.117]

52 /,i: mean vector for state i Vi: vector of variances for state i 1-l: set of HMM parameters {A , II, f. [sent-64, score-0.048]

53 /" v} We also use the variable St to represent the state of the HMM at time t. [sent-65, score-0.024]

54 We make use of the property of all latent variable density models that the derivative of the log-likelihood is equal to the expected derivative of the joint log-likelihood under the posterior distribution. [sent-66, score-0.252]

55 For an HMM this means that: O£(Xl:TI1-l) '" 0 o1-l i = ~ P(Sl:Tlxl:T' 1-l) o1-l i log P(Xl:T' Sl:TI1-l) (14) Sl:T The joint likelihood of an HMM is: (logP(Xl:T ' Sl:TI1-l)) = T L(b81 ,i)log 7ri + LL(b "jb 8 8 ,_1 ,i)log aij + i,j t=2 ~ ~(b8" i) [-~ ~IOgVi'd ~ ~(Xt'd - f. [sent-67, score-0.078]

56 /,i,d) 2 /Vi,d] - + canst (15) where (-) denotes expectations under the posterior distribution and (b 8 , ,i) and (b 8 , ,jb8 '_1 ,i) are the expected state occupancies and transitions under this distribution. [sent-68, score-0.081]

57 All the necessary expectations are computed by the forward backward algorithm. [sent-69, score-0.05]

58 We could take derivatives with respect to this functional directly, but that would require doing constrained gradient descent on the probInstead, we reparameterize the model using a abilities and the variances. [sent-70, score-0.055]

59 softmax basis for probability vectors and an exponential basis for the variance parameters. [sent-71, score-0.1]

60 _ a' J - exp(e;; ») (e (a» ) , 2: JI exp 1JI . [sent-74, score-0.04]

61 li ,d)2/Vi ,d - IJ (19) t= l When chained with the error signal backpropagated from the output, these derivatives give us the direction in which to move the parameters of each HMM in order to increase the log probability of the correct classification of the sequence. [sent-83, score-0.262]

62 6 Experiments To evaluate the relative merits of the RDN, we compared it against an Alphanet on a speaker identification task. [sent-84, score-0.108]

63 It consisted of 12 speakers uttering phrases consisting of 6 different sequences of connected digits recorded multiple times (48) over the course of 12 recording sessions. [sent-86, score-0.093]

64 The log magnitude filter response was then used as the feature vector for the HMM's. [sent-89, score-0.033]

65 This pre-processing reduced the data dimensionality while retaining its spectral structure. [sent-90, score-0.021]

66 While mel-cepstral coefficients are typically recommended for use with axis-aligned Gaussians, they destroy the spectral structure of the data, and we would like to allow for the possibility that of the many HMM's some of them will specialize on particular sub-bands of the frequency domain. [sent-91, score-0.021]

67 They can do this by treating the variance as a measure of the importance of a particular frequency band - using large variances for unimportant bands, and small ones for bands to which they pay particular attention. [sent-92, score-0.033]

68 We compared the RDN with an Alphanet and three other models which were implemented as controls. [sent-93, score-0.021]

69 The first of these was a network with a similar architecture to the RDN (as shown in figure 2), except that instead of fixed connections of ±1, the hidden units have a set of adaptable weights to all M of the HMM's. [sent-94, score-0.462]

70 We refer to this network as a comparative density net (CDN). [sent-95, score-0.194]

71 A second control experiment used an architecture similar to a CDN without the hidden layer, i. [sent-96, score-0.267]

72 there is a single layer of adaptable weights directly connecting the HMM's with the softmax output units. [sent-98, score-0.414]

73 The CDN-l differs from the Alphanet in that each softmax output unit has adaptable connections to the HMM's and we can vary the number of HMM's, whereas the Alphanet has just one HMM per class directly connected to each softmax output unit. [sent-100, score-0.514]

74 Finally, we implemented a version of a network similar to an Alphanet, but using a mixture of Gaussians as the input density model. [sent-101, score-0.203]

75 The point of this comparison was to see if the HMM actually achieves a benefit from modelling the temporal aspects of the speaker recognition task. [sent-102, score-0.134]

76 In each experiment an RDN constructed out of a set of, M, 4-state HMM's was compared to the four other networks all matched to have the same number of free parameters, except for the MoGnet. [sent-103, score-0.042]

77 In the case of the MoGnet, we used the same number of Gaussian mixture models as HMM's in the Alphanet, each with the same number of hidden states. [sent-104, score-0.197]

78 6 B RDN Alphanet CDN-1 Architecture MeG net Architecture CDN CDN-1 8 0. [sent-131, score-0.038]

79 For the Alphanet and MoGnet we varied the number of states in the HMM's and the Gaussian mixtures, respectively. [sent-135, score-0.053]

80 For the CDN model we used the same number of 4-state HMM's as the RDN and varied the number of units in the hidden layer of the network. [sent-136, score-0.392]

81 Since the CDN-1 network has no hidden units, we used the same number of HMM's as the RDN and varied the number of states in the HMM. [sent-137, score-0.274]

82 All the models were trained using 90 iterations of a conjugate gradient optimization procedure [6] . [sent-139, score-0.042]

83 7 Results The boxplot in figure 3 shows the results of the classification performance on the 10 runs in each of the 4 experiments. [sent-140, score-0.07]

84 This indicates that given a classification network with a fixed number of parameters, there is an advantage to using many small HMM 's and using all the pairwise information about an observed sequence, as opposed to using a network with a single large HMM per class. [sent-144, score-0.284]

85 In the third experiment involving the MoGnet we see that its performance is comparable to that of the Alphanet. [sent-145, score-0.021]

86 This suggests that the HMM's ability to model the temporal structure of the data is not really necessary for the speaker classification task as we have set it Up. [sent-146, score-0.127]

87 3 Nevertheless, the performance of both the Alphanet and 3If we had done text-dependent speaker identification, instead of multiple digit phrases the MoGnet is less than the RDN. [sent-147, score-0.085]

88 Unfortunately the CDN and CDN-l networks perform much worse than we expected. [sent-148, score-0.041]

89 While we expected these models to perform similarly to the RDN, it seems that the optimization procedure takes much longer with these models. [sent-149, score-0.041]

90 This is probably because the small initial weights from the HMM's to the next layer severely attenuate the backpropagated error derivatives that are used to train the HMM's. [sent-150, score-0.298]

91 As a result the CDN networks do not converge properly in the time allowed. [sent-151, score-0.021]

92 8 Conclusions We have introduced relative density networks, and shown that this method of discriminatively learning many small density models in place of a single density model per class has benefits in classification performance. [sent-152, score-0.513]

93 In addition, there may be a small speed benefit to using many smaller HMM 's compared to a few big ones. [sent-153, score-0.08]

94 Computing the probability of a sequence under an HMM is order O(TK 2 ), where T is the length of the sequence and K is the number of hidden states in the network. [sent-154, score-0.257]

95 However, this is somewhat counterbalanced by the quadratic growth in the size of the hidden layer as M increases. [sent-156, score-0.294]

96 Mercer, "Maximum mutual information of hidden Markov model parameters for speech recognition," in Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. [sent-175, score-0.249]

97 Bridle, "Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters," in Advances in Neural Information Processing Systems (D. [sent-178, score-0.084]

98 Hinton, "Products of hidden Markov models," in Proceedings of Artificial Intelligence and Statistics 2001 (T. [sent-193, score-0.154]

99 Matlab conjugate gradient code available from http ://www . [sent-202, score-0.021]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hmm', 0.664), ('alphanet', 0.338), ('rdn', 0.338), ('cdn', 0.203), ('hidden', 0.154), ('layer', 0.14), ('mn', 0.105), ('softmax', 0.1), ('architecture', 0.092), ('mognet', 0.09), ('density', 0.089), ('pk', 0.085), ('discriminatively', 0.071), ('backpropagation', 0.071), ('xl', 0.071), ('classification', 0.07), ('network', 0.067), ('qo', 0.067), ('units', 0.066), ('output', 0.066), ('logistic', 0.061), ('speech', 0.06), ('oak', 0.059), ('adaptable', 0.059), ('backpropagated', 0.059), ('speaker', 0.057), ('gaussians', 0.056), ('derivatives', 0.055), ('tk', 0.055), ('classifier', 0.054), ('derivative', 0.053), ('benefit', 0.049), ('pairwise', 0.049), ('discriminative', 0.047), ('bim', 0.045), ('chained', 0.045), ('magnet', 0.045), ('sequences', 0.045), ('aij', 0.045), ('densities', 0.043), ('sl', 0.043), ('sequence', 0.041), ('exp', 0.04), ('unsupervised', 0.039), ('bridle', 0.039), ('unit', 0.039), ('net', 0.038), ('toronto', 0.037), ('brown', 0.036), ('gj', 0.036), ('posterior', 0.036), ('mutual', 0.035), ('class', 0.033), ('bands', 0.033), ('feedforward', 0.033), ('log', 0.033), ('varied', 0.032), ('comparisons', 0.032), ('hinton', 0.031), ('identification', 0.031), ('ow', 0.031), ('kth', 0.031), ('big', 0.031), ('per', 0.031), ('forward', 0.029), ('tn', 0.028), ('phrases', 0.028), ('bin', 0.028), ('recognition', 0.028), ('classifying', 0.026), ('markov', 0.026), ('input', 0.025), ('connecting', 0.025), ('state', 0.024), ('weights', 0.024), ('generative', 0.022), ('classes', 0.022), ('mixture', 0.022), ('networks', 0.021), ('states', 0.021), ('expectations', 0.021), ('pass', 0.021), ('xt', 0.021), ('conjugate', 0.021), ('spectral', 0.021), ('models', 0.021), ('experiment', 0.021), ('performing', 0.021), ('comparing', 0.021), ('training', 0.021), ('indicator', 0.02), ('train', 0.02), ('perform', 0.02), ('relative', 0.02), ('connected', 0.02), ('memberships', 0.02), ('unordered', 0.02), ('alleviated', 0.02), ('hen', 0.02), ('cooperate', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

Author: Andrew D. Brown, Geoffrey E. Hinton

Abstract: Logistic units in the first hidden layer of a feedforward neural network compute the relative probability of a data point under two Gaussians. This leads us to consider substituting other density models. We present an architecture for performing discriminative learning of Hidden Markov Models using a network of many small HMM's. Experiments on speech data show it to be superior to the standard method of discriminatively training HMM's. 1

2 0.2472095 172 nips-2001-Speech Recognition using SVMs

Author: N. Smith, Mark Gales

Abstract: An important issue in applying SVMs to speech recognition is the ability to classify variable length sequences. This paper presents extensions to a standard scheme for handling this variable length data, the Fisher score. A more useful mapping is introduced based on the likelihood-ratio. The score-space defined by this mapping avoids some limitations of the Fisher score. Class-conditional generative models are directly incorporated into the definition of the score-space. The mapping, and appropriate normalisation schemes, are evaluated on a speaker-independent isolated letter task where the new mapping outperforms both the Fisher score and HMMs trained to maximise likelihood. 1

3 0.19998382 183 nips-2001-The Infinite Hidden Markov Model

Author: Matthew J. Beal, Zoubin Ghahramani, Carl E. Rasmussen

Abstract: We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite— consider, for example, symbols being possible words appearing in English text.

4 0.19136864 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

Author: Jens Kohlmorgen, Steven Lemm

Abstract: We propose a novel method for the analysis of sequential data that exhibits an inherent mode switching. In particular, the data might be a non-stationary time series from a dynamical system that switches between multiple operating modes. Unlike other approaches, our method processes the data incrementally and without any training of internal parameters. We use an HMM with a dynamically changing number of states and an on-line variant of the Viterbi algorithm that performs an unsupervised segmentation and classification of the data on-the-fly, i.e. the method is able to process incoming data in real-time. The main idea of the approach is to track and segment changes of the probability density of the data in a sliding window on the incoming data stream. The usefulness of the algorithm is demonstrated by an application to a switching dynamical system. 1

5 0.17025758 115 nips-2001-Linear-time inference in Hierarchical HMMs

Author: Kevin P. Murphy, Mark A. Paskin

Abstract: The hierarchical hidden Markov model (HHMM) is a generalization of the hidden Markov model (HMM) that models sequences with structure at many length/time scales [FST98]. Unfortunately, the original infertime, where is ence algorithm is rather complicated, and takes the length of the sequence, making it impractical for many domains. In this paper, we show how HHMMs are a special kind of dynamic Bayesian network (DBN), and thereby derive a much simpler inference algorithm, which only takes time. Furthermore, by drawing the connection between HHMMs and DBNs, we enable the application of many standard approximation techniques to further speed up inference. ¥ ©§ £ ¨¦¥¤¢ © £ ¦¥¤¢

6 0.15803149 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models

7 0.14128305 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

8 0.084090821 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition

9 0.08294867 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

10 0.080686942 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

11 0.079603337 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets

12 0.078636147 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

13 0.078291059 43 nips-2001-Bayesian time series classification

14 0.070848592 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

15 0.069599375 3 nips-2001-ACh, Uncertainty, and Cortical Inference

16 0.068217427 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering

17 0.067924581 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

18 0.066977777 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

19 0.06657017 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks

20 0.066213645 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.195), (1, 0.004), (2, -0.012), (3, -0.084), (4, -0.285), (5, 0.077), (6, 0.198), (7, -0.068), (8, -0.112), (9, -0.046), (10, 0.007), (11, 0.119), (12, 0.061), (13, 0.161), (14, 0.006), (15, -0.037), (16, -0.056), (17, -0.029), (18, 0.2), (19, -0.018), (20, 0.114), (21, 0.197), (22, -0.042), (23, -0.015), (24, -0.031), (25, 0.125), (26, -0.11), (27, 0.007), (28, -0.215), (29, 0.017), (30, -0.077), (31, -0.102), (32, -0.018), (33, -0.053), (34, 0.015), (35, -0.014), (36, -0.011), (37, -0.073), (38, -0.036), (39, 0.069), (40, 0.041), (41, 0.024), (42, 0.026), (43, -0.068), (44, -0.078), (45, -0.03), (46, 0.053), (47, 0.003), (48, -0.015), (49, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97105312 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

Author: Andrew D. Brown, Geoffrey E. Hinton

Abstract: Logistic units in the first hidden layer of a feedforward neural network compute the relative probability of a data point under two Gaussians. This leads us to consider substituting other density models. We present an architecture for performing discriminative learning of Hidden Markov Models using a network of many small HMM's. Experiments on speech data show it to be superior to the standard method of discriminatively training HMM's. 1

2 0.73071432 172 nips-2001-Speech Recognition using SVMs

Author: N. Smith, Mark Gales

Abstract: An important issue in applying SVMs to speech recognition is the ability to classify variable length sequences. This paper presents extensions to a standard scheme for handling this variable length data, the Fisher score. A more useful mapping is introduced based on the likelihood-ratio. The score-space defined by this mapping avoids some limitations of the Fisher score. Class-conditional generative models are directly incorporated into the definition of the score-space. The mapping, and appropriate normalisation schemes, are evaluated on a speaker-independent isolated letter task where the new mapping outperforms both the Fisher score and HMMs trained to maximise likelihood. 1

3 0.68874109 183 nips-2001-The Infinite Hidden Markov Model

Author: Matthew J. Beal, Zoubin Ghahramani, Carl E. Rasmussen

Abstract: We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite— consider, for example, symbols being possible words appearing in English text.

4 0.68263555 115 nips-2001-Linear-time inference in Hierarchical HMMs

Author: Kevin P. Murphy, Mark A. Paskin

Abstract: The hierarchical hidden Markov model (HHMM) is a generalization of the hidden Markov model (HMM) that models sequences with structure at many length/time scales [FST98]. Unfortunately, the original infertime, where is ence algorithm is rather complicated, and takes the length of the sequence, making it impractical for many domains. In this paper, we show how HHMMs are a special kind of dynamic Bayesian network (DBN), and thereby derive a much simpler inference algorithm, which only takes time. Furthermore, by drawing the connection between HHMMs and DBNs, we enable the application of many standard approximation techniques to further speed up inference. ¥ ©§ £ ¨¦¥¤¢ © £ ¦¥¤¢

5 0.65071517 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

Author: Jens Kohlmorgen, Steven Lemm

Abstract: We propose a novel method for the analysis of sequential data that exhibits an inherent mode switching. In particular, the data might be a non-stationary time series from a dynamical system that switches between multiple operating modes. Unlike other approaches, our method processes the data incrementally and without any training of internal parameters. We use an HMM with a dynamically changing number of states and an on-line variant of the Viterbi algorithm that performs an unsupervised segmentation and classification of the data on-the-fly, i.e. the method is able to process incoming data in real-time. The main idea of the approach is to track and segment changes of the probability density of the data in a sliding window on the incoming data stream. The usefulness of the algorithm is demonstrated by an application to a switching dynamical system. 1

6 0.4632605 3 nips-2001-ACh, Uncertainty, and Cortical Inference

7 0.43650442 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models

8 0.416347 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

9 0.4133943 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

10 0.35386258 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

11 0.33827364 85 nips-2001-Grammar Transfer in a Second Order Recurrent Neural Network

12 0.33035752 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

13 0.3209087 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets

14 0.31253785 43 nips-2001-Bayesian time series classification

15 0.30259266 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

16 0.3021909 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

17 0.28348815 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons

18 0.27455398 179 nips-2001-Tempo tracking and rhythm quantization by sequential Monte Carlo

19 0.26840016 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition

20 0.26233724 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.025), (17, 0.026), (18, 0.252), (19, 0.04), (27, 0.121), (30, 0.109), (38, 0.023), (59, 0.026), (72, 0.059), (79, 0.053), (83, 0.026), (88, 0.011), (91, 0.126)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85761851 73 nips-2001-Eye movements and the maturation of cortical orientation selectivity

Author: Antonino Casile, Michele Rucci

Abstract: Neural activity appears to be a crucial component for shaping the receptive fields of cortical simple cells into adjacent, oriented subregions alternately receiving ON- and OFF-center excitatory geniculate inputs. It is known that the orientation selective responses of V1 neurons are refined by visual experience. After eye opening, the spatiotemporal structure of neural activity in the early stages of the visual pathway depends both on the visual environment and on how the environment is scanned. We have used computational modeling to investigate how eye movements might affect the refinement of the orientation tuning of simple cells in the presence of a Hebbian scheme of synaptic plasticity. Levels of correlation between the activity of simulated cells were examined while natural scenes were scanned so as to model sequences of saccades and fixational eye movements, such as microsaccades, tremor and ocular drift. The specific patterns of activity required for a quantitatively accurate development of simple cell receptive fields with segregated ON and OFF subregions were observed during fixational eye movements, but not in the presence of saccades or with static presentation of natural visual input. These results suggest an important role for the eye movements occurring during visual fixation in the refinement of orientation selectivity.

same-paper 2 0.81040621 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

Author: Andrew D. Brown, Geoffrey E. Hinton

Abstract: Logistic units in the first hidden layer of a feedforward neural network compute the relative probability of a data point under two Gaussians. This leads us to consider substituting other density models. We present an architecture for performing discriminative learning of Hidden Markov Models using a network of many small HMM's. Experiments on speech data show it to be superior to the standard method of discriminatively training HMM's. 1

3 0.65009534 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

4 0.64993322 102 nips-2001-KLD-Sampling: Adaptive Particle Filters

Author: Dieter Fox

Abstract: Over the last years, particle filters have been applied with great success to a variety of state estimation problems. We present a statistical approach to increasing the efficiency of particle filters by adapting the size of sample sets on-the-fly. The key idea of the KLD-sampling method is to bound the approximation error introduced by the sample-based representation of the particle filter. The name KLD-sampling is due to the fact that we measure the approximation error by the Kullback-Leibler distance. Our adaptation approach chooses a small number of samples if the density is focused on a small part of the state space, and it chooses a large number of samples if the state uncertainty is high. Both the implementation and computation overhead of this approach are small. Extensive experiments using mobile robot localization as a test application show that our approach yields drastic improvements over particle filters with fixed sample set sizes and over a previously introduced adaptation technique.

5 0.64973038 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

Author: Paul Viola, Michael Jones

Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.

6 0.64837062 149 nips-2001-Probabilistic Abstraction Hierarchies

7 0.64721787 46 nips-2001-Categorization by Learning and Combining Object Parts

8 0.64546734 190 nips-2001-Thin Junction Trees

9 0.64462692 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

10 0.64388418 27 nips-2001-Activity Driven Adaptive Stochastic Resonance

11 0.6436168 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks

12 0.64361334 56 nips-2001-Convolution Kernels for Natural Language

13 0.64285135 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

14 0.64252472 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex

15 0.64183116 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

16 0.64182246 60 nips-2001-Discriminative Direction for Kernel Classifiers

17 0.63853037 89 nips-2001-Grouping with Bias

18 0.63797402 121 nips-2001-Model-Free Least-Squares Policy Iteration

19 0.63699013 169 nips-2001-Small-World Phenomena and the Dynamics of Information

20 0.63682926 185 nips-2001-The Method of Quantum Clustering