nips nips2002 nips2002-93 knowledge-graph by maker-knowledge-mining

93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

Source: pdf

Author: Shantanu Chakrabartty, Gert Cauwenberghs

Abstract: Forward decoding kernel machines (FDKM) combine large-margin classifiers with hidden Markov models (HMM) for maximum a posteriori (MAP) adaptive sequence estimation. State transitions in the sequence are conditioned on observed data using a kernel-based probability model trained with a recursive scheme that deals effectively with noisy and partially labeled data. Training over very large data sets is accomplished using a sparse probabilistic support vector machine (SVM) model based on quadratic entropy, and an on-line stochastic steepest descent algorithm. For speaker-independent continuous phone recognition, FDKM trained over 177 ,080 samples of the TlMIT database achieves 80.6% recognition accuracy over the full test set, without use of a prior phonetic language model.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Forward decoding kernel machines (FDKM) combine large-margin classifiers with hidden Markov models (HMM) for maximum a posteriori (MAP) adaptive sequence estimation. [sent-2, score-0.627]

2 State transitions in the sequence are conditioned on observed data using a kernel-based probability model trained with a recursive scheme that deals effectively with noisy and partially labeled data. [sent-3, score-0.262]

3 Training over very large data sets is accomplished using a sparse probabilistic support vector machine (SVM) model based on quadratic entropy, and an on-line stochastic steepest descent algorithm. [sent-4, score-0.044]

4 For speaker-independent continuous phone recognition, FDKM trained over 177 ,080 samples of the TlMIT database achieves 80. [sent-5, score-0.272]

5 6% recognition accuracy over the full test set, without use of a prior phonetic language model. [sent-6, score-0.254]

6 1 Introduction Sequence estimation is at the core of many problems in pattern recognition, most notably speech and language processing. [sent-7, score-0.217]

7 Recognizing dynamic patterns in sequential data requires a set of tools very different from classifiers trained to recognize static patterns in data assumed i. [sent-8, score-0.321]

8 The speech recognition community has predominantly relied on hidden Markov models (HMMs) [1] to produce state-of-the-art results. [sent-12, score-0.302]

9 If the aim is discrimination between classes, then it might be sufficient to model discrimination boundaries between classes which (in most affine cases) afford fewer parameters. [sent-14, score-0.225]

10 Recurrent neural networks have been used to extend the dynamic modeling power of HMMs with the discriminant nature of neural networks [2], but learning long term dependencies remains a challenging problem [3]. [sent-15, score-0.045]

11 Typically, neural network training algorithms are prone to local optima, and while they work well in many situations, the quality and consistency of the converged solution cannot be warranted. [sent-16, score-0.089]

12 Large margin classifiers, like support vector machines, have been the subject of intensive research in the neural network and artificial intelligence communities [4]. [sent-17, score-0.114]

13 They are attractive because they generalize well even with relatively few data points in the training set, and bounds on the generalization error can be directly obtained from the training data. [sent-18, score-0.178]

14 Under general conditions, the training procedure finds a unique solution (decision or regression surface) that provides an out-of-sample performance superior to many techniques. [sent-19, score-0.214]

15 Recently, support vector machines (SVMs) [4] have been used for phoneme (or phone) recognition [5] and have shown encouraging results. [sent-20, score-0.172]

16 However, use of a standard SVM P(xI1) P(xIO) P(111 ) P(OIO) P(110) (a) P(110,x) (b) Figure 1: (a) Two state Markovian maximum-likehood (ML) model with static state transition probabilities and observation vectors xemittedfrom the states. [sent-21, score-0.465]

17 (b) Two state Markovian MAP model, where transition probabilities between states are modulated by the observation vector x. [sent-22, score-0.487]

18 To model inter-phonetic dependencies, maximum likelihood (ML) approaches assume a phonetic language model that is independent of the utterance data [6], as illustrated in Figure 1 (a). [sent-27, score-0.235]

19 In contrast, the maximum a posteriori (MAP) approach assumes transitions between states that are directly modulated by the observed data, as illustrated in Figure 1 (b). [sent-28, score-0.368]

20 The MAP approach lends itself naturally to hybrid HMM/connectionist approaches with performance comparable to state-of-the-art HMM systems [7]. [sent-29, score-0.096]

21 FDKM [8] can be seen a hybrid HMM/SYM MAP approach to sequence estimation. [sent-30, score-0.151]

22 It thereby augments the ability of large margin classifiers to infer sequential properties of the data. [sent-31, score-0.337]

23 FDKMs have shown superior performance for channel equalization in digital communication where the received symbol sequence is contaminated by inter symbol interference [8]. [sent-32, score-0.393]

24 In the present paper, FDKM is applied to speaker-independent continuous phone recognition. [sent-33, score-0.188]

25 To handle the vast amount of data in the TIMIT corpus, we present a sparse probabilistic model and efficient implementation of the associated FDKM training procedure. [sent-34, score-0.136]

26 2 FDKM formulation The problem of FDKM recognition is formulated in the framework of MAP (maximum a posteriori) estimation, combining Markovian dynamics with kernel machines. [sent-35, score-0.332]

27 A Markovian model is assumed with symbols belonging to S classes, as illustrated in Figure I(a) for S = 2. [sent-36, score-0.114]

28 Transitions between the classes are modulated in probability by observation (data) vectors x over time. [sent-37, score-0.227]

29 1 Decoding Formulation The MAP forward decoder receives the sequence X [n] = {x[n], x [n - 1], . [sent-39, score-0.271]

30 ,x li]} and produces an estimate of the probability of the state variable q[n] over all classes i, adn] = P(q[n] = i I X [n], w) , where w denotes the set of parameters for the learning machine. [sent-42, score-0.143]

31 Unlike hidden Markov models, the states directly encode the symbols, and the observations x modulate transition probabilities between states [7]. [sent-43, score-0.36]

32 The forward decoding (1) embeds sequential dependence of the data wherein the probability estimate at time instant n depends on all the previous data. [sent-45, score-0.448]

33 Accurate estimation of transition probabilities Pij [n ] in (1) is crucial in decoding (2) to provide good performance. [sent-47, score-0.391]

34 In [8] we used kernel logistic regression [10], with regularized maximum cross-entropy, to model conditional probabilities. [sent-48, score-0.344]

35 2 Training Formulation For training the MAP forward decoder, we assume access to a training sequence with labels (class memberships). [sent-51, score-0.449]

36 For instance, the TIMIT speech database comes labeled with phonemes. [sent-52, score-0.175]

37 Continuous (soft) labels could be assigned rather than binary indicator labels, to signify uncertainty in the training data over the classes. [sent-53, score-0.211]

38 The parameter space w can be partitioned into disjoint parameter vectors W ij and bij for each pair of classes i , j = 0, . [sent-56, score-0.471]

39 , S - 1 such that Pij [ depends only on W i j and bij . [sent-59, score-0.257]

40 n] (The parameter bij corresponds to the bias term in the standard SVM formulation). [sent-60, score-0.257]

41 The objective function (4) is similar to the primal formulation of a large margin classifier [4]. [sent-62, score-0.324]

42 Unlike the convex (quadratic) cost function of SVMs, the formulation (4) does not have a unique solution and direct optimization could lead to poor local optima. [sent-63, score-0.203]

43 However, a lower bound of the objective function can be formulated so that maximizing this lower bound reduces to a set of convex optimization sub-problems with an elegant dual formulation in terms of support vectors and kernels. [sent-64, score-0.316]

44 ) function to the convex sum in the forward estimation (1), we obtain directly (5) where N - 1 Hj = L n= O 8- 1 Cj [n] L yd n]log Pij [ n] i= O 8- 1 ~L IWij 1 2 (6) i= O with effective regularization sequence Cj[n] = Caj[n - 1] . [sent-66, score-0.411]

45 (7) Disregarding the intricate dependence of (7) on the results of (6) which we defer to the followin~ section, the formulation (6) is equivalent to regression of conditional probabilities Pij [ j from labeled data x [n] and Yi [ for a given outgoing state j. [sent-67, score-0.695]

46 3 Kernel Logistic Probability Regression Estimation of conditional probabilities Pr( ilx) from training data x[n] and labels Yi [n] can be obtained using a regularized form of kernel logistic regression [10]. [sent-69, score-0.612]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fdkm', 0.414), ('pij', 0.306), ('bij', 0.257), ('adn', 0.237), ('phone', 0.188), ('classifiers', 0.144), ('formulation', 0.129), ('decoding', 0.126), ('ydn', 0.118), ('markovian', 0.116), ('transition', 0.111), ('hmms', 0.104), ('probabilities', 0.104), ('timit', 0.103), ('forward', 0.1), ('modulated', 0.098), ('sequence', 0.096), ('map', 0.093), ('training', 0.089), ('phonetic', 0.088), ('posteriori', 0.087), ('language', 0.087), ('ij', 0.086), ('outgoing', 0.083), ('speech', 0.08), ('kernel', 0.08), ('sequential', 0.079), ('recognition', 0.079), ('regularizer', 0.079), ('symbol', 0.078), ('labels', 0.075), ('regression', 0.075), ('decoder', 0.075), ('logistic', 0.075), ('convex', 0.074), ('classes', 0.074), ('transitions', 0.073), ('cj', 0.07), ('objective', 0.069), ('state', 0.069), ('classifier', 0.063), ('margin', 0.063), ('svms', 0.062), ('illustrated', 0.06), ('svm', 0.058), ('conditional', 0.057), ('ml', 0.057), ('hmm', 0.057), ('static', 0.057), ('regularized', 0.057), ('hybrid', 0.055), ('observation', 0.055), ('symbols', 0.054), ('disjoint', 0.054), ('labeled', 0.052), ('discrimination', 0.052), ('wherein', 0.051), ('predominantly', 0.051), ('xio', 0.051), ('augments', 0.051), ('biasvariance', 0.051), ('communities', 0.051), ('embeds', 0.051), ('est', 0.051), ('fij', 0.051), ('lx', 0.051), ('memberships', 0.051), ('estimation', 0.05), ('states', 0.05), ('superior', 0.05), ('machines', 0.049), ('afford', 0.047), ('vast', 0.047), ('relied', 0.047), ('inter', 0.047), ('iw', 0.047), ('gert', 0.047), ('signify', 0.047), ('regularization', 0.047), ('unlike', 0.046), ('hidden', 0.045), ('dependencies', 0.045), ('formulated', 0.044), ('equalization', 0.044), ('baltimore', 0.044), ('disregarding', 0.044), ('yd', 0.044), ('phoneme', 0.044), ('steepest', 0.044), ('intricate', 0.044), ('ol', 0.044), ('database', 0.043), ('yi', 0.041), ('defer', 0.041), ('hopkins', 0.041), ('johns', 0.041), ('md', 0.041), ('lends', 0.041), ('trained', 0.041), ('dependence', 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

Author: Shantanu Chakrabartty, Gert Cauwenberghs

2 0.17237821 29 nips-2002-Analysis of Information in Speech Based on MANOVA

Author: Sachin S. Kajarekar, Hynek Hermansky

Abstract: We propose analysis of information in speech using three sources - language (phone), speaker and channeL Information in speech is measured as mutual information between the source and the set of features extracted from speech signaL We assume that distribution of features can be modeled using Gaussian distribution. The mutual information is computed using the results of analysis of variability in speech. We observe similarity in the results of phone variability and phone information, and show that the results of the proposed analysis have more meaningful interpretations than the analysis of variability. 1

3 0.11086109 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

Author: Samy Bengio

Abstract: This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/ Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions. 1

4 0.10807092 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

Author: David Barber

Abstract: The application of latent/hidden variable Dynamic Bayesian Networks is constrained by the complexity of marginalising over latent variables. For this reason either small latent dimensions or Gaussian latent conditional tables linearly dependent on past states are typically considered in order that inference is tractable. We suggest an alternative approach in which the latent variables are modelled using deterministic conditional probability tables. This specialisation has the advantage of tractable inference even for highly complex non-linear/non-Gaussian visible conditional probability tables. This approach enables the consideration of highly complex latent dynamics whilst retaining the beneﬁts of a tractable probabilistic model. 1

5 0.10545421 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

Author: Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Naonori Ueda

Abstract: In this paper, we propose a Bayesian framework, which constructs shared-state triphone HMMs based on a variational Bayesian approach, and recognizes speech based on the Bayesian prediction classiﬁcation; variational Bayesian estimation and clustering for speech recognition (VBEC). An appropriate model structure with high recognition performance can be found within a VBEC framework. Unlike conventional methods, including BIC or MDL criterion based on the maximum likelihood approach, the proposed model selection is valid in principle, even when there are insufﬁcient amounts of data, because it does not use an asymptotic assumption. In isolated word recognition experiments, we show the advantage of VBEC over conventional methods, especially when dealing with small amounts of data.

6 0.10145215 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

7 0.09783987 135 nips-2002-Learning with Multiple Labels

8 0.097539723 161 nips-2002-PAC-Bayes & Margins

9 0.095783032 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

10 0.093504816 45 nips-2002-Boosted Dyadic Kernel Discriminants

11 0.092081599 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

12 0.088278212 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

13 0.087932929 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

14 0.086856648 156 nips-2002-On the Complexity of Learning the Kernel Matrix

15 0.084210597 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

16 0.08170528 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

17 0.079935834 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

18 0.077421635 119 nips-2002-Kernel Dependency Estimation

19 0.075851016 106 nips-2002-Hyperkernels

20 0.075644612 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.228), (1, -0.08), (2, 0.011), (3, -0.048), (4, 0.008), (5, 0.005), (6, -0.028), (7, 0.055), (8, 0.132), (9, -0.087), (10, 0.106), (11, -0.075), (12, -0.08), (13, -0.049), (14, -0.081), (15, -0.051), (16, 0.032), (17, 0.152), (18, 0.04), (19, 0.089), (20, 0.056), (21, -0.087), (22, 0.08), (23, -0.147), (24, -0.045), (25, -0.087), (26, -0.031), (27, -0.17), (28, 0.024), (29, -0.022), (30, -0.013), (31, -0.124), (32, 0.058), (33, -0.035), (34, 0.165), (35, -0.091), (36, -0.15), (37, 0.116), (38, 0.06), (39, 0.059), (40, 0.031), (41, -0.086), (42, 0.025), (43, -0.104), (44, 0.084), (45, -0.087), (46, 0.02), (47, -0.141), (48, 0.05), (49, -0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94032186 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

Author: Shantanu Chakrabartty, Gert Cauwenberghs

2 0.72196996 29 nips-2002-Analysis of Information in Speech Based on MANOVA

Author: Sachin S. Kajarekar, Hynek Hermansky

3 0.50403816 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

Author: David Barber

4 0.48048973 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

Author: Samy Bengio

5 0.46958056 114 nips-2002-Information Regularization with Partially Labeled Data

Author: Martin Szummer, Tommi S. Jaakkola

Abstract: Classiﬁcation with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for ﬁnite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.

6 0.4637289 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

7 0.4598937 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

8 0.41483644 129 nips-2002-Learning in Spiking Neural Assemblies

9 0.41451153 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

10 0.40252703 192 nips-2002-Support Vector Machines for Multiple-Instance Learning

11 0.40012756 178 nips-2002-Robust Novelty Detection with Single-Class MPM

12 0.39642227 135 nips-2002-Learning with Multiple Labels

13 0.39003927 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

14 0.34870371 124 nips-2002-Learning Graphical Models with Mercer Kernels

15 0.34017748 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

16 0.32931721 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

17 0.32836464 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

18 0.32436243 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

19 0.32280853 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences

20 0.31792417 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.269), (11, 0.024), (23, 0.042), (42, 0.045), (54, 0.132), (55, 0.036), (57, 0.02), (67, 0.03), (68, 0.045), (74, 0.122), (92, 0.051), (98, 0.12)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93459797 20 nips-2002-Adaptive Caching by Refetching

Author: Robert B. Gramacy, Manfred K. Warmuth, Scott A. Brandt, Ismail Ari

Abstract: We are constructing caching policies that have 13-20% lower miss rates than the best of twelve baseline policies over a large variety of request streams. This represents an improvement of 49–63% over Least Recently Used, the most commonly implemented policy. We achieve this not by designing a speciﬁc new policy but by using on-line Machine Learning algorithms to dynamically shift between the standard policies based on their observed miss rates. A thorough experimental evaluation of our techniques is given, as well as a discussion of what makes caching an interesting on-line learning problem.

same-paper 2 0.80221361 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

Author: Shantanu Chakrabartty, Gert Cauwenberghs

3 0.63423628 27 nips-2002-An Impossibility Theorem for Clustering

Author: Jon M. Kleinberg

Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very diﬃcult to develop a uniﬁed framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the diﬃculty in ﬁnding such a uniﬁcation, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-oﬀs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1

4 0.63379312 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter

Abstract: We propose a framework for classiﬁer design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classiﬁcation. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classiﬁers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classiﬁcation performance.

5 0.63335657 10 nips-2002-A Model for Learning Variance Components of Natural Images

Author: Yan Karklin, Michael S. Lewicki

Abstract: We present a hierarchical Bayesian model for learning efﬁcient codes of higher-order structure in natural images. The model, a non-linear generalization of independent component analysis, replaces the standard assumption of independence for the joint distribution of coefﬁcients with a distribution that is adapted to the variance structure of the coefﬁcients of an efﬁcient image basis. This offers a novel description of higherorder image structure and provides a way to learn coarse-coded, sparsedistributed representations of abstract image properties such as object location, scale, and texture.

6 0.63233644 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

7 0.63176429 135 nips-2002-Learning with Multiple Labels

8 0.63069403 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories

9 0.63044488 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

10 0.62980962 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

11 0.62968051 124 nips-2002-Learning Graphical Models with Mercer Kernels

12 0.62965178 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

13 0.62961471 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

14 0.62947321 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

15 0.6286974 44 nips-2002-Binary Tuning is Optimal for Neural Rate Coding with High Temporal Resolution

16 0.62862378 89 nips-2002-Feature Selection by Maximum Marginal Diversity

17 0.62772346 3 nips-2002-A Convergent Form of Approximate Policy Iteration

18 0.62759244 2 nips-2002-A Bilinear Model for Sparse Coding

19 0.62684399 53 nips-2002-Clustering with the Fisher Score

20 0.62618756 52 nips-2002-Cluster Kernels for Semi-Supervised Learning