nips nips2006 nips2006-106 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Fei Sha, Lawrence K. Saul
Abstract: We study the problem of parameter estimation in continuous density hidden Markov models (CD-HMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on max-margin Markov networks, our approach is specifically geared to the modeling of real-valued observations (such as acoustic feature vectors) using Gaussian mixture models. Unlike previous discriminative frameworks for ASR, such as maximum mutual information and minimum classification error, our framework leads to a convex optimization, without any spurious local minima. The objective function for large margin training of CD-HMMs is defined over a parameter space of positive semidefinite matrices. Its optimization can be performed efficiently with simple gradient-based methods that scale well to large problems. We obtain competitive results for phonetic recognition on the TIMIT speech corpus.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We study the problem of parameter estimation in continuous density hidden Markov models (CD-HMMs) for automatic speech recognition (ASR). [sent-6, score-0.287]
2 As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. [sent-7, score-0.378]
3 Unlike earlier work on max-margin Markov networks, our approach is specifically geared to the modeling of real-valued observations (such as acoustic feature vectors) using Gaussian mixture models. [sent-8, score-0.292]
4 Unlike previous discriminative frameworks for ASR, such as maximum mutual information and minimum classification error, our framework leads to a convex optimization, without any spurious local minima. [sent-9, score-0.287]
5 The objective function for large margin training of CD-HMMs is defined over a parameter space of positive semidefinite matrices. [sent-10, score-0.53]
6 We obtain competitive results for phonetic recognition on the TIMIT speech corpus. [sent-12, score-0.398]
7 1 Introduction As a result of many years of widespread use, continuous density hidden Markov models (CDHMMs) are very well matched to current front and back ends for automatic speech recognition (ASR) [21]. [sent-13, score-0.287]
8 The distributions of these acoustic feature vectors are modeled by Gaussian mixture models (GMMs), which in turn appear as observation models in CD-HMMs. [sent-15, score-0.344]
9 Viterbi decoding is used to solve the problem of sequential classification in ASR—namely, the mapping of sequences of acoustic feature vectors to sequences of phonemes and/or words, which are modeled by state transitions in CD-HMMs. [sent-16, score-0.602]
10 A weakness of this approach, however, is that the model parameters of CDHMMs are not optimized for sequential classification: in general, maximizing the joint likelihood does not minimize the phoneme or word error rates, which are more relevant metrics for ASR. [sent-20, score-0.316]
11 The learning algorithms in these frameworks optimize discriminative criteria that more closely track actual error rates, as opposed to the EM algorithm for maximum likelihood estimation. [sent-22, score-0.283]
12 Recently, in a new approach to discriminative acoustic modeling, we proposed the use of “large margin GMMs” for multiway classification [15]. [sent-24, score-0.89]
13 Inspired by support vector machines (SVMs), the learning algorithm in large margin GMMs is designed to maximize the distance between labeled examples and the decision boundaries that separate different classes [19]. [sent-25, score-0.581]
14 In contrast to SVMs, however, large margin GMMs are very naturally suited to problems in multiway (as opposed to binary) classification; also, they do not require the kernel trick for nonlinear decision boundaries. [sent-27, score-0.748]
15 We showed how to train large margin GMMs as segment-based phonetic classifiers, yielding significantly lower error rates than maximum likelihood GMMs [15]. [sent-28, score-0.74]
16 The integrated large margin training of GMMs and transition probabilities in CD-HMMs, however, was left as an open problem. [sent-29, score-0.514]
17 We address that problem in this paper, showing how to train large margin CD-HMMs in the more general setting of sequential (as opposed to multiway) classification. [sent-30, score-0.641]
18 In this setting, the GMMs appear as acoustic models whose likelihoods are integrated over time by Viterbi decoding. [sent-31, score-0.21]
19 Experimentally, we find that large margin training of HMMs for sequential classification leads to significant improvement beyond the frame-based and segment-based discriminative training in [15]. [sent-32, score-0.816]
20 Our framework for large margin training of CD-HMMs builds on ideas from many previous studies in machine learning and ASR. [sent-33, score-0.483]
21 It has similar motivation as recent frameworks for sequential classification in the machine learning community [1, 6, 17], but differs in its focus on the real-valued acoustic feature representations used in ASR. [sent-34, score-0.425]
22 It has similar motivation as other discriminative paradigms in ASR [3, 4, 5, 11, 13, 20], but differs in its goal of margin maximization and its formulation of the learning problem as a convex optimization over positive semidefinite matrices. [sent-35, score-0.668]
23 2 Large margin GMMs for multiway classification Before developing large margin HMMs for ASR, we briefly review large margin GMMs for multiway classification [15]. [sent-37, score-1.624]
24 The problem of multiway classification is to map inputs x ∈ ℜd to labels y ∈ {1, 2, . [sent-38, score-0.241]
25 Large margin GMMs are trained from a set of labeled examples {(xn , yn )}N . [sent-42, score-0.604]
26 They have many parallels to SVMs, including the goal of margin n=1 maximization and the use of a convex surrogate to the zero-one loss [19]. [sent-43, score-0.489]
27 Unlike SVMs, where classes are modeled by half-spaces, in large margin GMMs the classes are modeled by collections of ellipsoids. [sent-44, score-0.521]
28 For this reason, they are more naturally suited to problems in multiway as opposed to binary classification. [sent-45, score-0.259]
29 3 review the basic framework for large margin GMMs: first, the simplest setting in which each class is modeled by a single ellipsoid; second, the formulation of the learning problem as a convex optimization; third, the general setting in which each class is modeled by two or more ellipsoids. [sent-48, score-0.661]
30 1 Parameterization of the decision rule The simplest large margin GMMs model each class by a single ellipsoid in the input space. [sent-52, score-0.682]
31 The margin of a labeled example is defined as its distance to the nearest decision boundary. [sent-76, score-0.542]
32 If possible, each labeled example is constrained to lie at least one unit distance away from the decision boundary to each competing class: ∀c = yn , z T (Φc − Φyn ) z n ≥ 1. [sent-77, score-0.337]
33 Therefore, as in SVMs, we propose a convex optimization that selects the “smallest” parameters that satisfy the large margin constraints in eq. [sent-81, score-0.649]
34 , N (5) Note that the trace of the matrix Ψc appears in the above objective function, as opposed to the trace of the matrix Φc , as defined in eq. [sent-91, score-0.224]
35 As in SVMs, we introduce nonnegative slack variables ξnc to monitor the amount by which the margin constraints in eq. [sent-96, score-0.493]
36 The objective function in this setting balances the margin violations versus the scale regularization: min nc ξnc + γ c trace(Ψc ) s. [sent-98, score-0.572]
37 3 Softmax margin maximization for multiple mixture components Lastly we review the extension to mixture modeling where each class is represented by multiple ellipsoids [15]. [sent-109, score-0.684]
38 Let Φcm denote the matrix for the mth ellipsoid (or mixture component) in class c. [sent-110, score-0.245]
39 We imagine that each example xn has not only a class label yn , but also a mixture component label mn . [sent-111, score-0.389]
40 Such labels are not provided a priori in the training data, but we can generate “proxy” labels by fitting GMMs to the examples in each class by maximum likelihood estimation, then for each example, computing the mixture component with the highest posterior probability. [sent-112, score-0.313]
41 In the setting where each class is represented by multiple ellipsoids, the goal of learning is to ensure that each example is closer to its “target” ellipsoid than the ellipsoids from all other classes. [sent-113, score-0.238]
42 Specifically, for a labeled example (xn , yn , mn ), the constraint in eq. [sent-114, score-0.294]
43 (4) is replaced by the M constraints: ∀c = yn , ∀m, z T (Φcm − Φyn mn )z n ≥ 1, n (7) n i g r a m y r a d n u o b n o i s i c e d mixture 1 2 4 8 Figure 1: Decision boundary in a large margin GMM: labeled examples lie at least one unit of distance away. [sent-115, score-0.802]
44 5% Table 1: Test error rates on MNIST digit recognition: maximum likelihood versus large margin GMMs. [sent-124, score-0.62]
45 (7) by the stricter constraint: n ∀c = yn , T e−zn Φcm zn − z T Φyn mn z n ≥ 1. [sent-128, score-0.257]
46 n − log (8) m We will use a similar technique in section 3 to handle the exponentially many constraints that arise in sequential classification. [sent-129, score-0.286]
47 The optimization is given by: min nc ξnc + γ cm trace(Ψcm ) T T s. [sent-139, score-0.334]
48 1 + z n Φyn mn z n + log m e−zn Φcm zn ≤ ξnc , ξnc ≥ 0, ∀c = yn , n = 1, 2, . [sent-141, score-0.257]
49 4 Handwritten digit recognition We trained large margin GMMs for multiway classification of MNIST handwritten digits [8]. [sent-153, score-0.815]
50 Table 1 shows that the large margin GMMs yielded significantly lower test error rates than GMMs trained by maximum likelihood estimation. [sent-155, score-0.605]
51 For our best model, with four mixture components per digit class, the core training optimization over all training examples took five minutes on a PC. [sent-159, score-0.347]
52 ) 3 Large margin HMMs for sequential classification In this section, we extend the framework in the previous section from multiway classification to sequential classification. [sent-161, score-0.926]
53 In sequential classification by CD-HMMs, the goal is to infer the correct hidden state sequence y = [y1 , y2 , . [sent-165, score-0.353]
54 In the application to ASR, the hidden states correspond to phoneme labels, and the observations are acoustic feature vectors. [sent-172, score-0.354]
55 Note that if an observation sequence has length T and each label can belong to C classes, then the number of incorrect state sequences grows as O(C T ). [sent-173, score-0.288]
56 This combinatorial explosion presents the main challenge for large margin methods in sequential classification: how to separate the correct hidden state sequence from the exponentially large number of incorrect ones. [sent-174, score-0.931]
57 2 describes our algorithm for large margin training of CD-HMMs. [sent-179, score-0.483]
58 Both these extensions were implemented, however, for the experiments on phonetic recognition in section 3. [sent-184, score-0.238]
59 1 Margin constraints for sequential classification We start by defining a discriminant function over state (label) sequences of the CD-HMM. [sent-187, score-0.401]
60 Let a(i, j) denote the transition probabilities of the CD-HMM, and let Φs denote the ellipsoid parameters of state s. [sent-188, score-0.225]
61 In the setting where each state is modeled by multiple mixture components, the acoustic scores from individual Mahalanobis distances are replaced with “softmax” distances of T the form log M e−zt Φst m zt , as described in section 2. [sent-198, score-0.477]
62 m=1 We introduce margin constraints in terms of the above discriminant function. [sent-200, score-0.492]
63 , the number of mismatched labels) between an arbitrary state sequence s and the target state sequence y. [sent-203, score-0.246]
64 Earlier, in section 2 on multiway classification, we constrained each labeled example to lie at least one unit distance from the decision boundary to each competing class; see eq. [sent-204, score-0.399]
65 (11) requires that the (log-likelihood) gap between the score of an incorrect sequence s and the target sequence y should grow in proportion to the number of individual label errors. [sent-207, score-0.298]
66 The appropriateness of such proportional constraints for sequential classification was first noted by [17]. [sent-208, score-0.244]
67 2 Softmax margin maximization for sequential classification The challenge of large margin sequence classification lies in the exponentially large number of constraints, one for each incorrect sequence s, embodied by eq. [sent-210, score-1.282]
68 (11) as: −D(X, y) + max{H(s, y) + D(X, s)} ≤ 0 s=y (12) We obtain a more manageable constraint by substituting a softmax upper bound for the max term and requiring that the inequality still hold: eH(s,y)+D(X,s) ≤ 0 −D(X, y) + log (13) s=y Note that eq. [sent-215, score-0.227]
69 As in the setting for multiway classification, the objective function for sequential classification balances two terms: one regularizing the scale of the GMM parameters, the other penalizing margin violations. [sent-218, score-0.871]
70 Denoting the training sequences by {X n , y n }N and the slack variables (one for each training sequence) by ξn ≥ 0, we obtain the n=1 following convex optimization: min n ξn + γ cm trace(Ψcm ) s. [sent-219, score-0.448]
71 , M (14) It is worth emphasizing several crucial differences between this optimization and previous ones [4, 11, 20] for discriminative training of CD-HMMs for ASR. [sent-230, score-0.245]
72 (2), the discriminant function D(X n , y n ) and the softmax function are convex in the model parameters. [sent-236, score-0.259]
73 Third, the optimization not only increases the log-likelihood gap between correct and incorrect state sequences, but also drives the gap to grow in proportion to the number of individually incorrect labels (which we believe leads to more robust generalization). [sent-239, score-0.419]
74 Finally, compared to the large margin framework in [17], the softmax handling of exponentially large number of margin constraints makes it possible to train on larger data sets. [sent-240, score-1.091]
75 3 Phoneme recognition We used the TIMIT speech corpus [7, 9, 12] to perform experiments in phonetic recognition. [sent-243, score-0.398]
76 We trained baseline maximum likelihood recognizers and two different types of large margin recognizers. [sent-248, score-0.622]
77 The large margin recognizers in the first group were “low-cost” discriminative CD-HMMs whose GMMs were merely trained for frame-based classification. [sent-249, score-0.687]
78 The large margin recognizers in the second group were fully trained for sequential classification. [sent-252, score-0.73]
79 In all the recognizers, the acoustic feature vectors were labeled by 48 phonetic classes, each represented by one state in a first-order CD-HMM. [sent-255, score-0.49]
80 For each recognizer, we compared the phonetic state sequences obtained by Viterbi decoding to the “ground-truth” phonetic transcriptions provided by the TIMIT corpus. [sent-256, score-0.52]
81 For the purpose of computing error rates, we followed standard conventions in mapping the 48 phonetic state labels down to 39 broader phone categories. [sent-257, score-0.422]
82 We computed two different types of phone error rates, one based on Hamming distance, the other based on edit distance. [sent-258, score-0.218]
83 The “frame-based” phone error rate computed from Hamming distances is more closely tracked by our objective function for large margin training, while the “string-based” phone error rate computed from edit distances provides a more relevant metric for ASR. [sent-261, score-0.906]
84 For both types of error rates, and across all model sizes, the best performance was consistently obtained by large margin CD-HMMs trained for sequential classification. [sent-263, score-0.65]
85 Moreover, among the two different types of large margin recognizers, utterance-based training generally yielded significant improvement over frame-based training. [sent-264, score-0.483]
86 2% Table 3: String-based phone error rates, from edit distance, of different recognizers. [sent-280, score-0.218]
87 In [16], we compare the large margin training proposed in this paper to both MMI and MCE systems for phoneme recognition trained on the exact same acoustic features. [sent-283, score-0.881]
88 There we find that the large margin approach leads to lower error rates, owing perhaps to the absence of local minima in the objective function and/or the use of margin constraints based on Hamming distances. [sent-284, score-0.942]
89 The use of the Hamming distance in this context is a crucial insight from the work of [17] in the machine learning community, and it differs profoundly from merely penalizing the log-likelihood gap between incorrect and correct transcriptions, as commonly done in ASR. [sent-288, score-0.224]
90 Second, in distinction to previous work in machine learning, we have proposed a framework for sequential classification that naturally integrates with the infrastructure of modern speech recognizers. [sent-289, score-0.331]
91 Using the softmax function, we have also proposed a novel way to monitor the exponentially many margin constraints that arise in sequential classification. [sent-290, score-0.806]
92 For real-valued observation sequences, we have shown how to train large margin HMMs via convex optimizations over their parameter space of positive semidefinite matrices. [sent-291, score-0.534]
93 Finally, we have demonstrated that these learning algorithms lead to improved sequential classification on data sets with over one million training examples (i. [sent-292, score-0.237]
94 (5), (6), (9) and (14) are convex: specifically, in terms of the matrices that parameterize large margin GMMs and HMMs, the objective functions are linear, while the constraints define convex sets. [sent-297, score-0.659]
95 MMI training for continuous phoneme recognition on the TIMIT database. [sent-352, score-0.219]
96 A decision-theoretic formulation of a training problem in speech recognition and a comparison a of training by unconditional versus conditional maximum likelihood. [sent-408, score-0.39]
97 Large margin training of acoustic models for speech recognition. [sent-422, score-0.814]
98 Large margin Gaussian mixture modeling for phonetic classification and recognition. [sent-428, score-0.63]
99 Comparison of large margin training to other discriminative methods for phonetic recognition by hidden Markov models. [sent-434, score-0.876]
100 Large scale discriminative training of hidden Markov models for speech recognition. [sent-462, score-0.381]
wordName wordTfidf (topN-words)
[('gmms', 0.409), ('margin', 0.378), ('asr', 0.224), ('acoustic', 0.21), ('multiway', 0.206), ('sequential', 0.171), ('phonetic', 0.17), ('speech', 0.16), ('cm', 0.145), ('yn', 0.144), ('softmax', 0.142), ('ellipsoid', 0.131), ('phone', 0.127), ('semide', 0.126), ('hamming', 0.109), ('recognizers', 0.107), ('nc', 0.106), ('hmms', 0.105), ('classi', 0.098), ('discriminative', 0.096), ('phoneme', 0.085), ('optimization', 0.083), ('mixture', 0.082), ('incorrect', 0.08), ('convex', 0.076), ('timit', 0.075), ('ellipsoids', 0.075), ('constraints', 0.073), ('decision', 0.072), ('recognition', 0.068), ('mn', 0.067), ('training', 0.066), ('edit', 0.064), ('mmi', 0.064), ('transcriptions', 0.064), ('state', 0.063), ('rates', 0.063), ('trace', 0.062), ('svms', 0.062), ('sequence', 0.06), ('reparameterization', 0.06), ('hidden', 0.059), ('gmm', 0.054), ('icassp', 0.054), ('opposed', 0.053), ('sequences', 0.053), ('modeled', 0.052), ('viterbi', 0.05), ('digit', 0.05), ('inequality', 0.049), ('em', 0.048), ('sha', 0.048), ('mahalanobis', 0.048), ('labeled', 0.047), ('objective', 0.047), ('zn', 0.046), ('matrices', 0.046), ('distance', 0.045), ('emission', 0.045), ('frameworks', 0.044), ('cdhmms', 0.043), ('minm', 0.043), ('slack', 0.042), ('cation', 0.042), ('exponentially', 0.042), ('discriminant', 0.041), ('spurious', 0.041), ('balances', 0.041), ('subgradient', 0.041), ('optimizations', 0.041), ('st', 0.041), ('large', 0.039), ('handwritten', 0.039), ('gap', 0.039), ('markov', 0.039), ('solver', 0.038), ('constraint', 0.036), ('distances', 0.035), ('maximization', 0.035), ('trained', 0.035), ('labels', 0.035), ('mnist', 0.034), ('accumulates', 0.034), ('mce', 0.034), ('utterance', 0.034), ('hinge', 0.033), ('likelihood', 0.033), ('label', 0.032), ('eh', 0.032), ('class', 0.032), ('merely', 0.032), ('transition', 0.031), ('saul', 0.031), ('maximum', 0.03), ('centroid', 0.03), ('rule', 0.03), ('competing', 0.029), ('penalizing', 0.028), ('score', 0.027), ('error', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
Author: Fei Sha, Lawrence K. Saul
Abstract: We study the problem of parameter estimation in continuous density hidden Markov models (CD-HMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on max-margin Markov networks, our approach is specifically geared to the modeling of real-valued observations (such as acoustic feature vectors) using Gaussian mixture models. Unlike previous discriminative frameworks for ASR, such as maximum mutual information and minimum classification error, our framework leads to a convex optimization, without any spurious local minima. The objective function for large margin training of CD-HMMs is defined over a parameter space of positive semidefinite matrices. Its optimization can be performed efficiently with simple gradient-based methods that scale well to large problems. We obtain competitive results for phonetic recognition on the TIMIT speech corpus.
2 0.19992158 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
Author: Hamed Valizadegan, Rong Jin
Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1
3 0.14288706 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou
Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1
4 0.1374511 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics
Author: John R. Hershey, Trausti Kristjansson, Steven Rennie, Peder A. Olsen
Abstract: Human listeners have the extraordinary ability to hear and recognize speech even when more than one person is talking. Their machine counterparts have historically been unable to compete with this ability, until now. We present a modelbased system that performs on par with humans in the task of separating speech of two talkers from a single-channel recording. Remarkably, the system surpasses human recognition performance in many conditions. The models of speech use temporal dynamics to help infer the source speech signals, given mixed speech signals. The estimated source signals are then recognized using a conventional speech recognition system. We demonstrate that the system achieves its best performance when the model of temporal dynamics closely captures the grammatical constraints of the task. One of the hallmarks of human perception is our ability to solve the auditory cocktail party problem: we can direct our attention to a given speaker in the presence of interfering speech, and understand what was said remarkably well. Until now the same could not be said for automatic speech recognition systems. However, we have recently introduced a system which in many conditions performs this task better than humans [1][2]. The model addresses the Pascal Speech Separation Challenge task [3], and outperforms all other published results by more than 10% word error rate (WER). In this model, dynamics are modeled using a layered combination of one or two Markov chains: one for long-term dependencies and another for short-term dependencies. The combination of the two speakers was handled via an iterative Laplace approximation method known as Algonquin [4]. Here we describe experiments that show better performance on the same task with a simpler version of the model. The task we address is provided by the PASCAL Speech Separation Challenge [3], which provides standard training, development, and test data sets of single-channel speech mixtures following an arbitrary but simple grammar. In addition, the challenge organizers have conducted human-listening experiments to provide an interesting baseline for comparison of computational techniques. The overall system we developed is composed of the three components: a speaker identification and gain estimation component, a signal separation component, and a speech recognition system. In this paper we focus on the signal separation component, which is composed of the acoustic and grammatical models. The details of the other components are discussed in [2]. Single-channel speech separation has previously been attempted using Gaussian mixture models (GMMs) on individual frames of acoustic features. However such models tend to perform well only when speakers are of different gender or have rather different voices [4]. When speakers have similar voices, speaker-dependent mixture models cannot unambiguously identify the component speakers. In such cases it is helpful to model the temporal dynamics of the speech. Several models in the literature have attempted to do so either for recognition [5, 6] or enhancement [7, 8] of speech. Such models have typically been based on a discrete-state hidden Markov model (HMM) operating on a frame-based acoustic feature vector. Modeling the dynamics of the log spectrum of speech is challenging in that different speech components evolve at different time-scales. For example the excitation, which carries mainly pitch, versus the filter, which consists of the formant structure, are somewhat independent of each other. The formant structure closely follows the sequences of phonemes in each word, which are pronounced at a rate of several per second. In non-tonal languages such as English, the pitch fluctuates with prosody over the course of a sentence, and is not directly coupled with the words being spoken. Nevertheless, it seems to be important in separating speech, because the pitch harmonics carry predictable structure that stands out against the background. We address the various dynamic components of speech by testing different levels of dynamic constraints in our models. We explore four different levels of dynamics: no dynamics, low-level acoustic dynamics, high-level grammar dynamics, and a layered combination, dual dynamics, of the acoustic and grammar dynamics. The grammar dynamics and dual dynamics models perform the best in our experiments. The acoustic models are combined to model mixtures of speech using two methods: a nonlinear model known as Algonquin, which models the combination of log-spectrum models as a sum in the power spectrum, and a simpler max model that combines two log spectra using the max function. It turns out that whereas Algonquin works well, our formulation of the max model does better overall. With the combination of the max model and grammar-level dynamics, the model produces remarkable results: it is often able to extract two utterances from a mixture even when they are from the same speaker 1 . Overall results are given in Table 1, which shows that our closest competitors are human listeners. Table 1: Overall word error rates across all conditions on the challenge task. Human: average human error rate, IBM: our best result, Next Best: the best of the eight other published results on this task, and Chance: the theoretical error rate for random guessing. System: Word Error Rate: 1 Human 22.3% IBM 22.6% Next Best 34.2% Chance 93.0% Speech Models The model consists of an acoustic model and temporal dynamics model for each source, and a mixing model, which models how the source models are combined to describe the mixture. The acoustic features were short-time log spectrum frames computed every 15 ms. Each frame was of length 40 ms and a 640-point mixed-radix FFT was used. The DC component was discarded, producing a 319-dimensional log-power-spectrum feature vector yt . The acoustic model consists of a set of diagonal-covariance Gaussians in the features. For a given speaker, a, we model the conditional probability of the log-power spectrum of each source signal xa given a discrete acoustic state sa as Gaussian, p(xa |sa ) = N (xa ; µsa , Σsa ), with mean µsa , and covariance matrix Σsa . We used 256 Gaussians, one per acoustic state, to model the acoustic space of each speaker. For efficiency and tractability we restrict the covariance to be diagonal. A model with no dynamics can be formulated by producing state probabilities p(sa ), and is depicted in 1(a). Acoustic Dynamics: To capture the low-level dynamics of the acoustic signal, we modeled the acoustic dynamics of a given speaker, a, via state transitions p(sa |sa ) as shown in Figure 1(b). t t−1 There are 256 acoustic states, hence for each speaker a, we estimated a 256 × 256 element transition matrix Aa . Grammar Dynamics: The grammar dynamics are modeled by grammar state transitions, a a p(vt |vt−1 ), which consist of left-to-right phone models. The legal word sequences are given by the Speech Separation Challenge grammar [3] and are modeled using a set of pronunciations that 1 Demos and information can be found at: http : //www.research.ibm.com/speechseparation sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (a) No Dynamics (b) Acoustic Dynamics a vt−1 a vt a vt−1 a vt sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (c) Grammar Dynamics (d) Dual Dynamics Figure 1: Graph of models for a given source. In (a), there are no dynamics, so the model is a simple mixture model. In (b), only acoustic dynamics are modeled. In (c), grammar dynamics are modeled with a shared set of acoustic Gaussians, in (d) dual – grammar and acoustic – dynamics have been combined. Note that (a) (b) and (c) are special cases of (d), where different nodes are assumed independent. map from words to three-state context-dependent phone models. The state transition probabilities derived from these phone models are sparse in the sense that most transition probabilities are zero. We model speaker dependent distributions p(sa |v a ) that associate the grammar states, v a to the speaker-dependent acoustic states. These are learned from training data where the grammar state sequences and acoustic state sequences are known for each utterance. The grammar of our system has 506 states, so we estimate a 506 × 256 element conditional probability matrix B a for each speaker. Dual Dynamics: The dual-dynamics model combines the acoustic dynamics with the grammar dynamics. It is useful in this case to avoid modeling the full combination of s and v states in the joint transitions p(sa |sa , vt ). Instead we make a naive-Bayes assumption to approximate this as t t−1 1 p(sa |sa )α p(sa |vt )β , where α and β adjust the relative influence of the two probabilities, and z t t−1 t z is the normalizing constant. Here we simply use the probability matrices Aa and B a , defined above. 2 Mixed Speech Models The speech separation challenge involves recognizing speech in mixtures of signals from two speakers, a and b. We consider only mixing models that operate independently on each frequency for analytical and computational tractability. The short-time log spectrum of the mixture yt , in a given frequency band, is related to that of the two sources xa and xb via the mixing model given by the t t conditional probability distribution, p(y|xa , xb ). The joint distribution of the observation and source in one feature dimension, given the source states is thus: p(yt , xa , xb |sa , sb ) = p(yt |xa , xb )p(xa |sa )p(xb |sb ). t t t t t t t t t t (1) In general, to infer and reconstruct speech we need to compute the likelihood of the observed mixture p(yt |sa , sb ) = t t p(yt , xa , xb |sa , sb )dxa dxb , t t t t t t (2) and the posterior expected values of the sources given the states, E(xa |yt , sa , sb ) = t t t xa p(xa , xb |yt , sa , sb )dxa dxb , t t t t t t t (3) and similarly for xb . These quantities, combined with a prior model for the joint state set quences {sa , sb }, allow us to compute the minimum mean squared error (MMSE) estima1..T 1..T ˆ ˆ tors E(xa |y1..T ) or the maximum a posteriori (MAP) estimate E(xa |y1..T , sa 1..T , sb 1..T ), 1..T 1..T ˆ ˆ where sa 1..T , sb 1..T = arg maxsa ,sb p(sa , sb |y1..T ), where the subscript, 1..T , refers to 1..T 1..T 1..T 1..T all frames in the signal. The mixing model can be defined in a number of ways. We explore two popular candidates, for which the above integrals can be readily computed: Algonquin, and the max model. s a s xa b xb y (a) Mixing Model (v a v b )t−1 (v a v b )t (sa sb )t−1 (sa sb )t yt yt (b) Dual Dynamics Factorial Model Figure 2: Model combination for two talkers. In (a) all dependencies are shown. In (b) the full dual-dynamics model is graphed with the xa and xb integrated out, and corresponding states from each speaker combined into product states. The other models are special cases of this graph with different edges removed, as in Figure 1. Algonquin: The relationship between the sources and mixture in the log power spectral domain is approximated as p(yt |xa , xb ) = N (yt ; log(exp(xa ) + exp(xb )), Ψ) (4) t t t t where Ψ is introduced to model the error due to the omission of phase [4]. An iterative NewtonLaplace method accurately approximates the conditional posterior p(xa , xb |yt , sa , sb ) from (1) as t t t t Gaussian. This Gaussian allows us to analytically compute the observation likelihood p(yt |sa , sb ) t t and expected value E(xa |yt , sa , sb ), as in [4]. t t t Max model: The mixing model is simplified using the fact that log of a sum is approximately the log of the maximum: p(y|xa , xb ) = δ y − max(xa , xb ) (5) In this model the likelihood is p(yt |sa , sb ) = pxa (yt |sa )Φxb (yt |sb ) + pxb (yt |sb )Φxa (yt |sa ), (6) t t t t t t t t t y t where Φxa (yt |sa ) = −∞ N (xa ; µsa , Σsa )dxa is a Gaussian cumulative distribution function [5]. t t t t t t In [5], such a model was used to compute state likelihoods and find the optimal state sequence. In [8], a simplified model was used to infer binary masking values for refiltering. We take the max model a step further and derive source posteriors, so that we can compute the MMSE estimators for the log power spectrum. Note that the source posteriors in xa and xb are each t t a mixture of a delta function and a truncated Gaussian. Thus we analytically derive the necessary expected value: E(xa |yt , sa , sb ) t t t p(xa = yt |yt , sa , sb )yt + p(xa < yt |yt , sa , sb )E(xa |xa < yt , sa ) t t t t t t t t t pxa (yt |sa ) t a b , = πt yt + πt µsa − Σsa t t t Φxa (yt |sa ) t t = (7) (8) a b a with weights πt = p(xa=yt |yt , sa , sb ) = pxa (yt |sa )Φxb (yt |sb )/p(yt |sa , sb ), and πt = 1 − πt . For t t t t t t t t a ≫ µ b in a given frequency many pairs of states one model is significantly louder than another µs s band, relative to their variances. In such cases it is reasonable to approximate the likelihood as p(yt |sa , sb ) ≈ pxa (yt |sa ), and the posterior expected values according to E(xa |yt , sa , sb ) ≈ yt and t t t t t t t E(xb |yt , sa , sb ) ≈ min(yt , µsb ), and similarly for µsa ≪ µsb . t t t t 3 Likelihood Estimation Because of the large number of state combinations, the model would not be practical without techniques to reduce computation time. To speed up the evaluation of the joint state likelihood, we employed both band quantization of the acoustic Gaussians and joint-state pruning. Band Quantization: One source of computational savings stems from the fact that some of the Gaussians in our model may differ only in a few features. Band quantization addresses this by approximating each of the D Gaussians of each model with a shared set of d Gaussians, where d ≪ D, in each of the F frequency bands of the feature vector. A similar idea is described in [9]. It relies on the use of a diagonal covariance matrix, so that p(xa |sa ) = f N (xa ; µf,sa , Σf,sa ), where Σf,sa f are the diagonal elements of covariance matrix Σsa . The mapping Mf (si ) associates each of the D Gaussians with one of the d Gaussians in band f . Now p(xa |sa ) = f N (xa ; µf,Mf (sa ) , Σf,Mf (sa ) ) ˆ f is used as a surrogate for p(xa |sa ). Figure 3 illustrates the idea. Figure 3: In band quantization, many multi-dimensional Gaussians are mapped to a few unidimensional Gaussians. Under this model the d Gaussians are optimized by minimizing the KL-divergence D( sa p(sa )p(xa |sa )|| sa p(sa )ˆ(xa |sa )), and likewise for sb . Then in each frequency band, p only d×d, instead of D ×D combinations of Gaussians have to be evaluated to compute p(y|sa , sb ). Despite the relatively small number of components d in each band, taken across bands, band quantization is capable of expressing dF distinct patterns, in an F -dimensional feature space, although in practice only a subset of these will be used to approximate the Gaussians in a given model. We used d = 8 and D = 256, which reduced the likelihood computation time by three orders of magnitude. Joint State Pruning: Another source of computational savings comes from the sparseness of the model. Only a handful of sa , sb combinations have likelihoods that are significantly larger than the rest for a given observation. Only these states are required to adequately explain the observation. By pruning the total number of combinations down to a smaller number we can speed up the likelihood calculation, estimation of the components signals, as well as the temporal inference. However, we must estimate the likelihoods in order to determine which states to retain. We therefore used band-quantization to estimate likelihoods for all states, perform state pruning, and then the full model on the pruned states using the exact parameters. In the experiments reported here, we pruned down to 256 state combinations. The effect of these speedup methods on accuracy will be reported in a future publication. 4 Inference In our experiments we performed inference in four different conditions: no dynamics, with acoustic dynamics only, with grammar dynamics only, and with dual dynamics (acoustic and grammar). With no dynamics the source models reduce to GMMs and we infer MMSE estimates of the sources based on p(xa , xb |y) as computed from (1), using Algonquin or the max model. Once the log spectrum of each source is estimated, we estimate the corresponding time-domain signal as shown in [4]. In the acoustic dynamics condition the exact inference algorithm uses a 2-Dimensional Viterbi search, described below, with acoustic temporal constraints p(st |st−1 ) and likelihoods from Eqn. (1), to find the most likely joint state sequence s1..T . Similarly in the grammar dynamics condition, 2-D Viterbi search is used to infer the grammar state sequences, v1..T . Instead of single Gaussians as the likelihood models, however, we have mixture models in this case. So we can perform an MMSE estimate of the sources by averaging over the posterior probability of the mixture components given the grammar Viterbi sequence, and the observations. It is critical to use the 2-D Viterbi algorithm in both cases, rather than the forward-backward algorithm, because in the same-speaker condition at 0dB, the acoustic models and dynamics are symmetric. This symmetry means that the posterior is essentially bimodal and averaging over these modes would yield identical estimates for both speakers. By finding the best path through the joint state space, the 2-D Viterbi algorithm breaks this symmetry and allows the model to make different estimates for each speaker. In the dual-dynamics condition we use the model of section 2(b). With two speakers, exact inference is computationally complex because the full joint distribution of the grammar and acoustic states, (v a × sa ) × (v b × sb ) is required and is very large in number. Instead we perform approximate inference by alternating the 2-D Viterbi search between two factors: the Cartesian product sa × sb of the acoustic state sequences and the Cartesian product v a × v b of the grammar state sequences. When evaluating each state sequence we hold the other chain constant, which decouples its dynamics and allows for efficient inference. This is a useful factorization because the states sa and sb interact strongly with each other and similarly for v a and v b . Again, in the same-talker condition, the 2-D Viterbi search breaks the symmetry in each factor. 2-D Viterbi search: The Viterbi algorithm estimates the maximum-likelihood state sequence s1..T given the observations x1..T . The complexity of the Viterbi search is O(T D2 ) where D is the number of states and T is the number of frames. For producing MAP estimates of the 2 sources, we require a 2 dimensional Viterbi search which finds the most likely joint state sequences sa and 1..T sb given the mixed signal y1..T as was proposed in [5]. 1..T On the surface, the 2-D Viterbi search appears to be of complexity O(T D4 ). Surprisingly, it can be computed in O(T D3 ) operations. This stems from the fact that the dynamics for each chain are independent. The forward-backward algorithm for a factorial HMM with N state variables requires only O(T N DN +1 ) rather than the O(T D2N ) required for a naive implementation [10]. The same is true for the Viterbi algorithm. In the Viterbi algorithm, we wish to find the most probable paths leading to each state by finding the two arguments sa and sb of the following maximization: t−1 t−1 {ˆa , sb } = st−1 ˆt−1 = arg max p(sa |sa )p(sb |sb )p(sa , sb |y1..t−1 ) t t−1 t t−1 t−1 t−1 sa sb t−1 t−1 arg max p(sa |sa ) max p(sb |sb )p(sa , sb |y1..t−1 ). t t−1 t t−1 t−1 t−1 a st−1 sb t−1 (9) The two maximizations can be done in sequence, requiring O(D3 ) operations with O(D2 ) storage for each step. In general, as with the forward-backward algorithm, the N -dimensional Viterbi search requires O(T N DN +1 ) operations. We can also exploit the sparsity of the transition matrices and observation likelihoods, by pruning unlikely values. Using both of these methods our implementation of 2-D Viterbi search is faster than the acoustic likelihood computation that serves as its input, for the model sizes and grammars chosen in the speech separation task. Speaker and Gain Estimation: In the challenge task, the gains and identities of the two speakers were unknown at test time and were selected from a set of 34 speakers which were mixed at SNRs ranging from 6dB to -9dB. We used speaker-dependent acoustic models because of their advantages when separating different speakers. These models were trained on gain-normalized data, so the models are not well matched to the different gains of the signals at test time. This means that we have to estimate both the speaker identities and the gain in order to adapt our models to the source signals for each test utterance. The number of speakers and range of SNRs in the test set makes it too expensive to consider every possible combination of models and gains. Instead, we developed an efficient model-based method for identifying the speakers and gains, described in [2]. The algorithm is based upon a very simple idea: identify and utilize frames that are dominated by a single source – based on their likelihoods under each speaker-dependent acoustic model – to determine what sources are present in the mixture. Using this criteria we can eliminate most of the unlikely speakers, and explore all combinations of the remaining speakers. An approximate EM procedure is then used to select a single pair of speakers and estimate their gains. Recognition: Although inference in the system may involve recognition of the words– for models that contain a grammar –we still found that a separately trained recognizer performed better. After reconstruction, each of the two signals is therefore decoded with a speech recognition system that incorporates Speaker Dependent Labeling (SDL) [2]. This method uses speaker dependent models for each of the 34 speakers. Instead of using the speaker identities provided by the speaker ID and gain module, we followed the approach for gender dependent labeling (GDL) described in [11]. This technique provides better results than if the true speaker ID is specified. 5 Results The Speech Separation Challenge [3] involves separating the mixed speech of two speakers drawn from of a set of 34 speakers. An example utterance is place white by R 4 now. In each recording, one of the speakers says white while the other says blue, red or green. The task is to recognize the letter and the digit of the speaker that said white. Using the SDL recognizer, we decoded the two estimated signals under the assumption that one signal contains white and the other does not, and vice versa. We then used the association that yielded the highest combined likelihood. 80 WER (%) 60 40 20 0 Same Talker No Separation No dynamics Same Gender Acoustic Dyn. Different Gender Grammar Dyn All Dual Dyn Human Figure 4: Average word error rate (WER) as a function of model dynamics, in different talker conditions, compared to Human error rates, using Algonquin. Human listener performance [3] is compared in Figure 4 to results using the SDL recognizer without speech separation, and for each the proposed models. Performance is poor without separation in all conditions. With no dynamics the models do surprisingly well in the different talker conditions, but poorly when the signals come from the same talker. Acoustic dynamics gives some improvement, mainly in the same-talker condition. The grammar dynamics seems to give the most benefit, bringing the error rate in the same-gender condition below that of humans. The dual-dynamics model performed about the same as the grammar dynamics model, despite our intuitions. Replacing Algonquin with the max model reduced the error rate in the dual dynamics model (from 24.3% to 23.5%) and grammar dynamics model (from 24.6% to 22.6%), which brings the latter closer than any other model to the human recognition rate of 22.3%. Figure 5 shows the relative word error rate of the best system compared to human subjects. When both speakers are around the same loudness, the system exceeds human performance, and in the same-gender condition makes less than half the errors of the humans. Human listeners do better when the two signals are at different levels, even if the target is below the masker (i.e., in -9dB), suggesting that they are better able to make use of differences in amplitude as a cue for separation. Relative Word Error Rate (WER) 200 Same Talker Same Gender Different Gender Human 150 100 50 0 −50 −100 6 dB 3 dB 0 dB −3 dB Signal to Noise Ratio (SNR) −6 dB −9 dB Figure 5: Word error rate of best system relative to human performance. Shaded area is where the system outperforms human listeners. An interesting question is to what extent different grammar constraints affect the results. To test this, we limited the grammar to just the two test utterances, and the error rate on the estimated sources dropped to around 10%. This may be a useful paradigm for separating speech from background noise when the text is known, such as in closed-captioned recordings. At the other extreme, in realistic speech recognition scenarios, there is little knowledge of the background speaker’s grammar. In such cases the benefits of models of low-level acoustic continuity over purely grammar-based systems may be more apparent. It is our hope that further experiments with both human and machine listeners will provide us with a better understanding of the differences in their performance characteristics, and provide insights into how the human auditory system functions, as well as how automatic speech perception in general can be brought to human levels of performance. References [1] T. Kristjansson, J. R. Hershey, P. A. Olsen, S. Rennie, and R. Gopinath, “Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system,” in ICSLP, 2006. [2] Steven Rennie, Pedera A. Olsen, John R. Hershey, and Trausti Kristjansson, “Separating multiple speakers using temporal constraints,” in ISCA Workshop on Statistical And Perceptual Audition, 2006. [3] Martin Cooke and Tee-Won Lee, “Interspeech speech separation http : //www.dcs.shef.ac.uk/ ∼ martin/SpeechSeparationChallenge.htm, 2006. challenge,” [4] T. Kristjansson, J. Hershey, and H. Attias, “Single microphone source separation using high resolution signal reconstruction,” ICASSP, 2004. [5] P. Varga and R.K. Moore, “Hidden Markov model decomposition of speech and noise,” ICASSP, pp. 845–848, 1990. [6] M. Gales and S. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 352–359, September 1996. [7] Y. Ephraim, “A Bayesian estimation approach for speech enhancement using hidden Markov models.,” vol. 40, no. 4, pp. 725–735, 1992. [8] S. Roweis, “Factorial models and refiltering for speech separation and denoising,” Eurospeech, pp. 1009–1012, 2003. [9] E. Bocchieri, “Vector quantization for the efficient computation of continuous density likelihoods. proceedings of the international conference on acoustics,” in ICASSP, 1993, vol. II, pp. 692–695. [10] Zoubin Ghahramani and Michael I. Jordan, “Factorial hidden Markov models,” in Advances in Neural Information Processing Systems, vol. 8. [11] Peder Olsen and Satya Dharanipragada, “An efficient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models,” in Eurospeech 2003, 2003, vol. 4, pp. 2509–2512.
5 0.12359586 108 nips-2006-Large Scale Hidden Semi-Markov SVMs
Author: Gunnar Rätsch, Sören Sonnenburg
Abstract: We describe Hidden Semi-Markov Support Vector Machines (SHM SVMs), an extension of HM SVMs to semi-Markov chains. This allows us to predict segmentations of sequences based on segment-based features measuring properties such as the length of the segment. We propose a novel technique to partition the problem into sub-problems. The independently obtained partial solutions can then be recombined in an efficient way, which allows us to solve label sequence learning problems with several thousands of labeled sequences. We have tested our algorithm for predicting gene structures, an important problem in computational biology. Results on a well-known model organism illustrate the great potential of SHM SVMs in computational biology. 1
6 0.1066303 130 nips-2006-Max-margin classification of incomplete data
7 0.090489894 186 nips-2006-Support Vector Machines on a Budget
8 0.086184137 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
9 0.085301481 156 nips-2006-Ordinal Regression by Extended Binary Classification
10 0.0725343 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
11 0.072302349 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
12 0.071383715 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
13 0.070976049 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
14 0.06911806 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees
15 0.067574263 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
16 0.065187082 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions
17 0.063128442 193 nips-2006-Tighter PAC-Bayes Bounds
18 0.062287956 105 nips-2006-Large Margin Component Analysis
19 0.060395017 33 nips-2006-Analysis of Representations for Domain Adaptation
20 0.059159543 185 nips-2006-Subordinate class recognition using relational object models
topicId topicWeight
[(0, -0.233), (1, 0.061), (2, -0.003), (3, -0.009), (4, -0.038), (5, 0.068), (6, -0.023), (7, -0.046), (8, 0.005), (9, -0.042), (10, -0.17), (11, 0.03), (12, -0.181), (13, 0.102), (14, 0.083), (15, -0.023), (16, 0.081), (17, -0.014), (18, 0.085), (19, -0.02), (20, 0.222), (21, 0.022), (22, -0.03), (23, -0.059), (24, 0.035), (25, -0.033), (26, -0.093), (27, 0.071), (28, 0.012), (29, 0.08), (30, 0.132), (31, 0.152), (32, 0.037), (33, -0.071), (34, -0.132), (35, -0.131), (36, -0.061), (37, 0.163), (38, 0.047), (39, 0.044), (40, -0.055), (41, 0.209), (42, 0.042), (43, 0.019), (44, -0.02), (45, 0.122), (46, 0.051), (47, -0.087), (48, -0.007), (49, -0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.95489901 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
Author: Fei Sha, Lawrence K. Saul
Abstract: We study the problem of parameter estimation in continuous density hidden Markov models (CD-HMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on max-margin Markov networks, our approach is specifically geared to the modeling of real-valued observations (such as acoustic feature vectors) using Gaussian mixture models. Unlike previous discriminative frameworks for ASR, such as maximum mutual information and minimum classification error, our framework leads to a convex optimization, without any spurious local minima. The objective function for large margin training of CD-HMMs is defined over a parameter space of positive semidefinite matrices. Its optimization can be performed efficiently with simple gradient-based methods that scale well to large problems. We obtain competitive results for phonetic recognition on the TIMIT speech corpus.
2 0.56136811 108 nips-2006-Large Scale Hidden Semi-Markov SVMs
Author: Gunnar Rätsch, Sören Sonnenburg
Abstract: We describe Hidden Semi-Markov Support Vector Machines (SHM SVMs), an extension of HM SVMs to semi-Markov chains. This allows us to predict segmentations of sequences based on segment-based features measuring properties such as the length of the segment. We propose a novel technique to partition the problem into sub-problems. The independently obtained partial solutions can then be recombined in an efficient way, which allows us to solve label sequence learning problems with several thousands of labeled sequences. We have tested our algorithm for predicting gene structures, an important problem in computational biology. Results on a well-known model organism illustrate the great potential of SHM SVMs in computational biology. 1
3 0.56035632 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics
Author: John R. Hershey, Trausti Kristjansson, Steven Rennie, Peder A. Olsen
Abstract: Human listeners have the extraordinary ability to hear and recognize speech even when more than one person is talking. Their machine counterparts have historically been unable to compete with this ability, until now. We present a modelbased system that performs on par with humans in the task of separating speech of two talkers from a single-channel recording. Remarkably, the system surpasses human recognition performance in many conditions. The models of speech use temporal dynamics to help infer the source speech signals, given mixed speech signals. The estimated source signals are then recognized using a conventional speech recognition system. We demonstrate that the system achieves its best performance when the model of temporal dynamics closely captures the grammatical constraints of the task. One of the hallmarks of human perception is our ability to solve the auditory cocktail party problem: we can direct our attention to a given speaker in the presence of interfering speech, and understand what was said remarkably well. Until now the same could not be said for automatic speech recognition systems. However, we have recently introduced a system which in many conditions performs this task better than humans [1][2]. The model addresses the Pascal Speech Separation Challenge task [3], and outperforms all other published results by more than 10% word error rate (WER). In this model, dynamics are modeled using a layered combination of one or two Markov chains: one for long-term dependencies and another for short-term dependencies. The combination of the two speakers was handled via an iterative Laplace approximation method known as Algonquin [4]. Here we describe experiments that show better performance on the same task with a simpler version of the model. The task we address is provided by the PASCAL Speech Separation Challenge [3], which provides standard training, development, and test data sets of single-channel speech mixtures following an arbitrary but simple grammar. In addition, the challenge organizers have conducted human-listening experiments to provide an interesting baseline for comparison of computational techniques. The overall system we developed is composed of the three components: a speaker identification and gain estimation component, a signal separation component, and a speech recognition system. In this paper we focus on the signal separation component, which is composed of the acoustic and grammatical models. The details of the other components are discussed in [2]. Single-channel speech separation has previously been attempted using Gaussian mixture models (GMMs) on individual frames of acoustic features. However such models tend to perform well only when speakers are of different gender or have rather different voices [4]. When speakers have similar voices, speaker-dependent mixture models cannot unambiguously identify the component speakers. In such cases it is helpful to model the temporal dynamics of the speech. Several models in the literature have attempted to do so either for recognition [5, 6] or enhancement [7, 8] of speech. Such models have typically been based on a discrete-state hidden Markov model (HMM) operating on a frame-based acoustic feature vector. Modeling the dynamics of the log spectrum of speech is challenging in that different speech components evolve at different time-scales. For example the excitation, which carries mainly pitch, versus the filter, which consists of the formant structure, are somewhat independent of each other. The formant structure closely follows the sequences of phonemes in each word, which are pronounced at a rate of several per second. In non-tonal languages such as English, the pitch fluctuates with prosody over the course of a sentence, and is not directly coupled with the words being spoken. Nevertheless, it seems to be important in separating speech, because the pitch harmonics carry predictable structure that stands out against the background. We address the various dynamic components of speech by testing different levels of dynamic constraints in our models. We explore four different levels of dynamics: no dynamics, low-level acoustic dynamics, high-level grammar dynamics, and a layered combination, dual dynamics, of the acoustic and grammar dynamics. The grammar dynamics and dual dynamics models perform the best in our experiments. The acoustic models are combined to model mixtures of speech using two methods: a nonlinear model known as Algonquin, which models the combination of log-spectrum models as a sum in the power spectrum, and a simpler max model that combines two log spectra using the max function. It turns out that whereas Algonquin works well, our formulation of the max model does better overall. With the combination of the max model and grammar-level dynamics, the model produces remarkable results: it is often able to extract two utterances from a mixture even when they are from the same speaker 1 . Overall results are given in Table 1, which shows that our closest competitors are human listeners. Table 1: Overall word error rates across all conditions on the challenge task. Human: average human error rate, IBM: our best result, Next Best: the best of the eight other published results on this task, and Chance: the theoretical error rate for random guessing. System: Word Error Rate: 1 Human 22.3% IBM 22.6% Next Best 34.2% Chance 93.0% Speech Models The model consists of an acoustic model and temporal dynamics model for each source, and a mixing model, which models how the source models are combined to describe the mixture. The acoustic features were short-time log spectrum frames computed every 15 ms. Each frame was of length 40 ms and a 640-point mixed-radix FFT was used. The DC component was discarded, producing a 319-dimensional log-power-spectrum feature vector yt . The acoustic model consists of a set of diagonal-covariance Gaussians in the features. For a given speaker, a, we model the conditional probability of the log-power spectrum of each source signal xa given a discrete acoustic state sa as Gaussian, p(xa |sa ) = N (xa ; µsa , Σsa ), with mean µsa , and covariance matrix Σsa . We used 256 Gaussians, one per acoustic state, to model the acoustic space of each speaker. For efficiency and tractability we restrict the covariance to be diagonal. A model with no dynamics can be formulated by producing state probabilities p(sa ), and is depicted in 1(a). Acoustic Dynamics: To capture the low-level dynamics of the acoustic signal, we modeled the acoustic dynamics of a given speaker, a, via state transitions p(sa |sa ) as shown in Figure 1(b). t t−1 There are 256 acoustic states, hence for each speaker a, we estimated a 256 × 256 element transition matrix Aa . Grammar Dynamics: The grammar dynamics are modeled by grammar state transitions, a a p(vt |vt−1 ), which consist of left-to-right phone models. The legal word sequences are given by the Speech Separation Challenge grammar [3] and are modeled using a set of pronunciations that 1 Demos and information can be found at: http : //www.research.ibm.com/speechseparation sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (a) No Dynamics (b) Acoustic Dynamics a vt−1 a vt a vt−1 a vt sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (c) Grammar Dynamics (d) Dual Dynamics Figure 1: Graph of models for a given source. In (a), there are no dynamics, so the model is a simple mixture model. In (b), only acoustic dynamics are modeled. In (c), grammar dynamics are modeled with a shared set of acoustic Gaussians, in (d) dual – grammar and acoustic – dynamics have been combined. Note that (a) (b) and (c) are special cases of (d), where different nodes are assumed independent. map from words to three-state context-dependent phone models. The state transition probabilities derived from these phone models are sparse in the sense that most transition probabilities are zero. We model speaker dependent distributions p(sa |v a ) that associate the grammar states, v a to the speaker-dependent acoustic states. These are learned from training data where the grammar state sequences and acoustic state sequences are known for each utterance. The grammar of our system has 506 states, so we estimate a 506 × 256 element conditional probability matrix B a for each speaker. Dual Dynamics: The dual-dynamics model combines the acoustic dynamics with the grammar dynamics. It is useful in this case to avoid modeling the full combination of s and v states in the joint transitions p(sa |sa , vt ). Instead we make a naive-Bayes assumption to approximate this as t t−1 1 p(sa |sa )α p(sa |vt )β , where α and β adjust the relative influence of the two probabilities, and z t t−1 t z is the normalizing constant. Here we simply use the probability matrices Aa and B a , defined above. 2 Mixed Speech Models The speech separation challenge involves recognizing speech in mixtures of signals from two speakers, a and b. We consider only mixing models that operate independently on each frequency for analytical and computational tractability. The short-time log spectrum of the mixture yt , in a given frequency band, is related to that of the two sources xa and xb via the mixing model given by the t t conditional probability distribution, p(y|xa , xb ). The joint distribution of the observation and source in one feature dimension, given the source states is thus: p(yt , xa , xb |sa , sb ) = p(yt |xa , xb )p(xa |sa )p(xb |sb ). t t t t t t t t t t (1) In general, to infer and reconstruct speech we need to compute the likelihood of the observed mixture p(yt |sa , sb ) = t t p(yt , xa , xb |sa , sb )dxa dxb , t t t t t t (2) and the posterior expected values of the sources given the states, E(xa |yt , sa , sb ) = t t t xa p(xa , xb |yt , sa , sb )dxa dxb , t t t t t t t (3) and similarly for xb . These quantities, combined with a prior model for the joint state set quences {sa , sb }, allow us to compute the minimum mean squared error (MMSE) estima1..T 1..T ˆ ˆ tors E(xa |y1..T ) or the maximum a posteriori (MAP) estimate E(xa |y1..T , sa 1..T , sb 1..T ), 1..T 1..T ˆ ˆ where sa 1..T , sb 1..T = arg maxsa ,sb p(sa , sb |y1..T ), where the subscript, 1..T , refers to 1..T 1..T 1..T 1..T all frames in the signal. The mixing model can be defined in a number of ways. We explore two popular candidates, for which the above integrals can be readily computed: Algonquin, and the max model. s a s xa b xb y (a) Mixing Model (v a v b )t−1 (v a v b )t (sa sb )t−1 (sa sb )t yt yt (b) Dual Dynamics Factorial Model Figure 2: Model combination for two talkers. In (a) all dependencies are shown. In (b) the full dual-dynamics model is graphed with the xa and xb integrated out, and corresponding states from each speaker combined into product states. The other models are special cases of this graph with different edges removed, as in Figure 1. Algonquin: The relationship between the sources and mixture in the log power spectral domain is approximated as p(yt |xa , xb ) = N (yt ; log(exp(xa ) + exp(xb )), Ψ) (4) t t t t where Ψ is introduced to model the error due to the omission of phase [4]. An iterative NewtonLaplace method accurately approximates the conditional posterior p(xa , xb |yt , sa , sb ) from (1) as t t t t Gaussian. This Gaussian allows us to analytically compute the observation likelihood p(yt |sa , sb ) t t and expected value E(xa |yt , sa , sb ), as in [4]. t t t Max model: The mixing model is simplified using the fact that log of a sum is approximately the log of the maximum: p(y|xa , xb ) = δ y − max(xa , xb ) (5) In this model the likelihood is p(yt |sa , sb ) = pxa (yt |sa )Φxb (yt |sb ) + pxb (yt |sb )Φxa (yt |sa ), (6) t t t t t t t t t y t where Φxa (yt |sa ) = −∞ N (xa ; µsa , Σsa )dxa is a Gaussian cumulative distribution function [5]. t t t t t t In [5], such a model was used to compute state likelihoods and find the optimal state sequence. In [8], a simplified model was used to infer binary masking values for refiltering. We take the max model a step further and derive source posteriors, so that we can compute the MMSE estimators for the log power spectrum. Note that the source posteriors in xa and xb are each t t a mixture of a delta function and a truncated Gaussian. Thus we analytically derive the necessary expected value: E(xa |yt , sa , sb ) t t t p(xa = yt |yt , sa , sb )yt + p(xa < yt |yt , sa , sb )E(xa |xa < yt , sa ) t t t t t t t t t pxa (yt |sa ) t a b , = πt yt + πt µsa − Σsa t t t Φxa (yt |sa ) t t = (7) (8) a b a with weights πt = p(xa=yt |yt , sa , sb ) = pxa (yt |sa )Φxb (yt |sb )/p(yt |sa , sb ), and πt = 1 − πt . For t t t t t t t t a ≫ µ b in a given frequency many pairs of states one model is significantly louder than another µs s band, relative to their variances. In such cases it is reasonable to approximate the likelihood as p(yt |sa , sb ) ≈ pxa (yt |sa ), and the posterior expected values according to E(xa |yt , sa , sb ) ≈ yt and t t t t t t t E(xb |yt , sa , sb ) ≈ min(yt , µsb ), and similarly for µsa ≪ µsb . t t t t 3 Likelihood Estimation Because of the large number of state combinations, the model would not be practical without techniques to reduce computation time. To speed up the evaluation of the joint state likelihood, we employed both band quantization of the acoustic Gaussians and joint-state pruning. Band Quantization: One source of computational savings stems from the fact that some of the Gaussians in our model may differ only in a few features. Band quantization addresses this by approximating each of the D Gaussians of each model with a shared set of d Gaussians, where d ≪ D, in each of the F frequency bands of the feature vector. A similar idea is described in [9]. It relies on the use of a diagonal covariance matrix, so that p(xa |sa ) = f N (xa ; µf,sa , Σf,sa ), where Σf,sa f are the diagonal elements of covariance matrix Σsa . The mapping Mf (si ) associates each of the D Gaussians with one of the d Gaussians in band f . Now p(xa |sa ) = f N (xa ; µf,Mf (sa ) , Σf,Mf (sa ) ) ˆ f is used as a surrogate for p(xa |sa ). Figure 3 illustrates the idea. Figure 3: In band quantization, many multi-dimensional Gaussians are mapped to a few unidimensional Gaussians. Under this model the d Gaussians are optimized by minimizing the KL-divergence D( sa p(sa )p(xa |sa )|| sa p(sa )ˆ(xa |sa )), and likewise for sb . Then in each frequency band, p only d×d, instead of D ×D combinations of Gaussians have to be evaluated to compute p(y|sa , sb ). Despite the relatively small number of components d in each band, taken across bands, band quantization is capable of expressing dF distinct patterns, in an F -dimensional feature space, although in practice only a subset of these will be used to approximate the Gaussians in a given model. We used d = 8 and D = 256, which reduced the likelihood computation time by three orders of magnitude. Joint State Pruning: Another source of computational savings comes from the sparseness of the model. Only a handful of sa , sb combinations have likelihoods that are significantly larger than the rest for a given observation. Only these states are required to adequately explain the observation. By pruning the total number of combinations down to a smaller number we can speed up the likelihood calculation, estimation of the components signals, as well as the temporal inference. However, we must estimate the likelihoods in order to determine which states to retain. We therefore used band-quantization to estimate likelihoods for all states, perform state pruning, and then the full model on the pruned states using the exact parameters. In the experiments reported here, we pruned down to 256 state combinations. The effect of these speedup methods on accuracy will be reported in a future publication. 4 Inference In our experiments we performed inference in four different conditions: no dynamics, with acoustic dynamics only, with grammar dynamics only, and with dual dynamics (acoustic and grammar). With no dynamics the source models reduce to GMMs and we infer MMSE estimates of the sources based on p(xa , xb |y) as computed from (1), using Algonquin or the max model. Once the log spectrum of each source is estimated, we estimate the corresponding time-domain signal as shown in [4]. In the acoustic dynamics condition the exact inference algorithm uses a 2-Dimensional Viterbi search, described below, with acoustic temporal constraints p(st |st−1 ) and likelihoods from Eqn. (1), to find the most likely joint state sequence s1..T . Similarly in the grammar dynamics condition, 2-D Viterbi search is used to infer the grammar state sequences, v1..T . Instead of single Gaussians as the likelihood models, however, we have mixture models in this case. So we can perform an MMSE estimate of the sources by averaging over the posterior probability of the mixture components given the grammar Viterbi sequence, and the observations. It is critical to use the 2-D Viterbi algorithm in both cases, rather than the forward-backward algorithm, because in the same-speaker condition at 0dB, the acoustic models and dynamics are symmetric. This symmetry means that the posterior is essentially bimodal and averaging over these modes would yield identical estimates for both speakers. By finding the best path through the joint state space, the 2-D Viterbi algorithm breaks this symmetry and allows the model to make different estimates for each speaker. In the dual-dynamics condition we use the model of section 2(b). With two speakers, exact inference is computationally complex because the full joint distribution of the grammar and acoustic states, (v a × sa ) × (v b × sb ) is required and is very large in number. Instead we perform approximate inference by alternating the 2-D Viterbi search between two factors: the Cartesian product sa × sb of the acoustic state sequences and the Cartesian product v a × v b of the grammar state sequences. When evaluating each state sequence we hold the other chain constant, which decouples its dynamics and allows for efficient inference. This is a useful factorization because the states sa and sb interact strongly with each other and similarly for v a and v b . Again, in the same-talker condition, the 2-D Viterbi search breaks the symmetry in each factor. 2-D Viterbi search: The Viterbi algorithm estimates the maximum-likelihood state sequence s1..T given the observations x1..T . The complexity of the Viterbi search is O(T D2 ) where D is the number of states and T is the number of frames. For producing MAP estimates of the 2 sources, we require a 2 dimensional Viterbi search which finds the most likely joint state sequences sa and 1..T sb given the mixed signal y1..T as was proposed in [5]. 1..T On the surface, the 2-D Viterbi search appears to be of complexity O(T D4 ). Surprisingly, it can be computed in O(T D3 ) operations. This stems from the fact that the dynamics for each chain are independent. The forward-backward algorithm for a factorial HMM with N state variables requires only O(T N DN +1 ) rather than the O(T D2N ) required for a naive implementation [10]. The same is true for the Viterbi algorithm. In the Viterbi algorithm, we wish to find the most probable paths leading to each state by finding the two arguments sa and sb of the following maximization: t−1 t−1 {ˆa , sb } = st−1 ˆt−1 = arg max p(sa |sa )p(sb |sb )p(sa , sb |y1..t−1 ) t t−1 t t−1 t−1 t−1 sa sb t−1 t−1 arg max p(sa |sa ) max p(sb |sb )p(sa , sb |y1..t−1 ). t t−1 t t−1 t−1 t−1 a st−1 sb t−1 (9) The two maximizations can be done in sequence, requiring O(D3 ) operations with O(D2 ) storage for each step. In general, as with the forward-backward algorithm, the N -dimensional Viterbi search requires O(T N DN +1 ) operations. We can also exploit the sparsity of the transition matrices and observation likelihoods, by pruning unlikely values. Using both of these methods our implementation of 2-D Viterbi search is faster than the acoustic likelihood computation that serves as its input, for the model sizes and grammars chosen in the speech separation task. Speaker and Gain Estimation: In the challenge task, the gains and identities of the two speakers were unknown at test time and were selected from a set of 34 speakers which were mixed at SNRs ranging from 6dB to -9dB. We used speaker-dependent acoustic models because of their advantages when separating different speakers. These models were trained on gain-normalized data, so the models are not well matched to the different gains of the signals at test time. This means that we have to estimate both the speaker identities and the gain in order to adapt our models to the source signals for each test utterance. The number of speakers and range of SNRs in the test set makes it too expensive to consider every possible combination of models and gains. Instead, we developed an efficient model-based method for identifying the speakers and gains, described in [2]. The algorithm is based upon a very simple idea: identify and utilize frames that are dominated by a single source – based on their likelihoods under each speaker-dependent acoustic model – to determine what sources are present in the mixture. Using this criteria we can eliminate most of the unlikely speakers, and explore all combinations of the remaining speakers. An approximate EM procedure is then used to select a single pair of speakers and estimate their gains. Recognition: Although inference in the system may involve recognition of the words– for models that contain a grammar –we still found that a separately trained recognizer performed better. After reconstruction, each of the two signals is therefore decoded with a speech recognition system that incorporates Speaker Dependent Labeling (SDL) [2]. This method uses speaker dependent models for each of the 34 speakers. Instead of using the speaker identities provided by the speaker ID and gain module, we followed the approach for gender dependent labeling (GDL) described in [11]. This technique provides better results than if the true speaker ID is specified. 5 Results The Speech Separation Challenge [3] involves separating the mixed speech of two speakers drawn from of a set of 34 speakers. An example utterance is place white by R 4 now. In each recording, one of the speakers says white while the other says blue, red or green. The task is to recognize the letter and the digit of the speaker that said white. Using the SDL recognizer, we decoded the two estimated signals under the assumption that one signal contains white and the other does not, and vice versa. We then used the association that yielded the highest combined likelihood. 80 WER (%) 60 40 20 0 Same Talker No Separation No dynamics Same Gender Acoustic Dyn. Different Gender Grammar Dyn All Dual Dyn Human Figure 4: Average word error rate (WER) as a function of model dynamics, in different talker conditions, compared to Human error rates, using Algonquin. Human listener performance [3] is compared in Figure 4 to results using the SDL recognizer without speech separation, and for each the proposed models. Performance is poor without separation in all conditions. With no dynamics the models do surprisingly well in the different talker conditions, but poorly when the signals come from the same talker. Acoustic dynamics gives some improvement, mainly in the same-talker condition. The grammar dynamics seems to give the most benefit, bringing the error rate in the same-gender condition below that of humans. The dual-dynamics model performed about the same as the grammar dynamics model, despite our intuitions. Replacing Algonquin with the max model reduced the error rate in the dual dynamics model (from 24.3% to 23.5%) and grammar dynamics model (from 24.6% to 22.6%), which brings the latter closer than any other model to the human recognition rate of 22.3%. Figure 5 shows the relative word error rate of the best system compared to human subjects. When both speakers are around the same loudness, the system exceeds human performance, and in the same-gender condition makes less than half the errors of the humans. Human listeners do better when the two signals are at different levels, even if the target is below the masker (i.e., in -9dB), suggesting that they are better able to make use of differences in amplitude as a cue for separation. Relative Word Error Rate (WER) 200 Same Talker Same Gender Different Gender Human 150 100 50 0 −50 −100 6 dB 3 dB 0 dB −3 dB Signal to Noise Ratio (SNR) −6 dB −9 dB Figure 5: Word error rate of best system relative to human performance. Shaded area is where the system outperforms human listeners. An interesting question is to what extent different grammar constraints affect the results. To test this, we limited the grammar to just the two test utterances, and the error rate on the estimated sources dropped to around 10%. This may be a useful paradigm for separating speech from background noise when the text is known, such as in closed-captioned recordings. At the other extreme, in realistic speech recognition scenarios, there is little knowledge of the background speaker’s grammar. In such cases the benefits of models of low-level acoustic continuity over purely grammar-based systems may be more apparent. It is our hope that further experiments with both human and machine listeners will provide us with a better understanding of the differences in their performance characteristics, and provide insights into how the human auditory system functions, as well as how automatic speech perception in general can be brought to human levels of performance. References [1] T. Kristjansson, J. R. Hershey, P. A. Olsen, S. Rennie, and R. Gopinath, “Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system,” in ICSLP, 2006. [2] Steven Rennie, Pedera A. Olsen, John R. Hershey, and Trausti Kristjansson, “Separating multiple speakers using temporal constraints,” in ISCA Workshop on Statistical And Perceptual Audition, 2006. [3] Martin Cooke and Tee-Won Lee, “Interspeech speech separation http : //www.dcs.shef.ac.uk/ ∼ martin/SpeechSeparationChallenge.htm, 2006. challenge,” [4] T. Kristjansson, J. Hershey, and H. Attias, “Single microphone source separation using high resolution signal reconstruction,” ICASSP, 2004. [5] P. Varga and R.K. Moore, “Hidden Markov model decomposition of speech and noise,” ICASSP, pp. 845–848, 1990. [6] M. Gales and S. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 352–359, September 1996. [7] Y. Ephraim, “A Bayesian estimation approach for speech enhancement using hidden Markov models.,” vol. 40, no. 4, pp. 725–735, 1992. [8] S. Roweis, “Factorial models and refiltering for speech separation and denoising,” Eurospeech, pp. 1009–1012, 2003. [9] E. Bocchieri, “Vector quantization for the efficient computation of continuous density likelihoods. proceedings of the international conference on acoustics,” in ICASSP, 1993, vol. II, pp. 692–695. [10] Zoubin Ghahramani and Michael I. Jordan, “Factorial hidden Markov models,” in Advances in Neural Information Processing Systems, vol. 8. [11] Peder Olsen and Satya Dharanipragada, “An efficient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models,” in Eurospeech 2003, 2003, vol. 4, pp. 2509–2512.
4 0.55500549 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou
Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1
5 0.51518661 186 nips-2006-Support Vector Machines on a Budget
Author: Ofer Dekel, Yoram Singer
Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1
6 0.48976544 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
7 0.48115152 130 nips-2006-Max-margin classification of incomplete data
8 0.4785096 105 nips-2006-Large Margin Component Analysis
9 0.46804297 156 nips-2006-Ordinal Regression by Extended Binary Classification
10 0.4623467 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
11 0.44264618 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
12 0.41769373 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
13 0.39916548 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
14 0.39178649 24 nips-2006-Aggregating Classification Accuracy across Time: Application to Single Trial EEG
15 0.37713581 179 nips-2006-Sparse Representation for Signal Classification
16 0.3581813 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
17 0.35384196 180 nips-2006-Speakers optimize information density through syntactic reduction
18 0.35294765 47 nips-2006-Boosting Structured Prediction for Imitation Learning
19 0.34825581 185 nips-2006-Subordinate class recognition using relational object models
20 0.34099945 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
topicId topicWeight
[(1, 0.115), (3, 0.066), (7, 0.077), (9, 0.028), (22, 0.083), (44, 0.062), (57, 0.079), (64, 0.01), (65, 0.077), (69, 0.078), (77, 0.216), (90, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.82239372 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
Author: Fei Sha, Lawrence K. Saul
Abstract: We study the problem of parameter estimation in continuous density hidden Markov models (CD-HMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on max-margin Markov networks, our approach is specifically geared to the modeling of real-valued observations (such as acoustic feature vectors) using Gaussian mixture models. Unlike previous discriminative frameworks for ASR, such as maximum mutual information and minimum classification error, our framework leads to a convex optimization, without any spurious local minima. The objective function for large margin training of CD-HMMs is defined over a parameter space of positive semidefinite matrices. Its optimization can be performed efficiently with simple gradient-based methods that scale well to large problems. We obtain competitive results for phonetic recognition on the TIMIT speech corpus.
2 0.68283969 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou
Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1
3 0.67797279 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
Author: Kilian Q. Weinberger, Fei Sha, Qihui Zhu, Lawrence K. Saul
Abstract: In many areas of science and engineering, the problem arises how to discover low dimensional representations of high dimensional data. Recently, a number of researchers have converged on common solutions to this problem using methods from convex optimization. In particular, many results have been obtained by constructing semidefinite programs (SDPs) with low rank solutions. While the rank of matrix variables in SDPs cannot be directly constrained, it has been observed that low rank solutions emerge naturally by computing high variance or maximal trace solutions that respect local distance constraints. In this paper, we show how to solve very large problems of this type by a matrix factorization that leads to much smaller SDPs than those previously studied. The matrix factorization is derived by expanding the solution of the original problem in terms of the bottom eigenvectors of a graph Laplacian. The smaller SDPs obtained from this matrix factorization yield very good approximations to solutions of the original problem. Moreover, these approximations can be further refined by conjugate gradient descent. We illustrate the approach on localization in large scale sensor networks, where optimizations involving tens of thousands of nodes can be solved in just a few minutes. 1
Author: Gloria Haro, Gregory Randall, Guillermo Sapiro
Abstract: The study of point cloud data sampled from a stratification, a collection of manifolds with possible different dimensions, is pursued in this paper. We present a technique for simultaneously soft clustering and estimating the mixed dimensionality and density of such structures. The framework is based on a maximum likelihood estimation of a Poisson mixture model. The presentation of the approach is completed with artificial and real examples demonstrating the importance of extending manifold learning to stratification learning. 1
5 0.67463487 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
Author: Andrea Vedaldi, Stefano Soatto
Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational benefits of the approach to the some of the problems on which IC has been demonstrated. 1
6 0.67396843 20 nips-2006-Active learning for misspecified generalized linear models
7 0.67361665 61 nips-2006-Convex Repeated Games and Fenchel Duality
8 0.67326003 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
9 0.66955453 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
10 0.66875601 65 nips-2006-Denoising and Dimension Reduction in Feature Space
11 0.66773915 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
12 0.66707861 34 nips-2006-Approximate Correspondences in High Dimensions
13 0.66695243 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
14 0.6655429 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
15 0.66531956 79 nips-2006-Fast Iterative Kernel PCA
16 0.66496861 175 nips-2006-Simplifying Mixture Models through Function Approximation
17 0.66466951 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
18 0.66451478 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
19 0.66391021 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
20 0.66290736 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation