acl acl2012 acl2012-16 knowledge-graph by maker-knowledge-mining

16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery


Source: pdf

Author: Chia-ying Lee ; James Glass

Abstract: We investigate the problem of acoustic modeling in which prior language-specific knowledge and transcribed data are unavailable. We present an unsupervised model that simultaneously segments the speech, discovers a proper set of sub-word units (e.g., phones) and learns a Hidden Markov Model (HMM) for each induced acoustic unit. Our approach is formulated as a Dirichlet process mixture model in which each mixture is an HMM that represents a sub-word unit. We apply our model to the TIMIT corpus, and the results demonstrate that our model discovers sub-word units that are highly correlated with English phones and also produces better segmentation than the state-of-the-art unsupervised baseline. We test the quality of the learned acoustic models on a spoken term detection task. Compared to the baselines, our model improves the relative precision of top hits by at least 22.1% and outper- forms a language-mismatched acoustic model.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We investigate the problem of acoustic modeling in which prior language-specific knowledge and transcribed data are unavailable. [sent-3, score-0.555]

2 We present an unsupervised model that simultaneously segments the speech, discovers a proper set of sub-word units (e. [sent-4, score-0.279]

3 , phones) and learns a Hidden Markov Model (HMM) for each induced acoustic unit. [sent-6, score-0.471]

4 Our approach is formulated as a Dirichlet process mixture model in which each mixture is an HMM that represents a sub-word unit. [sent-7, score-0.354]

5 We apply our model to the TIMIT corpus, and the results demonstrate that our model discovers sub-word units that are highly correlated with English phones and also produces better segmentation than the state-of-the-art unsupervised baseline. [sent-8, score-0.457]

6 We test the quality of the learned acoustic models on a spoken term detection task. [sent-9, score-0.799]

7 However, the standard process of training acoustic models is expensive, and requires not only language-specific knowledge, e. [sent-13, score-0.471]

8 Therefore, a procedure for training acoustic models without annotated data would not only be a breakthrough from the traditional approach, but 40 would also allow us to build speech recognizers for any language efficiently. [sent-17, score-0.542]

9 In this paper, we investigate the problem of unsupervised acoustic modeling with only spoken utterances as training data. [sent-18, score-0.784]

10 As suggested in Garcia and Gish (2006), unsupervised acoustic modeling can be broken down to three sub-tasks: segmentation, clustering segments, and modeling the sound pattern of each cluster. [sent-19, score-0.656]

11 For example, the speech data was usually segmented regardless of the clustering results and the learned acoustic models. [sent-22, score-0.634]

12 In contrast to the previous methods, we approach the problem by modeling the three sub-problems as well as the unknown set of sub-word units as latent variables in one nonparametric Bayesian model. [sent-23, score-0.261]

13 More specifically, we formulate a Dirichlet process mixture model where each mixture is a Hidden Markov Model (HMM) used to model a subword unit and to generate observed segments of that unit. [sent-24, score-0.514]

14 Our model shows its ability to discover sub-word units that are highly correlated with standard English phones and to capture acoustic context information. [sent-29, score-0.685]

15 Finally, we test the quality of the learned acoustic models through a keyword spotting task. [sent-34, score-0.609]

16 1% with only some degradation in equal error rate (EER), and outperforms a language-mismatched acoustic model trained with supervised data. [sent-37, score-0.517]

17 , 1988; Garcia and Gish, 2006; Chan and Lee, 2011) and approach the problem of unsupervised acoustic modeling by solving three sub-problems of the task: segmentation, clustering and modeling each cluster. [sent-39, score-0.656]

18 For our work, we assume no transcriptions are available and measure the quality of the learned acoustic units via a spoken query detection task as in Jansen and Church (201 1). [sent-47, score-0.821]

19 Jansen and Church (201 1) approached the task of unsupervised acoustic modeling by first discovering repetitive patterns in the data, and then learned a whole-word HMM for each found pattern, where the state number of each HMM depends on the average length of the pattern. [sent-48, score-0.738]

20 The states of the whole-word HMMs were then collapsed and used to represent 41 acoustic units. [sent-49, score-0.471]

21 Most unsupervised speech segmentation methods rely on acoustic change for hypothesizing phone boundaries (Scharenborg et al. [sent-54, score-1.006]

22 Our model does not make a single one-stage decision; instead, it infers the segmentation through an iterative process and exploits the learned sub-word models to guide its hypotheses on phone boundaries. [sent-60, score-0.423]

23 3 Problem Formulation The goal of our model, given a set of spoken utterances, is to jointly learn the following: • Segmentation: To find the phonetic boundaries Sweigthmine neatacthio unt:te Troan fcined. [sent-66, score-0.295]

24 In this section, we describe the observed data, latent variables, and auxiliary variables Figure 1: An example of the observed data and hidden variables of the problem for the word banana. [sent-70, score-0.36]

25 Boundary (bit) We use a binary variable bti to indicate whether a phone boundary exists between xti and xti+1. [sent-79, score-0.558]

26 If our model hypothesizes xti to be the last frame of a sub-word unit, which is called a boundary frame in this paper, bti is assigned with value 1; or 0 otherwise. [sent-80, score-0.478]

27 We use an auxiliary variable gqi to denote the index of the qth boundary frame in utterance i. [sent-83, score-0.358]

28 To make the derivation of posterior distributions easier in Section 5, we define g0i to be the beginning of an utterance, and Li to be the number of boundary frames in an utterance. [sent-84, score-0.264]

29 42 Segment (pij,k) We define a segment to be composed of feature vectors between two boundary frames. [sent-87, score-0.252]

30 We use θc to represent the set of parameters that define the cth HMM, which includes state transition probability and the GMM parameters of each state emission probability. [sent-96, score-0.234]

31 We use ∈ R, acj,k, µcm,s wcm,s the we∈igh Rt, ∈ R39 and λcm,s ∈ R39 to denote mean vector and the diagonal of the inverse covariance matrix of the mth mixture in the GMM for the sth state in the cth HMM. [sent-97, score-0.424]

32 Hidden State (sit) Since we assume the observed data are generated by HMMs, each feature vector, xti, has an associated hidden state index. [sent-98, score-0.235]

33 4 Model We aim to discover and model a set of sub-word units that represent the spoken data. [sent-102, score-0.276]

34 We assume each spoken segment is generated by one of the clusters in this DP mixture model. [sent-110, score-0.421]

35 This cluster label can be either an existing label or a new one. [sent-117, score-0.26]

36 Given the cluster label, choose a hidden state for each feature vector xti in the segment. [sent-120, score-0.486]

37 For each xti, based on its hidden state, choose a mixture from the GMM of the chosen state. [sent-122, score-0.248]

38 The generative process indicates that our model ignores utterance boundaries and views the entire data as concatenated spoken segments. [sent-125, score-0.326]

39 Specifically, we use a Bernoulli distribution as the prior of the boundary variables and impose a Dirichlet process prior on the cluster labels and the HMM parameters. [sent-129, score-0.505]

40 For example, the boundary variables deterministically construct the duration of each segment, d, which in turn sets the number of feature vectors that should be generated for a segment. [sent-131, score-0.298]

41 , 2004) to approximate the posterior distribution of the hidden variables in our model. [sent-135, score-0.286]

42 To apply Gibbs sampling to our problem, we need to derive the conditional posterior distributions of each hidden variable of the model. [sent-136, score-0.404]

43 In the following sections, we first derive the sampling equations for each hidden variable and then describe how we incorporate acoustic cues to reduce the sampling load at the end. [sent-137, score-0.765]

44 Cluster Label (cj,k) Let C be the set of distinctive label values in c−j,k, which represents all the cluster labels except cj,k. [sent-141, score-0.247]

45 The first term is the conditional prior, which is a result of the DP prior imposed on the cluster labels 2. [sent-145, score-0.373]

46 The second term is the likelihood of xt being emitted by state s of HMMcj,k . [sent-147, score-0.227]

47 Hidden State (st) To enforce the assumption that a traversal of an HMM must start from the first state and end at the last state3, we do not sample hidden state indices for the first and the last frame of a segment. [sent-156, score-0.33]

48 The conditional posterior probability of wc,s is: P(wc,s| ) ∝ P(wc,s; β)P(mc,s|wc,s) ∝ Dir(wc,s; β)Mul(mc,s; wc,s) ∝ Dir(wc,s; β0) · · · (5) where mc,s is the set of mixture IDs of feature vectors that beloPng to state s of HMM c. [sent-163, score-0.465]

49 µcm,s,d The conjugate prior we use for the two variables is a normal-Gamma distribution with hyperparameters µ0, κ0, α0 and β0 (Murphy, 2007). [sent-172, score-0.248]

50 µcm,s,d λcm,s,d acj,k: Transition Probabilities We represent the transition probabilities at state j in HMM c using If we view as mixing weights for states reachable from state j, we can simply apply the update rule derived for the mixing weights of Gaussian mixtures shown in Eq. [sent-175, score-0.234]

51 Dirichlet distribution with a positive hyperparameter η as the prior, the conditional posterior for is: ajc P(acj| ···) ∝ Dir(acj;η0) 45 where the kth entry of η0 is η + the number of occurrences of the state transition pair (j, k) in segments that belong to HMM c. [sent-178, score-0.41]

52 Note that the value of bt only affects segmentation between xl and xr. [sent-181, score-0.317]

53 If bt is turned on, the sampler hypothesizes two segments pl,t and pt+1,r between xl and xr. [sent-182, score-0.346]

54 For bt = 1, to account the fact that when the model generates pt+1,r, pl,t is already generated and owns a cluster label, we sample a cluster label for pl,t that is reflected in the Kronecker delta function. [sent-194, score-0.57]

55 2 Heuristic Boundary Elimination To reduce the inference load on the boundary variables bt, we exploit acoustic cues in the feature space to eliminate bt’s that are unlikely to be phonetic boundaries. [sent-198, score-0.767]

56 For the rest of the boundary variables that are proposed by the heuristic algorithm, we randomly initialize their values and proceed with the sampling process described above. [sent-200, score-0.27]

57 6 Experimental Setup To the best of our knowledge, there are no standard corpora for evaluating unsupervised methods for acoustic modeling. [sent-201, score-0.55]

58 In this section, we describe the methods used to measure the performance of our model on the following three tasks: sub-word acoustic model- ing, segmentation and nonparametric clustering. [sent-207, score-0.688]

59 Unsupervised Segmentation We compare the phonetic boundaries proposed by our model to the manual labels provided in the TIMIT dataset. [sent-208, score-0.233]

60 We compare our model against the state-of-the-art unsupervised and semi-supervised segmentation methods that were also evaluated on the TIMIT training set (Dusan and Rabiner, 2006; Qiao et al. [sent-211, score-0.243]

61 To answer the question, we develop a method to map cluster labels to the phone set in a dataset. [sent-215, score-0.399]

62 We align each cluster label in an utterance to the phone(s) it overlaps with in time by using the boundaries proposed by our model and the manually-labeled ones. [sent-216, score-0.389]

63 When a cluster label overlaps with more than one phone, we align it to the phone with the largest overlap. [sent-217, score-0.412]

64 4 We compile the alignment results for 3696 training utterances5 and present a confusion matrix between the learned cluster labels and the 48 phonetic units used in TIMIT (Lee and Hon, 1989). [sent-218, score-0.499]

65 Sub-word Acoustic Modeling Finally, and most importantly, we need to gauge the quality of the learned sub-word acoustic models. [sent-219, score-0.527]

66 (2008) and Garcia and Gish (2006) tested their models on a phone recognition task and a term detection task respectively. [sent-221, score-0.329]

67 These two tasks are fair measuring methods, but performance on these tasks depends not only on the learned acoustic models, but also other components such as the label-to-phone transducer in (Varadarajan et al. [sent-222, score-0.527]

68 To reduce performance dependencies on components other than the acoustic model, we turn to the task of spoken term detection, which is also the measuring method used in (Jansen and Church, 2011). [sent-224, score-0.679]

69 We compare our unsupervised acoustic model with three supervised ones: 1) an English triphone model, 2) an English monophone model and 3) a Thai monophone model. [sent-225, score-0.814]

70 b5β3η3µµ0dκ50α303/βλ0d Table 1: The values of the hyperparameters of our model, where µd and λd are the dth entry of the mean and the diagonal of the inverse covariance matrix of training data. [sent-233, score-0.234]

71 In addition to the supervised acoustic models, we also compare our model against the state-ofthe-art unsupervised methods for this task (Zhang and Glass, 2009; Zhang et al. [sent-237, score-0.596]

72 (2012) used a deep Boltzmann machine (DBM) trained with pseudo phone labels generated from an unsupervised GMM to produce a posteriorgram representation. [sent-240, score-0.32]

73 Hyperparameters and Training Iterations The values of the hyperparameters of our model are shown in Table 1, where µd and λd are the dth entry of the mean and the diagonal of the inverse covariance matrix computed from training data. [sent-243, score-0.28]

74 4 shows a confusion matrix of the 48 phones used in TIMIT and the sub-word units learned from 3696 TIMIT utterances. [sent-248, score-0.302]

75 Each circle represents a mapping pair for a cluster label and an English phone. [sent-249, score-0.247]

76 47 Figure 4: A confusion matrix of the learned cluster labels from the TIMIT training set excluding the sa type utterances and the 48 phones used in TIMIT. [sent-251, score-0.467]

77 A more careful examination on the alignment results shows that the three clusters are mapped to the same vowel in a different acoustic context. [sent-256, score-0.545]

78 For example, cluster 19 is mapped to /ae/ followed by stop consonants, while cluster 20 corresponds to /ae/ followed by nasal consonants. [sent-257, score-0.354]

79 The performance of the four acoustic models on the spoken term detection task is presented in Table 2. [sent-265, score-0.743]

80 The English triphone model achieves the best P@N and EER results and performs slightly better than the English monophone model, which indicates a correlation between the quality of an acoustic model and its performance on the spoken term detection task. [sent-266, score-0.921]

81 hR897resupr- vised acoustic models on the spoken term detection task. [sent-269, score-0.743]

82 acoustic models, it generates a comparable EER and a more accurate detection performance for top hits than the Thai monophone model. [sent-270, score-0.678]

83 This indicates that even without supervision, our model captures and learns the acoustic characteristics of a language automatically and is able to produce an acoustic model that outperforms a language-mismatched acoustic model trained with high supervision. [sent-271, score-1.551]

84 Table 3 shows that our model improves P@N by a large margin and generates only a slightly worse EER than the GMM baseline on the spoken term detection task. [sent-272, score-0.318]

85 , 2012), the hierarchical structure of DBM allows the model to provide a descent posterior representation of phonetic units. [sent-280, score-0.234]

86 This demonstrates that even with just a simple model structure, the proposed learning algorithm is able to acquire rich phonetic knowledge from data and generate a fine posterior representation for phonetic units. [sent-282, score-0.319]

87 no0839res, and DBM baselines on the spoken term detection task. [sent-290, score-0.272]

88 *The number of phone boundaries in each utterance was assumed to be known in this model. [sent-292, score-0.337]

89 In addition, it also allows the model to capture proper phone durations, which compensates the fact that we do not include any explicit duration modeling mechanisms in our approach. [sent-297, score-0.326]

90 , 2008), the number of phone boundaries in an utterance was assumed to be known. [sent-299, score-0.337]

91 When compared to the baseline in which the number of phone boundaries in each utterance was also unknown (Dusan and Rabiner, 2006), our model outperforms in both recall and precision, improving the relative F-score by 18. [sent-301, score-0.383]

92 The key difference between the two baselines and our method is that our model does not treat segmentation as a stand-alone problem; instead, it jointly learns segmentation, clustering and acoustic units from data. [sent-303, score-0.755]

93 8 Conclusion We present a Bayesian unsupervised approach to the problem of acoustic modeling. [sent-305, score-0.55]

94 Without any prior knowledge, this method is able to discover phonetic units that are closely related to English phones, improve upon state-of-the-art unsupervised segmentation method and generate more precise spoken term detection performance on the TIMIT dataset. [sent-306, score-0.687]

95 In the future, we plan to explore phonological context and use more flexible topological structures to model acoustic units within our framework. [sent-307, score-0.601]

96 Acknowledgements The authors would like to thank Hung-an Chang and Ekapol Chuangsuwanich for training the English and Thai acoustic models. [sent-308, score-0.471]

97 Unsupervised hidden Markov modeling of spoken queries for spoken term detection without speech recognition. [sent-315, score-0.618]

98 On the relation between maximum spectral transition positions and phone boundaries. [sent-325, score-0.243]

99 Unsupervised speech segmentation: An analysis of the hypothesized phone boundaries. [sent-408, score-0.274]

100 Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. [sent-416, score-0.228]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('acoustic', 0.471), ('timit', 0.292), ('phone', 0.203), ('hmm', 0.187), ('cluster', 0.158), ('bt', 0.157), ('mixture', 0.154), ('spoken', 0.146), ('gmm', 0.137), ('xti', 0.137), ('glass', 0.123), ('boundary', 0.122), ('gaussian', 0.121), ('eer', 0.12), ('garcia', 0.12), ('segmentation', 0.118), ('posterior', 0.103), ('dusan', 0.103), ('state', 0.097), ('hidden', 0.094), ('gish', 0.09), ('variables', 0.089), ('thai', 0.089), ('dbm', 0.086), ('monophone', 0.086), ('qiao', 0.086), ('phonetic', 0.085), ('segment', 0.085), ('units', 0.084), ('hmms', 0.084), ('phones', 0.084), ('unsupervised', 0.079), ('speech', 0.071), ('utterance', 0.07), ('segments', 0.07), ('scharenborg', 0.069), ('xt', 0.068), ('conditional', 0.066), ('hyperparameters', 0.065), ('boundaries', 0.064), ('detection', 0.064), ('term', 0.062), ('rabiner', 0.06), ('sampling', 0.059), ('hits', 0.057), ('learned', 0.056), ('st', 0.056), ('dirichlet', 0.056), ('varadarajan', 0.055), ('pt', 0.054), ('utterances', 0.053), ('nonparametric', 0.053), ('zhang', 0.051), ('bti', 0.051), ('covariance', 0.051), ('estevan', 0.051), ('label', 0.051), ('prior', 0.049), ('dp', 0.049), ('icassp', 0.048), ('model', 0.046), ('jansen', 0.046), ('conjugate', 0.045), ('murphy', 0.045), ('vectors', 0.045), ('bayesian', 0.045), ('variable', 0.045), ('observed', 0.044), ('keyword', 0.044), ('xl', 0.042), ('duration', 0.042), ('frame', 0.042), ('denote', 0.041), ('diagonal', 0.041), ('matrix', 0.04), ('transition', 0.04), ('lee', 0.04), ('sampler', 0.039), ('frames', 0.039), ('mapped', 0.038), ('neal', 0.038), ('dir', 0.038), ('circle', 0.038), ('hypothesizes', 0.038), ('integral', 0.038), ('spotting', 0.038), ('confusion', 0.038), ('labels', 0.038), ('index', 0.038), ('derive', 0.037), ('dth', 0.037), ('clusters', 0.036), ('goldwater', 0.036), ('clustering', 0.036), ('modeling', 0.035), ('acj', 0.034), ('ajc', 0.034), ('garofolo', 0.034), ('gaussians', 0.034), ('gelman', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery

Author: Chia-ying Lee ; James Glass

Abstract: We investigate the problem of acoustic modeling in which prior language-specific knowledge and transcribed data are unavailable. We present an unsupervised model that simultaneously segments the speech, discovers a proper set of sub-word units (e.g., phones) and learns a Hidden Markov Model (HMM) for each induced acoustic unit. Our approach is formulated as a Dirichlet process mixture model in which each mixture is an HMM that represents a sub-word unit. We apply our model to the TIMIT corpus, and the results demonstrate that our model discovers sub-word units that are highly correlated with English phones and also produces better segmentation than the state-of-the-art unsupervised baseline. We test the quality of the learned acoustic models on a spoken term detection task. Compared to the baselines, our model improves the relative precision of top hits by at least 22.1% and outper- forms a language-mismatched acoustic model.

2 0.19633916 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

Author: Micha Elsner ; Sharon Goldwater ; Jacob Eisenstein

Abstract: ILCC, School of Informatics School of Interactive Computing University of Edinburgh Georgia Institute of Technology Edinburgh, EH8 9AB, UK Atlanta, GA, 30308, USA (a) intended: /ju want w2n/ /want e kUki/ (b) surface: [j@ w a?P w2n] [wan @ kUki] During early language acquisition, infants must learn both a lexicon and a model of phonetics that explains how lexical items can vary in pronunciation—for instance “the” might be realized as [Di] or [D@]. Previous models of acquisition have generally tackled these problems in isolation, yet behavioral evidence suggests infants acquire lexical and phonetic knowledge simultaneously. We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features. The model is trained on transcribed surface pronunciations, and learns by bootstrapping, without access to the true lexicon. We test the model using a corpus of child-directed speech with realistic phonetic variation and either gold standard or automatically induced word boundaries. In both cases modeling variability improves the accuracy of the learned lexicon over a system that assumes each lexical item has a unique pronunciation.

3 0.12066844 74 acl-2012-Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach

Author: Hao Tang ; Joseph Keshet ; Karen Livescu

Abstract: We address the problem of learning the mapping between words and their possible pronunciations in terms of sub-word units. Most previous approaches have involved generative modeling of the distribution of pronunciations, usually trained to maximize likelihood. We propose a discriminative, feature-rich approach using large-margin learning. This approach allows us to optimize an objective closely related to a discriminative task, to incorporate a large number of complex features, and still do inference efficiently. We test the approach on the task of lexical access; that is, the prediction of a word given a phonetic transcription. In experiments on a subset of the Switchboard conversational speech corpus, our models thus far improve classification error rates from a previously published result of 29.1% to about 15%. We find that large-margin approaches outperform conditional random field learning, and that the Passive-Aggressive algorithm for largemargin learning is faster to converge than the Pegasos algorithm.

4 0.10791162 14 acl-2012-A Joint Model for Discovery of Aspects in Utterances

Author: Asli Celikyilmaz ; Dilek Hakkani-Tur

Abstract: We describe a joint model for understanding user actions in natural language utterances. Our multi-layer generative approach uses both labeled and unlabeled utterances to jointly learn aspects regarding utterance’s target domain (e.g. movies), intention (e.g., finding a movie) along with other semantic units (e.g., movie name). We inject information extracted from unstructured web search query logs as prior information to enhance the generative process of the natural language utterance understanding model. Using utterances from five domains, our approach shows up to 4.5% improvement on domain and dialog act performance over cascaded approach in which each semantic component is learned sequentially and a supervised joint learning model (which requires fully labeled data).

5 0.10357515 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition

Author: Khe Chai Sim

Abstract: This paper presents a probabilistic framework that combines multiple knowledge sources for Haptic Voice Recognition (HVR), a multimodal input method designed to provide efficient text entry on modern mobile devices. HVR extends the conventional voice input by allowing users to provide complementary partial lexical information via touch input to improve the efficiency and accuracy of voice recognition. This paper investigates the use of the initial letter of the words in the utterance as the partial lexical information. In addition to the acoustic and language models used in automatic speech recognition systems, HVR uses the haptic and partial lexical models as additional knowledge sources to reduce the recognition search space and suppress confusions. Experimental results show that both the word error rate and runtime factor can be re- duced by a factor of two using HVR.

6 0.10190342 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

7 0.094336472 171 acl-2012-SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations

8 0.088515684 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content

9 0.086645968 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

10 0.077533811 140 acl-2012-Machine Translation without Words through Substring Alignment

11 0.077340871 113 acl-2012-INPROwidth.3emiSS: A Component for Just-In-Time Incremental Speech Synthesis

12 0.074186191 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling

13 0.073442213 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection

14 0.073371224 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

15 0.07033255 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

16 0.069475144 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

17 0.067690112 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining

18 0.067661375 64 acl-2012-Crosslingual Induction of Semantic Roles

19 0.065437138 18 acl-2012-A Probabilistic Model for Canonicalizing Named Entity Mentions

20 0.065136343 48 acl-2012-Classifying French Verbs Using French and English Lexical Resources


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.193), (1, 0.032), (2, -0.004), (3, 0.022), (4, -0.061), (5, 0.148), (6, 0.041), (7, -0.036), (8, 0.027), (9, 0.01), (10, -0.107), (11, -0.116), (12, -0.105), (13, -0.021), (14, -0.087), (15, 0.018), (16, 0.043), (17, 0.076), (18, 0.076), (19, 0.171), (20, 0.034), (21, -0.075), (22, -0.098), (23, -0.062), (24, -0.136), (25, -0.181), (26, 0.151), (27, 0.012), (28, 0.122), (29, 0.033), (30, 0.015), (31, 0.018), (32, 0.066), (33, -0.039), (34, -0.102), (35, -0.05), (36, 0.011), (37, -0.024), (38, 0.066), (39, 0.048), (40, -0.16), (41, 0.142), (42, 0.002), (43, -0.103), (44, 0.002), (45, -0.109), (46, -0.099), (47, -0.013), (48, 0.001), (49, -0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95113277 16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery

Author: Chia-ying Lee ; James Glass

Abstract: We investigate the problem of acoustic modeling in which prior language-specific knowledge and transcribed data are unavailable. We present an unsupervised model that simultaneously segments the speech, discovers a proper set of sub-word units (e.g., phones) and learns a Hidden Markov Model (HMM) for each induced acoustic unit. Our approach is formulated as a Dirichlet process mixture model in which each mixture is an HMM that represents a sub-word unit. We apply our model to the TIMIT corpus, and the results demonstrate that our model discovers sub-word units that are highly correlated with English phones and also produces better segmentation than the state-of-the-art unsupervised baseline. We test the quality of the learned acoustic models on a spoken term detection task. Compared to the baselines, our model improves the relative precision of top hits by at least 22.1% and outper- forms a language-mismatched acoustic model.

2 0.73062783 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

Author: Micha Elsner ; Sharon Goldwater ; Jacob Eisenstein

Abstract: ILCC, School of Informatics School of Interactive Computing University of Edinburgh Georgia Institute of Technology Edinburgh, EH8 9AB, UK Atlanta, GA, 30308, USA (a) intended: /ju want w2n/ /want e kUki/ (b) surface: [j@ w a?P w2n] [wan @ kUki] During early language acquisition, infants must learn both a lexicon and a model of phonetics that explains how lexical items can vary in pronunciation—for instance “the” might be realized as [Di] or [D@]. Previous models of acquisition have generally tackled these problems in isolation, yet behavioral evidence suggests infants acquire lexical and phonetic knowledge simultaneously. We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features. The model is trained on transcribed surface pronunciations, and learns by bootstrapping, without access to the true lexicon. We test the model using a corpus of child-directed speech with realistic phonetic variation and either gold standard or automatically induced word boundaries. In both cases modeling variability improves the accuracy of the learned lexicon over a system that assumes each lexical item has a unique pronunciation.

3 0.60779119 113 acl-2012-INPROwidth.3emiSS: A Component for Just-In-Time Incremental Speech Synthesis

Author: Timo Baumann ; David Schlangen

Abstract: We present a component for incremental speech synthesis (iSS) and a set of applications that demonstrate its capabilities. This component can be used to increase the responsivity and naturalness of spoken interactive systems. While iSS can show its full strength in systems that generate output incrementally, we also discuss how even otherwise unchanged systems may profit from its capabilities.

4 0.60382402 74 acl-2012-Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach

Author: Hao Tang ; Joseph Keshet ; Karen Livescu

Abstract: We address the problem of learning the mapping between words and their possible pronunciations in terms of sub-word units. Most previous approaches have involved generative modeling of the distribution of pronunciations, usually trained to maximize likelihood. We propose a discriminative, feature-rich approach using large-margin learning. This approach allows us to optimize an objective closely related to a discriminative task, to incorporate a large number of complex features, and still do inference efficiently. We test the approach on the task of lexical access; that is, the prediction of a word given a phonetic transcription. In experiments on a subset of the Switchboard conversational speech corpus, our models thus far improve classification error rates from a previously published result of 29.1% to about 15%. We find that large-margin approaches outperform conditional random field learning, and that the Passive-Aggressive algorithm for largemargin learning is faster to converge than the Pegasos algorithm.

5 0.60083401 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

Author: Benjamin Borschinger ; Mark Johnson

Abstract: We present a novel extension to a recently proposed incremental learning algorithm for the word segmentation problem originally introduced in Goldwater (2006). By adding rejuvenation to a particle filter, we are able to considerably improve its performance, both in terms of finding higher probability and higher accuracy solutions.

6 0.57147753 14 acl-2012-A Joint Model for Discovery of Aspects in Utterances

7 0.55420846 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition

8 0.47940612 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content

9 0.40751192 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

10 0.38916057 129 acl-2012-Learning High-Level Planning from Text

11 0.3724966 32 acl-2012-Automated Essay Scoring Based on Finite State Transducer: towards ASR Transcription of Oral English Speech

12 0.37052065 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection

13 0.33945468 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

14 0.33710456 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes

15 0.33104685 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining

16 0.32769874 48 acl-2012-Classifying French Verbs Using French and English Lexical Resources

17 0.32640153 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

18 0.3261205 18 acl-2012-A Probabilistic Model for Canonicalizing Named Entity Mentions

19 0.32324627 42 acl-2012-Bootstrapping via Graph Propagation

20 0.32260415 117 acl-2012-Improving Word Representations via Global Context and Multiple Word Prototypes


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.025), (26, 0.049), (28, 0.037), (30, 0.025), (36, 0.013), (37, 0.029), (39, 0.071), (59, 0.014), (72, 0.27), (74, 0.03), (82, 0.015), (84, 0.034), (85, 0.013), (90, 0.157), (92, 0.074), (94, 0.033), (99, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75730789 16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery

Author: Chia-ying Lee ; James Glass

Abstract: We investigate the problem of acoustic modeling in which prior language-specific knowledge and transcribed data are unavailable. We present an unsupervised model that simultaneously segments the speech, discovers a proper set of sub-word units (e.g., phones) and learns a Hidden Markov Model (HMM) for each induced acoustic unit. Our approach is formulated as a Dirichlet process mixture model in which each mixture is an HMM that represents a sub-word unit. We apply our model to the TIMIT corpus, and the results demonstrate that our model discovers sub-word units that are highly correlated with English phones and also produces better segmentation than the state-of-the-art unsupervised baseline. We test the quality of the learned acoustic models on a spoken term detection task. Compared to the baselines, our model improves the relative precision of top hits by at least 22.1% and outper- forms a language-mismatched acoustic model.

2 0.59920955 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction

Author: Mark Johnson ; Katherine Demuth ; Michael Frank

Abstract: This paper uses an unsupervised model of grounded language acquisition to study the role that social cues play in language acquisition. The input to the model consists of (orthographically transcribed) child-directed utterances accompanied by the set of objects present in the non-linguistic context. Each object is annotated by social cues, indicating e.g., whether the caregiver is looking at or touching the object. We show how to model the task of inferring which objects are being talked about (and which words refer to which objects) as standard grammatical inference, and describe PCFG-based unigram models and adaptor grammar-based collocation models for the task. Exploiting social cues improves the performance of all models. Our models learn the relative importance of each social cue jointly with word-object mappings and collocation structure, consis- tent with the idea that children could discover the importance of particular social information sources during word learning.

3 0.59813106 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling

Author: Arjun Mukherjee ; Bing Liu

Abstract: Aspect extraction is a central problem in sentiment analysis. Current methods either extract aspects without categorizing them, or extract and categorize them using unsupervised topic modeling. By categorizing, we mean the synonymous aspects should be clustered into the same category. In this paper, we solve the problem in a different setting where the user provides some seed words for a few aspect categories and the model extracts and clusters aspect terms into categories simultaneously. This setting is important because categorizing aspects is a subjective task. For different application purposes, different categorizations may be needed. Some form of user guidance is desired. In this paper, we propose two statistical models to solve this seeded problem, which aim to discover exactly what the user wants. Our experimental results show that the two proposed models are indeed able to perform the task effectively. 1

4 0.59737724 167 acl-2012-QuickView: NLP-based Tweet Search

Author: Xiaohua Liu ; Furu Wei ; Ming Zhou ; QuickView Team Microsoft

Abstract: Tweets have become a comprehensive repository for real-time information. However, it is often hard for users to quickly get information they are interested in from tweets, owing to the sheer volume of tweets as well as their noisy and informal nature. We present QuickView, an NLP-based tweet search platform to tackle this issue. Specifically, it exploits a series of natural language processing technologies, such as tweet normalization, named entity recognition, semantic role labeling, sentiment analysis, tweet classification, to extract useful information, i.e., named entities, events, opinions, etc., from a large volume of tweets. Then, non-noisy tweets, together with the mined information, are indexed, on top of which two brand new scenarios are enabled, i.e., categorized browsing and advanced search, allowing users to effectively access either the tweets or fine-grained information they are interested in.

5 0.59408081 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

Author: Wan-Yu Lin ; Nanyun Peng ; Chun-Chao Yen ; Shou-de Lin

Abstract: In this paper, we introduce a framework that identifies online plagiarism by exploiting lexical, syntactic and semantic features that includes duplication-gram, reordering and alignment of words, POS and phrase tags, and semantic similarity of sentences. We establish an ensemble framework to combine the predictions of each model. Results demonstrate that our system can not only find considerable amount of real-world online plagiarism cases but also outperforms several state-of-the-art algorithms and commercial software. Keywords Plagiarism Detection, Lexical, Syntactic, Semantic 1.

6 0.59121758 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling

7 0.59010893 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

8 0.58972138 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content

9 0.58933222 10 acl-2012-A Discriminative Hierarchical Model for Fast Coreference at Large Scale

10 0.5891037 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

11 0.58885324 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

12 0.5881415 217 acl-2012-Word Sense Disambiguation Improves Information Retrieval

13 0.5879494 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons

14 0.58785909 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

15 0.58725345 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition

16 0.58673006 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

17 0.58460653 73 acl-2012-Discriminative Learning for Joint Template Filling

18 0.58407819 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models

19 0.58392602 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

20 0.58348209 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation