acl acl2011 acl2011-203 knowledge-graph by maker-knowledge-mining

203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition


Source: pdf

Author: Carolina Parada ; Mark Dredze ; Abhinav Sethy ; Ariya Rastrow

Abstract: Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to learn the subword lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. A hybrid model with our learned sub-word lexicon reduces error by 6.3% and 7.6% (absolute) at a 5% false alarm rate on an English Broadcast News and MIT Lectures task respectively.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. [sent-10, score-0.32]

2 Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. [sent-11, score-0.261]

3 We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. [sent-13, score-0.346]

4 A hybrid model with our learned sub-word lexicon reduces error by 6. [sent-14, score-0.394]

5 While large vocabulary continuous speech recognition (LVCSR) systems produce high quality transcripts, they fail to recognize out of vocabulary (OOV) words. [sent-18, score-0.196]

6 712 Hybrid word/sub-word recognizers can produce a sequence of sub-word units in place of OOV words. [sent-20, score-0.243]

7 Ideally, the recognizer outputs a complete word for in-vocabulary (IV) utterances, and sub-word units for OOVs. [sent-21, score-0.245]

8 “slow it dawn”), a hybrid system could output a sequence of multi-phoneme units: s l ow, b ax, d ae n. [sent-26, score-0.293]

9 In fact, hybrid systems have improved OOV spoken term detection (Mamou et al. [sent-28, score-0.412]

10 , 2009), achieved better phone error rates, especially in OOV regions (Rastrow et al. [sent-30, score-0.175]

11 Hybrid recognizers vary in a number of ways: sub-word unit type: variable-length phoneme units (Rastrow et al. [sent-33, score-0.366]

12 In this work, we consider how to optimally create sub-word units for a hybrid system. [sent-35, score-0.464]

13 These units are variable-length phoneme sequences, although in principle our work can be use for other unit types. [sent-36, score-0.322]

14 These units typically represent the most frequent phoneme sequences in English words. [sent-41, score-0.231]

15 However, it isn’t clear why these units would produce the best hybrid output. [sent-42, score-0.464]

16 Instead, we introduce a probabilistic model for learning the optimal units for a given task. [sent-43, score-0.199]

17 Our model learns a segmentation of a text corpus given some side information: a mapping between the vocabulary and a label set; learned units are predictive of class labels. [sent-44, score-0.48]

18 In this paper, we learn sub-word units optimized for OOV detection. [sent-45, score-0.199]

19 OOV detection aims to identify regions in the LVCSR output where OOVs were uttered. [sent-46, score-0.175]

20 Towards this goal, we are interested in selecting units such that the recognizer outputs them only for OOV regions while prefering to output a complete word for in-vocabulary regions. [sent-47, score-0.31]

21 We begin by presenting our log-linear model for learning sub-word units with a simple but effective inference procedure. [sent-49, score-0.199]

22 After reviewing existing OOV detection approaches, we detail how the learned units are integrated into a hybrid speech recognition system. [sent-50, score-0.632]

23 2 Learning Sub-Word Units Given raw text, our objective is to produce a lexicon of sub-word units that can be used by a hybrid system for open vocabulary speech recognition. [sent-52, score-0.684]

24 lWabee assume tuheenrece eis Y a glaivteennt segmentation PS( Yof | Wthi)s. [sent-56, score-0.176]

25 oSminecse: we are maximPizing tYhe,S So|bWse)rv deudr Y , segmentation S must discrimPinate between different possible labels. [sent-59, score-0.176]

26 We learn variable-length multi-phone units by segmenting the phonetic representation ofeach word in the corpus. [sent-60, score-0.245]

27 1 Model Inspired by the morphological segmentation model of Poon et al. [sent-67, score-0.176]

28 e (l2 parameterized by Λ P: PΛ(Y,S|W) =Z(W1)uΛ(Y,S,W) (1) where uΛ(Y, S, W) defines the score of the pro- posed segmentation S for words W and labels Y according to model parameters Λ. [sent-69, score-0.198]

29 Sub-word units σ compose S, where each σ is a phone sequence, including the full pronunciation for vocabulary words; the collection of σs form the lexicon. [sent-70, score-0.475]

30 Each unit σ is present in a segmentation with some context c = (φl, φr) of the form φlσφr. [sent-71, score-0.267]

31 In addition to scoring a segmentation based on features, we include two priors inspired by the Minimum Description Length (MDL) principle suggested by Poon et al. [sent-73, score-0.176]

32 The lexicon prior favors smaller lexicons by placing an exponential prior Pwith negative weight on the length of the lexicon Pσ |σ|, where |σ| is the length of the unit σ in numPber|σ σo|,f wphhoenrees. [sent-75, score-0.375]

33 The corpus prior counters this effect, an exponential prior with negative weight on the number of units in each word’s segmentation, where |si | is the segimne enatachtio wno lredn’gsth se agmnde |nwtait t|i oisn ,th weh leernegt |hs |o ifs sth thee ew seogrdimn pnhtaotnioens. [sent-77, score-0.273]

34 Using these definitions, the segmentation score uΛ (Y, S, W) is given as: 1Since sub-word units can expand full-words, we refer to both words and sub-words simply as units. [sent-79, score-0.375]

35 s l ow b ax d ae n s l ow (#,#, , b, ax) b ax (l,ow, , d, ae) d ae n (b,ax, , #, #) Figure 1: Units and bigram phone context (in parenthesis) for an example segmentation of the word “slobodan”. [sent-82, score-0.558]

36 (2) fσ,y(S, Y ) are the co-occurrence counts of the pair (σ, y) where σ is a unit under segmentation S and y is the label. [sent-84, score-0.267]

37 The normalizer Z sums over all possible segmentations and labels: Z(W) =XS0XY0uΛ(Y0,S0,W) (3) Consider the example segmentation for the word “slobodan” with pronunciation s,l,ow,b,ax,d,ae,n (Figure 1). [sent-88, score-0.354]

38 Overlapping context features capture rich segmentation regularities associated with each class. [sent-93, score-0.176]

39 1 Inference Inference is challenging since the lexicon prior renders all word segmentations interdependent. [sent-99, score-0.235]

40 Numerous segmentations are possible; each word has 2N−1 possible segmentations, where N is the number of phones in its pronunciation (i. [sent-101, score-0.247]

41 However, if we decide to segment th×e f2irst word as: {s iy z er}, ethciedne t thoe segmentation ifrosrt “cesium”:{s iy z iy ax m} hwei lsle ignmceurn a iloexnic foorn “ prior penalty yfo,r including mth}e new segment z iy ax m. [sent-104, score-0.753]

42 eIfn ainltsyte faodr we segment “cesar” as {s iy z er}, the segmentwaetisoneg {s iy z iy ax m} izn,cur es rd}o,u thbeles penalty ftoatri othne {lsex iicoyn, prior (since we are including two new units in the lexicon: s iy and z iy ax m). [sent-105, score-0.855]

43 This dependency requires joint segmentation of the entire corpus, which is intractable. [sent-106, score-0.176]

44 , , , , One approach is to use Gibbs Sampling: iterating through each word, sampling a new segmentation conditioned on the segmentation of all other words. [sent-109, score-0.426]

45 The sampling distribution requires enumerating all possible segmentations for each word (2N−1) and computing the conditional probabilities for each segmentation: P(S|Y∗ , W) = P(Y∗, S|W)/P(Y∗ |W) (the ofen:atu Pre(sS are extracted from t,hSe remaining w|Wor)ds ( hine th feea corpus). [sent-110, score-0.167]

46 Sm we compute ES|Y∗,W [fi] as follows: ES|Y∗,W[fi] ≈M1Xjfi[Sj] Similarly, to compute ES,Y |W we sample a segmentation and a label for each word. [sent-114, score-0.176]

47 A) sampled segmentation can introduce new units, which may have higher probability than existing ones. [sent-117, score-0.176]

48 To make burn in faster for sampling, the sampler is initialized with the most likely segmentation from the previous iteration. [sent-122, score-0.201]

49 To initialize the sampler the first time, we set all the parameters to zero (only the priors have non-zero values) and run deterministic annealing to obtain the first segmentation of the corpus. [sent-123, score-0.251]

50 2 Efficient Sampling Sampling a segmentation for the corpus requires computing the normalization constant (3), which contains a summation over all possible corpus segmentations. [sent-125, score-0.176]

51 Still, even sampling a single word’s segmentation requires enumerating probabilities for all possible segmentations. [sent-127, score-0.25]

52 However, the lexicon prior poses a problem for this construction since the penalty incurred by a new unit in the segmentation depends on whether that unit is present elsewhere in that segmentation. [sent-131, score-0.526]

53 For example, consider the segmentation for the word ANJANI : AA N, JH, AA N, IY. [sent-132, score-0.176]

54 Ifnone ofthese units are in the lexicon, this segmentation yields the lowest prior penalty since it repeats the unit AA N. [sent-133, score-0.529]

55 4 Once we sample a segmentation (and label) we accept it according to Eq. [sent-142, score-0.176]

56 2) samples a segmentation and label sequence for the entire corpus from P(Y, S|W), and s ampleS samples a segmentation fPro(Ym, P(S|Y∗, W). [sent-146, score-0.41]

57 3Splitting at phone boundaries yields the same lexicon prior but a higher corpus prior. [sent-147, score-0.252]

58 λ¯0 = 0¯ S0 = random segmentation for each word in L. [sent-153, score-0.176]

59 OOV detection for ASR output can be categorized into two broad groups: 1) hybrid (filler) models: which explicitly model OOVs using either filler, sub-words, or generic word models (Bazzi, 2002; Schaaf, 2001 ; Bisani and Ney, 2005; Klakow et al. [sent-156, score-0.375]

60 (2010), a second order CRF with features based on the output of a hybrid recognizer. [sent-186, score-0.265]

61 This detector processes hybrid recognizer output, so we can evaluate different sub-word unit lexicons for the hybrid recognizer and measure the change in OOV detection accuracy. [sent-187, score-0.887]

62 Given a sub-word lexicon, the word and subwords are combined to form a hybrid language model (LM) to be used by the LVCSR system. [sent-193, score-0.29]

63 This hybrid LM captures dependencies between word and sub-words. [sent-194, score-0.265]

64 Since sub-words represent OOVs while building the hybrid LM, the existence of sub-words in ASR output indicate an OOV region. [sent-197, score-0.265]

65 Sub-word Posterior=σX∈tjp(σ|tj) (7) Word-Entropy=−Xp(w|tj)logp(w|tj) wX (8) X∈tj tj is the current bin in the confusion network and σ is a sub-word in the hybrid dictionary. [sent-206, score-0.359]

66 We also use a hybrid LVCSR system, combining word and sub-word units obtained from either our approach or a state-of-the-art baseline approach (Rastrow et al. [sent-225, score-0.501]

67 Our hybrid system’s laexstircoown ehta sa 8,3 2K0 9wao)rd (§s . [sent-228, score-0.265]

68 The 1290 words are OOVs to both the word and hybrid systems. [sent-235, score-0.265]

69 In addition we report OOV detection results on a MIT lectures data set (Glass et al. [sent-236, score-0.186]

70 This outof-domain test-set help us evaluate the cross-domain performance of the proposed and baseline hybrid systems. [sent-241, score-0.302]

71 Each of the words in training and development was converted to their most-likely pronunciation using the dictionary 6This was used to obtain the 5K hybrid system. [sent-252, score-0.402]

72 To learn subwords for the 10K hybrid system we used 10K in-vocabulary words and 10K OOVs. [sent-253, score-0.29]

73 We limit segmentations to those including units of at most 5 phones to speed sampling with no significant degradation in performance. [sent-265, score-0.435]

74 (2009a) as our baseline unit selection method, a data driven approach where the language model training text is converted into phones using the dictionary (or a letter-to-sound model for OOVs), and a N-gram phone LM is estimated on this data and pruned using a relative entropy based method. [sent-269, score-0.359]

75 The hybrid lexicon includes resulting sub-words ranging from unigrams to 5gram phones, and the 83K word lexicon. [sent-270, score-0.37]

76 3 Evaluation We obtain confusion networks from both the word and hybrid LVCSR systems. [sent-272, score-0.319]

77 We report OOV detection accuracy using standard detection error tradeoff (DET) curves (Martin et al. [sent-275, score-0.254]

78 8 7In this work we ignore pronunciation variability and simply consider the most likely pronunciation for each word. [sent-280, score-0.17]

79 It is straightforward to extend to multiple pronunciations by first sampling a pronunciation for each word and then sampling a segmentation for that pronunciation. [sent-281, score-0.451]

80 718 6 Results We compare the performance of a hybrid system with baseline units9 (§5. [sent-283, score-0.302]

81 2) and one with units learned by our model on O(§O5V. [sent-284, score-0.223]

82 We present results using a hybrid system with 5k and 10k sub-words. [sent-286, score-0.265]

83 OOV detection improvements can be attributed to increased coverage of OOV regions by the learned sub-words compared to the baseline. [sent-301, score-0.199]

84 Table 1 shows the percent of Hits: sub-word units predicted in OOV regions, and False Alarms: sub-word units predicted for in-vocabulary words. [sent-302, score-0.398]

85 Interestingly, the average sub-word length for the proposed units exceeded that of the baseline units by 0. [sent-305, score-0.435]

86 When implementing the lexicon baseline, we discovered that their hybrid units were mistakenly derived from text containing test OOVs. [sent-311, score-0.569]

87 (a) (b) Figure 4: DET curves for OOV detection using baseline hybrid systems for different lexicon size and proposed discriminative hybrid system on OOVCORP data set. [sent-316, score-0.816]

88 (a) (b) Figure 5: Effect of adding context features to baseline and discriminative hybrid systems on OOVCORP data set. [sent-318, score-0.302]

89 The proposed hybrid system (This Paper 10k + context-features) still improves over the baseline (Baseline 10k + context-features), however the relative gain is reduced. [sent-321, score-0.302]

90 was trained on (a) (b) Figure 6: DET curves for OOV detection using baseline hybrid systems for different lexicon size and proposed discriminative hybrid system on MIT Lectures data set. [sent-340, score-0.816]

91 (a) (b) Figure 7: Effect of adding context features to baseline and discriminative hybrid systems on MIT Lectures data set. [sent-342, score-0.302]

92 1 Improved Phonetic Transcription We consider the hybrid lexicon’s impact on Phone Error Rate (PER) with respect to the reference transcription. [sent-345, score-0.265]

93 The reference phone sequence is obtained by doing forced alignment of the audio stream to the reference transcripts using acoustic models. [sent-346, score-0.215]

94 Table 2 presents PERs for the word and different hybrid systems. [sent-349, score-0.265]

95 , 2009b), the hybrid systems achieve better PER, specially in OOV regions since they predict sub-word units for OOVs. [sent-351, score-0.529]

96 Our method achieves modest improvements in PER compared to the hybrid baseline. [sent-352, score-0.265]

97 7 Conclusions Our probabilistic model learns sub-word units for hybrid speech recognizers by segmenting a text corpus while exploiting side information. [sent-354, score-0.542]

98 Furthermore, we have confirmed previous work that hybrid systems achieve better phone accuracy, and our model makes modest improvements over a baseline with a similarly sized sub-word lexicon. [sent-361, score-0.412]

99 A new method for OOV detection using hybrid word/fragment system. [sent-470, score-0.375]

100 Towards using hybrid, word, and fragment units for vocabulary independent LVCSR systems. [sent-474, score-0.28]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('oov', 0.518), ('oovs', 0.454), ('hybrid', 0.265), ('lvcsr', 0.258), ('units', 0.199), ('segmentation', 0.176), ('rastrow', 0.152), ('parada', 0.137), ('phone', 0.11), ('detection', 0.11), ('lexicon', 0.105), ('segmentations', 0.093), ('unit', 0.091), ('pronunciation', 0.085), ('ax', 0.083), ('vocabulary', 0.081), ('broadcast', 0.081), ('ih', 0.081), ('iy', 0.076), ('lectures', 0.076), ('sampling', 0.074), ('phones', 0.069), ('regions', 0.065), ('detector', 0.064), ('abhinav', 0.061), ('ariya', 0.061), ('bhuvana', 0.061), ('oovcorp', 0.061), ('sethy', 0.061), ('confusion', 0.054), ('bazzi', 0.053), ('bisani', 0.046), ('phonetic', 0.046), ('recognizer', 0.046), ('ramabhadran', 0.046), ('slobodan', 0.046), ('acoustic', 0.045), ('recognizers', 0.044), ('det', 0.044), ('pronunciations', 0.042), ('subword', 0.04), ('alarms', 0.04), ('mangu', 0.04), ('tj', 0.04), ('si', 0.04), ('mit', 0.039), ('transcripts', 0.038), ('prior', 0.037), ('baseline', 0.037), ('spoken', 0.037), ('carolina', 0.037), ('aa', 0.037), ('iv', 0.037), ('asr', 0.037), ('speech', 0.034), ('curves', 0.034), ('hours', 0.034), ('lm', 0.033), ('glass', 0.032), ('phoneme', 0.032), ('news', 0.031), ('false', 0.031), ('alarm', 0.03), ('bestsegmentation', 0.03), ('burget', 0.03), ('cesium', 0.03), ('mamou', 0.03), ('samplesl', 0.03), ('wessel', 0.03), ('samples', 0.029), ('fi', 0.029), ('ae', 0.028), ('ym', 0.028), ('unobserved', 0.028), ('annealing', 0.028), ('poon', 0.027), ('cesar', 0.027), ('fiscus', 0.027), ('hrs', 0.027), ('issam', 0.027), ('soltau', 0.027), ('penalty', 0.026), ('dictionary', 0.026), ('converted', 0.026), ('ow', 0.025), ('region', 0.025), ('sampler', 0.025), ('es', 0.025), ('subwords', 0.025), ('fa', 0.024), ('absolute', 0.024), ('learned', 0.024), ('utterances', 0.023), ('sm', 0.023), ('fred', 0.023), ('white', 0.023), ('segment', 0.022), ('audio', 0.022), ('parameters', 0.022), ('klakow', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition

Author: Carolina Parada ; Mark Dredze ; Abhinav Sethy ; Ariya Rastrow

Abstract: Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to learn the subword lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. A hybrid model with our learned sub-word lexicon reduces error by 6.3% and 7.6% (absolute) at a 5% false alarm rate on an English Broadcast News and MIT Lectures task respectively.

2 0.27971935 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

Author: Thomas Mueller ; Hinrich Schuetze

Abstract: We present a class-based language model that clusters rare words of similar morphology together. The model improves the prediction of words after histories containing outof-vocabulary words. The morphological features used are obtained without the use of labeled data. The perplexity improvement compared to a state of the art Kneser-Ney model is 4% overall and 81% on unknown histories.

3 0.25005651 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

Author: Bo Han ; Timothy Baldwin

Abstract: Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn’t require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.

4 0.1230177 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL

Author: Daniel Hewlett ; Paul Cohen

Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

5 0.11172798 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

Author: Hal Daume III ; Jagadeesh Jagarlamudi

Abstract: We show that unseen words account for a large part of the translation error when moving to new domains. Using an extension of a recent approach to mining translations from comparable corpora (Haghighi et al., 2008), we are able to find translations for otherwise OOV terms. We show several approaches to integrating such translations into a phrasebased translation system, yielding consistent improvements in translations quality (between 0.5 and 1.5 Bleu points) on four domains and two language pairs.

6 0.09489651 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

7 0.09405116 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

8 0.084770538 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

9 0.080299489 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

10 0.071899377 228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns

11 0.067859605 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

12 0.066962503 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations

13 0.062952816 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

14 0.060381468 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

15 0.055256635 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

16 0.053235229 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

17 0.052175254 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations

18 0.050475888 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

19 0.05012954 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System

20 0.049566932 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.143), (1, -0.009), (2, -0.004), (3, 0.021), (4, -0.068), (5, 0.012), (6, 0.082), (7, -0.022), (8, 0.028), (9, 0.172), (10, -0.048), (11, 0.084), (12, 0.004), (13, 0.088), (14, 0.014), (15, -0.021), (16, -0.094), (17, -0.018), (18, 0.173), (19, 0.073), (20, 0.115), (21, -0.087), (22, -0.006), (23, -0.039), (24, 0.052), (25, -0.041), (26, 0.077), (27, 0.031), (28, -0.077), (29, 0.104), (30, -0.025), (31, -0.184), (32, 0.01), (33, -0.001), (34, -0.039), (35, 0.001), (36, -0.08), (37, 0.231), (38, 0.041), (39, -0.038), (40, -0.038), (41, 0.096), (42, -0.098), (43, 0.12), (44, 0.013), (45, 0.04), (46, 0.019), (47, 0.035), (48, 0.082), (49, -0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94021112 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition

Author: Carolina Parada ; Mark Dredze ; Abhinav Sethy ; Ariya Rastrow

Abstract: Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to learn the subword lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. A hybrid model with our learned sub-word lexicon reduces error by 6.3% and 7.6% (absolute) at a 5% false alarm rate on an English Broadcast News and MIT Lectures task respectively.

2 0.73370242 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

Author: Bo Han ; Timothy Baldwin

Abstract: Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn’t require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.

3 0.71336567 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

Author: Thomas Mueller ; Hinrich Schuetze

Abstract: We present a class-based language model that clusters rare words of similar morphology together. The model improves the prediction of words after histories containing outof-vocabulary words. The morphological features used are obtained without the use of labeled data. The perplexity improvement compared to a state of the art Kneser-Ney model is 4% overall and 81% on unknown histories.

4 0.68668705 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

Author: Fei Liu ; Fuliang Weng ; Bingqing Wang ; Yang Liu

Abstract: Most text message normalization approaches are based on supervised learning and rely on human labeled training data. In addition, the nonstandard words are often categorized into different types and specific models are designed to tackle each type. In this paper, we propose a unified letter transformation approach that requires neither pre-categorization nor human supervision. Our approach models the generation process from the dictionary words to nonstandard tokens under a sequence labeling framework, where each letter in the dictionary word can be retained, removed, or substituted by other letters/digits. To avoid the expensive and time consuming hand labeling process, we automatically collected a large set of noisy training pairs using a novel webbased approach and performed character-level . alignment for model training. Experiments on both Twitter and SMS messages show that our system significantly outperformed the stateof-the-art deletion-based abbreviation system and the jazzy spell checker (absolute accuracy gain of 21.69% and 18. 16% over jazzy spell checker on the two test sets respectively).

5 0.61743057 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL

Author: Daniel Hewlett ; Paul Cohen

Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

6 0.48588744 301 acl-2011-The impact of language models and loss functions on repair disfluency detection

7 0.47204122 175 acl-2011-Integrating history-length interpolation and classes in language modeling

8 0.46453962 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

9 0.45254919 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

10 0.42822704 238 acl-2011-P11-2093 k2opt.pdf

11 0.4152554 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

12 0.40354916 228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns

13 0.38884822 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

14 0.37550956 11 acl-2011-A Fast and Accurate Method for Approximate String Search

15 0.37454581 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations

16 0.36088166 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

17 0.35245296 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

18 0.33869159 142 acl-2011-Generalized Interpolation in Decision Tree LM

19 0.33839008 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

20 0.33059561 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.246), (5, 0.015), (17, 0.047), (20, 0.012), (37, 0.099), (39, 0.061), (41, 0.066), (55, 0.035), (59, 0.027), (72, 0.044), (91, 0.058), (96, 0.159)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91948092 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

Author: Kenneth Hild ; Umut Orhan ; Deniz Erdogmus ; Brian Roark ; Barry Oken ; Shalini Purwar ; Hooman Nezamfar ; Melanie Fried-Oken

Abstract: Event related potentials (ERP) corresponding to stimuli in electroencephalography (EEG) can be used to detect the intent of a person for brain computer interfaces (BCI). This paradigm is widely used to build letter-byletter text input systems using BCI. Nevertheless using a BCI-typewriter depending only on EEG responses will not be sufficiently accurate for single-trial operation in general, and existing systems utilize many-trial schemes to achieve accuracy at the cost of speed. Hence incorporation of a language model based prior or additional evidence is vital to improve accuracy and speed. In this demonstration we will present a BCI system for typing that integrates a stochastic language model with ERP classification to achieve speedups, via the rapid serial visual presentation (RSVP) paradigm.

2 0.87632 6 acl-2011-A Comprehensive Dictionary of Multiword Expressions

Author: Kosho Shudo ; Akira Kurahone ; Toshifumi Tanabe

Abstract: It has been widely recognized that one of the most difficult and intriguing problems in natural language processing (NLP) is how to cope with idiosyncratic multiword expressions. This paper presents an overview of the comprehensive dictionary (JDMWE) of Japanese multiword expressions. The JDMWE is characterized by a large notational, syntactic, and semantic diversity of contained expressions as well as a detailed description of their syntactic functions, structures, and flexibilities. The dictionary contains about 104,000 expressions, potentially 750,000 expressions. This paper shows that the JDMWE’s validity can be supported by comparing the dictionary with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08, generated by Google Inc. (Kudo et al. 2009). 1

same-paper 3 0.81288207 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition

Author: Carolina Parada ; Mark Dredze ; Abhinav Sethy ; Ariya Rastrow

Abstract: Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to learn the subword lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. A hybrid model with our learned sub-word lexicon reduces error by 6.3% and 7.6% (absolute) at a 5% false alarm rate on an English Broadcast News and MIT Lectures task respectively.

4 0.77351201 158 acl-2011-Identification of Domain-Specific Senses in a Machine-Readable Dictionary

Author: Fumiyo Fukumoto ; Yoshimi Suzuki

Abstract: This paper focuses on domain-specific senses and presents a method for assigning category/domain label to each sense of words in a dictionary. The method first identifies each sense of a word in the dictionary to its corresponding category. We used a text classification technique to select appropriate senses for each domain. Then, senses were scored by computing the rank scores. We used Markov Random Walk (MRW) model. The method was tested on English and Japanese resources, WordNet 3.0 and EDR Japanese dictionary. For evaluation of the method, we compared English results with the Subject Field Codes (SFC) resources. We also compared each English and Japanese results to the first sense heuristics in the WSD task. These results suggest that identification of domain-specific senses (IDSS) may actually be of benefit.

5 0.77019203 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features

Author: Awais Athar

Abstract: Sentiment analysis of citations in scientific papers and articles is a new and interesting problem due to the many linguistic differences between scientific texts and other genres. In this paper, we focus on the problem of automatic identification of positive and negative sentiment polarity in citations to scientific papers. Using a newly constructed annotated citation sentiment corpus, we explore the effectiveness of existing and novel features, including n-grams, specialised science-specific lexical features, dependency relations, sentence splitting and negation features. Our results show that 3-grams and dependencies perform best in this task; they outperform the sentence splitting, science lexicon and negation based features.

6 0.74371421 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

7 0.66816509 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

8 0.66276866 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

9 0.66198617 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

10 0.66189426 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

11 0.66169715 28 acl-2011-A Statistical Tree Annotator and Its Applications

12 0.66023839 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

13 0.65985882 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

14 0.65826774 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

15 0.65794092 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

16 0.65777373 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

17 0.65715653 117 acl-2011-Entity Set Expansion using Topic information

18 0.65502077 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

19 0.65496361 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

20 0.65436745 238 acl-2011-P11-2093 k2opt.pdf