acl acl2010 acl2010-170 knowledge-graph by maker-knowledge-mining

170 acl-2010-Letter-Phoneme Alignment: An Exploration


Source: pdf

Author: Sittichai Jiampojamarn ; Grzegorz Kondrak

Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. [sent-3, score-0.392]

2 We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. [sent-4, score-0.889]

3 1 Introduction Letter-to-phoneme (L2P) conversion (also called grapheme-to-phoneme conversion) is the task of predicting the pronunciation of a word given its orthographic form by converting a sequence of letters into a sequence of phonemes. [sent-7, score-0.341]

4 Letter-phoneme alignment is an important step in the L2P task. [sent-12, score-0.392]

5 The training data usually consists of pairs of letter and phoneme sequences, which are not aligned. [sent-13, score-0.557]

6 Since there is no explicit information indicating the relationships between individual letter and phonemes, these must be inferred by a letter-phoneme alignment algorithm before a prediction model can be trained. [sent-14, score-0.705]

7 The quality of the alignment affects the accuracy of L2P conversion. [sent-15, score-0.464]

8 Letter-phoneme alignment is closely related to transliteration alignment (Pervouchine et al. [sent-16, score-0.784]

9 Letter-phoneme alignment may also be considered as a task in itself; for example, in the alignment of speech transcription with text in spoken corpora. [sent-18, score-0.818]

10 Most previous L2P approaches induce the alignment between letters and phonemes with the expectation maximization (EM) algorithm. [sent-19, score-0.826]

11 In this paper, we propose a number of alternative alignment methods, and compare them to the EMbased algorithms using both intrinsic and extrinsic evaluations. [sent-20, score-0.546]

12 The intrinsic evaluation is conducted by comparing the generated alignments to a manually-constructed gold standard. [sent-21, score-0.375]

13 We discuss the advantages and disadvantages of various methods, and show that better alignments tend to improve the accuracy of the L2P systems regardless of the actual technique. [sent-23, score-0.314]

14 We also examine the relationship between alignment entropy and alignment quality. [sent-25, score-0.838]

15 In Section 2, we enumerate the assumptions that the alignment methods commonly adopt. [sent-27, score-0.392]

16 In Section 7, we propose an algorithm to refine the alignments produced by EM. [sent-30, score-0.305]

17 c As2s0o1c0ia Atisosnoc foiart Cionom fopru Ctaotmiopnuatla Lti on gaulis Lti cnsg,u piasgtiecs 780–78 , 2 Background We define the letter-phoneme alignment task as the problem of inducing links between units that are related by pronunciation. [sent-35, score-0.595]

18 The leftmost example alignment of the word accuse [@kjuz] below includes 1-1, 1-0, 12, and 2-1 links. [sent-37, score-0.421]

19 The letter e is considered to be linked to special null phoneme. [sent-38, score-0.353]

20 We refer to an alignment model that assumes all three constraints as a pure one-to-one (1-1) model. [sent-42, score-0.459]

21 By allowing only 1-1 and 1-0 links, the alignment task is thus greatly simplified. [sent-43, score-0.392]

22 In the simplest case, when the number of letters is equal to the number of phonemes, there is only one possible alignment that satisfies all three constraints. [sent-44, score-0.528]

23 When there are more letters than phonemes, the search is reduced to identifying letters that must be linked to null phonemes (the process referred to as “epsilon scattering” by Black et al. [sent-45, score-0.643]

24 Moreover, a pure 1-1 approach cannot handle cases where the number of phonemes exceeds the number of letters. [sent-48, score-0.363]

25 A typical solution to overcome this problems is to introduce so-called double phonemes by merging adjacent phonemes that could be represented as a single letter. [sent-49, score-0.623]

26 For example, a double phoneme U would replace a sequence of the phonemes j and u in Figure 1. [sent-50, score-0.575]

27 3 EM Alignment Early EM-based alignment methods (Daelemans and Bosch, 1997; Black et al. [sent-56, score-0.392]

28 The 1-1 alignment problem can be formulated as a dynamic programming problem to find the maximum score of alignment, given a probability table of aligning letter and phoneme as a mapping function. [sent-59, score-1.047]

29 In practice, the latter probability is often set to zero in order to enforce the representation constraint, which facilitates the subsequent phoneme generation process. [sent-61, score-0.334]

30 The probability table δ(xi, yj) can be initialized by a uniform distribution and is iteratively re-computed (M-step) from the most likely alignments found at each iteration over the data set (E-step). [sent-62, score-0.299]

31 The final alignments are constructed after the probability table converges. [sent-63, score-0.299]

32 , 2007) is a many-to-many (M-M) alignment algorithm based on EM that allows for mapping of multiple letters to multiple phonemes. [sent-65, score-0.6]

33 Algorithm 1 describes the E-step of the many-to-many alignment algorithm. [sent-66, score-0.392]

34 γ represents partial counts collected over all possible mappings between substrings of letters and phonemes. [sent-67, score-0.329]

35 The final many-tomany alignments are created by finding the most likely paths using the Viterbi algorithm based on the learned mapping probability table. [sent-85, score-0.371]

36 1 Although the many-to-many approach tends to create relatively large models, it generates more intuitive alignments and leads to improvement in the L2P accuracy (Jiampojamarn et al. [sent-87, score-0.314]

37 However, since many links involve multiple letters, it also introduces additional complexity in the phoneme prediction phase. [sent-89, score-0.446]

38 One possible solution is to apply a letter segmentation algorithm at test time to cluster letters according to the alignments in the training data. [sent-90, score-0.721]

39 com/p /m2m4 al igne r / Phonetic alignment The EM-based approaches to L2P alignment treat both letters and phonemes as abstract symbols. [sent-97, score-1.218]

40 A completely different approach to L2P alignment is based on the phonetic similarity between phonemes. [sent-98, score-0.509]

41 The key idea of the approach is to rep- resent each letter by a phoneme that is likely to be represented by the letter. [sent-99, score-0.594]

42 The actual phonemes on the phoneme side and the phonemes representing letters on the letter side can then be aligned on the basis of phonetic similarity between phonemes. [sent-100, score-1.406]

43 The main advantage of the phonetic alignment is that it requires no training data, and so can be readily be applied to languages for which no pronunciation lexicons are available. [sent-101, score-0.624]

44 The task of identifying the phoneme that is most likely to be represented by a given letter may seem complex and highly language-dependent. [sent-102, score-0.557]

45 Intuitively, the letters that had been chosen (often centuries ago) to represent phonemes in any orthographic system tend to be close to the prototype phoneme in the original script. [sent-105, score-0.711]

46 The post-processing algorithm produces an alignment that contains 1-0, 1-1, and 1-2 links. [sent-202, score-0.425]

47 We are interested in establishing whether a set of allowable letter-phoneme mappings could be derived directly from the data without relying on phonetic features. [sent-214, score-0.363]

48 (1998) report that constructing lists of possible phonemes for each letter leads to L2P improvement. [sent-216, score-0.608]

49 The lists constrain the alignments performed by the EM algorithm and lead to better-quality alignments. [sent-218, score-0.335]

50 We implement a similar interactive program that incrementally expands the lists of possible phonemes for each letter by refining alignments constrained by those lists. [sent-219, score-0.908]

51 However, instead of employing the EM algorithm, we induce alignments using the standard edit distance algorithm with substitution and deletion assigned the same cost. [sent-220, score-0.305]

52 6 IP Alignment The process ofmanually inducing allowable letterphoneme mappings is time-consuming and involves a great deal of language-specific knowledge. [sent-245, score-0.365]

53 We specify two types of binary variables that correspond to local alignment links and global letter-phoneme mappings, respectively. [sent-249, score-0.658]

54 In the lexicon entry k, let lik be the letter at position i, and pjk the phoneme at position j. [sent-252, score-0.752]

55 A corresponding global variable G(lik, pjk) is set if the list of allowed letterphoneme mappings includes the link (lik, pjk). [sent-254, score-0.399]

56 783 Figure 2: A network of possible alignment links. [sent-256, score-0.421]

57 We create a network of possible alignment links for each lexicon entry k, and assign a binary variable to each link in the network. [sent-257, score-0.71]

58 Figure 2 shows an alignment network for the lexicon entry k: wriggle [r Ig @ L]. [sent-258, score-0.485]

59 There are three 1-0 links (level), three 1-1 links (diagonal), and one 1-2 link (steep). [sent-259, score-0.394]

60 We create constraints to ensure that the link variables receiving a value of 1form a left-to-right path through the alignment network, and that all other link variables receive a value of 0. [sent-262, score-0.661]

61 We accomplish this by requiring the sum of the links entering each node to equal the sum of the links leaving each node. [sent-263, score-0.338]

62 Instead, we first run the full set of variables on a subset of the training data which includes only the lexicon entries in which the number of phonemes exceeds the number of letters. [sent-265, score-0.453]

63 In the second pass, we run the model on the full data set, but we allow only the 1-2 links that belong to the initial set of 1-2 mappings induced in the first pass. [sent-267, score-0.362]

64 1 Combining IP with EM The set of allowable letter-phoneme mappings can also be used as an input to the EM alignment algorithm. [sent-269, score-0.638]

65 We train the EM model in a similar fashion to the many-tomany alignment algorithm presented in Section 3, except that we limit the letter size to be one letter, and that any letter-phoneme mapping that is not in the minimal set is assigned zero count during the E-step. [sent-273, score-0.744]

66 7 Alignment by aggregation During our development experiments, we observed that the technique that combines IP with EM described in the previous section generally leads to alignment quality improvement in comparison with the IP alignment. [sent-275, score-0.472]

67 We propose an alternative EM-based alignment method that instead utilizes a list of alternative one-to-many alignments created with M2M-aligner and aggregates 1-M links into M-M links in cases when there is a disagreement between alignments within the list. [sent-280, score-1.328]

68 For example, if the list contains the two alignments shown in Figure 3, the algorithm creates a single many-to-many alignment by merging the first pair of 1-1 and 1-0 links into a single ph:f link. [sent-281, score-0.893]

69 Therefore, the resulting alignment reinforces the ph:f mapping, but avoids the questionable se:z link. [sent-283, score-0.392]

70 Each cell Qt,v contains a list of n-best scores that correspond to al- 784 Algorithm 2: Extracting n-best alignments Algorithm 2: Extracting n-best alignments Input: x, y, δ Output: QT,V 1 T = |x| + 1, V = |y| + 1 2 fTor = =t |=x 1 +. [sent-286, score-0.544]

71 mQat−x1Y,v− sjt vd o− 9 ra qpp ∈en Qd q · δ(xt , yvv−j+1 ) to Qt,v 10 sort Qt,v 11 Qt,v = Qt,v [1 : n] ternative alignments during the forward pass. [sent-293, score-0.314]

72 In line 9, we consider all possible 1-M links between letter xt and phoneme substring yvv−j+1. [sent-294, score-0.757]

73 However, in order to further restrict the set of high-quality alignments, we also discard the alignments with scores below threshold R with respect to the best alignment score. [sent-297, score-0.664]

74 8 Intrinsic evaluation For the intrinsic evaluation, we compared the generated alignments to gold standard alignments extracted from the the core vocabulary of the Combilex data set (Richmond et al. [sent-300, score-0.647]

75 Each alignment approach creates alignments from unaligned word-phoneme pairs in an unsupervised fashion. [sent-305, score-0.7]

76 We report the alignment quality in terms of precision, recall and Fscore. [sent-307, score-0.422]

77 However, it is possible to obtain the perfect precision because we count as correct all 1-1 links that are consistent with the M-M links in the gold standard. [sent-310, score-0.374]

78 Alignment entropy is a measure of alignment quality proposed by Pervouchine et al. [sent-313, score-0.476]

79 The entropy indicates the uncertainty of mapping between letter l and phoneme p resulting from the alignment: We compute the alignment entropy for each of the methods using the following formula: H= −XP(l,p)logP(l|p) (7) Xl,p Table 1includes the results of the intrinsic evaluation. [sent-315, score-1.163]

80 The baseline BaseEM is an implementation of the one-to-one alignment method of (Black et al. [sent-317, score-0.392]

81 IP-align is the alignment generated by the IP formulation from Section 6. [sent-324, score-0.392]

82 EM-Aggr is our final many-to-many alignment method described in Section 7. [sent-327, score-0.392]

83 9 Extrinsic evaluation In order to investigate the relationship between the alignment quality and L2P performance, we feed the alignments to two different L2P systems. [sent-416, score-0.694]

84 We observe that although better alignment quality does not always translate into better L2P accuracy, there is nevertheless a strong correlation between the two, especially for the weaker phoneme generation system. [sent-431, score-0.729]

85 However, there is no reason to claim that the gold standard alignments are optimal for the L2P generation task, so that result should not be considered as an upper bound. [sent-433, score-0.338]

86 Finally, we note that alignment entropy seems to match the L2P accuracy better than it matches alignment quality. [sent-434, score-0.88]

87 The TiMBL L2P generation method (Table 2) is applicable only to the 1-1 alignment models. [sent-438, score-0.422]

88 In general, the 1-M-EM method achieves the best results among the 1-1 alignment methods, Overall, EM-Aggr achieves the best word accuracy in comparison to other alignment methods including the joint n-gram results, which are taken directly from the original paper of Bisani and Ney (2008). [sent-483, score-0.826]

89 Figure 4 contains a plot of alignment entropy values vs. [sent-485, score-0.446]

90 Each point rep- resent an application of a particular alignment method to a different data sets. [sent-487, score-0.429]

91 It appears that there is only weak correlation between alignment entropy and L2P accuracy. [sent-488, score-0.446]

92 So far, we have been unable to find either direct or indirect evidence that alignment entropy is a reliable measure of letterphoneme alignment quality. [sent-489, score-0.923]

93 The phonetic alignment is recommended for languages with little or no training data. [sent-491, score-0.509]

94 The IP alignment requires no linguistic expertise and guarantees a minimal set of letter-phoneme mappings. [sent-493, score-0.392]

95 The alignment by aggregation advances the state-of-the-art results in L2P conversion. [sent-494, score-0.442]

96 We thoroughly evaluated the resulting alignments on several data sets by using them as input to two different L2P generation systems. [sent-495, score-0.302]

97 Finally, we em- ployed an independently constructed lexicon to demonstrate the close relationship between alignment quality and L2P conversion accuracy. [sent-496, score-0.544]

98 One open question that we would like to investigate in the future is whether L2P conversion accuracy could be improved by treating letter-phoneme alignment links as latent variables, instead of committing to a single best alignment. [sent-497, score-0.693]

99 Aligning text and phonemes for speech technology applications using an EM-like algorithm. [sent-528, score-0.332]

100 A new algorithm for the alignment of phonetic sequences. [sent-549, score-0.542]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('alignment', 0.392), ('phonemes', 0.298), ('letter', 0.28), ('phoneme', 0.277), ('alignments', 0.272), ('mappings', 0.193), ('jiampojamarn', 0.19), ('aline', 0.185), ('links', 0.169), ('letters', 0.136), ('ip', 0.134), ('combilex', 0.127), ('phonetic', 0.117), ('pronunciation', 0.115), ('em', 0.11), ('marchand', 0.106), ('xtt', 0.093), ('conversion', 0.09), ('bisani', 0.085), ('damper', 0.085), ('letterphoneme', 0.085), ('ipa', 0.074), ('black', 0.071), ('lik', 0.068), ('intrinsic', 0.067), ('pjk', 0.063), ('sittichai', 0.063), ('yvv', 0.063), ('variables', 0.061), ('extrinsic', 0.06), ('grzegorz', 0.06), ('daelemans', 0.057), ('link', 0.056), ('nettalk', 0.056), ('maxy', 0.056), ('pervouchine', 0.056), ('latin', 0.055), ('entropy', 0.054), ('allowable', 0.053), ('bosch', 0.053), ('constraint', 0.052), ('maxx', 0.051), ('aggregation', 0.05), ('timbl', 0.048), ('yannick', 0.048), ('null', 0.043), ('cmudict', 0.042), ('engelbrecht', 0.042), ('qd', 0.042), ('qpp', 0.042), ('schroeter', 0.042), ('seedmap', 0.042), ('accuracy', 0.042), ('synthesis', 0.041), ('antal', 0.041), ('mapping', 0.039), ('rightmost', 0.038), ('ney', 0.037), ('satisfaction', 0.037), ('ph', 0.037), ('canisius', 0.037), ('sejnowski', 0.037), ('grapheme', 0.037), ('resent', 0.037), ('global', 0.036), ('unaligned', 0.036), ('gold', 0.036), ('constraints', 0.035), ('activating', 0.034), ('speech', 0.034), ('inducing', 0.034), ('analogy', 0.033), ('exceeds', 0.033), ('algorithm', 0.033), ('discriminative', 0.033), ('lexicon', 0.032), ('programming', 0.032), ('pure', 0.032), ('robert', 0.032), ('entry', 0.032), ('xt', 0.031), ('den', 0.031), ('generation', 0.03), ('linked', 0.03), ('alberta', 0.03), ('kondrak', 0.03), ('richmond', 0.03), ('quality', 0.03), ('lists', 0.03), ('network', 0.029), ('includes', 0.029), ('phonetics', 0.029), ('decoder', 0.028), ('demberg', 0.028), ('refining', 0.028), ('zens', 0.027), ('alternative', 0.027), ('probability', 0.027), ('merging', 0.027), ('rf', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 170 acl-2010-Letter-Phoneme Alignment: An Exploration

Author: Sittichai Jiampojamarn ; Grzegorz Kondrak

Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.

2 0.32651657 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

Author: Vamshi Ambati ; Stephan Vogel ; Jaime Carbonell

Abstract: Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner.

3 0.32423636 133 acl-2010-Hierarchical Search for Word Alignment

Author: Jason Riesa ; Daniel Marcu

Abstract: We present a simple yet powerful hierarchical search algorithm for automatic word alignment. Our algorithm induces a forest of alignments from which we can efficiently extract a ranked k-best list. We score a given alignment within the forest with a flexible, linear discriminative model incorporating hundreds of features, and trained on a relatively small amount of annotated data. We report results on Arabic-English word alignment and translation tasks. Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.

4 0.28319296 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages

Author: Bing Xiang ; Yonggang Deng ; Bowen Zhou

Abstract: We present a novel method to improve word alignment quality and eventually the translation performance by producing and combining complementary word alignments for low-resource languages. Instead of focusing on the improvement of a single set of word alignments, we generate multiple sets of diversified alignments based on different motivations, such as linguistic knowledge, morphology and heuristics. We demonstrate this approach on an English-to-Pashto translation task by combining the alignments obtained from syntactic reordering, stemming, and partial words. The combined alignment outperforms the baseline alignment, with significantly higher F-scores and better transla- tion performance.

5 0.24555434 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

Author: John DeNero ; Dan Klein

Abstract: We present a discriminative model that directly predicts which set ofphrasal translation rules should be extracted from a sentence pair. Our model scores extraction sets: nested collections of all the overlapping phrase pairs consistent with an underlying word alignment. Extraction set models provide two principle advantages over word-factored alignment models. First, we can incorporate features on phrase pairs, in addition to word links. Second, we can optimize for an extraction-based loss function that relates directly to the end task of generating translations. Our model gives improvements in alignment quality relative to state-of-the-art unsupervised and supervised baselines, as well as providing up to a 1.4 improvement in BLEU score in Chinese-to-English translation experiments.

6 0.18582387 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

7 0.17704648 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation

8 0.16538343 262 acl-2010-Word Alignment with Synonym Regularization

9 0.1424554 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment

10 0.13418648 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels

11 0.11101656 116 acl-2010-Finding Cognate Groups Using Phylogenies

12 0.10168891 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

13 0.098520793 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons

14 0.077482417 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers

15 0.077282421 46 acl-2010-Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression

16 0.077133618 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

17 0.074681133 233 acl-2010-The Same-Head Heuristic for Coreference

18 0.072435401 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

19 0.071928553 265 acl-2010-cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models

20 0.071854457 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.228), (1, -0.249), (2, -0.063), (3, -0.046), (4, 0.096), (5, 0.072), (6, -0.163), (7, 0.107), (8, 0.233), (9, -0.16), (10, -0.178), (11, -0.086), (12, -0.197), (13, 0.067), (14, -0.107), (15, -0.03), (16, 0.028), (17, 0.015), (18, -0.001), (19, -0.015), (20, 0.037), (21, 0.021), (22, -0.002), (23, -0.078), (24, 0.052), (25, -0.046), (26, -0.01), (27, -0.073), (28, 0.01), (29, 0.043), (30, -0.079), (31, 0.0), (32, 0.044), (33, -0.059), (34, -0.069), (35, -0.1), (36, -0.03), (37, 0.041), (38, 0.01), (39, -0.024), (40, -0.094), (41, -0.056), (42, -0.018), (43, -0.008), (44, -0.048), (45, -0.027), (46, 0.069), (47, -0.043), (48, 0.019), (49, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97727519 170 acl-2010-Letter-Phoneme Alignment: An Exploration

Author: Sittichai Jiampojamarn ; Grzegorz Kondrak

Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.

2 0.82263809 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

Author: Vamshi Ambati ; Stephan Vogel ; Jaime Carbonell

Abstract: Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner.

3 0.81934088 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages

Author: Bing Xiang ; Yonggang Deng ; Bowen Zhou

Abstract: We present a novel method to improve word alignment quality and eventually the translation performance by producing and combining complementary word alignments for low-resource languages. Instead of focusing on the improvement of a single set of word alignments, we generate multiple sets of diversified alignments based on different motivations, such as linguistic knowledge, morphology and heuristics. We demonstrate this approach on an English-to-Pashto translation task by combining the alignments obtained from syntactic reordering, stemming, and partial words. The combined alignment outperforms the baseline alignment, with significantly higher F-scores and better transla- tion performance.

4 0.78557712 133 acl-2010-Hierarchical Search for Word Alignment

Author: Jason Riesa ; Daniel Marcu

Abstract: We present a simple yet powerful hierarchical search algorithm for automatic word alignment. Our algorithm induces a forest of alignments from which we can efficiently extract a ranked k-best list. We score a given alignment within the forest with a flexible, linear discriminative model incorporating hundreds of features, and trained on a relatively small amount of annotated data. We report results on Arabic-English word alignment and translation tasks. Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.

5 0.75421584 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

Author: John DeNero ; Dan Klein

Abstract: We present a discriminative model that directly predicts which set ofphrasal translation rules should be extracted from a sentence pair. Our model scores extraction sets: nested collections of all the overlapping phrase pairs consistent with an underlying word alignment. Extraction set models provide two principle advantages over word-factored alignment models. First, we can incorporate features on phrase pairs, in addition to word links. Second, we can optimize for an extraction-based loss function that relates directly to the end task of generating translations. Our model gives improvements in alignment quality relative to state-of-the-art unsupervised and supervised baselines, as well as providing up to a 1.4 improvement in BLEU score in Chinese-to-English translation experiments.

6 0.71116966 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment

7 0.70137352 262 acl-2010-Word Alignment with Synonym Regularization

8 0.55946136 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation

9 0.55205512 116 acl-2010-Finding Cognate Groups Using Phylogenies

10 0.47747505 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

11 0.47116563 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels

12 0.45972547 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers

13 0.44944143 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

14 0.4137592 16 acl-2010-A Statistical Model for Lost Language Decipherment

15 0.40853417 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

16 0.40533009 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration

17 0.38370547 100 acl-2010-Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble

18 0.38226578 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities

19 0.36873227 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

20 0.35226026 68 acl-2010-Conditional Random Fields for Word Hyphenation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.011), (14, 0.026), (20, 0.244), (25, 0.046), (39, 0.019), (42, 0.028), (44, 0.018), (59, 0.099), (73, 0.083), (78, 0.032), (83, 0.065), (84, 0.032), (98, 0.175)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82642305 170 acl-2010-Letter-Phoneme Alignment: An Exploration

Author: Sittichai Jiampojamarn ; Grzegorz Kondrak

Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.

2 0.76790822 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules

Author: Zhiyang Wang ; Yajuan Lv ; Qun Liu ; Young-Sook Hwang

Abstract: This paper presents a novel filtration criterion to restrict the rule extraction for the hierarchical phrase-based translation model, where a bilingual but relaxed wellformed dependency restriction is used to filter out bad rules. Furthermore, a new feature which describes the regularity that the source/target dependency edge triggers the target/source word is also proposed. Experimental results show that, the new criteria weeds out about 40% rules while with translation performance improvement, and the new feature brings an- other improvement to the baseline system, especially on larger corpus.

3 0.66588157 79 acl-2010-Cross-Lingual Latent Topic Extraction

Author: Duo Zhang ; Qiaozhu Mei ; ChengXiang Zhai

Abstract: Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Proba- bilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.

4 0.66423607 146 acl-2010-Improving Chinese Semantic Role Labeling with Rich Syntactic Features

Author: Weiwei Sun

Abstract: Developing features has been shown crucial to advancing the state-of-the-art in Semantic Role Labeling (SRL). To improve Chinese SRL, we propose a set of additional features, some of which are designed to better capture structural information. Our system achieves 93.49 Fmeasure, a significant improvement over the best reported performance 92.0. We are further concerned with the effect of parsing in Chinese SRL. We empirically analyze the two-fold effect, grouping words into constituents and providing syntactic information. We also give some preliminary linguistic explanations.

5 0.66395152 83 acl-2010-Dependency Parsing and Projection Based on Word-Pair Classification

Author: Wenbin Jiang ; Qun Liu

Abstract: In this paper we describe an intuitionistic method for dependency parsing, where a classifier is used to determine whether a pair of words forms a dependency edge. And we also propose an effective strategy for dependency projection, where the dependency relationships of the word pairs in the source language are projected to the word pairs of the target language, leading to a set of classification instances rather than a complete tree. Experiments show that, the classifier trained on the projected classification instances significantly outperforms previous projected dependency parsers. More importantly, when this clas- , sifier is integrated into a maximum spanning tree (MST) dependency parser, obvious improvement is obtained over the MST baseline.

6 0.66222346 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

7 0.6621505 133 acl-2010-Hierarchical Search for Word Alignment

8 0.66190457 202 acl-2010-Reading between the Lines: Learning to Map High-Level Instructions to Commands

9 0.66182077 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

10 0.66142547 140 acl-2010-Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.

11 0.66061503 100 acl-2010-Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble

12 0.6601094 116 acl-2010-Finding Cognate Groups Using Phylogenies

13 0.66007984 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

14 0.66001028 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

15 0.65982991 154 acl-2010-Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm

16 0.65969849 3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities

17 0.65964258 22 acl-2010-A Unified Graph Model for Sentence-Based Opinion Retrieval

18 0.65963697 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

19 0.65961581 232 acl-2010-The S-Space Package: An Open Source Package for Word Space Models

20 0.65928888 15 acl-2010-A Semi-Supervised Key Phrase Extraction Approach: Learning from Title Phrases through a Document Semantic Network