acl acl2010 acl2010-135 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nadir Durrani ; Hassan Sajjad ; Alexander Fraser ; Helmut Schmid
Abstract: We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV (out-of-vocabulary) words. We use transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. We obtain final BLEU scores of 19.35 (conditional prob- ability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. This indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. [sent-3, score-0.447]
2 Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV (out-of-vocabulary) words. [sent-5, score-1.039]
3 We use transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. [sent-6, score-0.601]
4 This indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu. [sent-12, score-0.476]
5 This provides a strong motivation to implement an end-to-end translation system which strongly relies on high quality transliteration from Hindi to Urdu. [sent-25, score-0.534]
6 Hindi and Urdu have similar sound systems but transliteration from Hindi to Urdu is still very hard because some phonemes in Hindi have several orthographic equivalents in Urdu. [sent-26, score-0.447]
7 In such cases we hope to choose the correct transliteration by using context. [sent-49, score-0.447]
8 Sometimes there is also an ambiguity of whether to translate or transliterate a particular word. [sent-51, score-0.104]
9 We try to model whether to translate or transliterate in a given situation. [sent-53, score-0.133]
10 Section 3 introduces two probabilistic models for integrating translations and transliterations into a translation model which are based on conditional and joint probability distributions. [sent-57, score-0.542]
11 We remedy the problems by adding some heuristics and modifications to our models which show improvements in the results as discussed in section 6. [sent-61, score-0.09]
12 Section 7 gives two examples illustrating how our model decides whether to translate or transliterate and how it is able to choose among different valid transliterations given the context. [sent-62, score-0.446]
13 The first group is generic transliteration work, which is evaluated outside of the context of translation. [sent-66, score-0.447]
14 addresses Hindi to Urdu transliteration using hand-crafted rules and a phonemic representation; it ignores translation context. [sent-72, score-0.534]
15 AlOnaizan and Knight (2002) transliterate Arabic NEs into English and score them against their respective translations using a modified IBM Model 1. [sent-74, score-0.112]
16 An efficient way to compute and re-rank the transliterations of NEs and integrate them on the fly might be possible. [sent-77, score-0.313]
17 However, this is not practical in our case as our model considers transliterations of all input words and not just NEs. [sent-78, score-0.342]
18 A log-linear block transliteration model is applied to OOV NEs in Arabic to English SMT by Zhao et al. [sent-79, score-0.476]
19 (2007) integrates translations provided by external sources such as transliteration or rule-base translation of numbers and dates, for an arbitrary number of entries within the input text. [sent-83, score-0.57]
20 (2007) in that our model compares transliterations with translations 466 on the fly whereas transliterations in Kashani et al. [sent-85, score-0.691]
21 (2008) use a tagger to identify good candidates for transliteration (which are mostly NEs) in input text and add transliterations to the SMT phrase table dynamically such that they can directly compete with translations during decoding. [sent-89, score-0.815]
22 This is closer to our approach except that we use transliteration as an alternative to translation for all Hindi words. [sent-90, score-0.534]
23 Moreover, they are working with a large bitext so they can rely on their translation model and only need to transliterate NEs and OOVs. [sent-92, score-0.192]
24 Our translation model is based on data which is both sparse and noisy. [sent-93, score-0.116]
25 Therefore we pit transliterations against translations for every input word. [sent-94, score-0.349]
26 This work also uses transliteration only for the translation of unknown words. [sent-96, score-0.569]
27 The third group uses transliteration models inside of a cross-lingual IR system (AbdulJaleel and Larkey, 2003; Virga and Khudanpur, 2003; Pirkola et al. [sent-98, score-0.447]
28 Picking a single best transliteration or translation in context is not important in an IR system. [sent-100, score-0.534]
29 3 Our Approach Both of our models combine a character-based transliteration model with a word-based translation model. [sent-102, score-0.563]
30 1 Model-1 : Conditional Probability Model Applying a noisy channel model to compute the most probable translation u ˆ1n, we get: argmua1nxp(u1n|h1n) = argmua1nxp(un1)p(h1n|u1n) (1) 3. [sent-110, score-0.116]
31 For a multi-word ui we do multiple language model look-ups, one for each uix in ui = ui1 , . [sent-119, score-0.781]
32 Language Model for Unknown Words: Our model generates transliterations that can be known or unknown to the language model and the translation model. [sent-123, score-0.493]
33 We refer to the words known to the language model and to the translation model as LM-known and TM-known words respectively and to words that are unknown as LM-unknown and TM-unknown respectively. [sent-124, score-0.18]
34 If one or more uix in a multi-word ui are LM-unknown we assign a language model score pLM (ui |ui−1) = ψ for the entire ui, meaning that we cuoi−nksider partially known transliterations to be as bad as fully unknown transliterations. [sent-126, score-0.764]
35 It does not influence translation options because they are always LM-known in our case. [sent-128, score-0.12]
36 This is because our monolingual corpus also contains the Urdu part of translation corpus. [sent-129, score-0.105]
37 2 Translation Model The translation model (TM) p(h1n |un1) is approximated with a context-independent |muodel: Yn p(h1n|u1n) = Yp(hi|ui) (3) iY= Y1 where hi and ui are Hindi and Urdu tokens respectively. [sent-137, score-0.632]
38 Our model estimates the conditional probability p(hi |ui) by interpolating a wordbased model and|u a character-based (transliteration) model. [sent-138, score-0.164]
39 p(hi |ui) = λpw (hi |ui) + (1 − λ)pc(hi |ui) (4) The parameters of the word-based translation model pw (h|u) are estimated from the word alignments of a s|mu)al alr parallel corpus. [sent-139, score-0.348]
40 Wt hee only raelitgainn1-1/1-N (1 Hindi word, 1 or more Urdu words) alignments and throw away N-1 and M-N alignments for our models. [sent-140, score-0.146]
41 The character-based transliteration model pc(h|u) is computed in terms of pc(h, u), a joint ch(ahra|cut)er i model, wtedhic inh tise malsso o ufs ped for ChineseEnglish back-transliteration (Li et al. [sent-144, score-0.496]
42 The character-based transliteration probability is defined as follows: pc(h,u) = = a1n X p(a1n) ∈aXlign(h,u) Yn a1n X Yi=1 Yp(ai|aii−−k1) ∈aXlign(h,u) (5) where ai is a pair consisting of the i-th Hindi character hi and the sequence of 0 or more Urdu characters that it is aligned with. [sent-147, score-0.683]
43 The parameters p(ai |ai−1) are estimated from a small transliteration corpus which we automatically extracted from the translation corpus. [sent-152, score-0.575]
44 Again, the translation model p(h1n |un1) is approximated with a contextindependent muodel: p(hn1|un1) =iY=n1p(hi|ui) =iY=n1p(ph(ui,iu)i) (10) The joint probability p(hi, ui) of a Hindi and an Urdu word is estimated by interpolating a wordbased model and a character-based model. [sent-162, score-0.264]
45 The character-ba|sued transliteration probability pc(hi, ui) and the character-based prior probability pc(ui) are defined by (5) and (7) respectively in 468 the previous section. [sent-164, score-0.533]
46 It searches for an Urdu string that maximizes the product of translation probability and the language model probability (equation 1) by translating one Hindi word at a time. [sent-168, score-0.213]
47 At the lower level, it computes n-best transliterations for each Hindi word hi according to pc(h, u). [sent-170, score-0.464]
48 The joint probabilities given by pc(h, u) are marginalized for each Urdu transliteration to give pc(h|u). [sent-171, score-0.485]
49 At the higher level, transliteration probabilities are interpolated ewviethl, pw (h|u) and then multiplied with language model probabilities to give the probability ofa hypothesis. [sent-172, score-0.664]
50 We use 20-best translations and 25-best transliterations for pw (h|u) and pc(h|u) respectively and a 5-gram language amndode pl. [sent-173, score-0.467]
51 We extracted a total of 107323 alignment pairs (5743 N-1 alignments, 8404 MN alignments and 93 176 1-1/1-N alignments). [sent-184, score-0.102]
52 Of these alignments M-N and N-1 alignment pairs were ignored. [sent-185, score-0.102]
53 For valid M-N alignments we observed that these could be broken into 1-1/1-N alignments in most of the cases. [sent-193, score-0.146]
54 We also observed that we usually have coverage of the resulting 1-1 and 1-N alignments in our translation corpus. [sent-194, score-0.16]
55 3 Transliteration Corpus The training corpus for transliteration is extracted from the 1-1/1-N word-alignments ofthe EMILLE corpus discussed in section 4. [sent-205, score-0.468]
56 We use an edit distance algorithm to align this training corpus at the character level and we eliminate translation pairs with high edit distance which are unlikely to be transliterations. [sent-208, score-0.121]
57 The mapping was further extended by looking into available Hindi-Urdu transliteration and other resources (Gupta, 2004; Malik et al. [sent-213, score-0.447]
58 A Hindi character that always map to only one Urdu character is assigned a cost of 0 whereas the Hindi characters that map to different Urdu characters are assigned a cost of 0. [sent-216, score-0.102]
59 Using this metric we filter out the word pairs with high edit-distance to extract our transliteration corpus. [sent-224, score-0.447]
60 The resulting alignments are modified by merging unaligned ∅ → 1(no character on source side, 1n gch uanraalcitgenre on target side) or ∅c → Nn alignments w chithar tahctee preceding alignment pair. [sent-226, score-0.209]
61 Our model retains 1 → ∅ and N → ∅ alignments as mdeoledteioln re operations. [sent-229, score-0.102]
62 1 Parameter Optimization Our model contains two parameters λ (the inter- polating factor between translation and transliteration modules) and ψ (the factor that controls the trade-off between LM-known and LM-unknown transliterations). [sent-252, score-0.624]
63 We chose a very low value 1e−40 for the factor ψ initially, favoring LMknown transliterations very strongly. [sent-254, score-0.331]
64 Again, the language model is imple7It should be noted though that diacritics play a very important role when transliterating in the reverse direction because these are virtually always written in Hindi as dependent vowels. [sent-270, score-0.148]
65 We also used two methods to incorporate transliterations in the phrasebased system: Post-process Pb1: All the OOV words in the phrase-based output are replaced with their topcandidate transliteration as given by our transliteration system. [sent-285, score-1.229]
66 Pre-process Pb2: Instead of adding transliterations as a post process we do a second pass by adding the unknown words with their topcandidate transliteration to the training corpus and rerun Koehn’s training script with the new training corpus. [sent-286, score-0.817]
67 The transliteration aided phrase-based systems Pb1 and Pb2 are closer to our Model-2 results but are way below Model-1 results. [sent-291, score-0.447]
68 35 BLEU points between M1 and Pb1 indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu. [sent-293, score-0.476]
69 Our models choose between translations and transliterations based on context unlike the phrase-based systems Pb1 and Pb2 which use transliteration only as a tool to translate OOV words. [sent-294, score-0.824]
70 1 Heuristic-1 A lot of errors occur because our translation model is built on very sparse and noisy data. [sent-302, score-0.116]
71 The moti- vation for this heuristic is to counter wrong alignments at least in the case of verbs and functional words (which are often transliterations). [sent-303, score-0.115]
72 This heuristic favors translations that also appear in the n-best transliteration list over only-translation and only-transliteration options. [sent-304, score-0.502]
73 We modify the translation model for both the conditional and the joint model by adding another factor which strongly weighs translation+transliteration options by taking the square-root ofthe product ofthe translation and transliteration probabilities. [sent-305, score-0.773]
74 2 Heuristic-2 When an unknown Hindi word occurs for which all transliteration options are LM-unknown then the best transliteration should be selected. [sent-310, score-0.962]
75 Hence our model selects the transliteration that has the best score i. [sent-312, score-0.476]
76 The language model probability of unknown words is uniform (and equal to ψ) whereas the translation model uses the nonuniform prior probability pc(ui) for these words. [sent-316, score-0.266]
77 There is another reason why we can not use the ppc(ch(iu,iu)i) 13The translation coefficient λ1 is same as λ used in previ- models and the transliteration coefficient λ2 = 1− λ 14After optimization we normalize the lambd=as 1t o− m λake their sum equal to 1. [sent-317, score-0.592]
78 The value of ψ is very small because of which transliterations that are actually LM-unknown, but are mistakenly broken into constituents that are LM-known, will always be preferred over their counter parts. [sent-320, score-0.336]
79 An example of this is (America) for which two possible transliterations as given by our model are (AmerIkA, without space) and (AmerI kA, with space). [sent-321, score-0.342]
80 Space insertion is an important feature of our transliteration model. [sent-324, score-0.447]
81 The last line of theQ calculation shows that we simply drop pQc(ui) if ui is LM-unknown and use the constant instead of ψ. [sent-329, score-0.365]
82 As a result of this, transliterations are sometimes incorrectly favored over their translation alternatives. [sent-345, score-0.4]
83 In order to remedy this problem we assign a minimal probability β to the word-based prior pw (ui) in case of TM-unknown transliterations, which prevents it from ever being zero. [sent-346, score-0.17]
84 Because of this addition the translation model probability for LM-unknown words becomes: λβ(1 + − ( λ1) −pc λ(h)pi,cu(iu)i)where β =Urdu Typ1es in TM 6 Final Results This section shows the improvement in BLEU score by applying heuristics and combinations of heuristics in both the models. [sent-347, score-0.256]
85 Tables 5 and 6 show the improvements achieved by using the different heuristics and modifications discussed in section 5. [sent-348, score-0.09]
86 We refer to the results as MxHy where x denotes the model number, 1 for the conditional probability model and 2 for the joint probability model and y denotes a heuristic or a combination of heuristics applied to that model15. [sent-349, score-0.27]
87 Using heuris- tic 2 we were able to properly score LM-unknown transliterations against each other. [sent-352, score-0.313]
88 Heuristic-3 remedies the flaw in M2 by assigning a special value to the word-based prior pw (ui) for TM-unknown words which prevents the cancelation of interpolating parameter λ. [sent-356, score-0.191]
89 We observed that sometimes on data where the translators preferred to translate rather than doing transliteration our system is penalized by BLEU even though our output string is a valid translation. [sent-367, score-0.475]
90 Our model correctly identifies which transliteration to choose given the context. [sent-373, score-0.476]
91 Our model successfully decides whether to translate or transliterate given the context. [sent-375, score-0.133]
92 8 Conclusion We have presented a novel way to integrate transliterations into machine translation. [sent-376, score-0.313]
93 First, transliteration helps overcome the problem of data sparsity and noisy alignments. [sent-384, score-0.447]
94 We are able to generate word translations that are unseen in the translation corpus but known to the language model. [sent-385, score-0.123]
95 Additionally, we can generate novel transliterations (that are LM-Unknown). [sent-386, score-0.313]
96 Second, generating multiple transliterations for homograph Hindi words and using language model context helps us solve the problem of disambiguation. [sent-387, score-0.342]
97 We found that the joint probability model performs almost as well as the conditional probability model but that it was more complex to make it work well. [sent-388, score-0.169]
98 Name translation in statistical machine translation - learning when to transliterate. [sent-412, score-0.174]
99 Integration of an Arabic transliteration module into a statistical machine translation system. [sent-422, score-0.534]
100 A log-linear block transliteration model based on bi-stream HMMs. [sent-491, score-0.476]
wordName wordTfidf (topN-words)
[('urdu', 0.459), ('transliteration', 0.447), ('hindi', 0.396), ('ui', 0.365), ('transliterations', 0.313), ('pc', 0.168), ('hi', 0.151), ('pw', 0.118), ('translation', 0.087), ('uii', 0.085), ('pcp', 0.076), ('transliterate', 0.076), ('alignments', 0.073), ('oov', 0.07), ('transliterated', 0.061), ('heuristics', 0.053), ('nes', 0.047), ('bleu', 0.046), ('diacritics', 0.043), ('ch', 0.043), ('written', 0.038), ('malik', 0.038), ('transliterating', 0.038), ('kashani', 0.038), ('translations', 0.036), ('unknown', 0.035), ('character', 0.034), ('probability', 0.034), ('interpolating', 0.033), ('iy', 0.033), ('options', 0.033), ('homonyms', 0.032), ('qjk', 0.032), ('sampa', 0.032), ('sant', 0.032), ('shanti', 0.032), ('transliterator', 0.032), ('yn', 0.032), ('fold', 0.03), ('alignment', 0.029), ('translating', 0.029), ('model', 0.029), ('ekbal', 0.028), ('translate', 0.028), ('arabic', 0.028), ('kun', 0.026), ('parameters', 0.025), ('optimization', 0.024), ('counter', 0.023), ('ser', 0.023), ('conditional', 0.023), ('vocabulary', 0.023), ('kt', 0.022), ('koehn', 0.022), ('abduljaleel', 0.022), ('aom', 0.022), ('axlign', 0.022), ('durrani', 0.022), ('emille', 0.022), ('farsi', 0.022), ('flaw', 0.022), ('forschungsgemeinschaft', 0.022), ('jawaid', 0.022), ('lmknown', 0.022), ('muodel', 0.022), ('phwi', 0.022), ('pirkola', 0.022), ('ppcc', 0.022), ('topcandidate', 0.022), ('uix', 0.022), ('virga', 0.022), ('roughly', 0.021), ('discussed', 0.021), ('joint', 0.02), ('token', 0.02), ('ka', 0.019), ('compete', 0.019), ('sanskrit', 0.019), ('sur', 0.019), ('za', 0.019), ('heuristic', 0.019), ('yp', 0.018), ('monolingual', 0.018), ('factor', 0.018), ('prior', 0.018), ('probabilities', 0.018), ('hermjakob', 0.017), ('plm', 0.017), ('srilm', 0.017), ('coefficient', 0.017), ('characters', 0.017), ('discusses', 0.016), ('smt', 0.016), ('estimated', 0.016), ('modifications', 0.016), ('wordbased', 0.016), ('deutsche', 0.016), ('sfb', 0.016), ('name', 0.016), ('knight', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration
Author: Nadir Durrani ; Hassan Sajjad ; Alexander Fraser ; Helmut Schmid
Abstract: We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV (out-of-vocabulary) words. We use transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. We obtain final BLEU scores of 19.35 (conditional prob- ability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. This indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.
Author: Dong Yang ; Paul Dixon ; Sadaoki Furui
Abstract: This paper presents a joint optimization method of a two-step conditional random field (CRF) model for machine transliteration and a fast decoding algorithm for the proposed method. Our method lies in the category of direct orthographical mapping (DOM) between two languages without using any intermediate phonemic mapping. In the two-step CRF model, the first CRF segments an input word into chunks and the second one converts each chunk into one unit in the target language. In this paper, we propose a method to jointly optimize the two-step CRFs and also a fast algorithm to realize it. Our experiments show that the proposed method outper- forms the well-known joint source channel model (JSCM) and our proposed fast algorithm decreases the decoding time significantly. Furthermore, combination of the proposed method and the JSCM gives further improvement, which outperforms state-of-the-art results in terms of top-1 accuracy.
3 0.11820351 143 acl-2010-Importance of Linguistic Constraints in Statistical Dependency Parsing
Author: Bharat Ram Ambati
Abstract: Statistical systems with high accuracy are very useful in real-world applications. If these systems can capture basic linguistic information, then the usefulness of these statistical systems improve a lot. This paper is an attempt at incorporating linguistic constraints in statistical dependency parsing. We consider a simple linguistic constraint that a verb should not have multiple subjects/objects as its children in the dependency tree. We first describe the importance of this constraint considering Machine Translation systems which use dependency parser output, as an example application. We then show how the current state-ofthe-art dependency parsers violate this constraint. We present two new methods to handle this constraint. We evaluate our methods on the state-of-the-art dependency parsers for Hindi and Czech. 1
4 0.08535257 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages
Author: Bing Xiang ; Yonggang Deng ; Bowen Zhou
Abstract: We present a novel method to improve word alignment quality and eventually the translation performance by producing and combining complementary word alignments for low-resource languages. Instead of focusing on the improvement of a single set of word alignments, we generate multiple sets of diversified alignments based on different motivations, such as linguistic knowledge, morphology and heuristics. We demonstrate this approach on an English-to-Pashto translation task by combining the alignments obtained from syntactic reordering, stemming, and partial words. The combined alignment outperforms the baseline alignment, with significantly higher F-scores and better transla- tion performance.
5 0.079932287 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out
Author: Joern Wuebker ; Arne Mauser ; Hermann Ney
Abstract: Several attempts have been made to learn phrase translation probabilities for phrasebased statistical machine translation that go beyond pure counting of phrases in word-aligned training data. Most approaches report problems with overfitting. We describe a novel leavingone-out approach to prevent over-fitting that allows us to train phrase models that show improved translation performance on the WMT08 Europarl German-English task. In contrast to most previous work where phrase models were trained separately from other models used in translation, we include all components such as single word lexica and reordering mod- els in training. Using this consistent training of phrase models we are able to achieve improvements of up to 1.4 points in BLEU. As a side effect, the phrase table size is reduced by more than 80%.
6 0.079772398 133 acl-2010-Hierarchical Search for Word Alignment
7 0.079729445 54 acl-2010-Boosting-Based System Combination for Machine Translation
8 0.065373115 170 acl-2010-Letter-Phoneme Alignment: An Exploration
9 0.062597781 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation
10 0.060103845 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation
11 0.06009382 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities
12 0.059203248 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation
13 0.057047263 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation
14 0.05669307 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment
15 0.053094298 145 acl-2010-Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment
16 0.050895371 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction
17 0.049710181 244 acl-2010-TrustRank: Inducing Trust in Automatic Translations via Ranking
18 0.049573708 16 acl-2010-A Statistical Model for Lost Language Decipherment
19 0.049432341 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features
20 0.049111564 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures
topicId topicWeight
[(0, -0.123), (1, -0.112), (2, -0.039), (3, -0.01), (4, 0.037), (5, 0.011), (6, -0.045), (7, -0.012), (8, 0.025), (9, 0.024), (10, 0.0), (11, 0.069), (12, 0.047), (13, -0.027), (14, -0.039), (15, -0.014), (16, -0.031), (17, 0.061), (18, -0.023), (19, -0.023), (20, 0.014), (21, -0.04), (22, -0.049), (23, -0.069), (24, -0.004), (25, -0.059), (26, 0.021), (27, -0.058), (28, -0.071), (29, -0.004), (30, -0.02), (31, -0.014), (32, 0.002), (33, -0.165), (34, -0.173), (35, 0.058), (36, -0.117), (37, 0.061), (38, -0.038), (39, -0.19), (40, -0.043), (41, 0.067), (42, -0.093), (43, 0.007), (44, 0.169), (45, -0.107), (46, 0.015), (47, -0.061), (48, -0.036), (49, 0.127)]
simIndex simValue paperId paperTitle
same-paper 1 0.89242017 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration
Author: Nadir Durrani ; Hassan Sajjad ; Alexander Fraser ; Helmut Schmid
Abstract: We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV (out-of-vocabulary) words. We use transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. We obtain final BLEU scores of 19.35 (conditional prob- ability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. This indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.
Author: Dong Yang ; Paul Dixon ; Sadaoki Furui
Abstract: This paper presents a joint optimization method of a two-step conditional random field (CRF) model for machine transliteration and a fast decoding algorithm for the proposed method. Our method lies in the category of direct orthographical mapping (DOM) between two languages without using any intermediate phonemic mapping. In the two-step CRF model, the first CRF segments an input word into chunks and the second one converts each chunk into one unit in the target language. In this paper, we propose a method to jointly optimize the two-step CRFs and also a fast algorithm to realize it. Our experiments show that the proposed method outper- forms the well-known joint source channel model (JSCM) and our proposed fast algorithm decreases the decoding time significantly. Furthermore, combination of the proposed method and the JSCM gives further improvement, which outperforms state-of-the-art results in terms of top-1 accuracy.
3 0.57356656 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities
Author: Yufeng Chen ; Chengqing Zong ; Keh-Yih Su
Abstract: We observe that (1) how a given named entity (NE) is translated (i.e., either semantically or phonetically) depends greatly on its associated entity type, and (2) entities within an aligned pair should share the same type. Also, (3) those initially detected NEs are anchors, whose information should be used to give certainty scores when selecting candidates. From this basis, an integrated model is thus proposed in this paper to jointly identify and align bilingual named entities between Chinese and English. It adopts a new mapping type ratio feature (which is the proportion of NE internal tokens that are semantically translated), enforces an entity type consistency constraint, and utilizes additional monolingual candidate certainty factors (based on those NE anchors). The experi- ments show that this novel approach has substantially raised the type-sensitive F-score of identified NE-pairs from 68.4% to 81.7% (42.1% F-score imperfection reduction) in our Chinese-English NE alignment task.
4 0.48995689 68 acl-2010-Conditional Random Fields for Word Hyphenation
Author: Nikolaos Trogkanis ; Charles Elkan
Abstract: Finding allowable places in words to insert hyphens is an important practical problem. The algorithm that is used most often nowadays has remained essentially unchanged for 25 years. This method is the TEX hyphenation algorithm of Knuth and Liang. We present here a hyphenation method that is clearly more accurate. The new method is an application of conditional random fields. We create new training sets for English and Dutch from the CELEX European lexical resource, and achieve error rates for English of less than 0.1% for correctly allowed hyphens, and less than 0.01% for Dutch. Experiments show that both the Knuth/Liang method and a leading current commercial alternative have error rates several times higher for both languages.
5 0.4805041 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers
Author: Eric Corlett ; Gerald Penn
Abstract: Letter-substitution ciphers encode a document from a known or hypothesized language into an unknown writing system or an unknown encoding of a known writing system. It is a problem that can occur in a number of practical applications, such as in the problem of determining the encodings of electronic documents in which the language is known, but the encoding standard is not. It has also been used in relation to OCR applications. In this paper, we introduce an exact method for deciphering messages using a generalization of the Viterbi algorithm. We test this model on a set of ciphers developed from various web sites, and find that our algorithm has the potential to be a viable, practical method for efficiently solving decipherment prob- lems.
7 0.37484735 98 acl-2010-Efficient Staggered Decoding for Sequence Labeling
8 0.3642047 143 acl-2010-Importance of Linguistic Constraints in Statistical Dependency Parsing
9 0.34008083 104 acl-2010-Evaluating Machine Translations Using mNCD
10 0.33631471 12 acl-2010-A Probabilistic Generative Model for an Intermediate Constituency-Dependency Representation
11 0.33177602 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules
12 0.32909438 223 acl-2010-Tackling Sparse Data Issue in Machine Translation Evaluation
13 0.32005706 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures
14 0.31723753 16 acl-2010-A Statistical Model for Lost Language Decipherment
15 0.31163374 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
16 0.31035191 40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers
17 0.3102617 265 acl-2010-cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models
18 0.30721971 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation
19 0.30666721 244 acl-2010-TrustRank: Inducing Trust in Automatic Translations via Ranking
20 0.30053642 170 acl-2010-Letter-Phoneme Alignment: An Exploration
topicId topicWeight
[(5, 0.309), (14, 0.017), (16, 0.02), (25, 0.036), (39, 0.018), (44, 0.016), (52, 0.01), (59, 0.118), (73, 0.081), (76, 0.013), (78, 0.016), (80, 0.015), (83, 0.067), (84, 0.025), (98, 0.117)]
simIndex simValue paperId paperTitle
1 0.80254203 35 acl-2010-Automated Planning for Situated Natural Language Generation
Author: Konstantina Garoufi ; Alexander Koller
Abstract: We present a natural language generation approach which models, exploits, and manipulates the non-linguistic context in situated communication, using techniques from AI planning. We show how to generate instructions which deliberately guide the hearer to a location that is convenient for the generation of simple referring expressions, and how to generate referring expressions with context-dependent adjectives. We implement and evaluate our approach in the framework of the Challenge on Generating Instructions in Virtual Environments, finding that it performs well even under the constraints of realtime generation.
same-paper 2 0.77979439 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration
Author: Nadir Durrani ; Hassan Sajjad ; Alexander Fraser ; Helmut Schmid
Abstract: We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV (out-of-vocabulary) words. We use transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. We obtain final BLEU scores of 19.35 (conditional prob- ability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. This indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.
3 0.65508604 5 acl-2010-A Framework for Figurative Language Detection Based on Sense Differentiation
Author: Daria Bogdanova
Abstract: Various text mining algorithms require the process offeature selection. High-level semantically rich features, such as figurative language uses, speech errors etc., are very promising for such problems as e.g. writing style detection, but automatic extraction of such features is a big challenge. In this paper, we propose a framework for figurative language use detection. This framework is based on the idea of sense differentiation. We describe two algorithms illustrating the mentioned idea. We show then how these algorithms work by applying them to Russian language data.
4 0.52418745 238 acl-2010-Towards Open-Domain Semantic Role Labeling
Author: Danilo Croce ; Cristina Giannone ; Paolo Annesi ; Roberto Basili
Abstract: Current Semantic Role Labeling technologies are based on inductive algorithms trained over large scale repositories of annotated examples. Frame-based systems currently make use of the FrameNet database but fail to show suitable generalization capabilities in out-of-domain scenarios. In this paper, a state-of-art system for frame-based SRL is extended through the encapsulation of a distributional model of semantic similarity. The resulting argument classification model promotes a simpler feature space that limits the potential overfitting effects. The large scale empirical study here discussed confirms that state-of-art accuracy can be obtained for out-of-domain evaluations.
Author: Dong Yang ; Paul Dixon ; Sadaoki Furui
Abstract: This paper presents a joint optimization method of a two-step conditional random field (CRF) model for machine transliteration and a fast decoding algorithm for the proposed method. Our method lies in the category of direct orthographical mapping (DOM) between two languages without using any intermediate phonemic mapping. In the two-step CRF model, the first CRF segments an input word into chunks and the second one converts each chunk into one unit in the target language. In this paper, we propose a method to jointly optimize the two-step CRFs and also a fast algorithm to realize it. Our experiments show that the proposed method outper- forms the well-known joint source channel model (JSCM) and our proposed fast algorithm decreases the decoding time significantly. Furthermore, combination of the proposed method and the JSCM gives further improvement, which outperforms state-of-the-art results in terms of top-1 accuracy.
7 0.52015233 148 acl-2010-Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
8 0.51945233 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans
9 0.51898706 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
10 0.51792645 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web
11 0.51749611 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery
12 0.51685095 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation
13 0.51681846 96 acl-2010-Efficient Optimization of an MDL-Inspired Objective Function for Unsupervised Part-Of-Speech Tagging
14 0.51652801 56 acl-2010-Bridging SMT and TM with Translation Recommendation
15 0.51642072 162 acl-2010-Learning Common Grammar from Multilingual Corpus
16 0.51599962 156 acl-2010-Knowledge-Rich Word Sense Disambiguation Rivaling Supervised Systems
17 0.51573968 118 acl-2010-Fine-Grained Tree-to-String Translation Rule Extraction
18 0.51554811 145 acl-2010-Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment
19 0.51480204 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation
20 0.51408255 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules