acl acl2012 acl2012-20 knowledge-graph by maker-knowledge-mining

20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining


Source: pdf

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. We evaluate on NEWS 2010 shared task data and on parallel corpora with competitive results.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de , Abstract We propose a novel model to automatically extract transliteration pairs from parallel corpora. [sent-3, score-0.926]

2 Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. [sent-4, score-1.073]

3 We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. [sent-5, score-1.805]

4 1 Introduction Transliteration mining is the extraction of transliteration pairs from unlabelled data. [sent-7, score-1.178]

5 Most transliteration mining systems are built using labelled training data or using heuristics to extract transliteration pairs. [sent-8, score-2.039]

6 Our sys- tem extracts transliteration pairs in an unsupervised fashion. [sent-10, score-1.049]

7 We present a novel model of transliteration mining defined as a mixture of a transliteration model and a non-transliteration model. [sent-12, score-1.775]

8 The transliteration model is a joint source channel model (Li et al. [sent-13, score-0.836]

9 At test time, we label word 469 pairs as transliterations if they have a higher probability assigned by the transliteration sub-model than by the non-transliteration sub-model. [sent-17, score-1.085]

10 The S-step takes the probability estimates from unlabelled data (computed in the Mstep) and uses them as a backoff distribution to smooth probabilities which were estimated from labelled data. [sent-19, score-0.468]

11 We evaluate our unsupervised and semisupervised transliteration mining system on the datasets available from the NEWS 2010 shared task on transliteration mining (Kumaran et al. [sent-22, score-2.19]

12 Compared with a baseline unsupervised system our unsupervised system achieves up to 5% better F-measure. [sent-25, score-0.416]

13 Additional experiments on parallel corpora show that we are able to effectively mine transliteration pairs from very noisy data. [sent-31, score-0.955]

14 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi4c 6s9–47 , 2 Previous Work We first discuss the literature on semi-supervised and supervised techniques for transliteration mining and then describe a previously defined unsupervised system. [sent-39, score-1.139]

15 Supervised and semi-supervised systems use a manually labelled set of training data to learn character mappings between source and target strings. [sent-40, score-0.358]

16 The labelled training data either consists of a few hundred transliteration pairs or of just a few carefully selected transliteration pairs. [sent-41, score-1.922]

17 The NEWS 2010 shared task on transliteration mining (NEWS10) (Kumaran et al. [sent-42, score-0.987]

18 Our transliteration mining model can mine transliterations without using any labelled data. [sent-48, score-1.375]

19 The transliteration mining systems evaluated on the NEWS10 dataset generally used heuristic methods, discriminative models or generative models for transliteration mining (Kumaran et al. [sent-50, score-1.961]

20 They presented two discriminative methods an SVM-based classifier and alignment-based string similarity for transliteration mining. [sent-54, score-0.81]

21 We propose a flexible generative model for transliteration mining usable for both unsupervised and semi-supervised learning. [sent-56, score-1.107]

22 Our method is different from theirs as our generative story explains the unlabelled data using a combination of a transliteration – and a non-transliteration sub-model. [sent-60, score-0.957]

23 The transliteration model jointly generates source and target 470 strings, whereas the non-transliteration system generates them independently of each other. [sent-61, score-0.921]

24 (201 1) proposed a heuristic-based unsupervised transliteration mining system. [sent-63, score-1.107]

25 It is the only unsupervised mining system that was evaluated on the NEWS10 dataset up until now, as far as we know. [sent-65, score-0.383]

26 In this paper, we propose a novel model-based approach to transliteration mining. [sent-68, score-0.81]

27 Unlike the previous unsupervised system, and unlike the supervised and semi-supervised systems we mentioned, our model can be used for both unsupervised and semi-supervised mining in a consistent way. [sent-70, score-0.471]

28 The joint transliteration probability p1(e, f) of a word pair is the sum of the probabilities of all alignment sequences: p1(e,f) = X p(a) a∈AXlign(e,f) (1) Transliteration systems are trained on a list of transliteration pairs. [sent-74, score-1.862]

29 The alignment between the transliteration pairs is learned with Expectation Maximization (EM). [sent-75, score-0.919]

30 We use a simple unigram model, so an alignment sequence from function Align(e, f) is a combination of 0–1, 1–1, and 1– 0 character alignments between a source word e and its transliteration f. [sent-76, score-1.016]

31 ,q|a|) =Y|a|p(qj) (2) Yj=1 While transliteration systems are trained on a clean list of transliteration pairs, our transliteration mining system has to learn from data containing both transliterations and non-transliterations. [sent-82, score-2.868]

32 The transliteration model p1(e, f) handles only the transliteration pairs. [sent-83, score-1.62]

33 Interpolation with the non-transliteration model allows the transliteration model to concentrate on modelling transliterations during EM training. [sent-85, score-0.955]

34 After EM training, transliteration word pairs are assigned a high probability by the transliteration submodel and a low probability by the non-transliteration submodel, and vice versa for non-transliteration pairs. [sent-86, score-1.818]

35 p2(e, f) The traQnsliteration mining modelQ Qis an interpolation of the transliteration model p1(e, f) and the non-transliteration model p2 (e, f) : p(e, f) = (1 λ)p1 (e, f) + λp2 (e, f) (4) − λ is the prior probability of non-transliteration. [sent-92, score-1.036]

36 1 Model Estimation In this section, we discuss the estimation of the parameters ofthe transliteration model p1(e, f) and the non-transliteration model p2 (e, f). [sent-94, score-0.81]

37 For the transliteration model, we implement a simplified form of the grapheme-to-phoneme converter, g2p (Bisani and Ney, 2008). [sent-97, score-0.81]

38 In the E-step the EM algorithm computes expected counts for the multigrams and in the M-step the multigram probabilities are reestimated from these counts. [sent-103, score-0.33]

39 The expected count of a multigram q (E-step) is computed by multiplying the posterior probability of each alignment a with the frequency of q in a and summing these weighted frequencies over all alignments of all word pairs. [sent-107, score-0.349]

40 Consider a node r which is connected with a node s via an arc labelled with the multigram q. [sent-113, score-0.401]

41 We multiply the expected count of a transition by the posterior probability of transliteration (1 pntr(e, f)) werhiiocrh p pirnodbicabaitelist yho owf likely ttherea string pair is to be a transliteration. [sent-116, score-0.926]

42 The counts γrs are then summed for all multigram types q over all training pairs to obtain the frequencies c(q) which are used to reestimate the multigram probabilities according to Equation 5. [sent-117, score-0.465]

43 − 4 Semi-supervised Transliteration Mining Model Our unsupervised transliteration mining system can be applied to language pairs for which no labelled data is available. [sent-118, score-1.464]

44 However, the unsupervised system is focused on high recall and also mines close transliterations (see Section 5 for details). [sent-119, score-0.414]

45 In a task dependent scenario, it is difficult for the unsupervised system to mine transliteration pairs according to the details of a particular definition of what is considered a transliteration (which may vary somewhat with the task). [sent-120, score-1.912]

46 In this section, we propose an extension of our unsupervised model which overcomes this shortcoming by using labelled data. [sent-121, score-0.378]

47 The idea is to rely on probabilities from labelled data where they can be estimated reliably and to use probabilities from unlabelled data where the labelled data is sparse. [sent-122, score-0.707]

48 This is achieved by smoothing the labelled data probabilities using the unlabelled data probabilities as a backoff. [sent-123, score-0.471]

49 We obtain this effect by smoothing the probability distribution of unlabelled and labelled data using a technique similar to Witten-Bell smoothing (Witten and Bell, 1991), as we describe below. [sent-127, score-0.424]

50 The first step creates a reasonable alignment of the labelled data from which multigram counts can be obtained. [sent-130, score-0.469]

51 The labelled data is a small list of transliteration pairs. [sent-131, score-1.111]

52 Therefore we use the unlabelled data to help correctly align it and train our unsupervised mining system on the combined labelled and unlabelled training data. [sent-132, score-0.882]

53 We start the second step with the probability estimates from the first step and run the E-step separately on labelled and unlabelled data. [sent-135, score-0.424]

54 The E-step on the labelled data is done using Equation 8, which forces the posterior probability ofnon-transliteration to zero, while the E-step on the unlabelled data uses Equation 4. [sent-136, score-0.45]

55 After the two E-steps, we estimate a probability distribution from the counts obtained from the unlabelled data (M-step) and use it as a backoff distribution in computing smoothed probabilities from the labelled data counts (S-step). [sent-137, score-0.557]

56 5 Evaluation We evaluate our unsupervised system and semisupervised system on two tasks, NEWS10 and parallel corpora. [sent-139, score-0.343]

57 NEWS 10 is a standard task on transliteration mining from WIL. [sent-140, score-0.965]

58 On NEWS10, we compare our results with the unsupervised mining system of Sajjad et al. [sent-141, score-0.352]

59 The seed data is a list of 1000 transliteration pairs provided to semi-supervised systems for initial training. [sent-150, score-0.999]

60 We compared the word-aligned list with the NEWS10 reference data and found that the word-aligned list is missing some transliteration pairs because ofword-alignment errors. [sent-159, score-1.006]

61 We built another list by adding a word pair for every source word that cooccurs with a target word in a parallel phrase/sentence and call it the cross-product list later on. [sent-160, score-0.383]

62 The cross-product list is noisier but contains almost all transliteration pairs in the corpus. [sent-161, score-0.968]

63 2 The word-aligned list calculated from the NEWS10 dataset is used to compare our unsupervised system with the unsupervised system of Sajjad et al. [sent-170, score-0.509]

64 2 Unsupervised Transliteration Mining We run our unsupervised transliteration mining system on the word-aligned list and the crossproduct list. [sent-177, score-1.254]

65 The word pairs with a posterior probability of transliteration 1 − pntr(e, f) = 1 λp2 (ei, fi)/p(ei, fi) greater t−ha np 0. [sent-178, score-0.966]

66 We compare our unsupervised system with the unsupervised system of Sajjad1 1. [sent-180, score-0.394]

67 On the same machine, our transliteration mining system only takes 1. [sent-194, score-1.02]

68 95 Table 2: F-measure results on NEWS10 datasets where SJD is the unsupervised system of Sajjad1 1, OU is our unsupervised system built on the cross-product list, OS is our semi-supervised system, SBest is the best NEWS10 system, GR is the supervised system of Kahki et al. [sent-205, score-0.509]

69 (201 1) and DBN is the semi-supervised system of Nabende (201 1) Our unsupervised mining system built on the cross-product list consistently outperforms the one built on the word-aligned list. [sent-206, score-0.528]

70 Table 2 shows the results of our unsupervised sys- tem OU in comparison with the unsupervised system of Sajjad1 1(SJD), the best semi-supervised systems presented at NEWS 10 (SBEST) and the best semi-supervised results reported on the NEWS10 dataset (GR, DBN). [sent-208, score-0.37]

71 Cognates are close transliterations which differ by only one or two characters from an exact transliteration pair. [sent-220, score-1.016]

72 The unsupervised system learns to delete the additional one or two characters with a high probability and incorrectly mines such word pairs as transliterations. [sent-221, score-0.435]

73 137 Table 3: Precision(P), Recall(R) and F-measure(F) of our unsupervised and semi-supervised transliteration mining systems on NEWS 10 datasets 5. [sent-231, score-1.107]

74 This shows that the unlabelled training data is already providing most of the transliteration information. [sent-235, score-0.957]

75 The seed data is used to help the transliteration mining system to learn the right definition of transliteration. [sent-236, score-1.096]

76 The increase in precision shows that the seed data is helping the system in disambiguating transliteration pairs from cognates. [sent-241, score-0.989]

77 Our transliteration mining system wrongly extracts such pairs as transliterations. [sent-249, score-1.154]

78 Ta- Table 4: Word pairs with pronunciation differences Table 5: Examples of word pairs which are wrongly annotated as transliterations in the gold standard ble 4 shows a few examples of such word pairs. [sent-253, score-0.392]

79 Inconsistencies in the gold standard: There are several inconsistencies in the gold standard where our transliteration system correctly identifies a word pair as a transliteration but it is marked as a nontransliteration or vice versa. [sent-254, score-1.788]

80 Our semi-supervised system learns this as a non-transliteration but it is wrongly annotated as a transliteration in the gold standard. [sent-256, score-0.956]

81 Our mining system classifies such cases as non-transliterations, but 24 of them are incorrectly annotated as transliterations in the gold standard. [sent-259, score-0.406]

82 Often the Russian word differs only by the last character from a correct transliteration of the English word. [sent-267, score-0.881]

83 Due to the large amount of such word pairs in the English/Russian data, our mining system learns to delete the final case marking characters from the Russian words. [sent-268, score-0.359]

84 It assigns a high transliteration prob475 Table 6: A few examples of English/Russian cognates ability to these word pairs and extracts them as transliterations. [sent-269, score-0.985]

85 2 Transliteration Mining using Parallel Corpora The percentage of transliteration pairs in the NEWS10 datasets is high. [sent-279, score-0.876]

86 We further check the effectiveness of our unsupervised and semi-supervised mining systems by evaluating them on parallel corpora with as few as 2% transliteration pairs. [sent-280, score-1.157]

87 The English/Hindi and English/Arabic transliteration gold standards were provided by Sajjad et al. [sent-285, score-0.842]

88 We first train and test our unsupervised mining system on the word-aligned list and compare our results with Sajjad et al. [sent-292, score-0.417]

89 35 Table 7: Transliteration mining results of our unsupervised system and Sajjad1 1 system trained and tested on the word-aligned list of English/Hindi and English/Arabic parallel corpus TPFNTNFPPRF EEHHSU 3 9653 1479 1 2 234709 16289 8745. [sent-307, score-0.522]

90 59 Table 8: Transliteration mining results of our unsuper- vised and semi-supervised systems trained on the wordaligned list and tested on the cross-product list of English/Hindi and English/Arabic parallel corpus aligned list but has almost 100% recall of transliteration pairs. [sent-313, score-1.259]

91 The English-Hindi cross-product list has almost 55% more transliteration pairs (412 types) than the word-aligned list (180 types). [sent-314, score-1.006]

92 transliteration pairs of our unsupervised tains 65 and 111 close transliterations glish/Hindi and English/Arabic task The mined system confor the Enrespectively. [sent-323, score-1.241]

93 We define their probability as the inverse of the number of multigram tokens in the Viterbi alignment of the labelled and unlabelled data together. [sent-325, score-0.632]

94 We think these pairs provide transliteration information to the systems and help them to avoid problems with data sparseness. [sent-327, score-0.876]

95 6 Conclusion and Future Work We presented a novel model to automatically mine transliteration pairs. [sent-332, score-0.839]

96 Both the unsupervised and semisupervised systems achieve higher accuracy than the only unsupervised transliteration mining system we are aware of and are competitive with the stateof-the-art supervised and semi-supervised systems. [sent-334, score-1.377]

97 These language pairs require oneto-many character mappings to learn transliteration units, while our current system only learns unigram character alignments. [sent-337, score-1.105]

98 Whitepaper of NEWS 2010 shared task on transliteration mining. [sent-382, score-0.832]

99 Language independent transliteration mining system using finite state automata framework. [sent-402, score-1.02]

100 An algorithm for unsupervised transliteration mining with an application to word alignment. [sent-411, score-1.13]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('transliteration', 0.81), ('labelled', 0.236), ('multigram', 0.165), ('sajjad', 0.165), ('mining', 0.155), ('unlabelled', 0.147), ('transliterations', 0.145), ('unsupervised', 0.142), ('kumaran', 0.088), ('kahki', 0.082), ('multigrams', 0.069), ('pairs', 0.066), ('list', 0.065), ('seed', 0.058), ('cognates', 0.055), ('system', 0.055), ('parallel', 0.05), ('em', 0.049), ('character', 0.048), ('probabilities', 0.044), ('jiampojamarn', 0.044), ('alignment', 0.043), ('bisani', 0.041), ('semisupervised', 0.041), ('probability', 0.041), ('arabic', 0.04), ('news', 0.039), ('characters', 0.038), ('unigram', 0.038), ('wrongly', 0.037), ('alphabetic', 0.036), ('fraser', 0.034), ('pronounced', 0.034), ('vowels', 0.033), ('uppsala', 0.032), ('gold', 0.032), ('supervised', 0.032), ('dataset', 0.031), ('helmut', 0.031), ('russian', 0.031), ('extracts', 0.031), ('target', 0.03), ('interpolation', 0.03), ('mines', 0.029), ('wordaligned', 0.029), ('mine', 0.029), ('built', 0.028), ('alignments', 0.028), ('axlign', 0.027), ('crossproduct', 0.027), ('deligne', 0.027), ('deutsche', 0.027), ('eisele', 0.027), ('forschungsgemeinschaft', 0.027), ('noeman', 0.027), ('noisier', 0.027), ('pntr', 0.027), ('quran', 0.027), ('reestimated', 0.027), ('sjd', 0.027), ('submodel', 0.027), ('tpfntnfpprf', 0.027), ('expectation', 0.027), ('hassan', 0.026), ('posterior', 0.026), ('pair', 0.026), ('source', 0.026), ('gr', 0.025), ('participated', 0.025), ('counts', 0.025), ('dbn', 0.024), ('khapra', 0.024), ('mitesh', 0.024), ('nabende', 0.024), ('sbest', 0.024), ('wil', 0.024), ('later', 0.024), ('word', 0.023), ('count', 0.023), ('equation', 0.023), ('maximization', 0.023), ('wikipedia', 0.023), ('close', 0.023), ('learns', 0.022), ('darwish', 0.022), ('kareem', 0.022), ('stuttgart', 0.022), ('achieves', 0.022), ('shared', 0.022), ('haizhou', 0.022), ('smoothed', 0.021), ('witten', 0.02), ('fi', 0.02), ('recall', 0.02), ('schmid', 0.019), ('incorrectly', 0.019), ('ei', 0.019), ('calculated', 0.019), ('estimate', 0.018), ('learn', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. We evaluate on NEWS 2010 shared task data and on parallel corpora with competitive results.

2 0.27933666 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

Author: Preslav Nakov ; Jorg Tiedemann

Abstract: We propose several techniques for improving statistical machine translation between closely-related languages with scarce resources. We use character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU, which we further augment with character-based transliteration at the word level and combine with a word-level translation model. The evaluation on Macedonian-Bulgarian movie subtitles shows an improvement of 2.84 BLEU points over a phrase-based word-level baseline.

3 0.23276511 134 acl-2012-Learning to Find Translations and Transliterations on the Web

Author: Joseph Z. Chang ; Jason S. Chang ; Roger Jyh-Shing Jang

Abstract: Jason S. Chang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j s chang@ c s .nthu . edu .tw Jyh-Shing Roger Jang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j ang@ c s .nthu .edu .tw identifying such translation counterparts Web, we can cope with the OOV problem. In this paper, we present a new method on the for learning to finding translations and transliterations on the Web for a given term. The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine, and automatically annotating the snippets with tags and features for training a conditional random field model. At runtime, the model is used to extracting translation candidates for a given term. Preliminary experiments and evaluation show our method cleanly combining various features, resulting in a system that outperforms previous work. 1

4 0.12722953 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

Author: Sungchul Kim ; Kristina Toutanova ; Hwanjo Yu

Abstract: In this paper we propose a method to automatically label multi-lingual data with named entity tags. We build on prior work utilizing Wikipedia metadata and show how to effectively combine the weak annotations stemming from Wikipedia metadata with information obtained through English-foreign language parallel Wikipedia sentences. The combination is achieved using a novel semi-CRF model for foreign sentence tagging in the context of a parallel English sentence. The model outperforms both standard annotation projection methods and methods based solely on Wikipedia metadata.

5 0.11694294 140 acl-2012-Machine Translation without Words through Substring Alignment

Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara

Abstract: In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings. We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model, and using this in the phrase-based MT framework. We also propose a look-ahead parsing algorithm and substring-informed prior probabilities to achieve more effective and efficient alignment. In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs.

6 0.094120294 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

7 0.080341443 42 acl-2012-Bootstrapping via Graph Propagation

8 0.062525384 83 acl-2012-Error Mining on Dependency Trees

9 0.061490238 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

10 0.050451092 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

11 0.050140366 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

12 0.046986267 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

13 0.0449747 64 acl-2012-Crosslingual Induction of Semantic Roles

14 0.041586321 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

15 0.040578619 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment

16 0.040431853 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

17 0.040268708 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT

18 0.039835982 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes

19 0.038865097 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

20 0.037977651 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.143), (1, -0.049), (2, 0.037), (3, 0.045), (4, 0.098), (5, 0.102), (6, 0.016), (7, -0.059), (8, -0.0), (9, -0.064), (10, 0.018), (11, -0.077), (12, -0.025), (13, -0.01), (14, 0.026), (15, -0.063), (16, -0.042), (17, -0.008), (18, 0.044), (19, 0.194), (20, -0.183), (21, 0.051), (22, 0.194), (23, -0.085), (24, -0.138), (25, 0.129), (26, -0.092), (27, -0.052), (28, -0.093), (29, 0.181), (30, -0.184), (31, -0.07), (32, -0.184), (33, -0.099), (34, 0.152), (35, -0.116), (36, 0.116), (37, 0.076), (38, -0.018), (39, 0.004), (40, -0.014), (41, -0.033), (42, 0.207), (43, -0.234), (44, -0.04), (45, 0.038), (46, 0.018), (47, -0.157), (48, 0.114), (49, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95024568 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. We evaluate on NEWS 2010 shared task data and on parallel corpora with competitive results.

2 0.62268716 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

Author: Preslav Nakov ; Jorg Tiedemann

Abstract: We propose several techniques for improving statistical machine translation between closely-related languages with scarce resources. We use character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU, which we further augment with character-based transliteration at the word level and combine with a word-level translation model. The evaluation on Macedonian-Bulgarian movie subtitles shows an improvement of 2.84 BLEU points over a phrase-based word-level baseline.

3 0.48858213 134 acl-2012-Learning to Find Translations and Transliterations on the Web

Author: Joseph Z. Chang ; Jason S. Chang ; Roger Jyh-Shing Jang

Abstract: Jason S. Chang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j s chang@ c s .nthu . edu .tw Jyh-Shing Roger Jang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j ang@ c s .nthu .edu .tw identifying such translation counterparts Web, we can cope with the OOV problem. In this paper, we present a new method on the for learning to finding translations and transliterations on the Web for a given term. The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine, and automatically annotating the snippets with tags and features for training a conditional random field model. At runtime, the model is used to extracting translation candidates for a given term. Preliminary experiments and evaluation show our method cleanly combining various features, resulting in a system that outperforms previous work. 1

4 0.36473942 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

Author: Kareem Darwish ; Ahmed Ali

Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.

5 0.34735599 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

Author: Sungchul Kim ; Kristina Toutanova ; Hwanjo Yu

Abstract: In this paper we propose a method to automatically label multi-lingual data with named entity tags. We build on prior work utilizing Wikipedia metadata and show how to effectively combine the weak annotations stemming from Wikipedia metadata with information obtained through English-foreign language parallel Wikipedia sentences. The combination is achieved using a novel semi-CRF model for foreign sentence tagging in the context of a parallel English sentence. The model outperforms both standard annotation projection methods and methods based solely on Wikipedia metadata.

6 0.30714861 83 acl-2012-Error Mining on Dependency Trees

7 0.29086307 42 acl-2012-Bootstrapping via Graph Propagation

8 0.28426355 140 acl-2012-Machine Translation without Words through Substring Alignment

9 0.24408323 26 acl-2012-Applications of GPC Rules and Character Structures in Games for Learning Chinese Characters

10 0.23844182 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

11 0.20927413 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

12 0.20392783 137 acl-2012-Lemmatisation as a Tagging Task

13 0.20050012 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

14 0.19482729 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

15 0.19314829 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

16 0.19028705 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

17 0.18583833 163 acl-2012-Prediction of Learning Curves in Machine Translation

18 0.1779619 74 acl-2012-Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach

19 0.17338656 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

20 0.17146353 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.024), (26, 0.063), (28, 0.061), (30, 0.018), (37, 0.035), (39, 0.058), (57, 0.019), (74, 0.029), (82, 0.023), (84, 0.019), (85, 0.025), (89, 0.28), (90, 0.145), (92, 0.037), (94, 0.023), (99, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.74967068 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

Author: Xiaohua Liu ; Ming Zhou ; Xiangyang Zhou ; Zhongyang Fu ; Furu Wei

Abstract: Tweets represent a critical source of fresh information, in which named entities occur frequently with rich variations. We study the problem of named entity normalization (NEN) for tweets. Two main challenges are the errors propagated from named entity recognition (NER) and the dearth of information in a single tweet. We propose a novel graphical model to simultaneously conduct NER and NEN on multiple tweets to address these challenges. Particularly, our model introduces a binary random variable for each pair of words with the same lemma across similar tweets, whose value indicates whether the two related words are mentions of the same entity. We evaluate our method on a manually annotated data set, and show that our method outperforms the baseline that handles these two tasks separately, boosting the F1 from 80.2% to 83.6% for NER, and the Accuracy from 79.4% to 82.6% for NEN, respectively.

same-paper 2 0.70311201 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. We evaluate on NEWS 2010 shared task data and on parallel corpora with competitive results.

3 0.69170886 137 acl-2012-Lemmatisation as a Tagging Task

Author: Andrea Gesmundo ; Tanja Samardzic

Abstract: We present a novel approach to the task of word lemmatisation. We formalise lemmatisation as a category tagging task, by describing how a word-to-lemma transformation rule can be encoded in a single label and how a set of such labels can be inferred for a specific language. In this way, a lemmatisation system can be trained and tested using any supervised tagging model. In contrast to previous approaches, the proposed technique allows us to easily integrate relevant contextual information. We test our approach on eight languages reaching a new state-of-the-art level for the lemmatisation task.

4 0.55410588 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

Author: Patrick Simianer ; Stefan Riezler ; Chris Dyer

Abstract: With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data. Evidence from machine learning indicates that increasing the training sample size results in better prediction. The goal of this paper is to show that this common wisdom can also be brought to bear upon SMT. We deploy local features for SCFG-based SMT that can be read off from rules at runtime, and present a learning algorithm that applies ‘1/‘2 regularization for joint feature selection over distributed stochastic learning processes. We present experiments on learning on 1.5 million training sentences, and show significant improvements over tuning discriminative models on small development sets.

5 0.54929352 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

Author: Weiwei Sun ; Hans Uszkoreit

Abstract: From the perspective of structural linguistics, we explore paradigmatic and syntagmatic lexical relations for Chinese POS tagging, an important and challenging task for Chinese language processing. Paradigmatic lexical relations are explicitly captured by word clustering on large-scale unlabeled data and are used to design new features to enhance a discriminative tagger. Syntagmatic lexical relations are implicitly captured by constituent parsing and are utilized via system combination. Experiments on the Penn Chinese Treebank demonstrate the importance of both paradigmatic and syntagmatic relations. Our linguistically motivated approaches yield a relative error reduction of 18% in total over a stateof-the-art baseline.

6 0.54817206 140 acl-2012-Machine Translation without Words through Substring Alignment

7 0.54810119 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

8 0.54768437 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

9 0.54713905 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

10 0.5459429 187 acl-2012-Subgroup Detection in Ideological Discussions

11 0.54497659 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

12 0.5435012 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons

13 0.54306453 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

14 0.54191154 217 acl-2012-Word Sense Disambiguation Improves Information Retrieval

15 0.54187763 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

16 0.54186416 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling

17 0.54174542 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

18 0.54117799 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

19 0.53949106 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

20 0.53949028 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models