acl acl2011 acl2011-197 knowledge-graph by maker-knowledge-mining

197 acl-2011-Latent Class Transliteration based on Source Language Origin

Source: pdf

Author: Masato Hagiwara ; Satoshi Sekine

Abstract: Transliteration, a rich source of proper noun spelling variations, is usually recognized by phonetic- or spelling-based models. However, a single model cannot deal with different words from different language origins, e.g., “get” in “piaget” and “target.” Li et al. (2007) propose a method which explicitly models and classifies the source language origins and switches transliteration models accordingly. This model, however, requires an explicitly tagged training set with language origins. We propose a novel method which models language origins as latent classes. The parameters are learned from a set of transliterated word pairs via the EM algorithm. The experimental results of the transliteration task of Western names to Japanese show that the proposed model can achieve higher accuracy compared to the conventional models without latent classes.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract Transliteration, a rich source of proper noun spelling variations, is usually recognized by phonetic- or spelling-based models. [sent-4, score-0.155]

2 However, a single model cannot deal with different words from different language origins, e. [sent-5, score-0.038]

3 (2007) propose a method which explicitly models and classifies the source language origins and switches transliteration models accordingly. [sent-9, score-1.081]

4 This model, however, requires an explicitly tagged training set with language origins. [sent-10, score-0.085]

5 We propose a novel method which models language origins as latent classes. [sent-11, score-0.527]

6 The parameters are learned from a set of transliterated word pairs via the EM algorithm. [sent-12, score-0.349]

7 The experimental results of the transliteration task of Western names to Japanese show that the proposed model can achieve higher accuracy compared to the conventional models without latent classes. [sent-13, score-0.815]

8 , “バラクオバマ baraku obama / Barak Obama”) is phonetic translation between languages with different writing systems. [sent-16, score-0.067]

9 Words are often transliterated when imported into differet languages, which is a major cause of spelling variations of proper nouns in Japanese and many other languages. [sent-17, score-0.452]

10 Accurate transliteration is also the key to robust machine translation systems. [sent-18, score-0.518]

11 Phonetic-based rewriting models (Knight and Jonathan, 1998) and spelling-based supervised models (Brill and Moore, 2000) have been proposed for 53 Satoshi Sekine Rakuten Institute of Technology, New York 215 Park Avenue South, New York, NY s at o shi . [sent-19, score-0.041]

12 These methods usually learn a single model given a training set. [sent-24, score-0.038]

13 For example, the “get” parts in “piaget / ピアジェ piaje” (French origin) and “target / ターゲット t ¯agetto” (English origin) may differ in how they are transliterated depending on their origins. [sent-26, score-0.297]

14 (2007) tackled this issue by proposing a class transliteration model, which explicitly models and classifies origins such as language and genders, and switches corresponding transliteration model. [sent-28, score-1.676]

15 This method requires training sets of transliterated word pairs with language origin. [sent-29, score-0.349]

16 However, it is difficult to obtain such tagged data, especially for proper nouns, a rich source of transliterated words. [sent-30, score-0.439]

17 In addition, the explicitly tagged language origins are not necessarily helpful for loanwords. [sent-31, score-0.484]

18 For example, the word “spaghetti” (Italian origin) can also be found in an English dictionary, but applying an English model can lead to unwanted results. [sent-32, score-0.038]

19 In this paper, we propose a latent class transliteration model, which models the source language origin as unobservable latent classes and applies appropriate transliteration models to given transliteration pairs. [sent-33, score-2.216]

20 The model parameters are learned via the EM algorithm from training sets of transliterated pairs. [sent-34, score-0.335]

21 We expect that, for example, a latent class which is mostly occupied by Italian words would be assigned to “spaghetti / スパゲティ supageti” and the pair will be correctly recognized. [sent-35, score-0.241]

22 In the evaluation experiments, we evaluated the accuracy in estimating a corresponding Japanese transliteration given an unknown foreign word, Proceedings ofP tohretl 4an9tdh, O Anrneguoanl, M Jueentein 19g- o2f4 t,h 2e0 A1s1s. [sent-36, score-0.581]

23 The results showed that the proposed model achieves higher accuracy than conventional models without latent classes. [sent-39, score-0.215]

24 Related researches include Llitjos and Black (2001), where it is shown that source language origins may improve the pronunciation of proper nouns in text-to-speech systems. [sent-40, score-0.63]

25 Another one by Ahmad and Kondrak (2005) estimates character-based error probabilities from query logs via the EM algorithm. [sent-41, score-0.083]

26 This model is less general than ours because it only deals with character-based error probability. [sent-42, score-0.038]

27 2 Alpha-Beta Model We adopted the alpha-beta model (Brill and Moore, 2000), which directly models the string substitution probabilities of transliterated pairs, as the base model in this paper. [sent-43, score-0.506]

28 This model is an extension to the conventional edit distance, and gives probabilities to general string substitutions in the form of α → β (α, β are strings of any length). [sent-44, score-0.179]

29 The whole probability o βf rewriting sw oofrd a s yw leinthg tt hi)s. [sent-45, score-0.041]

30 In practice, we conditioned P(α → β) by the position of α in words, i. [sent-48, score-0.041]

31 This conditioning is simply omitted in the equations in this paper. [sent-51, score-0.03]

32 54 The substitution probabilities P(α → β) are leaTrnheed sfurbomsti tturtainosnlit eprraotbeadb pairs. [sent-52, score-0.133]

33 Firstly, we βo)bt aairne an edit operation sequence using the normal DP for edit distance computation. [sent-53, score-0.169]

34 Secondly, →noun,- l→mart,c eh→ operations are merged wdi stho aond-. [sent-58, score-0.068]

35 jacent edit operations, with the maximum length of substitution pairs limited to W. [sent-59, score-0.207]

36 When W = 2, for example, the first non-match operation ε →u is merged mwpilthe, one operation on cthhe olepfetr aantido right, producing f→fu and l→ur. [sent-60, score-0.17]

37 Finally, substitution probdabuicli tniges f are uca alncudl la→tedu as Freinlaatlilvye, frequencies oprfo abllsubstitution operations created in this way. [sent-61, score-0.135]

38 Note that the minimum edit operation sequence is not unique, − so we take the averaged frequencies of all the possible minimum sequences. [sent-62, score-0.112]

39 3 Class Transliteration Model The alpha-beta model showed better performance in tasks such as spelling correction (Brill and Moore, 2000), transliteration (Brill et al. [sent-63, score-0.609]

40 However, the substitution probabilities learned by this model are simply the monolithic average of training set statistics, and cannot be switched depending on the source language origin of given pairs, as explained in Section 1. [sent-65, score-0.466]

41 Transliteration of Indo-European names such as “亜歴山 / Alexandra” can be addressed by Mandarin pronunciation (Pinyin) “Ya-LiShan-Da,” while Japanese names such as “山本 / Yamamoto” can only be addressed by considering the Japanese pronunciation, not the Chinese pronunciation “Shan-Ben. [sent-68, score-0.35]

42 This model can be interpreted as firstly computing the class probability distribution given P(c|s) then taking a weighted sum torifb P(t|s, c) ewni tPh regard to ttahek ensgti maa wteedig chltaesds c amnd o tfhe P target t). [sent-73, score-0.321]

43 Note that this weighted sum can be regarded as doing soft-clustering of the input s into classes with probabilities. [sent-74, score-0.063]

44 Alternatively, we can employ hard-clustering by taking one class such that c∗ = arg maxl,g P(l, g|s) and compute the transliteration probability Pby(:l P(t|s)hard 4 ∝ P(t|s, c∗). [sent-75, score-0.631]

45 (3) Latent Class Transliteration Model The model explained in the previous section integrates different transliteration models for words with different language origins, but it requires us to build class detection model c from training pairs explicitly tagged with language origins. [sent-76, score-0.913]

46 Instead of assigning an explicit class c to each transliterated pair, we can introduce a random variable z and consider a conditioned string substitution probability P(α → β|z). [sent-77, score-0.549]

47 This latent class z corresponds tyo Pth(eα αcl →asse βs oz)f. [sent-78, score-0.27]

48 traT nhsilsit elaratetendt pairs wzh coicrhshare the same transliteration characteristics, such as language origins and genders. [sent-80, score-0.969]

49 Although z is not directly observable from sets of transliterated words, we can compute it via EM algorithm so that it maximizes the training set likelihood as shown below. [sent-81, score-0.297]

50 Xtrain is the training set consisting of transliterated pairs {(sn, tn) | 1 ≤ n ≤ N}, N is tohfe t nraunmslbieterr oatfe training pairs, an)d|1 1K ≤ ≤is n nth ≤e n Num},be Nr o isf latent classes. [sent-83, score-0.477]

52 5 Experiments Here we evaluate the performance of the transliteration models as an information retrieval task, where the model ranks target t0 for a given source s0, based on the model P(t0|s0). [sent-89, score-0.63]

53 We used all the t0n in the test set Xtest = {(s0n, t0n) | 1 ≤ n ≤ M} as target candidates and= s0n (fosr queries. [sent-90, score-0.03]

54 Fi nve ≤-fo Mld cross avraglei-t dation was adopted when learning the models, that is, the datasets described in the next subsections are equally splitted into five folds, of which four were used for training and one for testing. [sent-91, score-0.058]

55 The mean reciprocal rank (MRR) of top 10 ranked candidates was used as a performance measure. [sent-92, score-0.03]

56 1 Experimental Settings Dataset 1: Western Person Name List This dataset contains 6,717 Western person names and their Katakana readings taken from an European name website 欧羅巴人名録 1, consisting of German (de), English (en), and French (fr) person name pairs. [sent-94, score-0.305]

57 The numbers of pairs for these languages are 2,470, 2,492, and 1,747, respectively. [sent-95, score-0.084]

58 Dataset 2: Western Proper Noun List This dataset contains 11,323 proper nouns and their Japanese counterparts extracted from Wikipedia interwiki. [sent-98, score-0.216]

59 The languages and numbers of pairs contained are: German (de): 2,003, English (en): 5,530, Spanish (es): 781, French (fr): 1,918, Italian (it): 1http://www. [sent-99, score-0.084]

60 Linked English and Japanese titles are extracted, unless the Japanese title contains any other characters than Katakana, hyphen, or middle dot. [sent-105, score-0.127]

61 The language origin of titles were detected whether appropriate country names are included in the first sentence of Japanese articles. [sent-106, score-0.37]

62 If the sentence contains any of Spain, Argentina, Mexico, Peru, or Chile plus “の”(of), it is marked as Spanish origin. [sent-108, score-0.038]

63 If they contain any of America, England, Australia or Canada plus “の”(of), it is marked as English origin. [sent-109, score-0.038]

64 The latter parts of Japanese/foreign titles starting from “,” or “(” were removed. [sent-110, score-0.089]

65 Japanese and foreign titles were split into chunks by middle dots and “ ”, respectively, and resulting chunks were aligned. [sent-111, score-0.282]

66 Titles pairs with different numbers of chunks, or ones with foreign character length less than 3 were excluded. [sent-112, score-0.115]

67 All accent marks were normalized (German “ß” was converted to “ss”). [sent-113, score-0.093]

68 Implementation Details P(c|s) of the class transliteration model was caPlc(ucl|ast)ed by a character 3-gram language model with Witten-Bell discounting. [sent-114, score-0.707]

69 Japanese Katakanas were all converted to Hepburn-style Roman characters, with minor changes so as to incorporate foreign pronunciations such as “wi / ウィ ” and “we / ウェ . [sent-115, score-0.105]

70 ”) The maximum length of substitution pairs W de- scribed in Section 2 was set W = 2. [sent-119, score-0.15]

71 The EM algorithm parameters P(α → β|z) were initialized to tghoer probability P(α → β) o βf |tzh)e alpha-beta zmeodde tol plus Groabuasbsiilainty noise, →and β πk were uniformly i mniotidael-l ized to 1/K. [sent-120, score-0.096]

72 2 Results Language Class Detection We firstly show the precision of language detection using the class 56 Languagedeenesfrit Precision(%)65. [sent-123, score-0.232]

73 44 Table 3: Model Performance Comparison (MRR; %) transliteration model P(c|s) and Equation (3) (Table 5. [sent-136, score-0.556]

74 iTonhe m oovdeeralPl l precision Eisq relatively lToawbeler than, e. [sent-139, score-0.032]

75 (2007), which is attributed to the fact that European names can be quite ambiguous (e. [sent-142, score-0.082]

76 , “Charles” can read “チャールズ ch¯ aruzu” or “シャルル sharuru”) The precision of Dataset 2 is even worse because it has more classes. [sent-144, score-0.032]

77 We can also use the result of the latent class transliteration for clustering by regarding k∗ = arg maxk γnk as the class of the pair. [sent-145, score-0.872]

79 HARD, on the other hand, shows lower performance, due to the low precision of class which is mainly detection. [sent-152, score-0.145]

80 The de- tection errors are alleviated in SOFT by considering of transliteration probabilities. [sent-153, score-0.518]

81 We also conducted the evaluation based on the top-1 accuracy of transliteration candidates. [sent-154, score-0.518]

82 Because we found out that the tendency of the results is the same as MRR, we simply omitted the result in the weighted sum this paper. [sent-155, score-0.064]

83 The simplest model AB incorrectly reads “Felix リックス ,” “Read / リード” as “フィリス Firisu” and “ レアード Re¯ ado. [sent-156, score-0.038]

84 ” This may be because English pronunciation “x / ックス kkusu” and “ea / / フェイー i¯” are influenced by other languages. [sent-157, score-0.093]

85 SOFT and LATENT can find correct candidates for these Haizhou Li, Khe Chai Sum, Jin-Shea Kuo, and Minghui Dong. [sent-158, score-0.03]

86 oKnacowulイルダ Iruda”(English), it is difficult to find correct counterparts even by LATENT. [sent-167, score-0.041]

87 Finally, we investigated the effect of the number of latent classes K. [sent-168, score-0.157]

88 The performance is higher when K is slightly smaller than the number of language origins in the dataset (e. [sent-169, score-0.472]

89 6 Conclusion In this paper, we proposed a latent class transliteration method which models source language origins as latent classes. [sent-172, score-1.322]

90 The model parameters are learned from sets of transliterated words with different origins via the EM algorithm. [sent-173, score-0.734]

91 The experimental result of Western person / proper name transliteration task shows that, even though the proposed model does not rely on explicit language origins, it achieves higher accuracy versus conventional methods using explicit language origins. [sent-174, score-0.746]

92 Considering sources other than Western languages as well as targets other than Japanese is the future work. [sent-175, score-0.032]

93 Learning a spelling error model from search query logs. [sent-178, score-0.139]

94 Automatically harvesting katakana-english term pairs from search engine query logs. [sent-189, score-0.1]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('transliteration', 0.518), ('origins', 0.399), ('transliterated', 0.297), ('origin', 0.199), ('japanese', 0.18), ('western', 0.169), ('latent', 0.128), ('rakuten', 0.117), ('class', 0.113), ('brill', 0.112), ('nk', 0.101), ('hagiwara', 0.1), ('substitution', 0.098), ('pronunciation', 0.093), ('titles', 0.089), ('names', 0.082), ('em', 0.077), ('italian', 0.076), ('dataset', 0.073), ('mrr', 0.072), ('flexti', 0.066), ('katakana', 0.066), ('masato', 0.066), ('pab', 0.066), ('piaget', 0.066), ('platent', 0.066), ('soft', 0.066), ('proper', 0.066), ('foreign', 0.063), ('alteration', 0.059), ('spaghetti', 0.059), ('edit', 0.057), ('operation', 0.055), ('moore', 0.053), ('spelling', 0.053), ('pairs', 0.052), ('accent', 0.051), ('switches', 0.051), ('sn', 0.05), ('conventional', 0.049), ('firstly', 0.049), ('query', 0.048), ('french', 0.046), ('chunks', 0.046), ('explicitly', 0.045), ('tn', 0.045), ('ahmad', 0.044), ('mai', 0.043), ('south', 0.043), ('li', 0.042), ('converted', 0.042), ('german', 0.042), ('counterparts', 0.041), ('rewriting', 0.041), ('conditioned', 0.041), ('tagged', 0.04), ('person', 0.039), ('middle', 0.038), ('detection', 0.038), ('model', 0.038), ('plus', 0.038), ('operations', 0.037), ('nouns', 0.036), ('name', 0.036), ('avenue', 0.036), ('fu', 0.036), ('source', 0.036), ('probabilities', 0.035), ('obama', 0.035), ('dp', 0.034), ('fn', 0.034), ('sum', 0.034), ('park', 0.034), ('classifies', 0.032), ('languages', 0.032), ('precision', 0.032), ('merged', 0.031), ('explained', 0.031), ('omitted', 0.03), ('york', 0.03), ('candidates', 0.03), ('amnd', 0.029), ('farooq', 0.029), ('nve', 0.029), ('splitted', 0.029), ('nnk', 0.029), ('hde', 0.029), ('monolithic', 0.029), ('phe', 0.029), ('torifb', 0.029), ('rweo', 0.029), ('unobservable', 0.029), ('tol', 0.029), ('tyo', 0.029), ('kuo', 0.029), ('asse', 0.029), ('cthhe', 0.029), ('ized', 0.029), ('tfhe', 0.029), ('classes', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 197 acl-2011-Latent Class Transliteration based on Source Language Origin

Author: Masato Hagiwara ; Satoshi Sekine

2 0.46667951 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora. In contrast to previous work, our method uses no form of supervision, and does not require linguistically informed preprocessing. We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted. We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs. Finally, we integrate the transliteration module into the GIZA++ word aligner and evaluate it on two word alignment tasks achieving improvements in both precision and recall measured against gold standard word alignments.

3 0.21266842 153 acl-2011-How do you pronounce your name? Improving G2P with transliterations

Author: Aditya Bhargava ; Grzegorz Kondrak

Abstract: Grapheme-to-phoneme conversion (G2P) of names is an important and challenging problem. The correct pronunciation of a name is often reflected in its transliterations, which are expressed within a different phonological inventory. We investigate the problem of using transliterations to correct errors produced by state-of-the-art G2P systems. We present a novel re-ranking approach that incorporates a variety of score and n-gram features, in order to leverage transliterations from multiple languages. Our experiments demonstrate significant accuracy improvements when re-ranking is applied to n-best lists generated by three different G2P programs.

4 0.18135329 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

Author: Yun Huang ; Min Zhang ; Chew Lim Tan

Abstract: Machine transliteration is defined as automatic phonetic translation of names across languages. In this paper, we propose synchronous adaptor grammar, a novel nonparametric Bayesian learning approach, for machine transliteration. This model provides a general framework without heuristic or restriction to automatically learn syllable equivalents between languages. The proposed model outperforms the state-of-the-art EMbased model in the English to Chinese transliteration task.

5 0.08615949 11 acl-2011-A Fast and Accurate Method for Approximate String Search

Author: Ziqi Wang ; Gu Xu ; Hang Li ; Ming Zhang

Abstract: This paper proposes a new method for approximate string search, specifically candidate generation in spelling error correction, which is a task as follows. Given a misspelled word, the system finds words in a dictionary, which are most “similar” to the misspelled word. The paper proposes a probabilistic approach to the task, which is both accurate and efficient. The approach includes the use of a log linear model, a method for training the model, and an algorithm for finding the top k candidates. The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction conditioned on the misspelled word. The learning method employs the criterion in candidate generation as loss function. The retrieval algorithm is efficient and is guaranteed to find the optimal k candidates. Experimental results on large scale data show that the proposed approach improves upon existing methods in terms of accuracy in different settings.

6 0.080098405 9 acl-2011-A Cross-Lingual ILP Solution to Zero Anaphora Resolution

7 0.078189485 151 acl-2011-Hindi to Punjabi Machine Translation System

8 0.073049359 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

9 0.072108343 258 acl-2011-Ranking Class Labels Using Query Sessions

10 0.071020447 238 acl-2011-P11-2093 k2opt.pdf

11 0.066526853 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

12 0.065113418 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

13 0.063906789 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method

14 0.063452639 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

15 0.059458271 94 acl-2011-Deciphering Foreign Language

16 0.057790689 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

17 0.055921409 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

18 0.054877467 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

19 0.054526694 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

20 0.051380321 6 acl-2011-A Comprehensive Dictionary of Multiword Expressions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.156), (1, -0.016), (2, -0.01), (3, 0.061), (4, -0.02), (5, -0.069), (6, 0.037), (7, -0.032), (8, -0.014), (9, 0.073), (10, 0.013), (11, 0.032), (12, 0.028), (13, 0.164), (14, 0.071), (15, 0.037), (16, 0.123), (17, 0.14), (18, 0.35), (19, -0.172), (20, -0.264), (21, 0.197), (22, -0.082), (23, -0.081), (24, -0.202), (25, -0.115), (26, 0.024), (27, -0.057), (28, -0.05), (29, 0.082), (30, -0.088), (31, -0.016), (32, 0.04), (33, 0.01), (34, -0.027), (35, -0.018), (36, -0.051), (37, -0.066), (38, -0.025), (39, -0.059), (40, 0.052), (41, -0.043), (42, 0.005), (43, -0.038), (44, 0.048), (45, 0.002), (46, -0.015), (47, -0.02), (48, -0.021), (49, -0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94231534 197 acl-2011-Latent Class Transliteration based on Source Language Origin

Author: Masato Hagiwara ; Satoshi Sekine

2 0.89551771 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

3 0.84083211 153 acl-2011-How do you pronounce your name? Improving G2P with transliterations

Author: Aditya Bhargava ; Grzegorz Kondrak

4 0.7399568 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

Author: Yun Huang ; Min Zhang ; Chew Lim Tan

5 0.40841219 151 acl-2011-Hindi to Punjabi Machine Translation System

Author: Vishal Goyal ; Gurpreet Singh Lehal

Abstract: Hindi-Punjabi being closely related language pair (Goyal V. and Lehal G.S., 2008) , Hybrid Machine Translation approach has been used for developing Hindi to Punjabi Machine Translation System. Non-availability of lexical resources, spelling variations in the source language text, source text ambiguous words, named entity recognition and collocations are the major challenges faced while developing this syetm. The key activities involved during translation process are preprocessing, translation engine and post processing. Lookup algorithms, pattern matching algorithms etc formed the basis for solving these issues. The system accuracy has been evaluated using intelligibility test, accuracy test and BLEU score. The hybrid syatem is found to perform better than the constituent systems. Keywords: Machine Translation, Computational Linguistics, Natural Language Processing, Hindi, Punjabi. Translate Hindi to Punjabi, Closely related languages. 1Introduction Machine Translation system is a software designed that essentially takes a text in one language (called the source language), and translates it into another language (called the target language). There are number of approaches for MT like Direct based, Transform based, Interlingua based, Statistical etc. But the choice of approach depends upon the available resources and the kind of languages involved. In general, if the two languages are structurally similar, in particular as regards lexical correspondences, morphology and word order, the case for abstract syntactic analysis seems less convincing. Since the present research work deals with a pair of closely related language 1 Gurpreet Singh Lehal Department of Computer Science Punjabi University, Patiala,India gs lehal @ gmai l com . i.e. Hindi-Punjabi , thus direct word-to-word translation approach is the obvious choice. As some rule based approach has also been used, thus, Hybrid approach has been adopted for developing the system. An exhaustive survey has already been given for existing machine translations systems developed so far mentioning their accuracies and limitations. (Goyal V. and Lehal G.S., 2009). 2 System Architecture 2.1 Pre Processing Phase The preprocessing stage is a collection of operations that are applied on input data to make it processable by the translation engine. In the first phase of Machine Translation system, various activities incorporated include text normalization, replacing collocations and replacing proper nouns. 2.2 Text Normalization The variety in the alphabet, different dialects and influence of foreign languages has resulted in spelling variations of the same word. Such variations sometimes can be treated as errors in writing. (Goyal V. and Lehal G.S., 2010). 2.3 Replacing Collocations After passing the input text through text normalization, the text passes through this Collocation replacement sub phase of Preprocessing phase. Collocation is two or more consecutive words with a special behavior. (Choueka :1988). For example, the collocation उ?र ?देश (uttar pradēsh) if translated word to word, will be translated as ਜਵਾਬ ਰਾਜ (javāb rāj) but it must be translated as ਉ?ਤਰ ਪ?ਦਸ਼ੇ (uttar pradēsh). The accuracy of the results for collocation extraction using t-test is not accurate and includes number of such bigrams and trigrams that are not actually collocations. Thus, manually such entries were removed and actual collocations were further extracted. The Portland, POrroecgeoend,in UgSsA o,f 2 t1he Ju AnCeL 2-0H1L1T. 2 ?c 021101 S1y Astessmoc Diaetmioonn fsotr a Ctioonms,p puatagteiosn 1a–l6 L,inguistics correct corresponding Punjabi translation for each extracted collocation is stored in the collocation table of the database. The collocation table of the database consists of 5000 such entries. In this sub phase, the normalized input text is analyzed. Each collocation in the database found in the input text will be replaced with the Punjabi translation of the corresponding collocation. It is found that when tested on a corpus containing about 1,00,000 words, only 0.001 % collocations were found and replaced during the translation. Hindi Text Figure 1: Overview of Hindi-Punjabi Machine Translation System 2.4 Replacing Proper Nouns A great proposition of unseen words includes proper nouns like personal, days of month, days of week, country names, city names, bank fastens words proper decide the translation process. Once these are recognized and stored into the noun database, there is no need to about their translation or transliteration names, organization names, ocean names, river every names, university words names etc. and if translated time in the case of presence in word to word, their meaning is changed. If the gazetteer meaning is not affected, even though this step fast. This input makes list text for the translation is self of such translation. growing This accurate and during each 2 translation. Thus, to process this sub phase, the system requires a proper noun gazetteer that has been complied offline. For this task, we have developed an offline module to extract proper nouns from the corpus based on some rules. Also, Named Entity recognition module has been developed based on the CRF approach (Sharma R. and Goyal V., 2011b). 2.5 Tokenizer Tokenizers (also known as lexical analyzers or word segmenters) segment a stream of characters into meaningful units called tokens. The tokenizer takes the text generated by pre processing phase as input. Individual words or tokens are extracted and processed to generate its equivalent in the target language. This module, using space, a punctuation mark, as delimiter, extracts tokens (word) one by one from the text and gives it to translation engine for analysis till the complete input text is read and processed. 2.6 Translation Engine The translation engine is the main component of our Machine Translation system. It takes token generated by the tokenizer as input and outputs the translated token in the target language. These translated tokens are concatenated one after another along with the delimiter. Modules included in this phase are explained below one by one. 2.6.1 Identifying Titles and Surnames Title may be defined as a formal appellation attached to the name of a person or family by virtue of office, rank, hereditary privilege, noble birth, or attainment or used as a mark of respect. Thus word next to title and word previous to surname is usually a proper noun. And sometimes, a word used as proper name of a person has its own meaning in target language. Similarly, Surname may be defined as a name shared in common to identify the members of a family, as distinguished from each member's given name. It is also called family name or last name. When either title or surname is passed through the translation engine, it is translated by the system. This cause the system failure as these proper names should be transliterated instead of translation. For example consider the Hindi sentence 3 ?ीमान हष? जी हमार ेयहाँ पधार।े (shrīmān harsh jī हष? hamārē yahāṃ padhārē). In this sentence, (harsh) has the meaning “joy”. The equivalent translation of हष? (harsh) in target language is ਖੁਸ਼ੀ (khushī). Similarly, consider the Hindi sentence ?काश ?सह हमार े (prakāsh siṃh hamārē yahāṃ padhārē). Here, ?काश (prakāsh) word is acting as proper noun and it must be transliterated and not translated because (siṃh) is surname and word previous to it is proper noun. Thus, a small module has been developed for यहाँ पधार।े. ?सह locating such proper nouns to consider them as title or surname. There is one special character ‘॰’ in Devanagari script to mark the symbols like डा॰, ?ो॰. If this module found this symbol to be title or surname, the word next and previous to this token as the case may be for title or surname respectively, will be transliterated not translated. The title and surname database consists of 14 and 654 entries respectively. These databases can be extended at any time to allow new titles and surnames to be added. This module was tested on a large Hindi corpus and showed that about 2-5 % text of the input text depending upon its domain is proper noun. Thus, this module plays an important role in translation. 2.6.2 Hindi Morphological analyzer This module finds the root word for the token and its morphological features.Morphological analyzer developed by IIT-H has been ported for Windows platform for making it usable for this system. (Goyal V. and Lehal G.S.,2008a) 2.6.3 Word-to-Word translation using lexicon lookup If token is not a title or a surname, it is looked up in the HPDictionary database containing Hindi to Punjabi direct word to word translation. If it is found, it is used for translation. If no entry is found in HPDictionary database, it is sent to next sub phase for processing. The HPDictionary database consists of 54, 127 entries.This database can be extended at any time to allow new entries in the dictionary to be added. 2.6.4 Resolving Ambiguity Among number of approaches for disambiguation, the most appropriate approach to determine the correct meaning of a Hindi word in a particular usage for our Machine Translation system is to examine its context using N-gram approach. After analyzing the past experiences of various authors, we have chosen the value of n to be 3 and 2 i.e. trigram and bigram approaches respectively for our system. Trigrams are further categorized into three different types. First category of trigram consists of context one word previous to and one word next to the ambiguous word. Second category of trigram consists of context of two adjacent previous words to the ambiguous word. Third category of the trigram consists of context of two adjacent next words to the ambiguous word. Bigrams are also categorized into two categories. First category of the bigrams consists of context of one previous word to ambiguous word and second category of the bigrams consists of one context word next to ambiguous word. For this purpose, the Hindi corpus consisting of about 2 million words was collected from different sources like online newspaper daily news, blogs, Prem Chand stories, Yashwant jain stories, articles etc. The most common list of ambiguous words was found. We have found a list of 75 ambiguous words out of which the most are स े sē and aur. (Goyal V. and frequent Lehal G.S., 2011) और 2.6.5 Handling Unknown Words 2.6.5.1 Word Inflectional Analysis and generation In linguistics, a suffix (also sometimes called a postfix or ending) is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns or adjectives, and verb endings. Hindi is a (relatively) free wordorder and highly inflectional language. Because of same origin, both languages have very similar structure and grammar. The difference is only in words and in pronunciation e.g. in Hindi it is लड़का and in Punjabi the word for boy is ਮੰੁਡਾ and even sometimes that is also not there like घर (ghar) and ਘਰ (ghar). The inflection forms of both these words in Hindi and Punjabi are also similar. In this activity, inflectional analysis without using morphology has been performed 4 for all those tokens that are not processed by morphological analysis module. Thus, for performing inflectional analysis, rule based approach has been followed. When the token is passed to this sub phase for inflectional analysis, If any pattern of the regular expression (inflection rule) matches with this token, that rule is applied on the token and its equivalent translation in Punjabi is generated based on the matched rule(s). There is also a check on the generated word for its correctness. We are using correct Punjabi words database for testing the correctness of the generated word. 2.6.5.2 Transliteration This module is beneficial for handling out-ofvocabulary words. For example the word िवशाल is as ਿਵਸ਼ਾਲ (vishāl) whereas translated as ਵੱਡਾ. There must be some method in every Machine Translation system for words like technical terms and (vishāl) transliterated proper names of persons, places, objects etc. that cannot be found in translation resources such as Hindi-Punjabi bilingual dictionary, surnames database, titles database etc and transliteration is an obvious choice for such words. (Goyal V. and Lehal G.S., 2009a). 2.7 Post-Processing 2.7.1 Agreement Corrections In spite of the great similarity between Hindi and Punjabi, there are still a number of important agreement divergences in gender and number. The output generated by the translation engine phase becomes the input for post-processing phase. This phase will correct the agreement errors based on the rules implemented in the form of regular expressions. (Goyal V. and Lehal G.S., 2011) 3 Evaluation and Results The evaluation document set consisted of documents from various online newspapers news, articles, blogs, biographies etc. This test bed consisted of 35500 words and was translated using our Machine Translation system. 3.1 Test Document For our Machine Translation system evaluation, we have used benchmark sampling method for selecting the set of sentences. Input sentences are selected from randomly selected news (sports, politics, world, regional, entertainment, travel etc.), articles (published by various writers, philosophers etc.), literature (stories by Prem Chand, Yashwant jain etc.), Official language for office letters (The Language Officially used on the files in Government offices) and blogs (Posted by general public in forums etc.). Care has been taken to ensure that sentences use a variety of constructs. All possible constructs including simple as well as complex ones are incorporated in the set. The sentence set also contains all types of sentences such as declarative, interrogative, imperative and exclamatory. Sentence length is not restricted although care has been taken that single sentences do not become too long. Following table shows the test data set: Table 1: Test data set for the evaluation of Hindi to Punjabi Machine Translation DTSWeo nctaruldenmscent 91DN03ae, 4wil0ys A5230,1rt6ic70lS4esytO0LQ38m6,1au5f4no9itg3c5e1uiaslgeB5130,lo6g50 L29105i,te84r05atue 3.2 Experiments It is also important to choose appropriate evaluators for our experiments. Thus, depending upon the requirements and need of the above mentioned tests, 50 People of different professions were selected for performing experiments. 20 Persons were from villages that only knew Punjabi and did not know Hindi and 30 persons were from different professions having knowledge of both Hindi and Punjabi. Average ratings for the sentences of the individual translations were then summed up (separately according to intelligibility and accuracy) to get the average scores. Percentage of accurate sentences and intelligent sentences was also calculated separately sentences. by counting the number of 3.2.1 Intelligibility Evaluation 5 The evaluators do not have any clue about the source language i.e. Hindi. They judge each sentence (in target language i.e. Punjabi) on the basis of its comprehensibility. The target user is a layman who is interested only in the comprehensibility of translations. Intelligibility is effected by grammatical errors, mistranslations, and un-translated words. 3.2.1.1 Results The response by the evaluators were analysed and following are the results: • 70.3 % sentences got the score 3 i.e. they were perfectly clear and intelligible. • 25. 1 % sentences got the score 2 i.e. they were generally clear and intelligible. • 3.5 % sentences got the score 1i.e. they were hard to understand. • 1. 1 % sentences got the score 0 i.e. they were not understandable. So we can say that about 95.40 % sentences are intelligible. These sentences are those which have score 2 or above. Thus, we can say that the direct approach can translate Hindi text to Punjabi Text with a consideably good accuracy. 3.2.2 Accuracy Evaluation / Fidelity Measure The evaluators are provided with source text along with translated text. A highly intelligible output sentence need not be a correct translation of the source sentence. It is important to check whether the meaning of the source language sentence is preserved in the translation. This property is called accuracy. 3.2.2.1 Results Initially Null Hypothesis is assumed i.e. the system’s performance is NULL. The author assumes that system is dumb and does not produce any valuable output. By the intelligibility of the analysis and Accuracy analysis, it has been proved wrong. The accuracy percentage for the system is found out to be 87.60% Further investigations reveal that out of 13.40%: • 80.6 % sentences achieve a match between 50 to 99% • 17.2 % of remaining sentences were marked with less than 50% match against the correct sentences. • Only 2.2 % sentences are those which are found unfaithful. A match of lower 50% does not mean that the sentences are not usable. After some post editing, they can fit properly in the translated text. (Goyal, V., Lehal, G.S., 2009b) 3.2.2 BLEU Score: As there is no Hindi –Parallel Corpus was available, thus for testing the system automatically, we generated Hindi-Parallel Corpus of about 10K Sentences. The BLEU score comes out to be 0.7801. 5 Conclusion In this paper, a hybrid translation approach for translating the text from Hindi to Punjabi has been presented. The proposed architecture has shown extremely good results and if found to be appropriate for MT systems between closely related language pairs. Copyright The developed system has already been copyrighted with The Registrar, Punjabi University, Patiala with authors same as the authors of the publication. Acknowlegement We are thankful to Dr. Amba Kulkarni, University of Hyderabad for her support in providing technical assistance for developing this system. References Bharati, Akshar, Chaitanya, Vineet, Kulkarni, Amba P., Sangal, Rajeev. 1997. Anusaaraka: Machine Translation in stages. Vivek, A Quarterly in Artificial Intelligence, Vol. 10, No. 3. ,NCST, Banglore. India, pp. 22-25. 6 Goyal V., Lehal G.S. 2008. Comparative Study of Hindi and Punjabi Language Scripts, Napalese Linguistics, Journal of the Linguistics Society of Nepal, Volume 23, November Issue, pp 67-82. Goyal V., Lehal, G. S. 2008a. Hindi Morphological Analyzer and Generator. In Proc.: 1st International Conference on Emerging Trends in Engineering and Technology, Nagpur, G.H.Raisoni College of Engineering, Nagpur, July16-19, 2008, pp. 11561159, IEEE Computer Society Press, California, USA. Goyal V., Lehal G.S. 2009. Advances in Machine Translation Systems, Language In India, Volume 9, November Issue, pp. 138-150. Goyal V., Lehal G.S. 2009a. A Machine Transliteration System for Machine Translation System: An Application on Hindi-Punjabi Language Pair. Atti Della Fondazione Giorgio Ronchi (Italy), Volume LXIV, No. 1, pp. 27-35. Goyal V., Lehal G.S. 2009b. Evaluation of Hindi to Punjabi Machine Translation System. International Journal of Computer Science Issues, France, Vol. 4, No. 1, pp. 36-39. Goyal V., Lehal G.S. 2010. Automatic Spelling Standardization for Hindi Text. In : 1st International Conference on Computer & Communication Technology, Moti Lal Nehru National Institute of technology, Allhabad, Sepetember 17-19, 2010, pp. 764-767, IEEE Computer Society Press, California. Goyal V., Lehal G.S. 2011. N-Grams Based Word Sense Disambiguation: A Case Study of Hindi to Punjabi Machine Translation System. International Journal of Translation. (Accepted, In Print). Goyal V., Lehal G.S. 2011a. Hindi to Punjabi Machine Translation System. In Proc.: International Conference for Information Systems for Indian Languages, Department of Computer Science, Punjabi University, Patiala, March 9-11, 2011, pp. 236-241, Springer CCIS 139, Germany. Sharma R., Goyal V. 2011b. Named Entity Recognition Systems for Hindi using CRF Approach. In Proc.: International Conference for Information Systems for Indian Languages, Department of Computer Science, Punjabi University, Patiala, March 9-11, 2011, pp. 31-35, Springer CCIS 139, Germany.

6 0.31811541 11 acl-2011-A Fast and Accurate Method for Approximate String Search

7 0.27026096 301 acl-2011-The impact of language models and loss functions on repair disfluency detection

8 0.25475115 238 acl-2011-P11-2093 k2opt.pdf

9 0.2526246 239 acl-2011-P11-5002 k2opt.pdf

10 0.25125772 321 acl-2011-Unsupervised Discovery of Rhyme Schemes

11 0.24099316 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

12 0.23846766 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search

13 0.23612794 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering

14 0.22866008 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

15 0.22771461 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

16 0.22439426 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

17 0.22378714 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

18 0.21781653 311 acl-2011-Translationese and Its Dialects

19 0.21744953 150 acl-2011-Hierarchical Text Classification with Latent Concepts

20 0.21700269 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.02), (5, 0.013), (10, 0.294), (17, 0.048), (26, 0.034), (37, 0.102), (39, 0.049), (41, 0.067), (53, 0.018), (55, 0.044), (59, 0.033), (72, 0.026), (91, 0.023), (96, 0.133), (97, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76746345 197 acl-2011-Latent Class Transliteration based on Source Language Origin

Author: Masato Hagiwara ; Satoshi Sekine

2 0.66792989 256 acl-2011-Query Weighting for Ranking Model Adaptation

Author: Peng Cai ; Wei Gao ; Aoying Zhou ; Kam-Fai Wong

Abstract: We propose to directly measure the importance of queries in the source domain to the target domain where no rank labels of documents are available, which is referred to as query weighting. Query weighting is a key step in ranking model adaptation. As the learning object of ranking algorithms is divided by query instances, we argue that it’s more reasonable to conduct importance weighting at query level than document level. We present two query weighting schemes. The first compresses the query into a query feature vector, which aggregates all document instances in the same query, and then conducts query weighting based on the query feature vector. This method can efficiently estimate query importance by compressing query data, but the potential risk is information loss resulted from the compression. The second measures the similarity between the source query and each target query, and then combines these fine-grained similarity values for its importance estimation. Adaptation experiments on LETOR3.0 data set demonstrate that query weighting significantly outperforms document instance weighting methods.

3 0.65880561 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages

Author: Sara Stymne

Abstract: In this thesis proposal Ipresent my thesis work, about pre- and postprocessing for statistical machine translation, mainly into Germanic languages. I focus my work on four areas: compounding, definite noun phrases, reordering, and error correction. Initial results are positive within all four areas, and there are promising possibilities for extending these approaches. In addition Ialso focus on methods for performing thorough error analysis of machine translation output, which can both motivate and evaluate the studies performed.

4 0.65484726 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

Author: Phil Blunsom ; Trevor Cohn

Abstract: In this work we address the problem of unsupervised part-of-speech induction by bringing together several strands of research into a single model. We develop a novel hidden Markov model incorporating sophisticated smoothing using a hierarchical Pitman-Yor processes prior, providing an elegant and principled means of incorporating lexical characteristics. Central to our approach is a new type-based sampling algorithm for hierarchical Pitman-Yor models in which we track fractional table counts. In an empirical evaluation we show that our model consistently out-performs the current state-of-the-art across 10 languages.

5 0.55470705 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

Author: Yee Seng Chan ; Dan Roth

Abstract: In this paper, we observe that there exists a second dimension to the relation extraction (RE) problem that is orthogonal to the relation type dimension. We show that most of these second dimensional structures are relatively constrained and not difficult to identify. We propose a novel algorithmic approach to RE that starts by first identifying these structures and then, within these, identifying the semantic type of the relation. In the real RE problem where relation arguments need to be identified, exploiting these structures also allows reducing pipelined propagated errors. We show that this RE framework provides significant improvement in RE performance.

6 0.55431962 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

7 0.55320787 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

8 0.55270654 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

9 0.55240709 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

10 0.55219293 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

11 0.55214328 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

12 0.55152297 311 acl-2011-Translationese and Its Dialects

13 0.55084759 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

14 0.55036515 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

15 0.54997408 44 acl-2011-An exponential translation model for target language morphology

16 0.54980612 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

17 0.54936284 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

18 0.54886603 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

19 0.5487982 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

20 0.54764467 28 acl-2011-A Statistical Tree Annotator and Its Applications