acl acl2011 acl2011-153 knowledge-graph by maker-knowledge-mining

153 acl-2011-How do you pronounce your name? Improving G2P with transliterations

Source: pdf

Author: Aditya Bhargava ; Grzegorz Kondrak

Abstract: Grapheme-to-phoneme conversion (G2P) of names is an important and challenging problem. The correct pronunciation of a name is often reflected in its transliterations, which are expressed within a different phonological inventory. We investigate the problem of using transliterations to correct errors produced by state-of-the-art G2P systems. We present a novel re-ranking approach that incorporates a variety of score and n-gram features, in order to leverage transliterations from multiple languages. Our experiments demonstrate significant accuracy improvements when re-ranking is applied to n-best lists generated by three different G2P programs.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Improving G2P with transliterations Aditya Bhargava and Grzegorz Kondrak Department of Computing Science University of Alberta Edmonton, Alberta, Canada, T6G 2E8 {abhargava kondrak} @ c s . [sent-2, score-0.633]

2 ca , Abstract Grapheme-to-phoneme conversion (G2P) of names is an important and challenging problem. [sent-4, score-0.177]

3 The correct pronunciation of a name is often reflected in its transliterations, which are expressed within a different phonological inventory. [sent-5, score-0.245]

4 We investigate the problem of using transliterations to correct errors produced by state-of-the-art G2P systems. [sent-6, score-0.659]

5 We present a novel re-ranking approach that incorporates a variety of score and n-gram features, in order to leverage transliterations from multiple languages. [sent-7, score-0.68]

6 1 Introduction Grapheme-to-phoneme conversion (G2P), in which the aim is to convert the orthography of a word to its pronunciation (phonetic transcription), plays an important role in speech synthesis and understanding. [sent-9, score-0.27]

7 , 1998), present a particular challenge to G2P systems because of their high pronunciation variability. [sent-11, score-0.142]

8 Guessing the correct pronunciation of a name is often difficult, especially if they are of foreign origin; this is attested by the ad hoc transcriptions which sometimes accompany new names introduced in news articles, especially for international stories with many foreign names. [sent-12, score-0.433]

9 Transliterations provide a way of disambiguating the pronunciation of names. [sent-13, score-0.163]

10 They are more abundant than phonetic transcriptions, for example when news items ofinternational or global significance are reported in multiple languages. [sent-14, score-0.142]

11 In addition, writing 399 scripts such as Arabic, Korean, or Hindi are more consistent and easier to identify than various phonetic transcription schemes. [sent-15, score-0.154]

12 Thus, the correct pronunciation of a name is partially encoded in the form of the transliteration. [sent-18, score-0.222]

13 For example, given the ambiguous letter-to-phoneme mapping of the English letter g, the initial phoneme of the name Gershwin may be predicted by a G2P system to be either /g/ (as in Gertrude) or /Ã/(as in Gerald). [sent-19, score-0.197]

14 The transliterations of the name in other scripts provide support for the former (correct) alternative. [sent-20, score-0.739]

15 Although it seems evident that transliterations should be helpful in determining the correct pronunciation of a name, designing a system that takes advantage of this insight is not trivial. [sent-21, score-0.852]

16 For example, because Hindi has no /w/ sound, the transliteration of Gershwin instead uses a symbol that represents the phoneme /V/, similar to the /v/ phoneme in English. [sent-24, score-0.531]

17 In addition, converting transliterations into phonemes is often non-trivial; although few orthographies are as inconsistent as that of English, this is effectively the G2P task for the particular language in question. [sent-25, score-0.711]

18 In this paper, we demonstrate that leveraging transliterations can, in fact, improve the graphemeto-phoneme conversion of names. [sent-26, score-0.698]

19 Ac s2s0o1ci1a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 399–408, fail to achieve the same goal, and that transliterations from multiple languages are more helpful than from a single language. [sent-30, score-0.726]

20 The experiments that we perform demonstrate significant error reduction for three very different G2P base systems. [sent-32, score-0.151]

21 In the G2P task, x is composed of graphemes and y is composed of phonemes; in transliteration, both sequences consist of graphemes but they represent different writing scripts. [sent-35, score-0.144]

22 We assume that we have available a base G2P system that produces an n-best list of outputs with a corresponding list of confidence scores. [sent-39, score-0.269]

23 The goal is to improve the base system’s performance by applying existing transliterations of the input x to re-rank the system’s n-best output list. [sent-40, score-0.84]

24 2 Similarity-based methods A simple and intuitive approach to improving G2P with transliterations is to select from the n-best list the output sequence that is most similar to the corresponding transliteration. [sent-42, score-0.689]

25 For example, the Hindi transliteration in Figure 1 is arguably closest in perceptual terms to the phonetic transcription of the second output in the n-best list, as compared to the other outputs. [sent-43, score-0.486]

26 One obvious problem with this method is that it ignores the relative ordering of the n-best lists and their corresponding scores produced by the base system. [sent-44, score-0.227]

27 A better approach is to combine the similarity score with the output score from the base system, allowing it to contribute an estimate of confidence in its output. [sent-45, score-0.341]

28 This approach is similar 400 to the method used by Finch and Sumita (2010) to combine the scores oftwo different machine transliteration systems. [sent-48, score-0.373]

29 A more general approach is to skip the transcription step and compute the similarity between phonemes and graphemes directly. [sent-54, score-0.247]

30 For example, the edit distance function can be learned from a training set of transliterations and their phonetic transcriptions (Ristad and Yianilos, 1998). [sent-55, score-0.808]

31 M2M-ALIGNER was originally designed to align graphemes and phonemes, but can be applied to discover the alignment between any sets of symbols (given training data). [sent-58, score-0.132]

32 2, which are based on the similarity between outputs and transliterations, are difficult to generalize when multiple transliterations of a single name are available. [sent-62, score-0.849]

33 The SVM re-ranking paradigm offers a solution input n-best outputs transliterations Gershwin (/गɡशʌr�ʃिवʋɪnन/)ガ(/ーɡaシːɕュuwウiɴィ/ン)Г(е/ɡрerшʂvвinи/н) /d͡ʒɜːʃwɪn/ /ɡɜːʃwɪn/ /d͡ʒɛɹʃwɪn/ Figure 1: An example name showing the data used for feature construction. [sent-65, score-0.75]

34 The score features use similarity scores for transliteration-transcription pairs and system output scores for input-output pairs. [sent-67, score-0.284]

35 The scores produced by the base system for each output in the n-best list. [sent-72, score-0.274]

36 The similarity scores between the outputs and each available transliteration. [sent-74, score-0.155]

37 The linear chain features combine the context features with the bigram transition features. [sent-84, score-0.159]

38 Unlike a traditional G2P generator, our re-ranker has access to the outputs produced by the base system. [sent-91, score-0.214]

39 Since the n-gram features are also applied to transliteration-transcription pairs, the reverse features enable us to include features which bind a variety of n-grams in the transliteration string with a single corresponding phoneme. [sent-93, score-0.485]

40 The construction of n-gram features presupposes a fixed alignment between the input and output sequences. [sent-94, score-0.139]

41 If the base G2P system does not provide input-output alignments, we use M2M-ALIGNER for this purpose. [sent-95, score-0.203]

42 1 Data & setup For pronunciation data, we extracted all names from the Combilex corpus (Richmond et al. [sent-104, score-0.254]

43 Both the similarity and SVM methods require transliterations for identifying the best candidates in the nbest lists. [sent-107, score-0.689]

44 They are therefore trained and evaluated on the subset of the G2P corpus for which transliterations available. [sent-108, score-0.633]

45 Naturally, allowing transliterations from all languages results in a larger corpus than the one obtained by the intersection with transliterations from a single language. [sent-109, score-1.315]

46 For SVM reranking, during both development and testing we split the training set into 10 folds; this is necessary when training the re-ranker as it must have system output scores that are representative of the scores on unseen data. [sent-112, score-0.18]

47 Our transliteration data come from the shared tasks on transliteration at the 2009 and 2010 Named Entities Workshops (Li et al. [sent-114, score-0.643]

48 In cases where the data provide alternative transliterations for a given input, we keep only one; our preliminary experiments indicated that including alternative transliterations did not improve performance. [sent-118, score-1.287]

49 It should be noted that these transliteration corpora are noisy: Jiampojamarn et al. [sent-119, score-0.307]

50 English-to-Hindi transliteration performance with a simple cleaning of the data. [sent-124, score-0.307]

51 Our tests involving transliterations from multiple languages are performed on the set of names for which we have both the pronunciation and transliteration data. [sent-125, score-1.246]

52 There are 7,423 names in the G2P corpus for which at least one transliteration is available. [sent-126, score-0.419]

53 Table 1 lists the total size of the transliteration corpora as well as the amount of overlap with the G2P data. [sent-127, score-0.377]

54 Note that the base G2P systems are trained using all 10,084 names in the corpus as opposed to only the 7,423 names for which there are transliterations available. [sent-128, score-1.008]

55 This ensures that the G2P systems have more training data to provide the best possible base performance. [sent-129, score-0.172]

56 This measure marks pronunciations that are even slightly different from the correct one as incorrect, so even a small change in pronunciation that might be acceptable or even unnoticeable to humans would count against the system’s performance. [sent-135, score-0.193]

57 2 Base systems It is important to test multiple base systems in order to ensure that any gain in performance applies to the task in general and not just to a particular system. [sent-137, score-0.174]

58 All systems are capable of providing n-best output lists along with scores for each output, although for FESTIVAL they had to be constructed from the list of output probabilities for each input character. [sent-146, score-0.212]

59 For SEQUITUR, we keep default options except for the enabling of the 10 best outputs and we convert the probabilities assigned to the outputs to log-probabilities. [sent-152, score-0.154]

60 Note that the three base systems differ slightly in terms of the alignment information that they provide in their outputs. [sent-154, score-0.21]

61 FESTIVAL operates letter-byletter, so we use the single-letter inputs with the phoneme outputs as the aligned units. [sent-155, score-0.175]

62 In order to 403 find the similarity between phonetic transcriptions, we use the two different methods described in Section 2. [sent-160, score-0.138]

63 We further test the use of a linear combination of the similarity scores with the base system’s score so that its confidence information can be taken into account; the linear combination weight is determined from the training set. [sent-162, score-0.395]

64 For these experiments, our training and testing sets are obtained by intersecting our G2P training and testing sets respectively with the Hindi transliteration corpus, yielding 1,950 names for training and 229 names for testing. [sent-164, score-0.573]

65 Furthermore, ALINE operates on phoneme sequences, so we first need to convert the transliterations to phonemes. [sent-166, score-0.773]

66 Regardless of the choice of the similarity function, the simplest approaches fail in a spectacular manner, significantly reducing the accuracy with respect to the base system. [sent-176, score-0.252]

67 However, they perform much better than the methods based on similarity scores alone as they are able to take advantage of the base system’s output scores. [sent-178, score-0.299]

68 6 Table 2: Word accuracy (in percentages) of various methods when only Hindi transliterations are used. [sent-200, score-0.657]

69 on the training set, we find that they are higher for the stronger base systems, indicating more reliance on the base system output scores. [sent-201, score-0.389]

70 We would expect to achieve at least the base system’s performance, but disparities between the training and testing sets prevent this. [sent-206, score-0.172]

71 SVM-ALL produces impressive accuracy gains for all three base systems, while SVMHINDI yields smaller (but still statistically significant) improvements for FESTIVAL and SEQUITUR. [sent-208, score-0.233]

72 On the other hand, the results also show that the information obtained by consulting a single transliteration may be insufficient to improve an already high-performing G2P converter. [sent-210, score-0.327]

73 4 Transliterations from multiple languages Our second experiment expands upon the first; we use all available transliterations instead of being restricted to one language. [sent-212, score-0.685]

74 3 Table 3: Word accuracy of the base system versus the reranking variants with transliterations from multiple languages. [sent-226, score-0.892]

75 We draw a further conclusion from our results: consider the large disparity in improvements over the base systems. [sent-244, score-0.151]

76 In the case of DIRECTL+, the transliterations help through the n-gram features rather than the score features; this is probably because the crucial feature that signals the inability of M2M-ALIGNER to align a given transliteration-transcription pair belongs to the set of the n-gram features. [sent-248, score-0.724]

77 The two therefore overlap to some degree, although the score features still provide useful information via probabilities learned during the alignment training process. [sent-250, score-0.158]

78 The most likely reason why our re-ranker selects instead the correct pronunciation /bæk@s/ is that M2MALIGNER fails to align three of the five available transliterations with /bækÙ@s/. [sent-253, score-0.823]

79 uch alignment failS ures are caused by a lack of evidence for the mapping of the grapheme representing the sound /k/ in the transliteration training data with the phoneme /Ù/. [sent-254, score-0.484]

80 In fact, many instances of human transliterations in our corpora are clearly incorrect. [sent-257, score-0.633]

81 For example, the Hindi transliteration of Bacchus contains the /Ù/consonant instead of the correct /k/. [sent-258, score-0.333]

82 Moreover, our strict evaluation based on word accuracy counts all system outputs that fail to exactly match the dictionary data as errors. [sent-259, score-0.139]

83 1%, 2The phoneme accuracy is calculated from the minimum edit distance between the predicted and correct pronunciations. [sent-262, score-0.193]

84 5 Table 4: Absolute improvement in word accuracy (%) over the base system (DIRECTL+) of the SVM re-ranker for various numbers of available transliterations. [sent-272, score-0.206]

85 which provides some idea of how similar the predicted pronunciation is to the correct one. [sent-273, score-0.168]

86 5 Effect of multiple transliterations One motivating factor for the use of SVM re-ranking was the ability to incorporate multiple transliteration languages. [sent-275, score-0.986]

87 To examine this question, we look particularly at the sets of names having at most k transliterations available. [sent-277, score-0.745]

88 Table 4 shows the results with DIRECTL+ as the base system. [sent-278, score-0.151]

89 Note that the number ofnames with more than five transliterations was small. [sent-279, score-0.633]

90 Importantly, we see that the increase in performance when only one transliteration is available is so small as to be insignificant. [sent-280, score-0.307]

91 From this, we can conclude that obtaining improvement on the basis of a single transliteration is difficult in general. [sent-281, score-0.327]

92 (1998) report similar accuracy on names as for other types of English words. [sent-293, score-0.136]

93 A similar pivoting approach has also been applied to machine transliteration (Zhang et al. [sent-302, score-0.307]

94 Finch and Sumita (2010) combine two very different approaches to transliteration using simple linear interpolation: they use SEQUITUR’s n-best outputs and re-rank them using a linear combination of the original SEQUITUR score and the score for that output of a phrased-based SMT system. [sent-306, score-0.607]

95 406 5 Conclusions & future work In this paper, we explored the application of transliterations to G2P. [sent-310, score-0.633]

96 We demonstrated that transliterations have the potential for helping choose between n-best output lists provided by standard G2P systems. [sent-311, score-0.729]

97 Simple approaches based solely on similarity do not work when tested using a single transliteration language (Hindi), necessitating the use of smarter methods that can incorporate multiple transliteration languages. [sent-312, score-0.742]

98 We apply SVM reranking to this task, enabling us to use a variety of features based not only on similarity scores but on n-grams as well. [sent-313, score-0.187]

99 Our analy- sis demonstrated that it is essential to provide the re-ranking system with transliterations from multiple languages in order to mitigate the differences between phonological inventories and smooth out noise in the transliterations. [sent-316, score-0.788]

100 We would also like to apply our approach to web data; we have shown that it is possible to use noisy transliteration data, so it may be possible to leverage the noisy ad hoc pronunciation data as well. [sent-319, score-0.469]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('transliterations', 0.633), ('directl', 0.325), ('transliteration', 0.307), ('sequitur', 0.216), ('jiampojamarn', 0.19), ('festival', 0.162), ('base', 0.151), ('pronunciation', 0.142), ('aline', 0.117), ('hindi', 0.117), ('phoneme', 0.112), ('names', 0.112), ('phonetic', 0.082), ('phonemes', 0.078), ('sittichai', 0.073), ('graphemes', 0.072), ('grzegorz', 0.067), ('conversion', 0.065), ('svm', 0.063), ('outputs', 0.063), ('transcriptions', 0.062), ('similarity', 0.056), ('output', 0.056), ('kondrak', 0.055), ('name', 0.054), ('dtl', 0.054), ('fest', 0.054), ('gershwin', 0.054), ('features', 0.045), ('seq', 0.044), ('vladimir', 0.044), ('transcription', 0.041), ('bisani', 0.041), ('finch', 0.041), ('lists', 0.04), ('haizhou', 0.04), ('kumaran', 0.039), ('linear', 0.039), ('alignment', 0.038), ('alberta', 0.037), ('news', 0.037), ('bacchus', 0.036), ('combilex', 0.036), ('heuvel', 0.036), ('kienappel', 0.036), ('nanneke', 0.036), ('pervouchine', 0.036), ('richmond', 0.036), ('scores', 0.036), ('synthesis', 0.035), ('impressive', 0.034), ('alignments', 0.033), ('bhargava', 0.032), ('martens', 0.032), ('henk', 0.032), ('black', 0.031), ('system', 0.031), ('scripts', 0.031), ('edit', 0.031), ('entities', 0.031), ('combine', 0.03), ('overlap', 0.03), ('reranking', 0.03), ('den', 0.03), ('ristad', 0.029), ('converter', 0.029), ('necessitating', 0.029), ('languages', 0.029), ('shared', 0.029), ('inventories', 0.028), ('convert', 0.028), ('sound', 0.027), ('correct', 0.026), ('suntec', 0.025), ('aditya', 0.025), ('sumita', 0.025), ('pronunciations', 0.025), ('combination', 0.025), ('score', 0.024), ('produces', 0.024), ('accuracy', 0.024), ('capable', 0.024), ('min', 0.024), ('phonological', 0.023), ('multiple', 0.023), ('li', 0.023), ('named', 0.023), ('reverse', 0.023), ('liblinear', 0.022), ('sweden', 0.022), ('align', 0.022), ('testing', 0.021), ('provide', 0.021), ('fail', 0.021), ('korean', 0.021), ('singapore', 0.021), ('apply', 0.02), ('interspeech', 0.02), ('designing', 0.02), ('single', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 153 acl-2011-How do you pronounce your name? Improving G2P with transliterations

Author: Aditya Bhargava ; Grzegorz Kondrak

2 0.42399171 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora. In contrast to previous work, our method uses no form of supervision, and does not require linguistically informed preprocessing. We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted. We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs. Finally, we integrate the transliteration module into the GIZA++ word aligner and evaluate it on two word alignment tasks achieving improvements in both precision and recall measured against gold standard word alignments.

3 0.21266842 197 acl-2011-Latent Class Transliteration based on Source Language Origin

Author: Masato Hagiwara ; Satoshi Sekine

Abstract: Transliteration, a rich source of proper noun spelling variations, is usually recognized by phonetic- or spelling-based models. However, a single model cannot deal with different words from different language origins, e.g., “get” in “piaget” and “target.” Li et al. (2007) propose a method which explicitly models and classifies the source language origins and switches transliteration models accordingly. This model, however, requires an explicitly tagged training set with language origins. We propose a novel method which models language origins as latent classes. The parameters are learned from a set of transliterated word pairs via the EM algorithm. The experimental results of the transliteration task of Western names to Japanese show that the proposed model can achieve higher accuracy compared to the conventional models without latent classes.

4 0.12921135 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

Author: Yun Huang ; Min Zhang ; Chew Lim Tan

Abstract: Machine transliteration is defined as automatic phonetic translation of names across languages. In this paper, we propose synchronous adaptor grammar, a novel nonparametric Bayesian learning approach, for machine transliteration. This model provides a general framework without heuristic or restriction to automatically learn syllable equivalents between languages. The proposed model outperforms the state-of-the-art EMbased model in the English to Chinese transliteration task.

5 0.08175768 151 acl-2011-Hindi to Punjabi Machine Translation System

Author: Vishal Goyal ; Gurpreet Singh Lehal

Abstract: Hindi-Punjabi being closely related language pair (Goyal V. and Lehal G.S., 2008) , Hybrid Machine Translation approach has been used for developing Hindi to Punjabi Machine Translation System. Non-availability of lexical resources, spelling variations in the source language text, source text ambiguous words, named entity recognition and collocations are the major challenges faced while developing this syetm. The key activities involved during translation process are preprocessing, translation engine and post processing. Lookup algorithms, pattern matching algorithms etc formed the basis for solving these issues. The system accuracy has been evaluated using intelligibility test, accuracy test and BLEU score. The hybrid syatem is found to perform better than the constituent systems. Keywords: Machine Translation, Computational Linguistics, Natural Language Processing, Hindi, Punjabi. Translate Hindi to Punjabi, Closely related languages. 1Introduction Machine Translation system is a software designed that essentially takes a text in one language (called the source language), and translates it into another language (called the target language). There are number of approaches for MT like Direct based, Transform based, Interlingua based, Statistical etc. But the choice of approach depends upon the available resources and the kind of languages involved. In general, if the two languages are structurally similar, in particular as regards lexical correspondences, morphology and word order, the case for abstract syntactic analysis seems less convincing. Since the present research work deals with a pair of closely related language 1 Gurpreet Singh Lehal Department of Computer Science Punjabi University, Patiala,India gs lehal @ gmai l com . i.e. Hindi-Punjabi , thus direct word-to-word translation approach is the obvious choice. As some rule based approach has also been used, thus, Hybrid approach has been adopted for developing the system. An exhaustive survey has already been given for existing machine translations systems developed so far mentioning their accuracies and limitations. (Goyal V. and Lehal G.S., 2009). 2 System Architecture 2.1 Pre Processing Phase The preprocessing stage is a collection of operations that are applied on input data to make it processable by the translation engine. In the first phase of Machine Translation system, various activities incorporated include text normalization, replacing collocations and replacing proper nouns. 2.2 Text Normalization The variety in the alphabet, different dialects and influence of foreign languages has resulted in spelling variations of the same word. Such variations sometimes can be treated as errors in writing. (Goyal V. and Lehal G.S., 2010). 2.3 Replacing Collocations After passing the input text through text normalization, the text passes through this Collocation replacement sub phase of Preprocessing phase. Collocation is two or more consecutive words with a special behavior. (Choueka :1988). For example, the collocation उ?र ?देश (uttar pradēsh) if translated word to word, will be translated as ਜਵਾਬ ਰਾਜ (javāb rāj) but it must be translated as ਉ?ਤਰ ਪ?ਦਸ਼ੇ (uttar pradēsh). The accuracy of the results for collocation extraction using t-test is not accurate and includes number of such bigrams and trigrams that are not actually collocations. Thus, manually such entries were removed and actual collocations were further extracted. The Portland, POrroecgeoend,in UgSsA o,f 2 t1he Ju AnCeL 2-0H1L1T. 2 ?c 021101 S1y Astessmoc Diaetmioonn fsotr a Ctioonms,p puatagteiosn 1a–l6 L,inguistics correct corresponding Punjabi translation for each extracted collocation is stored in the collocation table of the database. The collocation table of the database consists of 5000 such entries. In this sub phase, the normalized input text is analyzed. Each collocation in the database found in the input text will be replaced with the Punjabi translation of the corresponding collocation. It is found that when tested on a corpus containing about 1,00,000 words, only 0.001 % collocations were found and replaced during the translation. Hindi Text Figure 1: Overview of Hindi-Punjabi Machine Translation System 2.4 Replacing Proper Nouns A great proposition of unseen words includes proper nouns like personal, days of month, days of week, country names, city names, bank fastens words proper decide the translation process. Once these are recognized and stored into the noun database, there is no need to about their translation or transliteration names, organization names, ocean names, river every names, university words names etc. and if translated time in the case of presence in word to word, their meaning is changed. If the gazetteer meaning is not affected, even though this step fast. This input makes list text for the translation is self of such translation. growing This accurate and during each 2 translation. Thus, to process this sub phase, the system requires a proper noun gazetteer that has been complied offline. For this task, we have developed an offline module to extract proper nouns from the corpus based on some rules. Also, Named Entity recognition module has been developed based on the CRF approach (Sharma R. and Goyal V., 2011b). 2.5 Tokenizer Tokenizers (also known as lexical analyzers or word segmenters) segment a stream of characters into meaningful units called tokens. The tokenizer takes the text generated by pre processing phase as input. Individual words or tokens are extracted and processed to generate its equivalent in the target language. This module, using space, a punctuation mark, as delimiter, extracts tokens (word) one by one from the text and gives it to translation engine for analysis till the complete input text is read and processed. 2.6 Translation Engine The translation engine is the main component of our Machine Translation system. It takes token generated by the tokenizer as input and outputs the translated token in the target language. These translated tokens are concatenated one after another along with the delimiter. Modules included in this phase are explained below one by one. 2.6.1 Identifying Titles and Surnames Title may be defined as a formal appellation attached to the name of a person or family by virtue of office, rank, hereditary privilege, noble birth, or attainment or used as a mark of respect. Thus word next to title and word previous to surname is usually a proper noun. And sometimes, a word used as proper name of a person has its own meaning in target language. Similarly, Surname may be defined as a name shared in common to identify the members of a family, as distinguished from each member's given name. It is also called family name or last name. When either title or surname is passed through the translation engine, it is translated by the system. This cause the system failure as these proper names should be transliterated instead of translation. For example consider the Hindi sentence 3 ?ीमान हष? जी हमार ेयहाँ पधार।े (shrīmān harsh jī हष? hamārē yahāṃ padhārē). In this sentence, (harsh) has the meaning “joy”. The equivalent translation of हष? (harsh) in target language is ਖੁਸ਼ੀ (khushī). Similarly, consider the Hindi sentence ?काश ?सह हमार े (prakāsh siṃh hamārē yahāṃ padhārē). Here, ?काश (prakāsh) word is acting as proper noun and it must be transliterated and not translated because (siṃh) is surname and word previous to it is proper noun. Thus, a small module has been developed for यहाँ पधार।े. ?सह locating such proper nouns to consider them as title or surname. There is one special character ‘॰’ in Devanagari script to mark the symbols like डा॰, ?ो॰. If this module found this symbol to be title or surname, the word next and previous to this token as the case may be for title or surname respectively, will be transliterated not translated. The title and surname database consists of 14 and 654 entries respectively. These databases can be extended at any time to allow new titles and surnames to be added. This module was tested on a large Hindi corpus and showed that about 2-5 % text of the input text depending upon its domain is proper noun. Thus, this module plays an important role in translation. 2.6.2 Hindi Morphological analyzer This module finds the root word for the token and its morphological features.Morphological analyzer developed by IIT-H has been ported for Windows platform for making it usable for this system. (Goyal V. and Lehal G.S.,2008a) 2.6.3 Word-to-Word translation using lexicon lookup If token is not a title or a surname, it is looked up in the HPDictionary database containing Hindi to Punjabi direct word to word translation. If it is found, it is used for translation. If no entry is found in HPDictionary database, it is sent to next sub phase for processing. The HPDictionary database consists of 54, 127 entries.This database can be extended at any time to allow new entries in the dictionary to be added. 2.6.4 Resolving Ambiguity Among number of approaches for disambiguation, the most appropriate approach to determine the correct meaning of a Hindi word in a particular usage for our Machine Translation system is to examine its context using N-gram approach. After analyzing the past experiences of various authors, we have chosen the value of n to be 3 and 2 i.e. trigram and bigram approaches respectively for our system. Trigrams are further categorized into three different types. First category of trigram consists of context one word previous to and one word next to the ambiguous word. Second category of trigram consists of context of two adjacent previous words to the ambiguous word. Third category of the trigram consists of context of two adjacent next words to the ambiguous word. Bigrams are also categorized into two categories. First category of the bigrams consists of context of one previous word to ambiguous word and second category of the bigrams consists of one context word next to ambiguous word. For this purpose, the Hindi corpus consisting of about 2 million words was collected from different sources like online newspaper daily news, blogs, Prem Chand stories, Yashwant jain stories, articles etc. The most common list of ambiguous words was found. We have found a list of 75 ambiguous words out of which the most are स े sē and aur. (Goyal V. and frequent Lehal G.S., 2011) और 2.6.5 Handling Unknown Words 2.6.5.1 Word Inflectional Analysis and generation In linguistics, a suffix (also sometimes called a postfix or ending) is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns or adjectives, and verb endings. Hindi is a (relatively) free wordorder and highly inflectional language. Because of same origin, both languages have very similar structure and grammar. The difference is only in words and in pronunciation e.g. in Hindi it is लड़का and in Punjabi the word for boy is ਮੰੁਡਾ and even sometimes that is also not there like घर (ghar) and ਘਰ (ghar). The inflection forms of both these words in Hindi and Punjabi are also similar. In this activity, inflectional analysis without using morphology has been performed 4 for all those tokens that are not processed by morphological analysis module. Thus, for performing inflectional analysis, rule based approach has been followed. When the token is passed to this sub phase for inflectional analysis, If any pattern of the regular expression (inflection rule) matches with this token, that rule is applied on the token and its equivalent translation in Punjabi is generated based on the matched rule(s). There is also a check on the generated word for its correctness. We are using correct Punjabi words database for testing the correctness of the generated word. 2.6.5.2 Transliteration This module is beneficial for handling out-ofvocabulary words. For example the word िवशाल is as ਿਵਸ਼ਾਲ (vishāl) whereas translated as ਵੱਡਾ. There must be some method in every Machine Translation system for words like technical terms and (vishāl) transliterated proper names of persons, places, objects etc. that cannot be found in translation resources such as Hindi-Punjabi bilingual dictionary, surnames database, titles database etc and transliteration is an obvious choice for such words. (Goyal V. and Lehal G.S., 2009a). 2.7 Post-Processing 2.7.1 Agreement Corrections In spite of the great similarity between Hindi and Punjabi, there are still a number of important agreement divergences in gender and number. The output generated by the translation engine phase becomes the input for post-processing phase. This phase will correct the agreement errors based on the rules implemented in the form of regular expressions. (Goyal V. and Lehal G.S., 2011) 3 Evaluation and Results The evaluation document set consisted of documents from various online newspapers news, articles, blogs, biographies etc. This test bed consisted of 35500 words and was translated using our Machine Translation system. 3.1 Test Document For our Machine Translation system evaluation, we have used benchmark sampling method for selecting the set of sentences. Input sentences are selected from randomly selected news (sports, politics, world, regional, entertainment, travel etc.), articles (published by various writers, philosophers etc.), literature (stories by Prem Chand, Yashwant jain etc.), Official language for office letters (The Language Officially used on the files in Government offices) and blogs (Posted by general public in forums etc.). Care has been taken to ensure that sentences use a variety of constructs. All possible constructs including simple as well as complex ones are incorporated in the set. The sentence set also contains all types of sentences such as declarative, interrogative, imperative and exclamatory. Sentence length is not restricted although care has been taken that single sentences do not become too long. Following table shows the test data set: Table 1: Test data set for the evaluation of Hindi to Punjabi Machine Translation DTSWeo nctaruldenmscent 91DN03ae, 4wil0ys A5230,1rt6ic70lS4esytO0LQ38m6,1au5f4no9itg3c5e1uiaslgeB5130,lo6g50 L29105i,te84r05atue 3.2 Experiments It is also important to choose appropriate evaluators for our experiments. Thus, depending upon the requirements and need of the above mentioned tests, 50 People of different professions were selected for performing experiments. 20 Persons were from villages that only knew Punjabi and did not know Hindi and 30 persons were from different professions having knowledge of both Hindi and Punjabi. Average ratings for the sentences of the individual translations were then summed up (separately according to intelligibility and accuracy) to get the average scores. Percentage of accurate sentences and intelligent sentences was also calculated separately sentences. by counting the number of 3.2.1 Intelligibility Evaluation 5 The evaluators do not have any clue about the source language i.e. Hindi. They judge each sentence (in target language i.e. Punjabi) on the basis of its comprehensibility. The target user is a layman who is interested only in the comprehensibility of translations. Intelligibility is effected by grammatical errors, mistranslations, and un-translated words. 3.2.1.1 Results The response by the evaluators were analysed and following are the results: • 70.3 % sentences got the score 3 i.e. they were perfectly clear and intelligible. • 25. 1 % sentences got the score 2 i.e. they were generally clear and intelligible. • 3.5 % sentences got the score 1i.e. they were hard to understand. • 1. 1 % sentences got the score 0 i.e. they were not understandable. So we can say that about 95.40 % sentences are intelligible. These sentences are those which have score 2 or above. Thus, we can say that the direct approach can translate Hindi text to Punjabi Text with a consideably good accuracy. 3.2.2 Accuracy Evaluation / Fidelity Measure The evaluators are provided with source text along with translated text. A highly intelligible output sentence need not be a correct translation of the source sentence. It is important to check whether the meaning of the source language sentence is preserved in the translation. This property is called accuracy. 3.2.2.1 Results Initially Null Hypothesis is assumed i.e. the system’s performance is NULL. The author assumes that system is dumb and does not produce any valuable output. By the intelligibility of the analysis and Accuracy analysis, it has been proved wrong. The accuracy percentage for the system is found out to be 87.60% Further investigations reveal that out of 13.40%: • 80.6 % sentences achieve a match between 50 to 99% • 17.2 % of remaining sentences were marked with less than 50% match against the correct sentences. • Only 2.2 % sentences are those which are found unfaithful. A match of lower 50% does not mean that the sentences are not usable. After some post editing, they can fit properly in the translated text. (Goyal, V., Lehal, G.S., 2009b) 3.2.2 BLEU Score: As there is no Hindi –Parallel Corpus was available, thus for testing the system automatically, we generated Hindi-Parallel Corpus of about 10K Sentences. The BLEU score comes out to be 0.7801. 5 Conclusion In this paper, a hybrid translation approach for translating the text from Hindi to Punjabi has been presented. The proposed architecture has shown extremely good results and if found to be appropriate for MT systems between closely related language pairs. Copyright The developed system has already been copyrighted with The Registrar, Punjabi University, Patiala with authors same as the authors of the publication. Acknowlegement We are thankful to Dr. Amba Kulkarni, University of Hyderabad for her support in providing technical assistance for developing this system. References Bharati, Akshar, Chaitanya, Vineet, Kulkarni, Amba P., Sangal, Rajeev. 1997. Anusaaraka: Machine Translation in stages. Vivek, A Quarterly in Artificial Intelligence, Vol. 10, No. 3. ,NCST, Banglore. India, pp. 22-25. 6 Goyal V., Lehal G.S. 2008. Comparative Study of Hindi and Punjabi Language Scripts, Napalese Linguistics, Journal of the Linguistics Society of Nepal, Volume 23, November Issue, pp 67-82. Goyal V., Lehal, G. S. 2008a. Hindi Morphological Analyzer and Generator. In Proc.: 1st International Conference on Emerging Trends in Engineering and Technology, Nagpur, G.H.Raisoni College of Engineering, Nagpur, July16-19, 2008, pp. 11561159, IEEE Computer Society Press, California, USA. Goyal V., Lehal G.S. 2009. Advances in Machine Translation Systems, Language In India, Volume 9, November Issue, pp. 138-150. Goyal V., Lehal G.S. 2009a. A Machine Transliteration System for Machine Translation System: An Application on Hindi-Punjabi Language Pair. Atti Della Fondazione Giorgio Ronchi (Italy), Volume LXIV, No. 1, pp. 27-35. Goyal V., Lehal G.S. 2009b. Evaluation of Hindi to Punjabi Machine Translation System. International Journal of Computer Science Issues, France, Vol. 4, No. 1, pp. 36-39. Goyal V., Lehal G.S. 2010. Automatic Spelling Standardization for Hindi Text. In : 1st International Conference on Computer & Communication Technology, Moti Lal Nehru National Institute of technology, Allhabad, Sepetember 17-19, 2010, pp. 764-767, IEEE Computer Society Press, California. Goyal V., Lehal G.S. 2011. N-Grams Based Word Sense Disambiguation: A Case Study of Hindi to Punjabi Machine Translation System. International Journal of Translation. (Accepted, In Print). Goyal V., Lehal G.S. 2011a. Hindi to Punjabi Machine Translation System. In Proc.: International Conference for Information Systems for Indian Languages, Department of Computer Science, Punjabi University, Patiala, March 9-11, 2011, pp. 236-241, Springer CCIS 139, Germany. Sharma R., Goyal V. 2011b. Named Entity Recognition Systems for Hindi using CRF Approach. In Proc.: International Conference for Information Systems for Indian Languages, Department of Computer Science, Punjabi University, Patiala, March 9-11, 2011, pp. 31-35, Springer CCIS 139, Germany.

6 0.062434472 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

7 0.054409415 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

8 0.052376788 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

9 0.052265551 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

10 0.04947285 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

11 0.048854601 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

12 0.04775057 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing

13 0.047321446 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

14 0.045026429 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

15 0.043150481 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

16 0.043052964 329 acl-2011-Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition

17 0.042763725 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

18 0.042552531 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

19 0.042544764 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

20 0.042354502 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.146), (1, -0.033), (2, 0.001), (3, 0.044), (4, 0.001), (5, -0.017), (6, 0.067), (7, -0.024), (8, -0.032), (9, 0.063), (10, 0.015), (11, 0.051), (12, 0.012), (13, 0.11), (14, 0.021), (15, 0.048), (16, 0.148), (17, 0.124), (18, 0.328), (19, -0.18), (20, -0.224), (21, 0.134), (22, -0.098), (23, -0.059), (24, -0.208), (25, -0.165), (26, 0.034), (27, -0.021), (28, -0.053), (29, 0.125), (30, -0.121), (31, 0.019), (32, -0.021), (33, 0.057), (34, 0.027), (35, -0.014), (36, -0.041), (37, -0.074), (38, -0.002), (39, -0.058), (40, 0.019), (41, -0.049), (42, -0.009), (43, -0.051), (44, 0.096), (45, 0.045), (46, 0.035), (47, -0.021), (48, -0.047), (49, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91301197 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

same-paper 2 0.90525538 153 acl-2011-How do you pronounce your name? Improving G2P with transliterations

Author: Aditya Bhargava ; Grzegorz Kondrak

3 0.87328172 197 acl-2011-Latent Class Transliteration based on Source Language Origin

Author: Masato Hagiwara ; Satoshi Sekine

4 0.70034486 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

Author: Yun Huang ; Min Zhang ; Chew Lim Tan

5 0.41126558 151 acl-2011-Hindi to Punjabi Machine Translation System

Author: Vishal Goyal ; Gurpreet Singh Lehal

6 0.28523761 301 acl-2011-The impact of language models and loss functions on repair disfluency detection

7 0.27084595 11 acl-2011-A Fast and Accurate Method for Approximate String Search

8 0.25637692 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning

9 0.2371836 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

10 0.23490423 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

11 0.22796653 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

12 0.22715613 239 acl-2011-P11-5002 k2opt.pdf

13 0.22654404 321 acl-2011-Unsupervised Discovery of Rhyme Schemes

14 0.22218525 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

15 0.2216565 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge

16 0.22076422 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names

17 0.217841 220 acl-2011-Minimum Bayes-risk System Combination

18 0.2160487 205 acl-2011-Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments

19 0.21448968 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

20 0.21444111 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.021), (17, 0.065), (20, 0.264), (26, 0.018), (37, 0.075), (39, 0.049), (41, 0.097), (55, 0.031), (59, 0.046), (72, 0.037), (91, 0.044), (96, 0.136), (97, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74736595 153 acl-2011-How do you pronounce your name? Improving G2P with transliterations

Author: Aditya Bhargava ; Grzegorz Kondrak

2 0.71538734 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

Author: Apoorv Agarwal

Abstract: In my thesis, Ipropose to build a system that would enable extraction of social interactions from texts. To date Ihave defined a comprehensive set of social events and built a preliminary system that extracts social events from news articles. Iplan to improve the performance of my current system by incorporating semantic information. Using domain adaptation techniques, Ipropose to apply my system to a wide range of genres. By extracting linguistic constructs relevant to social interactions, I will be able to empirically analyze different kinds of linguistic constructs that people use to express social interactions. Lastly, I will attempt to make convolution kernels more scalable and interpretable.

3 0.69462585 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

Author: Bo Li ; Eric Gaussier ; Akiko Aizawa

Abstract: We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.

4 0.61630398 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

Author: Shasha Liao ; Ralph Grishman

Abstract: Annotating training data for event extraction is tedious and labor-intensive. Most current event extraction tasks rely on hundreds of annotated documents, but this is often not enough. In this paper, we present a novel self-training strategy, which uses Information Retrieval (IR) to collect a cluster of related documents as the resource for bootstrapping. Also, based on the particular characteristics of this corpus, global inference is applied to provide more confident and informative data selection. We compare this approach to self-training on a normal newswire corpus and show that IR can provide a better corpus for bootstrapping and that global inference can further improve instance selection. We obtain gains of 1.7% in trigger labeling and 2.3% in role labeling through IR and an additional 1.1% in trigger labeling and 1.3% in role labeling by applying global inference. 1

5 0.61214 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

Author: Sameer Singh ; Amarnag Subramanya ; Fernando Pereira ; Andrew McCallum

Abstract: Cross-document coreference, the task of grouping all the mentions of each entity in a document collection, arises in information extraction and automated knowledge base construction. For large collections, it is clearly impractical to consider all possible groupings of mentions into distinct entities. To solve the problem we propose two ideas: (a) a distributed inference technique that uses parallelism to enable large scale processing, and (b) a hierarchical model of coreference that represents uncertainty over multiple granularities of entities to facilitate more effective approximate inference. To evaluate these ideas, we constructed a labeled corpus of 1.5 million disambiguated mentions in Web pages by selecting link anchors referring to Wikipedia entities. We show that the combination of the hierarchical model with distributed inference quickly obtains high accuracy (with error reduction of 38%) on this large dataset, demonstrating the scalability of our approach.

6 0.61202532 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

7 0.61182177 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

8 0.61132073 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

9 0.60540444 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

10 0.60413724 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

11 0.60326576 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

12 0.60277581 11 acl-2011-A Fast and Accurate Method for Approximate String Search

13 0.60265952 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

14 0.60078269 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

15 0.60000956 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

16 0.59931839 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

17 0.59867513 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

18 0.59819114 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

19 0.59790146 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

20 0.59728479 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling