acl acl2011 acl2011-134 knowledge-graph by maker-knowledge-mining

134 acl-2011-Extracting and Classifying Urdu Multiword Expressions


Source: pdf

Author: Annette Hautli ; Sebastian Sulger

Abstract: This paper describes a method for automatically extracting and classifying multiword expressions (MWEs) for Urdu on the basis of a relatively small unannotated corpus (around 8.12 million tokens). The MWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and person names. The classification is based on simple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of 0.5 and 0.746 for locations and persons, respectively. A target application is the Urdu ParGram grammar, where MWEs are needed to generate a more precise syntactic and semantic analysis.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Extracting and Classifying Urdu Multiword Expressions Annette Hautli Department of Linguistics University of Konstanz, Germany annette . [sent-1, score-0.067]

2 de i Abstract This paper describes a method for automatically extracting and classifying multiword expressions (MWEs) for Urdu on the basis of a relatively small unannotated corpus (around 8. [sent-3, score-0.365]

3 The MWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and person names. [sent-5, score-0.164]

4 The classification is based on simple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. [sent-6, score-0.214]

5 The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of 0. [sent-7, score-0.042]

6 A target application is the Urdu ParGram grammar, where MWEs are needed to generate a more precise syntactic and semantic analysis. [sent-10, score-0.03]

7 1 Introduction Multiword expressions (MWEs) are expressions which can be semantically and syntactically idiosyncratic in nature; acting as a single unit, their meaning is not always predictable from their components. [sent-11, score-0.16]

8 There is a vast amount of literature on extracting and classifying MWEs automatically; many approaches rely on already available resources that aid during the acquisition process. [sent-13, score-0.108]

9 de sources such as annotated corpora or lexical knowledge bases impedes the task of detecting and classifying MWEs. [sent-16, score-0.064]

10 Nevertheless, statistical measures and language-specific syntactic information can be employed to extract and classify MWEs. [sent-17, score-0.03]

11 Therefore, the method described in this paper can partly overcome the bottleneck of resource sparsity, despite the relatively small size of the available corpus and the simplistic approach taken. [sent-18, score-0.024]

12 With the help ofheuristics as to the occurrence ofUrdu MWEs with characteristic postpositions and other cues, it is possible to cluster the MWEs into two groups: locations and person names. [sent-19, score-0.351]

13 The classification is then evaluated against a hand-annotated gold standard of Urdu MWEs. [sent-21, score-0.042]

14 An NLP tool where the MWEs can be employed is the Urdu ParGram grammar (Butt and King, 2007; B o¨gel et al. [sent-22, score-0.06]

15 For this task, different types of MWEs need to be distinguished as they are treated differently in the syntactic analysis. [sent-25, score-0.03]

16 The paper is structured as follows: Section 2 pro- vides a brief review of related work, in particular on MWE extraction in Indo-Aryan languages. [sent-26, score-0.025]

17 2 Related Work MWE extraction and classification has been the focus of a large amount of research. [sent-29, score-0.025]

18 , 2010), parallel data (Zarrieß and Kuhn, 2009) and NLP tools such as taggers or dependency parsers (Martens and Vandeghinste (2010), among others) and lexical resources (Pearce, 2001). [sent-33, score-0.023]

19 Related work on Indo-Aryan languages has mostly focused on the extraction of complex predicates, with the focus on Hindi (Mukerjee et al. [sent-34, score-0.062]

20 While complex predicates also make up a large part of the verbal inventory in Urdu (Butt, 1993), for the scope of this paper, we restrict ourselves to classifying MWEs as locations or person names and filter out junk bigrams. [sent-38, score-0.554]

21 For classification, we use simple heuristics by taking the postpositions of the MWEs into account. [sent-40, score-0.214]

22 These can provide hints as to the nature of the MWE. [sent-41, score-0.07]

23 1 Extraction and Identification of MWE Candidates The bigram extraction was carried out on a corpus of around 8. [sent-43, score-0.049]

24 Due to the relatively small size of our corpus, the frequency cut-off for bigrams was set to 5, i. [sent-46, score-0.062]

25 all bigrams that occurred five times or more in the corpus were considered. [sent-48, score-0.062]

26 This rendered a list of 172,847 bigrams which were then ranked with the X2 association measure, using the UCS toolkit. [sent-49, score-0.062]

27 First, papers using comparatively sized corpora reported encouraging results for similar experiments (Ramisch et al. [sent-51, score-0.027]

28 For the time being, we focus on bigram MWE extraction. [sent-59, score-0.024]

29 2 Syntactic Cues The clustering approach taken in this paper is based on Urdu-specific syntactic information that can be gathered straightforwardly from the corpus. [sent-63, score-0.055]

30 Urdu has a number of postpositions that can be used to identify the nature of an MWE. [sent-64, score-0.226]

31 Typographical cues such as initial capital letters do not exist in the Urdu script. [sent-65, score-0.061]

32 ’ (mEN) expresses location in or at a point in space or time, whereas ‰K? [sent-87, score-0.05]

33 (sE) shows movement away from a certain point in? [sent-92, score-0.024]

34 These postpositions mostly occur with locations and are thus syntactic indicators for this type of œ? [sent-94, score-0.352]

35 However, in special cases, they can also occur with other nouns, in which case we predict wrong results during classification. [sent-98, score-0.026]

36 G (nE) describes syntactic cues To classify an we consider syntactic cues that such MWEs. [sent-102, score-0.182]

37 The ergative marker an agentive subject in transitive œ? [sent-103, score-0.03]

38 JLPUOERNCKS—√——√—√——√√—√——√——√——√——√—— Table 1: Heuristics for clustering Urdu MWEs by different postpositions sentences; therefore, it forms part of our heuristic for finding person MWEs. [sent-126, score-0.267]

39 markers æ» The accusative and dative case marker (kO) is also a possible indicator that the preceding MWE is a person. [sent-146, score-0.03]

40 These cues can also appear with common nouns, but the combination of MWE and syntactic cue hints to a person MWE. [sent-147, score-0.177]

41 , where New Delhi is treated as an agent with nE attached to it, providing a wrong clue as to the nature of the MWE. [sent-149, score-0.039]

42 3 Classifying Urdu MWEs The classification of the extracted bigrams is solely based on syntactic information as described in the previous section. [sent-151, score-0.092]

43 For every bigram, the postpositions that it occurs with are extracted from the corpus, together with the frequency of the cooccurrence. [sent-152, score-0.187]

44 Table 1 shows which postpositions are expected to occur with which type of MWE. [sent-153, score-0.213]

45 The first stipulation is that only bigrams that occur with one of the locative postpositions plus the ablative/instrumental marker (sE) one or more times are considered œ? [sent-154, score-0.373]

46 In contrast, bigrams are judged as persons (PERS) when they co-occur with all postpositions apart from the locative postpositions one or more times. [sent-159, score-0.56]

47 If a bigram occurs with none of the postpositions, it is judged as being junk (JUNK). [sent-160, score-0.2]

48 As a consequence this means that theoretically valid MWEs such as complex predicates, which 26 never occur with a postposition, are misclassified as being JUNK. [sent-161, score-0.063]

49 Without any further processing, the resulting clusters are then evaluated against a hand-annotated gold standard, as described in the following section. [sent-162, score-0.042]

50 1 Gold Standard Our gold standard comprises the 1300 highest ranked Urdu multiword candidates extracted from the CRULP corpus, using the X2 association measure. [sent-164, score-0.262]

51 The bigrams are then hand-annotated by a native speaker of Urdu and clustered into the following classes: locations, person names, companies, miscellaneous MWEs and junk. [sent-165, score-0.182]

52 For the scope of this paper, we restrict ourselves to classifying MWEs as either locations or person names,. [sent-166, score-0.228]

53 This also lies in the nature of the corpus: companies can usually be detected by endings such as “Corp. [sent-167, score-0.098]

54 The class of miscellaneous MWEs contains complex predicates that we do not attempt to deal with here. [sent-172, score-0.2]

55 In total, the gold standard comprises 30 companies, 95 locations, 411person names, 5 12 miscellaneous MWEs (mostly complex predicates) and 252 junk bigrams. [sent-173, score-0.314]

56 We have not analyzed the gold standard any further, and restricting it to n < 1300 might improve the evaluation results. [sent-174, score-0.064]

57 2 Results The bigrams are classified according to the heuristics outlined in Section 3. [sent-176, score-0.089]

58 Evaluating against the hand-annotated gold standard yields the results in Table 2. [sent-178, score-0.042]

59 While the results are encouraging for persons with an f-score of 0. [sent-179, score-0.057]

60 746, there is still room for improvement for locative MWEs. [sent-180, score-0.068]

61 456942915 12 1491389 Table 2: Results for MWE clustering son names is that Urdu names are generally longer than two words, and as we have not considered trigrams yet, it is impossible to find a postposition after an incomplete though generally valid name. [sent-185, score-0.157]

62 Locations tend to have the same problem, however the reasons for missing out on a large part of the locative MWEs are not quite clear and are currently being investigated. [sent-186, score-0.068]

63 Junk bigrams can be detected with an f-score of 0. [sent-187, score-0.062]

64 Due to the heterogeneous nature of the miscellaneous MWEs (e. [sent-189, score-0.104]

65 , complex predicates), many of them are judged as being junk because they never occur with a postposition. [sent-191, score-0.239]

66 If one could detect complex predicate and, possibly, other subgroups from the miscellaneous class, then classifying the junk MWEs would become easier. [sent-192, score-0.316]

67 5 Integration into the Urdu ParGram Grammar The extracted MWEs are integrated into the Urdu ParGram grammar (Butt and King, 2007; B o¨gel et al. [sent-193, score-0.06]

68 , 2009), a computational grammar for Urdu running with XLE (Crouch et al. [sent-195, score-0.06]

69 This makes grammar development a very conscious task and it is imperative to deal with MWEs in order to achieve a linguistically valid and deep syntactic analysis that can be used for an additional semantic analysis. [sent-198, score-0.09]

70 MWEs that are correctly classified according to the gold standard are automatically integrated into the multiword lexicon of the grammar, accompanied by information about their nature (see example (3)). [sent-199, score-0.281]

71 In general, grammar input is first tokenized by a standard tokenizer that separates the input string into single tokens and replaces the white spaces with a special token boundary symbol. [sent-200, score-0.06]

72 Each token is then passed through a cascade of finite-state morphological analyzers (Beesley and Karttunen, 2003). [sent-201, score-0.024]

73 Apart from the meaning preservation, integrating MWEs into the grammar reduces parsing ambiguity and parsing time, while the perspicuity of the syntactic analyses is increased (Butt et al. [sent-203, score-0.09]

74 In order to prevent the MWEs from being independently analyzed by the finite-state morphology, a look-up is performed in a transducer which only contains MWEs with their morphological information. [sent-205, score-0.046]

75 So instead of analyzing t3ul and AbEb separately, for example, they are analyzed as a single item carrying the morphological information +Noun+Locat ion. [sent-206, score-0.046]

76 See (4) for an example and Figures 1 and 2 for the corresponding c- and f-structure; the +Locat ion tag in (3) is used to produce the location analysis in the f-structure. [sent-208, score-0.022]

77 Note also that t 3ul AbEb is displayed as a multiword under the N node in the c-structure. [sent-209, score-0.2]

78 ’ CS 1: ROOT Sadj S KP KP VCmain NP NP K V N N par gAyI nAdiyah t 3ul AbEb Figure 1: C-structure for (4) 4The ` symbol is an escape character, yielding a literal white space. [sent-225, score-0.057]

79 " nAdiyah t 3ul AbEb par gAyI " Figure 2: F-structure for (4) 6 Discussion, Summary and Future Work Despite the simplistic approach for extracting and clustering Urdu MWEs taken in this paper, the results are encouraging with f-scores of 0. [sent-226, score-0.154]

80 We are well aware that this paper does not present a complete approach to classifying Urdu multiwords, but considering the targeted tool, the Urdu ParGram grammar, this methodology provides us with a set of MWEs that can be implemented to improve the syntactic analyses. [sent-229, score-0.12]

81 The methodology provided here can also guide MWE work in other languages facing the same resource sparsity as Urdu, given that distinctive syntactic cues are available in the language. [sent-230, score-0.117]

82 For Urdu, the syntactic cues are good indications of the nature of the MWE; future work on this subtopic might prove beneficial to the clustering regarding companies, complex predicates and junk MWEs. [sent-231, score-0.44]

83 Another area for future work is to extend the extraction and classification to trigrams to improve the results especially for locations and person names. [sent-232, score-0.189]

84 We also consider harvesting data sources from the web such as lists of cities, common names and companies in Pakistan and India. [sent-233, score-0.1]

85 Acknowledgments We would like to thank Samreen Khan for annotating the gold standard, as well as the anonymous reviewers for their valuable comments. [sent-235, score-0.042]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mwes', 0.549), ('urdu', 0.437), ('mwe', 0.316), ('multiword', 0.2), ('postpositions', 0.187), ('abeb', 0.151), ('butt', 0.151), ('junk', 0.15), ('nadya', 0.113), ('pargram', 0.113), ('locations', 0.109), ('predicates', 0.098), ('miriam', 0.092), ('expressions', 0.08), ('gayi', 0.076), ('nadiyah', 0.076), ('locative', 0.068), ('annette', 0.067), ('gel', 0.067), ('locat', 0.067), ('miscellaneous', 0.065), ('classifying', 0.064), ('bigrams', 0.062), ('cues', 0.061), ('grammar', 0.06), ('companies', 0.059), ('tel', 0.058), ('par', 0.057), ('chakraborty', 0.057), ('dalrymple', 0.057), ('hautli', 0.057), ('ogel', 0.057), ('sulger', 0.057), ('tina', 0.057), ('ucs', 0.057), ('hindi', 0.056), ('person', 0.055), ('postposition', 0.05), ('holloway', 0.05), ('xle', 0.05), ('king', 0.044), ('tracy', 0.043), ('gold', 0.042), ('names', 0.041), ('lfg', 0.039), ('nature', 0.039), ('beesley', 0.038), ('crulp', 0.038), ('delhi', 0.038), ('hussain', 0.038), ('kizito', 0.038), ('kxak', 0.038), ('malik', 0.038), ('mukerjee', 0.038), ('ramisch', 0.038), ('sarmad', 0.038), ('tanmoy', 0.038), ('yasin', 0.038), ('complex', 0.037), ('sebastian', 0.033), ('attia', 0.033), ('sivaji', 0.033), ('konstanz', 0.033), ('martens', 0.033), ('tak', 0.033), ('zarrie', 0.033), ('chakrabarti', 0.031), ('bengali', 0.031), ('aviv', 0.031), ('hints', 0.031), ('marker', 0.03), ('syntactic', 0.03), ('persons', 0.03), ('ko', 0.029), ('expresses', 0.028), ('crouch', 0.027), ('heuristics', 0.027), ('encouraging', 0.027), ('occur', 0.026), ('judged', 0.026), ('methodology', 0.026), ('extraction', 0.025), ('ki', 0.025), ('clustering', 0.025), ('morphological', 0.024), ('simplistic', 0.024), ('kp', 0.024), ('bigram', 0.024), ('csli', 0.024), ('movement', 0.024), ('resources', 0.023), ('went', 0.023), ('loc', 0.023), ('banerjee', 0.022), ('analyzed', 0.022), ('location', 0.022), ('ne', 0.021), ('extracting', 0.021), ('ka', 0.021), ('comprises', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 134 acl-2011-Extracting and Classifying Urdu Multiword Expressions

Author: Annette Hautli ; Sebastian Sulger

Abstract: This paper describes a method for automatically extracting and classifying multiword expressions (MWEs) for Urdu on the basis of a relatively small unannotated corpus (around 8.12 million tokens). The MWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and person names. The classification is based on simple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of 0.5 and 0.746 for locations and persons, respectively. A target application is the Urdu ParGram grammar, where MWEs are needed to generate a more precise syntactic and semantic analysis.

2 0.44611287 6 acl-2011-A Comprehensive Dictionary of Multiword Expressions

Author: Kosho Shudo ; Akira Kurahone ; Toshifumi Tanabe

Abstract: It has been widely recognized that one of the most difficult and intriguing problems in natural language processing (NLP) is how to cope with idiosyncratic multiword expressions. This paper presents an overview of the comprehensive dictionary (JDMWE) of Japanese multiword expressions. The JDMWE is characterized by a large notational, syntactic, and semantic diversity of contained expressions as well as a detailed description of their syntactic functions, structures, and flexibilities. The dictionary contains about 104,000 expressions, potentially 750,000 expressions. This paper shows that the JDMWE’s validity can be supported by comparing the dictionary with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08, generated by Google Inc. (Kudo et al. 2009). 1

3 0.11276019 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

Author: Omar F. Zaidan ; Chris Callison-Burch

Abstract: Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional trans- lators. The total cost is more than an order of magnitude lower than professional translation.

4 0.041345797 144 acl-2011-Global Learning of Typed Entailment Rules

Author: Jonathan Berant ; Ido Dagan ; Jacob Goldberger

Abstract: Extensive knowledge bases ofentailment rules between predicates are crucial for applied semantic inference. In this paper we propose an algorithm that utilizes transitivity constraints to learn a globally-optimal set of entailment rules for typed predicates. We model the task as a graph learning problem and suggest methods that scale the algorithm to larger graphs. We apply the algorithm over a large data set of extracted predicate instances, from which a resource of typed entailment rules has been recently released (Schoenmackers et al., 2010). Our results show that using global transitivity information substantially improves performance over this resource and several baselines, and that our scaling methods allow us to increase the scope of global learning of entailment-rule graphs.

5 0.038901035 151 acl-2011-Hindi to Punjabi Machine Translation System

Author: Vishal Goyal ; Gurpreet Singh Lehal

Abstract: Hindi-Punjabi being closely related language pair (Goyal V. and Lehal G.S., 2008) , Hybrid Machine Translation approach has been used for developing Hindi to Punjabi Machine Translation System. Non-availability of lexical resources, spelling variations in the source language text, source text ambiguous words, named entity recognition and collocations are the major challenges faced while developing this syetm. The key activities involved during translation process are preprocessing, translation engine and post processing. Lookup algorithms, pattern matching algorithms etc formed the basis for solving these issues. The system accuracy has been evaluated using intelligibility test, accuracy test and BLEU score. The hybrid syatem is found to perform better than the constituent systems. Keywords: Machine Translation, Computational Linguistics, Natural Language Processing, Hindi, Punjabi. Translate Hindi to Punjabi, Closely related languages. 1Introduction Machine Translation system is a software designed that essentially takes a text in one language (called the source language), and translates it into another language (called the target language). There are number of approaches for MT like Direct based, Transform based, Interlingua based, Statistical etc. But the choice of approach depends upon the available resources and the kind of languages involved. In general, if the two languages are structurally similar, in particular as regards lexical correspondences, morphology and word order, the case for abstract syntactic analysis seems less convincing. Since the present research work deals with a pair of closely related language 1 Gurpreet Singh Lehal Department of Computer Science Punjabi University, Patiala,India gs lehal @ gmai l com . i.e. Hindi-Punjabi , thus direct word-to-word translation approach is the obvious choice. As some rule based approach has also been used, thus, Hybrid approach has been adopted for developing the system. An exhaustive survey has already been given for existing machine translations systems developed so far mentioning their accuracies and limitations. (Goyal V. and Lehal G.S., 2009). 2 System Architecture 2.1 Pre Processing Phase The preprocessing stage is a collection of operations that are applied on input data to make it processable by the translation engine. In the first phase of Machine Translation system, various activities incorporated include text normalization, replacing collocations and replacing proper nouns. 2.2 Text Normalization The variety in the alphabet, different dialects and influence of foreign languages has resulted in spelling variations of the same word. Such variations sometimes can be treated as errors in writing. (Goyal V. and Lehal G.S., 2010). 2.3 Replacing Collocations After passing the input text through text normalization, the text passes through this Collocation replacement sub phase of Preprocessing phase. Collocation is two or more consecutive words with a special behavior. (Choueka :1988). For example, the collocation उ?र ?देश (uttar pradēsh) if translated word to word, will be translated as ਜਵਾਬ ਰਾਜ (javāb rāj) but it must be translated as ਉ?ਤਰ ਪ?ਦਸ਼ੇ (uttar pradēsh). The accuracy of the results for collocation extraction using t-test is not accurate and includes number of such bigrams and trigrams that are not actually collocations. Thus, manually such entries were removed and actual collocations were further extracted. The Portland, POrroecgeoend,in UgSsA o,f 2 t1he Ju AnCeL 2-0H1L1T. 2 ?c 021101 S1y Astessmoc Diaetmioonn fsotr a Ctioonms,p puatagteiosn 1a–l6 L,inguistics correct corresponding Punjabi translation for each extracted collocation is stored in the collocation table of the database. The collocation table of the database consists of 5000 such entries. In this sub phase, the normalized input text is analyzed. Each collocation in the database found in the input text will be replaced with the Punjabi translation of the corresponding collocation. It is found that when tested on a corpus containing about 1,00,000 words, only 0.001 % collocations were found and replaced during the translation. Hindi Text Figure 1: Overview of Hindi-Punjabi Machine Translation System 2.4 Replacing Proper Nouns A great proposition of unseen words includes proper nouns like personal, days of month, days of week, country names, city names, bank fastens words proper decide the translation process. Once these are recognized and stored into the noun database, there is no need to about their translation or transliteration names, organization names, ocean names, river every names, university words names etc. and if translated time in the case of presence in word to word, their meaning is changed. If the gazetteer meaning is not affected, even though this step fast. This input makes list text for the translation is self of such translation. growing This accurate and during each 2 translation. Thus, to process this sub phase, the system requires a proper noun gazetteer that has been complied offline. For this task, we have developed an offline module to extract proper nouns from the corpus based on some rules. Also, Named Entity recognition module has been developed based on the CRF approach (Sharma R. and Goyal V., 2011b). 2.5 Tokenizer Tokenizers (also known as lexical analyzers or word segmenters) segment a stream of characters into meaningful units called tokens. The tokenizer takes the text generated by pre processing phase as input. Individual words or tokens are extracted and processed to generate its equivalent in the target language. This module, using space, a punctuation mark, as delimiter, extracts tokens (word) one by one from the text and gives it to translation engine for analysis till the complete input text is read and processed. 2.6 Translation Engine The translation engine is the main component of our Machine Translation system. It takes token generated by the tokenizer as input and outputs the translated token in the target language. These translated tokens are concatenated one after another along with the delimiter. Modules included in this phase are explained below one by one. 2.6.1 Identifying Titles and Surnames Title may be defined as a formal appellation attached to the name of a person or family by virtue of office, rank, hereditary privilege, noble birth, or attainment or used as a mark of respect. Thus word next to title and word previous to surname is usually a proper noun. And sometimes, a word used as proper name of a person has its own meaning in target language. Similarly, Surname may be defined as a name shared in common to identify the members of a family, as distinguished from each member's given name. It is also called family name or last name. When either title or surname is passed through the translation engine, it is translated by the system. This cause the system failure as these proper names should be transliterated instead of translation. For example consider the Hindi sentence 3 ?ीमान हष? जी हमार ेयहाँ पधार।े (shrīmān harsh jī हष? hamārē yahāṃ padhārē). In this sentence, (harsh) has the meaning “joy”. The equivalent translation of हष? (harsh) in target language is ਖੁਸ਼ੀ (khushī). Similarly, consider the Hindi sentence ?काश ?सह हमार े (prakāsh siṃh hamārē yahāṃ padhārē). Here, ?काश (prakāsh) word is acting as proper noun and it must be transliterated and not translated because (siṃh) is surname and word previous to it is proper noun. Thus, a small module has been developed for यहाँ पधार।े. ?सह locating such proper nouns to consider them as title or surname. There is one special character ‘॰’ in Devanagari script to mark the symbols like डा॰, ?ो॰. If this module found this symbol to be title or surname, the word next and previous to this token as the case may be for title or surname respectively, will be transliterated not translated. The title and surname database consists of 14 and 654 entries respectively. These databases can be extended at any time to allow new titles and surnames to be added. This module was tested on a large Hindi corpus and showed that about 2-5 % text of the input text depending upon its domain is proper noun. Thus, this module plays an important role in translation. 2.6.2 Hindi Morphological analyzer This module finds the root word for the token and its morphological features.Morphological analyzer developed by IIT-H has been ported for Windows platform for making it usable for this system. (Goyal V. and Lehal G.S.,2008a) 2.6.3 Word-to-Word translation using lexicon lookup If token is not a title or a surname, it is looked up in the HPDictionary database containing Hindi to Punjabi direct word to word translation. If it is found, it is used for translation. If no entry is found in HPDictionary database, it is sent to next sub phase for processing. The HPDictionary database consists of 54, 127 entries.This database can be extended at any time to allow new entries in the dictionary to be added. 2.6.4 Resolving Ambiguity Among number of approaches for disambiguation, the most appropriate approach to determine the correct meaning of a Hindi word in a particular usage for our Machine Translation system is to examine its context using N-gram approach. After analyzing the past experiences of various authors, we have chosen the value of n to be 3 and 2 i.e. trigram and bigram approaches respectively for our system. Trigrams are further categorized into three different types. First category of trigram consists of context one word previous to and one word next to the ambiguous word. Second category of trigram consists of context of two adjacent previous words to the ambiguous word. Third category of the trigram consists of context of two adjacent next words to the ambiguous word. Bigrams are also categorized into two categories. First category of the bigrams consists of context of one previous word to ambiguous word and second category of the bigrams consists of one context word next to ambiguous word. For this purpose, the Hindi corpus consisting of about 2 million words was collected from different sources like online newspaper daily news, blogs, Prem Chand stories, Yashwant jain stories, articles etc. The most common list of ambiguous words was found. We have found a list of 75 ambiguous words out of which the most are स े sē and aur. (Goyal V. and frequent Lehal G.S., 2011) और 2.6.5 Handling Unknown Words 2.6.5.1 Word Inflectional Analysis and generation In linguistics, a suffix (also sometimes called a postfix or ending) is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns or adjectives, and verb endings. Hindi is a (relatively) free wordorder and highly inflectional language. Because of same origin, both languages have very similar structure and grammar. The difference is only in words and in pronunciation e.g. in Hindi it is लड़का and in Punjabi the word for boy is ਮੰੁਡਾ and even sometimes that is also not there like घर (ghar) and ਘਰ (ghar). The inflection forms of both these words in Hindi and Punjabi are also similar. In this activity, inflectional analysis without using morphology has been performed 4 for all those tokens that are not processed by morphological analysis module. Thus, for performing inflectional analysis, rule based approach has been followed. When the token is passed to this sub phase for inflectional analysis, If any pattern of the regular expression (inflection rule) matches with this token, that rule is applied on the token and its equivalent translation in Punjabi is generated based on the matched rule(s). There is also a check on the generated word for its correctness. We are using correct Punjabi words database for testing the correctness of the generated word. 2.6.5.2 Transliteration This module is beneficial for handling out-ofvocabulary words. For example the word िवशाल is as ਿਵਸ਼ਾਲ (vishāl) whereas translated as ਵੱਡਾ. There must be some method in every Machine Translation system for words like technical terms and (vishāl) transliterated proper names of persons, places, objects etc. that cannot be found in translation resources such as Hindi-Punjabi bilingual dictionary, surnames database, titles database etc and transliteration is an obvious choice for such words. (Goyal V. and Lehal G.S., 2009a). 2.7 Post-Processing 2.7.1 Agreement Corrections In spite of the great similarity between Hindi and Punjabi, there are still a number of important agreement divergences in gender and number. The output generated by the translation engine phase becomes the input for post-processing phase. This phase will correct the agreement errors based on the rules implemented in the form of regular expressions. (Goyal V. and Lehal G.S., 2011) 3 Evaluation and Results The evaluation document set consisted of documents from various online newspapers news, articles, blogs, biographies etc. This test bed consisted of 35500 words and was translated using our Machine Translation system. 3.1 Test Document For our Machine Translation system evaluation, we have used benchmark sampling method for selecting the set of sentences. Input sentences are selected from randomly selected news (sports, politics, world, regional, entertainment, travel etc.), articles (published by various writers, philosophers etc.), literature (stories by Prem Chand, Yashwant jain etc.), Official language for office letters (The Language Officially used on the files in Government offices) and blogs (Posted by general public in forums etc.). Care has been taken to ensure that sentences use a variety of constructs. All possible constructs including simple as well as complex ones are incorporated in the set. The sentence set also contains all types of sentences such as declarative, interrogative, imperative and exclamatory. Sentence length is not restricted although care has been taken that single sentences do not become too long. Following table shows the test data set: Table 1: Test data set for the evaluation of Hindi to Punjabi Machine Translation DTSWeo nctaruldenmscent 91DN03ae, 4wil0ys A5230,1rt6ic70lS4esytO0LQ38m6,1au5f4no9itg3c5e1uiaslgeB5130,lo6g50 L29105i,te84r05atue 3.2 Experiments It is also important to choose appropriate evaluators for our experiments. Thus, depending upon the requirements and need of the above mentioned tests, 50 People of different professions were selected for performing experiments. 20 Persons were from villages that only knew Punjabi and did not know Hindi and 30 persons were from different professions having knowledge of both Hindi and Punjabi. Average ratings for the sentences of the individual translations were then summed up (separately according to intelligibility and accuracy) to get the average scores. Percentage of accurate sentences and intelligent sentences was also calculated separately sentences. by counting the number of 3.2.1 Intelligibility Evaluation 5 The evaluators do not have any clue about the source language i.e. Hindi. They judge each sentence (in target language i.e. Punjabi) on the basis of its comprehensibility. The target user is a layman who is interested only in the comprehensibility of translations. Intelligibility is effected by grammatical errors, mistranslations, and un-translated words. 3.2.1.1 Results The response by the evaluators were analysed and following are the results: • 70.3 % sentences got the score 3 i.e. they were perfectly clear and intelligible. • 25. 1 % sentences got the score 2 i.e. they were generally clear and intelligible. • 3.5 % sentences got the score 1i.e. they were hard to understand. • 1. 1 % sentences got the score 0 i.e. they were not understandable. So we can say that about 95.40 % sentences are intelligible. These sentences are those which have score 2 or above. Thus, we can say that the direct approach can translate Hindi text to Punjabi Text with a consideably good accuracy. 3.2.2 Accuracy Evaluation / Fidelity Measure The evaluators are provided with source text along with translated text. A highly intelligible output sentence need not be a correct translation of the source sentence. It is important to check whether the meaning of the source language sentence is preserved in the translation. This property is called accuracy. 3.2.2.1 Results Initially Null Hypothesis is assumed i.e. the system’s performance is NULL. The author assumes that system is dumb and does not produce any valuable output. By the intelligibility of the analysis and Accuracy analysis, it has been proved wrong. The accuracy percentage for the system is found out to be 87.60% Further investigations reveal that out of 13.40%: • 80.6 % sentences achieve a match between 50 to 99% • 17.2 % of remaining sentences were marked with less than 50% match against the correct sentences. • Only 2.2 % sentences are those which are found unfaithful. A match of lower 50% does not mean that the sentences are not usable. After some post editing, they can fit properly in the translated text. (Goyal, V., Lehal, G.S., 2009b) 3.2.2 BLEU Score: As there is no Hindi –Parallel Corpus was available, thus for testing the system automatically, we generated Hindi-Parallel Corpus of about 10K Sentences. The BLEU score comes out to be 0.7801. 5 Conclusion In this paper, a hybrid translation approach for translating the text from Hindi to Punjabi has been presented. The proposed architecture has shown extremely good results and if found to be appropriate for MT systems between closely related language pairs. Copyright The developed system has already been copyrighted with The Registrar, Punjabi University, Patiala with authors same as the authors of the publication. Acknowlegement We are thankful to Dr. Amba Kulkarni, University of Hyderabad for her support in providing technical assistance for developing this system. References Bharati, Akshar, Chaitanya, Vineet, Kulkarni, Amba P., Sangal, Rajeev. 1997. Anusaaraka: Machine Translation in stages. Vivek, A Quarterly in Artificial Intelligence, Vol. 10, No. 3. ,NCST, Banglore. India, pp. 22-25. 6 Goyal V., Lehal G.S. 2008. Comparative Study of Hindi and Punjabi Language Scripts, Napalese Linguistics, Journal of the Linguistics Society of Nepal, Volume 23, November Issue, pp 67-82. Goyal V., Lehal, G. S. 2008a. Hindi Morphological Analyzer and Generator. In Proc.: 1st International Conference on Emerging Trends in Engineering and Technology, Nagpur, G.H.Raisoni College of Engineering, Nagpur, July16-19, 2008, pp. 11561159, IEEE Computer Society Press, California, USA. Goyal V., Lehal G.S. 2009. Advances in Machine Translation Systems, Language In India, Volume 9, November Issue, pp. 138-150. Goyal V., Lehal G.S. 2009a. A Machine Transliteration System for Machine Translation System: An Application on Hindi-Punjabi Language Pair. Atti Della Fondazione Giorgio Ronchi (Italy), Volume LXIV, No. 1, pp. 27-35. Goyal V., Lehal G.S. 2009b. Evaluation of Hindi to Punjabi Machine Translation System. International Journal of Computer Science Issues, France, Vol. 4, No. 1, pp. 36-39. Goyal V., Lehal G.S. 2010. Automatic Spelling Standardization for Hindi Text. In : 1st International Conference on Computer & Communication Technology, Moti Lal Nehru National Institute of technology, Allhabad, Sepetember 17-19, 2010, pp. 764-767, IEEE Computer Society Press, California. Goyal V., Lehal G.S. 2011. N-Grams Based Word Sense Disambiguation: A Case Study of Hindi to Punjabi Machine Translation System. International Journal of Translation. (Accepted, In Print). Goyal V., Lehal G.S. 2011a. Hindi to Punjabi Machine Translation System. In Proc.: International Conference for Information Systems for Indian Languages, Department of Computer Science, Punjabi University, Patiala, March 9-11, 2011, pp. 236-241, Springer CCIS 139, Germany. Sharma R., Goyal V. 2011b. Named Entity Recognition Systems for Hindi using CRF Approach. In Proc.: International Conference for Information Systems for Indian Languages, Department of Computer Science, Punjabi University, Patiala, March 9-11, 2011, pp. 31-35, Springer CCIS 139, Germany.

6 0.038339309 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars

7 0.03681989 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

8 0.033629041 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

9 0.033335127 317 acl-2011-Underspecifying and Predicting Voice for Surface Realisation Ranking

10 0.032741822 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

11 0.031659927 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

12 0.031153549 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

13 0.030701347 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

14 0.030163154 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus

15 0.029674986 50 acl-2011-Automatic Extraction of Lexico-Syntactic Patterns for Detection of Negation and Speculation Scopes

16 0.029585242 298 acl-2011-The ACL Anthology Searchbench

17 0.029507099 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

18 0.028448207 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

19 0.028215783 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

20 0.028013894 285 acl-2011-Simple supervised document geolocation with geodesic grids


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.087), (1, -0.001), (2, -0.021), (3, -0.017), (4, -0.021), (5, 0.016), (6, 0.05), (7, -0.011), (8, -0.003), (9, -0.012), (10, -0.042), (11, -0.026), (12, -0.024), (13, 0.023), (14, 0.003), (15, -0.062), (16, 0.001), (17, -0.001), (18, 0.047), (19, -0.044), (20, 0.027), (21, 0.054), (22, -0.027), (23, -0.12), (24, -0.045), (25, -0.015), (26, -0.006), (27, -0.03), (28, -0.112), (29, -0.001), (30, 0.116), (31, 0.064), (32, 0.146), (33, -0.071), (34, -0.107), (35, 0.256), (36, -0.037), (37, -0.098), (38, -0.217), (39, -0.079), (40, 0.056), (41, -0.061), (42, 0.347), (43, 0.295), (44, -0.2), (45, -0.151), (46, 0.209), (47, 0.136), (48, -0.096), (49, -0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95822436 134 acl-2011-Extracting and Classifying Urdu Multiword Expressions

Author: Annette Hautli ; Sebastian Sulger

Abstract: This paper describes a method for automatically extracting and classifying multiword expressions (MWEs) for Urdu on the basis of a relatively small unannotated corpus (around 8.12 million tokens). The MWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and person names. The classification is based on simple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of 0.5 and 0.746 for locations and persons, respectively. A target application is the Urdu ParGram grammar, where MWEs are needed to generate a more precise syntactic and semantic analysis.

2 0.92382264 6 acl-2011-A Comprehensive Dictionary of Multiword Expressions

Author: Kosho Shudo ; Akira Kurahone ; Toshifumi Tanabe

Abstract: It has been widely recognized that one of the most difficult and intriguing problems in natural language processing (NLP) is how to cope with idiosyncratic multiword expressions. This paper presents an overview of the comprehensive dictionary (JDMWE) of Japanese multiword expressions. The JDMWE is characterized by a large notational, syntactic, and semantic diversity of contained expressions as well as a detailed description of their syntactic functions, structures, and flexibilities. The dictionary contains about 104,000 expressions, potentially 750,000 expressions. This paper shows that the JDMWE’s validity can be supported by comparing the dictionary with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08, generated by Google Inc. (Kudo et al. 2009). 1

3 0.2016176 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

Author: Mariet Theune ; Ruud Koolen ; Emiel Krahmer ; Sander Wubben

Abstract: In this paper we investigate how much data is required to train an algorithm for attribute selection, a subtask of Referring Expressions Generation (REG). To enable comparison between different-sized training sets, a systematic training method was developed. The results show that depending on the complexity of the domain, training on 10 to 20 items may already lead to a good performance.

4 0.20085226 193 acl-2011-Language-independent compound splitting with morphological operations

Author: Klaus Macherey ; Andrew Dai ; David Talbot ; Ashok Popat ; Franz Och

Abstract: Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the compound parts and morphological operations needed to split compounds into their compound parts. The method uses a bilingual corpus to learn the morphological operations required to split a compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set of compound part candidates. We evaluate our method within a machine translation task and show significant improvements for various languages to show the versatility of the approach.

5 0.19857109 238 acl-2011-P11-2093 k2opt.pdf

Author: empty-author

Abstract: We present a pointwise approach to Japanese morphological analysis (MA) that ignores structure information during learning and tagging. Despite the lack of structure, it is able to outperform the current state-of-the-art structured approach for Japanese MA, and achieves accuracy similar to that of structured predictors using the same feature set. We also find that the method is both robust to outof-domain data, and can be easily adapted through the use of a combination of partial annotation and active learning.

6 0.1833448 291 acl-2011-SystemT: A Declarative Information Extraction System

7 0.1789304 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars

8 0.17186575 239 acl-2011-P11-5002 k2opt.pdf

9 0.15674943 68 acl-2011-Classifying arguments by scheme

10 0.15620016 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

11 0.15511659 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

12 0.15121599 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

13 0.14852005 121 acl-2011-Event Discovery in Social Media Feeds

14 0.14283863 252 acl-2011-Prototyping virtual instructors from human-human corpora

15 0.14089987 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes

16 0.14025323 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds

17 0.13734445 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

18 0.13595591 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

19 0.13454458 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

20 0.13327393 297 acl-2011-That's What She Said: Double Entendre Identification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.425), (5, 0.026), (17, 0.04), (26, 0.026), (31, 0.039), (37, 0.049), (39, 0.04), (41, 0.042), (55, 0.032), (59, 0.042), (72, 0.027), (91, 0.018), (96, 0.093)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78317797 134 acl-2011-Extracting and Classifying Urdu Multiword Expressions

Author: Annette Hautli ; Sebastian Sulger

Abstract: This paper describes a method for automatically extracting and classifying multiword expressions (MWEs) for Urdu on the basis of a relatively small unannotated corpus (around 8.12 million tokens). The MWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and person names. The classification is based on simple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of 0.5 and 0.746 for locations and persons, respectively. A target application is the Urdu ParGram grammar, where MWEs are needed to generate a more precise syntactic and semantic analysis.

2 0.77792096 321 acl-2011-Unsupervised Discovery of Rhyme Schemes

Author: Sravana Reddy ; Kevin Knight

Abstract: This paper describes an unsupervised, language-independent model for finding rhyme schemes in poetry, using no prior knowledge about rhyme or pronunciation.

3 0.68993372 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

4 0.60076481 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds

Author: Taniya Mishra ; Srinivas Bangalore

Abstract: There are several theories regarding what influences prominence assignment in English noun-noun compounds. We have developed corpus-driven models for automatically predicting prominence assignment in noun-noun compounds using feature sets based on two such theories: the informativeness theory and the semantic composition theory. The evaluation of the prediction models indicate that though both of these theories are relevant, they account for different types of variability in prominence assignment.

5 0.58217603 112 acl-2011-Efficient CCG Parsing: A* versus Adaptive Supertagging

Author: Michael Auli ; Adam Lopez

Abstract: We present a systematic comparison and combination of two orthogonal techniques for efficient parsing of Combinatory Categorial Grammar (CCG). First we consider adaptive supertagging, a widely used approximate search technique that prunes most lexical categories from the parser’s search space using a separate sequence model. Next we consider several variants on A*, a classic exact search technique which to our knowledge has not been applied to more expressive grammar formalisms like CCG. In addition to standard hardware-independent measures of parser effort we also present what we believe is the first evaluation of A* parsing on the more realistic but more stringent metric of CPU time. By itself, A* substantially reduces parser effort as measured by the number of edges considered during parsing, but we show that for CCG this does not always correspond to improvements in CPU time over a CKY baseline. Combining A* with adaptive supertagging decreases CPU time by 15% for our best model.

6 0.57067215 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

7 0.38597685 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing

8 0.32604352 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

9 0.32232487 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

10 0.32010114 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

11 0.31374779 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

12 0.31050992 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

13 0.30930808 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

14 0.30886129 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

15 0.30817503 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

16 0.30814105 301 acl-2011-The impact of language models and loss functions on repair disfluency detection

17 0.30771506 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

18 0.30735931 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

19 0.306169 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

20 0.30598575 274 acl-2011-Semi-Supervised Frame-Semantic Parsing for Unknown Predicates