acl acl2012 acl2012-202 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Emad Mohamed ; Behrang Mohit ; Kemal Oflazer
Abstract: We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabic; e.g., this approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation. It can also considerably speed up the annotation of Arabic dialects.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). [sent-6, score-0.056]
2 84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28. [sent-9, score-0.048]
3 The process holds promise for any NLP task targeting the dialectal varieties of Arabic; e. [sent-12, score-0.185]
4 , this approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation. [sent-14, score-0.382]
5 It can also considerably speed up the annotation of Arabic dialects. [sent-15, score-0.024]
6 Dialectal varieties have not received much attention due to the lack of dialectal tools and annotated texts (Duh and Kirchoff, 2005). [sent-18, score-0.185]
7 The transformation process relies on the observation that dialectal varieties of Arabic differ mainly in the use of affixes and function words while the word stem mostly remains unchanged. [sent-20, score-0.401]
8 For example, given the Buckwalter-encoded MSA sentence “AlAxwAn Almslmwn lm yfwzwA fy AlAntxbAt” the rules produce “AlAxwAn Almslmyn mfAzw$ f AlAntxAbAt” ( ,االخىات( المسلميت( ه مفازوت( ش ت( االوتخاباتThe Muslim Brotherhood did not win the elections). [sent-21, score-0.106]
9 The availability of segment-based part-of-speech tags is essential since many of the affixes in MSA are ambiguous. [sent-22, score-0.119]
10 For example, lm could be either a negative particle or a question work, and the word AlAxwAn could be either made of two segments (Al+ xw+An, the two brothers). [sent-23, score-0.242]
11 We first introduce the transformation rules, and show that in many cases it is feasible to transform MSA to CEA, although there are cases that require much more than POS tags. [sent-24, score-0.053]
12 We then provide a typical case in which we utilize the transformed text of the Arabic Treebank (Bies and Maamouri, 2003) to build a part-of-speech tagger for CEA. [sent-25, score-0.047]
13 The tagger improves the accuracy of POS tagging on authentic Egyptian Arabic by 13% absolute (from 73. [sent-26, score-0.158]
14 84%) and reduces the percentage of out-of-vocabulary words from 28. [sent-28, score-0.048]
15 Both can be translated into: “We did not write it for them. [sent-33, score-0.034]
16 ” MSA has three words while CEA is more synthetic as the preposition and the negative particle turn into clitics. [sent-34, score-0.206]
17 Table 1 illustrates the end product of one of the Imperfect transformation rules, namely the case where the Imperfect Verb is preceded by the negative particle lm. [sent-35, score-0.187]
18 The rules also cover certain lexical items as 400 words in MSA have been converted to their comProce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-37, score-0.087]
19 Examples of lexical conversions include ZlAm and Dlmp (darkness), rjl and rAjl (man), rjAl and rjAlp (men), and kvyr and ktyr (many), where the first word is the MSA version and the second is the CEA version. [sent-40, score-0.102]
20 For example, the word rjl can either mean man or leg. [sent-42, score-0.119]
21 When it means man, the CEA form is rAjl, but the word for leg is the same in both MSA and CEA. [sent-43, score-0.022]
22 While they have different vowel patterns (rajul and rijol respectively), the vowel information is harder to get correctly than POS tags. [sent-44, score-0.064]
23 The problem may arise especially when dealing with raw data for which we need to provide POS tags (and vowels) so we may be able to convert it to the colloquial form. [sent-45, score-0.305]
24 Below, we provide two sample rules: The imperfect verb is used, inter alia, to express the negated past, for which CEA uses the perfect verb. [sent-46, score-0.113]
25 What makes things more complicated is that CEA treats negative particles and prepositional phrases as clitics. [sent-47, score-0.082]
26 An example of this is the word mktbthlhm$ (I did not write it for them) in Table 1 above. [sent-48, score-0.056]
27 It is made of the negative particle m, the stem ktb (to write), the object pronoun h, the preposition l, the pronoun hm (them) and the negative particle $. [sent-49, score-0.502]
28 Figure 1, and the following steps show the conversions of lm nktbhA lhm to mktbnhAlhm$: 1. [sent-50, score-0.095]
29 Replace the negative word lm with one of the prefixes m, mA or the word mA. [sent-51, score-0.147]
30 For example, the IV first person singular subject prefix > turns into t in the PV. [sent-54, score-0.023]
31 If the verb is followed by a prepositional phrase headed by the preposition lthat contains a pronominal object, convert the preposition to a prepositional clitic. [sent-56, score-0.25]
32 Transform the dual to plural and the plural feminine to plural masculine. [sent-58, score-0.434]
33 Add the negative suffix $ (or the variant $y, which is less probable) As alluded to in 1) above, given that colloquial orthography is not standardized, many affixes and clitics can be written in different ways. [sent-60, score-0.371]
34 For exam- ple, the word mktbnhlhm$, can be written in 24 ways. [sent-61, score-0.022]
35 177 Figure1:OnenegatedIVforminMSAcangen rate24 (3x2x2x2) possible forms in CEA MSA possessive pronouns inflect for gender, number (singular, dual, and plural), and person. [sent-64, score-0.038]
36 In CEA, there is no distinction between the dual and the plural, and a single pronoun is used for the plural feminine and masculine. [sent-65, score-0.25]
37 The three MSA forms ktAbhm, ktAbhmA and ktAbhn (their book for the masculine plural, the dual, and the feminine plural respectively) all collapse to ktAbhm. [sent-66, score-0.165]
38 Table 2 has examples of some other rules we have applied. [sent-67, score-0.044]
39 We note that the stem, in bold, hardly changes, and that the changes mainly affect function segments. [sent-68, score-0.036]
40 The last example is a lexical rule in which the stem has to change. [sent-69, score-0.058]
41 POS Tagging Egyptian Arabic We use the conversion above to build a POS tagger for Egyptian Arabic. [sent-71, score-0.104]
42 We follow Mohamed and Kuebler (2010) in using whole word tagging, i. [sent-72, score-0.022]
43 For example, the word wHnktblhm (and we will write to them, )وحىكتبلهم receives the tag PRT+PRT+VRB+PRT+NOM. [sent-76, score-0.056]
44 This results in 58 composite tags, 9 of which occur 5 times or less in the converted ECA training set. [sent-77, score-0.076]
45 We converted two sections of the Arabic Treebank (ATB): p2v3 and p3v2. [sent-78, score-0.043]
46 For all the POS tagging experiments, we use the memory-based POS tagger (MBT) (Daelemans et al. [sent-79, score-0.117]
47 , 1996) The best results, tuned on a dev set, were obtained, in nonexhaustive search, with the Modified Value Difference Metric as a distance metric and with k (the number of nearest neighbors) = 25. [sent-80, score-0.055]
48 For known words, we use the IGTree algorithm and 2 words to the left, their POS tags, the focus word and its list of possible tags, 1 right context word and its list of possible tags as features. [sent-81, score-0.092]
49 For unknown words, we use the IB 1 algorithm and the word itself, its first 5 and last 3 characters, 1 left context word and its POS tag, and 1 right context word and its list of possible tags as features. [sent-82, score-0.163]
50 Development and Test Data As a development set, we use 100 user-contributed comments (2757 words) from the website masrawy. [sent-85, score-0.054]
51 The test set contains 192 comments (7092 words) from the same website with the same criterion. [sent-87, score-0.054]
52 The development and test sets were handannotated with composite tags as illustrated above by two native Arabic-speaking students. [sent-88, score-0.104]
53 The test and development sets contained spelling errors (mostly run-on words). [sent-89, score-0.023]
54 The most common of these is the vocative particle yA, which is usually attached to following word (e. [sent-90, score-0.139]
55 The same holds true for the variation between the letters * and z, ( ذand زin Arabic) which are pronounced exactly the same way in CEA to the extent that the substitution may not be considered a spelling error. [sent-94, score-0.023]
56 Experiments and Results We ran five experiments to test the effect of MSA to CEA conversion on POS tagging: (a) Standard, where we train the tagger on the ATB MSA data, (b) 3-gram LM, where for each MSA sentence we generate all transformed sentences (see Section 2. [sent-97, score-0.104]
57 1 This corpus is 1Available probable sentence model built from user contributed highly dialectal from http://www. [sent-100, score-0.168]
58 Hybridization is necessary since most Arabic data in blogs and comments are a mix of MSA and CEA, and (e) Hybrid + dev, where we enrich the Hybrid training set with the dev data. [sent-105, score-0.083]
59 We use the following metrics for evaluation: KWA: Known Word Accuracy (%), UWA: Unknown Word Accuracy (%), TA: Total Accuracy (%), and UW: unknown words (%) in the respective set in the respective experiment. [sent-106, score-0.049]
60 W9208 4 We notice that randomly selecting a sentence from the correct generated sentences yields better results than choosing the most probable sentence according to a language model. [sent-112, score-0.036]
61 This drop in the percentage of unknown words may indicate that generating all possible variations of CEA may be more useful than using a language model in general. [sent-118, score-0.097]
62 Even in a CEA corpus of 35 million words, one third of the words generated by the rules are not in the corpus, while many of these are in both the test set and the development set. [sent-119, score-0.044]
63 We also notice that the conversion alone improves tagging accuracy from 75. [sent-120, score-0.155]
64 Combining the original MSA and the best scoring converted data (Random) raises the accuracies to 84. [sent-125, score-0.043]
65 The fact that the percentage of unknown words drops further to 16. [sent-131, score-0.12]
66 66% in the Hybrid+dev experiment points out the authentic colloquial data contains elements that have not been captured using conversion alone. [sent-132, score-0.325]
67 Related Work To the best of our knowledge, ours is the first work that generates CEA automatically from morphologically disambiguated MSA, but Habash et al. [sent-134, score-0.056]
68 (2005) discussed root and pattern morphological analysis and generation of Arabic dialects within the MAGED morphological analyzer. [sent-135, score-0.215]
69 MAGED incorporates the morphology, phonology, and orthography of several Arabic dialects. [sent-136, score-0.032]
70 (2010) worked on the annotation of dialectal Arabic through the COLABA project, and they used the (manually) annotated resources to facilitate the incorporation of the dialects in Arabic information retrieval. [sent-138, score-0.215]
71 Duh and Kirchhoff (2005) successfully designed a POS tagger for CEA that used an MSA morphological analyzer and information gleaned from the intersection of several Arabic dialects. [sent-139, score-0.153]
72 This is different from our approach for which POS tagging is only an application. [sent-140, score-0.07]
73 Our focus is to use any existing MSA data to generate colloquial Arabic resources that can be used in virtually any NLP task. [sent-141, score-0.253]
74 Conclusions and Future Work We have a presented a method to convert Modern Standard Arabic to Egyptian Colloquial Arabic with an example application to the POS tagging task. [sent-145, score-0.1]
75 This approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation, for example. [sent-146, score-0.382]
76 While the rules of conversion were mainly morphological in nature, they have proved useful in handling colloquial data. [sent-147, score-0.443]
77 However, morphology alone is not enough for handling key points of difference between CEA and MSA. [sent-148, score-0.053]
78 While CEA is mainly an SVO language, MSA is mainly VSO, and while demonstratives are pre-nominal in MSA, they are post-nominal in CEA. [sent-149, score-0.072]
79 These phenomena can be handled only through syntactic conversion. [sent-150, score-0.025]
80 When no gold standard segment-based POS tags are available, tools that produce segment-based annotation can be used, e. [sent-152, score-0.048]
81 segment-based POS tagging (Mohamed and Kuebler, 2010) or MADA (Habash et al, 2009), although these are not expected to yield the same results as gold standard part-of-speech tags. [sent-154, score-0.07]
82 We thank the two native speaker annotators and the anonymous reviewers for their instructive and enriching feedback. [sent-157, score-0.023]
wordName wordTfidf (topN-words)
[('msa', 0.494), ('cea', 0.441), ('arabic', 0.438), ('colloquial', 0.227), ('egyptian', 0.145), ('dialectal', 0.132), ('particle', 0.117), ('plural', 0.112), ('habash', 0.102), ('mohamed', 0.1), ('pos', 0.082), ('morphological', 0.079), ('prt', 0.076), ('alaxwan', 0.071), ('affixes', 0.071), ('imperfect', 0.071), ('tagging', 0.07), ('qatar', 0.067), ('lm', 0.062), ('kuebler', 0.062), ('nizar', 0.061), ('stem', 0.058), ('conversion', 0.057), ('daelemans', 0.057), ('dialects', 0.057), ('dev', 0.055), ('gender', 0.054), ('feminine', 0.053), ('varieties', 0.053), ('man', 0.05), ('unknown', 0.049), ('percentage', 0.048), ('preposition', 0.048), ('tags', 0.048), ('colaba', 0.047), ('nprp', 0.047), ('rajl', 0.047), ('rjl', 0.047), ('vrb', 0.047), ('tagger', 0.047), ('hybrid', 0.047), ('dual', 0.045), ('rules', 0.044), ('converted', 0.043), ('duh', 0.042), ('verb', 0.042), ('maamouri', 0.041), ('authentic', 0.041), ('behrang', 0.041), ('kundu', 0.041), ('maged', 0.041), ('mbt', 0.041), ('mada', 0.041), ('modern', 0.041), ('negative', 0.041), ('prepositional', 0.041), ('rambow', 0.04), ('pronoun', 0.04), ('roth', 0.039), ('pronouns', 0.038), ('bies', 0.038), ('june', 0.037), ('probable', 0.036), ('mainly', 0.036), ('atb', 0.035), ('write', 0.034), ('composite', 0.033), ('kirchhoff', 0.033), ('conversions', 0.033), ('treebank', 0.033), ('vowel', 0.032), ('semitic', 0.032), ('orthography', 0.032), ('convert', 0.03), ('walter', 0.03), ('transformation', 0.029), ('morphologically', 0.029), ('comments', 0.028), ('alone', 0.028), ('diab', 0.027), ('analyzer', 0.027), ('disambiguated', 0.027), ('owen', 0.026), ('resources', 0.026), ('website', 0.026), ('morphology', 0.025), ('arbor', 0.025), ('phenomena', 0.025), ('srilm', 0.024), ('stolcke', 0.024), ('cheap', 0.024), ('considerably', 0.024), ('ann', 0.024), ('feasible', 0.024), ('al', 0.023), ('spelling', 0.023), ('drops', 0.023), ('singular', 0.023), ('native', 0.023), ('word', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic
Author: Emad Mohamed ; Behrang Mohit ; Kemal Oflazer
Abstract: We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabic; e.g., this approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation. It can also considerably speed up the annotation of Arabic dialects.
2 0.38247341 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT
Author: David Stallard ; Jacob Devlin ; Michael Kayser ; Yoong Keok Lee ; Regina Barzilay
Abstract: If unsupervised morphological analyzers could approach the effectiveness of supervised ones, they would be a very attractive choice for improving MT performance on low-resource inflected languages. In this paper, we compare performance gains for state-of-the-art supervised vs. unsupervised morphological analyzers, using a state-of-theart Arabic-to-English MT system. We apply maximum marginal decoding to the unsupervised analyzer, and show that this yields the best published segmentation accuracy for Arabic, while also making segmentation output more stable. Our approach gives an 18% relative BLEU gain for Levantine dialectal Arabic. Furthermore, it gives higher gains for Modern Standard Arabic (MSA), as measured on NIST MT-08, than does MADA (Habash and Rambow, 2005), a leading supervised MSA segmenter.
3 0.29516953 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
Author: Spence Green ; John DeNero
Abstract: When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agreement model. Agreement is promoted by scoring a sequence of fine-grained morpho-syntactic classes that are predicted during decoding for each translation hypothesis. For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders. 1
4 0.25300935 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling
Author: Kareem Darwish ; Ahmed Ali
Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.
5 0.11640296 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study
Author: Nathan Schneider ; Behrang Mohit ; Kemal Oflazer ; Noah A. Smith
Abstract: “Lightweight” semantic annotation of text calls for a simple representation, ideally without requiring a semantic lexicon to achieve good coverage in the language and domain. In this paper, we repurpose WordNet’s supersense tags for annotation, developing specific guidelines for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains. The resulting corpus has high coverage and was completed quickly with reasonable inter-annotator agreement.
6 0.085451961 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation
7 0.079866186 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing
8 0.071723491 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation
9 0.070208564 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors
10 0.064125903 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging
11 0.06291303 137 acl-2012-Lemmatisation as a Tagging Task
12 0.056872383 96 acl-2012-Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection
13 0.056731258 103 acl-2012-Grammar Error Correction Using Pseudo-Error Sentences and Domain Adaptation
14 0.053765055 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation
15 0.052265007 168 acl-2012-Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
17 0.048811007 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese
18 0.046441905 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars
19 0.045592159 189 acl-2012-Syntactic Annotations for the Google Books NGram Corpus
20 0.042813949 100 acl-2012-Fine Granular Aspect Analysis using Latent Structural Models
topicId topicWeight
[(0, -0.141), (1, -0.046), (2, -0.05), (3, -0.044), (4, 0.071), (5, 0.194), (6, 0.107), (7, -0.223), (8, -0.013), (9, -0.021), (10, -0.224), (11, -0.076), (12, 0.195), (13, -0.197), (14, 0.036), (15, -0.164), (16, -0.328), (17, -0.19), (18, -0.203), (19, -0.106), (20, 0.151), (21, -0.009), (22, -0.012), (23, -0.05), (24, 0.053), (25, -0.07), (26, 0.039), (27, -0.014), (28, -0.12), (29, 0.069), (30, -0.001), (31, -0.022), (32, 0.001), (33, 0.066), (34, -0.118), (35, 0.001), (36, 0.021), (37, -0.077), (38, -0.033), (39, 0.062), (40, 0.076), (41, -0.056), (42, 0.014), (43, 0.048), (44, 0.013), (45, 0.029), (46, -0.048), (47, -0.019), (48, 0.015), (49, -0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.95173401 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic
Author: Emad Mohamed ; Behrang Mohit ; Kemal Oflazer
Abstract: We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabic; e.g., this approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation. It can also considerably speed up the annotation of Arabic dialects.
2 0.82051015 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT
Author: David Stallard ; Jacob Devlin ; Michael Kayser ; Yoong Keok Lee ; Regina Barzilay
Abstract: If unsupervised morphological analyzers could approach the effectiveness of supervised ones, they would be a very attractive choice for improving MT performance on low-resource inflected languages. In this paper, we compare performance gains for state-of-the-art supervised vs. unsupervised morphological analyzers, using a state-of-theart Arabic-to-English MT system. We apply maximum marginal decoding to the unsupervised analyzer, and show that this yields the best published segmentation accuracy for Arabic, while also making segmentation output more stable. Our approach gives an 18% relative BLEU gain for Levantine dialectal Arabic. Furthermore, it gives higher gains for Modern Standard Arabic (MSA), as measured on NIST MT-08, than does MADA (Habash and Rambow, 2005), a leading supervised MSA segmenter.
3 0.77464789 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling
Author: Kareem Darwish ; Ahmed Ali
Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.
4 0.54955435 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
Author: Spence Green ; John DeNero
Abstract: When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agreement model. Agreement is promoted by scoring a sequence of fine-grained morpho-syntactic classes that are predicted during decoding for each translation hypothesis. For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders. 1
5 0.39741889 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study
Author: Nathan Schneider ; Behrang Mohit ; Kemal Oflazer ; Noah A. Smith
Abstract: “Lightweight” semantic annotation of text calls for a simple representation, ideally without requiring a semantic lexicon to achieve good coverage in the language and domain. In this paper, we repurpose WordNet’s supersense tags for annotation, developing specific guidelines for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains. The resulting corpus has high coverage and was completed quickly with reasonable inter-annotator agreement.
6 0.38917834 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing
7 0.32275176 137 acl-2012-Lemmatisation as a Tagging Task
8 0.22718723 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation
9 0.21369164 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation
10 0.20382957 189 acl-2012-Syntactic Annotations for the Google Books NGram Corpus
12 0.17528576 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum
13 0.174492 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors
14 0.17297684 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
15 0.17242646 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging
16 0.17208484 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese
17 0.1709352 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese
18 0.16823165 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing
19 0.15964922 168 acl-2012-Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations
20 0.15008116 96 acl-2012-Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection
topicId topicWeight
[(6, 0.272), (25, 0.044), (26, 0.029), (28, 0.034), (30, 0.022), (37, 0.025), (39, 0.03), (57, 0.09), (74, 0.038), (82, 0.018), (83, 0.014), (84, 0.04), (85, 0.046), (90, 0.092), (92, 0.043), (94, 0.017), (99, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.7197848 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic
Author: Emad Mohamed ; Behrang Mohit ; Kemal Oflazer
Abstract: We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabic; e.g., this approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation. It can also considerably speed up the annotation of Arabic dialects.
2 0.70299941 47 acl-2012-Chinese Comma Disambiguation for Discourse Analysis
Author: Yaqin Yang ; Nianwen Xue
Abstract: The Chinese comma signals the boundary of discourse units and also anchors discourse relations between adjacent text spans. In this work, we propose a discourse structureoriented classification of the comma that can be automatically extracted from the Chinese Treebank based on syntactic patterns. We then experimented with two supervised learning methods that automatically disambiguate the Chinese comma based on this classification. The first method integrates comma classification into parsing, and the second method adopts a “post-processing” approach that extracts features from automatic parses to train a classifier. The experimental results show that the second approach compares favorably against the first approach.
3 0.49838883 110 acl-2012-Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
Author: William Yang Wang ; Elijah Mayfield ; Suresh Naidu ; Jeremiah Dittmar
Abstract: We propose a latent variable model to enhance historical analysis of large corpora. This work extends prior work in topic modelling by incorporating metadata, and the interactions between the components in metadata, in a general way. To test this, we collect a corpus of slavery-related United States property law judgements sampled from the years 1730 to 1866. We study the language use in these legal cases, with a special focus on shifts in opinions on controversial topics across different regions. Because this is a longitudinal data set, we are also interested in understanding how these opinions change over the course of decades. We show that the joint learning scheme of our sparse mixed-effects model improves on other state-of-the-art generative and discriminative models on the region and time period identification tasks. Experiments show that our sparse mixed-effects model is more accurate quantitatively and qualitatively interesting, and that these improvements are robust across different parameter settings.
4 0.4981035 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling
Author: Kareem Darwish ; Ahmed Ali
Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.
5 0.4913829 83 acl-2012-Error Mining on Dependency Trees
Author: Claire Gardent ; Shashi Narayan
Abstract: In recent years, error mining approaches were developed to help identify the most likely sources of parsing failures in parsing systems using handcrafted grammars and lexicons. However the techniques they use to enumerate and count n-grams builds on the sequential nature of a text corpus and do not easily extend to structured data. In this paper, we propose an algorithm for mining trees and apply it to detect the most likely sources of generation failure. We show that this tree mining algorithm permits identifying not only errors in the generation system (grammar, lexicon) but also mismatches between the structures contained in the input and the input structures expected by our generator as well as a few idiosyncrasies/error in the input data.
6 0.48727003 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
7 0.46211314 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons
9 0.44600889 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation
10 0.44543314 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
11 0.44361457 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
12 0.44113129 136 acl-2012-Learning to Translate with Multiple Objectives
13 0.43819866 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation
15 0.43398672 140 acl-2012-Machine Translation without Words through Substring Alignment
16 0.43391258 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base
17 0.43348429 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
18 0.43267348 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures
19 0.43251249 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence
20 0.43223634 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool