acl acl2012 acl2012-27 knowledge-graph by maker-knowledge-mining

27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling


Source: pdf

Author: Kareem Darwish ; Ahmed Ali

Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ali Qatar Computing Research Institute Qatar Foundation, Doha, Qatar kdarwi sh@ qf . [sent-2, score-0.039]

2 qa @ kdar Abstract Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. [sent-6, score-0.47]

3 However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. [sent-7, score-0.498]

4 In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. [sent-8, score-0.413]

5 The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. [sent-9, score-0.554]

6 The technique can potentially be applied to other languages. [sent-10, score-0.048]

7 Introduction Arabic exhibits rich morphological phenomena that complicate retrieval. [sent-12, score-0.147]

8 Arabic nouns and verbs are typically derived from a set of 10,000 roots that are cast into stems using templates that may add infixes, double letters, or remove letters. [sent-13, score-0.455]

9 Stems can accept the attachment of clitics, in the form of prefixes or suffixes, such as prepositions, determiners, pronouns, etc. [sent-14, score-0.084]

10 Orthographic rules can cause the addition, deletion, or substitution of letters during suffix and prefix attachment. [sent-15, score-0.104]

11 Further, stems can be inflected to obtain plural forms via the addition of suffixes or through using a different stem form altogether producing socalled broken1 (aka irregular) plurals. [sent-16, score-0.659]

12 For retrieval, we would ideally like to match “related” stem forms regardless of inflected form or attached clitic. [sent-17, score-0.13]

13 Tolerating some form of derivational morphology where nouns are transformed into adjectives via the attachment of 1 “Broken” is a direct translation of the Arabic word “takseer”, which refers to this kind of plural. [sent-18, score-0.175]

14 ‫( ﻣﺼﺮ‬ as they Matching all stems that are cast from the same root would introduce undesired ambiguity, because a single root can produce up to 1,000 stems. [sent-23, score-0.451]

15 The first approach involves stemming, which removes clitics, plural and gender markers, and suffixes such as ‫ ﻱي‬y). [sent-25, score-0.139]

16 Statistical stemming was reported to be the most effective for Arabic retrieval (Darwish et al. [sent-26, score-0.473]

17 Stemming does not handle infixes and hence cannot conflate singular and broken plural word forms. [sent-29, score-0.225]

18 For example, the plural of the Arabic word for book “‫ ”ﻛﺘﺎﺏب‬ktAb) is “‫ ”ﻛﺘﺐ‬ktb). [sent-30, score-0.063]

19 Stemming of some named entities, which are important for retrieval, and their inflected forms may produce different stems as word endings may change with the attachment of suffixes. [sent-32, score-0.521]

20 Consider the Arabic words for America ‫ﺃأﻣﺮﻳﯾﻜﺎ‬ (>mrykA) and American ‫> ﺃأﻣﺮﻳﯾﻜﻲ‬mryky), where the final letter is transformed from “A” to “y”. [sent-33, score-0.082]

21 The second approach involves using character 3or 4-grams (as opposed to words) (Mayfield et al. [sent-34, score-0.101]

22 This approach though it has been shown to improve retrieval effectiveness, it has the following drawbacks: 1. [sent-37, score-0.147]

23 I t cannot handle broken plurals, though it would handle words where stemming would produce different stems for different inflected forms. [sent-38, score-0.986]

24 For example, using a 6 letter word would produce 4 trigram chunks, which would have 12 letters. [sent-41, score-0.083]

25 Longer words would yield more character ngram chunks compared to shorter ones leading to skewed weights for query words. [sent-43, score-0.233]

26 2 We use Buckwalter transliteration in the paper Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-44, score-0.049]

27 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi2c 1s8–2 2, To address this problem, we propose the use of a character level transformation model that can generate tokens that are morphologically related to query tokens. [sent-46, score-0.371]

28 We train the model using morphological related stems that are extracted from hypertext/page title pairs from Wikipedia. [sent-47, score-0.692]

29 Such pairs are good for the task at hand, because they show different ways to refer to the same concept. [sent-48, score-0.03]

30 We show that expanding stems in a query with related stems using our model outperforms the use of state-of-the-art statistical Arabic stemming. [sent-49, score-0.916]

31 Further, the expansion can be applied to words directly to perform at par with statistical stemming. [sent-50, score-0.092]

32 Laterally, the model can help produce spelling variants of transliterated names. [sent-51, score-0.128]

33 The contribution of this paper is as follows: • We proposed an automatic method for learning character-level morphological transformations from Wikipedia hypertext/page title pairs. [sent-52, score-0.318]

34 • When applied to stems, we show that the method overcomes some morphological problems that are associated with stemming, statistically significantly outperforming Arabic retrieval using statistical stemming and character n-grams. [sent-53, score-0.841]

35 • When applied to words, we show that the method yields retrieval effectiveness at par with statistical stemming. [sent-54, score-0.21]

36 Related Work Most studies are based on a single large collection from the TREC-2001/2002 cross-language retrieval track (Gey and Oard, 2001 ; Oard and Gey, 2002). [sent-56, score-0.179]

37 , stems and roots (Darwish and Oard, 2002), light stemming (Aljlayl et al. [sent-60, score-0.803]

38 , 2002), and character n-grams of various lengths (Darwish and Oard, 2002; Mayfield et al. [sent-62, score-0.101]

39 The effects of normalizing alternative characters, removal of diacritics and stop-word removal have also been explored (Xu et al. [sent-64, score-0.062]

40 These studies suggest that light stemming, character n-grams, and statistical stemming are the better index terms. [sent-66, score-0.509]

41 Morphological approaches assume an Arabic word is constituted from prefixes-stem-suffixes and aim to remove prefixes and suffixes. [sent-67, score-0.048]

42 Since Arabic morphology is ambiguous, statistical stemming attempts to find the most likely segmentation of 219 words. [sent-68, score-0.494]

43 (2003) used a trigram language model with a minimal set of manually crafted rules to achieve a stemming accuracy of 97. [sent-71, score-0.326]

44 (2005) to lead to statistical improvements over using light stemming. [sent-74, score-0.082]

45 Although consistency is more important for IR applications than linguistic correctness, perhaps improved correctness would naturally yield great consistency. [sent-79, score-0.063]

46 In this paper, we used a reimplementation of the system proposed by Diab (2009) with the same training set as a baseline. [sent-80, score-0.042]

47 Concerning the automatic induction of morphologically related word-forms, Hammarström (2009) surveyed fairly comprehensively many unsupervised morphology learning approaches. [sent-81, score-0.33]

48 MDL based approach was improved by: Goldsmith (2001) who applied the EM algorithm to improve the precision of pairing stems prior to suffix induction; and Schone and Jurafsky (2001) who applied latent semantic analysis to determine if two words are semantically related. [sent-84, score-0.454]

49 Baroni (2002) extended his work by incorporating semantic similarity features, via mutual information, and orthographic features, via edit distance. [sent-88, score-0.071]

50 Chen and Gey (2002) utilized a bilingual dictionary to find Arabic words with a common stem that map to the same English stem. [sent-89, score-0.124]

51 Also in the cross-language spirit, Snyder and Barzilay (2008) used cross-language mappings to learn morpheme patterns and consequently automatically segment words. [sent-90, score-0.123]

52 Creutz and Lagus (2007) proposed a probabilistic model for automatic word segment discovery. [sent-92, score-0.035]

53 Most of these approaches can discover suffixes and prefixes without human intervention. [sent-93, score-0.124]

54 However, they may not be able to handle infixation and spelling variations. [sent-94, score-0.076]

55 (2006) used approximate string matching to automatically map morphologically similar words in noisy dictionary data. [sent-96, score-0.222]

56 They used the mappings to learn affixation, including infixiation, from noisy data. [sent-97, score-0.054]

57 In this paper, we propose a new technique for finding morphologically related word-forms based on learning character-level mappings. [sent-98, score-0.211]

58 1 Training Data In our experiments, we extracted Wikipedia hypertext to page title pairs as in Figure 1. [sent-102, score-0.296]

59 From them, we attempted to find word pairs that were morphologically related. [sent-106, score-0.193]

60 From the example in Figure 1, given the hypertext ‫ ( ﺑﺎﻟﺒﺮﺗﻐﺎﻟﻴﯿﺔ‬bAlbrtgAlyp – in Portuguese) and the page title that it points to ‫ ( ﻟﻐﺔ ﺑﺮﺗﻐﺎﻟﻴﯿﺔ‬lgp brtgAlyp Portuguese language) we needed to extract the pairs ‫ ( ﺑﺎﻟﺒﺮﺗﻐﺎﻟﻴﯿﺔ‬bAlbrtgAlyp) and ‫ ( ﺑﺮﺗﻐﺎﻟﻴﯿﺔ‬brtgAlyp). [sent-107, score-0.296]

61 We assumed that a word in the hypertext and another in Wikipedia title were morphologically related using the following criteria: • The words share the first 2 letters or the last 2 letters. [sent-108, score-0.492]

62 • The edit distance between the two words must be <= 3. [sent-110, score-0.061]

63 The choice of 3 was motivated by the fact that Arabic prefixes and suffixes are typically 1, 2, or 3 letters long. [sent-111, score-0.193]

64 • The edit distance was less than 50% of the length of the shorter of the two words. [sent-112, score-0.032]

65 This was important to insure that short words that share common letters but are in fact different are filtered out. [sent-113, score-0.098]

66 All words in the word pairs were stemmed using a reimplementation of the stemmer of Diab (2009). [sent-115, score-0.143]

67 In the first, we aligned the stems of the word pairs at character level. [sent-118, score-0.521]

68 In the second, we aligned the words of the word pairs at character level without stemming. [sent-119, score-0.16]

69 To apply a machine translation analogy, we treated words as sentences and the letters from which were constructed as tokens. [sent-122, score-0.098]

70 Source character sequence lengths were restricted to 3 letters. [sent-124, score-0.101]

71 Generating related stems/words: We treated the problem of generating morphologically related stems (or words) like a transliteration mining problem akin to that in Udupa et al. [sent-125, score-0.602]

72 Briefly, the miner used character segment mappings to generate all possible transformations while constraining generation to the existing tokens (either stems or words) in a list of unique tokens in the retrieval test collection. [sent-127, score-0.847]

73 Basically, given a query token, all possible segmentations, where each segment has a maximum length of 3 characters, were produced along with their associated mappings. [sent-128, score-0.105]

74 Given all mapping combinations, combinations producing valid target tokens were retained and sorted according to the product of their mapping probabilities. [sent-129, score-0.037]

75 To illustrate how this works, consider the following example: Given a query word “min”, target words in the word list {moon, men, man, min} , and the possible mappings for the segments and their probabilities: m = {(m, 0. [sent-130, score-0.153]

76 1 Experimental Setup We used extrinsic IR evaluation to determine the quality of the related stems that were generated. [sent-156, score-0.39]

77 We performed experiments on the TREC 2001/2002 cross language track collection, which contains 383,872 Arabic newswire articles and 75 topics with their relevance judgments (Oard and Gey, 2002). [sent-157, score-0.032]

78 This is presently the best available large Arabic information retrieval test collection. [sent-158, score-0.147]

79 We used Mean Average Precision (MAP) as the measure of goodness for this retrieval task. [sent-159, score-0.147]

80 All experiments were performed using the Indri retrieval toolkit, which uses a retrieval model that combines inference networks and language modeling and implements advanced query operators (Metzler and Croft, 2004). [sent-162, score-0.364]

81 05 to determine if a set of retrieval results was better than another. [sent-164, score-0.147]

82 We replaced each query tokens with all the related stems that were generated using a weighted synonym operator (Wang and Oard, 2006), where the weights correspond to the product of the mapping probabilities for each related word. [sent-165, score-0.555]

83 With the weighted synonym operator, we did not need to threshold the generated related stems as ones with low probabilities were demoted. [sent-166, score-0.419]

84 Probabilities were normalized by the score of the original query word. [sent-167, score-0.07]

85 For example, given the stem ‫ ( ﺻﻨﺎﻉع‬SnAE) it was replaced with: #wsyn(1 . [sent-168, score-0.065]

86 2 A76249P654 w oSrdtas i/ tecmals/yhbaert4-grhamns 221 namely: using raw words, using statistical stemming (Diab, 2009), and character 4-grams. [sent-176, score-0.456]

87 For all runs, we performed letter normalization, where we conflated: variants of “alef”, “ta marbouta” and “ha”, “alef maqsoura” and “ya”, and the different forms of “hamza”. [sent-177, score-0.053]

88 2 Experimental Results Table 1 reports retrieval results. [sent-179, score-0.147]

89 Expanding stems using morphologically related stems yielded statistically significant improvements over using words, stems, and character 4-grams. [sent-180, score-1.096]

90 Expanding words yielded results that were statistically significantly better than using words, and statistically indistinguishable from using 4-grams and stems. [sent-181, score-0.172]

91 As the results show, the proposed technique improves upon statistical stemming by overcoming the shortfalls of stemming. [sent-182, score-0.44]

92 Another phenomenon that was addressed implicitly by the proposed technique had to do with detecting variant spellings of transliterated names. [sent-183, score-0.153]

93 This draws from the fact that differences in spelling variations and the construction of broken plurals are typically due to the insertion or deletion of long vowels. [sent-184, score-0.174]

94 Conclusion In this paper, we presented a method for generating morphologically related tokens from Wikipedia hypertext to page title pairs. [sent-187, score-0.466]

95 We showed that the method overcomes some of the problems of statistical stemming to yield statistically significant improvements in Arabic retrieval over using statistical stemming. [sent-188, score-0.655]

96 The technique can also be applied on words to yield results that statistically indistinguishable from statistical stemming. [sent-189, score-0.23]

97 The technique had the added advantage of detecting variable spellings of transliterated named entities. [sent-190, score-0.153]

98 For future work, we would like to try the proposed technique on other languages, because it would likely be effective in automatically learning character-level morphological transformations as well as overcoming some of the problems associated with stemming. [sent-191, score-0.278]

99 It is worthwhile to devise models that concurrently generate morphological and phonologically related tokens. [sent-192, score-0.147]

100 Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. [sent-216, score-0.231]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('arabic', 0.425), ('stems', 0.39), ('stemming', 0.326), ('oard', 0.231), ('darwish', 0.211), ('gey', 0.186), ('morphologically', 0.163), ('retrieval', 0.147), ('morphological', 0.147), ('morphology', 0.139), ('trec', 0.139), ('title', 0.125), ('hypertext', 0.106), ('character', 0.101), ('larkey', 0.08), ('diab', 0.076), ('suffixes', 0.076), ('broken', 0.072), ('query', 0.07), ('mayfield', 0.069), ('letters', 0.069), ('stem', 0.065), ('inflected', 0.065), ('plurals', 0.063), ('plural', 0.063), ('wikipedia', 0.061), ('transliterated', 0.059), ('gaithersburg', 0.056), ('qatar', 0.056), ('mappings', 0.054), ('alef', 0.053), ('aljlayl', 0.053), ('balbrtgalyp', 0.053), ('brtgalyp', 0.053), ('goldsmith', 0.053), ('hammarstr', 0.053), ('infixes', 0.053), ('jacquemin', 0.053), ('snae', 0.053), ('udupa', 0.053), ('light', 0.053), ('letter', 0.053), ('statistically', 0.052), ('min', 0.051), ('transliteration', 0.049), ('prefixes', 0.048), ('technique', 0.048), ('cairo', 0.046), ('clitics', 0.046), ('emam', 0.046), ('schone', 0.046), ('spellings', 0.046), ('transformations', 0.046), ('creutz', 0.042), ('lagus', 0.042), ('mdl', 0.042), ('portuguese', 0.042), ('reimplementation', 0.042), ('stemmer', 0.042), ('metzler', 0.039), ('indistinguishable', 0.039), ('overcomes', 0.039), ('baroni', 0.039), ('qf', 0.039), ('orthographic', 0.039), ('spelling', 0.039), ('tokens', 0.037), ('men', 0.037), ('overcoming', 0.037), ('brent', 0.037), ('expanding', 0.037), ('handle', 0.037), ('attachment', 0.036), ('semitic', 0.035), ('suffix', 0.035), ('segment', 0.035), ('page', 0.035), ('morpheme', 0.034), ('roots', 0.034), ('par', 0.034), ('drawbacks', 0.034), ('croft', 0.034), ('ahmed', 0.033), ('snyder', 0.033), ('yield', 0.033), ('edit', 0.032), ('track', 0.032), ('cast', 0.031), ('removal', 0.031), ('pairs', 0.03), ('correctness', 0.03), ('map', 0.03), ('produce', 0.03), ('operator', 0.029), ('statistical', 0.029), ('mi', 0.029), ('words', 0.029), ('synonym', 0.029), ('qa', 0.029), ('unsupervised', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999887 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

Author: Kareem Darwish ; Ahmed Ali

Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.

2 0.25300935 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic

Author: Emad Mohamed ; Behrang Mohit ; Kemal Oflazer

Abstract: We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabic; e.g., this approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation. It can also considerably speed up the annotation of Arabic dialects.

3 0.24667746 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

Author: Spence Green ; John DeNero

Abstract: When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agreement model. Agreement is promoted by scoring a sequence of fine-grained morpho-syntactic classes that are predicted during decoding for each translation hypothesis. For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders. 1

4 0.1561216 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT

Author: David Stallard ; Jacob Devlin ; Michael Kayser ; Yoong Keok Lee ; Regina Barzilay

Abstract: If unsupervised morphological analyzers could approach the effectiveness of supervised ones, they would be a very attractive choice for improving MT performance on low-resource inflected languages. In this paper, we compare performance gains for state-of-the-art supervised vs. unsupervised morphological analyzers, using a state-of-theart Arabic-to-English MT system. We apply maximum marginal decoding to the unsupervised analyzer, and show that this yields the best published segmentation accuracy for Arabic, while also making segmentation output more stable. Our approach gives an 18% relative BLEU gain for Levantine dialectal Arabic. Furthermore, it gives higher gains for Modern Standard Arabic (MSA), as measured on NIST MT-08, than does MADA (Habash and Rambow, 2005), a leading supervised MSA segmenter.

5 0.11174625 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

Author: Reut Tsarfaty ; Joakim Nivre ; Evelina Andersson

Abstract: We present novel metrics for parse evaluation in joint segmentation and parsing scenarios where the gold sequence of terminals is not known in advance. The protocol uses distance-based metrics defined for the space of trees over lattices. Our metrics allow us to precisely quantify the performance gap between non-realistic parsing scenarios (assuming gold segmented and tagged input) and realistic ones (not assuming gold segmentation and tags). Our evaluation of segmentation and parsing for Modern Hebrew sheds new light on the performance ofthe best parsing systems to date in the different scenarios.

6 0.1013151 140 acl-2012-Machine Translation without Words through Substring Alignment

7 0.099074125 217 acl-2012-Word Sense Disambiguation Improves Information Retrieval

8 0.094120294 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

9 0.091587268 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

10 0.081631318 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

11 0.077066682 137 acl-2012-Lemmatisation as a Tagging Task

12 0.076580696 134 acl-2012-Learning to Find Translations and Transliterations on the Web

13 0.075802132 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

14 0.06846381 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

15 0.063111857 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

16 0.062766381 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

17 0.061783034 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

18 0.057655435 89 acl-2012-Exploring Deterministic Constraints: from a Constrained English POS Tagger to an Efficient ILP Solution to Chinese Word Segmentation

19 0.057387196 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

20 0.055254392 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.165), (1, -0.052), (2, -0.017), (3, 0.015), (4, 0.094), (5, 0.201), (6, 0.081), (7, -0.146), (8, -0.015), (9, -0.027), (10, -0.136), (11, -0.072), (12, 0.16), (13, -0.119), (14, 0.039), (15, -0.211), (16, -0.248), (17, -0.118), (18, -0.123), (19, 0.004), (20, 0.068), (21, 0.007), (22, 0.091), (23, -0.077), (24, -0.084), (25, 0.013), (26, 0.018), (27, -0.003), (28, -0.059), (29, 0.034), (30, -0.088), (31, -0.039), (32, -0.07), (33, -0.018), (34, -0.038), (35, 0.065), (36, 0.059), (37, -0.021), (38, -0.019), (39, 0.003), (40, 0.073), (41, -0.082), (42, -0.032), (43, -0.027), (44, -0.038), (45, -0.016), (46, -0.001), (47, -0.034), (48, 0.014), (49, -0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95030391 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

Author: Kareem Darwish ; Ahmed Ali

Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.

2 0.82494837 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic

Author: Emad Mohamed ; Behrang Mohit ; Kemal Oflazer

Abstract: We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabic; e.g., this approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation. It can also considerably speed up the annotation of Arabic dialects.

3 0.76567829 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT

Author: David Stallard ; Jacob Devlin ; Michael Kayser ; Yoong Keok Lee ; Regina Barzilay

Abstract: If unsupervised morphological analyzers could approach the effectiveness of supervised ones, they would be a very attractive choice for improving MT performance on low-resource inflected languages. In this paper, we compare performance gains for state-of-the-art supervised vs. unsupervised morphological analyzers, using a state-of-theart Arabic-to-English MT system. We apply maximum marginal decoding to the unsupervised analyzer, and show that this yields the best published segmentation accuracy for Arabic, while also making segmentation output more stable. Our approach gives an 18% relative BLEU gain for Levantine dialectal Arabic. Furthermore, it gives higher gains for Modern Standard Arabic (MSA), as measured on NIST MT-08, than does MADA (Habash and Rambow, 2005), a leading supervised MSA segmenter.

4 0.54572427 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

Author: Spence Green ; John DeNero

Abstract: When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agreement model. Agreement is promoted by scoring a sequence of fine-grained morpho-syntactic classes that are predicted during decoding for each translation hypothesis. For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders. 1

5 0.4446972 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

Author: Reut Tsarfaty ; Joakim Nivre ; Evelina Andersson

Abstract: We present novel metrics for parse evaluation in joint segmentation and parsing scenarios where the gold sequence of terminals is not known in advance. The protocol uses distance-based metrics defined for the space of trees over lattices. Our metrics allow us to precisely quantify the performance gap between non-realistic parsing scenarios (assuming gold segmented and tagged input) and realistic ones (not assuming gold segmentation and tags). Our evaluation of segmentation and parsing for Modern Hebrew sheds new light on the performance ofthe best parsing systems to date in the different scenarios.

6 0.434598 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

7 0.38695779 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

8 0.37599164 137 acl-2012-Lemmatisation as a Tagging Task

9 0.30498773 134 acl-2012-Learning to Find Translations and Transliterations on the Web

10 0.30251333 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

11 0.29782104 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

12 0.29645994 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum

13 0.27963689 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

14 0.27825361 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

15 0.27510703 140 acl-2012-Machine Translation without Words through Substring Alignment

16 0.24938378 7 acl-2012-A Computational Approach to the Automation of Creative Naming

17 0.24870783 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

18 0.24773058 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

19 0.2387428 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

20 0.23565556 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.023), (26, 0.048), (28, 0.041), (30, 0.016), (37, 0.013), (39, 0.056), (57, 0.434), (74, 0.014), (82, 0.02), (84, 0.037), (85, 0.028), (90, 0.102), (92, 0.049), (94, 0.021), (99, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78680217 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

Author: Kareem Darwish ; Ahmed Ali

Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.

2 0.65559298 110 acl-2012-Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model

Author: William Yang Wang ; Elijah Mayfield ; Suresh Naidu ; Jeremiah Dittmar

Abstract: We propose a latent variable model to enhance historical analysis of large corpora. This work extends prior work in topic modelling by incorporating metadata, and the interactions between the components in metadata, in a general way. To test this, we collect a corpus of slavery-related United States property law judgements sampled from the years 1730 to 1866. We study the language use in these legal cases, with a special focus on shifts in opinions on controversial topics across different regions. Because this is a longitudinal data set, we are also interested in understanding how these opinions change over the course of decades. We show that the joint learning scheme of our sparse mixed-effects model improves on other state-of-the-art generative and discriminative models on the region and time period identification tasks. Experiments show that our sparse mixed-effects model is more accurate quantitatively and qualitatively interesting, and that these improvements are robust across different parameter settings.

3 0.65191472 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

Author: Tong Xiao ; Jingbo Zhu ; Hao Zhang ; Qiang Li

Abstract: We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntaxbased models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decoding. Moreover, several useful utilities were distributed with the toolkit, including a discriminative reordering model, a simple and fast language model, and an implementation of minimum error rate training for weight tuning. 1

4 0.59369946 83 acl-2012-Error Mining on Dependency Trees

Author: Claire Gardent ; Shashi Narayan

Abstract: In recent years, error mining approaches were developed to help identify the most likely sources of parsing failures in parsing systems using handcrafted grammars and lexicons. However the techniques they use to enumerate and count n-grams builds on the sequential nature of a text corpus and do not easily extend to structured data. In this paper, we propose an algorithm for mining trees and apply it to detect the most likely sources of generation failure. We show that this tree mining algorithm permits identifying not only errors in the generation system (grammar, lexicon) but also mismatches between the structures contained in the input and the input structures expected by our generator as well as a few idiosyncrasies/error in the input data.

5 0.49708742 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons

Author: Fangtao Li ; Sinno Jialin Pan ; Ou Jin ; Qiang Yang ; Xiaoyan Zhu

Abstract: Extracting sentiment and topic lexicons is important for opinion mining. Previous works have showed that supervised learning methods are superior for this task. However, the performance of supervised methods highly relies on manually labeled training data. In this paper, we propose a domain adaptation framework for sentiment- and topic- lexicon co-extraction in a domain of interest where we do not require any labeled data, but have lots of labeled data in another related domain. The framework is twofold. In the first step, we generate a few high-confidence sentiment and topic seeds in the target domain. In the second step, we propose a novel Relational Adaptive bootstraPping (RAP) algorithm to expand the seeds in the target domain by exploiting the labeled source domain data and the relationships between topic and sentiment words. Experimental results show that our domain adaptation framework can extract precise lexicons in the target domain without any annotation.

6 0.45271826 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic

7 0.42601705 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

8 0.40061897 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

9 0.38803807 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

10 0.3711952 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

11 0.37026924 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

12 0.36390975 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT

13 0.36229125 120 acl-2012-Information-theoretic Multi-view Domain Adaptation

14 0.36151838 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

15 0.35990858 140 acl-2012-Machine Translation without Words through Substring Alignment

16 0.35987192 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

17 0.35772491 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

18 0.3548578 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

19 0.35359469 136 acl-2012-Learning to Translate with Multiple Objectives

20 0.35198373 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation