emnlp emnlp2010 emnlp2010-44 knowledge-graph by maker-knowledge-mining

44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

Source: pdf

Author: Yassine Benajiba ; Imed Zitouni

Abstract: The research question treated in this paper is centered on the idea of exploiting rich resources of one language to enhance the performance of a mention detection system of another one. We successfully achieve this goal by projecting information from one language to another via a parallel corpus. We examine the potential improvement using various degrees of linguistic information in a statistical framework and we show that the proposed technique is effective even when the target language model has access to a significantly rich feature set. Experimental results show up to 2.4F improvement in performance when the system has access to information obtained by projecting mentions from a resource-richlanguage mention detection system via a parallel corpus.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The research question treated in this paper is centered on the idea of exploiting rich resources of one language to enhance the performance of a mention detection system of another one. [sent-3, score-0.458]

2 We successfully achieve this goal by projecting information from one language to another via a parallel corpus. [sent-4, score-0.215]

3 We examine the potential improvement using various degrees of linguistic information in a statistical framework and we show that the proposed technique is effective even when the target language model has access to a significantly rich feature set. [sent-5, score-0.135]

4 4F improvement in performance when the system has access to information obtained by projecting mentions from a resource-richlanguage mention detection system via a parallel corpus. [sent-7, score-0.898]

5 Similarly to the Automatic Content Extraction (ACE) 1 nomenclature, we consider that a mention can be either named (e. [sent-12, score-0.33]

6 we find the mentions ‘Michael Bloomberg’, ‘Mayor’ and ‘his’ of the same person entity. [sent-31, score-0.253]

7 ‘NYC’ and ‘city’, on the other hand, are mentions of the same geopolitical (GPE) entity of type named and nominal, respectively. [sent-33, score-0.363]

8 7 (Zitouni and Florian, 2009) when it resorts to a rich set of features extracted from diverse resources, namely: part-of-speech, chunk information, syntactic parse trees, word sense information, WordNet information and information from the output of other mention detection classifiers. [sent-41, score-0.398]

9 The linguistic resources available for the Arabic language allow a simulation of different TL richness levels; and 3. [sent-47, score-0.12]

10 Our hypothesis might be expressed as follows: us- ing an MD system resorting to a rich feature set (i. [sent-49, score-0.159]

11 To test this hypothesis, we have projected MD tags from RRL to TL via a parallel corpus, and then extracted several linguistic features about the automatically tagged words. [sent-52, score-0.379]

12 In order to have a complete picture on the impact of these new features, we have used TL baseline systems resorting to a varied amount of features, starting with a case employing only lexical information to a case where we use all the resources we could gather for the TL. [sent-54, score-0.103]

13 It also assigns to every non outside mention a class to specify its type: e. [sent-59, score-0.251]

14 The features used by our MD systems can be divided into the fol994 lowing categories: 1- Lexical: these are token n-grams directly neighboring the current token on both sides, i. [sent-65, score-0.126]

15 In order to provide the MD system with complementary information, these classifiers are trained on different datasets annotated for different mention types, e. [sent-73, score-0.316]

16 3 Annotation, Projection and Feature Extraction We remind the reader that our main goal is to use an RRL MD system to enhance the performance of an MD system in another language, i. [sent-76, score-0.113]

17 In order to achieve this goal, we propose an approach that uses an RRL-to-TL parallel corpus to bridge between these two languages. [sent-79, score-0.175]

18 This approach performs in three main steps, namely: annotation, projection and feature extraction. [sent-80, score-0.108]

19 1 Annotation This first step consists of MD tagging of the RRL side ofthe parallel corpus. [sent-83, score-0.251]

20 2 Projection Once the RRL side of the parallel corpus is accurately augmented with MD tags, the projection step comes to transfer those tags to the TL side, Arabic in our case study, using the word alignment information. [sent-88, score-0.366]

21 Let consider the following MD tagged English sentence: Bill/B-PER-NAM Clinton/I-PER-NAM is visiting North/B-GPE-NAM Korea/I-GPE-NAM today where “Bill Clinton” is a named person mention and “North Korea” is a named geopolitical entity (GPE) one. [sent-90, score-0.623]

22 In real world translation (both human and automatic), one should expect to see 1-to-n, n-to-1 mappings as well as unmapped words on both sides of the parallel corpus rather frequently. [sent-113, score-0.21]

23 As stated by (Klementiev and Roth, 2006), the projection of NER tags is easier in comparison to projecting other types of annotations such as POS-tags and BPC2, mainly because: 1. [sent-114, score-0.155]

24 Not all the words are mentions: once we have projected the tags of the mentions from the RRL to TL side, the rest of tokens are simply considered as outside any mentions. [sent-115, score-0.271]

25 In case of a 1-to-n mapping, the target n words are assigned the same class: for instance, let consider the English GPE named mention “North- Korea”. [sent-118, score-0.33]

26 In case of n-to-1 mapping, the TL side word is simply assigned the class propagated from the RRL side. [sent-135, score-0.121]

27 For instance, if on the English side we have the named person multi-word mention “Ben Moussa”, translated into the one-word mention œ? [sent-136, score-0.739]

28 nsists of simply assigning the person named tag to the Arabic word. [sent-142, score-0.13]

29 use mention “splits ” to filter annotation errors: We assume that when a sequence of tokens is tagged as a mention on the RRL side, its TL counterpart should be an uninterrupted sequence of tokens as well. [sent-148, score-0.717]

30 ”, the RRL MD system might mistakenly tag “Dona Karan international” as an organization mention instead of tagging “Dona Karan” as a person mention. [sent-154, score-0.424]

31 We use this “split” in the mentions as information in order to not use these mentions in the feature extraction step (see Subsection 3. [sent-158, score-0.431]

32 do not use the projected mentions directly for training: Instead, we use these tags as additional features to our TL baseline model and allow our MEMM classifier to weigh them according to their relevance to each mention type. [sent-161, score-0.521]

33 3 Feature Extraction At this point, the parallel corpus should be annotated with mentions on both of its sides. [sent-163, score-0.406]

34 Where the RRL side is tagged using the English MD system during the annotation step (c. [sent-164, score-0.222]

35 1) while the TL side is annotated by the propagation of these MD tags via the parallel corpus in the projection step (c. [sent-166, score-0.428]

36 In this third step, the goal is to extract pertinent linguistic features of the automatically tagged TL corpus to enhance MD model in the TL. [sent-170, score-0.183]

37 Gazetteers: we group mentions by class in different dictionaries. [sent-172, score-0.202]

38 During both training and decoding, when we encounter a token or a sequence of tokens that is part of a dictionary, we fire its corresponding class; the feature is fired only when we find a complete match between sequence of tokens in the text and in the dictionary. [sent-173, score-0.143]

39 Model-based features: it consists of building a model on the automatically tagged TL side of the parallel corpus. [sent-175, score-0.359]

40 We organize those contexts by mention type and we use them to tag tokens which appear in the same context in both the training and decoding sets. [sent-181, score-0.286]

41 which might be transliterated as: SrH Ams An SdAm Hsyn ytrAs nZAmA fA$lA and translated to English as: declared yesterday that Sadam Husein governs a failed system the context n-grams that would be extracted are: Left n-grams: (An - that), . [sent-215, score-0.26]

42 For both training and test data we create a new feature stream where we indicate that a token sequence is a mention if it appears in the same n-gram context. [sent-244, score-0.355]

43 Head-word based features: it considers that the lexical context in which the mention appeared is the sequence of the parent sub-trees head words in a parse-tree. [sent-246, score-0.295]

44 Similarly to the other features, in both training and decoding sets, we create a new feature stream where we tag those token sequences which appear with the same n first parent sub-tree head words as a person mention in the annotated TL data. [sent-252, score-0.479]

45 Parser-based features: it attempts to use the syntactic environment in which a mention might appear. [sent-254, score-0.282]

46 In order to do so, for each mention in the target language corpus we consider only labels of the parent non-terminals . [sent-255, score-0.295]

47 Similarly to the features described above, we create during both training and test a new feature stream where we indicate the token sequences which appear in the same parent non-terminal labels. [sent-257, score-0.182]

48 Gazetteers and model-based features are the most natural and expected kind of features that one would extract from the automatically MD tagged version of the TL text. [sent-258, score-0.176]

49 If the current token xi is a stem, stem n-gram features contain the previous n −1 stems and the following n −1 stems. [sent-286, score-0.178]

50 As we describe with more details in the experiments section (see Section 6), once we have extracted the new features from the parallel corpus, we contrast their impact with the level of richness in features of the TL MD system, i. [sent-290, score-0.374]

51 we measure the impact of each feature fi when the TL MD system uses: (i) only lexical features; (ii) both lexical and stem features; and (iii) lexical, stem and syntactic features. [sent-292, score-0.287]

52 This results in 17, 634 mentions (7, 816 named, 8, 831 nominal and 987 pronominal) for training and 3, 566 for test (1, 673 named, 1, 682 nominal and 211 pronominal). [sent-298, score-0.308]

53 However, given that we are interested in the mention detection task only, we decided to use the more intuitive and popular (un-weighted) F-measure, the harmonic mean of precision and recall. [sent-303, score-0.301]

54 6 Experiments and Results As we have stated earlier, our main goal is to investigate how an MD model of a TL might benefit from additional information about the mentions obtained by propagation from an RRL. [sent-304, score-0.301]

55 For each of these baseline systems, we study the impact of features extracted from the parallel corpus (c. [sent-315, score-0.265]

56 219t604a83c Table 1: Obtained results when the features were extracted from a hand-aligned parallel corpus 3- n − Head: Base. [sent-326, score-0.237]

57 ++ p rasuetor-mrealtaitcedall fye euxretrs;acted gazetteers from the parallel corpus; 6- Model: Base. [sent-332, score-0.215]

58 + output of model trained on the Arabic part of the parallel corpus; 7- Comb. [sent-333, score-0.175]

59 In the rest of the paper, to measure whether the improvement in performance of a system using features from parallel data over baseline is statistically significant or not, we use the stratified bootstrap resampling significance test (Noreen, 1989) used in the NER shared task of CoNLL-20023. [sent-335, score-0.284]

60 1 Hand-aligned Data In our first experiment-set, we use a hand-aligned English-to-Arabic parallel corpus of approximately one million words. [sent-339, score-0.211]

61 After tagging the Arabic side by projection we obtain 86. [sent-340, score-0.157]

62 As we have previously mentioned, in order to generate the model-based feature, Model, we have trained a model on the Arabic side of the parallel corpus. [sent-342, score-0.251]

63 Results in Table 1show that a significant improvement is obtained when the TL is poor in resources; for instance an improvement of ∼1 . [sent-346, score-0.113]

64 According to our error-analysis, the significant amount of Arabic mentions observed in the parallel corpus, where many of them do not appear in the training corpus, has significantly helped the Lex. [sent-364, score-0.421]

65 , Stem and Syntac MD models to capture new mentions and/or correct the type assigned. [sent-365, score-0.202]

66 ples in our data are: (i) the facility mention Pæf? [sent-368, score-0.281]

67 (mbnY blfwr - Belvoir Building); (ii) the GPE mention ¨æK. [sent-374, score-0.251]

68 only been tagged correctly when we have added the new extracted features to our model. [sent-384, score-0.143]

69 The second parameter can be, indirectly, increased by increasing the size of the parallel data. [sent-386, score-0.175]

70 Getting 10 or 20 times more of parallel data that is hand-aligned is expensive and requires several months of human/hours work. [sent-387, score-0.175]

71 For this reason we opted for using an unsupervised approach by selecting a parallel corpus that is automatically aligned as we discuss in the next section. [sent-388, score-0.261]

72 2 Automatically-aligned Data We have used for this experiment-set an Arabic-toEnglish parallel data of 22 million words. [sent-390, score-0.211]

73 Such filtering consists in keeping, from the parallel corpus, only sentences which have all tokens tagged with a confidence greater than α. [sent-394, score-0.291]

74 9204375t a63c Table 2: Obtained results when the features were extracted from a automatically-aligned parallel corpus data based features using the 17M subset. [sent-404, score-0.271]

75 Othne oonnee obtained when using the 1M hand-aligned parallel data (see Table 1), i. [sent-409, score-0.21]

76 (i) the greatest improvement has been obtained when the TL uses a poor feature-set; and (ii) when the TL baseline model is rich in resources, we still obtain 0. [sent-411, score-0.109]

77 ic Oanlly th-eali ogtnheedr data, in comparison with the ones extracted from the hand aligned data, have helped the MD model to correct many of the TL baseline model false negatives. [sent-414, score-0.131]

78 ¸@ —Y« which might be transliterated as: Edm AlsmAH lmstHDrAt AxrY and translated to English as: not to allow other preparations has been tagged as an organization mention because it has been mistakenly aligned, in the parallel corpus, with the word? [sent-435, score-0.713]

79 2o1 ,f64835m74985entios 999 Table 3: Distribution over the classes of the blind test mentions n −MBGaHoPLsdzae rlx. [sent-460, score-0.248]

80 Table 3 shows the distribution of these mentions over the different classes. [sent-467, score-0.202]

81 9245t8a360c Table 5: Obtained results when the features were extracted from both hand-aligned and automaticallyaligned parallel corpora provement of using Comb. [sent-479, score-0.272]

82 We notice again that when the TL baseline MD model uses a richer feature set, the obtained improvement from using RRL becomes smaller. [sent-481, score-0.101]

83 We also observed that automatically aligned data helped capture most of the unseen mentions whereas the hand-aligned features helped decrease the number of false-alarms. [sent-482, score-0.41]

84 3 higher than the baseline model which uses lexical, stem and syntactic features Syntac (75. [sent-486, score-0.132]

85 The type of errors which mostly occur and has not been fixed neither by using hand-aligned data, automati– cally aligned data nor the combination of both are the nominal mentions whose class dep? [sent-488, score-0.314]

86 (mwZf employee) which was considered as O by the MD model because it has not been seen in any of the parallel data in a context such as the following: ? [sent-494, score-0.175]

87 , 2003) report a research study which uses an English-Chinese parallel corpus in order to extract sense-tagged training data. [sent-527, score-0.175]

88 One of the significant differences between these works and the one we present in this paper is that instead of using the propagated annotation directly 1000 as training data we use it as an additional feature and thus allow the MEMM model to weigh each one of them. [sent-531, score-0.101]

89 The approach in (Zitouni and Florian, 2008) requires a MT system that needs more effort and resources to build when compared to a parallel corpus (used in our experi- ments); not all institutions may have access to MT and MD systems in plenty of language pairs. [sent-535, score-0.29]

90 We use different Arabic baseline MD models which employ different feature sets representing different levels of richness in resources. [sent-539, score-0.102]

91 We also use both a one million word hand-aligned parallel corpus and a 22 million word automatically aligned one in order to study size vs. [sent-540, score-0.333]

92 When we use the hand-aligned parallel corpus, we obtain up to 2. [sent-543, score-0.175]

93 The results also show that a greater improvement is achieved when using a small hand-aligned corpus than using a 20 times bigger automatically aligned data. [sent-549, score-0.125]

94 Automatic tagging of arabic text: from raw text to base phrase chunks. [sent-562, score-0.321]

95 Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. [sent-583, score-0.121]

96 Exploiting parallel texts for word sense disambiguation: An empirical study. [sent-593, score-0.175]

97 Unsupervised learning of arabic stemming using a parallel corpus. [sent-614, score-0.496]

98 Introduction to the conll2002 shared task: Language-independent named entity recognition. [sent-620, score-0.121]

99 Inducing multilingual text analysis tools via robust projection across aligned corpora. [sent-625, score-0.14]

100 The impact of morphological stemming on arabic mention detection and coreference res- olution. [sent-638, score-0.65]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('md', 0.46), ('tl', 0.395), ('arabic', 0.321), ('rrl', 0.297), ('mention', 0.251), ('mentions', 0.202), ('zitouni', 0.179), ('parallel', 0.175), ('gpe', 0.105), ('stem', 0.098), ('karan', 0.087), ('projection', 0.081), ('tagged', 0.081), ('named', 0.079), ('side', 0.076), ('richness', 0.075), ('dona', 0.07), ('hsyn', 0.07), ('sdam', 0.07), ('florian', 0.06), ('aligned', 0.059), ('ace', 0.055), ('transliterated', 0.054), ('imed', 0.054), ('nominal', 0.053), ('ams', 0.052), ('bpc', 0.052), ('syntac', 0.052), ('person', 0.051), ('detection', 0.05), ('blind', 0.046), ('ner', 0.046), ('pronominal', 0.046), ('token', 0.046), ('resources', 0.045), ('governs', 0.045), ('benajiba', 0.045), ('propagated', 0.045), ('parent', 0.044), ('helped', 0.044), ('english', 0.043), ('entity', 0.042), ('enhance', 0.041), ('memm', 0.04), ('gazetteers', 0.04), ('geopolitical', 0.04), ('klementiev', 0.04), ('projecting', 0.04), ('yarowsky', 0.04), ('improvement', 0.039), ('diab', 0.037), ('system', 0.036), ('million', 0.036), ('obtained', 0.035), ('rich', 0.035), ('alywm', 0.035), ('automaticallyaligned', 0.035), ('bloomberg', 0.035), ('declared', 0.035), ('kp', 0.035), ('kwrya', 0.035), ('malyp', 0.035), ('nyc', 0.035), ('nzama', 0.035), ('preparations', 0.035), ('thereafter', 0.035), ('uninterrupted', 0.035), ('unmapped', 0.035), ('yassine', 0.035), ('ytras', 0.035), ('yzwr', 0.035), ('ii', 0.035), ('tokens', 0.035), ('access', 0.034), ('tags', 0.034), ('features', 0.034), ('propagation', 0.033), ('al', 0.032), ('translated', 0.031), ('stream', 0.031), ('radu', 0.031), ('might', 0.031), ('resorting', 0.03), ('ncb', 0.03), ('employee', 0.03), ('facility', 0.03), ('ittycheriah', 0.03), ('mayor', 0.03), ('rogati', 0.03), ('jj', 0.03), ('annotated', 0.029), ('annotation', 0.029), ('extracted', 0.028), ('impact', 0.028), ('organization', 0.028), ('sparseness', 0.027), ('mona', 0.027), ('mistakenly', 0.027), ('automatically', 0.027), ('feature', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999917 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

Author: Yassine Benajiba ; Imed Zitouni

2 0.36451462 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

Author: Radu Florian ; John Pitrelli ; Salim Roukos ; Imed Zitouni

Abstract: Information-extraction (IE) research typically focuses on clean-text inputs. However, an IE engine serving real applications yields many false alarms due to less-well-formed input. For example, IE in a multilingual broadcast processing system has to deal with inaccurate automatic transcription and translation. The resulting presence of non-target-language text in this case, and non-language material interspersed in data from other applications, raise the research problem of making IE robust to such noisy input text. We address one such IE task: entity-mention detection. We describe augmenting a statistical mention-detection system in order to reduce false alarms from spurious passages. The diverse nature of input noise leads us to pursue a multi-faceted approach to robustness. For our English-language system, at various miss rates we eliminate 97% of false alarms on inputs from other Latin-alphabet languages. In another experiment, representing scenarios in which genre-specific training is infeasible, we process real financial-transactions text containing mixed languages and data-set codes. On these data, because we do not train on data like it, we achieve a smaller but significant improvement. These gains come with virtually no loss in accuracy on clean English text.

3 0.19864999 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

Author: Karthik Raghunathan ; Heeyoung Lee ; Sudarshan Rangarajan ; Nate Chambers ; Mihai Surdeanu ; Dan Jurafsky ; Christopher Manning

Abstract: Most coreference resolution models determine if two mentions are coreferent using a single function over a set of constraints or features. This approach can lead to incorrect decisions as lower precision features often overwhelm the smaller number of high precision ones. To overcome this problem, we propose a simple coreference architecture based on a sieve that applies tiers of deterministic coreference models one at a time from highest to lowest precision. Each tier builds on the previous tier’s entity cluster output. Further, our model propagates global information by sharing attributes (e.g., gender and number) across mentions in the same cluster. This cautious sieve guarantees that stronger features are given precedence over weaker ones and that each decision is made using all of the information available at the time. The framework is highly modular: new coreference modules can be plugged in without any change to the other modules. In spite of its simplicity, our approach outperforms many state-of-the-art supervised and unsupervised models on several standard corpora. This suggests that sievebased approaches could be applied to other NLP tasks.

4 0.14287946 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

Author: Rushin Shah ; Paramveer S. Dhillon ; Mark Liberman ; Dean Foster ; Mohamed Maamouri ; Lyle Ungar

Abstract: We describe a model for the lexical analysis of Arabic text, using the lists of alternatives supplied by a broad-coverage morphological analyzer, SAMA, which include stable lemma IDs that correspond to combinations of broad word sense categories and POS tags. We break down each of the hundreds of thousands of possible lexical labels into its constituent elements, including lemma ID and part-of-speech. Features are computed for each lexical token based on its local and document-level context and used in a novel, simple, and highly efficient two-stage supervised machine learning algorithm that over- comes the extreme sparsity of label distribution in the training data. The resulting system achieves accuracy of 90.6% for its first choice, and 96.2% for its top two choices, in selecting among the alternatives provided by the SAMA lexical analyzer. We have successfully used this system in applications such as an online reading helper for intermediate learners of the Arabic language, and a tool for improving the productivity of Arabic Treebank annotators. 1 Background and Motivation This paper presents a methodology for generating high quality lexical analysis of highly inflected languages, and demonstrates excellent performance applying our approach to Arabic. Lexical analysis of the written form of a language involves resolving, explicitly or implicitly, several different kinds ofambiguities. Unfortunately, the usual ways of talking about this process are also ambiguous, and our general approach to the problem, though not unprecedented, has uncommon aspects. Therefore, in order 725 Paramveer S. Dhillon, Mark Liberman, Dean Foster, Mohamed Maamouri and Lyle Ungar University of Pennsylvania 345 1Walnut Street Philadelphia, PA 19104, USA {dhi l lon | myl | ungar} @ cis .upenn .edu floo snt|emry@lw|huanrgta on .upenn .eednun maamouri @ ldc .upenn .edu , , to avoid confusion, we begin by describing how we define the problem. In an inflected language with an alphabetic writing system, a central issue is how to interpret strings of characters as forms of words. For example, the English letter-string ‘winds’ will normally be interpreted in one of four different ways, all four of which involve the sequence of two formatives wind+s. The stem ‘wind’ might be analyzed as (1) a noun meaning something like “air in motion”, pronounced [wInd] , which we can associate with an arbitrary but stable identifier like wind n1; (2) a verb wind v1 derived from that noun, and pronounced the same way; (3) a verb wind v2 meaning something like “(cause to) twist”, pronounced [waInd]; or (4) a noun wind n2 derived from that verb, and pro- nounced the same way. Each of these “lemmas”, or dictionary entries, will have several distinguishable senses, which we may also wish to associate with stable identifiers. The affix ‘-s’ might be analyzed as the plural inflection, if the stem is a noun; or as the third-person singular inflection, if the stem is a verb. We see this analysis as conceptually divided into four parts: 1) Morphological analysis, which recognizes that the letter-string ‘winds’ might be (perhaps among other things) wind/N s/PLURAL or wind/V s/3SING; 2) Morphological disambiguation, which involves deciding, for example, that in the phrase “the four winds”, ‘winds’ is probably a plural noun, i.e. wind/N s/PLURAL; 3) Lemma analysis, which involves recognizing that the stem wind in ‘winds’ might be any of the four lemmas listed above – perhaps with a further listing of senses or other sub-entries for each of them; and 4) Lemma disambiguation, deciding, for example, that + + + ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 7t2ic5s–735, the phrase “the four winds” probably involves the lemma wind n1. Confusingly, the standard word-analysis tasks in computational linguistics involve various combinations of pieces of these logically-distinguished operations. Thus, “part of speech (POS) tagging” is mainly what we’ve called “morphological disambiguation”, except that it doesn’t necessarily require identifying the specific stems and affixes involved. In some cases, it also may require a small amount of “lemma disambiguation”, for example to distinguish a proper noun from a common noun. “Sense disambiguation” is basically a form of what we’ve called “lemma disambiguation”, except that the sense disambiguation task may assume that the part of speech is known, and may break down lexical identity more finely than our system happens to do. “Lemmatization” generally refers to a radically simplified form of “lemma analysis” and “lemma disambiguation”, where the goal is simply to collapse different inflected forms of any similarly-spelled stems, so that the strings ‘wind’, ‘winds’, ‘winded’, ‘winding’ will all be treated as instances of the same thing, without in fact making any attempt to determine the identity of “lemmas” in the traditional sense of dictionary entries. Linguists use the term morphology to include all aspects of lexical analysis under discussion here. But in most computational applications, “morphological analysis” does not include the disambiguation of lemmas, because most morphological analyzers do not reference a set of stable lemma IDs. So for the purposes of this paper, we will continue to discuss lemma analysis and disambiguation as conceptually distinct from morphological analysis and disambiguation, although, in fact, our system disambiguates both of these aspects of lexical analysis at the same time. The lexical analysis of textual character-strings is a more complex and consequential problem in Arabic than it is in English, for several reasons. First, Arabic inflectional morphology is more complex than English inflectional morphology is. Where an English verb has five basic forms, for example, an Arabic verb in principle may have dozens. Second, the Arabic orthographic system writes elements such as prepositions, articles, and possessive pronouns without setting them off by spaces, roughly 726 as if the English phrase “in a way” were written “inaway”. This leads to an enormous increase in the number of distinct “orthographic words”, and a substantial increase in ambiguity. Third, short vowels are normally omitted in Arabic text, roughly as if English “in a way” were written “nway”. As a result, a whitespace/punctuation-delimited letter-string in Arabic text typically has many more alternative analyses than a comparable English letter-string does, and these analyses have many more parts, drawn from a much larger vocabulary of form-classes. While an English “tagger” can specify the morphosyntactic status of a word by choosing from a few dozen tags, an equivalent level of detail in Arabic would require thousands of alternatives. Similarly, the number of lemmas that might play a role in a given letter-sequence is generally much larger in Arabic than in English. We start our labeling of Arabic text with the alternative analyses provided by SAMA v. 3.1, the Standard Arabic Morphological Analyzer (Maamouri et al., 2009). SAMA is an updated version of the earlier Buckwalter analyzers (Buckwalter, 2004), with a number of significant differences in analysis to make it compatible with the LDC Arabic Treebank 3-v3.2 (Maamouri et al., 2004). The input to SAMA is an Arabic orthographic word (a string of letters delimited by whitespace or punctuation), and the output of SAMA is a set of alternative analyses, as shown in Table 1. For a typical word, SAMA produces approximately a dozen alternative analyses, but for certain highly ambiguous words it can produce hundreds of alternatives. The SAMA analyzer has good coverage; for typical texts, the correct analysis of an orthographic word can be found somewhere in SAMA’s list of alternatives about 95% of the time. However, this broad coverage comes at a cost; the list of analytic alternatives must include a long Zipfian tail of rare or contextually-implausible analyses, which collectively are correct often enough to make a large contribution to the coverage statistics. Furthermore, SAMA’s long lists of alternative analyses are not evaluated or ordered in terms of overall or contextual plausibility. This makes the results less useful in most practical applications. Our goal is to rank these alternative analyses so that the correct answer is as near to the top of the list as possible. Despite some risk of confusion, we’ll refer to SAMA’s list of alternative analyses for an orthographic word as potential labels for that word. And despite a greater risk ofconfusion, we’ll refer to the assignment of probabilities to the set of SAMA labels for a particular Arabic word in a particular textual context as tagging, by analogy to the operation of a stochastic part-of-speech tagger, which similarly assigns probabilities to the set of labels available for a word in textual context. Although our algorithms have been developed for the particular case of Arabic and the particular set of lexical-analysis labels produced by SAMA, they should be applicable without modification to the sets of labels produced by any broad-coverage lexical analyzer for the orthographic words of any highlyinflected language. In choosing our approach, we have been moti- vated by two specific applications. One application aims to help learners of Arabic in reading text, by offering a choice of English glosses with associated Arabic morphological analyses and vocalizations. SAMA’s excellent coverage is an important basis for this help; but SAMA’s long, unranked list of alternative analyses for a particular letter-string, where many analyses may involve rare words or alternatives that are completely implausible in the context, will be confusing at best for a learner. It is much more helpful for the list to be ranked so that the correct answer is almost always near the top, and is usually one of the top two or three alternatives. In our second application, this same sort of ranking is also helpful for the linguistically expert native speakers who do Arabic Treebank analysis. These 727 annotators understand the text without difficulty, but find it time-consuming and fatiguing to scan a long list of rare or contextually-implausible alternatives for the correct SAMA output. Their work is faster and more accurate if they start with a list that is ranked accurately in order of contextual plausibility. Other applications are also possible, such as vocalization of Arabic text for text-to-speech synthesis, or lexical analysis for Arabic parsing. However, our initial goals have been to rank the list of SAMA outputs for human users. We note in passing that the existence of set of stable “lemma IDs” is an unusual feature of SAMA, which in our opinion ought to be emulated by approaches to lexical analysis in other languages. The lack of such stable lemma IDs has helped to disguise the fact that without lemma analysis and disambiguation, morphological analyses and disambiguation is only a partial solution to the problem of lexical analysis. In principle, it is obvious that lemma disambiguation and morphological disambiguation are mutually beneficial. If we know the answer to one of the questions, the other one is easier to answer. However, these two tasks require rather different sets of contextual features. Lemma disambiguation is similar to the problem of word-sense disambiguation on some definitions, they are identical and as a result, it benefits from paragraph-level and documentlevel bag-of-words attributes that help to character– – ize what the text is “about” and therefore which lemmas are more likely to play a role in it. In contrast, morphological disambiguation mainly depends on features of nearby words, which help to characterize how inflected forms of these lemmas might fit into local phrasal structures. 2 Problem and Methodology Consider a collection oftokens (observations), ti, referred to by index i∈ {1, . . . , n}, where each token fise raressdo tcoia bteyd i nwdiethx a s∈et { of p features, xij, efaocr hth teo k jethn feature, and a label, li, which is a combination of a lemma and a morphological analysis. We use indicator functions yik to indicate whether or not the kth label for the ith token is present. We represent the complete set of features and labels for the entire training data using matrix notation as X and Y , respectively. Our goal is to predict the label l (or equivalently, the vector y for a given feature vector x. A standard linear regression model of this problem would be y = xβ + ? (1) The standard linear regression estimate of β (ig- ×× × noring, for simplicity the fact that the ys are 0/1) is: βˆ = (XTtrainXtrain)−1XtTrainYtrain (2) where Ytrain is an n h matrix containing 0s and 1s indicating whise tahner n or nho mt aetarcixh coofn tthaien ihn possible labels is the correct label (li) for each of the n tokens ti, Xtrain is an n p matrix of context features for each of thei n tokens, pth mea ctoriexff oifcie cnotnst are p hs .f However, this is a large, sparse, multiple l hab.el problem, and the above formulation is neither statistically nor computationally efficient. Each observation (x, y) consists of thousands of features associated with thousands of potential labels, almost all of which are zero. Worse, the matrix of coefficients β, to be estimated is large (p h) and one should thus use some soatretd do ifs tr laarngsefe (pr learning dto o nshea srheo strength across the different labels. We present a novel principled and highly computationally efficient method of estimating this multilabel model. We use a two stage procedure, first using a subset (Xtrain1 , Ytrain1) of training data to give a fast approximate estimate of β; we then use a second smaller subset of the training data (Xtrain2, Ytrain2,) to “correct” these estimates in a eβˆx way that we will show can be viewed as a specialized shrinkage. Our first stage estimation approximates β, but avoids the expensive computa728 tion of (XTtrainXtrain)−1. Our second stage corrects (shrinks) these initial estimates in a manner specialized to this problem. The second stage takes advantage of the fact that we only need to consider those candidate labels produced by SAMA. Thus, only dozens of the thousands of possible labels are considered for each token. We now present our algorithm. We start with a corpus D of documents d of labeled Arabic text. As described above, each token, ti is associated with a set of features characterizing its context, computed from the other words in the same document, and a label, li = (lemmai, morphologyi), which is a combination of a lemma and a morphological analysis. As described below, we introduce a novel factorization of the morphology into 15 different components. Our estimation algorithm, shown in Algorithm 1, has two stages. We partition the training corpus into × two subsets, one of which (Xtrain1) is used to estimate the coefficients βs and the other of which (Xtrain2) is used to optimally “shrink” these coefficient estimates to reduce variance and prevent overfitting due to data sparsity. For the first stage of our estimation procedure, we simplify the estimate of the (β) matrix (Equation 2) to avoid the inversion of the very high dimensional (p p) matrix (XTX) by approximating (XTX) by (itps diagonal, Var(X), the inverse of which is trivial to compute; i.e. we estimate β using βˆ = Var(Xtrain1)−1XtTrain1Ytrain1 (3) For the second stage, we assume that the coefficients for each feature can be shrunk differently, but that coefficients for each feature should be shrunk the same regardless of what label they are predicting. Thus, for a given observation we predict: ˆgik=Xpwjβˆjkxij (4) Xj=1 where the weights wj indicate how much to shrink each of the p features. In practice, we fold the variance of each of the j features into the weight, giving a slightly modified equation: ˆgik=Xj=p1αjβj∗kxij (5) where β∗ = XtTrain1Ytrain1 is just a matrix of the counts of how often each context feature shows up with each label in the first training set. The vector α, which we will estimate by regression, is just the shrinkage weights w rescaled by the feature variance. Note that the formation here is different from the first stage. Instead of having each observation be a token, we now let each observation be a (token, label) pair, but only include those labels that were output by SAMA. For a given token ti and potential label lk, our goal is to approximate the indicator function g(i, k), which is 1 if the kth label of token ti is present, and 0 otherwise. We find candidate labels using a morphological analyzer (namely SAMA), which returns a set of possible candidate labels, say C(t), for each Arabic token t. Our pre- dicted label for ti is then argmaxk∈C(ti)g(i, k). The regression model for learning tthe weights αj in the second stage thus has a row for each label g(i, k) associated with a SAMA candidate for each token i = ntrain1+1 . . . ntrain2 in the second training set. The value of g(i, k) is predicted as a function of the feature vector zijk = βj∗kxij. The shrinkage coefficients, αj, could be estimated from theory, using a version of James-Stein shrinkage (James and Stein, 1961), but in practice, superior results are obtained by estimating them empirically. Since there are only p of them (unlike the p ∗ h βs), a relatively asmreal oln training sheetm mis ( usunflfi kceie tnhte. Wp ∗e hfou βnsd), that regression-SVMs work slightly better than linear regression and significantly better than standard classification SVMs for this problem. Prediction is then done in the obvious way by taking the tokens in a test corpus Dtest, generating context features and candidate SAMA labels for each token ti, and selected the candidate label with the highest score ˆ g(i, k) that we set out to learn. More formally, The model parameters β∗ and α produced by the algorithm allow one to estimate the most likely label for a new token ti out of a set of can- didate labels C(ti) using kpred= argmaxk∈C(ti)jX=p1αjβj∗kxij (6) The most expensive part of the procedure is estimating β∗, which requires for each token in cor729 Algorithm 1 Training algorithm. Input: A training corpusDtrainof n observations (Xtrain, Ytrain) Partition Dtrain into two sets, D1 and D2, of sizes ntrain1 and ntrain2 = n − ntrain1 observations // Using D1, estimat=e β∗ βj∗k = Pin=tr1ain1 xijyik for the jth feature and kth label // Using D2, estimate αj // Generate new “features” Z and the true labels g(i, k) for each of the SAMA candidate labels for each of the tokens in D2 zijk = βj∗kxij for iin i= ntrain1 + 1...ntrain2 Estimate αj for the above (feature,label) pairs (zijk, g(i, k)) using Regression SVMs Output: α and β∗ pus D1, (a subset of D), finding the co-occurrence frequencies of each label element (a lemma, or a part of the morphological segmentation) with the target token and jointly with the token and with other tokens or characters in the context of the token of interest. For example, given an Arabic token, “yHlm”, we count what fraction of the time it is associated with each lemma (e.g. Halamu 1), count(lemma=Halam-u 1, token=yHlm) and each segment (e.g. “ya”), count(segment=ya, token=yHlm). (Of course, most tokens never show up with most lemmas or segments; this is not a problem.) We also find the base rates of the components of the labels (e.g., count(lemma=Halam-u 1), and what fraction of the time the label shows up in various contexts, e.g. count(lemma=Halam-u 1, previous token = yHlm). We describe these features in more detail below. 3 Features and Labels used for Training Our approach to tagging Arabic differs from conventional approaches in the two-part shrinkage-based method used, and in the choice of both features and labels used in our model. For features, we study both local context variables, as described above, and document-level word frequencies. For the labels, the key question is what labels are included and how they are factored. Standard “taggers” work by doing an n-way classification of all the alternatives, which is not feasible here due to the thousands of possible labels. Standard approaches such as Conditional Random Fields (CRFs) are intractable with so many labels. Moreover, few if any taggers do any lemma disambiguation; that is partly because one must start with some standard inventory of lemmas, which are not available for most languages, perhaps because the importance of lemma disambiguation has been underestimated. We make a couple of innovations to deal with these issues. First, we perform lemma disambiguation in addition to “tagging”. As mentioned above, lemmas and morphological information are not independent; the choice of lemma often influences morphology and vice versa. For example, Table 1 contains two analyses for the word qbl. For the first analysis, where the lemma is qabil-a 1 and the gloss is accept/receive/approve + he/it [verb], the word is a verb. However, for the second analysis, where the lemma is qabol 1 and the gloss is before, the word is a noun. Simultaneous lemma disambiguation and tagging introduces additional complexity: An analysis of ATB and SAMA shows that there are approximately 2,200 possible morphological analyses (“tags”) and 40,000 possible lemmas; even accounting for the fact that most combinations of lemmas and morphological analyses don’t occur, the size of the label space is still in the order of tens of thousands. To deal with data sparsity, our second innovation is to factor the labels. We factor each label linto a set of 16 label elements (LEs). These include lemmas, as well as morphological elements such as basic partof-speech, suffix, gender, number, mood, etc. These are explained in detail below. Thus, since each label l is a set of 15 categorical variables, each y in the first learning stage is actually a vector with 16 nonzero components and thousands of zeros. Since we do simultaneous estimation of the entire set of label elements, the value g(i, k) being predicted in the second learning phase is 1 if the entire label set is correct, and zero otherwise. We do not learn separate models for each label. 3.1 Label Elements (LEs) The fact that there are tens of thousands of possible labels presents the problem of extreme sparsity of label distribution in the training data. We find that a model that estimates coefficients β∗ to predict a sin730 data on basic POS include whether a noun is proper or common, whether a verb is transitive or not, etc. Both the basic POS and its suffix may have person, gender and number data. gle label (a label being in the Cartesian product of the set of label elements) yields poor performance. Therefore, as just mentioned, we factor each label l into a set of label elements (LEs), and learn the correlations β∗ between features and label elements, rather than features and entire label sets. This reduces, but does not come close to eliminating, the problem sparsity. A complete list of these LEs and their possible values is detailed in Table 2. 3.2 Features 3.2.1 Local Context Features We take (t, l) pairs from D2, and for each such pair generate features Z based on co-occurrence statistics β∗ in D1, as mentioned in Algorithm 2. These statistics include unigram co-occurrence frequencies of each label with the target token and bigram co-occurrence of the label with the token and with other tokens or characters in the context of the target token. We define them formally in Table 3. Let Zbaseline denote the set of all such basic features based on the local context statistics of the target token, namely the words and letters preceding and following it. We will use this set to create a baseline model. generate feature sets for our regression SVMs. For each label element (LE) e, we define a set of features Ze similar to Zbaseline; these features are based on co-occurrence frequencies of the particular LE e, not the entire label l. Finally, we define an aggregate feature set Zaggr as follows: Zaggr = Zbaseline [ {Ze} (7) where e ∈ {lemma, pre1, pre2, det, pos, dpos, suf, perpos, numpos, genpos, persuf, numsuf, gensuf, mood, pron}. 3.2.2 Document Level Features When trying to predict the lemma, it is useful to include not just the words and characters immediately adjacent to the target token, but also the all the words in the document. These words capture the “topic” of the document, and help to disambiguate different lemmas, which tend to be used or not used based on the topic being discussed, similarly to the way that word sense disambiguation systems in English sometimes use the “bag of words” the document to disambiguate, for example a “bank” for depositing money from a “bank” of a river. More precisely, we augment the features for each target token with the counts of each word in the document (the “term frequency” tf) in which the token occurs with a given label. Zfull = Zaggr [ Ztf (8) This set Zfull is our final feature set. We use Zfull to train an SVM model Mfull; this is our final predictive model. 731 3.3 Corpora used for Training and Testing We use three modules of the Penn Arabic Treebank (ATB) (Maamouri et al., 2004), namely ATB 1, ATB2 and ATB3 as our corpus of labeled Arabic text, D. Each ATB module is a collection of newswire data from a particular agency. ATB1 uses the Associated Press as a source, ATB2 uses Ummah, and ATB3 uses Annahar. D contains a total of 1,835 documents, accounting for approximately 350,000 words. We construct the training and testing sets Dtrain and Dtest from D using 10-fold cross validation, and we construct D1 and D2 from Dtrain by randomly performing a 9: 1 split. As mentioned earlier, we use the SAMA morphological analyzer to obtain candidate labels C(t) for each token t while training and testing an SVM model on D2 and Dtest respectively. A sample output of SAMA is shown in Table 1. To improve coverage, we also add to C(t) all the labels lseen for t in D1. We find that doing so improves coverage to 98% . This is an upper bound on the accuracy of our model. C(t) = SAMA(t) 4 [ {l|(t, l) ∈ D1} (9) Results We use two metrics of accuracy: A1, which measures the percentage of tokens for which the model assigns the highest score to the correct label or LE value (or E1= 100 A1, the corresponding percentage error), 1a=nd 1 A2, wAh1i,ch th measures tnhdei percentage of tokens for which the correct label or LE value is one of the two highest ranked choices returned by the model (or E2 = 100 A2). We test our bmyod theel Mfull on Dtest a =nd 1 a0c0hi −eve A A2)1. and A2 scores of 90.6% and 96.2% respectively. The accuracy achieved by our Mfull model is, to the best of our knowledge, higher than prior approaches have been able to achieve so far for the problem of combined morphological and lemma disambiguation. This is all the more impressive considering that the upper bound on accuracy for our model is 98% because, as described above, our set of candidate labels is incomplete. In order to analyze how well different LEs can be predicted, we train an SVM model Me for each LE e using the feature set Ze, and test all such models − − on Dtest. The results for all the LEs are reported in the form of error percentages E1 and E2 in Table 4. reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. A comparison of the results for Mfull with the results for Mlemma and Mpos is particularly informative. We see that Mfull is able to achieve a substantially lower E1 error score (9.4%) than Mlemma (11.1%) and Mpos (23.4%); in other words, we find that our full model is able to predict lemmas and basic parts-of-speech more accurately than the individ- ual models for each of these elements. We examine the effect of varying the size of D2, i.e. the number of SVM training instances, on the performance of Mfull on Dtest, and find that with increasing sizes of D2, E1 reduces only slightly from 9.5% to 9.4%, and shows no improvement thereafter. We also find that the use of documentlevel features in Mlemma reduces E1 and E2 percentages for Mlemma by 5.7% and 3.2% respectively. 4.1 Comparison to Alternate Approaches 4.1.1 Structured Prediction Models Preliminary experiments showed that knowing the predicted labels (lemma + morphology) of the surrounding words can slightly improve the predictive accuracy of our model. To further investigate this effect, we tried running experiments using different structured models, namely CRF (Conditional Random Fields) (Lafferty et al., 2001), (Structured) MIRA (Margin Infused Relaxation Algorithm) (Crammer et al., 2006) and Structured Perceptron (Collins, 2002). We used linear chain 732 CRFs as implemented in MALLET Toolbox (McCallum, 2001) and for Structured MIRA and Perceptron we used their implementations from EDLIN Toolbox (Ganchev and Georgiev, 2009). However, given the vast label space of our problem, running these methods proved infeasible. The time complexity of these methods scales badly with the number of labels; It took a week to train a linear chain CRF for only ∼ 50 labels and though MIRA and Perceptron are o 5n0lin leab algorithms, they MalsIoR Abec aonmde P ienr-tractable beyond a few hundred labels. Since our label space contains combinations of lemmas and morphologies, so even after factoring, the dimension of the label space is in the order of thousands. We also tried a na¨ ıve version (two-pass approximation) of these structured models. In addition to the features in Zfull, we include the predicted labels for the tokens preceding and following the target token as features. This new model is not only slow to train, but also achieves only slightly lower error rates (1.2% lower E1 and 1.0% lower E2) than Mfull. This provides an upper bound on the benefit of using the more complex structured models, and suggests that given their computational demands our (unstructured) model Mfull is a better choice. 4.1.2 MADA (Habash and Rambow, 2005) perform morphological disambiguation using a morphological analyzer. (Roth et al., 2008) augment this with lemma disambiguation; they call their system MADA. Our work differs from theirs in a number of respects. Firstly, they don’t use the two step regression procedure that we use. Secondly, they use only “unigram” features. Also, they do not learn a single model from a feature set based on labels and LEs; instead, they combine models for individual elements by using weighted agreement. We trained and tested MADA v2.32 using its full feature set on the same Dtrain and Dtest. We should point out that this is not an exact comparison, since MADA uses the older Buckwalter morphological analyzer.1 4.1.3 Other Alternatives Unfactored Labels: To illustrate the benefit obtained by breaking down each label l into 1A new version of MADA was released very close to the submission deadline for this conference. LEs, we contrast the performance of our Mfull model to an SVM model Mbaseline trained using only the feature set Zbaseline, which only contains features based on entire labels, those based on individual LEs. Independent lemma and morphology prediction: Another alternative approach is to predict lemmas and morphological analyses separately. We construct a feature set Zlemma0 = Zfull − Zlemma and train an SVM model Mlemma0 using this feature set. Labels are then predicted by simply combining the results predicted independently by Mlemma and Mlemma0 . Let Mind denote this approach. Unigram Features: Finally, we also consider a context-less approach, i.e. using only “unigram” features for labels as well as LEs. We call this feature set Zuni, and the corresponding SVM model Muni. The results of these various models, along with those of Mfull are summarized in Table 5. We see that Mfull has roughly half the error rate of the stateof-the-art MADA system. Note: The results reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. We used same train-test splits for all the datasets. 5 Related Work (Hajic, 2000) show that for highly inflectional languages, the use of a morphological analyzer improves accuracy of disambiguation. (Diab et al., 2004) perform tokenization, POS tagging and base phrase chunking using an SVM based learner. (Ahmed and N ¨urnberger, 2008) perform word-sense disambiguation using a Naive Bayesian 733 model and rely on parallel corpora and match- ing schemes instead of a morphological analyzer. (Kulick, 2010) perform simultaneous tokenization and part-of-speech tagging for Arabic by separating closed and open-class items and focusing on the likelihood of possible stems of openclass words. (Mohamed and K ¨ubler, 2010) present a hybrid method between word-based and segmentbased POS tagging for Arabic and report good results. (Toutanova and Cherry, 2009) perform joint lemmatization and part-of-speech tagging for English, Bulgarian, Czech and Slovene, but they do not use the two step estimation-shrinkage model described in this paper; nor do they factor labels. The idea of joint lemmatization and part-of-speech tagging has also been discussed in the context of Hungarian in (Kornai, 1994). A substantial amount of relevant work has been done previously for Hebrew. (Adler and Elhadad, 2006) perform Hebrew morphological disambiguation using an unsupervised morpheme-based HMM, but they report lower scores than those achieved by our model. Moreover, their analysis doesn’t include lemma IDs, which is a novelty of our model. (Goldberg et al., 2008) extend the work of (Adler and El- hadad, 2006) by using an EM algorithm, and achieve an accuracy of 88% for full morphological analysis, but again, this does not include lemma IDs. To the best of our knowledge, there is no existing research for Hebrew that does what we did for Arabic, namely to use simultaneous lemma and morphological disambiguation to improve both. (Dinur et al., 2009) show that prepositions and function words can be accurately segmented using unsupervised methods. However, by using this method as a preprocessing step, we would lose the power of a simultaneous solution for these problems. Our method is closer in style to a CRF, giving much of the accuracy gains of simultaneous solution, while being about 4 orders of magnitude easier to train. We believe that our use of factored labels is novel for the problem of simultaneous lemma and morphological disambiguation; however, (Smith et al., 2005) and (Hatori et al., 2008) have previously made use of features based on parts of labels in CRF models for morphological disambiguation and word-sense disambiguation respectively. Also, we note that there is a similarity between our two-stage machine learning approach and log-linear models in machine translation that break the data in two parts, estimating log-probabilities of generative models from one part, and discriminatively re-weighting the models using the second part. 6 Conclusions We introduced a new approach to accurately predict labels consisting of both lemmas and morphological analyses for Arabic text. We obtained an accuracy of over 90% substantially higher than current state-of-the-art systems. Key to our success is the factoring of labels into lemma and a large set of morphosyntactic elements, and the use of an algorithm that computes a simple initial estimate of the coefficient relating each contextual feature to each label element (simply by counting co-occurrence) and then regularizes these features by shrinking each of the coefficients for each feature by an amount determined by supervised learning using only the candidate label sets produced by SAMA. We also showed that using features of word ngrams is preferable to using features of only individual tokens of data. Finally, we showed that a model using a full feature set based on labels as well as – factored components of labels, which we call label elements (LEs) works better than a model created by combining individual models for each LE. We believe that the approach we have used to create our model can be successfully applied not just to Arabic but also to other languages such as Turkish, Hungarian and Finnish that have highly inflectional morphology. The current accuracy of of our model, getting the correct answer among the top two choices 96.2% of the time is high enough to be highly useful for tasks such as aiding the manual annotation of Arabic text; a more complete automation would require that accuracy for the single top choice. Acknowledgments We woud like to thank everyone at the Linguistic Data Consortium, especially Christopher Cieri, David Graff, Seth Kulick, Ann Bies, Wajdi Zaghouani and Basma Bouziri for their help. We also wish to thank the anonymous reviewers for their comments and suggestions. 734 References Meni Adler and Michael Elhadad. 2006. An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Farag Ahmed and Andreas N ¨urnberger. 2008. Arabic/English Word Translation Disambiguation using Parallel Corpora and Matching Schemes. In Proceedings of EAMT’08, Hamburg, Germany. Tim Buckwalter. 2004. Buckwalter Arabic Morphological Analyzer version 2.0. Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of EMNLP’02. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai ShalevShwartz, and Yoram Singer. 2006. Online PassiveAggressive Algorithms. Journal of Machine Learning Research, 7:551–585. Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic Tagging of Arabic text: From Raw Text to Base Phrase Chunks. In Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLT-NAACL’04). Elad Dinur, Dmitry Davidov, and Ari Rappoport. 2009. Unsupervised Concept Discovery in Hebrew Using Simple Unsupervised Word Prefix Segmentation for Hebrew and Arabic. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages. Kuzman Ganchev and Georgi Georgiev. 2009. Edlin: An Easy to Read Linear Learning Framework. In Proceedings of RANLP’09. Yoav Goldberg, Meni Adler, and Michael Elhadad. 2008. EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)*. In Proceedings of ACL’08. Nizar Habash and Owen Rambow. 2005. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. In Proceedings of ACL’05, Ann Arbor, MI, USA. Jan Hajic. 2000. Morphological Tagging: Data vs. Dictionaries. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL’00). Jun Hatori, Yusuke Miyao, and Jun’ichi Tsujii. 2008. Word Sense Disambiguation for All Words using TreeStructured Conditional Random Fields. In Proceedings of COLing’08. W. James and Charles Stein. 1961 . Estimation with Quadratic Loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1. Andr a´s Kornai. 1994. On Hungarian morphology (LinDissertationes 14). Lin- guistica, Series A: Studia et guistics Institute of Hungarian Academy of Sciences, Budapest. Seth Kulick. 2010. Simultaneous Tokenization and Partof-Speech Tagging for Arabic without a Morphological Analyzer. In Proceedings of ACL’10. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML’01, pages 282–289. Mohamed Maamouri, Ann Bies, and Tim Buckwalter. 2004. The Penn Arabic Treebank: Building a Large Scale Annotated Arabic Corpus. In Proceedings of NEMLAR Conference on Arabic Language Resources and Tools. Mohamed Maamouri, David Graff, Basma Bouziri, Sondos Krouna, and Seth Kulick. 2009. LDC Standard Arabic Morphological Analyzer (SAMA) v. 3.0. Andrew McCallum, 2001. MALLET: A Machine Learning for Language Toolkit. Software available at http : / /mal let .cs .umas s .edu. Emad Mohamed and Sandra K ¨ubler. 2010. Arabic Part of Speech Tagging. In Proceedings of LREC’10. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In Proceedings of ACL’08, Columbus, Ohio, USA. Noah A. Smith, David A. Smith, and Roy W. Tromble. 2005. Context-Based Morphological Disambiguation with Random Fields*. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP). Kristina Toutanova and Colin Cherry. 2009. A Global Model for Joint Lemmatization and Part-of-Speech Prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, pages 486–494. 735

5 0.09414643 28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data

Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum

Abstract: We present a novel approach to relation extraction that integrates information across documents, performs global inference and requires no labelled text. In particular, we tackle relation extraction and entity identification jointly. We use distant supervision to train a factor graph model for relation extraction based on an existing knowledge base (Freebase, derived in parts from Wikipedia). For inference we run an efficient Gibbs sampler that leads to linear time joint inference. We evaluate our approach both for an indomain (Wikipedia) and a more realistic outof-domain (New York Times Corpus) setting. For the in-domain setting, our joint model leads to 4% higher precision than an isolated local approach, but has no advantage over a pipeline. For the out-of-domain data, we benefit strongly from joint modelling, and observe improvements in precision of 13% over the pipeline, and 15% over the isolated baseline.

6 0.086332381 10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac

7 0.078349538 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

8 0.07143978 20 emnlp-2010-Automatic Detection and Classification of Social Events

9 0.067931943 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

10 0.067692436 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

11 0.061870381 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

12 0.056157216 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

13 0.055932276 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

14 0.05480472 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

15 0.052641839 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks

16 0.050387174 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

17 0.049461443 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

18 0.048671175 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

19 0.047811534 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

20 0.043995507 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, 0.052), (2, -0.024), (3, 0.154), (4, -0.206), (5, -0.386), (6, -0.009), (7, 0.032), (8, -0.055), (9, -0.399), (10, 0.148), (11, 0.155), (12, 0.018), (13, 0.158), (14, -0.073), (15, -0.026), (16, -0.108), (17, -0.089), (18, 0.069), (19, 0.019), (20, -0.11), (21, 0.105), (22, -0.017), (23, -0.091), (24, 0.072), (25, 0.011), (26, -0.09), (27, -0.029), (28, 0.003), (29, -0.004), (30, -0.064), (31, 0.086), (32, -0.092), (33, 0.04), (34, -0.009), (35, -0.021), (36, -0.043), (37, -0.057), (38, 0.025), (39, -0.07), (40, 0.047), (41, 0.025), (42, 0.014), (43, 0.056), (44, 0.004), (45, 0.037), (46, -0.022), (47, -0.013), (48, 0.026), (49, -0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96541846 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

Author: Yassine Benajiba ; Imed Zitouni

2 0.88148361 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

Author: Radu Florian ; John Pitrelli ; Salim Roukos ; Imed Zitouni

3 0.66133559 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

Author: Karthik Raghunathan ; Heeyoung Lee ; Sudarshan Rangarajan ; Nate Chambers ; Mihai Surdeanu ; Dan Jurafsky ; Christopher Manning

4 0.44052997 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

Author: Rushin Shah ; Paramveer S. Dhillon ; Mark Liberman ; Dean Foster ; Mohamed Maamouri ; Lyle Ungar

5 0.30631414 28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data

Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum

6 0.29130849 10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac

7 0.28269765 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

8 0.26536942 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

9 0.24534202 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

10 0.20609312 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

11 0.19651914 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

12 0.18961309 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

13 0.18101878 20 emnlp-2010-Automatic Detection and Classification of Social Events

14 0.17933095 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

15 0.17550579 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

16 0.1750408 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

17 0.16716345 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

18 0.1668938 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

19 0.16650391 114 emnlp-2010-Unsupervised Parse Selection for HPSG

20 0.16316716 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.061), (10, 0.015), (12, 0.038), (29, 0.081), (30, 0.02), (32, 0.022), (43, 0.345), (52, 0.02), (56, 0.057), (66, 0.128), (72, 0.061), (76, 0.023), (87, 0.019), (89, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70907301 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

Author: Yassine Benajiba ; Imed Zitouni

2 0.69566554 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

Author: Vladimir Eidelman ; Zhongqiang Huang ; Mary Harper

Abstract: This paper examines tagging models for spontaneous English speech transcripts. We analyze the performance of state-of-the-art tagging models, either generative or discriminative, left-to-right or bidirectional, with or without latent annotations, together with the use of ToBI break indexes and several methods for segmenting the speech transcripts (i.e., conversation side, speaker turn, or humanannotated sentence). Based on these studies, we observe that: (1) bidirectional models tend to achieve better accuracy levels than left-toright models, (2) generative models seem to perform somewhat better than discriminative models on this task, and (3) prosody improves tagging performance of models on conversation sides, but has much less impact on smaller segments. We conclude that, although the use of break indexes can indeed significantly im- prove performance over baseline models without them on conversation sides, tagging accuracy improves more by using smaller segments, for which the impact of the break indexes is marginal.

3 0.51065218 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

Author: Radu Florian ; John Pitrelli ; Salim Roukos ; Imed Zitouni

4 0.46341145 61 emnlp-2010-Improving Gender Classification of Blog Authors

Author: Arjun Mukherjee ; Bing Liu

Abstract: The problem of automatically classifying the gender of a blog author has important applications in many commercial domains. Existing systems mainly use features such as words, word classes, and POS (part-ofspeech) n-grams, for classification learning. In this paper, we propose two new techniques to improve the current result. The first technique introduces a new class of features which are variable length POS sequence patterns mined from the training data using a sequence pattern mining algorithm. The second technique is a new feature selection method which is based on an ensemble of several feature selection criteria and approaches. Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-ofthe-art methods significantly.

5 0.45129684 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

6 0.45088449 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

7 0.44732472 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

8 0.44533849 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

9 0.44510967 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

10 0.44485775 84 emnlp-2010-NLP on Spoken Documents Without ASR

11 0.44183332 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

12 0.44012335 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

13 0.43937483 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

14 0.43919107 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

15 0.43854016 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

16 0.43850395 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

17 0.43776372 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

18 0.43753663 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

19 0.43659881 20 emnlp-2010-Automatic Detection and Classification of Social Events

20 0.43612307 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields