emnlp emnlp2010 emnlp2010-16 knowledge-graph by maker-knowledge-mining

16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language


Source: pdf

Author: Aida Khemakhem ; Bilel Gargouri ; Abdelmajid Ben Hamadou

Abstract: Electronic dictionaries covering all natural language levels are very relevant for the human use as well as for the automatic processing use, namely those constructed with respect to international standards. Such dictionaries are characterized by a complex structure and an important access time when using a querying system. However, the need of a user is generally limited to a part of such a dictionary according to his domain and expertise level which corresponds to a specialized dictionary. Given the importance of managing a unified dictionary and considering the personalized needs of users, we propose an approach for generating personalized views starting from a normalized dictionary with respect to Lexical Markup Framework LMF-ISO 24613 norm. This approach provides the re-use of already defined views for a community of users by managing their profiles information and promoting the materialization of the generated views. It is composed of four main steps: (i) the projection of data categories controlled by a set of constraints (related to the user‟s profiles), (ii) the selection of values with consistency checking, (iii) the automatic generation of the query‟s model and finally, (iv) the refinement of the view. The proposed approach was con- solidated by carrying out an experiment on an LMF normalized Arabic dictionary. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 An approach for generating personalized views from normalized electronic dictionaries : A practical experiment on Aida Khemakhem MIRACL Laboratory FSEGS, B. [sent-1, score-0.88]

2 com Abstract Electronic dictionaries covering all natural language levels are very relevant for the human use as well as for the automatic processing use, namely those constructed with respect to international standards. [sent-5, score-0.197]

3 Such dictionaries are characterized by a complex structure and an important access time when using a querying system. [sent-6, score-0.234]

4 However, the need of a user is generally limited to a part of such a dictionary according to his domain and expertise level which corresponds to a specialized dictionary. [sent-7, score-0.577]

5 Given the importance of managing a unified dictionary and considering the personalized needs of users, we propose an approach for generating personalized views starting from a normalized dictionary with respect to Lexical Markup Framework LMF-ISO 24613 norm. [sent-8, score-1.384]

6 This approach provides the re-use of already defined views for a community of users by managing their profiles information and promoting the materialization of the generated views. [sent-9, score-0.462]

7 It is composed of four main steps: (i) the projection of data categories controlled by a set of constraints (related to the user‟s profiles), (ii) the selection of values with consistency checking, (iii) the automatic generation of the query‟s model and finally, (iv) the refinement of the view. [sent-10, score-0.205]

8 The proposed approach was con- solidated by carrying out an experiment on an LMF normalized Arabic dictionary. [sent-11, score-0.143]

9 1 Introduction Electronic dictionaries are very useful in nowadays society, with the globalization and the in953 Arabic language Bilel Gargouri MIRACL Laboratory FSEGS, B. [sent-12, score-0.197]

10 There are clearly identified needs of dictionaries for human use as well as for automatic processing use. [sent-22, score-0.237]

11 A normalized dictionary covers wide areas that include all lexical information of a given language and which are useful both for human use and for Natural Language Processing use in accordance with the kind of the user (linguist, lexicographer, developer, etc. [sent-26, score-0.728]

12 These dictionaries are characterized by a complex structure that supports the richness of natural languages. [sent-30, score-0.274]

13 Therefore, dealing with a unique and complete dictionary is well for the manage task. [sent-31, score-0.275]

14 However, such dictionaries are large and can be time consuming when querying their contents especially on the web. [sent-32, score-0.197]

15 Moreover, displaying all details when some of which are not useful for the field of the query research is a ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0. [sent-33, score-0.182]

16 So, it will be interesting to reduce the displayed details according to the domain or to the expertise level of the user by generating personalized views (virtual or materialized) in order to appropriate the use of such dictionaries to the user needs. [sent-36, score-1.174]

17 The idea of creating document views is not a new concept but applying it on LMF normalized dictionaries is a new one. [sent-37, score-0.555]

18 Indeed, it has been some attempts for dictionary creating in accordance to the TEI consortium (Véronis and Ide 1996) but the problem was the fact that created textual views (corresponding to the surface structure) or data base views (corresponding to the deep structure) were not customized. [sent-38, score-0.725]

19 Others propositions are very interesting but they concern the ontology domain when dealing with the concept of point of view (Corby and al 2005). [sent-39, score-0.185]

20 In this paper, we propose an approach that favors the use of normalized dictionaries by generating virtual/materialized personalized views. [sent-40, score-0.58]

21 This approach is based on the profiles information associated to user‟s community which helps to retrieve already defined views stored in a library of views. [sent-41, score-0.355]

22 For the illustration, we use an Arabic LMF normalized dictionary (Baccar and al. [sent-42, score-0.378]

23 We will start by giving an overview of projects that use LMF notably for the construction and the exploitation of electronic dictionaries. [sent-48, score-0.135]

24 Then, we will present the foundation of the proposed approach related to the profile and the view concepts. [sent-49, score-0.363]

25 Finally, we will bring back the experimentation that we carried out on a normalized Arabic dictionary using the proposed approach. [sent-51, score-0.378]

26 All these labors have been recourse to Web service technology that favors to invoke, locally or afar, appropriate services for the management or the exploitation of a normalized dictionary. [sent-56, score-0.336]

27 LEXUS offers an interface permitting to the user to define formats of lexical bases with a perspective to enable the construction of lexical bases according to the LMF model. [sent-57, score-0.355]

28 These servic- es offer the exploitation of a normalized lexical base without having any piece of information about its content and its structure. [sent-63, score-0.303]

29 The results of these services may be personalized dictionaries given in an XML format. [sent-64, score-0.479]

30 Concerning the construction of personalized dictionaries using the works mentioned above, we can notice that the user must have an idea about the content of the desired dictionary and its structure. [sent-66, score-0.937]

31 Finally, we note an absence of works dealing with the generation of views starting from LMF standardized dictionary. [sent-68, score-0.306]

32 3 Foundation of the approach An electronic dictionary can be used by many users who have different requirements. [sent-69, score-0.391]

33 The making of a view of the dictionary might be difficult for some kinds of users, so the recourse to user profiles may facilitate this task. [sent-72, score-0.805]

34 One can note that the user profile is very important to guide the user through the retrieval and the reuse of existent views corresponding to his profile. [sent-73, score-0.936]

35 1 A user profile definition Generally, all features characterizing a user or a group of users can be grouped under the term of a user profile. [sent-75, score-1.007]

36 For electronic dictionaries, a user profile is a dataset that concerns a community of users of the dictionary. [sent-76, score-0.643]

37 Every profile is characterized by a category, a level of expertise and a field. [sent-77, score-0.327]

38 Indeed, we classify the views of the dictionary according to a profile that is based on a selection of these three criteria. [sent-78, score-0.668]

39 The formal representation of a profile is the following: PP :: << KK,, LL,, FF>> KK:: the kind of user: lexicographer, linguist, lingware system developer, etc. [sent-79, score-0.265]

40 2 A View definition A view of a dictionary is a synthesis of a dictio- nary interrogation query. [sent-83, score-0.423]

41 A dictionary view allows to filter some lexicographic information and to hide others that are useless for some users. [sent-85, score-0.424]

42 The formal representation of a view: VViieeww :: << DD,, PP,, C > DD:: dictionary: each view is specific to a normalized dictionary. [sent-86, score-0.288]

43 If the weight equals to 1, then this property is mandatory for each lexical entry of this view. [sent-97, score-0.178]

44 Different types of views There are two types of views:  Virtual view: the results of this view are calculated upon request. [sent-98, score-0.36]

45 In this case, interrogating queries of this view might generate a composed query that interrogates directly the principal dictionary (underlying). [sent-99, score-0.509]

46  Materialized view (physical): a physical copy faithful to the view definition which is stored and maintained. [sent-100, score-0.333]

47     Projection of the view model: includes the specification of data categories (linguistic information), controlled by constraints in order to build a suitable normalized and valid model. [sent-103, score-0.482]

48 Selection and checkout of the coherence: concerns the specification of some values for data categories (DC) already specified. [sent-105, score-0.27]

49 Refinement of the view: involves the validation of a new view of dictionary, by adding the elements that are related to the lexical entries of this view. [sent-109, score-0.297]

50 1 Projection of the view model The UML model of the dictionary is difficult to understand. [sent-113, score-0.38]

51 The choice can be started by already existing profiles in order to avoid views redundancy by the reuse of the previous ones and help users. [sent-115, score-0.448]

52 The specification of a profile A user profile is a description that corresponds to a user community. [sent-117, score-1.042]

53 We use the features of profile to filter the DCs. [sent-118, score-0.218]

54 Indeed, we offer to the user only DCs that correspond to its field and its level based on the weight assigned to each DC. [sent-119, score-0.323]

55 We assigned these weights according to a study on the needs of each user level. [sent-120, score-0.265]

56 This study is based on the specific documents and dictionaries for each user level. [sent-121, score-0.422]

57 It also facilitates and accelerates the task of needs specification and permits to avoid views redundancy. [sent-123, score-0.489]

58 The projection phase is started by the specification of the user profile that requires the choice of its category, its level and its field. [sent-124, score-0.73]

59 Then, if the user wants to consult the previous views of his profile, we display all the views associated to this profile. [sent-125, score-0.655]

60 Indeed, during their specification there is a risk of having views with a non valid 956 model. [sent-129, score-0.371]

61 So, the DC specification must be controlled by constraints rules such as:  If we select the field, the semantic class or the nature then the definition must be selected. [sent-130, score-0.317]

62 Values specification of some DCs might influence the presence or the absence of the other DCs. [sent-134, score-0.156]

63 For example, if the user has chosen DCs: pos, root, scheme, gender and number; then, he has fixed the value du pos =”particle” (it means that he needs only particles) and the value of the number =”singular”. [sent-135, score-0.34]

64 In this case, we must request the user to rectify the specifications. [sent-137, score-0.265]

65 The selection must contain a checkout phase of coherence of DCs specified with the already existing data in the dictionary. [sent-138, score-0.211]

66 3 Automatic generation of the DTD and the query We use the Document Type Definition (DTD) of the LMF norm and DC specifications of the model to automatically generate the DTD of the view. [sent-141, score-0.235]

67 using XQuery) involves two steps: the first one concern the specification of conditions for the selection of a lexical entry and its related information (i. [sent-145, score-0.287]

68 The second step permits the definition of an XML representation of a lexical entry. [sent-148, score-0.15]

69 This step is based on the projection specified by the user and the priority order of DCs. [sent-149, score-0.271]

70 There are DCs that influence the presence of the lexical entry and others that only influence the presence of the sense. [sent-150, score-0.175]

71 4 Refinement of the view The steps of this phase depend on the type of the view. [sent-152, score-0.19]

72 If it‟s a virtual one, then it‟s necessary to save the query already generated. [sent-153, score-0.223]

73 If it‟s a materialized one, then query results are a part of the dictionary and must be saved for the operation of the view. [sent-155, score-0.504]

74 The second case may give us a non valid XML base, especially when there are two lexical entries in relation and our query will select one of them that have an identifier of the other entry that is not selected. [sent-156, score-0.347]

75 We recall that the lexical entries may have morphological links with other lexical entries and semantic links with other senses. [sent-157, score-0.346]

76 This step consists to valid the new personalized dictionary, adding lexical entries, senses and syntactic frame in relation with these results. [sent-159, score-0.305]

77 1 Experimentation The normalized Arabic dictionary In order to experiment our approach, we are going to use the normalized Arabic interactive dictionary containing more than 38000 lexical entries and developed in the framework of an Arabic project 2 supervised by the ALECSO. [sent-161, score-0.908]

78 This dictionary is modeled according to the meta-model proposed by LMF3 (ISO 24613) and uses data categories generated by the DCR 4 norm (ISO 12620). [sent-162, score-0.286]

79 The dictionary pattern is composed of classes selected from the kernel or from one of its extensions (morphological, semantic, syntactic, MRD) in order to see a dictionary covering most of new dictionary‟s needs (Baccar and al 2008). [sent-163, score-0.51]

80 In addition, this dictionary is rich with semantic information (i. [sent-176, score-0.235]

81 Example of a lexical entry of the normalized Arabic dictionary. [sent-184, score-0.274]

82 In this Figure, we highlight some properties of the used dictionary such as:  The diversity of information: morphology, semantics, syntax, image, video, etc. [sent-185, score-0.235]

83 , synonym) link two senses and not two lexical entries  The precision of the syntactic behavior. [sent-188, score-0.152]

84 2 Experiment of the approach We illustrate in this section the generation process of a personalized dictionary view. [sent-191, score-0.53]

85 We have setup a computing system online, that allows the user to make his view of the dictionary in the format of an interactive Web page (so independent of all material or software owner) in which he will be able to define, create and enhance his personal view. [sent-192, score-0.605]

86 Before starting the generation of the view, the user must specify his profile. [sent-193, score-0.32]

87 If the views are associated with this profile, we will display their description to reuse the existing views and to avoid redundancy. [sent-194, score-0.483]

88 We give an example of view that includes only the schema, the derivational relations, sense, examples of all the Arabic verbs (pos = verb). [sent-202, score-0.181]

89 Then from 38000 lexical entries, our view has 7000 verbs and 3000 roots i. [sent-203, score-0.246]

90 In the Figure 5, the user has selected the lemma, pos, schema, the derivational relations, etymology, sense, definition and example. [sent-206, score-0.304]

91 According to the specification of requirements, the user clicks the save button. [sent-208, score-0.419]

92 Example of a generated query If the user chooses a materialized view, the system must save the query result in the user's computer after refinement (add missing information to validate the XML document). [sent-211, score-0.727]

93 For the verification of results, the user must choose the view before starting the search in the dictionary. [sent-212, score-0.457]

94 In the following Figure 7, we present the results of research in the view already speci- fied in Figure 5. [sent-213, score-0.145]

95 6 Conclusion The construction of specialized dictionaries is an old concept. [sent-215, score-0.242]

96 However, it has not been used after the publication of LMF standard in spite of the complexity and the richness of normalized dictionaries. [sent-216, score-0.183]

97 In this paper, we proposed an approach allowing the generation of specialized and personalized views of dictionaries accord- ing to users‟ profiles in order to benefit from the management of a unique dictionary and give appropriate services. [sent-217, score-1.168]

98 A Practical experiment was carried out on a normalized Arabic dictionary using an appropriate tool that permits to manage users‟ profiles and views‟ generation. [sent-218, score-0.647]

99 We successfully performed some empirical illustrations starting from the normalized dictionary. [sent-219, score-0.143]

100 In the future, we will consider the experimentation of the developed tool in the generation of various personalized views both in virtual and materialized versions. [sent-220, score-0.713]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lmf', 0.421), ('personalized', 0.24), ('dictionary', 0.235), ('user', 0.225), ('profile', 0.218), ('views', 0.215), ('dictionaries', 0.197), ('dcs', 0.18), ('specification', 0.156), ('view', 0.145), ('normalized', 0.143), ('profiles', 0.14), ('arabic', 0.133), ('query', 0.129), ('lexus', 0.117), ('materialized', 0.1), ('dtd', 0.094), ('gargouri', 0.094), ('khemakhem', 0.094), ('lirics', 0.094), ('entries', 0.087), ('electronic', 0.085), ('expertise', 0.072), ('users', 0.071), ('baccar', 0.07), ('checkout', 0.07), ('fsegs', 0.07), ('miracl', 0.07), ('sfax', 0.07), ('tunisia', 0.07), ('iso', 0.067), ('xml', 0.067), ('refinement', 0.066), ('entry', 0.066), ('lexical', 0.065), ('accordance', 0.06), ('recourse', 0.06), ('virtual', 0.056), ('coherence', 0.056), ('generation', 0.055), ('particles', 0.054), ('schema', 0.054), ('dc', 0.053), ('reuse', 0.053), ('field', 0.053), ('norm', 0.051), ('exploitation', 0.05), ('tool', 0.047), ('abderrahmen', 0.047), ('aida', 0.047), ('apis', 0.047), ('beginner', 0.047), ('corby', 0.047), ('francopoulo', 0.047), ('hamadou', 0.047), ('kataba', 0.047), ('kirsch', 0.047), ('lexicographer', 0.047), ('lingware', 0.047), ('mandatory', 0.047), ('mantique', 0.047), ('mrd', 0.047), ('relatedform', 0.047), ('rnu', 0.047), ('ronis', 0.047), ('verification', 0.047), ('xquery', 0.047), ('projection', 0.046), ('offer', 0.045), ('phase', 0.045), ('specialized', 0.045), ('others', 0.044), ('concerns', 0.044), ('definition', 0.043), ('indeed', 0.043), ('morphological', 0.042), ('permits', 0.042), ('markup', 0.042), ('services', 0.042), ('management', 0.041), ('needs', 0.04), ('editorial', 0.04), ('linguist', 0.04), ('manage', 0.04), ('propositions', 0.04), ('richness', 0.04), ('started', 0.04), ('must', 0.04), ('pos', 0.038), ('controlled', 0.038), ('save', 0.038), ('du', 0.037), ('characterized', 0.037), ('managing', 0.036), ('roots', 0.036), ('standardized', 0.036), ('biology', 0.036), ('derivational', 0.036), ('developer', 0.036), ('facilitates', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

Author: Aida Khemakhem ; Bilel Gargouri ; Abdelmajid Ben Hamadou

Abstract: Electronic dictionaries covering all natural language levels are very relevant for the human use as well as for the automatic processing use, namely those constructed with respect to international standards. Such dictionaries are characterized by a complex structure and an important access time when using a querying system. However, the need of a user is generally limited to a part of such a dictionary according to his domain and expertise level which corresponds to a specialized dictionary. Given the importance of managing a unified dictionary and considering the personalized needs of users, we propose an approach for generating personalized views starting from a normalized dictionary with respect to Lexical Markup Framework LMF-ISO 24613 norm. This approach provides the re-use of already defined views for a community of users by managing their profiles information and promoting the materialization of the generated views. It is composed of four main steps: (i) the projection of data categories controlled by a set of constraints (related to the user‟s profiles), (ii) the selection of values with consistency checking, (iii) the automatic generation of the query‟s model and finally, (iv) the refinement of the view. The proposed approach was con- solidated by carrying out an experiment on an LMF normalized Arabic dictionary. 1

2 0.08257702 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

Author: Rushin Shah ; Paramveer S. Dhillon ; Mark Liberman ; Dean Foster ; Mohamed Maamouri ; Lyle Ungar

Abstract: We describe a model for the lexical analysis of Arabic text, using the lists of alternatives supplied by a broad-coverage morphological analyzer, SAMA, which include stable lemma IDs that correspond to combinations of broad word sense categories and POS tags. We break down each of the hundreds of thousands of possible lexical labels into its constituent elements, including lemma ID and part-of-speech. Features are computed for each lexical token based on its local and document-level context and used in a novel, simple, and highly efficient two-stage supervised machine learning algorithm that over- comes the extreme sparsity of label distribution in the training data. The resulting system achieves accuracy of 90.6% for its first choice, and 96.2% for its top two choices, in selecting among the alternatives provided by the SAMA lexical analyzer. We have successfully used this system in applications such as an online reading helper for intermediate learners of the Arabic language, and a tool for improving the productivity of Arabic Treebank annotators. 1 Background and Motivation This paper presents a methodology for generating high quality lexical analysis of highly inflected languages, and demonstrates excellent performance applying our approach to Arabic. Lexical analysis of the written form of a language involves resolving, explicitly or implicitly, several different kinds ofambiguities. Unfortunately, the usual ways of talking about this process are also ambiguous, and our general approach to the problem, though not unprecedented, has uncommon aspects. Therefore, in order 725 Paramveer S. Dhillon, Mark Liberman, Dean Foster, Mohamed Maamouri and Lyle Ungar University of Pennsylvania 345 1Walnut Street Philadelphia, PA 19104, USA {dhi l lon | myl | ungar} @ cis .upenn .edu floo snt|emry@lw|huanrgta on .upenn .eednun maamouri @ ldc .upenn .edu , , to avoid confusion, we begin by describing how we define the problem. In an inflected language with an alphabetic writing system, a central issue is how to interpret strings of characters as forms of words. For example, the English letter-string ‘winds’ will normally be interpreted in one of four different ways, all four of which involve the sequence of two formatives wind+s. The stem ‘wind’ might be analyzed as (1) a noun meaning something like “air in motion”, pronounced [wInd] , which we can associate with an arbitrary but stable identifier like wind n1; (2) a verb wind v1 derived from that noun, and pronounced the same way; (3) a verb wind v2 meaning something like “(cause to) twist”, pronounced [waInd]; or (4) a noun wind n2 derived from that verb, and pro- nounced the same way. Each of these “lemmas”, or dictionary entries, will have several distinguishable senses, which we may also wish to associate with stable identifiers. The affix ‘-s’ might be analyzed as the plural inflection, if the stem is a noun; or as the third-person singular inflection, if the stem is a verb. We see this analysis as conceptually divided into four parts: 1) Morphological analysis, which recognizes that the letter-string ‘winds’ might be (perhaps among other things) wind/N s/PLURAL or wind/V s/3SING; 2) Morphological disambiguation, which involves deciding, for example, that in the phrase “the four winds”, ‘winds’ is probably a plural noun, i.e. wind/N s/PLURAL; 3) Lemma analysis, which involves recognizing that the stem wind in ‘winds’ might be any of the four lemmas listed above – perhaps with a further listing of senses or other sub-entries for each of them; and 4) Lemma disambiguation, deciding, for example, that + + + ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 7t2ic5s–735, the phrase “the four winds” probably involves the lemma wind n1. Confusingly, the standard word-analysis tasks in computational linguistics involve various combinations of pieces of these logically-distinguished operations. Thus, “part of speech (POS) tagging” is mainly what we’ve called “morphological disambiguation”, except that it doesn’t necessarily require identifying the specific stems and affixes involved. In some cases, it also may require a small amount of “lemma disambiguation”, for example to distinguish a proper noun from a common noun. “Sense disambiguation” is basically a form of what we’ve called “lemma disambiguation”, except that the sense disambiguation task may assume that the part of speech is known, and may break down lexical identity more finely than our system happens to do. “Lemmatization” generally refers to a radically simplified form of “lemma analysis” and “lemma disambiguation”, where the goal is simply to collapse different inflected forms of any similarly-spelled stems, so that the strings ‘wind’, ‘winds’, ‘winded’, ‘winding’ will all be treated as instances of the same thing, without in fact making any attempt to determine the identity of “lemmas” in the traditional sense of dictionary entries. Linguists use the term morphology to include all aspects of lexical analysis under discussion here. But in most computational applications, “morphological analysis” does not include the disambiguation of lemmas, because most morphological analyzers do not reference a set of stable lemma IDs. So for the purposes of this paper, we will continue to discuss lemma analysis and disambiguation as conceptually distinct from morphological analysis and disambiguation, although, in fact, our system disambiguates both of these aspects of lexical analysis at the same time. The lexical analysis of textual character-strings is a more complex and consequential problem in Arabic than it is in English, for several reasons. First, Arabic inflectional morphology is more complex than English inflectional morphology is. Where an English verb has five basic forms, for example, an Arabic verb in principle may have dozens. Second, the Arabic orthographic system writes elements such as prepositions, articles, and possessive pronouns without setting them off by spaces, roughly 726 as if the English phrase “in a way” were written “inaway”. This leads to an enormous increase in the number of distinct “orthographic words”, and a substantial increase in ambiguity. Third, short vowels are normally omitted in Arabic text, roughly as if English “in a way” were written “nway”. As a result, a whitespace/punctuation-delimited letter-string in Arabic text typically has many more alternative analyses than a comparable English letter-string does, and these analyses have many more parts, drawn from a much larger vocabulary of form-classes. While an English “tagger” can specify the morphosyntactic status of a word by choosing from a few dozen tags, an equivalent level of detail in Arabic would require thousands of alternatives. Similarly, the number of lemmas that might play a role in a given letter-sequence is generally much larger in Arabic than in English. We start our labeling of Arabic text with the alternative analyses provided by SAMA v. 3.1, the Standard Arabic Morphological Analyzer (Maamouri et al., 2009). SAMA is an updated version of the earlier Buckwalter analyzers (Buckwalter, 2004), with a number of significant differences in analysis to make it compatible with the LDC Arabic Treebank 3-v3.2 (Maamouri et al., 2004). The input to SAMA is an Arabic orthographic word (a string of letters delimited by whitespace or punctuation), and the output of SAMA is a set of alternative analyses, as shown in Table 1. For a typical word, SAMA produces approximately a dozen alternative analyses, but for certain highly ambiguous words it can produce hundreds of alternatives. The SAMA analyzer has good coverage; for typical texts, the correct analysis of an orthographic word can be found somewhere in SAMA’s list of alternatives about 95% of the time. However, this broad coverage comes at a cost; the list of analytic alternatives must include a long Zipfian tail of rare or contextually-implausible analyses, which collectively are correct often enough to make a large contribution to the coverage statistics. Furthermore, SAMA’s long lists of alternative analyses are not evaluated or ordered in terms of overall or contextual plausibility. This makes the results less useful in most practical applications. Our goal is to rank these alternative analyses so that the correct answer is as near to the top of the list as possible. Despite some risk of confusion, we’ll refer to SAMA’s list of alternative analyses for an orthographic word as potential labels for that word. And despite a greater risk ofconfusion, we’ll refer to the assignment of probabilities to the set of SAMA labels for a particular Arabic word in a particular textual context as tagging, by analogy to the operation of a stochastic part-of-speech tagger, which similarly assigns probabilities to the set of labels available for a word in textual context. Although our algorithms have been developed for the particular case of Arabic and the particular set of lexical-analysis labels produced by SAMA, they should be applicable without modification to the sets of labels produced by any broad-coverage lexical analyzer for the orthographic words of any highlyinflected language. In choosing our approach, we have been moti- vated by two specific applications. One application aims to help learners of Arabic in reading text, by offering a choice of English glosses with associated Arabic morphological analyses and vocalizations. SAMA’s excellent coverage is an important basis for this help; but SAMA’s long, unranked list of alternative analyses for a particular letter-string, where many analyses may involve rare words or alternatives that are completely implausible in the context, will be confusing at best for a learner. It is much more helpful for the list to be ranked so that the correct answer is almost always near the top, and is usually one of the top two or three alternatives. In our second application, this same sort of ranking is also helpful for the linguistically expert native speakers who do Arabic Treebank analysis. These 727 annotators understand the text without difficulty, but find it time-consuming and fatiguing to scan a long list of rare or contextually-implausible alternatives for the correct SAMA output. Their work is faster and more accurate if they start with a list that is ranked accurately in order of contextual plausibility. Other applications are also possible, such as vocalization of Arabic text for text-to-speech synthesis, or lexical analysis for Arabic parsing. However, our initial goals have been to rank the list of SAMA outputs for human users. We note in passing that the existence of set of stable “lemma IDs” is an unusual feature of SAMA, which in our opinion ought to be emulated by approaches to lexical analysis in other languages. The lack of such stable lemma IDs has helped to disguise the fact that without lemma analysis and disambiguation, morphological analyses and disambiguation is only a partial solution to the problem of lexical analysis. In principle, it is obvious that lemma disambiguation and morphological disambiguation are mutually beneficial. If we know the answer to one of the questions, the other one is easier to answer. However, these two tasks require rather different sets of contextual features. Lemma disambiguation is similar to the problem of word-sense disambiguation on some definitions, they are identical and as a result, it benefits from paragraph-level and documentlevel bag-of-words attributes that help to character– – ize what the text is “about” and therefore which lemmas are more likely to play a role in it. In contrast, morphological disambiguation mainly depends on features of nearby words, which help to characterize how inflected forms of these lemmas might fit into local phrasal structures. 2 Problem and Methodology Consider a collection oftokens (observations), ti, referred to by index i∈ {1, . . . , n}, where each token fise raressdo tcoia bteyd i nwdiethx a s∈et { of p features, xij, efaocr hth teo k jethn feature, and a label, li, which is a combination of a lemma and a morphological analysis. We use indicator functions yik to indicate whether or not the kth label for the ith token is present. We represent the complete set of features and labels for the entire training data using matrix notation as X and Y , respectively. Our goal is to predict the label l (or equivalently, the vector y for a given feature vector x. A standard linear regression model of this problem would be y = xβ + ? (1) The standard linear regression estimate of β (ig- ×× × noring, for simplicity the fact that the ys are 0/1) is: βˆ = (XTtrainXtrain)−1XtTrainYtrain (2) where Ytrain is an n h matrix containing 0s and 1s indicating whise tahner n or nho mt aetarcixh coofn tthaien ihn possible labels is the correct label (li) for each of the n tokens ti, Xtrain is an n p matrix of context features for each of thei n tokens, pth mea ctoriexff oifcie cnotnst are p hs .f However, this is a large, sparse, multiple l hab.el problem, and the above formulation is neither statistically nor computationally efficient. Each observation (x, y) consists of thousands of features associated with thousands of potential labels, almost all of which are zero. Worse, the matrix of coefficients β, to be estimated is large (p h) and one should thus use some soatretd do ifs tr laarngsefe (pr learning dto o nshea srheo strength across the different labels. We present a novel principled and highly computationally efficient method of estimating this multilabel model. We use a two stage procedure, first using a subset (Xtrain1 , Ytrain1) of training data to give a fast approximate estimate of β; we then use a second smaller subset of the training data (Xtrain2, Ytrain2,) to “correct” these estimates in a eβˆx way that we will show can be viewed as a specialized shrinkage. Our first stage estimation approximates β, but avoids the expensive computa728 tion of (XTtrainXtrain)−1. Our second stage corrects (shrinks) these initial estimates in a manner specialized to this problem. The second stage takes advantage of the fact that we only need to consider those candidate labels produced by SAMA. Thus, only dozens of the thousands of possible labels are considered for each token. We now present our algorithm. We start with a corpus D of documents d of labeled Arabic text. As described above, each token, ti is associated with a set of features characterizing its context, computed from the other words in the same document, and a label, li = (lemmai, morphologyi), which is a combination of a lemma and a morphological analysis. As described below, we introduce a novel factorization of the morphology into 15 different components. Our estimation algorithm, shown in Algorithm 1, has two stages. We partition the training corpus into × two subsets, one of which (Xtrain1) is used to estimate the coefficients βs and the other of which (Xtrain2) is used to optimally “shrink” these coefficient estimates to reduce variance and prevent overfitting due to data sparsity. For the first stage of our estimation procedure, we simplify the estimate of the (β) matrix (Equation 2) to avoid the inversion of the very high dimensional (p p) matrix (XTX) by approximating (XTX) by (itps diagonal, Var(X), the inverse of which is trivial to compute; i.e. we estimate β using βˆ = Var(Xtrain1)−1XtTrain1Ytrain1 (3) For the second stage, we assume that the coefficients for each feature can be shrunk differently, but that coefficients for each feature should be shrunk the same regardless of what label they are predicting. Thus, for a given observation we predict: ˆgik=Xpwjβˆjkxij (4) Xj=1 where the weights wj indicate how much to shrink each of the p features. In practice, we fold the variance of each of the j features into the weight, giving a slightly modified equation: ˆgik=Xj=p1αjβj∗kxij (5) where β∗ = XtTrain1Ytrain1 is just a matrix of the counts of how often each context feature shows up with each label in the first training set. The vector α, which we will estimate by regression, is just the shrinkage weights w rescaled by the feature variance. Note that the formation here is different from the first stage. Instead of having each observation be a token, we now let each observation be a (token, label) pair, but only include those labels that were output by SAMA. For a given token ti and potential label lk, our goal is to approximate the indicator function g(i, k), which is 1 if the kth label of token ti is present, and 0 otherwise. We find candidate labels using a morphological analyzer (namely SAMA), which returns a set of possible candidate labels, say C(t), for each Arabic token t. Our pre- dicted label for ti is then argmaxk∈C(ti)g(i, k). The regression model for learning tthe weights αj in the second stage thus has a row for each label g(i, k) associated with a SAMA candidate for each token i = ntrain1+1 . . . ntrain2 in the second training set. The value of g(i, k) is predicted as a function of the feature vector zijk = βj∗kxij. The shrinkage coefficients, αj, could be estimated from theory, using a version of James-Stein shrinkage (James and Stein, 1961), but in practice, superior results are obtained by estimating them empirically. Since there are only p of them (unlike the p ∗ h βs), a relatively asmreal oln training sheetm mis ( usunflfi kceie tnhte. Wp ∗e hfou βnsd), that regression-SVMs work slightly better than linear regression and significantly better than standard classification SVMs for this problem. Prediction is then done in the obvious way by taking the tokens in a test corpus Dtest, generating context features and candidate SAMA labels for each token ti, and selected the candidate label with the highest score ˆ g(i, k) that we set out to learn. More formally, The model parameters β∗ and α produced by the algorithm allow one to estimate the most likely label for a new token ti out of a set of can- didate labels C(ti) using kpred= argmaxk∈C(ti)jX=p1αjβj∗kxij (6) The most expensive part of the procedure is estimating β∗, which requires for each token in cor729 Algorithm 1 Training algorithm. Input: A training corpusDtrainof n observations (Xtrain, Ytrain) Partition Dtrain into two sets, D1 and D2, of sizes ntrain1 and ntrain2 = n − ntrain1 observations // Using D1, estimat=e β∗ βj∗k = Pin=tr1ain1 xijyik for the jth feature and kth label // Using D2, estimate αj // Generate new “features” Z and the true labels g(i, k) for each of the SAMA candidate labels for each of the tokens in D2 zijk = βj∗kxij for iin i= ntrain1 + 1...ntrain2 Estimate αj for the above (feature,label) pairs (zijk, g(i, k)) using Regression SVMs Output: α and β∗ pus D1, (a subset of D), finding the co-occurrence frequencies of each label element (a lemma, or a part of the morphological segmentation) with the target token and jointly with the token and with other tokens or characters in the context of the token of interest. For example, given an Arabic token, “yHlm”, we count what fraction of the time it is associated with each lemma (e.g. Halamu 1), count(lemma=Halam-u 1, token=yHlm) and each segment (e.g. “ya”), count(segment=ya, token=yHlm). (Of course, most tokens never show up with most lemmas or segments; this is not a problem.) We also find the base rates of the components of the labels (e.g., count(lemma=Halam-u 1), and what fraction of the time the label shows up in various contexts, e.g. count(lemma=Halam-u 1, previous token = yHlm). We describe these features in more detail below. 3 Features and Labels used for Training Our approach to tagging Arabic differs from conventional approaches in the two-part shrinkage-based method used, and in the choice of both features and labels used in our model. For features, we study both local context variables, as described above, and document-level word frequencies. For the labels, the key question is what labels are included and how they are factored. Standard “taggers” work by doing an n-way classification of all the alternatives, which is not feasible here due to the thousands of possible labels. Standard approaches such as Conditional Random Fields (CRFs) are intractable with so many labels. Moreover, few if any taggers do any lemma disambiguation; that is partly because one must start with some standard inventory of lemmas, which are not available for most languages, perhaps because the importance of lemma disambiguation has been underestimated. We make a couple of innovations to deal with these issues. First, we perform lemma disambiguation in addition to “tagging”. As mentioned above, lemmas and morphological information are not independent; the choice of lemma often influences morphology and vice versa. For example, Table 1 contains two analyses for the word qbl. For the first analysis, where the lemma is qabil-a 1 and the gloss is accept/receive/approve + he/it [verb], the word is a verb. However, for the second analysis, where the lemma is qabol 1 and the gloss is before, the word is a noun. Simultaneous lemma disambiguation and tagging introduces additional complexity: An analysis of ATB and SAMA shows that there are approximately 2,200 possible morphological analyses (“tags”) and 40,000 possible lemmas; even accounting for the fact that most combinations of lemmas and morphological analyses don’t occur, the size of the label space is still in the order of tens of thousands. To deal with data sparsity, our second innovation is to factor the labels. We factor each label linto a set of 16 label elements (LEs). These include lemmas, as well as morphological elements such as basic partof-speech, suffix, gender, number, mood, etc. These are explained in detail below. Thus, since each label l is a set of 15 categorical variables, each y in the first learning stage is actually a vector with 16 nonzero components and thousands of zeros. Since we do simultaneous estimation of the entire set of label elements, the value g(i, k) being predicted in the second learning phase is 1 if the entire label set is correct, and zero otherwise. We do not learn separate models for each label. 3.1 Label Elements (LEs) The fact that there are tens of thousands of possible labels presents the problem of extreme sparsity of label distribution in the training data. We find that a model that estimates coefficients β∗ to predict a sin730 data on basic POS include whether a noun is proper or common, whether a verb is transitive or not, etc. Both the basic POS and its suffix may have person, gender and number data. gle label (a label being in the Cartesian product of the set of label elements) yields poor performance. Therefore, as just mentioned, we factor each label l into a set of label elements (LEs), and learn the correlations β∗ between features and label elements, rather than features and entire label sets. This reduces, but does not come close to eliminating, the problem sparsity. A complete list of these LEs and their possible values is detailed in Table 2. 3.2 Features 3.2.1 Local Context Features We take (t, l) pairs from D2, and for each such pair generate features Z based on co-occurrence statistics β∗ in D1, as mentioned in Algorithm 2. These statistics include unigram co-occurrence frequencies of each label with the target token and bigram co-occurrence of the label with the token and with other tokens or characters in the context of the target token. We define them formally in Table 3. Let Zbaseline denote the set of all such basic features based on the local context statistics of the target token, namely the words and letters preceding and following it. We will use this set to create a baseline model. generate feature sets for our regression SVMs. For each label element (LE) e, we define a set of features Ze similar to Zbaseline; these features are based on co-occurrence frequencies of the particular LE e, not the entire label l. Finally, we define an aggregate feature set Zaggr as follows: Zaggr = Zbaseline [ {Ze} (7) where e ∈ {lemma, pre1, pre2, det, pos, dpos, suf, perpos, numpos, genpos, persuf, numsuf, gensuf, mood, pron}. 3.2.2 Document Level Features When trying to predict the lemma, it is useful to include not just the words and characters immediately adjacent to the target token, but also the all the words in the document. These words capture the “topic” of the document, and help to disambiguate different lemmas, which tend to be used or not used based on the topic being discussed, similarly to the way that word sense disambiguation systems in English sometimes use the “bag of words” the document to disambiguate, for example a “bank” for depositing money from a “bank” of a river. More precisely, we augment the features for each target token with the counts of each word in the document (the “term frequency” tf) in which the token occurs with a given label. Zfull = Zaggr [ Ztf (8) This set Zfull is our final feature set. We use Zfull to train an SVM model Mfull; this is our final predictive model. 731 3.3 Corpora used for Training and Testing We use three modules of the Penn Arabic Treebank (ATB) (Maamouri et al., 2004), namely ATB 1, ATB2 and ATB3 as our corpus of labeled Arabic text, D. Each ATB module is a collection of newswire data from a particular agency. ATB1 uses the Associated Press as a source, ATB2 uses Ummah, and ATB3 uses Annahar. D contains a total of 1,835 documents, accounting for approximately 350,000 words. We construct the training and testing sets Dtrain and Dtest from D using 10-fold cross validation, and we construct D1 and D2 from Dtrain by randomly performing a 9: 1 split. As mentioned earlier, we use the SAMA morphological analyzer to obtain candidate labels C(t) for each token t while training and testing an SVM model on D2 and Dtest respectively. A sample output of SAMA is shown in Table 1. To improve coverage, we also add to C(t) all the labels lseen for t in D1. We find that doing so improves coverage to 98% . This is an upper bound on the accuracy of our model. C(t) = SAMA(t) 4 [ {l|(t, l) ∈ D1} (9) Results We use two metrics of accuracy: A1, which measures the percentage of tokens for which the model assigns the highest score to the correct label or LE value (or E1= 100 A1, the corresponding percentage error), 1a=nd 1 A2, wAh1i,ch th measures tnhdei percentage of tokens for which the correct label or LE value is one of the two highest ranked choices returned by the model (or E2 = 100 A2). We test our bmyod theel Mfull on Dtest a =nd 1 a0c0hi −eve A A2)1. and A2 scores of 90.6% and 96.2% respectively. The accuracy achieved by our Mfull model is, to the best of our knowledge, higher than prior approaches have been able to achieve so far for the problem of combined morphological and lemma disambiguation. This is all the more impressive considering that the upper bound on accuracy for our model is 98% because, as described above, our set of candidate labels is incomplete. In order to analyze how well different LEs can be predicted, we train an SVM model Me for each LE e using the feature set Ze, and test all such models − − on Dtest. The results for all the LEs are reported in the form of error percentages E1 and E2 in Table 4. reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. A comparison of the results for Mfull with the results for Mlemma and Mpos is particularly informative. We see that Mfull is able to achieve a substantially lower E1 error score (9.4%) than Mlemma (11.1%) and Mpos (23.4%); in other words, we find that our full model is able to predict lemmas and basic parts-of-speech more accurately than the individ- ual models for each of these elements. We examine the effect of varying the size of D2, i.e. the number of SVM training instances, on the performance of Mfull on Dtest, and find that with increasing sizes of D2, E1 reduces only slightly from 9.5% to 9.4%, and shows no improvement thereafter. We also find that the use of documentlevel features in Mlemma reduces E1 and E2 percentages for Mlemma by 5.7% and 3.2% respectively. 4.1 Comparison to Alternate Approaches 4.1.1 Structured Prediction Models Preliminary experiments showed that knowing the predicted labels (lemma + morphology) of the surrounding words can slightly improve the predictive accuracy of our model. To further investigate this effect, we tried running experiments using different structured models, namely CRF (Conditional Random Fields) (Lafferty et al., 2001), (Structured) MIRA (Margin Infused Relaxation Algorithm) (Crammer et al., 2006) and Structured Perceptron (Collins, 2002). We used linear chain 732 CRFs as implemented in MALLET Toolbox (McCallum, 2001) and for Structured MIRA and Perceptron we used their implementations from EDLIN Toolbox (Ganchev and Georgiev, 2009). However, given the vast label space of our problem, running these methods proved infeasible. The time complexity of these methods scales badly with the number of labels; It took a week to train a linear chain CRF for only ∼ 50 labels and though MIRA and Perceptron are o 5n0lin leab algorithms, they MalsIoR Abec aonmde P ienr-tractable beyond a few hundred labels. Since our label space contains combinations of lemmas and morphologies, so even after factoring, the dimension of the label space is in the order of thousands. We also tried a na¨ ıve version (two-pass approximation) of these structured models. In addition to the features in Zfull, we include the predicted labels for the tokens preceding and following the target token as features. This new model is not only slow to train, but also achieves only slightly lower error rates (1.2% lower E1 and 1.0% lower E2) than Mfull. This provides an upper bound on the benefit of using the more complex structured models, and suggests that given their computational demands our (unstructured) model Mfull is a better choice. 4.1.2 MADA (Habash and Rambow, 2005) perform morphological disambiguation using a morphological analyzer. (Roth et al., 2008) augment this with lemma disambiguation; they call their system MADA. Our work differs from theirs in a number of respects. Firstly, they don’t use the two step regression procedure that we use. Secondly, they use only “unigram” features. Also, they do not learn a single model from a feature set based on labels and LEs; instead, they combine models for individual elements by using weighted agreement. We trained and tested MADA v2.32 using its full feature set on the same Dtrain and Dtest. We should point out that this is not an exact comparison, since MADA uses the older Buckwalter morphological analyzer.1 4.1.3 Other Alternatives Unfactored Labels: To illustrate the benefit obtained by breaking down each label l into 1A new version of MADA was released very close to the submission deadline for this conference. LEs, we contrast the performance of our Mfull model to an SVM model Mbaseline trained using only the feature set Zbaseline, which only contains features based on entire labels, those based on individual LEs. Independent lemma and morphology prediction: Another alternative approach is to predict lemmas and morphological analyses separately. We construct a feature set Zlemma0 = Zfull − Zlemma and train an SVM model Mlemma0 using this feature set. Labels are then predicted by simply combining the results predicted independently by Mlemma and Mlemma0 . Let Mind denote this approach. Unigram Features: Finally, we also consider a context-less approach, i.e. using only “unigram” features for labels as well as LEs. We call this feature set Zuni, and the corresponding SVM model Muni. The results of these various models, along with those of Mfull are summarized in Table 5. We see that Mfull has roughly half the error rate of the stateof-the-art MADA system. Note: The results reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. We used same train-test splits for all the datasets. 5 Related Work (Hajic, 2000) show that for highly inflectional languages, the use of a morphological analyzer improves accuracy of disambiguation. (Diab et al., 2004) perform tokenization, POS tagging and base phrase chunking using an SVM based learner. (Ahmed and N ¨urnberger, 2008) perform word-sense disambiguation using a Naive Bayesian 733 model and rely on parallel corpora and match- ing schemes instead of a morphological analyzer. (Kulick, 2010) perform simultaneous tokenization and part-of-speech tagging for Arabic by separating closed and open-class items and focusing on the likelihood of possible stems of openclass words. (Mohamed and K ¨ubler, 2010) present a hybrid method between word-based and segmentbased POS tagging for Arabic and report good results. (Toutanova and Cherry, 2009) perform joint lemmatization and part-of-speech tagging for English, Bulgarian, Czech and Slovene, but they do not use the two step estimation-shrinkage model described in this paper; nor do they factor labels. The idea of joint lemmatization and part-of-speech tagging has also been discussed in the context of Hungarian in (Kornai, 1994). A substantial amount of relevant work has been done previously for Hebrew. (Adler and Elhadad, 2006) perform Hebrew morphological disambiguation using an unsupervised morpheme-based HMM, but they report lower scores than those achieved by our model. Moreover, their analysis doesn’t include lemma IDs, which is a novelty of our model. (Goldberg et al., 2008) extend the work of (Adler and El- hadad, 2006) by using an EM algorithm, and achieve an accuracy of 88% for full morphological analysis, but again, this does not include lemma IDs. To the best of our knowledge, there is no existing research for Hebrew that does what we did for Arabic, namely to use simultaneous lemma and morphological disambiguation to improve both. (Dinur et al., 2009) show that prepositions and function words can be accurately segmented using unsupervised methods. However, by using this method as a preprocessing step, we would lose the power of a simultaneous solution for these problems. Our method is closer in style to a CRF, giving much of the accuracy gains of simultaneous solution, while being about 4 orders of magnitude easier to train. We believe that our use of factored labels is novel for the problem of simultaneous lemma and morphological disambiguation; however, (Smith et al., 2005) and (Hatori et al., 2008) have previously made use of features based on parts of labels in CRF models for morphological disambiguation and word-sense disambiguation respectively. Also, we note that there is a similarity between our two-stage machine learning approach and log-linear models in machine translation that break the data in two parts, estimating log-probabilities of generative models from one part, and discriminatively re-weighting the models using the second part. 6 Conclusions We introduced a new approach to accurately predict labels consisting of both lemmas and morphological analyses for Arabic text. We obtained an accuracy of over 90% substantially higher than current state-of-the-art systems. Key to our success is the factoring of labels into lemma and a large set of morphosyntactic elements, and the use of an algorithm that computes a simple initial estimate of the coefficient relating each contextual feature to each label element (simply by counting co-occurrence) and then regularizes these features by shrinking each of the coefficients for each feature by an amount determined by supervised learning using only the candidate label sets produced by SAMA. We also showed that using features of word ngrams is preferable to using features of only individual tokens of data. Finally, we showed that a model using a full feature set based on labels as well as – factored components of labels, which we call label elements (LEs) works better than a model created by combining individual models for each LE. We believe that the approach we have used to create our model can be successfully applied not just to Arabic but also to other languages such as Turkish, Hungarian and Finnish that have highly inflectional morphology. The current accuracy of of our model, getting the correct answer among the top two choices 96.2% of the time is high enough to be highly useful for tasks such as aiding the manual annotation of Arabic text; a more complete automation would require that accuracy for the single top choice. Acknowledgments We woud like to thank everyone at the Linguistic Data Consortium, especially Christopher Cieri, David Graff, Seth Kulick, Ann Bies, Wajdi Zaghouani and Basma Bouziri for their help. We also wish to thank the anonymous reviewers for their comments and suggestions. 734 References Meni Adler and Michael Elhadad. 2006. An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Farag Ahmed and Andreas N ¨urnberger. 2008. Arabic/English Word Translation Disambiguation using Parallel Corpora and Matching Schemes. In Proceedings of EAMT’08, Hamburg, Germany. Tim Buckwalter. 2004. Buckwalter Arabic Morphological Analyzer version 2.0. Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of EMNLP’02. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai ShalevShwartz, and Yoram Singer. 2006. Online PassiveAggressive Algorithms. Journal of Machine Learning Research, 7:551–585. Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic Tagging of Arabic text: From Raw Text to Base Phrase Chunks. In Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLT-NAACL’04). Elad Dinur, Dmitry Davidov, and Ari Rappoport. 2009. Unsupervised Concept Discovery in Hebrew Using Simple Unsupervised Word Prefix Segmentation for Hebrew and Arabic. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages. Kuzman Ganchev and Georgi Georgiev. 2009. Edlin: An Easy to Read Linear Learning Framework. In Proceedings of RANLP’09. Yoav Goldberg, Meni Adler, and Michael Elhadad. 2008. EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)*. In Proceedings of ACL’08. Nizar Habash and Owen Rambow. 2005. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. In Proceedings of ACL’05, Ann Arbor, MI, USA. Jan Hajic. 2000. Morphological Tagging: Data vs. Dictionaries. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL’00). Jun Hatori, Yusuke Miyao, and Jun’ichi Tsujii. 2008. Word Sense Disambiguation for All Words using TreeStructured Conditional Random Fields. In Proceedings of COLing’08. W. James and Charles Stein. 1961 . Estimation with Quadratic Loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1. Andr a´s Kornai. 1994. On Hungarian morphology (LinDissertationes 14). Lin- guistica, Series A: Studia et guistics Institute of Hungarian Academy of Sciences, Budapest. Seth Kulick. 2010. Simultaneous Tokenization and Partof-Speech Tagging for Arabic without a Morphological Analyzer. In Proceedings of ACL’10. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML’01, pages 282–289. Mohamed Maamouri, Ann Bies, and Tim Buckwalter. 2004. The Penn Arabic Treebank: Building a Large Scale Annotated Arabic Corpus. In Proceedings of NEMLAR Conference on Arabic Language Resources and Tools. Mohamed Maamouri, David Graff, Basma Bouziri, Sondos Krouna, and Seth Kulick. 2009. LDC Standard Arabic Morphological Analyzer (SAMA) v. 3.0. Andrew McCallum, 2001. MALLET: A Machine Learning for Language Toolkit. Software available at http : / /mal let .cs .umas s .edu. Emad Mohamed and Sandra K ¨ubler. 2010. Arabic Part of Speech Tagging. In Proceedings of LREC’10. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In Proceedings of ACL’08, Columbus, Ohio, USA. Noah A. Smith, David A. Smith, and Roy W. Tromble. 2005. Context-Based Morphological Disambiguation with Random Fields*. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP). Kristina Toutanova and Colin Cherry. 2009. A Global Model for Joint Lemmatization and Part-of-Speech Prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, pages 486–494. 735

3 0.075905561 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.

4 0.071566649 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei

Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.

5 0.068496332 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

Author: Roberto Navigli ; Giuseppe Crisafulli

Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.

6 0.067931943 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

7 0.06217951 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

8 0.061231084 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

9 0.050999921 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

10 0.049531732 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

11 0.047009118 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

12 0.046620108 10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac

13 0.043745756 61 emnlp-2010-Improving Gender Classification of Blog Authors

14 0.041925773 51 emnlp-2010-Function-Based Question Classification for General QA

15 0.038477432 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

16 0.037474934 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

17 0.037147705 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

18 0.036122166 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

19 0.034955293 13 emnlp-2010-A Simple Domain-Independent Probabilistic Approach to Generation

20 0.034507822 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.137), (1, 0.073), (2, -0.055), (3, 0.093), (4, -0.004), (5, -0.008), (6, -0.133), (7, -0.048), (8, -0.119), (9, -0.051), (10, 0.019), (11, 0.092), (12, 0.126), (13, 0.025), (14, 0.01), (15, 0.011), (16, -0.145), (17, 0.049), (18, -0.163), (19, -0.063), (20, 0.024), (21, -0.003), (22, 0.046), (23, -0.057), (24, -0.047), (25, 0.065), (26, -0.028), (27, -0.05), (28, -0.042), (29, 0.135), (30, -0.027), (31, 0.11), (32, 0.066), (33, -0.039), (34, -0.009), (35, -0.043), (36, -0.165), (37, -0.017), (38, 0.115), (39, -0.089), (40, 0.111), (41, 0.199), (42, -0.295), (43, 0.04), (44, -0.066), (45, 0.04), (46, -0.153), (47, -0.278), (48, -0.004), (49, -0.056)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98004794 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

Author: Aida Khemakhem ; Bilel Gargouri ; Abdelmajid Ben Hamadou

Abstract: Electronic dictionaries covering all natural language levels are very relevant for the human use as well as for the automatic processing use, namely those constructed with respect to international standards. Such dictionaries are characterized by a complex structure and an important access time when using a querying system. However, the need of a user is generally limited to a part of such a dictionary according to his domain and expertise level which corresponds to a specialized dictionary. Given the importance of managing a unified dictionary and considering the personalized needs of users, we propose an approach for generating personalized views starting from a normalized dictionary with respect to Lexical Markup Framework LMF-ISO 24613 norm. This approach provides the re-use of already defined views for a community of users by managing their profiles information and promoting the materialization of the generated views. It is composed of four main steps: (i) the projection of data categories controlled by a set of constraints (related to the user‟s profiles), (ii) the selection of values with consistency checking, (iii) the automatic generation of the query‟s model and finally, (iv) the refinement of the view. The proposed approach was con- solidated by carrying out an experiment on an LMF normalized Arabic dictionary. 1

2 0.42581838 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

Author: Rushin Shah ; Paramveer S. Dhillon ; Mark Liberman ; Dean Foster ; Mohamed Maamouri ; Lyle Ungar

Abstract: We describe a model for the lexical analysis of Arabic text, using the lists of alternatives supplied by a broad-coverage morphological analyzer, SAMA, which include stable lemma IDs that correspond to combinations of broad word sense categories and POS tags. We break down each of the hundreds of thousands of possible lexical labels into its constituent elements, including lemma ID and part-of-speech. Features are computed for each lexical token based on its local and document-level context and used in a novel, simple, and highly efficient two-stage supervised machine learning algorithm that over- comes the extreme sparsity of label distribution in the training data. The resulting system achieves accuracy of 90.6% for its first choice, and 96.2% for its top two choices, in selecting among the alternatives provided by the SAMA lexical analyzer. We have successfully used this system in applications such as an online reading helper for intermediate learners of the Arabic language, and a tool for improving the productivity of Arabic Treebank annotators. 1 Background and Motivation This paper presents a methodology for generating high quality lexical analysis of highly inflected languages, and demonstrates excellent performance applying our approach to Arabic. Lexical analysis of the written form of a language involves resolving, explicitly or implicitly, several different kinds ofambiguities. Unfortunately, the usual ways of talking about this process are also ambiguous, and our general approach to the problem, though not unprecedented, has uncommon aspects. Therefore, in order 725 Paramveer S. Dhillon, Mark Liberman, Dean Foster, Mohamed Maamouri and Lyle Ungar University of Pennsylvania 345 1Walnut Street Philadelphia, PA 19104, USA {dhi l lon | myl | ungar} @ cis .upenn .edu floo snt|emry@lw|huanrgta on .upenn .eednun maamouri @ ldc .upenn .edu , , to avoid confusion, we begin by describing how we define the problem. In an inflected language with an alphabetic writing system, a central issue is how to interpret strings of characters as forms of words. For example, the English letter-string ‘winds’ will normally be interpreted in one of four different ways, all four of which involve the sequence of two formatives wind+s. The stem ‘wind’ might be analyzed as (1) a noun meaning something like “air in motion”, pronounced [wInd] , which we can associate with an arbitrary but stable identifier like wind n1; (2) a verb wind v1 derived from that noun, and pronounced the same way; (3) a verb wind v2 meaning something like “(cause to) twist”, pronounced [waInd]; or (4) a noun wind n2 derived from that verb, and pro- nounced the same way. Each of these “lemmas”, or dictionary entries, will have several distinguishable senses, which we may also wish to associate with stable identifiers. The affix ‘-s’ might be analyzed as the plural inflection, if the stem is a noun; or as the third-person singular inflection, if the stem is a verb. We see this analysis as conceptually divided into four parts: 1) Morphological analysis, which recognizes that the letter-string ‘winds’ might be (perhaps among other things) wind/N s/PLURAL or wind/V s/3SING; 2) Morphological disambiguation, which involves deciding, for example, that in the phrase “the four winds”, ‘winds’ is probably a plural noun, i.e. wind/N s/PLURAL; 3) Lemma analysis, which involves recognizing that the stem wind in ‘winds’ might be any of the four lemmas listed above – perhaps with a further listing of senses or other sub-entries for each of them; and 4) Lemma disambiguation, deciding, for example, that + + + ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 7t2ic5s–735, the phrase “the four winds” probably involves the lemma wind n1. Confusingly, the standard word-analysis tasks in computational linguistics involve various combinations of pieces of these logically-distinguished operations. Thus, “part of speech (POS) tagging” is mainly what we’ve called “morphological disambiguation”, except that it doesn’t necessarily require identifying the specific stems and affixes involved. In some cases, it also may require a small amount of “lemma disambiguation”, for example to distinguish a proper noun from a common noun. “Sense disambiguation” is basically a form of what we’ve called “lemma disambiguation”, except that the sense disambiguation task may assume that the part of speech is known, and may break down lexical identity more finely than our system happens to do. “Lemmatization” generally refers to a radically simplified form of “lemma analysis” and “lemma disambiguation”, where the goal is simply to collapse different inflected forms of any similarly-spelled stems, so that the strings ‘wind’, ‘winds’, ‘winded’, ‘winding’ will all be treated as instances of the same thing, without in fact making any attempt to determine the identity of “lemmas” in the traditional sense of dictionary entries. Linguists use the term morphology to include all aspects of lexical analysis under discussion here. But in most computational applications, “morphological analysis” does not include the disambiguation of lemmas, because most morphological analyzers do not reference a set of stable lemma IDs. So for the purposes of this paper, we will continue to discuss lemma analysis and disambiguation as conceptually distinct from morphological analysis and disambiguation, although, in fact, our system disambiguates both of these aspects of lexical analysis at the same time. The lexical analysis of textual character-strings is a more complex and consequential problem in Arabic than it is in English, for several reasons. First, Arabic inflectional morphology is more complex than English inflectional morphology is. Where an English verb has five basic forms, for example, an Arabic verb in principle may have dozens. Second, the Arabic orthographic system writes elements such as prepositions, articles, and possessive pronouns without setting them off by spaces, roughly 726 as if the English phrase “in a way” were written “inaway”. This leads to an enormous increase in the number of distinct “orthographic words”, and a substantial increase in ambiguity. Third, short vowels are normally omitted in Arabic text, roughly as if English “in a way” were written “nway”. As a result, a whitespace/punctuation-delimited letter-string in Arabic text typically has many more alternative analyses than a comparable English letter-string does, and these analyses have many more parts, drawn from a much larger vocabulary of form-classes. While an English “tagger” can specify the morphosyntactic status of a word by choosing from a few dozen tags, an equivalent level of detail in Arabic would require thousands of alternatives. Similarly, the number of lemmas that might play a role in a given letter-sequence is generally much larger in Arabic than in English. We start our labeling of Arabic text with the alternative analyses provided by SAMA v. 3.1, the Standard Arabic Morphological Analyzer (Maamouri et al., 2009). SAMA is an updated version of the earlier Buckwalter analyzers (Buckwalter, 2004), with a number of significant differences in analysis to make it compatible with the LDC Arabic Treebank 3-v3.2 (Maamouri et al., 2004). The input to SAMA is an Arabic orthographic word (a string of letters delimited by whitespace or punctuation), and the output of SAMA is a set of alternative analyses, as shown in Table 1. For a typical word, SAMA produces approximately a dozen alternative analyses, but for certain highly ambiguous words it can produce hundreds of alternatives. The SAMA analyzer has good coverage; for typical texts, the correct analysis of an orthographic word can be found somewhere in SAMA’s list of alternatives about 95% of the time. However, this broad coverage comes at a cost; the list of analytic alternatives must include a long Zipfian tail of rare or contextually-implausible analyses, which collectively are correct often enough to make a large contribution to the coverage statistics. Furthermore, SAMA’s long lists of alternative analyses are not evaluated or ordered in terms of overall or contextual plausibility. This makes the results less useful in most practical applications. Our goal is to rank these alternative analyses so that the correct answer is as near to the top of the list as possible. Despite some risk of confusion, we’ll refer to SAMA’s list of alternative analyses for an orthographic word as potential labels for that word. And despite a greater risk ofconfusion, we’ll refer to the assignment of probabilities to the set of SAMA labels for a particular Arabic word in a particular textual context as tagging, by analogy to the operation of a stochastic part-of-speech tagger, which similarly assigns probabilities to the set of labels available for a word in textual context. Although our algorithms have been developed for the particular case of Arabic and the particular set of lexical-analysis labels produced by SAMA, they should be applicable without modification to the sets of labels produced by any broad-coverage lexical analyzer for the orthographic words of any highlyinflected language. In choosing our approach, we have been moti- vated by two specific applications. One application aims to help learners of Arabic in reading text, by offering a choice of English glosses with associated Arabic morphological analyses and vocalizations. SAMA’s excellent coverage is an important basis for this help; but SAMA’s long, unranked list of alternative analyses for a particular letter-string, where many analyses may involve rare words or alternatives that are completely implausible in the context, will be confusing at best for a learner. It is much more helpful for the list to be ranked so that the correct answer is almost always near the top, and is usually one of the top two or three alternatives. In our second application, this same sort of ranking is also helpful for the linguistically expert native speakers who do Arabic Treebank analysis. These 727 annotators understand the text without difficulty, but find it time-consuming and fatiguing to scan a long list of rare or contextually-implausible alternatives for the correct SAMA output. Their work is faster and more accurate if they start with a list that is ranked accurately in order of contextual plausibility. Other applications are also possible, such as vocalization of Arabic text for text-to-speech synthesis, or lexical analysis for Arabic parsing. However, our initial goals have been to rank the list of SAMA outputs for human users. We note in passing that the existence of set of stable “lemma IDs” is an unusual feature of SAMA, which in our opinion ought to be emulated by approaches to lexical analysis in other languages. The lack of such stable lemma IDs has helped to disguise the fact that without lemma analysis and disambiguation, morphological analyses and disambiguation is only a partial solution to the problem of lexical analysis. In principle, it is obvious that lemma disambiguation and morphological disambiguation are mutually beneficial. If we know the answer to one of the questions, the other one is easier to answer. However, these two tasks require rather different sets of contextual features. Lemma disambiguation is similar to the problem of word-sense disambiguation on some definitions, they are identical and as a result, it benefits from paragraph-level and documentlevel bag-of-words attributes that help to character– – ize what the text is “about” and therefore which lemmas are more likely to play a role in it. In contrast, morphological disambiguation mainly depends on features of nearby words, which help to characterize how inflected forms of these lemmas might fit into local phrasal structures. 2 Problem and Methodology Consider a collection oftokens (observations), ti, referred to by index i∈ {1, . . . , n}, where each token fise raressdo tcoia bteyd i nwdiethx a s∈et { of p features, xij, efaocr hth teo k jethn feature, and a label, li, which is a combination of a lemma and a morphological analysis. We use indicator functions yik to indicate whether or not the kth label for the ith token is present. We represent the complete set of features and labels for the entire training data using matrix notation as X and Y , respectively. Our goal is to predict the label l (or equivalently, the vector y for a given feature vector x. A standard linear regression model of this problem would be y = xβ + ? (1) The standard linear regression estimate of β (ig- ×× × noring, for simplicity the fact that the ys are 0/1) is: βˆ = (XTtrainXtrain)−1XtTrainYtrain (2) where Ytrain is an n h matrix containing 0s and 1s indicating whise tahner n or nho mt aetarcixh coofn tthaien ihn possible labels is the correct label (li) for each of the n tokens ti, Xtrain is an n p matrix of context features for each of thei n tokens, pth mea ctoriexff oifcie cnotnst are p hs .f However, this is a large, sparse, multiple l hab.el problem, and the above formulation is neither statistically nor computationally efficient. Each observation (x, y) consists of thousands of features associated with thousands of potential labels, almost all of which are zero. Worse, the matrix of coefficients β, to be estimated is large (p h) and one should thus use some soatretd do ifs tr laarngsefe (pr learning dto o nshea srheo strength across the different labels. We present a novel principled and highly computationally efficient method of estimating this multilabel model. We use a two stage procedure, first using a subset (Xtrain1 , Ytrain1) of training data to give a fast approximate estimate of β; we then use a second smaller subset of the training data (Xtrain2, Ytrain2,) to “correct” these estimates in a eβˆx way that we will show can be viewed as a specialized shrinkage. Our first stage estimation approximates β, but avoids the expensive computa728 tion of (XTtrainXtrain)−1. Our second stage corrects (shrinks) these initial estimates in a manner specialized to this problem. The second stage takes advantage of the fact that we only need to consider those candidate labels produced by SAMA. Thus, only dozens of the thousands of possible labels are considered for each token. We now present our algorithm. We start with a corpus D of documents d of labeled Arabic text. As described above, each token, ti is associated with a set of features characterizing its context, computed from the other words in the same document, and a label, li = (lemmai, morphologyi), which is a combination of a lemma and a morphological analysis. As described below, we introduce a novel factorization of the morphology into 15 different components. Our estimation algorithm, shown in Algorithm 1, has two stages. We partition the training corpus into × two subsets, one of which (Xtrain1) is used to estimate the coefficients βs and the other of which (Xtrain2) is used to optimally “shrink” these coefficient estimates to reduce variance and prevent overfitting due to data sparsity. For the first stage of our estimation procedure, we simplify the estimate of the (β) matrix (Equation 2) to avoid the inversion of the very high dimensional (p p) matrix (XTX) by approximating (XTX) by (itps diagonal, Var(X), the inverse of which is trivial to compute; i.e. we estimate β using βˆ = Var(Xtrain1)−1XtTrain1Ytrain1 (3) For the second stage, we assume that the coefficients for each feature can be shrunk differently, but that coefficients for each feature should be shrunk the same regardless of what label they are predicting. Thus, for a given observation we predict: ˆgik=Xpwjβˆjkxij (4) Xj=1 where the weights wj indicate how much to shrink each of the p features. In practice, we fold the variance of each of the j features into the weight, giving a slightly modified equation: ˆgik=Xj=p1αjβj∗kxij (5) where β∗ = XtTrain1Ytrain1 is just a matrix of the counts of how often each context feature shows up with each label in the first training set. The vector α, which we will estimate by regression, is just the shrinkage weights w rescaled by the feature variance. Note that the formation here is different from the first stage. Instead of having each observation be a token, we now let each observation be a (token, label) pair, but only include those labels that were output by SAMA. For a given token ti and potential label lk, our goal is to approximate the indicator function g(i, k), which is 1 if the kth label of token ti is present, and 0 otherwise. We find candidate labels using a morphological analyzer (namely SAMA), which returns a set of possible candidate labels, say C(t), for each Arabic token t. Our pre- dicted label for ti is then argmaxk∈C(ti)g(i, k). The regression model for learning tthe weights αj in the second stage thus has a row for each label g(i, k) associated with a SAMA candidate for each token i = ntrain1+1 . . . ntrain2 in the second training set. The value of g(i, k) is predicted as a function of the feature vector zijk = βj∗kxij. The shrinkage coefficients, αj, could be estimated from theory, using a version of James-Stein shrinkage (James and Stein, 1961), but in practice, superior results are obtained by estimating them empirically. Since there are only p of them (unlike the p ∗ h βs), a relatively asmreal oln training sheetm mis ( usunflfi kceie tnhte. Wp ∗e hfou βnsd), that regression-SVMs work slightly better than linear regression and significantly better than standard classification SVMs for this problem. Prediction is then done in the obvious way by taking the tokens in a test corpus Dtest, generating context features and candidate SAMA labels for each token ti, and selected the candidate label with the highest score ˆ g(i, k) that we set out to learn. More formally, The model parameters β∗ and α produced by the algorithm allow one to estimate the most likely label for a new token ti out of a set of can- didate labels C(ti) using kpred= argmaxk∈C(ti)jX=p1αjβj∗kxij (6) The most expensive part of the procedure is estimating β∗, which requires for each token in cor729 Algorithm 1 Training algorithm. Input: A training corpusDtrainof n observations (Xtrain, Ytrain) Partition Dtrain into two sets, D1 and D2, of sizes ntrain1 and ntrain2 = n − ntrain1 observations // Using D1, estimat=e β∗ βj∗k = Pin=tr1ain1 xijyik for the jth feature and kth label // Using D2, estimate αj // Generate new “features” Z and the true labels g(i, k) for each of the SAMA candidate labels for each of the tokens in D2 zijk = βj∗kxij for iin i= ntrain1 + 1...ntrain2 Estimate αj for the above (feature,label) pairs (zijk, g(i, k)) using Regression SVMs Output: α and β∗ pus D1, (a subset of D), finding the co-occurrence frequencies of each label element (a lemma, or a part of the morphological segmentation) with the target token and jointly with the token and with other tokens or characters in the context of the token of interest. For example, given an Arabic token, “yHlm”, we count what fraction of the time it is associated with each lemma (e.g. Halamu 1), count(lemma=Halam-u 1, token=yHlm) and each segment (e.g. “ya”), count(segment=ya, token=yHlm). (Of course, most tokens never show up with most lemmas or segments; this is not a problem.) We also find the base rates of the components of the labels (e.g., count(lemma=Halam-u 1), and what fraction of the time the label shows up in various contexts, e.g. count(lemma=Halam-u 1, previous token = yHlm). We describe these features in more detail below. 3 Features and Labels used for Training Our approach to tagging Arabic differs from conventional approaches in the two-part shrinkage-based method used, and in the choice of both features and labels used in our model. For features, we study both local context variables, as described above, and document-level word frequencies. For the labels, the key question is what labels are included and how they are factored. Standard “taggers” work by doing an n-way classification of all the alternatives, which is not feasible here due to the thousands of possible labels. Standard approaches such as Conditional Random Fields (CRFs) are intractable with so many labels. Moreover, few if any taggers do any lemma disambiguation; that is partly because one must start with some standard inventory of lemmas, which are not available for most languages, perhaps because the importance of lemma disambiguation has been underestimated. We make a couple of innovations to deal with these issues. First, we perform lemma disambiguation in addition to “tagging”. As mentioned above, lemmas and morphological information are not independent; the choice of lemma often influences morphology and vice versa. For example, Table 1 contains two analyses for the word qbl. For the first analysis, where the lemma is qabil-a 1 and the gloss is accept/receive/approve + he/it [verb], the word is a verb. However, for the second analysis, where the lemma is qabol 1 and the gloss is before, the word is a noun. Simultaneous lemma disambiguation and tagging introduces additional complexity: An analysis of ATB and SAMA shows that there are approximately 2,200 possible morphological analyses (“tags”) and 40,000 possible lemmas; even accounting for the fact that most combinations of lemmas and morphological analyses don’t occur, the size of the label space is still in the order of tens of thousands. To deal with data sparsity, our second innovation is to factor the labels. We factor each label linto a set of 16 label elements (LEs). These include lemmas, as well as morphological elements such as basic partof-speech, suffix, gender, number, mood, etc. These are explained in detail below. Thus, since each label l is a set of 15 categorical variables, each y in the first learning stage is actually a vector with 16 nonzero components and thousands of zeros. Since we do simultaneous estimation of the entire set of label elements, the value g(i, k) being predicted in the second learning phase is 1 if the entire label set is correct, and zero otherwise. We do not learn separate models for each label. 3.1 Label Elements (LEs) The fact that there are tens of thousands of possible labels presents the problem of extreme sparsity of label distribution in the training data. We find that a model that estimates coefficients β∗ to predict a sin730 data on basic POS include whether a noun is proper or common, whether a verb is transitive or not, etc. Both the basic POS and its suffix may have person, gender and number data. gle label (a label being in the Cartesian product of the set of label elements) yields poor performance. Therefore, as just mentioned, we factor each label l into a set of label elements (LEs), and learn the correlations β∗ between features and label elements, rather than features and entire label sets. This reduces, but does not come close to eliminating, the problem sparsity. A complete list of these LEs and their possible values is detailed in Table 2. 3.2 Features 3.2.1 Local Context Features We take (t, l) pairs from D2, and for each such pair generate features Z based on co-occurrence statistics β∗ in D1, as mentioned in Algorithm 2. These statistics include unigram co-occurrence frequencies of each label with the target token and bigram co-occurrence of the label with the token and with other tokens or characters in the context of the target token. We define them formally in Table 3. Let Zbaseline denote the set of all such basic features based on the local context statistics of the target token, namely the words and letters preceding and following it. We will use this set to create a baseline model. generate feature sets for our regression SVMs. For each label element (LE) e, we define a set of features Ze similar to Zbaseline; these features are based on co-occurrence frequencies of the particular LE e, not the entire label l. Finally, we define an aggregate feature set Zaggr as follows: Zaggr = Zbaseline [ {Ze} (7) where e ∈ {lemma, pre1, pre2, det, pos, dpos, suf, perpos, numpos, genpos, persuf, numsuf, gensuf, mood, pron}. 3.2.2 Document Level Features When trying to predict the lemma, it is useful to include not just the words and characters immediately adjacent to the target token, but also the all the words in the document. These words capture the “topic” of the document, and help to disambiguate different lemmas, which tend to be used or not used based on the topic being discussed, similarly to the way that word sense disambiguation systems in English sometimes use the “bag of words” the document to disambiguate, for example a “bank” for depositing money from a “bank” of a river. More precisely, we augment the features for each target token with the counts of each word in the document (the “term frequency” tf) in which the token occurs with a given label. Zfull = Zaggr [ Ztf (8) This set Zfull is our final feature set. We use Zfull to train an SVM model Mfull; this is our final predictive model. 731 3.3 Corpora used for Training and Testing We use three modules of the Penn Arabic Treebank (ATB) (Maamouri et al., 2004), namely ATB 1, ATB2 and ATB3 as our corpus of labeled Arabic text, D. Each ATB module is a collection of newswire data from a particular agency. ATB1 uses the Associated Press as a source, ATB2 uses Ummah, and ATB3 uses Annahar. D contains a total of 1,835 documents, accounting for approximately 350,000 words. We construct the training and testing sets Dtrain and Dtest from D using 10-fold cross validation, and we construct D1 and D2 from Dtrain by randomly performing a 9: 1 split. As mentioned earlier, we use the SAMA morphological analyzer to obtain candidate labels C(t) for each token t while training and testing an SVM model on D2 and Dtest respectively. A sample output of SAMA is shown in Table 1. To improve coverage, we also add to C(t) all the labels lseen for t in D1. We find that doing so improves coverage to 98% . This is an upper bound on the accuracy of our model. C(t) = SAMA(t) 4 [ {l|(t, l) ∈ D1} (9) Results We use two metrics of accuracy: A1, which measures the percentage of tokens for which the model assigns the highest score to the correct label or LE value (or E1= 100 A1, the corresponding percentage error), 1a=nd 1 A2, wAh1i,ch th measures tnhdei percentage of tokens for which the correct label or LE value is one of the two highest ranked choices returned by the model (or E2 = 100 A2). We test our bmyod theel Mfull on Dtest a =nd 1 a0c0hi −eve A A2)1. and A2 scores of 90.6% and 96.2% respectively. The accuracy achieved by our Mfull model is, to the best of our knowledge, higher than prior approaches have been able to achieve so far for the problem of combined morphological and lemma disambiguation. This is all the more impressive considering that the upper bound on accuracy for our model is 98% because, as described above, our set of candidate labels is incomplete. In order to analyze how well different LEs can be predicted, we train an SVM model Me for each LE e using the feature set Ze, and test all such models − − on Dtest. The results for all the LEs are reported in the form of error percentages E1 and E2 in Table 4. reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. A comparison of the results for Mfull with the results for Mlemma and Mpos is particularly informative. We see that Mfull is able to achieve a substantially lower E1 error score (9.4%) than Mlemma (11.1%) and Mpos (23.4%); in other words, we find that our full model is able to predict lemmas and basic parts-of-speech more accurately than the individ- ual models for each of these elements. We examine the effect of varying the size of D2, i.e. the number of SVM training instances, on the performance of Mfull on Dtest, and find that with increasing sizes of D2, E1 reduces only slightly from 9.5% to 9.4%, and shows no improvement thereafter. We also find that the use of documentlevel features in Mlemma reduces E1 and E2 percentages for Mlemma by 5.7% and 3.2% respectively. 4.1 Comparison to Alternate Approaches 4.1.1 Structured Prediction Models Preliminary experiments showed that knowing the predicted labels (lemma + morphology) of the surrounding words can slightly improve the predictive accuracy of our model. To further investigate this effect, we tried running experiments using different structured models, namely CRF (Conditional Random Fields) (Lafferty et al., 2001), (Structured) MIRA (Margin Infused Relaxation Algorithm) (Crammer et al., 2006) and Structured Perceptron (Collins, 2002). We used linear chain 732 CRFs as implemented in MALLET Toolbox (McCallum, 2001) and for Structured MIRA and Perceptron we used their implementations from EDLIN Toolbox (Ganchev and Georgiev, 2009). However, given the vast label space of our problem, running these methods proved infeasible. The time complexity of these methods scales badly with the number of labels; It took a week to train a linear chain CRF for only ∼ 50 labels and though MIRA and Perceptron are o 5n0lin leab algorithms, they MalsIoR Abec aonmde P ienr-tractable beyond a few hundred labels. Since our label space contains combinations of lemmas and morphologies, so even after factoring, the dimension of the label space is in the order of thousands. We also tried a na¨ ıve version (two-pass approximation) of these structured models. In addition to the features in Zfull, we include the predicted labels for the tokens preceding and following the target token as features. This new model is not only slow to train, but also achieves only slightly lower error rates (1.2% lower E1 and 1.0% lower E2) than Mfull. This provides an upper bound on the benefit of using the more complex structured models, and suggests that given their computational demands our (unstructured) model Mfull is a better choice. 4.1.2 MADA (Habash and Rambow, 2005) perform morphological disambiguation using a morphological analyzer. (Roth et al., 2008) augment this with lemma disambiguation; they call their system MADA. Our work differs from theirs in a number of respects. Firstly, they don’t use the two step regression procedure that we use. Secondly, they use only “unigram” features. Also, they do not learn a single model from a feature set based on labels and LEs; instead, they combine models for individual elements by using weighted agreement. We trained and tested MADA v2.32 using its full feature set on the same Dtrain and Dtest. We should point out that this is not an exact comparison, since MADA uses the older Buckwalter morphological analyzer.1 4.1.3 Other Alternatives Unfactored Labels: To illustrate the benefit obtained by breaking down each label l into 1A new version of MADA was released very close to the submission deadline for this conference. LEs, we contrast the performance of our Mfull model to an SVM model Mbaseline trained using only the feature set Zbaseline, which only contains features based on entire labels, those based on individual LEs. Independent lemma and morphology prediction: Another alternative approach is to predict lemmas and morphological analyses separately. We construct a feature set Zlemma0 = Zfull − Zlemma and train an SVM model Mlemma0 using this feature set. Labels are then predicted by simply combining the results predicted independently by Mlemma and Mlemma0 . Let Mind denote this approach. Unigram Features: Finally, we also consider a context-less approach, i.e. using only “unigram” features for labels as well as LEs. We call this feature set Zuni, and the corresponding SVM model Muni. The results of these various models, along with those of Mfull are summarized in Table 5. We see that Mfull has roughly half the error rate of the stateof-the-art MADA system. Note: The results reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. We used same train-test splits for all the datasets. 5 Related Work (Hajic, 2000) show that for highly inflectional languages, the use of a morphological analyzer improves accuracy of disambiguation. (Diab et al., 2004) perform tokenization, POS tagging and base phrase chunking using an SVM based learner. (Ahmed and N ¨urnberger, 2008) perform word-sense disambiguation using a Naive Bayesian 733 model and rely on parallel corpora and match- ing schemes instead of a morphological analyzer. (Kulick, 2010) perform simultaneous tokenization and part-of-speech tagging for Arabic by separating closed and open-class items and focusing on the likelihood of possible stems of openclass words. (Mohamed and K ¨ubler, 2010) present a hybrid method between word-based and segmentbased POS tagging for Arabic and report good results. (Toutanova and Cherry, 2009) perform joint lemmatization and part-of-speech tagging for English, Bulgarian, Czech and Slovene, but they do not use the two step estimation-shrinkage model described in this paper; nor do they factor labels. The idea of joint lemmatization and part-of-speech tagging has also been discussed in the context of Hungarian in (Kornai, 1994). A substantial amount of relevant work has been done previously for Hebrew. (Adler and Elhadad, 2006) perform Hebrew morphological disambiguation using an unsupervised morpheme-based HMM, but they report lower scores than those achieved by our model. Moreover, their analysis doesn’t include lemma IDs, which is a novelty of our model. (Goldberg et al., 2008) extend the work of (Adler and El- hadad, 2006) by using an EM algorithm, and achieve an accuracy of 88% for full morphological analysis, but again, this does not include lemma IDs. To the best of our knowledge, there is no existing research for Hebrew that does what we did for Arabic, namely to use simultaneous lemma and morphological disambiguation to improve both. (Dinur et al., 2009) show that prepositions and function words can be accurately segmented using unsupervised methods. However, by using this method as a preprocessing step, we would lose the power of a simultaneous solution for these problems. Our method is closer in style to a CRF, giving much of the accuracy gains of simultaneous solution, while being about 4 orders of magnitude easier to train. We believe that our use of factored labels is novel for the problem of simultaneous lemma and morphological disambiguation; however, (Smith et al., 2005) and (Hatori et al., 2008) have previously made use of features based on parts of labels in CRF models for morphological disambiguation and word-sense disambiguation respectively. Also, we note that there is a similarity between our two-stage machine learning approach and log-linear models in machine translation that break the data in two parts, estimating log-probabilities of generative models from one part, and discriminatively re-weighting the models using the second part. 6 Conclusions We introduced a new approach to accurately predict labels consisting of both lemmas and morphological analyses for Arabic text. We obtained an accuracy of over 90% substantially higher than current state-of-the-art systems. Key to our success is the factoring of labels into lemma and a large set of morphosyntactic elements, and the use of an algorithm that computes a simple initial estimate of the coefficient relating each contextual feature to each label element (simply by counting co-occurrence) and then regularizes these features by shrinking each of the coefficients for each feature by an amount determined by supervised learning using only the candidate label sets produced by SAMA. We also showed that using features of word ngrams is preferable to using features of only individual tokens of data. Finally, we showed that a model using a full feature set based on labels as well as – factored components of labels, which we call label elements (LEs) works better than a model created by combining individual models for each LE. We believe that the approach we have used to create our model can be successfully applied not just to Arabic but also to other languages such as Turkish, Hungarian and Finnish that have highly inflectional morphology. The current accuracy of of our model, getting the correct answer among the top two choices 96.2% of the time is high enough to be highly useful for tasks such as aiding the manual annotation of Arabic text; a more complete automation would require that accuracy for the single top choice. Acknowledgments We woud like to thank everyone at the Linguistic Data Consortium, especially Christopher Cieri, David Graff, Seth Kulick, Ann Bies, Wajdi Zaghouani and Basma Bouziri for their help. We also wish to thank the anonymous reviewers for their comments and suggestions. 734 References Meni Adler and Michael Elhadad. 2006. An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Farag Ahmed and Andreas N ¨urnberger. 2008. Arabic/English Word Translation Disambiguation using Parallel Corpora and Matching Schemes. In Proceedings of EAMT’08, Hamburg, Germany. Tim Buckwalter. 2004. Buckwalter Arabic Morphological Analyzer version 2.0. Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of EMNLP’02. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai ShalevShwartz, and Yoram Singer. 2006. Online PassiveAggressive Algorithms. Journal of Machine Learning Research, 7:551–585. Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic Tagging of Arabic text: From Raw Text to Base Phrase Chunks. In Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLT-NAACL’04). Elad Dinur, Dmitry Davidov, and Ari Rappoport. 2009. Unsupervised Concept Discovery in Hebrew Using Simple Unsupervised Word Prefix Segmentation for Hebrew and Arabic. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages. Kuzman Ganchev and Georgi Georgiev. 2009. Edlin: An Easy to Read Linear Learning Framework. In Proceedings of RANLP’09. Yoav Goldberg, Meni Adler, and Michael Elhadad. 2008. EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)*. In Proceedings of ACL’08. Nizar Habash and Owen Rambow. 2005. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. In Proceedings of ACL’05, Ann Arbor, MI, USA. Jan Hajic. 2000. Morphological Tagging: Data vs. Dictionaries. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL’00). Jun Hatori, Yusuke Miyao, and Jun’ichi Tsujii. 2008. Word Sense Disambiguation for All Words using TreeStructured Conditional Random Fields. In Proceedings of COLing’08. W. James and Charles Stein. 1961 . Estimation with Quadratic Loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1. Andr a´s Kornai. 1994. On Hungarian morphology (LinDissertationes 14). Lin- guistica, Series A: Studia et guistics Institute of Hungarian Academy of Sciences, Budapest. Seth Kulick. 2010. Simultaneous Tokenization and Partof-Speech Tagging for Arabic without a Morphological Analyzer. In Proceedings of ACL’10. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML’01, pages 282–289. Mohamed Maamouri, Ann Bies, and Tim Buckwalter. 2004. The Penn Arabic Treebank: Building a Large Scale Annotated Arabic Corpus. In Proceedings of NEMLAR Conference on Arabic Language Resources and Tools. Mohamed Maamouri, David Graff, Basma Bouziri, Sondos Krouna, and Seth Kulick. 2009. LDC Standard Arabic Morphological Analyzer (SAMA) v. 3.0. Andrew McCallum, 2001. MALLET: A Machine Learning for Language Toolkit. Software available at http : / /mal let .cs .umas s .edu. Emad Mohamed and Sandra K ¨ubler. 2010. Arabic Part of Speech Tagging. In Proceedings of LREC’10. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In Proceedings of ACL’08, Columbus, Ohio, USA. Noah A. Smith, David A. Smith, and Roy W. Tromble. 2005. Context-Based Morphological Disambiguation with Random Fields*. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP). Kristina Toutanova and Colin Cherry. 2009. A Global Model for Joint Lemmatization and Part-of-Speech Prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, pages 486–494. 735

3 0.39497337 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

Author: Danish Contractor ; Govind Kothari ; Tanveer Faruquie ; L V Subramaniam ; Sumit Negi

Abstract: Recent times have seen a tremendous growth in mobile based data services that allow people to use Short Message Service (SMS) to access these data services. In a multilingual society it is essential that data services that were developed for a specific language be made accessible through other local languages also. In this paper, we present a service that allows a user to query a FrequentlyAsked-Questions (FAQ) database built in a local language (Hindi) using Noisy SMS English queries. The inherent noise in the SMS queries, along with the language mismatch makes this a challenging problem. We handle these two problems by formulating the query similarity over FAQ questions as a combinatorial search problem where the search space consists of combinations of dictionary variations of the noisy query and its top-N translations. We demonstrate the effectiveness of our approach on a real-life dataset.

4 0.37486404 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

Author: Eduardo Blanco ; Dan Moldovan

Abstract: This paper presents a method for the automatic discovery of MANNER relations from text. An extended definition of MANNER is proposed, including restrictions on the sorts of concepts that can be part of its domain and range. The connections with other relations and the lexico-syntactic patterns that encode MANNER are analyzed. A new feature set specialized on MANNER detection is depicted and justified. Experimental results show improvement over previous attempts to extract MANNER. Combinations of MANNER with other semantic relations are also discussed.

5 0.33585313 13 emnlp-2010-A Simple Domain-Independent Probabilistic Approach to Generation

Author: Gabor Angeli ; Percy Liang ; Dan Klein

Abstract: Percy Liang UC Berkeley Berkeley, CA 94720 pliang@cs.berkeley.edu Dan Klein UC Berkeley Berkeley, CA 94720 klein@cs.berkeley.edu We operate in a setting in which we are only given examples consisting of (i) a set of database records We present a simple, robust generation system which performs content selection and surface realization in a unified, domain-independent framework. In our approach, we break up the end-to-end generation process into a sequence of local decisions, arranged hierarchically and each trained discriminatively. We deployed our system in three different domains—Robocup sportscasting, technical weather forecasts, and common weather forecasts, obtaining results comparable to state-ofthe-art domain-specific systems both in terms of BLEU scores and human evaluation.

6 0.30997059 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

7 0.30442911 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

8 0.28656563 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

9 0.28052744 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

10 0.25054401 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

11 0.23821439 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

12 0.2198692 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

13 0.2139966 61 emnlp-2010-Improving Gender Classification of Blog Authors

14 0.2024373 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

15 0.18918665 84 emnlp-2010-NLP on Spoken Documents Without ASR

16 0.18910956 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

17 0.18843129 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

18 0.18695189 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

19 0.18159984 59 emnlp-2010-Identifying Functional Relations in Web Text

20 0.17330347 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.01), (12, 0.023), (29, 0.042), (30, 0.019), (56, 0.046), (62, 0.014), (66, 0.056), (72, 0.049), (76, 0.625), (77, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93653685 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

Author: Aida Khemakhem ; Bilel Gargouri ; Abdelmajid Ben Hamadou

Abstract: Electronic dictionaries covering all natural language levels are very relevant for the human use as well as for the automatic processing use, namely those constructed with respect to international standards. Such dictionaries are characterized by a complex structure and an important access time when using a querying system. However, the need of a user is generally limited to a part of such a dictionary according to his domain and expertise level which corresponds to a specialized dictionary. Given the importance of managing a unified dictionary and considering the personalized needs of users, we propose an approach for generating personalized views starting from a normalized dictionary with respect to Lexical Markup Framework LMF-ISO 24613 norm. This approach provides the re-use of already defined views for a community of users by managing their profiles information and promoting the materialization of the generated views. It is composed of four main steps: (i) the projection of data categories controlled by a set of constraints (related to the user‟s profiles), (ii) the selection of values with consistency checking, (iii) the automatic generation of the query‟s model and finally, (iv) the refinement of the view. The proposed approach was con- solidated by carrying out an experiment on an LMF normalized Arabic dictionary. 1

2 0.88897628 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

Author: Stephen Boxwell ; Dennis Mehay ; Chris Brew

Abstract: In many NLP systems, there is a unidirectional flow of information in which a parser supplies input to a semantic role labeler. In this paper, we build a system that allows information to flow in both directions. We make use of semantic role predictions in choosing a single-best parse. This process relies on an averaged perceptron model to distinguish likely semantic roles from erroneous ones. Our system penalizes parses that give rise to low-scoring semantic roles. To explore the consequences of this we perform two experiments. First, we use a baseline generative model to produce n-best parses, which are then re-ordered by our semantic model. Second, we use a modified version of our semantic role labeler to predict semantic roles at parse time. The performance of this modified labeler is weaker than that of our best full SRL, because it is restricted to features that can be computed directly from the parser’s packed chart. For both experiments, the resulting semantic predictions are then used to select parses. Finally, we feed the selected parses produced by each experiment to the full version of our semantic role labeler. We find that SRL performance can be improved over this baseline by selecting parses with likely semantic roles.

3 0.6462388 40 emnlp-2010-Effects of Empty Categories on Machine Translation

Author: Tagyoung Chung ; Daniel Gildea

Abstract: We examine effects that empty categories have on machine translation. Empty categories are elements in parse trees that lack corresponding overt surface forms (words) such as dropped pronouns and markers for control constructions. We start by training machine translation systems with manually inserted empty elements. We find that inclusion of some empty categories in training data improves the translation result. We expand the experiment by automatically inserting these elements into a larger data set using various methods and training on the modified corpus. We show that even when automatic prediction of null elements is not highly accurate, it nevertheless improves the end translation result.

4 0.63802326 26 emnlp-2010-Classifying Dialogue Acts in One-on-One Live Chats

Author: Su Nam Kim ; Lawrence Cavedon ; Timothy Baldwin

Abstract: We explore the task of automatically classifying dialogue acts in 1-on-1 online chat forums, an increasingly popular means of providing customer service. In particular, we investigate the effectiveness of various features and machine learners for this task. While a simple bag-of-words approach provides a solid baseline, we find that adding information from dialogue structure and inter-utterance dependency provides some increase in performance; learners that account for sequential dependencies (CRFs) show the best performance. We report our results from testing using a corpus of chat dialogues derived from online shopping customer-feedback data.

5 0.33607081 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

Author: Eduardo Blanco ; Dan Moldovan

Abstract: This paper presents a method for the automatic discovery of MANNER relations from text. An extended definition of MANNER is proposed, including restrictions on the sorts of concepts that can be part of its domain and range. The connections with other relations and the lexico-syntactic patterns that encode MANNER are analyzed. A new feature set specialized on MANNER detection is depicted and justified. Experimental results show improvement over previous attempts to extract MANNER. Combinations of MANNER with other semantic relations are also discussed.

6 0.33426601 53 emnlp-2010-Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue

7 0.31567472 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

8 0.31168893 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation

9 0.30348378 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

10 0.3030999 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

11 0.2980794 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

12 0.29368585 94 emnlp-2010-SCFG Decoding Without Binarization

13 0.29052186 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

14 0.28998646 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

15 0.28642577 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

16 0.28642201 114 emnlp-2010-Unsupervised Parse Selection for HPSG

17 0.28139773 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

18 0.2768493 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks

19 0.27572441 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

20 0.26915976 4 emnlp-2010-A Game-Theoretic Approach to Generating Spatial Descriptions