acl acl2012 acl2012-137 knowledge-graph by maker-knowledge-mining

137 acl-2012-Lemmatisation as a Tagging Task


Source: pdf

Author: Andrea Gesmundo ; Tanja Samardzic

Abstract: We present a novel approach to the task of word lemmatisation. We formalise lemmatisation as a category tagging task, by describing how a word-to-lemma transformation rule can be encoded in a single label and how a set of such labels can be inferred for a specific language. In this way, a lemmatisation system can be trained and tested using any supervised tagging model. In contrast to previous approaches, the proposed technique allows us to easily integrate relevant contextual information. We test our approach on eight languages reaching a new state-of-the-art level for the lemmatisation task.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Lemmatisation as a Tagging Task Andrea Gesmundo Department of Computer Science University of Geneva andrea . [sent-1, score-0.047]

2 ch Abstract We present a novel approach to the task of word lemmatisation. [sent-3, score-0.032]

3 We formalise lemmatisation as a category tagging task, by describing how a word-to-lemma transformation rule can be encoded in a single label and how a set of such labels can be inferred for a specific language. [sent-4, score-1.139]

4 In this way, a lemmatisation system can be trained and tested using any supervised tagging model. [sent-5, score-0.82]

5 In contrast to previous approaches, the proposed technique allows us to easily integrate relevant contextual information. [sent-6, score-0.028]

6 We test our approach on eight languages reaching a new state-of-the-art level for the lemmatisation task. [sent-7, score-0.859]

7 1 Introduction Lemmatisation and part-of-speech (POS) tagging are necessary steps in automatic processing of language corpora. [sent-8, score-0.117]

8 This annotation is a prerequisite for developing systems for more sophisticated automatic processing such as information retrieval, as well as for using language corpora in linguistic research and in the humanities. [sent-9, score-0.027]

9 Lemmatisation is especially important for processing morphologically rich languages, where the number of different word forms is too large to be included in the part-ofspeech tag set. [sent-10, score-0.226]

10 The work on morphologically rich languages suggests that using comprehensive morphological dictionaries is necessary for achieving good results (Haji ˇc, 2000; Erjavec and D ˇzeroski, 2004). [sent-11, score-0.444]

11 However, such dictionaries are constructed manually and they cannot be expected to be developed quickly for many languages. [sent-12, score-0.097]

12 ch c In this paper, we present a new general approach to the task of lemmatisation which can be used to overcome the shortage of comprehensive dictionaries for languages for which they have not been developed. [sent-15, score-0.93]

13 Our approach is based on redefining the task of lemmatisation as a category tagging task. [sent-16, score-0.881]

14 Formulating lemmatisation as a tagging task allows the use of advanced tagging techniques, and the efficient integration of contextual information. [sent-17, score-0.965]

15 We show that this approach gives the highest accuracy known on eight European languages having different morphological complexity, including agglutinative (Hungarian, Estonian) and fusional (Slavic) languages. [sent-18, score-0.284]

16 2 Lemmatisation as a Tagging Task Lemmatisation is the task of grouping together word forms that belong to the same inflectional morphological paradigm and assigning to each paradigm its corresponding canonical form called lemma. [sent-19, score-0.354]

17 For example, English word forms go, goes, going, went, gone constitute a single morphological paradigm which is assigned the lemma go. [sent-20, score-0.495]

18 Automatic lemmatisation requires defining a model that can determine the lemma for a given word form. [sent-21, score-0.939]

19 the fact that the transformation from going to go is governed by a general rule that applies to most English verbs). [sent-24, score-0.17]

20 Our method assigns to each word a label encodProce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-25, score-0.133]

21 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 6s8–372, ing the transformation required to obtain the lemma string from the given word string. [sent-27, score-0.361]

22 The generic transformation from a word to a lemma is done in four steps: 1) remove a suffix of length Ns; 2) add a new lemma suffix, Ls; 3) remove a prefix of length Np; 4) add a new lemma prefix, Lp. [sent-28, score-0.928]

23 The tuple τ ≡ hNs, Ls, Np, Lpi defines the word-to-lemma tτran ≡sfo hrNmation. [sent-29, score-0.033]

24 Eachi tuple eiss represented lwemithm a label that lists the 4 parameters. [sent-30, score-0.134]

25 For example, the transformation of the word going into its lemma is encoded by the label h3, ∅, 0, ∅i. [sent-31, score-0.557]

26 This label can be oenbsceordveedd on a specific hl3e,m∅m,0a,-w∅io. [sent-32, score-0.101]

27 rd T pair ainb ethle c tarnai bneing set but it generalizes well to the unseen words that are formed regularly by adding the suffix -ing. [sent-33, score-0.177]

28 The same label applies to any other transformation which requires only removing the last 3 characters of the word string. [sent-34, score-0.315]

29 Suffix transformations are more frequent than prefix transformations (Jongejan and Dalianis, 2009). [sent-35, score-0.288]

30 In some languages, such as English, it is sufficient to define only suffix transformations. [sent-36, score-0.081]

31 In this case, all the labels will have Np set to 0 and Lp set to ∅. [sent-37, score-0.043]

32 However, languages richer in morphology oetft teon ∅ require encoding prefix transformations too. [sent-38, score-0.354]

33 For example, in assigning the lemma to the negated verb forms in Czech the negation prefix needs to be removed. [sent-39, score-0.379]

34 In this case, the label h1, t, 2, ∅i maps the word nev eˇd eˇl ttoh tsh cea sleem, tmhea l avbeˇ deleˇ ht1. [sent-40, score-0.186]

35 1 The set of labels for a specific language is induced from a training set of pairs (word, lemma). [sent-42, score-0.076]

36 Then we set the value of Np to the number of characters in the word that precede the start of LCS and Ns to the number of characters in the word that follow the end of LCS. [sent-44, score-0.223]

37 The value of Lp is the substring preceding LCS in the lemma and the value of Ls is the substring following LCS in the lemma. [sent-45, score-0.298]

38 In the case of the example pair (nev eˇd eˇl, v ˇed ˇet), the LCS is v ˇed ˇe, 2 characters precede the LCS in the word and 1 follows it. [sent-46, score-0.134]

39 There are no characters preceding the start of the LCS in 1The transformation rules described adapted for a wide range of languages logical information by means of affixes. [sent-47, score-0.274]

40 in this section are well which encode morphoOther encodings can be types (such as Semitic word-lemma samples Figure 1: Growth ofthe label set with the number oftrain- ing instances. [sent-49, score-0.101]

41 The generated label is added to the set of labels. [sent-51, score-0.101]

42 3 Label set induction We apply the presented technique to induce the label set from annotated running text. [sent-52, score-0.13]

43 This approach results in a set of labels whose size convergences quickly with the increase of training pairs. [sent-53, score-0.077]

44 Figure 1 shows the growth of the label set size with the number of tokens seen in the training set for three representative languages. [sent-54, score-0.166]

45 This behavior is expected on the basis of the known interaction between the frequency and the regularity of word forms that is shared by all languages: infrequent words tend to be formed according to a regular pattern, while irregular word forms tend to occur in frequent words. [sent-55, score-0.237]

46 The described procedure leverages this fact to induce a label set that covers most of the word occurrences in a text: a specialized label is learnt for frequent irregular words, while a generic label is learnt to handle words that follow a regular pattern. [sent-56, score-0.514]

47 We observe that the non-complete convergence of the label set size is, to a large extent, due to the pres- ence of noise in the corpus (annotation errors, typos or inconsistency). [sent-57, score-0.167]

48 We test the robustness of our method by deciding not to filter out the noise generated labels in the experimental evaluation. [sent-58, score-0.043]

49 We also observe that encoding the prefix transformation in the label is fundamental for handling the size of the label sets in the languages that frequently use lemma prefixes. [sent-59, score-0.814]

50 For example, the label set generated for 369 Czech doubles in size if only the suffix transformation is encoded in the label. [sent-60, score-0.391]

51 Finally, we observe that the size of the set of induced labels depends on the morphological complexity oflanguages, as shown in Figure 1. [sent-61, score-0.277]

52 4 Experimental Evaluation The advantage of structuring the lemmatisation task as a tagging task is that it allows us to apply successful tagging techniques and use the context information in assigning transformation labels to the words in a text. [sent-63, score-1.15]

53 We chose this model since it has been shown to be easily adaptable for solving a wide set of tagging and chunking tasks obtaining state-of-the-art performances with short execution time (Gesmundo, 2011). [sent-66, score-0.117]

54 Furthermore, this model has consistently shown good generalisation behaviour reaching significantly higher accuracy in tagging unknown words than other systems. [sent-67, score-0.231]

55 We train and test the tagger on manually annotated G. [sent-68, score-0.057]

56 Orwell’s “1984” and its translations to seven European languages (see Table 2, column 1), included in the Multext-East corpora (Erjavec, 2010). [sent-69, score-0.147]

57 The words in the corpus are annotated with both lemmas and detailed morphosyntactic descriptions including the POS labels. [sent-70, score-0.196]

58 Each setting is defined by the set of features that are used for training and prediction. [sent-74, score-0.03]

59 Table 2 reports the accuracy scores achieved in each setting. [sent-76, score-0.057]

60 We establish the Base Line (BL) setting and performance in the first experiment. [sent-77, score-0.03]

61 This setting involves only features of the current word, [w0], such as the word form, suffixes and prefixes and features that flag the presence of special characters (digits, hyphen, caps). [sent-78, score-0.19]

62 The BL accuracy is reported in the second column of Table 2). [sent-79, score-0.082]

63 In the second experiment, the BL feature set is expanded with features of the surrounding words ([w−1], [w1]) and surrounding predicted lemmas ([lem−1], [lem1]). [sent-80, score-0.25]

64 sfr]m0i,x([ew1ps]o0,(w)[1l0]em−1], the second experiment are reported in the third column of Table 2. [sent-83, score-0.134]

65 The consistent improvements over the BL scores for all the languages, varying from the lowest relative error reduction (RER) for Czech (5. [sent-84, score-0.065]

66 In the third experiment, we use a feature set in which the BL set is expanded with the predicted POS tag of the current word, [pos0]. [sent-87, score-0.146]

67 2 The accuracy measured in the third experiment (Table 2, column 4) shows consistent improvement over the BL (the best RER is 34. [sent-88, score-0.161]

68 Furthermore, we observe that the accuracy scores in the third experiment are close to those in the second experiment. [sent-90, score-0.168]

69 This allows us to state that it is possible to design high quality lemmatisation systems which are independent of the POS tagging. [sent-91, score-0.703]

70 Instead of using the POS information, which is currently standard practice for lemmatisation, the task can be performed in a context-wise setting using only the information about surrounding words and lemmas. [sent-92, score-0.069]

71 In the fourth experiment we use a feature set consisting of contextual features of words, predicted lemmas and predicted POS tags. [sent-93, score-0.304]

72 This setting com2The POS tags that we use are extracted from the morphosyntactic descriptions provided in the corpus and learned using the same system that we use for lemmatisation. [sent-94, score-0.132]

73 370 bines the use of the context with the use of the predicted POS tags. [sent-95, score-0.048]

74 The scores obtained in the fourth experiment are considerably higher than those in the previous experiments (Table 2, column 5). [sent-96, score-0.171]

75 For this set- ting, we also report accuracies on unseen words only (UWA, column 6 in Table 2) to show the generalisation capacities of the lemmatizer. [sent-100, score-0.146]

76 The UWA scores 85% or higher for all the languages except Estonian (78. [sent-101, score-0.122]

77 The results of the fourth experiment show that interesting improvements in the performance are obtained by combining the POS and context information. [sent-103, score-0.086]

78 Current systems typically use only the information on the POS of the target word together with lemmatisation rules acquired separately from a dictionary, which roughly corresponds to the setting of our third experiment. [sent-105, score-0.8]

79 The improvement in the fourth experiment compared to the third experiment (RER varying between 12. [sent-106, score-0.2]

80 All the scores reported in Table 2 represent performance with raw text as input. [sent-108, score-0.03]

81 (2010) propose a general multilingual lemmatisation tool, LemGen, which is tested on the same corpora that we used in our evaluation. [sent-111, score-0.703]

82 LemGen learns word transformations in the form of ripple-down rules. [sent-112, score-0.137]

83 Disambiguition between multiple possible lemmas for a word form is based on the gold-standard morphosyntactic label of the word. [sent-113, score-0.329]

84 We measure a Relative Error Reduction varying between 81% for Serbian and 86% for English. [sent-115, score-0.035]

85 It is worth noting that we do not use manually constructed dictionaries for training, while Jur sˇi cˇ et al. [sent-116, score-0.097]

86 (2010) use additional dictionaries for languages 371 for which they are available. [sent-117, score-0.162]

87 Chrupała (2006) proposes a system which, like our system, learns the lemmatisation rules from a corpus, without external dictionaries. [sent-118, score-0.703]

88 The mappings between word forms and lemmas are encoded by means of the shortest edit script. [sent-119, score-0.228]

89 They are learnt using a SVM classifier and the word context features. [sent-121, score-0.088]

90 The most important limitation of this approach is that it cannot deal with both suffixes and prefixes at the same time, which is crucial for efficient processing of morphologically rich languages. [sent-122, score-0.18]

91 Our approach enables encoding transformations on both sides of words. [sent-123, score-0.152]

92 Furthermore, we propose a more straightforward and a more compact way of encoding the lemmatisation rules. [sent-124, score-0.75]

93 Toutanova and Cherry (2009) propose a joint model for assigning the set of possible lemmas and POS tags to out-of-lexicon words which is language independent. [sent-126, score-0.139]

94 The lemmatizer component is a discriminative character transducer that uses a set of withinword features to learn the transformations from input data consisting of a lexicon with full morphological paradigms and unlabelled texts. [sent-127, score-0.24]

95 They show that the joint model outperforms the pipeline model where the POS tag is used as input to the lemmatisation component. [sent-128, score-0.736]

96 6 Conclusion We have shown that redefining the task of lemmatisation as a category tagging task and using an efficient tagger to perform it results in a performance that is at the state-of-the-art level. [sent-129, score-0.911]

97 The adaptive general classification model used in our approach makes use of different sources of information that can be found in a small annotated corpus, with no need for comprehensive, manually constructed morphological dictionaries. [sent-130, score-0.162]

98 For this reason, it can be expected to be easily portable across languages enabling good quality processing of languages with complex morphology and scarce resources. [sent-131, score-0.216]

99 Bidirectional sequence classification for tagging tasks with guided learning. [sent-148, score-0.165]

100 Automatic training oflemmatization rules that handle morphological changes in pre-, in- and suffixes alike. [sent-162, score-0.178]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lemmatisation', 0.703), ('lemma', 0.204), ('lcs', 0.186), ('bl', 0.143), ('morphological', 0.135), ('transformation', 0.125), ('erjavec', 0.122), ('rer', 0.122), ('tagging', 0.117), ('transformations', 0.105), ('morphosyntactic', 0.102), ('label', 0.101), ('lemmas', 0.094), ('languages', 0.092), ('jur', 0.092), ('lemgen', 0.092), ('toma', 0.092), ('suffix', 0.081), ('prefix', 0.078), ('morphologically', 0.075), ('dictionaries', 0.07), ('pos', 0.067), ('ls', 0.061), ('chrupa', 0.061), ('jongejan', 0.061), ('redefining', 0.061), ('slovene', 0.061), ('characters', 0.057), ('learnt', 0.056), ('czech', 0.055), ('column', 0.055), ('generalisation', 0.053), ('nev', 0.053), ('estonian', 0.053), ('uwa', 0.053), ('forms', 0.052), ('encoded', 0.05), ('gesmundo', 0.049), ('unige', 0.049), ('serbian', 0.049), ('guided', 0.048), ('bidirectional', 0.048), ('predicted', 0.048), ('andrea', 0.047), ('substring', 0.047), ('encoding', 0.047), ('romanian', 0.045), ('precede', 0.045), ('assigning', 0.045), ('paradigm', 0.045), ('going', 0.045), ('experiment', 0.044), ('suffixes', 0.043), ('labels', 0.043), ('hungarian', 0.043), ('lp', 0.043), ('fourth', 0.042), ('np', 0.042), ('surrounding', 0.039), ('unseen', 0.038), ('comprehensive', 0.038), ('irregular', 0.038), ('haji', 0.036), ('geneva', 0.036), ('third', 0.035), ('varying', 0.035), ('reaching', 0.034), ('size', 0.034), ('rich', 0.034), ('induced', 0.033), ('tuple', 0.033), ('tag', 0.033), ('morphology', 0.032), ('observe', 0.032), ('word', 0.032), ('formed', 0.031), ('growth', 0.031), ('suntec', 0.031), ('scores', 0.03), ('setting', 0.03), ('tagger', 0.03), ('ns', 0.03), ('expanded', 0.03), ('eight', 0.03), ('induce', 0.029), ('toutanova', 0.028), ('contextual', 0.028), ('prefixes', 0.028), ('jan', 0.028), ('accuracy', 0.027), ('manually', 0.027), ('procesamiento', 0.027), ('ethle', 0.027), ('prerequisite', 0.027), ('tehle', 0.027), ('shortage', 0.027), ('gone', 0.027), ('chair', 0.027), ('smundo', 0.027), ('formulating', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 137 acl-2012-Lemmatisation as a Tagging Task

Author: Andrea Gesmundo ; Tanja Samardzic

Abstract: We present a novel approach to the task of word lemmatisation. We formalise lemmatisation as a category tagging task, by describing how a word-to-lemma transformation rule can be encoded in a single label and how a set of such labels can be inferred for a specific language. In this way, a lemmatisation system can be trained and tested using any supervised tagging model. In contrast to previous approaches, the proposed technique allows us to easily integrate relevant contextual information. We test our approach on eight languages reaching a new state-of-the-art level for the lemmatisation task.

2 0.095290825 96 acl-2012-Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection

Author: Jinho D. Choi ; Martha Palmer

Abstract: This paper presents a novel way of improving POS tagging on heterogeneous data. First, two separate models are trained (generalized and domain-specific) from the same data set by controlling lexical items with different document frequencies. During decoding, one of the models is selected dynamically given the cosine similarity between each sentence and the training data. This dynamic model selection approach, coupled with a one-pass, leftto-right POS tagging algorithm, is evaluated on corpora from seven different genres. Even with this simple tagging algorithm, our system shows comparable results against other state-of-the-art systems, and gives higher accuracies when evaluated on a mixture of the data. Furthermore, our system is able to tag about 32K tokens per second. this model selection approach to more sophisticated tagging improve their robustness even We believe that can be applied algorithms and further.

3 0.09078037 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

Author: Weiwei Sun ; Hans Uszkoreit

Abstract: From the perspective of structural linguistics, we explore paradigmatic and syntagmatic lexical relations for Chinese POS tagging, an important and challenging task for Chinese language processing. Paradigmatic lexical relations are explicitly captured by word clustering on large-scale unlabeled data and are used to design new features to enhance a discriminative tagger. Syntagmatic lexical relations are implicitly captured by constituent parsing and are utilized via system combination. Experiments on the Penn Chinese Treebank demonstrate the importance of both paradigmatic and syntagmatic relations. Our linguistically motivated approaches yield a relative error reduction of 18% in total over a stateof-the-art baseline.

4 0.089469068 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors

Author: Hyun-Je Song ; Jeong-Woo Son ; Tae-Gil Noh ; Seong-Bae Park ; Sang-Jo Lee

Abstract: All types of part-of-speech (POS) tagging errors have been equally treated by existing taggers. However, the errors are not equally important, since some errors affect the performance of subsequent natural language processing (NLP) tasks seriously while others do not. This paper aims to minimize these serious errors while retaining the overall performance of POS tagging. Two gradient loss functions are proposed to reflect the different types of errors. They are designed to assign a larger cost to serious errors and a smaller one to minor errors. Through a set of POS tagging experiments, it is shown that the classifier trained with the proposed loss functions reduces serious errors compared to state-of-the-art POS taggers. In addition, the experimental result on text chunking shows that fewer serious errors help to improve the performance of sub- sequent NLP tasks.

5 0.083603427 78 acl-2012-Efficient Search for Transformation-based Inference

Author: Asher Stern ; Roni Stern ; Ido Dagan ; Ariel Felner

Abstract: This paper addresses the search problem in textual inference, where systems need to infer one piece of text from another. A prominent approach to this task is attempts to transform one text into the other through a sequence of inference-preserving transformations, a.k.a. a proof, while estimating the proof’s validity. This raises a search challenge of finding the best possible proof. We explore this challenge through a comprehensive investigation of prominent search algorithms and propose two novel algorithmic components specifically designed for textual inference: a gradient-style evaluation function, and a locallookahead node expansion method. Evaluations, using the open-source system, BIUTEE, show the contribution of these ideas to search efficiency and proof quality.

6 0.080748864 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

7 0.079253308 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

8 0.077066682 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

9 0.072801001 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

10 0.069559202 139 acl-2012-MIX Is Not a Tree-Adjoining Language

11 0.067448795 140 acl-2012-Machine Translation without Words through Substring Alignment

12 0.065860584 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

13 0.06291303 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic

14 0.061003502 89 acl-2012-Exploring Deterministic Constraints: from a Constrained English POS Tagger to an Efficient ILP Solution to Chinese Word Segmentation

15 0.060576461 43 acl-2012-Building Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench

16 0.057939857 168 acl-2012-Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

17 0.052660115 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

18 0.052618347 107 acl-2012-Heuristic Cube Pruning in Linear Time

19 0.052036811 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT

20 0.051562395 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.152), (1, 0.007), (2, -0.077), (3, -0.035), (4, 0.024), (5, 0.132), (6, 0.063), (7, -0.049), (8, -0.017), (9, -0.019), (10, -0.087), (11, 0.016), (12, 0.063), (13, -0.081), (14, 0.058), (15, -0.02), (16, -0.024), (17, -0.013), (18, 0.026), (19, 0.017), (20, -0.004), (21, 0.035), (22, -0.015), (23, 0.026), (24, 0.05), (25, 0.071), (26, -0.034), (27, -0.0), (28, -0.002), (29, 0.072), (30, 0.095), (31, 0.105), (32, -0.158), (33, 0.035), (34, 0.096), (35, 0.089), (36, -0.105), (37, 0.003), (38, -0.084), (39, -0.108), (40, -0.003), (41, -0.045), (42, -0.03), (43, -0.011), (44, -0.067), (45, -0.024), (46, -0.002), (47, -0.006), (48, 0.043), (49, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89036906 137 acl-2012-Lemmatisation as a Tagging Task

Author: Andrea Gesmundo ; Tanja Samardzic

Abstract: We present a novel approach to the task of word lemmatisation. We formalise lemmatisation as a category tagging task, by describing how a word-to-lemma transformation rule can be encoded in a single label and how a set of such labels can be inferred for a specific language. In this way, a lemmatisation system can be trained and tested using any supervised tagging model. In contrast to previous approaches, the proposed technique allows us to easily integrate relevant contextual information. We test our approach on eight languages reaching a new state-of-the-art level for the lemmatisation task.

2 0.56297642 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors

Author: Hyun-Je Song ; Jeong-Woo Son ; Tae-Gil Noh ; Seong-Bae Park ; Sang-Jo Lee

Abstract: All types of part-of-speech (POS) tagging errors have been equally treated by existing taggers. However, the errors are not equally important, since some errors affect the performance of subsequent natural language processing (NLP) tasks seriously while others do not. This paper aims to minimize these serious errors while retaining the overall performance of POS tagging. Two gradient loss functions are proposed to reflect the different types of errors. They are designed to assign a larger cost to serious errors and a smaller one to minor errors. Through a set of POS tagging experiments, it is shown that the classifier trained with the proposed loss functions reduces serious errors compared to state-of-the-art POS taggers. In addition, the experimental result on text chunking shows that fewer serious errors help to improve the performance of sub- sequent NLP tasks.

3 0.56064534 96 acl-2012-Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection

Author: Jinho D. Choi ; Martha Palmer

Abstract: This paper presents a novel way of improving POS tagging on heterogeneous data. First, two separate models are trained (generalized and domain-specific) from the same data set by controlling lexical items with different document frequencies. During decoding, one of the models is selected dynamically given the cosine similarity between each sentence and the training data. This dynamic model selection approach, coupled with a one-pass, leftto-right POS tagging algorithm, is evaluated on corpora from seven different genres. Even with this simple tagging algorithm, our system shows comparable results against other state-of-the-art systems, and gives higher accuracies when evaluated on a mixture of the data. Furthermore, our system is able to tag about 32K tokens per second. this model selection approach to more sophisticated tagging improve their robustness even We believe that can be applied algorithms and further.

4 0.51189244 43 acl-2012-Building Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench

Author: Rafal Rak ; BalaKrishna Kolluru ; Sophia Ananiadou

Abstract: Argo is a web-based NLP and text mining workbench with a convenient graphical user interface for designing and executing processing workflows of various complexity. The workbench is intended for specialists and nontechnical audiences alike, and provides the ever expanding library of analytics compliant with the Unstructured Information Management Architecture, a widely adopted interoperability framework. We explore the flexibility of this framework by demonstrating workflows involving three processing components capable of performing self-contained machine learning-based tagging. The three components are responsible for the three distinct tasks of 1) generating observations or features, 2) training a statistical model based on the generated features, and 3) tagging unlabelled data with the model. The learning and tagging components are based on an implementation of conditional random fields (CRF); whereas the feature generation component is an analytic capable of extending basic token information to a comprehensive set of features. Users define the features of their choice directly from Argo’s graphical interface, without resorting to programming (a commonly used approach to feature engineering). The experimental results performed on two tagging tasks, chunking and named entity recognition, showed that a tagger with a generic set of features built in Argo is capable of competing with taskspecific solutions. 121

5 0.51176709 139 acl-2012-MIX Is Not a Tree-Adjoining Language

Author: Makoto Kanazawa ; Sylvain Salvati

Abstract: The language MIX consists of all strings over the three-letter alphabet {a, b, c} that contain an equal n-luemttebrer a olpfh occurrences }o tfh heaatch c olentttaeinr. We prove Joshi’s (1985) conjecture that MIX is not a tree-adjoining language.

6 0.50037736 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars

7 0.47604677 89 acl-2012-Exploring Deterministic Constraints: from a Constrained English POS Tagger to an Efficient ILP Solution to Chinese Word Segmentation

8 0.46963444 168 acl-2012-Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

9 0.45765403 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

10 0.4566429 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic

11 0.45412108 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

12 0.44651526 78 acl-2012-Efficient Search for Transformation-based Inference

13 0.4433516 189 acl-2012-Syntactic Annotations for the Google Books NGram Corpus

14 0.43801734 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

15 0.43264079 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

16 0.4226037 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

17 0.41766855 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

18 0.40317869 107 acl-2012-Heuristic Cube Pruning in Linear Time

19 0.40274176 83 acl-2012-Error Mining on Dependency Trees

20 0.38841102 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.046), (26, 0.067), (28, 0.048), (30, 0.023), (37, 0.021), (39, 0.044), (47, 0.01), (57, 0.026), (74, 0.026), (82, 0.026), (84, 0.033), (85, 0.04), (89, 0.241), (90, 0.135), (92, 0.059), (94, 0.024), (99, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.78216058 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

Author: Xiaohua Liu ; Ming Zhou ; Xiangyang Zhou ; Zhongyang Fu ; Furu Wei

Abstract: Tweets represent a critical source of fresh information, in which named entities occur frequently with rich variations. We study the problem of named entity normalization (NEN) for tweets. Two main challenges are the errors propagated from named entity recognition (NER) and the dearth of information in a single tweet. We propose a novel graphical model to simultaneously conduct NER and NEN on multiple tweets to address these challenges. Particularly, our model introduces a binary random variable for each pair of words with the same lemma across similar tweets, whose value indicates whether the two related words are mentions of the same entity. We evaluate our method on a manually annotated data set, and show that our method outperforms the baseline that handles these two tasks separately, boosting the F1 from 80.2% to 83.6% for NER, and the Accuracy from 79.4% to 82.6% for NEN, respectively.

2 0.72648644 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. We evaluate on NEWS 2010 shared task data and on parallel corpora with competitive results.

same-paper 3 0.7245419 137 acl-2012-Lemmatisation as a Tagging Task

Author: Andrea Gesmundo ; Tanja Samardzic

Abstract: We present a novel approach to the task of word lemmatisation. We formalise lemmatisation as a category tagging task, by describing how a word-to-lemma transformation rule can be encoded in a single label and how a set of such labels can be inferred for a specific language. In this way, a lemmatisation system can be trained and tested using any supervised tagging model. In contrast to previous approaches, the proposed technique allows us to easily integrate relevant contextual information. We test our approach on eight languages reaching a new state-of-the-art level for the lemmatisation task.

4 0.59557891 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

Author: Patrick Simianer ; Stefan Riezler ; Chris Dyer

Abstract: With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data. Evidence from machine learning indicates that increasing the training sample size results in better prediction. The goal of this paper is to show that this common wisdom can also be brought to bear upon SMT. We deploy local features for SCFG-based SMT that can be read off from rules at runtime, and present a learning algorithm that applies ‘1/‘2 regularization for joint feature selection over distributed stochastic learning processes. We present experiments on learning on 1.5 million training sentences, and show significant improvements over tuning discriminative models on small development sets.

5 0.59515852 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

Author: Wan-Yu Lin ; Nanyun Peng ; Chun-Chao Yen ; Shou-de Lin

Abstract: In this paper, we introduce a framework that identifies online plagiarism by exploiting lexical, syntactic and semantic features that includes duplication-gram, reordering and alignment of words, POS and phrase tags, and semantic similarity of sentences. We establish an ensemble framework to combine the predictions of each model. Results demonstrate that our system can not only find considerable amount of real-world online plagiarism cases but also outperforms several state-of-the-art algorithms and commercial software. Keywords Plagiarism Detection, Lexical, Syntactic, Semantic 1.

6 0.59393007 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

7 0.59247124 217 acl-2012-Word Sense Disambiguation Improves Information Retrieval

8 0.59098941 167 acl-2012-QuickView: NLP-based Tweet Search

9 0.59062469 187 acl-2012-Subgroup Detection in Ideological Discussions

10 0.59062058 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

11 0.59058386 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

12 0.59033895 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

13 0.59006691 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

14 0.58925384 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

15 0.58897471 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

16 0.58888793 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

17 0.58886999 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

18 0.58849567 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

19 0.58846325 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

20 0.5878728 191 acl-2012-Temporally Anchored Relation Extraction