acl acl2013 acl2013-227 knowledge-graph by maker-knowledge-mining

227 acl-2013-Learning to lemmatise Polish noun phrases


Source: pdf

Author: Adam Radziszewski

Abstract: We present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem. The idea draws on the observation that the lemmatisation of almost all Polish noun phrases may be decomposed into transformation of singular words (tokens) that make up each phrase. We perform evaluation, which shows results similar to those obtained earlier by a rule-based system, while our approach allows to separate chunking from lemmatisation.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Learning to lemmatise Polish noun phrases Adam Radziszewski Institute of Informatics, Wrocław University of Technology Wybrze˙ ze Wyspia n´skiego 27 Wrocław, Poland adam . [sent-1, score-0.256]

2 pl s Abstract We present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem. [sent-4, score-0.611]

3 The idea draws on the observation that the lemmatisation of almost all Polish noun phrases may be decomposed into transformation of singular words (tokens) that make up each phrase. [sent-5, score-0.656]

4 1 Introduction Lemmatisation of word forms is the task of finding base forms (lemmas) for each token in running text. [sent-7, score-0.11]

5 Similar task may be defined for whole noun phrases (Degórski, 2011). [sent-9, score-0.129]

6 By lemmatisation of noun phrases (NPs) we will understand assigning each NP a grammatically correct NP corresponding to the same phrase that could stand as a dictionary entry. [sent-10, score-0.676]

7 The task of NP lemmatisation is rarely considered, although it carries great practical value. [sent-11, score-0.463]

8 For instance, any keyword extraction system that works for a morphologically rich language must deal with lemmatisation of NPs. [sent-12, score-0.463]

9 This is because keywords are often longer phrases (Turney, 2000), while the user would be confused to see inflected forms as system output. [sent-13, score-0.238]

10 In (1) we give an example Polish noun phrase (‘the main city of the municipality’). [sent-15, score-0.11]

11 The orthographic form (1a) appears in instrumental case, singular. [sent-17, score-0.108]

12 Lemmatisation of this phrase consists in reverting case value of the main noun (miasto) as well as its adjective modifier (główne) to nominative (nom). [sent-19, score-0.185]

13 Each form in the example is in singular number (sg), miasto has neuter gender (n), gmina is feminine (f). [sent-20, score-0.189]

14 He also notes that this is not an easy task and lemma of a whole NP is rarely a concatenation of lemmas of phrase components. [sent-25, score-0.342]

15 It is worth stressing that even the task of word-level lemmatisation is non-trivial for inflectional languages due to a large number of inflected forms and even larger number of syncretisms. [sent-26, score-0.691]

16 Wndehrast) ”is, more, several syntactic phenomena typical for Polish complicate NP lemmatisation further. [sent-31, score-0.463]

17 In this paper we present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem and tackled using a method devised for such problems, namely Conditional Random Fields (CRF). [sent-39, score-0.573]

18 2 Related works NP lemmatisation received very little attention. [sent-40, score-0.463]

19 The approach consists in incorporating phrase lemmatisation rules into a shallow grammar developed for Polish. [sent-43, score-0.563]

20 This is implemented by extending the Spejd shallow parsing framework (Buczy´ nski and Przepiórkowski, 2009) with a rule action that is able to generate phrase lemmas. [sent-44, score-0.139]

21 Degórski assumes that lemma of each NP may be obtained by concatenating each token’s orthographic form, lemma or ‘half-lemmatised’ form (e. [sent-45, score-0.348]

22 Degórski notes that the selection was not entirely random: two types of NPs were deliberately omitted, namely foreign names and “a few groups for which the proper lemmatisation seemed very unclear”. [sent-50, score-0.528]

23 The task of phrase lemmatisation bears a close resemblance to a more popular task, namely lemmatisation of named entities. [sent-60, score-0.996]

24 One approach, which is especially suitable for person names, assumes that nominative forms may be found in the same source as the inflected forms. [sent-62, score-0.259]

25 Piskorski (2005) handles the problem of lemmatisation of Polish named entities of various types by combining specialised gazetteers with lemmatisation rules added to a hand-written grammar. [sent-67, score-0.926]

26 Biblioteki Głównej Wy˙ zszej library main higher gen : s g : f gen : s g : f gen : sg : f Szkoły Handlowej school commercial gen : s g : f gen : s g : f b. [sent-72, score-0.778]

27 6 of the detected NEs were lemmatised correctly” (Piskorski, 2005). [sent-74, score-0.107]

28 3 Phrase lemmatisation as a tagging problem The idea presented here is directly inspired by Degórski’s observations. [sent-75, score-0.463]

29 First, we will also assume 702 that lemma of any NP may be obtained by concatenating simple transformations of word forms that make up the phrase. [sent-76, score-0.391]

30 We will argue that there is a small finite set of inflectional transformations that are sufficient to lemmatise nearly every Polish NP. [sent-79, score-0.334]

31 Correct lemmatisation of the phrase may be obtained by applying a series of simple inflectional transformations to each of its words. [sent-81, score-0.793]

32 To show the real setting, this time we give full NCP tags and word-level lemmas assigned as a result of tagging. [sent-84, score-0.123]

33 In the NCP tagset, the first part of each tag denotes grammatical class (adj stands for adjective, subst for noun). [sent-85, score-0.162]

34 The transformation labelled = means that the inflected form is already equal to the desired part of the lemma, hence no transformation is needed. [sent-93, score-0.408]

35 In the NCP tagset each tag may be decomposed into grammatical class and attribute values, where the choice of applicable attributes depends on the grammatical class. [sent-95, score-0.213]

36 This assumption is important for our approach to be able to use simple tag transformations in the form replace the value of attribute A with the new value V (A=V). [sent-97, score-0.299]

37 Our idea is simple: by expressing phrase lemmatisation in terms of word-level transformations we can reduce the task to tagging problem and apply well known Machine Learning techniques that have been devised for solving such problems (e. [sent-102, score-0.749]

38 Assuming that we have already trained a statistical model, we need to perform the following steps to obtain lemmatisation of a new text: 1. [sent-106, score-0.463]

39 tagging with transformations by applying the trained model, 4. [sent-109, score-0.216]

40 application of transformations to obtain NP lemmas (using a morphological dictionary to generate forms). [sent-110, score-0.403]

41 The annotators were given a simpler task of assigning each NP instance a lemma and a heuristic procedure was used to induce transformations by matching the manually annotated lemmas to phrases’ orthographic forms using a morphological dictionary. [sent-115, score-0.602]

42 What is more, it allowed us to release the data annotated manually with phrase lemmas and under the same licence2. [sent-121, score-0.152]

43 One of the assumptions of KPWr annotation is that actual noun phrases and prepositional phrases are labelled collectively as NP chunks. [sent-122, score-0.194]

44 This solution allows to use our lemmatiser directly against chunker output to obtain NP lemmas from both NPs and PPs. [sent-129, score-0.216]

45 For instance, the phrase o przenoszeniu bakterii drog a˛ płciow a˛ (about sexual transmission of bacteria) should be lemmatised to przenoszenie bakterii drog a˛ płciow a˛ (sexual transmission of bacteria). [sent-130, score-0.293]

46 4 Preparation of training data First, simple lemmatisation guidelines were developed. [sent-131, score-0.463]

47 If the phrase was in fact prepositional, phrase-initial preposition should be removed first. [sent-133, score-0.108]

48 Although an explicit distinctions is made between NPs and PPs, NPs are not annotated as separate chunks when belonging to a PP chunk (an assumption which is typical for shallow parsing). [sent-148, score-0.136]

49 For instance, (4b) could be lemmatised as opis tytułu z Wikipedii (description of a Wikipedia title), but it was not obvious if it was better than leaving the whole phrase as is. [sent-158, score-0.212]

50 Both annotators were only told which phrases were lemmatised differently by the other party but they didn’t know the other decision. [sent-164, score-0.161]

51 The development set was enhanced with wordlevel transformations that were induced automatically in the following manner. [sent-169, score-0.252]

52 The assumption is that, having cut the preposition, all the remaining tokens of the original inflected phrase must be matched 1:1to corresponding tokens from the human-assigned lemma. [sent-174, score-0.199]

53 The only problems encountered were due to proper names unknown to the dictionary and misspelled phrases (altogether about 10 cases). [sent-176, score-0.133]

54 The task is to find a suitable transformation for the given inflected form from the original phrase, its tag and word-level lemma, but also given the desired form being part of human-assigned lemma. [sent-182, score-0.346]

55 If the inflected form is identical to the desired human-assigned lemma, the ‘=’ transformation is returned without any tag analysis. [sent-183, score-0.311]

56 For instance, the inflected form tej tagged as adj : sg : loc : f :pos should be matched to the human-assigned form ta (the row label H lem). [sent-185, score-0.458]

57 The result is a set of entries with the given lemma and orthographic form, but with different tags attached. [sent-187, score-0.234]

58 For the example considered, two tags may be obtained: adj : sg :nom : f :pos and adj : sg :voc : f :po s (the former is in nominative case, the latter in vocative). [sent-188, score-0.634]

59 Each of the obtained tags is compared to the tag attached to the inflected forms (adj : sg : loc : f :pos) and this way candidate transformations are generated (cas=nom and cas=voc here). [sent-189, score-0.682]

60 Most importantly, — obtained from http : / / sg jp . [sent-191, score-0.193]

61 Original: T tags: przy tej prep : ad l oc sg : l : f :po s oc j: drodze sub st : s g :l oc : f T lem: przy ten droga H lem:tadroga Transf. [sent-197, score-0.251]

62 This ranking was inspired only by intuition obtained from the lemmatisation guidelines and the transformations selected this way may be wrong in a number of cases. [sent-204, score-0.679]

63 While many transformations may lead to obtaining the same lemma for a given form, many of them will still be accidental. [sent-205, score-0.336]

64 On the other hand, manual inspection of some fragments suggest that the transformations inferred are rarely unjustified. [sent-207, score-0.216]

65 The frequencies of all transformations induced from the development set are given in Tab. [sent-208, score-0.216]

66 These findings support our hypothesis that a small finite set of transformations is sufficient to express lemmatisation of nearly every Polish NP. [sent-212, score-0.679]

67 We have also tested an alternative variant of the matching procedure that included additional transformation ‘lem’ with the meaning take the word-level lemma assigned by the tagger as the correct lemmatisation. [sent-213, score-0.253]

68 This transformation could be induced after an unsuccessful attempt to induce the ‘=’ transformation (i. [sent-214, score-0.198]

69 , if the correct humanassigned lemmatisation was not identical to orthographic form). [sent-216, score-0.536]

70 This resulted in replacing a number of tag-level transformations (mostly cas=nom) with the simple ‘lem’ . [sent-217, score-0.216]

71 The work describes a feature set proposed for this task, which includes word forms in a local window, values ofgrammatical class, gender, number and case, tests for agreement on number, gender and case, as well as simple tests for letter case. [sent-222, score-0.106]

72 The most obvious, yet most successful change was to introduce features returning the chunk tag assigned to a token. [sent-226, score-0.108]

73 Evaluation The performed evaluation assumed training of the CRF on the whole development set annotated with the induced transformations and then applying the trained model to tag the evaluation part with transformations. [sent-236, score-0.299]

74 Transformations were then applied and the obtained phrase lemmas were compared to the reference annotation. [sent-237, score-0.152]

75 Degórski (201 1) reports separate figures for the performance of the entire system (chunker + NP lemmatiser) on the whole test set and performance of the entire system limiting the test set only to those phrases that the system is able to chunk correctly (i. [sent-240, score-0.149]

76 This is why we decided to report performance of the whole system on the whole test set, but also, performance of the lemmatisation module alone on the whole test set. [sent-246, score-0.609]

77 This seems more appropriate, since the chunker may be improved or completely replaced independently, while discarding the phrases that are too hard to parse is likely to bias the evaluation of the lemmatisation stage (what is hard to chunk is probably also hard to lemmatise). [sent-247, score-0.667]

78 For the setting where chunker was used, we used the CRF-based chunker mentioned in the previous section (Radziszewski and Pawlaczek, 2012). [sent-248, score-0.18]

79 Degórski (201 1) uses concatenation of wordlevel base forms assigned by the tagger as a baseline. [sent-250, score-0.125]

80 Observation of the development set suggests that returning the original inflected NPs may be a better baseline. [sent-251, score-0.129]

81 Similarly, for the ‘take-lemma’ baseline, other transformations were substituted with ‘ lem’ . [sent-257, score-0.216]

82 Also, it turns out that the variation of the matching procedure using the ‘lem’ transformation (row labelled CRF lem) performs slightly worse than the procedure without this transformation (row CRF nolem). [sent-263, score-0.244]

83 This supports the suspicion that relying on wordlevel lemmas may reduce the ability to generalise. [sent-264, score-0.118]

84 8% Table 2: Performance of NP lemmatisation including chunking errors. [sent-278, score-0.532]

85 Results corresponding to performance of the lemmatisation module alone are reported in Tab. [sent-279, score-0.463]

86 In this settings recall and precision have the same interpretation, hence we simply refer to the value as accuracy (percentage of chunks that were lemmatised correctly). [sent-282, score-0.153]

87 The other important difference stems from phrase definitions used in both corpora; NPs in NCP are generally shorter than the chunks allowed in KPWr. [sent-289, score-0.116]

88 Note that the complex NP definition in KPWr also explains the huge gap between results of lemmatisation alone and lemmatisation including chunking errors. [sent-293, score-0.995]

89 7% CRF lem orth baseline lem baseline 444 / 564 314 / 564 290 / 564 78. [sent-295, score-0.294]

90 A rudimentary analysis of lemmatiser output indicates that the most common error is the assignment of the orthographic form as phrase lemma where something else was expected. [sent-301, score-0.342]

91 It seems that a part of these cases come from tagging errors (even if the correct transformation is obtained, the results ofits application depend on the tag and lemma attached to the inflected form by the tagger). [sent-305, score-0.431]

92 Pod Napi ˛eciem was lemmatised to napi e˛cie, which would be correct weren’t it a title). [sent-308, score-0.136]

93 7 Conclusions and further work We presented a novel approach to lemmatisation of Polish noun phrases. [sent-309, score-0.503]

94 The main advantage of this solution is that it allows to separate the lemmatisation phrase from the chunking phrase. [sent-310, score-0.602]

95 Degórski’s rule-based approach (Degórski, 2011) was also built on top ofan existing parser but, as he notes, to improve the lemmatisation accuracy, the grammar underlying the parser should actually be rewritten with lemmatisation in mind. [sent-311, score-0.926]

96 Extending existing chunk-annotated corpora with phrase lemmas corresponds to a relatively simple annotation task. [sent-313, score-0.152]

97 Such transformations may be expressed in terms of simple edit scripts, which has already been successfully applied to word-level lemmatisation of Polish and other languages (Chrupała et al. [sent-319, score-0.679]

98 This way, the training data labelled with transformations could be obtained automatically. [sent-321, score-0.262]

99 What is more, application of such transformations also does not depend on the dictionary. [sent-322, score-0.216]

100 Towards the lemmatisation of Polish nominal syntactic groups using a shallow 708 grammar. [sent-352, score-0.493]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lemmatisation', 0.463), ('nom', 0.312), ('rski', 0.235), ('transformations', 0.216), ('sg', 0.193), ('deg', 0.172), ('polish', 0.158), ('cas', 0.141), ('gnd', 0.132), ('kpwr', 0.132), ('ncp', 0.13), ('inflected', 0.129), ('lem', 0.125), ('lemma', 0.12), ('gen', 0.117), ('przepi', 0.117), ('np', 0.115), ('lemmatised', 0.107), ('nps', 0.103), ('nmb', 0.103), ('radziszewski', 0.103), ('transformation', 0.099), ('chunker', 0.09), ('lemmas', 0.082), ('rkowski', 0.078), ('nominative', 0.075), ('bembenik', 0.074), ('lemmatise', 0.074), ('miasto', 0.074), ('wroc', 0.074), ('orthographic', 0.073), ('crf', 0.073), ('phrase', 0.07), ('chunking', 0.069), ('tagset', 0.067), ('adj', 0.066), ('subst', 0.065), ('chunk', 0.06), ('gminy', 0.059), ('adam', 0.057), ('morphological', 0.056), ('forms', 0.055), ('phrases', 0.054), ('inst', 0.051), ('gender', 0.051), ('grammatical', 0.049), ('dictionary', 0.049), ('tag', 0.048), ('wne', 0.048), ('piskorski', 0.048), ('chunks', 0.046), ('labelled', 0.046), ('jakub', 0.045), ('prepositions', 0.044), ('lemmatiser', 0.044), ('municipality', 0.044), ('orth', 0.044), ('pawlaczek', 0.044), ('pwr', 0.044), ('szko', 0.044), ('ukasz', 0.044), ('wny', 0.044), ('inflectional', 0.044), ('decided', 0.041), ('tags', 0.041), ('noun', 0.04), ('aw', 0.04), ('nski', 0.039), ('preposition', 0.038), ('pl', 0.038), ('wordlevel', 0.036), ('marciniak', 0.036), ('broda', 0.036), ('poland', 0.036), ('notes', 0.035), ('form', 0.035), ('whole', 0.035), ('tagger', 0.034), ('ze', 0.031), ('names', 0.03), ('shallow', 0.03), ('bakterii', 0.029), ('biblioteka', 0.029), ('buczy', 0.029), ('chrupa', 0.029), ('drog', 0.029), ('gmina', 0.029), ('handlowej', 0.029), ('henryk', 0.029), ('koco', 0.029), ('miastem', 0.029), ('napi', 0.029), ('przy', 0.029), ('rybi', 0.029), ('ski', 0.029), ('sza', 0.029), ('tytu', 0.029), ('wikipedii', 0.029), ('wna', 0.029), ('wnym', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999881 227 acl-2013-Learning to lemmatise Polish noun phrases

Author: Adam Radziszewski

Abstract: We present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem. The idea draws on the observation that the lemmatisation of almost all Polish noun phrases may be decomposed into transformation of singular words (tokens) that make up each phrase. We perform evaluation, which shows results similar to those obtained earlier by a rule-based system, while our approach allows to separate chunking from lemmatisation.

2 0.1353156 290 acl-2013-Question Analysis for Polish Question Answering

Author: Piotr Przybyla

Abstract: This study is devoted to the problem of question analysis for a Polish question answering system. The goal of the question analysis is to determine its general structure, type of an expected answer and create a search query for finding relevant documents in a textual knowledge base. The paper contains an overview of available solutions of these problems, description of their implementation and presents an evaluation based on a set of 1137 questions from a Polish quiz TV show. The results help to understand how an environment of a Slavonic language affects the performance of methods created for English.

3 0.11418366 303 acl-2013-Robust multilingual statistical morphological generation models

Author: Ondrej Dusek ; Filip Jurcicek

Abstract: We present a novel method of statistical morphological generation, i.e. the prediction of inflected word forms given lemma, part-of-speech and morphological features, aimed at robustness to unseen inputs. Our system uses a trainable classifier to predict “edit scripts” that are then used to transform lemmas into inflected word forms. Suffixes of lemmas are included as features to achieve robustness. We evaluate our system on 6 languages with a varying degree of morphological richness. The results show that the system is able to learn most morphological phenomena and generalize to unseen inputs, producing significantly better results than a dictionarybased baseline.

4 0.096624047 378 acl-2013-Using subcategorization knowledge to improve case prediction for translation to German

Author: Marion Weller ; Alexander Fraser ; Sabine Schulte im Walde

Abstract: This paper demonstrates the need and impact of subcategorization information for SMT. We combine (i) features on sourceside syntactic subcategorization and (ii) an external knowledge base with quantitative, dependency-based information about target-side subcategorization frames. A manual evaluation of an English-toGerman translation task shows that the subcategorization information has a positive impact on translation quality through better prediction of case.

5 0.074838057 102 acl-2013-DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German

Author: Britta Zeller ; Jan Snajder ; Sebastian Pado

Abstract: Derivational models are still an underresearched area in computational morphology. Even for German, a rather resourcerich language, there is a lack of largecoverage derivational knowledge. This paper describes a rule-based framework for inducing derivational families (i.e., clusters of lemmas in derivational relationships) and its application to create a highcoverage German resource, DERIVBASE, mapping over 280k lemmas into more than 17k non-singleton clusters. We focus on the rule component and a qualitative and quantitative evaluation. Our approach achieves up to 93% precision and 71% recall. We attribute the high precision to the fact that our rules are based on information from grammar books.

6 0.06958089 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

7 0.055630133 280 acl-2013-Plurality, Negation, and Quantification:Towards Comprehensive Quantifier Scope Disambiguation

8 0.052562606 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections

9 0.051846091 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

10 0.05178345 205 acl-2013-Joint Apposition Extraction with Syntactic and Semantic Constraints

11 0.051503539 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

12 0.050527871 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

13 0.049304195 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language

14 0.049292307 76 acl-2013-Building and Evaluating a Distributional Memory for Croatian

15 0.04887813 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

16 0.046019722 372 acl-2013-Using CCG categories to improve Hindi dependency parsing

17 0.045317825 80 acl-2013-Chinese Parsing Exploiting Characters

18 0.045144916 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis

19 0.044948079 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing

20 0.044216398 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.128), (1, -0.021), (2, -0.018), (3, -0.03), (4, -0.021), (5, -0.018), (6, -0.017), (7, -0.015), (8, 0.067), (9, 0.004), (10, -0.033), (11, 0.02), (12, 0.02), (13, -0.0), (14, -0.113), (15, 0.006), (16, -0.026), (17, -0.044), (18, -0.036), (19, 0.026), (20, -0.112), (21, 0.004), (22, 0.049), (23, 0.018), (24, 0.038), (25, 0.002), (26, -0.038), (27, -0.083), (28, 0.009), (29, -0.024), (30, 0.001), (31, -0.009), (32, -0.012), (33, 0.05), (34, -0.016), (35, -0.006), (36, -0.026), (37, -0.011), (38, 0.027), (39, -0.082), (40, -0.009), (41, 0.012), (42, 0.063), (43, -0.077), (44, -0.033), (45, 0.042), (46, 0.035), (47, 0.024), (48, 0.02), (49, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92055476 227 acl-2013-Learning to lemmatise Polish noun phrases

Author: Adam Radziszewski

Abstract: We present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem. The idea draws on the observation that the lemmatisation of almost all Polish noun phrases may be decomposed into transformation of singular words (tokens) that make up each phrase. We perform evaluation, which shows results similar to those obtained earlier by a rule-based system, while our approach allows to separate chunking from lemmatisation.

2 0.76859224 303 acl-2013-Robust multilingual statistical morphological generation models

Author: Ondrej Dusek ; Filip Jurcicek

Abstract: We present a novel method of statistical morphological generation, i.e. the prediction of inflected word forms given lemma, part-of-speech and morphological features, aimed at robustness to unseen inputs. Our system uses a trainable classifier to predict “edit scripts” that are then used to transform lemmas into inflected word forms. Suffixes of lemmas are included as features to achieve robustness. We evaluate our system on 6 languages with a varying degree of morphological richness. The results show that the system is able to learn most morphological phenomena and generalize to unseen inputs, producing significantly better results than a dictionarybased baseline.

3 0.66534126 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

Author: Burak Kerim Akku� ; Ruket Cakici

Abstract: Morphologically rich languages such as Turkish may benefit from morphological analysis in natural language tasks. In this study, we examine the effects of morphological analysis on text categorization task in Turkish. We use stems and word categories that are extracted with morphological analysis as main features and compare them with fixed length stemmers in a bag of words approach with several learning algorithms. We aim to show the effects of using varying degrees of morphological information.

4 0.66330898 378 acl-2013-Using subcategorization knowledge to improve case prediction for translation to German

Author: Marion Weller ; Alexander Fraser ; Sabine Schulte im Walde

Abstract: This paper demonstrates the need and impact of subcategorization information for SMT. We combine (i) features on sourceside syntactic subcategorization and (ii) an external knowledge base with quantitative, dependency-based information about target-side subcategorization frames. A manual evaluation of an English-toGerman translation task shows that the subcategorization information has a positive impact on translation quality through better prediction of case.

5 0.63563013 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

Author: Dan Garrette ; Jason Mielens ; Jason Baldridge

Abstract: Developing natural language processing tools for low-resource languages often requires creating resources from scratch. While a variety of semi-supervised methods exist for training from incomplete data, there are open questions regarding what types of training data should be used and how much is necessary. We discuss a series of experiments designed to shed light on such questions in the context of part-of-speech tagging. We obtain timed annotations from linguists for the low-resource languages Kinyarwanda and Malagasy (as well as English) and eval- uate how the amounts of various kinds of data affect performance of a trained POS-tagger. Our results show that annotation of word types is the most important, provided a sufficiently capable semi-supervised learning infrastructure is in place to project type information onto a raw corpus. We also show that finitestate morphological analyzers are effective sources of type information when few labeled examples are available.

6 0.62662214 286 acl-2013-Psycholinguistically Motivated Computational Models on the Organization and Processing of Morphologically Complex Words

7 0.56435353 186 acl-2013-Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach

8 0.55415511 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

9 0.54367989 102 acl-2013-DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German

10 0.52759099 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis

11 0.52171558 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

12 0.51335812 28 acl-2013-A Unified Morpho-Syntactic Scheme of Stanford Dependencies

13 0.51242757 367 acl-2013-Universal Conceptual Cognitive Annotation (UCCA)

14 0.51196289 205 acl-2013-Joint Apposition Extraction with Syntactic and Semantic Constraints

15 0.50728118 302 acl-2013-Robust Automated Natural Language Processing with Multiword Expressions and Collocations

16 0.4992463 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information

17 0.4850181 76 acl-2013-Building and Evaluating a Distributional Memory for Croatian

18 0.4834069 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors

19 0.47605509 371 acl-2013-Unsupervised joke generation from big data

20 0.47329608 270 acl-2013-ParGramBank: The ParGram Parallel Treebank


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.049), (6, 0.028), (10, 0.033), (11, 0.033), (13, 0.016), (15, 0.011), (24, 0.038), (26, 0.048), (35, 0.072), (42, 0.063), (48, 0.046), (70, 0.025), (80, 0.328), (88, 0.029), (90, 0.022), (95, 0.077)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80664784 14 acl-2013-A Novel Classifier Based on Quantum Computation

Author: Ding Liu ; Xiaofang Yang ; Minghu Jiang

Abstract: In this article, we propose a novel classifier based on quantum computation theory. Different from existing methods, we consider the classification as an evolutionary process of a physical system and build the classifier by using the basic quantum mechanics equation. The performance of the experiments on two datasets indicates feasibility and potentiality of the quantum classifier.

same-paper 2 0.77310717 227 acl-2013-Learning to lemmatise Polish noun phrases

Author: Adam Radziszewski

Abstract: We present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem. The idea draws on the observation that the lemmatisation of almost all Polish noun phrases may be decomposed into transformation of singular words (tokens) that make up each phrase. We perform evaluation, which shows results similar to those obtained earlier by a rule-based system, while our approach allows to separate chunking from lemmatisation.

3 0.70041835 91 acl-2013-Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning

Author: Song Feng ; Jun Seok Kang ; Polina Kuznetsova ; Yejin Choi

Abstract: Understanding the connotation of words plays an important role in interpreting subtle shades of sentiment beyond denotative or surface meaning of text, as seemingly objective statements often allude nuanced sentiment of the writer, and even purposefully conjure emotion from the readers’ minds. The focus of this paper is drawing nuanced, connotative sentiments from even those words that are objective on the surface, such as “intelligence ”, “human ”, and “cheesecake ”. We propose induction algorithms encoding a diverse set of linguistic insights (semantic prosody, distributional similarity, semantic parallelism of coordination) and prior knowledge drawn from lexical resources, resulting in the first broad-coverage connotation lexicon.

4 0.62964737 135 acl-2013-English-to-Russian MT evaluation campaign

Author: Pavel Braslavski ; Alexander Beloborodov ; Maxim Khalilov ; Serge Sharoff

Abstract: This paper presents the settings and the results of the ROMIP 2013 MT shared task for the English→Russian language directfioorn. t Teh Een quality Rofu generated utraagnsel datiiroencswas assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations.

5 0.61243773 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

Author: Emmanuel Lassalle ; Pascal Denis

Abstract: This paper proposes a new method for significantly improving the performance of pairwise coreference models. Given a set of indicators, our method learns how to best separate types of mention pairs into equivalence classes for which we construct distinct classification models. In effect, our approach finds an optimal feature space (derived from a base feature set and indicator set) for discriminating coreferential mention pairs. Although our approach explores a very large space of possible feature spaces, it remains tractable by exploiting the structure of the hierarchies built from the indicators. Our exper- iments on the CoNLL-2012 Shared Task English datasets (gold mentions) indicate that our method is robust relative to different clustering strategies and evaluation metrics, showing large and consistent improvements over a single pairwise model using the same base features. Our best system obtains a competitive 67.2 of average F1 over MUC, and CEAF which, despite its simplicity, places it above the mean score of other systems on these datasets. B3,

6 0.6111356 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures

7 0.43833113 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

8 0.43358552 172 acl-2013-Graph-based Local Coherence Modeling

9 0.43348625 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

10 0.43319458 290 acl-2013-Question Analysis for Polish Question Answering

11 0.43173957 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain

12 0.43170226 312 acl-2013-Semantic Parsing as Machine Translation

13 0.4315722 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

14 0.43076101 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

15 0.43075585 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

16 0.43049085 62 acl-2013-Automatic Term Ambiguity Detection

17 0.43026945 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions

18 0.4301506 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

19 0.430089 97 acl-2013-Cross-lingual Projections between Languages from Different Families

20 0.42988831 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation