emnlp emnlp2010 emnlp2010-117 knowledge-graph by maker-knowledge-mining

117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

Source: pdf

Author: Kostadin Cholakov ; Gertjan van Noord

Abstract: Unknown words are a hindrance to the performance of hand-crafted computational grammars of natural language. However, words with incomplete and incorrect lexical entries pose an even bigger problem because they can be the cause of a parsing failure despite being listed in the lexicon of the grammar. Such lexical entries are hard to detect and even harder to correct. We employ an error miner to pinpoint words with problematic lexical entries. An automated lexical acquisition technique is then used to learn new entries for those words which allows the grammar to parse previously uncovered sentences successfully. We test our method on a large-scale grammar of Dutch and a set of sentences for which this grammar fails to produce a parse. The application of the method enables the grammar to cover 83.76% of those sentences with an accuracy of 86.15%.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, words with incomplete and incorrect lexical entries pose an even bigger problem because they can be the cause of a parsing failure despite being listed in the lexicon of the grammar. [sent-4, score-0.531]

2 Such lexical entries are hard to detect and even harder to correct. [sent-5, score-0.267]

3 We employ an error miner to pinpoint words with problematic lexical entries. [sent-6, score-0.479]

4 An automated lexical acquisition technique is then used to learn new entries for those words which allows the grammar to parse previously uncovered sentences successfully. [sent-7, score-0.552]

5 1 Introduction In this paper, we present an automated two-phase method for treating incomplete or incorrect lexical entries in the lexicons of large-scale computational grammars. [sent-12, score-0.311]

6 In the first phase, error mining pinpoints words which are listed in the lexicon of a given grammar but which nevertheless often lead to a parsing failure. [sent-23, score-0.365]

7 This indicates that the current lexical entry for such a word is either wrong or incomplete and that one or more correct entries for this word are missing from the lexicon. [sent-24, score-0.409]

8 In the case study presented here, we employ the iterative error miner of de Kok et al. [sent-26, score-0.343]

9 For example, the word afwater (to drain) is listed as a first person singular present verb in the Alpino lexicon. [sent-33, score-0.35]

10 However, the error miner identifies this word as the reason for the parsing failure of 9 sen- tences. [sent-34, score-0.343]

11 A manual examination reveals that the word is used as a neuter noun in these cases– het afwater (the drainage). [sent-35, score-0.429]

12 After the error miner identifies afwater as a problematic word, we employ our machine learning based LA method presented in Cholakov and van Noord (2010) to learn new entries for this word. [sent-37, score-0.88]

13 This method has already been successfully applied to the task of learning lexical entries for unknown words and, as the error miner, it can be used ‘out of the box’ . [sent-38, score-0.383]

14 tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 9t0ic2s–912, try for afwater and the addition of this entry to the lexicon enables Alpino to cover the 9 problematic sentences from the Mediargus corpus. [sent-41, score-0.62]

15 It should be noted that since our approach cannot differentiate between incomplete and incorrect entries, no entry in the lexicon is modified. [sent-42, score-0.227]

16 We simply add the lexical entries which, according to the LA method, are most suitable for a given problematic word and assume that, if these entries are correct, the grammar should be able to cover previously unparsable sentences in which the word occurs. [sent-43, score-0.853]

17 Section 4 describes an experiment where error mining is performed on the Mediargus corpus and then, LA is applied to learn new lexical entries for problematic words. [sent-47, score-0.522]

18 Section 5 discusses the effect which the addition of the new entries to the lexicon has on the parsing coverage and accuracy. [sent-48, score-0.406]

19 2 Error Mining The error miner of de Kok et al. [sent-51, score-0.31]

20 (2009) combines the strengths of the error mining methods of van Noord (2004) and Sagot and de la Clergerie (2006). [sent-52, score-0.474]

21 The iterative error mining algorithm of Sagot and de la Clergerie (2006) tackles this problem by taking the following into account: • If a form occurs within parsable sentences, it Ibfec aom forems le oscsc likely tfhoirn i tp atros babel eth see cause sof, a parsing failure. [sent-76, score-0.475]

22 • • The suspicion of a form depends on the suspicTihoens s uosfp tihcieo onth oefr a afo formrms dine tpheen unparsable sentences it occurs in. [sent-77, score-0.22]

23 (2009) uses a preprocessor to the iterative miner of Sagot and de la Clergerie (2006) which iterates through a sen- tence of unigrams and expands unigrams to longer n-grams when there is evidence that this is useful. [sent-84, score-0.621]

24 The grammar takes a ‘constructional’ approach, with rich lexical representations stored in the lexicon and a large number of detailed, construction specific rules (about 800). [sent-107, score-0.267]

25 Currently, the lexicon contains over 100K lexical entries and a list of about 200K named entities. [sent-108, score-0.38]

26 For example, the verb amuseert (to amuse) is assigned two lexical types– verb(hebben,sg3,intransitive) and verb(hebben,sg3,transitive)– because it can be used either transitively or intransitively. [sent-110, score-0.226]

27 The other type features indicate that it is a present third person singular verb and it forms perfect tense with the auxiliary verb hebben. [sent-111, score-0.262]

28 The types considered in the learning process are called universal types1 . [sent-115, score-0.234]

29 One verb and one noun paradigm are generated for afwater. [sent-130, score-0.24]

30 In these paradigms, afwater is listed as a first person singular present verb form and a singular het noun form, respectively. [sent-131, score-0.581]

31 Next, syntactic features for afwater are obtained by extracting a number of sentences which it occurs in from large corpora or Internet. [sent-133, score-0.288]

32 These sentences are parsed with a different ‘mode’ of Alpino where this word is assigned all universal types, i. [sent-134, score-0.336]

33 Then, the lexical type that has been assigned to afwater in this parse is stored. [sent-138, score-0.348]

34 For example, if a determiner occurs before the unknown word, all verb types are typically not taken into consideration. [sent-140, score-0.253]

35 This heavily reduces the computational overload and makes parsing with universal types computationally feasible. [sent-141, score-0.287]

36 When a word is assigned a verb or an adjective type by the classifier but there is no verb or adjective paradigm generated for it, all verb or adjective predictions for this word are discarded. [sent-147, score-0.602]

37 905 These sentences are again parsed with the universal types. [sent-153, score-0.302]

38 Then we look up the assigned universal verb types, calculate the MLE for each subcategorization frame and filter out frames with MLE below some empirical threshold. [sent-154, score-0.515]

39 a5s bb ielelnio parsed wsi (th∼ Alpino anntedn tchees parsing creosruplutss are fed into the error miner of de Kok et al. [sent-166, score-0.428]

40 When finished, the error miner stores the results in a data base containing potentially problematic ngrams. [sent-171, score-0.369]

41 Further, we select from this list only those unigrams which have lexical entries in the Alpino lexicon and occur in more than 5 sentences with no full-span parse. [sent-177, score-0.521]

42 Sometimes, the error miner might be wrong about the exact word which causes the parsing failure for a given sentence. [sent-178, score-0.401]

43 The small number of selected words is due to the fact that most of the problematic 4179 unigrams represent tokenization errors (two or more words written as one) and spelling mistakes which, naturally, are not listed in the Alpino lexicon. [sent-182, score-0.222]

44 Table 2 shows some of the problematic unigrams and their suspicions. [sent-184, score-0.222]

45 The unigram passerde should be written as passeerde, the past singular verb form of the verb ‘to pass’ and toegnag is the misspelled noun toegang (access). [sent-199, score-0.383]

46 The only problematic unigram with a lexical entry in the Alpino lexicon is mistrap (misstep, to misstep). [sent-200, score-0.491]

47 2 Applying Lexical Acquisition Our assumption is that incomplete or incorrect lexical entries prevented the production of full-span parses for the 388 sentences in which the 36 problematic words pinpointed by the error miner oc906 cur. [sent-204, score-0.761]

48 they are treated as unknown words, and we employ the LA method presented in the previous section to learn offline new lexical entries for them. [sent-207, score-0.356]

49 The set of universal types consists of 611types and the ME- based classifier has been trained on the same set of 2000 words as in Cholakov and van Noord (2010). [sent-209, score-0.319]

50 In order to increase the number of observed contexts for a given word when parsing with the universal types, up to 100 additional sentences in which the word occurs are extracted from Internet. [sent-211, score-0.323]

51 However, when predicting new lexical entries for this word, we want to take into account only sentences where it causes a parsing failure. [sent-212, score-0.402]

52 For example, the LA method would be able to predict a noun entry for afwater if it focuses only on contexts where it has a noun reading, i. [sent-214, score-0.432]

53 Although we cannot be sure that the 36 words are the cause of a parsing failure in each of the uncovered sentences, this low coverage indicates once more that Alpino has systematic problems with sentences containing these words. [sent-220, score-0.282]

54 Then, the uncovered sentences from Internet together with the 388 problematic sentences from the Mediargus corpus are parsed with Alpino and the universal types. [sent-221, score-0.56]

55 For example, the list of universal types assigned to afwater in (4) contains mostly noun types, i. [sent-222, score-0.535]

56 Since a verb can have various subcategorization frames, there is one type assigned for each frame. [sent-228, score-0.242]

57 For example, inscheppen (to spoon in(to)) receives 3 types which differ only in the subcategorization frame– verb(hebben,inf,tr. [sent-229, score-0.305]

58 Let us examine the most frequent types of lexicon errors for the 36 problematic words by looking at the current Alpino lexical entries for some of these words and the predictions they receive from the LA method. [sent-235, score-0.592]

59 The original Alpino entries for 19 of the 25 words predicted to be verbs are a product of a specific lexical rule in the grammar. [sent-236, score-0.295]

60 I spoon the soup in the bowl ‘I spoon the soup into the bowl. [sent-241, score-0.461]

61 ’ dat ik de soep de kom in schep that I the soup the bowl in spoon ‘that Ispoon the soup into the bowl’ dat ik de soep de kom inschep that I the soup the bowl in spoon ‘that Ispoon the soup into the bowl’ We see in (5-b) that the preposition in is used as a postposition in the relative clause. [sent-242, score-1.236]

62 However, in some cases, the entries generated by this lexical rule cannot account for other possible usages of the verbs in question. [sent-248, score-0.267]

63 Now, when the 907 LA method has predicted a transitive verb type for inscheppen, the parser should be able to cover the sentence. [sent-253, score-0.265]

64 This should enable the parser to cover sentences like: (7) Die moet een deel van het afwater vervoeren. [sent-261, score-0.593]

65 Currently, their lexical entries are incomplete because they are assigned only past participle types in the lexicon. [sent-265, score-0.394]

66 5 Results After LA is finished, we restore the original lexical entries for the 36 words but, additionally, each word is also assigned the types which have been predicted for it by the LA method. [sent-269, score-0.378]

67 how the parsing accuracy of Alpino changes Table 3 shows that when the Alpino lexicon is extended with the lexical entries we learnt through LA, the parser is able to cover nearly 84% of the sentences, including the ones given in (6) and (7). [sent-273, score-0.555]

68 Since there is no suitable baseline which this result can be compared to, we developed an additional model which indicates what is likely to be the maximum coverage that Alpino can achieve for those sentences by adding new lexical entries only. [sent-274, score-0.369]

69 In this second model, for each of the 36 words, we add to the lexicon all types which were successfully used for the respective word during the parsing with universal types. [sent-275, score-0.4]

70 89 Table 3: Coverage results for the re-parsed 388 problematic sentences Some of the sentences which cannot be covered by both models are actually not proper sentences but fragments which were wrongly identified as sentences during tokenization. [sent-286, score-0.341]

71 Here is a more interesting case: (9) Als we ons naar de buffettafel begeven, mistrap ik when we us to the buffet proceed misstep I me. [sent-291, score-0.246]

72 ’ The LA method does not predict a reflexive verb type for mistrap which prevents the production of a full-span analysis because Alpino cannot connect the reflexive pronoun me to mistrap. [sent-293, score-0.265]

73 A reflexive verb type is among the universal types and thus, Alpino is able to use that type to deliver a full-span parse. [sent-295, score-0.421]

74 We should note though, that LA cor908 rectly predicts a noun type for mistrap which enables Alpino to parse successfully the other 14 sentences which this word occurs in. [sent-296, score-0.28]

75 Clearly, this baseline is expected to perform worse than both our model and the universal types one since those are able to cover most of the sentences and thus, they are likely to produce more correct dependency relations. [sent-302, score-0.366]

76 Our model and the universal types one achieve the same accuracy for most of the sentences. [sent-312, score-0.234]

77 However, the universal types model has an important disadvantage which, in some cases, leads to the production of wrong dependency relations. [sent-313, score-0.262]

78 The model predicts a large number of lexical types which, in turn, leads to large lexical ambiguity. [sent-314, score-0.232]

79 Let us consider the following example where a sentence is covered by both models but the universal types model has lower accuracy: (10) Dat wij het rechttrokken, pleit voor onze that we it straighten. [sent-316, score-0.369]

80 ’ Here, het is the object of the verb rechttrokken. [sent-321, score-0.25]

81 However, although there are transitive verb types among the universal types assigned to rechttrokken, Alpino chooses to use a verb type which subcategorizes for a measure NP. [sent-322, score-0.547]

82 Since it considers sentences containing other forms of the paradigm of rechttrokken when predicting subcategorization frames, the LA method correctly assigns only one transitive and one intransitive verb type to this word. [sent-327, score-0.362]

83 This allows Alpino to recognize het as the object of the verb and to produce the correct dependency relation. [sent-328, score-0.25]

84 The few cases where the universal types model outperforms ours include sentences like the one given in (9) where the application of our model could not enable Alpino to assign a full-span analysis. [sent-329, score-0.286]

85 These types, on the other hand, could be provided by the universal types model and could enable Alpino to cover a given sentence and thus, to produce more correct dependency relations. [sent-331, score-0.283]

86 Then, they employ LA to learn proper lexical entries for these MWEs and add them to the lexicon of a large-scale HPSG grammar of English (ERG; (Copestake and Flickinger, 2000)). [sent-338, score-0.49]

87 The lexicon is used in two grammars– the FRMG (Thomasset and de la Clergerie, 2005), a hybrid Tree Adjoining/Tree Insertion Grammar, and the SxLFGFR LFG grammar (Boullier and Sagot, 2006). [sent-345, score-0.457]

88 The first step in this approach is also the application of an error miner (Sagot and de la Clergerie, 2006) which uses a parsed newspaper corpus (about 4. [sent-346, score-0.568]

89 (2008) assign underspecified lexical entries to a given problematic unigram to allow the grammar to parse the uncovered sentences associated with this unigram. [sent-350, score-0.666]

90 As a consequence of that, the ranked list of lexical entries for each unigram is manually validated to filter out the wrong entries. [sent-353, score-0.325]

91 The ranking of the predictions is done by the classifier and the predicted entries are good enough to improve the parsing coverage and accuracy without any manual work involved. [sent-355, score-0.351]

92 , a verb with a rare subcat frame) in the bottom of the ranked list because of the low number of sentences in which this entry is used. [sent-361, score-0.237]

93 (2008) uses the lexical entries which remain after the manual validation to re-parse the newspaper corpus. [sent-365, score-0.267]

94 However, the authors do not mention how many of the original uncovered sentences they are able to cover and therefore, we cannot compare our coverage result. [sent-369, score-0.255]

95 Although the lexicon contains a verb entry for ‘schampte’, there is no entry handling the case when this verb combines with the particle ‘af’ . [sent-393, score-0.517]

96 Our method is currently not able to capture these two cases since they can be identified as problematic on bigram level and not when only unigrams are considered. [sent-397, score-0.281]

97 Further, the definition of what the error miner considers to be a successful parse is a rather crude one. [sent-398, score-0.27]

98 Therefore, it is possible that a word could have a problematic lexical entry even if it only occurs in sentences which are assigned a full-span parse. [sent-400, score-0.399]

99 Lionel Nicolas, Beno ıˆt Sagot, Miguel Molinero, Jacques Farr e´, and Eric de la Clergerie. [sent-457, score-0.267]

100 Beno ıˆt Sagot, Lionel Cl´ ement, Eric de la Clergerie, and Pierre Boullier. [sent-469, score-0.267]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('alpino', 0.622), ('afwater', 0.203), ('la', 0.193), ('entries', 0.19), ('universal', 0.185), ('miner', 0.176), ('cholakov', 0.162), ('noord', 0.162), ('het', 0.135), ('problematic', 0.133), ('verb', 0.115), ('lexicon', 0.113), ('sagot', 0.104), ('mediargus', 0.095), ('soup', 0.095), ('spoon', 0.095), ('subcategorization', 0.093), ('unigrams', 0.089), ('van', 0.085), ('bowl', 0.081), ('clergerie', 0.081), ('suspicion', 0.081), ('gertjan', 0.081), ('grammar', 0.077), ('lexical', 0.077), ('de', 0.074), ('uncovered', 0.073), ('entry', 0.07), ('inscheppen', 0.068), ('mistrap', 0.068), ('parsed', 0.065), ('noun', 0.064), ('kok', 0.063), ('mining', 0.062), ('paradigm', 0.061), ('error', 0.06), ('unknown', 0.056), ('flemish', 0.054), ('kordoni', 0.054), ('kostadin', 0.054), ('unparsable', 0.054), ('valia', 0.054), ('failure', 0.054), ('nicolas', 0.054), ('dutch', 0.054), ('parsing', 0.053), ('sentences', 0.052), ('coverage', 0.05), ('cover', 0.049), ('types', 0.049), ('acquisition', 0.049), ('beno', 0.046), ('suspicious', 0.046), ('frame', 0.046), ('incomplete', 0.044), ('adjective', 0.044), ('frames', 0.042), ('paradigms', 0.042), ('parser', 0.042), ('expfactor', 0.041), ('frmg', 0.041), ('misstep', 0.041), ('mwes', 0.041), ('rechttrokken', 0.041), ('reflexive', 0.041), ('soep', 0.041), ('af', 0.036), ('ik', 0.036), ('genoa', 0.035), ('villavicencio', 0.035), ('assigned', 0.034), ('parse', 0.034), ('particle', 0.034), ('occurs', 0.033), ('employ', 0.033), ('singular', 0.032), ('able', 0.031), ('mle', 0.031), ('multiword', 0.031), ('deep', 0.031), ('predictions', 0.03), ('unigram', 0.03), ('causes', 0.03), ('parses', 0.029), ('predicts', 0.029), ('wrong', 0.028), ('currently', 0.028), ('predicted', 0.028), ('aline', 0.027), ('boullier', 0.027), ('buffet', 0.027), ('drainage', 0.027), ('groningen', 0.027), ('ispoon', 0.027), ('kom', 0.027), ('lefff', 0.027), ('lionel', 0.027), ('moet', 0.027), ('neuter', 0.027), ('passerde', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999899 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

Author: Kostadin Cholakov ; Gertjan van Noord

2 0.095164105 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

Author: Tahira Naseem ; Harr Chen ; Regina Barzilay ; Mark Johnson

Abstract: We present an approach to grammar induction that utilizes syntactic universals to improve dependency parsing across a range of languages. Our method uses a single set of manually-specified language-independent rules that identify syntactic dependencies between pairs of syntactic categories that commonly occur across languages. During inference of the probabilistic model, we use posterior expectation constraints to require that a minimum proportion of the dependencies we infer be instances of these rules. We also automatically refine the syntactic categories given in our coarsely tagged input. Across six languages our approach outperforms state-of-theart unsupervised methods by a significant margin.1

3 0.08619491 114 emnlp-2010-Unsupervised Parse Selection for HPSG

Author: Rebecca Dridan ; Timothy Baldwin

Abstract: Parser disambiguation with precision grammars generally takes place via statistical ranking of the parse yield of the grammar using a supervised parse selection model. In the standard process, the parse selection model is trained over a hand-disambiguated treebank, meaning that without a significant investment of effort to produce the treebank, parse selection is not possible. Furthermore, as treebanking is generally streamlined with parse selection models, creating the initial treebank without a model requires more resources than subsequent treebanks. In this work, we show that, by taking advantage of the constrained nature of these HPSG grammars, we can learn a discriminative parse selection model from raw text in a purely unsupervised fashion. This allows us to bootstrap the treebanking process and provide better parsers faster, and with less resources.

4 0.078285426 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

Author: Tom Kwiatkowksi ; Luke Zettlemoyer ; Sharon Goldwater ; Mark Steedman

Abstract: This paper addresses the problem of learning to map sentences to logical form, given training data consisting of natural language sentences paired with logical representations of their meaning. Previous approaches have been designed for particular natural languages or specific meaning representations; here we present a more general method. The approach induces a probabilistic CCG grammar that represents the meaning of individual words and defines how these meanings can be combined to analyze complete sentences. We use higher-order unification to define a hypothesis space containing all grammars consistent with the training data, and develop an online learning algorithm that efficiently searches this space while simultaneously estimating the parameters of a log-linear parsing model. Experiments demonstrate high accuracy on benchmark data sets in four languages with two different meaning representations.

5 0.066896349 95 emnlp-2010-SRL-Based Verb Selection for ESL

Author: Xiaohua Liu ; Bo Han ; Kuan Li ; Stephan Hyeonjun Stiller ; Ming Zhou

Abstract: In this paper we develop an approach to tackle the problem of verb selection for learners of English as a second language (ESL) by using features from the output of Semantic Role Labeling (SRL). Unlike existing approaches to verb selection that use local features such as n-grams, our approach exploits semantic features which explicitly model the usage context of the verb. The verb choice highly depends on its usage context which is not consistently captured by local features. We then combine these semantic features with other local features under the generalized perceptron learning framework. Experiments on both indomain and out-of-domain corpora show that our approach outperforms the baseline and achieves state-of-the-art performance. 1

6 0.056210607 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

7 0.050953381 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

8 0.049531732 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

9 0.047813915 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

10 0.045295067 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

11 0.044443212 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

12 0.043061864 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

13 0.041115738 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

14 0.04029819 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

15 0.038802717 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

16 0.038282178 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

17 0.038197339 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars

18 0.037648898 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

19 0.037130076 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

20 0.036996722 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.141), (1, 0.049), (2, 0.079), (3, 0.052), (4, 0.004), (5, 0.038), (6, -0.038), (7, -0.093), (8, 0.056), (9, -0.032), (10, 0.049), (11, 0.056), (12, 0.08), (13, 0.016), (14, 0.015), (15, -0.06), (16, -0.084), (17, 0.027), (18, -0.145), (19, -0.022), (20, 0.035), (21, -0.122), (22, -0.13), (23, 0.001), (24, -0.045), (25, 0.035), (26, 0.008), (27, -0.202), (28, 0.03), (29, -0.049), (30, -0.022), (31, -0.015), (32, 0.206), (33, -0.146), (34, -0.084), (35, -0.147), (36, -0.05), (37, 0.127), (38, 0.044), (39, 0.113), (40, -0.05), (41, -0.047), (42, -0.058), (43, 0.063), (44, 0.152), (45, 0.417), (46, -0.174), (47, -0.092), (48, 0.211), (49, 0.243)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95040429 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

Author: Kostadin Cholakov ; Gertjan van Noord

2 0.42991164 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

Author: Tom Kwiatkowksi ; Luke Zettlemoyer ; Sharon Goldwater ; Mark Steedman

3 0.39946583 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

Author: Tahira Naseem ; Harr Chen ; Regina Barzilay ; Mark Johnson

4 0.31916443 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

Author: Aida Khemakhem ; Bilel Gargouri ; Abdelmajid Ben Hamadou

Abstract: Electronic dictionaries covering all natural language levels are very relevant for the human use as well as for the automatic processing use, namely those constructed with respect to international standards. Such dictionaries are characterized by a complex structure and an important access time when using a querying system. However, the need of a user is generally limited to a part of such a dictionary according to his domain and expertise level which corresponds to a specialized dictionary. Given the importance of managing a unified dictionary and considering the personalized needs of users, we propose an approach for generating personalized views starting from a normalized dictionary with respect to Lexical Markup Framework LMF-ISO 24613 norm. This approach provides the re-use of already defined views for a community of users by managing their profiles information and promoting the materialization of the generated views. It is composed of four main steps: (i) the projection of data categories controlled by a set of constraints (related to the user‟s profiles), (ii) the selection of values with consistency checking, (iii) the automatic generation of the query‟s model and finally, (iv) the refinement of the view. The proposed approach was con- solidated by carrying out an experiment on an LMF normalized Arabic dictionary. 1

5 0.31476131 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

Author: Shane Bergsma ; Aditya Bhargava ; Hua He ; Grzegorz Kondrak

Abstract: In many applications, replacing a complex word form by its stem can reduce sparsity, revealing connections in the data that would not otherwise be apparent. In this paper, we focus on prefix verbs: verbs formed by adding a prefix to an existing verb stem. A prefix verb is considered compositional if it can be decomposed into a semantically equivalent expression involving its stem. We develop a classifier to predict compositionality via a range of lexical and distributional features, including novel features derived from web-scale Ngram data. Results on a new annotated corpus show that prefix verb compositionality can be predicted with high accuracy. Our system also performs well when trained and tested on conventional morphological segmentations of prefix verbs.

6 0.2863391 114 emnlp-2010-Unsupervised Parse Selection for HPSG

7 0.24311894 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

8 0.23972005 95 emnlp-2010-SRL-Based Verb Selection for ESL

9 0.21755683 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

10 0.20665786 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

11 0.17469937 54 emnlp-2010-Generating Confusion Sets for Context-Sensitive Error Correction

12 0.17158107 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

13 0.1714325 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

14 0.17093299 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference

15 0.15684475 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task

16 0.14774843 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

17 0.1468959 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

18 0.13580021 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

19 0.13420168 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

20 0.13418624 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(12, 0.036), (29, 0.076), (30, 0.011), (52, 0.011), (56, 0.034), (66, 0.082), (72, 0.587), (76, 0.03), (89, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93967855 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

Author: Pawel Mazur ; Robert Dale

Abstract: The reliable extraction of knowledge from text requires an appropriate treatment of the time at which reported events take place. Unfortunately, there are very few annotated data sets that support the development of techniques for event time-stamping and tracking the progression of time through a narrative. In this paper, we present a new corpus of temporally-rich documents sourced from English Wikipedia, which we have annotated with TIMEX2 tags. The corpus contains around 120000 tokens, and 2600 TIMEX2 expressions, thus comparing favourably in size to other existing corpora used in these areas. We describe the prepa- ration of the corpus, and compare the profile of the data with other existing temporally annotated corpora. We also report the results obtained when we use DANTE, our temporal expression tagger, to process this corpus, and point to where further work is required. The corpus is publicly available for research purposes.

2 0.92751533 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

Author: Valentin Zhikov ; Hiroya Takamura ; Manabu Okumura

Abstract: This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the ”CHILDES” corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced . computational time.

same-paper 3 0.88496536 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

Author: Kostadin Cholakov ; Gertjan van Noord

4 0.53635234 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka

Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

5 0.48547202 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei

Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.

6 0.4818947 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

7 0.43789551 53 emnlp-2010-Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue

8 0.4266834 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

9 0.41184819 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

10 0.41122958 20 emnlp-2010-Automatic Detection and Classification of Social Events

11 0.40514123 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

12 0.40498078 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

13 0.40205711 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

14 0.40047121 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

15 0.39530608 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

16 0.39269164 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

17 0.38621601 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

18 0.38601449 51 emnlp-2010-Function-Based Question Classification for General QA

19 0.38422409 80 emnlp-2010-Modeling Organization in Student Essays

20 0.38279799 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation