acl acl2013 acl2013-58 knowledge-graph by maker-knowledge-mining

58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners

Source: pdf

Author: Lis Pereira ; Erlyn Manguilimotan ; Yuji Matsumoto

Abstract: This study addresses issues of Japanese language learning concerning word combinations (collocations). Japanese learners may be able to construct grammatically correct sentences, however, these may sound “unnatural”. In this work, we analyze correct word combinations using different collocation measures and word similarity methods. While other methods use well-formed text, our approach makes use of a large Japanese language learner corpus for generating collocation candidates, in order to build a system that is more sensitive to constructions that are difficult for learners. Our results show that we get better results compared to other methods that use only wellformed text. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Japanese learners may be able to construct grammatically correct sentences, however, these may sound “unnatural”. [sent-3, score-0.137]

2 In this work, we analyze correct word combinations using different collocation measures and word similarity methods. [sent-4, score-0.685]

3 While other methods use well-formed text, our approach makes use of a large Japanese language learner corpus for generating collocation candidates, in order to build a system that is more sensitive to constructions that are difficult for learners. [sent-5, score-0.873]

4 1 Introduction Automated grammatical error correction is emerging as an interesting topic of natural language processing (NLP). [sent-7, score-0.2]

5 It is only recently that NLP research has addressed issues of collocation errors. [sent-11, score-0.398]

6 In Japanese, ocha wo ireru “ お茶入れる [to make tea]” and yume wo miru ” 夢を見 [to have a dream]” are examples of collocations. [sent-13, score-0.568]

7 For instance, the Japanese collocation yume wo miru [lit. [sent-19, score-0.754]

8 A learner might create the unnatural combination yume wo suru, using the verb suru (a general light verb meaning “do” in English) instead of miru “to see”. [sent-21, score-1.171]

9 In this work, we analyze various Japanese corpora using a number of collocation and word similarity measures to deduce and suggest the best collocations for Japanese second language learners. [sent-22, score-0.877]

10 In order to build a system that is more sensitive to constructions that are difficult for learners, we use word similarity measures that generate collocation candidates using a large Japanese language learner corpus. [sent-23, score-1.163]

11 In Section 2, we introduce related work on collocation error correction. [sent-26, score-0.434]

12 Section 3 explains our method, based on word similarity and association measures, for suggesting collocations. [sent-27, score-0.221]

13 In Section 4, we describe different word similarity and association measures, as well as the corpora used in our experiments. [sent-28, score-0.221]

14 2 Related Work Collocation correction currently follows a similar approach used in article and preposition correction. [sent-31, score-0.196]

15 The general strategy compares the learner's word choice to a confusion set generated from well-formed text during the training phase. [sent-32, score-0.297]

16 If one or more alternatives are more appropriate to the context, the learner's word is flagged as an error and the alternatives are suggested as corrections. [sent-33, score-0.23]

17 To constrain the size of the confusion set, 52 Sofia, BulPgraoricea,ed Ainug us otf 4 t-h9e 2 A0C13L. [sent-34, score-0.233]

18 tc ud2e0n1t3 R Aes seoacricahti Wonor foksrh Coopm, p augteasti 5o2n–a5l8 L,inguistics similarity measures are used. [sent-36, score-0.231]

19 To rank the best candidates, the strength of association in the learner’s construction and in each of the generated alternative construction are measured. [sent-37, score-0.325]

20 (2008) generated synonyms for each candidate string using WordNet and Roget’s Thesaurus and used the rank ratio measure to score them by their semantic similarity. [sent-39, score-0.148]

21 (2009) also used WordNet to generate synonyms, but used Pointwise Mutual Information as association measure to rank the candidates. [sent-41, score-0.153]

22 (2008) used bilingual dictionaries to derive collocation candidates and used the loglikelihood measure to rank them. [sent-43, score-0.571]

23 One drawback of these approaches is that they rely on resources of limited coverage, such as dictionaries, thesaurus or manually constructed databases to generate the candidates. [sent-44, score-0.155]

24 Another problem is that most research does not actually take the learners' tendency of collocation errors into account; instead, their systems are trained only on well-formed text corpora. [sent-47, score-0.447]

25 Our work follows the general approach, that is, uses similarity measures for generating the confusion set and association measures for ranking the best candidates. [sent-48, score-0.639]

26 However, instead of using only wellformed text for generating the confusion set, we use a large learner corpus created by crawling the revision log of a language learning social networking service (SNS), Lang-83. [sent-49, score-0.697]

27 The biggest benefit of using such kind of data is that we can obtain in large scale pairs of learners’ sentences and their corrections assigned by native speakers. [sent-52, score-0.099]

28 3 Combining Word Similarity and Association Measures to Suggest Collocations In our work, we focus on suggestions for noun and verb collocation errors in “noun wo verb (noun- を-verb)” constructions, where noun is the direct object of verb. [sent-53, score-1.352]

29 In our evaluation, we checked if the correction given in the learner corpus matches one of the suggestions given by the system. [sent-57, score-0.533]

30 method for suggesting We considered only the tuples that contain noun or verb error. [sent-60, score-0.452]

31 1 Word Similarity Similarity measures are used to generate the collocation candidates that are later ranked using association measures. [sent-64, score-0.629]

32 The first two measures generate the collocation candidates by finding words that are analogous to the writer’s choice, a common approach used in the related work on collocation error correction (Liu et al. [sent-66, score-1.179]

33 , 2010) and the third measure generates the candidates based on the corrections given by native speakers in the learner corpus. [sent-69, score-0.502]

34 Two words are considered similar if they are near each other in the thesaurus hierarchy (have a path within a pre-defined threshold length). [sent-71, score-0.126]

35 Distributional Similarity: Thesaurus-based methods produce weak recall since many words, phrases and semantic connections are not covered by hand-built thesauri, especially for verbs and adjectives. [sent-72, score-0.136]

36 As an alternative, distributional similarity models are often used since it gives higher recall. [sent-73, score-0.225]

37 On the other hand, distributional similarity models tend to have lower precision (Jurafsky et al. [sent-74, score-0.225]

38 We are interested in computing similarity of nouns and verbs and hence the context of a particular noun is a vector of verbs that are in an object relation with that noun. [sent-79, score-0.572]

39 The context of a particular verb is a vector 食eべatるごr1飯i6ce4を ramラenー noメ5o3 ンdleを so up カcレu3r9ーr yを Table 3 Context of a particular noun represented as a co-occurrence vector of nouns that are in an object relation with that verb. [sent-80, score-0.385]

40 Table 2 and Table 3 show examples of part of co-occurrence vectors for the noun “ 日記 [diary]” and the verb “食べる [eat]”, respectively. [sent-81, score-0.325]

41 We computed the similarity between co-occurrence vectors using different metrics: Cosine Similarity, Dice coefficient (Curran, 2004), KullbackLeibler divergence or KL divergence or relative entropy (Kullback and Leibler, 1951) and the Jenson-Shannon divergence (Lee, 1999). [sent-83, score-0.308]

42 Confusion Set derived from learner corpus: In order to build a module that can “guess” common construction errors, we created a confusion set using Lang-8 corpus. [sent-84, score-0.584]

43 Instead of generating words that have similar meaning to the learner’s written construction, we extracted all the possible noun and verb corrections for each of the nouns and verbs found in the data. [sent-85, score-0.61]

44 For instance, the confusion set of the verb suru “する [to do]” is composed of verbs such as ukeru “受ける [to accept]”, which does not necessarily have similar meaning with suru. [sent-87, score-0.675]

45 The confusion set means that in the corpus, suru was corrected to either one of these verbs, i. [sent-88, score-0.428]

46 , when the learner writes the verb suru, he/she might actually mean to write one of the verbs in the confusion set. [sent-90, score-0.795]

47 For the noun biru“ ビル [building]”, the learner may have, for example, misspelled the word bīru “ ビール [beer]”, or may have got confused with the translation of the English words bill (“お金[money]”, “札 [bill]”, “金額 [amount of money]”, “料金 [fee]”) or view (“景色 [scenery]”) to Japanese. [sent-91, score-0.485]

48 2 Word Association Strength After generating the collocation candidates using word similarity, the next step is to identify the “true collocations” among them. [sent-93, score-0.535]

49 Here, the association strength was measured, in such a way that word pairs generated by chance from the sampling process can be excluded. [sent-94, score-0.179]

50 An association measure assigns an association score to each word pair. [sent-95, score-0.158]

51 We adopted the Weighted Dice coefficient (Kitamura and Matsumoto, 1997) as our association measurement. [sent-97, score-0.085]

52 We also tested using other association measures (results are omitted): Pointwise Mutual Information (Church and Hanks, 1990), log-likelihood ratio (Dunning, 1993) and Dice coefficient (Smadja et al. [sent-98, score-0.171]

53 5 Experiment setup We divided our experiments into two parts: verb suggestion and noun suggestion. [sent-100, score-0.561]

54 For verb suggestion, given the learners’ “noun wo verb” construction, our focus is to suggest “noun wo verb” collocations with alternative verbs other than the learner’s written verb. [sent-101, score-0.956]

55 For noun suggestion, given the learners’ “noun wo verb” construction, our focus is to suggest “noun wo verb” collocations with alternative nouns other than the learner’s written noun. [sent-102, score-0.911]

56 This thesaurus was used to compute word similarity, taking the words that are in the same subtree as the candidate word. [sent-105, score-0.19]

57 , 1991): one of the major newspapers in Japan that provides raw text of newspaper articles used as linguistic resource. [sent-108, score-0.092]

58 One year data (1991) were used to extract the “noun wo verb” tuples to compute word similarity (using cosine similarity metric) and collocation scores. [sent-109, score-1.136]

59 We extracted 224,185 tuples composed of 16,781 unique verbs and 37,300 unique nouns. [sent-110, score-0.337]

60 Incorporating a variety of topics and styles in the training data helps minimize the domain gap problem between the learner’s vocabulary and newspaper vocabulary found in the Mainichi Shimbun data. [sent-112, score-0.154]

61 We extracted 194,036 “noun wo verb” tuples composed of 43,243 unique nouns and 18,212 unique verbs. [sent-113, score-0.517]

62 These data are necessary to compute the word similarity (using cosine similarity metric) and collocation scores. [sent-114, score-0.751]

63 We extracted 163,880 “noun wo verb” tuples composed of 38,999 unique nouns and 16,086 unique verbs. [sent-116, score-0.517]

64 ii) Construct the confusion set (explained in Section 4. [sent-117, score-0.233]

65 1): We constructed the confusion set for all the 16,086 verbs and 38,999 nouns that appeared in the data. [sent-118, score-0.385]

66 For the verb suggestion task, we extracted all the “noun wo verb” tuples with incorrect verbs and their correction. [sent-123, score-0.836]

67 From the tuples extracted, we selected the ones where the verbs were corrected to the same verb 5 or more times by the native speakers. [sent-124, score-0.493]

68 Similarly, for the noun suggestion task, we extracted all the “noun wo verb” tuples with incorrect nouns and their correction. [sent-125, score-0.791]

69 There are cases where the learner’s construction sounds more acceptable than its correction, cases where in the corpus, they were corrected due to some contextual information. [sent-126, score-0.1]

70 , and might contain errors like spelling and grammar, collocation errors are much less frequent compared to spelling and grammar errors, since combining words appropriately is one the vital competencies of a native speaker of a language. [sent-128, score-0.639]

71 55 the noun, particle and verb that the learner wrote, there was a need to filter out such contextually induced corrections. [sent-129, score-0.5]

72 To solve this problem, we used the Weighted Dice coefficient to compute the association strength between the noun and all the verbs, filtering out the pairs where the learner’s construction has a higher score than the correction. [sent-130, score-0.358]

73 After applying those conditions, we obtained 185 tuples for the verb suggestion test set and 85 tuples for the noun suggestion test set. [sent-131, score-1.051]

74 3 Evaluation Metrics We compared the verbs in the confusion set ranked by collocation score suggested by the system with the human correction verb and noun in the Lang-8 data. [sent-133, score-1.254]

75 The metrics we used for the evaluation are: precision, recall and the mean reciprocal rank (MRR). [sent-136, score-0.086]

76 We report precision at rank k, k=1, 5, computing the rank of the correction when a true positive occurs. [sent-137, score-0.275]

77 The MRR was used to assess whether the suggestion list contains the correction and how far up it is in the list. [sent-138, score-0.4]

78 If the system did not return the correction for a test instance, we set ran1k(i)to zero. [sent-140, score-0.164]

79 Recall rate is calculated with the formula below: tptpfn 6 (2) Results Table 4 shows the ten models derived from combining different word similarity measures and the Weighted Dice measure as association measure, using different corpora. [sent-141, score-0.382]

80 In this table, for instance, we named M1 the model that uses thesaurus for computing word similarity and uses Mainichi Shimbun corpus when computing collocation scores using the association measure adopted, Weighted Dice. [sent-142, score-0.858]

81 M2 uses Mainichi Shimbun corpus for computing both word similarity and collocation scores. [sent-143, score-0.623]

82 M10 computes word similarity using the confusing set from Lang-8 corpus and uses BCCWJ and Lang-8 corpus when computing collocation scores. [sent-144, score-0.648]

83 Table 4 reports the precision of the k-best suggestions, the recall rate and the MRR for verb and noun suggestion. [sent-148, score-0.41]

84 1 Verb Suggestion Table 4 shows that the model using thesaurus (M1) achieved the highest precision rate among the other models; however, it had the lowest recall. [sent-150, score-0.167]

85 The model could suggest for cases where the wrong verb written by the learner and the correction suggested in Lang-8 data have similar meaning, as they are near to each other in the thesaurus hierarchy. [sent-151, score-0.903]

86 However, for cases where the wrong verb written by the learner and the correction suggested in Lang-8 data do not have similar meaning, M1 could not suggest the correction. [sent-152, score-0.777]

87 The recall rate improved significantly but the precision rate de- creased. [sent-154, score-0.126]

88 The highest recall and MRR values are achieved when Lang-8 data were used to generate the confusion set (M10). [sent-157, score-0.306]

89 2 Noun Suggestion Similar to the verb suggestion experiments, the best recall and MRR values are achieved when Lang-8 data were used to generate the confusion set (M10). [sent-170, score-0.711]

90 For noun suggestion, our automatically constructed test set includes a number of spelling correction cases, such as cases for the combination eat ice cream, where the learner wrote aisukurimu wo taberu “ アイスリムを食べる ” and the correction is aisukurīmu wo taberu ” アイスリームを食べる ”. [sent-171, score-1.409]

91 Such phenomena did not occur with the test set for verb suggestion. [sent-172, score-0.169]

92 For those cases, the fact that only spelling correction is necessary in order to have the right collocation may also indicate that the learner is more confident regarding the choice of the noun than the verb. [sent-173, score-1.063]

93 07) when using a thesaurus for generating the candidates クク 7 Conclusion and Future Work We analyzed various Japanese corpora using a number of collocation and word similarity measures to deduce and suggest the best collocations for Japanese second language learners. [sent-175, score-1.112]

94 In order to build a system that is more sensitive to constructions that are difficult for learners, we use word similarity measures that generate collocation candidates using a large Japanese language learner corpus, instead of only using wellformed text. [sent-176, score-1.231]

95 By employing this approach, we could obtain better recall and MRR values compared to thesaurus based method and distributional similarity methods. [sent-177, score-0.395]

96 Another straightforward extension is to pursue constructions with other particles, such as “noun ga verb (subject-verb)”, “noun ni verb (dative-verb)”, etc. [sent-179, score-0.42]

97 In our experiments, only a small context information is considered (only the noun, the particle wo (を) and the verb written by the learner). [sent-180, score-0.459]

98 An automatic collocation writing assistant for Taiwanese EFL learners: A case of corpus-based NLP technology. [sent-193, score-0.398]

99 A computational approach to detecting collocation errors in the writing of non-native speakers of English. [sent-230, score-0.447]

100 Using mostly native data to correct errors in learners’ writing: A meta-classifier approach. [sent-235, score-0.104]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('collocation', 0.398), ('learner', 0.301), ('suggestion', 0.236), ('confusion', 0.233), ('japanese', 0.23), ('wo', 0.212), ('mainichi', 0.192), ('verb', 0.169), ('shimbun', 0.169), ('correction', 0.164), ('noun', 0.156), ('similarity', 0.145), ('suru', 0.145), ('collocations', 0.138), ('learners', 0.137), ('mrr', 0.133), ('tuples', 0.127), ('thesaurus', 0.126), ('dice', 0.114), ('newspaper', 0.092), ('verbs', 0.092), ('measures', 0.086), ('bccwj', 0.085), ('constructions', 0.082), ('distributional', 0.08), ('miru', 0.072), ('yume', 0.072), ('candidates', 0.068), ('wellformed', 0.068), ('strength', 0.067), ('writer', 0.066), ('alternatives', 0.062), ('nouns', 0.06), ('dream', 0.059), ('native', 0.055), ('suggest', 0.053), ('construction', 0.05), ('corrected', 0.05), ('errors', 0.049), ('particles', 0.049), ('written', 0.048), ('biru', 0.048), ('kitamura', 0.048), ('oyama', 0.048), ('smadja', 0.048), ('stling', 0.048), ('taberu', 0.048), ('association', 0.048), ('year', 0.046), ('corrections', 0.044), ('spelling', 0.044), ('recall', 0.044), ('suggestions', 0.043), ('mizumoto', 0.043), ('rozovskaya', 0.043), ('maekawa', 0.043), ('futagi', 0.043), ('kullback', 0.043), ('rank', 0.042), ('divergence', 0.042), ('suggested', 0.042), ('generating', 0.041), ('rate', 0.041), ('unique', 0.041), ('dahlmeier', 0.039), ('matsumoto', 0.039), ('chang', 0.038), ('sns', 0.037), ('coefficient', 0.037), ('candidate', 0.036), ('error', 0.036), ('generated', 0.036), ('composed', 0.036), ('cosine', 0.035), ('measure', 0.034), ('assisted', 0.034), ('tetreault', 0.034), ('alternative', 0.032), ('preposition', 0.032), ('unnatural', 0.031), ('vocabulary', 0.031), ('automated', 0.031), ('particle', 0.03), ('wrote', 0.03), ('eat', 0.03), ('dictionaries', 0.029), ('deduce', 0.029), ('revision', 0.029), ('generate', 0.029), ('contemporary', 0.028), ('word', 0.028), ('church', 0.028), ('computing', 0.027), ('tea', 0.027), ('sensitive', 0.026), ('money', 0.026), ('weighted', 0.025), ('mutual', 0.025), ('corpus', 0.025), ('suzuki', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners

Author: Lis Pereira ; Erlyn Manguilimotan ; Yuji Matsumoto

2 0.38264742 8 acl-2013-A Learner Corpus-based Approach to Verb Suggestion for ESL

Author: Yu Sawai ; Mamoru Komachi ; Yuji Matsumoto

Abstract: We propose a verb suggestion method which uses candidate sets and domain adaptation to incorporate error patterns produced by ESL learners. The candidate sets are constructed from a large scale learner corpus to cover various error patterns made by learners. Furthermore, the model is trained using both a native corpus and the learner corpus via a domain adaptation technique. Experiments on two learner corpora show that the candidate sets increase the coverage of error patterns and domain adaptation improves the performance for verb suggestion.

3 0.14257373 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

Author: Olivier Ferret

Abstract: Distributional thesauri are now widely used in a large number of Natural Language Processing tasks. However, they are far from containing only interesting semantic relations. As a consequence, improving such thesaurus is an important issue that is mainly tackled indirectly through the improvement of semantic similarity measures. In this article, we propose a more direct approach focusing on the identification of the neighbors of a thesaurus entry that are not semantically linked to this entry. This identification relies on a discriminative classifier trained from unsupervised selected examples for building a distributional model of the entry in texts. Its bad neighbors are found by applying this classifier to a representative set of occurrences of each of these neighbors. We evaluate the interest of this method for a large set of English nouns with various frequencies.

4 0.12149275 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

Author: Masato Hagiwara ; Satoshi Sekine

Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.

5 0.1062834 122 acl-2013-Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners

Author: Keisuke Sakaguchi ; Yuki Arase ; Mamoru Komachi

Abstract: We propose discriminative methods to generate semantic distractors of fill-in-theblank quiz for language learners using a large-scale language learners’ corpus. Unlike previous studies, the proposed methods aim at satisfying both reliability and validity of generated distractors; distractors should be exclusive against answers to avoid multiple answers in one quiz, and distractors should discriminate learners’ proficiency. Detailed user evaluation with 3 native and 23 non-native speakers of English shows that our methods achieve better reliability and validity than previous methods.

6 0.097605035 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

7 0.089239024 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

8 0.087942749 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

9 0.087070204 186 acl-2013-Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach

10 0.086920761 199 acl-2013-Integrating Multiple Dependency Corpora for Inducing Wide-coverage Japanese CCG Resources

11 0.086305805 238 acl-2013-Measuring semantic content in distributional vectors

12 0.085612372 366 acl-2013-Understanding Verbs based on Overlapping Verbs Senses

13 0.081603087 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit

14 0.081106618 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction

15 0.080685548 116 acl-2013-Detecting Metaphor by Contextual Analogy

16 0.07554242 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures

17 0.07446944 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

18 0.071834177 213 acl-2013-Language Acquisition and Probabilistic Models: keeping it simple

19 0.071210593 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

20 0.070308805 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.174), (1, 0.027), (2, 0.032), (3, -0.115), (4, -0.04), (5, -0.096), (6, -0.084), (7, 0.044), (8, 0.026), (9, -0.01), (10, -0.052), (11, 0.074), (12, 0.021), (13, -0.028), (14, -0.056), (15, 0.032), (16, -0.015), (17, 0.0), (18, -0.047), (19, 0.01), (20, 0.154), (21, -0.01), (22, 0.145), (23, -0.022), (24, 0.175), (25, 0.247), (26, -0.01), (27, 0.065), (28, -0.048), (29, -0.009), (30, -0.031), (31, -0.046), (32, -0.159), (33, -0.165), (34, -0.084), (35, 0.165), (36, -0.104), (37, 0.049), (38, 0.091), (39, -0.134), (40, -0.083), (41, -0.186), (42, 0.079), (43, -0.029), (44, 0.075), (45, 0.07), (46, -0.116), (47, -0.044), (48, -0.134), (49, -0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94709402 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners

Author: Lis Pereira ; Erlyn Manguilimotan ; Yuji Matsumoto

2 0.85144949 8 acl-2013-A Learner Corpus-based Approach to Verb Suggestion for ESL

Author: Yu Sawai ; Mamoru Komachi ; Yuji Matsumoto

3 0.78639448 122 acl-2013-Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners

Author: Keisuke Sakaguchi ; Yuki Arase ; Mamoru Komachi

4 0.63536716 186 acl-2013-Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach

Author: Veronika Vincze ; Istvan Nagy T. ; Richard Farkas

Abstract: Here, we introduce a machine learningbased approach that allows us to identify light verb constructions (LVCs) in Hungarian and English free texts. We also present the results of our experiments on the SzegedParalellFX English–Hungarian parallel corpus where LVCs were manually annotated in both languages. With our approach, we were able to contrast the performance of our method and define language-specific features for these typologically different languages. Our presented method proved to be sufficiently robust as it achieved approximately the same scores on the two typologically different languages.

5 0.51769698 213 acl-2013-Language Acquisition and Probabilistic Models: keeping it simple

Author: Aline Villavicencio ; Marco Idiart ; Robert Berwick ; Igor Malioutov

Abstract: Hierarchical Bayesian Models (HBMs) have been used with some success to capture empirically observed patterns of under- and overgeneralization in child language acquisition. However, as is well known, HBMs are “ideal”learningsystems, assumingaccess to unlimited computational resources that may not be available to child language learners. Consequently, it remains crucial to carefully assess the use of HBMs along with alternative, possibly simpler, candidate models. This paper presents such an evaluation for a language acquisi- tion domain where explicit HBMs have been proposed: the acquisition of English dative constructions. In particular, we present a detailed, empiricallygrounded model-selection comparison of HBMs vs. a simpler alternative based on clustering along with maximum likelihood estimation that we call linear competition learning (LCL). Our results demonstrate that LCL can match HBM model performance without incurring on the high computational costs associated with HBMs.

6 0.50292587 302 acl-2013-Robust Automated Natural Language Processing with Multiword Expressions and Collocations

7 0.49877015 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

8 0.48975086 366 acl-2013-Understanding Verbs based on Overlapping Verbs Senses

9 0.48330334 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

10 0.45629713 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information

11 0.44235027 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

12 0.43700621 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

13 0.40083623 371 acl-2013-Unsupervised joke generation from big data

14 0.39412162 76 acl-2013-Building and Evaluating a Distributional Memory for Croatian

15 0.39392594 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

16 0.39102536 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures

17 0.39026067 119 acl-2013-Diathesis alternation approximation for verb clustering

18 0.38765034 286 acl-2013-Psycholinguistically Motivated Computational Models on the Organization and Processing of Morphologically Complex Words

19 0.38266614 344 acl-2013-The Effects of Lexical Resource Quality on Preference Violation Detection

20 0.37914035 238 acl-2013-Measuring semantic content in distributional vectors

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.059), (6, 0.024), (11, 0.114), (24, 0.047), (26, 0.044), (35, 0.196), (42, 0.045), (48, 0.036), (70, 0.025), (88, 0.029), (90, 0.022), (91, 0.166), (93, 0.025), (95, 0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87653059 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners

Author: Lis Pereira ; Erlyn Manguilimotan ; Yuji Matsumoto

2 0.85307175 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

Author: Maria Skeppstedt

Abstract: For expanding a corpus of clinical text, annotated for named entities, a method that combines pre-tagging with a version of active learning is proposed. In order to facilitate annotation and to avoid bias, two alternative automatic pre-taggings are presented to the annotator, without revealing which of them is given a higher confidence by the pre-tagging system. The task of the annotator is to select the correct version among these two alternatives. To minimise the instances in which none of the presented pre-taggings is correct, the texts presented to the annotator are actively selected from a pool of unlabelled text, with the selection criterion that one of the presented pre-taggings should have a high probability of being correct, while still being useful for improving the result of an automatic classifier.

3 0.83571559 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

Author: Oleg Rokhlenko ; Idan Szpektor

Abstract: We introduce the novel task of automatically generating questions that are relevant to a text but do not appear in it. One motivating example of its application is for increasing user engagement around news articles by suggesting relevant comparable questions, such as “is Beyonce a better singer than Madonna?”, for the user to answer. We present the first algorithm for the task, which consists of: (a) offline construction of a comparable question template database; (b) ranking of relevant templates to a given article; and (c) instantiation of templates only with entities in the article whose comparison under the template’s relation makes sense. We tested the suggestions generated by our algorithm via a Mechanical Turk experiment, which showed a significant improvement over the strongest baseline of more than 45% in all metrics.

4 0.80164492 121 acl-2013-Discovering User Interactions in Ideological Discussions

Author: Arjun Mukherjee ; Bing Liu

Abstract: Online discussion forums are a popular platform for people to voice their opinions on any subject matter and to discuss or debate any issue of interest. In forums where users discuss social, political, or religious issues, there are often heated debates among users or participants. Existing research has studied mining of user stances or camps on certain issues, opposing perspectives, and contention points. In this paper, we focus on identifying the nature of interactions among user pairs. The central questions are: How does each pair of users interact with each other? Does the pair of users mostly agree or disagree? What is the lexicon that people often use to express agreement and disagreement? We present a topic model based approach to answer these questions. Since agreement and disagreement expressions are usually multiword phrases, we propose to employ a ranking method to identify highly relevant phrases prior to topic modeling. After modeling, we use the modeling results to classify the nature of interaction of each user pair. Our evaluation results using real-life discussion/debate posts demonstrate the effectiveness of the proposed techniques.

5 0.7991643 122 acl-2013-Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners

Author: Keisuke Sakaguchi ; Yuki Arase ; Mamoru Komachi

6 0.79642022 311 acl-2013-Semantic Neighborhoods as Hypergraphs

7 0.79573864 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction

8 0.79539001 238 acl-2013-Measuring semantic content in distributional vectors

9 0.78891385 215 acl-2013-Large-scale Semantic Parsing via Schema Matching and Lexicon Extension

10 0.78815389 160 acl-2013-Fine-grained Semantic Typing of Emerging Entities

11 0.78769141 76 acl-2013-Building and Evaluating a Distributional Memory for Croatian

12 0.78452682 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

13 0.7824927 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors

14 0.78168023 60 acl-2013-Automatic Coupling of Answer Extraction and Information Retrieval

15 0.78082174 113 acl-2013-Derivational Smoothing for Syntactic Distributional Semantics

16 0.78007925 158 acl-2013-Feature-Based Selection of Dependency Paths in Ad Hoc Information Retrieval

17 0.77891058 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models

18 0.77829051 352 acl-2013-Towards Accurate Distant Supervision for Relational Facts Extraction

19 0.77585584 278 acl-2013-Patient Experience in Online Support Forums: Modeling Interpersonal Interactions and Medication Use

20 0.77490151 32 acl-2013-A relatedness benchmark to test the role of determiners in compositional distributional semantics