acl acl2011 acl2011-304 knowledge-graph by maker-knowledge-mining

304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD


Source: pdf

Author: Mitesh M. Khapra ; Salil Joshi ; Arindam Chatterjee ; Pushpak Bhattacharyya

Abstract: Recent work on bilingual Word Sense Disambiguation (WSD) has shown that a resource deprived language (L1) can benefit from the annotation work done in a resource rich language (L2) via parameter projection. However, this method assumes the presence of sufficient annotated data in one resource rich language which may not always be possible. Instead, we focus on the situation where there are two resource deprived languages, both having a very small amount of seed annotated data and a large amount of untagged data. We then use bilingual bootstrapping, wherein, a model trained using the seed annotated data of L1 is used to annotate the untagged data of L2 and vice versa using parameter projection. The untagged instances of L1 and L2 which get annotated with high confidence are then added to the seed data of the respective languages and the above process is repeated. Our experiments show that such a bilingual bootstrapping algorithm when evaluated on two different domains with small seed sizes using Hindi (L1) and Marathi (L2) as the language pair performs better than monolingual bootstrapping and significantly reduces annotation cost.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 in l , , Abstract Recent work on bilingual Word Sense Disambiguation (WSD) has shown that a resource deprived language (L1) can benefit from the annotation work done in a resource rich language (L2) via parameter projection. [sent-6, score-0.634]

2 However, this method assumes the presence of sufficient annotated data in one resource rich language which may not always be possible. [sent-7, score-0.175]

3 Instead, we focus on the situation where there are two resource deprived languages, both having a very small amount of seed annotated data and a large amount of untagged data. [sent-8, score-0.897]

4 We then use bilingual bootstrapping, wherein, a model trained using the seed annotated data of L1 is used to annotate the untagged data of L2 and vice versa using parameter projection. [sent-9, score-0.81]

5 The untagged instances of L1 and L2 which get annotated with high confidence are then added to the seed data of the respective languages and the above process is repeated. [sent-10, score-0.799]

6 Our experiments show that such a bilingual bootstrapping algorithm when evaluated on two different domains with small seed sizes using Hindi (L1) and Marathi (L2) as the language pair performs better than monolingual bootstrapping and significantly reduces annotation cost. [sent-11, score-1.304]

7 1 Introduction The high cost of collecting sense annotated data for supervised approaches (Ng and Lee, 1996; Lee et al. [sent-12, score-0.268]

8 , 2004) has always remained a matter of concern for some of the resource deprived languages of the world. [sent-13, score-0.299]

9 Semi-supervised approaches (Yarowsky, 1995) which use a small amount of annotated data and a large amount of untagged data have shown promise albeit for a limited set of target words. [sent-19, score-0.305]

10 The above situation highlights the need for high accuracy resource conscious approaches to all-words multilingual WSD. [sent-20, score-0.172]

11 (2010) in this direction has shown that it is possible to perform cost effective WSD in a target language (L2) without compromising much on accuracy by leveraging on the annotation work done in another language (L1). [sent-22, score-0.102]

12 This is achieved with the help of a novel synsetaligned multilingual dictionary which facilitates the projection of parameters learned from the Wordnet and annotated corpus of L1 to L2. [sent-23, score-0.319]

13 This approach thus obviates the need for collecting large amounts of annotated corpora in multiple languages by relying on sufficient annotated corpus in one resource rich language. [sent-24, score-0.372]

14 However, in many situations such a pivot resource rich language itself may not be available. [sent-25, score-0.193]

15 Instead, we might have two or more languages having a small amount of annotated corpus and a large amount of untagged corpus. [sent-26, score-0.384]

16 Specifically, we address the following question: In the absence of a pivot resource rich language is it possible for two resource deprived languages to mutually benefit from each other’s annotated data? [sent-28, score-0.617]

17 Ac s2s0o1ci1a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 561–569, even though it is hard to obtain large amounts of annotated data in multiple languages, it should be fairly easy to obtain a large amount of untagged data in these languages. [sent-31, score-0.301]

18 We leverage on such untagged data by employing a bootstrapping strategy. [sent-32, score-0.453]

19 The idea is to train an initial model using a small amount of annotated data in both the languages and iteratively expand this seed data by including untagged instances which get tagged with a high confidence in successive iterations. [sent-33, score-0.929]

20 Instead of using monolingual bootstrapping, we use bilingual bootstrapping via parameter projection. [sent-34, score-0.589]

21 In other words, the parameters learned from the annotated data of L1 (and L2 respectively) are projected to L2 (and L1 respectively) and the projected model is used to tag the untagged instances of L2 (and L1 respectively). [sent-35, score-0.46]

22 Such a bilingual bootstrapping strategy when tested on two domains, viz. [sent-36, score-0.437]

23 , Tourism and Health using Hindi (L1) and Marathi (L2) as the language pair, consistently does better than a baseline strategy which uses only seed data for training without performing any bootstrapping. [sent-37, score-0.372]

24 In monolingual bootstrapping a language can benefit only from its own seed data and hence can tag only those instances with high confidence which it has already seen. [sent-40, score-0.912]

25 On the other hand, in bilingual bootstrapping a language can benefit from the seed data available in the other language which was not previously seen in its self corpus. [sent-41, score-0.887]

26 This is very similar to the process of co-training (Blum and Mitchell, 1998) wherein the annotated data in the two languages can be seen as two different views of the same data. [sent-42, score-0.162]

27 Hence, the classifier trained on one view can be improved by adding those untagged instances which are tagged with a high confidence by the classifier trained on the other view. [sent-43, score-0.358]

28 Section 3 describes the Synset aligned multilingual dictionary which facilitates parameter projection. [sent-46, score-0.229]

29 In section 5 we discuss bilingual bootstrapping which is the main focus of our work followed by a brief discussion on monolingual bootstrapping. [sent-49, score-0.553]

30 Starting with a very small number of seed collocations an initial decision list is created. [sent-54, score-0.372]

31 This decisions list is then applied to untagged data and the instances which get tagged with a high confidence are added to the seed data. [sent-55, score-0.73]

32 This algorithm thus proceeds iteratively increasing the seed size in successive iterations. [sent-56, score-0.446]

33 This monolingual bootstrapping method showed promise when tested on a limited set of target words but was not tried for all-words WSD. [sent-57, score-0.386]

34 The failure of monolingual approaches (Ng and Lee, 1996; Lee et al. [sent-58, score-0.116]

35 , 2004; Mihalcea, 2005) to deliver high accuracies for all-words WSD at low costs created interest in bilingual approaches which aim at reducing the annotation effort. [sent-60, score-0.241]

36 (2009) aims at reducing the annotation effort in multiple languages by leveraging on existing resources in a pivot language. [sent-62, score-0.196]

37 They showed that it is possible to project the parameters learned from the annotation work of one language to another language provided aligned Wordnets for the two languages are available. [sent-63, score-0.197]

38 However, they do not address situations where two resource deprived languages have aligned Wordnets but neither has sufficient annotated data. [sent-64, score-0.379]

39 In such cases bilingual bootstrapping can be used so that the two languages can mutually benefit from each other’s small annotated data. [sent-65, score-0.641]

40 Li and Li (2004) proposed a bilingual bootstrapping approach for the more specific task of Word Translation Disambiguation (WTD) as opposed to the more general task of WSD. [sent-66, score-0.437]

41 Our work instead focuses on improving the performance of all words WSD for two resource deprived languages using bilingual bootstrapping. [sent-69, score-0.466]

42 At the heart of our work lies parameter projection facilitated by a synset aligned multilingual dictionary described in the next section. [sent-70, score-0.425]

43 3 Synset Aligned Multilingual Dictionary A novel and effective method of storage and use of dictionary in a multilingual setting was proposed by Mohanty et al. [sent-71, score-0.13]

44 For the purpose of current discussion, we will refer to this multilingual dictionary framework as MultiDict. [sent-73, score-0.13]

45 One important departure in this framework from the traditional dictionary is that synsets are linked, and after that the words inside the synsets are linked. [sent-74, score-0.217]

46 After the synsets are linked, cross linkages are set up manually from the words of a synset to the words of a linked synset of the pivot language. [sent-80, score-0.512]

47 lgA (mulgaa), “a youthful male person”, the correct lexical substitute from the corresponding Hindi synset is lwкA (ladkaa). [sent-82, score-0.168]

48 The average number of such links per synset per language pair is approximately 3. [sent-83, score-0.131]

49 However, since our work takes place in a semi-supervised setting, we do not assume the presence of these manual cross linkages between synset members. [sent-84, score-0.199]

50 Instead, in the above example, we assume that all the words in the Hindi synset are equally probable translations of every word in the corresponding Marathi synset. [sent-85, score-0.131]

51 Such cross-linkages between synset members facilitate parameter projection as explained in the next section. [sent-86, score-0.226]

52 The other component Wij ∗ Vi ∗ Vj captures the influence of interaction of the cVan∗dVidate sense with the senses of context words weighted by factors of co-occurrence, conceptual distance and semantic distance. [sent-89, score-0.182]

53 Wordnet-dependent parameters depend on the structure of the Wordnet whereas the Corpusdependent parameters depend on various statistics learned from a sense marked corpora. [sent-90, score-0.212]

54 Both the tasks of (a) constructing a Wordnet from scratch and (b) collecting sense marked corpora for multiple languages are tedious and expensive. [sent-91, score-0.24]

55 At the heart of their work lies the MultiDict described in previous section which facilitates parameter projection in the following manner: 1. [sent-94, score-0.171]

56 By linking with the synsets of a pivot resource rich language (Hindi, in our case), the cost of building Wordnets of other languages is partly reduced (semantic relations are inherited). [sent-95, score-0.409]

57 For calculating corpus specific sense distributions, P(Sense Si |Word W), we need the counts, #(Si, W). [sent-98, score-0.126]

58 This parameter projection strategy as explained above lies at the heart of our work and allows us to perform bilingual bootstrapping by projecting the models learned from one language to another. [sent-100, score-0.622]

59 As shown in Algorithm 1, we start with a small amount of seed data (LD1 and LD2) in the two languages. [sent-104, score-0.407]

60 The parameter projection strategy described in the previous section is then applied to θ1 and θ2 to obtain the projected models θˆ2 and θˆ1 respectively. [sent-107, score-0.162]

61 These projected models are then applied to the untagged data of L1 and L2 and the instances which get labeled with a high confidence are added to the labeled data of the respective languages. [sent-108, score-0.363]

62 We compare our algorithm with monolingual bootstrapping where the self models θ1 and θ2 are directly used to annotate the unlabeled instances in L1and L2 respectively instead ofusing the projected models θˆ1 and ˆθ2. [sent-112, score-0.604]

63 The various statistics pertaining to the total number of words, number of words per POS category and average degree of polysemy are described in Tables 2 to 5. [sent-121, score-0.152]

64 degree of Wordnet polysemy for polysemous words Category TourismHealth Noun3. [sent-127, score-0.193]

65 23 Table 4: Average degree of Wordnet polysemy per category in the 2 domains for Hindi Avg. [sent-137, score-0.186]

66 degree of Wordnet polysemy for polysemous words Category TourismHealth Noun3. [sent-138, score-0.193]

67 In fact, the documents in the two languages were randomly split into 4 folds without ensuring that the parallel documents remain in the same folds for the two languages. [sent-151, score-0.137]

68 We experimented with different seed sizes varying from 0 to 5000 in steps of 250. [sent-152, score-0.4]

69 , monolingual bootstrapping and bilingual bootstrapping) for 10 iterations but, we observed that after 1-2 iterations the algorithms converge. [sent-156, score-0.609]

70 However, since our work focuses on resource scarce languages we did not want to incur the additional cost of using a development set. [sent-162, score-0.229]

71 6 so that in each iteration only those words get moved to the labeled data for which the assigned sense is clearly a majority sense (P > 0. [sent-164, score-0.252]

72 of tagged LDaonmguaainge- Algorithm F-score(%) woarcdhsi neveee tdheids to % Reductiocno isnt annotation F-score Hindi-Health Biboot 57. [sent-167, score-0.109]

73 The x-axis represents the amount of seed data used and the y-axis represents the F-scores obtained. [sent-178, score-0.407]

74 BiBoot: This curve represents the F-score obtained after 10 iterations by using bilingual bootstrapping with different amounts of seed data. [sent-180, score-0.919]

75 MonoBoot: This curve represents the F-score obtained after 10 iterations by using monolingual bootstrapping with different amounts of seed data. [sent-182, score-0.868]

76 OnlySeed: This curve represents the F-score obtained by training on the seed data alone without using any bootstrapping. [sent-184, score-0.423]

77 WFS: This curve represents the F-score obtained by simply selecting the first sense from Wordnet, a typically reported baseline. [sent-186, score-0.177]

78 1 Performance of Bilingual bootstrapping For small seed sizes, the F-score of bilingual bootstrapping is consistently better than the F-score obtained by training only on the seed data without using any bootstrapping. [sent-189, score-1.451]

79 Further, bilingual bootstrapping also does better than monolingual bootstrapping for small seed sizes. [sent-191, score-1.195]

80 As explained earlier, 567 this better performance can be attributed to the fact that in monolingual bootstrapping the algorithm can tag only those instances with high confidence which it has already seen in the training data. [sent-192, score-0.499]

81 This is clearly evident from the fact that the curve of monolingual bootstrapping (MonoBoot) is always close to the curve of OnlySeed. [sent-194, score-0.488]

82 2 Effect of seed size The benefit of bilingual bootstrapping is clearly felt for small seed sizes. [sent-196, score-1.291]

83 However, as the seed size increases the performance of the 3 algorithms, viz. [sent-197, score-0.413]

84 This is intuitive, because, as the seed size increases the algorithm is able to see more and more tagged instances in its self corpora and hence does not need any assistance from the other language. [sent-199, score-0.56]

85 3 Bilingual bootstrapping reduces annotation cost The performance boost obtained at small seed sizes suggests that bilingual bootstrapping helps to reduce the overall annotation costs for both the languages. [sent-202, score-1.256]

86 The rows for Hindi-Health and Marathi-Health in Table 6 show that when BiBoot is employed we need 1250 tagged words in Hindi and 1750 tagged words in Marathi to attain F-scores of 57. [sent-205, score-0.124]

87 On the other hand, in the absence of bilingual bootstrapping, (i. [sent-208, score-0.167]

88 4 Contribution of monosemous words in the performance of BiBoot As mentioned earlier, monosemous words in the test set are not considered while evaluating the performance of our algorithm but, we add monosemous words to the seed data. [sent-217, score-1.068]

89 However, we do not count monosemous words while calculating the seed size as there is no manual annotation cost associated with monosemous words (they can be tagged automatically by fetching their singleton sense id from the wordnet). [sent-218, score-1.167]

90 We observed that the monosemous words of L1 help in boosting the performance of L2 and vice versa. [sent-219, score-0.232]

91 This is because for a given monosemous word in L2 (or L1 respectively) the corresponding cross-linked word in L1 (or L2 respectively) need not necessarily be monosemous. [sent-220, score-0.232]

92 In such cases, the cross-linked polysemous word in L2 (or L1 respectively) benefits from the projected statistics of a monosemous word in L1 (or L2 respectively). [sent-221, score-0.394]

93 This explains why BiBoot gives an F-score of 35-52% even at zero seed size even though the F-score of OnlySeed is only 2-5% (see Figures 1 to 4). [sent-222, score-0.413]

94 9 Conclusion We presented a bilingual bootstrapping algorithm for Word Sense Disambiguation which allows two resource deprived languages to mutually benefit 568 from each other’s data via parameter projection. [sent-223, score-0.845]

95 It also performs better than using only monolingual seed data without using any bootstrapping. [sent-225, score-0.488]

96 The benefit of bilingual bootstrapping is felt prominently when the seed size in the two languages is very small thus highlighting the usefulness of this algorithm in highly resource constrained scenarios. [sent-226, score-1.093]

97 Value for money: Balancing annotation effort, lexicon building and accuracy for multilingual wsd. [sent-246, score-0.124]

98 Supervised word sense disambiguation with support vector machines and multiple knowledge Proceedings of Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pages 137–140. [sent-250, score-0.195]

99 Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. [sent-253, score-0.195]

100 Large vocabulary unsupervised word sense disambiguation with graph-based algorithms for sequence data labeling. [sent-269, score-0.195]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('seed', 0.372), ('biboot', 0.284), ('bootstrapping', 0.27), ('marathi', 0.264), ('onlyseed', 0.243), ('monosemous', 0.232), ('hindi', 0.227), ('khapra', 0.197), ('untagged', 0.183), ('monoboot', 0.183), ('bilingual', 0.167), ('synset', 0.131), ('sense', 0.126), ('deprived', 0.125), ('monolingual', 0.116), ('wfs', 0.101), ('tourism', 0.099), ('resource', 0.095), ('polysemous', 0.095), ('mitesh', 0.089), ('wordnet', 0.083), ('synsets', 0.082), ('languages', 0.079), ('multilingual', 0.077), ('pivot', 0.07), ('disambiguation', 0.069), ('polysemy', 0.068), ('projected', 0.067), ('wordnets', 0.066), ('confidence', 0.065), ('wsd', 0.065), ('tagged', 0.062), ('si', 0.06), ('projection', 0.059), ('pushpak', 0.059), ('cost', 0.055), ('category', 0.054), ('dictionary', 0.053), ('annotated', 0.052), ('curve', 0.051), ('projecting', 0.049), ('instances', 0.048), ('adverb', 0.048), ('annotation', 0.047), ('parameters', 0.043), ('health', 0.043), ('benefit', 0.041), ('size', 0.041), ('amsler', 0.041), ('ivi', 0.041), ('lga', 0.041), ('mulgaa', 0.041), ('multidict', 0.041), ('por', 0.041), ('rigau', 0.041), ('tourismhealth', 0.041), ('tourismhealthtourismhealth', 0.041), ('wio', 0.041), ('wordsmonosemous', 0.041), ('wrdor', 0.041), ('wtd', 0.041), ('heart', 0.041), ('mccarthy', 0.039), ('male', 0.037), ('self', 0.037), ('mohanty', 0.036), ('lesk', 0.036), ('arindam', 0.036), ('linkages', 0.036), ('parameter', 0.036), ('collecting', 0.035), ('amount', 0.035), ('agirre', 0.035), ('facilitates', 0.035), ('sj', 0.034), ('walker', 0.034), ('domains', 0.034), ('successive', 0.033), ('respectively', 0.033), ('projectable', 0.033), ('unlabeled', 0.033), ('adjective', 0.032), ('cross', 0.032), ('mutually', 0.032), ('bhattacharyya', 0.031), ('wherein', 0.031), ('amounts', 0.031), ('linked', 0.03), ('degree', 0.03), ('conceptual', 0.029), ('folds', 0.029), ('felt', 0.028), ('rich', 0.028), ('aligned', 0.028), ('vi', 0.028), ('sizes', 0.028), ('iterations', 0.028), ('deliver', 0.027), ('till', 0.027), ('senses', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

Author: Mitesh M. Khapra ; Salil Joshi ; Arindam Chatterjee ; Pushpak Bhattacharyya

Abstract: Recent work on bilingual Word Sense Disambiguation (WSD) has shown that a resource deprived language (L1) can benefit from the annotation work done in a resource rich language (L2) via parameter projection. However, this method assumes the presence of sufficient annotated data in one resource rich language which may not always be possible. Instead, we focus on the situation where there are two resource deprived languages, both having a very small amount of seed annotated data and a large amount of untagged data. We then use bilingual bootstrapping, wherein, a model trained using the seed annotated data of L1 is used to annotate the untagged data of L2 and vice versa using parameter projection. The untagged instances of L1 and L2 which get annotated with high confidence are then added to the seed data of the respective languages and the above process is repeated. Our experiments show that such a bilingual bootstrapping algorithm when evaluated on two different domains with small seed sizes using Hindi (L1) and Marathi (L2) as the language pair performs better than monolingual bootstrapping and significantly reduces annotation cost.

2 0.19070417 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping

Author: Tetsuo Kiso ; Masashi Shimbo ; Mamoru Komachi ; Yuji Matsumoto

Abstract: In bootstrapping (seed set expansion), selecting good seeds and creating stop lists are two effective ways to reduce semantic drift, but these methods generally need human supervision. In this paper, we propose a graphbased approach to helping editors choose effective seeds and stop list instances, applicable to Pantel and Pennacchiotti’s Espresso bootstrapping algorithm. The idea is to select seeds and create a stop list using the rankings of instances and patterns computed by Kleinberg’s HITS algorithm. Experimental results on a variation of the lexical sample task show the effectiveness of our method.

3 0.13413069 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

Author: Dmitriy Dligach ; Martha Palmer

Abstract: Active Learning (AL) is typically initialized with a small seed of examples selected randomly. However, when the distribution of classes in the data is skewed, some classes may be missed, resulting in a slow learning progress. Our contribution is twofold: (1) we show that an unsupervised language modeling based technique is effective in selecting rare class examples, and (2) we use this technique for seeding AL and demonstrate that it leads to a higher learning rate. The evaluation is conducted in the context of word sense disambiguation.

4 0.13243501 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

Author: Els Lefever ; Veronique Hoste ; Martine De Cock

Abstract: This paper describes a set of exploratory experiments for a multilingual classificationbased approach to Word Sense Disambiguation. Instead of using a predefined monolingual sense-inventory such as WordNet, we use a language-independent framework where the word senses are derived automatically from word alignments on a parallel corpus. We built five classifiers with English as an input language and translations in the five supported languages (viz. French, Dutch, Italian, Spanish and German) as classification output. The feature vectors incorporate both the more traditional local context features, as well as binary bag-of-words features that are extracted from the aligned translations. Our results show that the ParaSense multilingual WSD system shows very competitive results compared to the best systems that were evaluated on the SemEval-2010 Cross-Lingual Word Sense Disambiguation task for all five target languages.

5 0.12539537 177 acl-2011-Interactive Group Suggesting for Twitter

Author: Zhonghua Qu ; Yang Liu

Abstract: The number of users on Twitter has drastically increased in the past years. However, Twitter does not have an effective user grouping mechanism. Therefore tweets from other users can quickly overrun and become inconvenient to read. In this paper, we propose methods to help users group the people they follow using their provided seeding users. Two sources of information are used to build sub-systems: textural information captured by the tweets sent by users, and social connections among users. We also propose a measure of fitness to determine which subsystem best represents the seed users and use it for target user ranking. Our experiments show that our proposed framework works well and that adaptively choosing the appropriate sub-system for group suggestion results in increased accuracy.

6 0.11529537 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

7 0.11471716 167 acl-2011-Improving Dependency Parsing with Semantic Classes

8 0.11007275 162 acl-2011-Identifying the Semantic Orientation of Foreign Words

9 0.10743405 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

10 0.10419847 158 acl-2011-Identification of Domain-Specific Senses in a Machine-Readable Dictionary

11 0.1026276 151 acl-2011-Hindi to Punjabi Machine Translation System

12 0.092168584 117 acl-2011-Entity Set Expansion using Topic information

13 0.089896433 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

14 0.088304609 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation

15 0.087484613 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

16 0.086559683 243 acl-2011-Partial Parsing from Bitext Projections

17 0.085057795 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

18 0.083326325 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

19 0.078961492 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing

20 0.078244902 174 acl-2011-Insights from Network Structure for Text Mining


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.159), (1, 0.026), (2, -0.024), (3, 0.018), (4, 0.022), (5, -0.015), (6, 0.139), (7, 0.016), (8, -0.032), (9, -0.026), (10, -0.003), (11, -0.103), (12, 0.178), (13, 0.012), (14, 0.007), (15, -0.153), (16, 0.139), (17, 0.032), (18, 0.096), (19, 0.018), (20, -0.015), (21, 0.007), (22, 0.049), (23, 0.124), (24, -0.039), (25, 0.006), (26, 0.03), (27, 0.229), (28, 0.106), (29, -0.042), (30, 0.106), (31, -0.043), (32, 0.047), (33, -0.115), (34, 0.082), (35, 0.035), (36, 0.115), (37, 0.06), (38, 0.019), (39, 0.008), (40, -0.001), (41, -0.096), (42, -0.073), (43, 0.003), (44, -0.006), (45, 0.021), (46, 0.054), (47, -0.028), (48, 0.037), (49, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9546172 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

Author: Mitesh M. Khapra ; Salil Joshi ; Arindam Chatterjee ; Pushpak Bhattacharyya

Abstract: Recent work on bilingual Word Sense Disambiguation (WSD) has shown that a resource deprived language (L1) can benefit from the annotation work done in a resource rich language (L2) via parameter projection. However, this method assumes the presence of sufficient annotated data in one resource rich language which may not always be possible. Instead, we focus on the situation where there are two resource deprived languages, both having a very small amount of seed annotated data and a large amount of untagged data. We then use bilingual bootstrapping, wherein, a model trained using the seed annotated data of L1 is used to annotate the untagged data of L2 and vice versa using parameter projection. The untagged instances of L1 and L2 which get annotated with high confidence are then added to the seed data of the respective languages and the above process is repeated. Our experiments show that such a bilingual bootstrapping algorithm when evaluated on two different domains with small seed sizes using Hindi (L1) and Marathi (L2) as the language pair performs better than monolingual bootstrapping and significantly reduces annotation cost.

2 0.84305435 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping

Author: Tetsuo Kiso ; Masashi Shimbo ; Mamoru Komachi ; Yuji Matsumoto

Abstract: In bootstrapping (seed set expansion), selecting good seeds and creating stop lists are two effective ways to reduce semantic drift, but these methods generally need human supervision. In this paper, we propose a graphbased approach to helping editors choose effective seeds and stop list instances, applicable to Pantel and Pennacchiotti’s Espresso bootstrapping algorithm. The idea is to select seeds and create a stop list using the rankings of instances and patterns computed by Kleinberg’s HITS algorithm. Experimental results on a variation of the lexical sample task show the effectiveness of our method.

3 0.66881901 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

Author: Dmitriy Dligach ; Martha Palmer

Abstract: Active Learning (AL) is typically initialized with a small seed of examples selected randomly. However, when the distribution of classes in the data is skewed, some classes may be missed, resulting in a slow learning progress. Our contribution is twofold: (1) we show that an unsupervised language modeling based technique is effective in selecting rare class examples, and (2) we use this technique for seeding AL and demonstrate that it leads to a higher learning rate. The evaluation is conducted in the context of word sense disambiguation.

4 0.66718954 162 acl-2011-Identifying the Semantic Orientation of Foreign Words

Author: Ahmed Hassan ; Amjad AbuJbara ; Rahul Jha ; Dragomir Radev

Abstract: We present a method for identifying the positive or negative semantic orientation of foreign words. Identifying the semantic orientation of words has numerous applications in the areas of text classification, analysis of product review, analysis of responses to surveys, and mining online discussions. Identifying the semantic orientation of English words has been extensively studied in literature. Most of this work assumes the existence of resources (e.g. Wordnet, seeds, etc) that do not exist in foreign languages. In this work, we describe a method based on constructing a multilingual network connecting English and foreign words. We use this network to identify the semantic orientation of foreign words based on connection between words in the same language as well as multilingual connections. The method is experimentally tested using a manually labeled set of positive and negative words and has shown very promising results.

5 0.56921273 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

Author: Oleksandr Kolomiyets ; Steven Bethard ; Marie-Francine Moens

Abstract: We explore a semi-supervised approach for improving the portability of time expression recognition to non-newswire domains: we generate additional training examples by substituting temporal expression words with potential synonyms. We explore using synonyms both from WordNet and from the Latent Words Language Model (LWLM), which predicts synonyms in context using an unsupervised approach. We evaluate a state-of-the-art time expression recognition system trained both with and without the additional training examples using data from TempEval 2010, Reuters and Wikipedia. We find that the LWLM provides substantial improvements on the Reuters corpus, and smaller improvements on the Wikipedia corpus. We find that WordNet alone never improves performance, though intersecting the examples from the LWLM and WordNet provides more stable results for Wikipedia. 1

6 0.55521035 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

7 0.54248852 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

8 0.52351284 174 acl-2011-Insights from Network Structure for Text Mining

9 0.52001065 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

10 0.51092255 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

11 0.50196207 229 acl-2011-NULEX: An Open-License Broad Coverage Lexicon

12 0.4736518 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

13 0.45616671 158 acl-2011-Identification of Domain-Specific Senses in a Machine-Readable Dictionary

14 0.44541132 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation

15 0.4446885 115 acl-2011-Engkoo: Mining the Web for Language Learning

16 0.42687839 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons

17 0.41287765 307 acl-2011-Towards Tracking Semantic Change by Visual Analytics

18 0.40483922 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

19 0.39437193 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

20 0.39175892 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.022), (17, 0.03), (26, 0.033), (37, 0.081), (39, 0.022), (41, 0.065), (53, 0.018), (55, 0.027), (59, 0.067), (72, 0.032), (83, 0.273), (91, 0.096), (96, 0.104), (97, 0.056)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73467445 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

Author: Mitesh M. Khapra ; Salil Joshi ; Arindam Chatterjee ; Pushpak Bhattacharyya

Abstract: Recent work on bilingual Word Sense Disambiguation (WSD) has shown that a resource deprived language (L1) can benefit from the annotation work done in a resource rich language (L2) via parameter projection. However, this method assumes the presence of sufficient annotated data in one resource rich language which may not always be possible. Instead, we focus on the situation where there are two resource deprived languages, both having a very small amount of seed annotated data and a large amount of untagged data. We then use bilingual bootstrapping, wherein, a model trained using the seed annotated data of L1 is used to annotate the untagged data of L2 and vice versa using parameter projection. The untagged instances of L1 and L2 which get annotated with high confidence are then added to the seed data of the respective languages and the above process is repeated. Our experiments show that such a bilingual bootstrapping algorithm when evaluated on two different domains with small seed sizes using Hindi (L1) and Marathi (L2) as the language pair performs better than monolingual bootstrapping and significantly reduces annotation cost.

2 0.62882662 192 acl-2011-Language-Independent Parsing with Empty Elements

Author: Shu Cai ; David Chiang ; Yoav Goldberg

Abstract: We present a simple, language-independent method for integrating recovery of empty elements into syntactic parsing. This method outperforms the best published method we are aware of on English and a recently published method on Chinese.

3 0.54451728 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Author: Chung-chi Huang ; Mei-hua Chen ; Shih-ting Huang ; Jason S. Chang

Abstract: We introduce a new method for learning to detect grammatical errors in learner’s writing and provide suggestions. The method involves parsing a reference corpus and inferring grammar patterns in the form of a sequence of content words, function words, and parts-of-speech (e.g., “play ~ role in Ving” and “look forward to Ving”). At runtime, the given passage submitted by the learner is matched using an extended Levenshtein algorithm against the set of pattern rules in order to detect errors and provide suggestions. We present a prototype implementation of the proposed method, EdIt, that can handle a broad range of errors. Promising results are illustrated with three common types of errors in nonnative writing. 1

4 0.53937936 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing

Author: Dan Goldwasser ; Roi Reichart ; James Clarke ; Dan Roth

Abstract: Current approaches for semantic parsing take a supervised approach requiring a considerable amount of training data which is expensive and difficult to obtain. This supervision bottleneck is one of the major difficulties in scaling up semantic parsing. We argue that a semantic parser can be trained effectively without annotated data, and introduce an unsupervised learning algorithm. The algorithm takes a self training approach driven by confidence estimation. Evaluated over Geoquery, a standard dataset for this task, our system achieved 66% accuracy, compared to 80% of its fully supervised counterpart, demonstrating the promise of unsupervised approaches for this task.

5 0.53933144 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

Author: Ryan Gabbard ; Marjorie Freedman ; Ralph Weischedel

Abstract: As an alternative to requiring substantial supervised relation training data, many have explored bootstrapping relation extraction from a few seed examples. Most techniques assume that the examples are based on easily spotted anchors, e.g., names or dates. Sentences in a corpus which contain the anchors are then used to induce alternative ways of expressing the relation. We explore whether coreference can improve the learning process. That is, if the algorithm considered examples such as his sister, would accuracy be improved? With coreference, we see on average a 2-fold increase in F-Score. Despite using potentially errorful machine coreference, we see significant increase in recall on all relations. Precision increases in four cases and decreases in six.

6 0.53581101 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

7 0.53240001 28 acl-2011-A Statistical Tree Annotator and Its Applications

8 0.5306251 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping

9 0.53051382 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons

10 0.53019857 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features

11 0.52791429 167 acl-2011-Improving Dependency Parsing with Semantic Classes

12 0.52606624 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

13 0.52592456 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

14 0.52393007 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

15 0.52281845 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

16 0.52112079 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

17 0.52069199 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL

18 0.52017784 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

19 0.52008665 313 acl-2011-Two Easy Improvements to Lexical Weighting

20 0.51952803 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations