emnlp emnlp2010 emnlp2010-11 knowledge-graph by maker-knowledge-mining

11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension


Source: pdf

Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka

Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 j p Graduate School of Information Science & Technology The University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan Abstract Several recent discourse parsers have employed fully-supervised machine learning approaches. [sent-13, score-0.436]

2 In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. [sent-16, score-1.055]

3 Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. [sent-17, score-0.252]

4 1 Introduction Automatic detection of discourse relations in natural language text is important for numerous tasks in NLP, such as sentiment analysis (Somasundaran et al. [sent-22, score-0.549]

5 However, most of the recent work employing discourse relation classifiers are based on fully-supervised machine learning approaches (duVerle and Prendinger, 399 2009; Pitler et al. [sent-25, score-0.703]

6 Two of the main corpora with discourse annotations are the RST Discourse Treebank (RSTDT) (Carlson et al. [sent-28, score-0.402]

7 In the RSTDT, annotation is done using 78 fine-grained discourse relations, which are usually grouped into 18 coarser-grained relations. [sent-31, score-0.402]

8 In practice, a classifier trained on these coarse-grained relations must solve a 41-class classification problem. [sent-33, score-0.326]

9 Some of the relations corresponding to these classes are relatively more frequent in the corpus, such as the ELABORATION[N] [S] relation (4441 instances), or the ATTRIBUTION[S] [N] relation (1612 instances). [sent-34, score-0.631]

10 1 However, other relation types occur very rarely, such as TOPIC-COMMENT[S] [N] (2 instances), or EVALUATION[N] [N] (3 instances). [sent-35, score-0.242]

11 Although supervised approaches to discourse relation learning achieve good results on frequent relations, performance is poor on rare relation types (duVerle and Prendinger, 2009). [sent-40, score-0.886]

12 Nonetheless, certain infrequent relation types might be important for specific tasks. [sent-41, score-0.28]

13 For instance, 1We use the notation [N] and [S] respectively to denote the nucleus and satellite in a RST discourse relation. [sent-42, score-0.402]

14 Another situation where detection of lowoccurring relations is desirable is the case where we have only a small training set at our disposal, for instance when there is not enough annotated data for all the relation types described in a discourse theory. [sent-46, score-0.855]

15 In this case, all the dataset’s relations can be considered rare, and being able to build an efficient classifier depends on the capacity to deal with this lack of annotated data. [sent-47, score-0.248]

16 • We propose a semi-supervised method that exploits tohsee abundant, freely-available du tnhlaatbeled data, which is harvested for feature cooccurrence information, and used as a basis to extend feature vectors to help classification for cases where unknown features are found in test vectors. [sent-49, score-0.422]

17 Related Work Since the release in 2001 of the RSTDT corpus, several fully-supervised discourse parsers have been built in the RST framework. [sent-53, score-0.402]

18 In the recent work of duVerle and Prendinger (2009), a discourse parser based on Support Vector Machines (SVM) (Vapnik, 1995) is proposed. [sent-54, score-0.402]

19 SVMs are employed to train two classifiers: One, binary, for determining the pres- ence of a relation, and another, multi-class, for determining the relation label between related text spans. [sent-55, score-0.276]

20 For the discourse relation classifier, shallow lexical, syntactic and structural features, including ‘dominance sets’ (Soricut and Marcu, 2003) are used. [sent-56, score-0.644]

21 For relation classification, they report an accuracy of 0. [sent-57, score-0.289]

22 Discourse relation classifiers have also been trained using PDTB. [sent-65, score-0.27]

23 (2008) performed a corpus study of the PDTB, and found that ‘explicit’ relations can be most of the times distinguished by their discourse connectives. [sent-67, score-0.549]

24 Their discourse relation classifier reported an accuracy of 0. [sent-68, score-0.792]

25 1% over the previous state-of-theart for implicit relation classification in PDTB. [sent-76, score-0.355]

26 However, for our task, classification of discourse relations, we employ not only words but also other types of features such as parse tree production rules, and thus cannot compute semantic kernels using WordNet. [sent-95, score-0.656]

27 In this paper, we are not aiming at defining novel features for improving performance in RST or PDTB relation classification. [sent-96, score-0.282]

28 Instead we incorporate numerous features that have been shown to be useful for discourse relation learning and explore the possibilities of using unlabeled data for this task. [sent-97, score-0.836]

29 One of our goals is to improve classification accuracy for rare discourse relations. [sent-98, score-0.527]

30 3 Method Given a set of unlabeled instances U and labeled instances L, our objective is to learn an n-class relation classifier H such that for a given test instance x return its correct relation type H(x). [sent-99, score-1.149]

31 In the case of discourse relation learning we are interested in the situation where |U| >> |L|. [sent-100, score-0.644]

32 Therefore, the classification algorithm does not have sufficient information to correctly predict the relation type of the given test 401 instance. [sent-104, score-0.345]

33 We propose a method that first computes the co-occurrence between features using unlabeled data and use that information to extend the feature vectors during training and testing, thereby reducing the sparseness in test feature vectors. [sent-105, score-0.486]

34 1, we introduce the concept of feature co-occurrence matrix and describe how it is computed using unlabeled data. [sent-107, score-0.285]

35 Conse- quently, it can be used with any multi-class classification algorithm to learn a discourse relation classifier. [sent-113, score-0.722]

36 If both fi and fj appear in a feature vector then we define them to be co-occurring. [sent-121, score-0.384]

37 The number of different feature vectors in which fi and fj co-occur is denoted by the function h(fi, fj). [sent-122, score-0.372]

38 χ2-measure between two features fi and fj is defined as follows, × χi2,j=Xk2=1Xl=21(Oki,,jlE−ki, E,jlki,,jl)2. [sent-127, score-0.296]

39 In discourse relation learning, the feature space can be extremely large. [sent-140, score-0.715]

40 3), any two words that appear in two adjoining discourse units can form a feature. [sent-142, score-0.453]

41 Because the number of elements in the feature co-occurrence matrix is proportional to the square of the feature dimension, computing co-occurrences for all pairs of features can be computationally costly. [sent-143, score-0.244]

42 (5) × Here, p(i, j) is the probability that feature fi cooccurs with feature fj, and is given by p(i, j) = h(fi, fj)/Zi. [sent-147, score-0.281]

43 2 Feature Vector Extension Once the feature co-occurrence matrix is computed using unlabeled data as described in Section 3. [sent-154, score-0.285]

44 The proposed feature vector extension method is inspired by query expansion in the field of Information Retrieval (Salton and Buckley, 1983; Fang, 2008). [sent-156, score-0.29]

45 One of the reasons that a classifier might perform poorly on a test instance is that there are features in the test instance that were not observed during training. [sent-157, score-0.255]

46 The expansion features for each feature fxi are then appended to the original feature vector fx to create an extended feature vector, fx0, where, fx(i,j) fx(i,j) fx0 = fxd(x), fx(i,1), . [sent-171, score-0.526]

47 Figure 1 shows the parse tree for a sentence com- posed of two discourse units, which serve as arguments of a discourse relation we want to generate a feature vector from. [sent-193, score-1.241]

48 In the PDTB, certain discourse relations have disjoint arguments. [sent-204, score-0.549]

49 Then, they are segmented into elementary discourse units (EDUs) using our sequential discourse segmenter (Hernault et al. [sent-212, score-0.883]

50 In the unlabeled data, any two consecutive discourse units might not always be connected by a discourse relation. [sent-220, score-1.007]

51 Therefore, we introduce an artificial NONE relation in the training set, in order to facilitate this. [sent-221, score-0.274]

52 Instances of the NONE relation are gen- erated randomly by pairing consecutive discourse units which are not connected by a discourse relation in the training data. [sent-222, score-1.371]

53 NONE is also learnt as a separate discourse relation class by the multi-class classification algorithm. [sent-223, score-0.722]

54 This enables us to detect discourse units between which there exist no discourse relation, thereby improving the classification accuracy for other relation types. [sent-224, score-1.222]

55 We follow the common practice in discourse research for partitioning the discourse corpora into training and test set. [sent-225, score-0.861]

56 org S (declined) Figure 1: Two arguments of a discourse relation, and the minimum set of subtrees that contain them—lexical heads are indicated between brackets. [sent-228, score-0.443]

57 Because in the PDTB an instance can be annotated with several discourse relations simultaneously—called ‘senses’ in Prasad et al. [sent-234, score-0.581]

58 However, in the RST framework, only one relation is allowed to hold between two EDUs. [sent-236, score-0.242]

59 Consequently, each instance from the RSTDT is labeled with a single discourse relation, from which a single feature vector is created. [sent-237, score-0.613]

60 There are 41 classes (relation types) in the RSTDT relation classification task, and 29 classes in the PDTB task. [sent-240, score-0.32]

61 This inverse relation between the training dataset size and the number of features that only appear in test data can be observed in both RSTDT and PDTB datasets. [sent-248, score-0.369]

62 The number of unseen features is halved for a training set of 1800 instances in the case of RSTDT, and for a training set of 1300 instances in the case of PDTB. [sent-250, score-0.521]

63 In the following experiments, we use macroaveraged F-scores to evaluate the performance of the proposed discourse relation classifier on test data. [sent-252, score-0.803]

64 Macro-averaged F-score is not influenced by the number of instances that exist in each relation type. [sent-253, score-0.394]

65 It equally weights the performance on both frequent relation types and infrequent relation types. [sent-254, score-0.522]

66 Because we are interested in measuring the overall performance of a discourse relation classifier across all re- Figure 2: Number of features seen only in the test set, as a function of the number of training instances used. [sent-255, score-0.994]

67 This baseline is expected to show the effect of using the proposed feature vector extension approach for the task of discourse relation learning. [sent-258, score-0.863]

68 From these figures, we see that the proposed feature extension method outperforms the baseline for both RSTDT and PDTB datasets for the full range of training dataset sizes. [sent-260, score-0.286]

69 The difference of scores between the two methods then progressively diminishes as the number of training instances is increased, and fades beyond 10000 training instances. [sent-269, score-0.293]

70 In this case, the feature vec405 tor extension process is comprehensive, and score can be increased by the use of unlabeled data. [sent-271, score-0.318]

71 When more training data is progressively used, the number ofunseen test features sharply diminishes, which means feature vector extension becomes more limited, and the performance of the proposed method gets progressively closer to the baseline. [sent-272, score-0.435]

72 Note that we plotted PDTB performance up to 25000 training instances, as the number of unseen test features becomes so small past this point that the performances of the proposed method and baseline are identical. [sent-273, score-0.268]

73 Although the distribution of discourse relations in RSTDT and PDTB is not uniform, it is possible to study the performance of the proposed method when all relations are made equally rare. [sent-407, score-0.754]

74 We evaluate performance on artificially-created training sets containing an equal amount of each discourse relation. [sent-408, score-0.434]

75 We observe that, when using respectively one and two instances of each relation, the baseline classifier is unable to detect any relation, and has a macro-average F-score of zero. [sent-411, score-0.253]

76 Contrastingly, the classifier built with feature vector extension reaches in those cases an Fscore of 0. [sent-412, score-0.287]

77 Furthermore, when employing the proposed method, certain relations have relatively high F-scores even with very little labeled data: With one training instance, ATTRIBUTION[S] [N] has an F-score of 0. [sent-414, score-0.294]

78 When 406 the amount of each relation is increased, the baseline classifier starts detecting more relations. [sent-419, score-0.343]

79 Then, a progressive increase of both accuracy and macro-average F-score is observed, as the number of unseen test features is incremented. [sent-548, score-0.268]

80 These values reach a maximum of 119% macro-average F-score increase, and 66% accuracy increase, when 23500 features unseen during training are present in test data. [sent-554, score-0.257]

81 accuracy increase) is negligible for 1000 unseen test features, while this increase is 21% for both macro-average F-score and accuracy in the case of 9700 unseen test features, and 459% (resp. [sent-558, score-0.413]

82 This shows that the proposed method is useful when large numbers of features are missing from the training set, which corresponds in practice to small training sets, with few training instances for each relation type. [sent-560, score-0.613]

83 For large training sets, most fea407 tures are encountered by the classifier during training, and feature vector extension does not bring useful information. [sent-561, score-0.319]

84 We use respectively 100 and 10000 labeled training instances, create feature cooccurrence matrices with different amounts of unlabeled data, and evaluate the performance in relation classification. [sent-563, score-0.64]

85 However, the benefit of using larger amounts of unlabeled data is more pronounced when only a small number of labeled training instances are employed (ca. [sent-566, score-0.446]

86 The effect of using unlabeled data on PDTB relation classification is illustrated in Figure 6 (bottom). [sent-571, score-0.472]

87 Similarly, we consecutively set the labeled training dataset size to 100 and 10000 instances, and plot the macro-average F-score against the unlabeled dataset size. [sent-572, score-0.295]

88 As in the RSTDT experiment, the benefit of us- Number of unseen features in test data Figure 5: Score change as a function of unseen test features for RSTDT (top) and PDTB (bottom). [sent-573, score-0.356]

89 ing unlabeled data is more obvious when the number of labeled training instances is small. [sent-574, score-0.387]

90 However, with 10000 labeled training instances the maximum improvement in F-score is 15% (corresponds to 100 unlabeled instances). [sent-576, score-0.387]

91 These results confirm that, on the one hand performance improvement is more prominent for smaller training sets, and that on the other hand, performance is increased when using larger amounts of unlabeled data. [sent-577, score-0.246]

92 5 Conclusion We presented a semi-supervised method which exploits the co-occurrence of features in unlabeled data, to extend feature vectors during training and testing in a discourse relation classifier. [sent-578, score-1.034]

93 simplicity of the proposed method, it significantly improved the macro-average F-score in discourse relation classification for small training datasets, containing low-occurrence relations. [sent-580, score-0.787]

94 Although the macro-average F-scores of the classifiers described are too low to be used directly as discourse analyzers, the gain in F-score and accuracy for small labeled datasets are a promising perspective for improving classification accuracy for infrequent relation types. [sent-583, score-0.97]

95 In particular, the proposed method can be employed in existing discourse classifiers that work well on popular relations, and be expected to improve the overall accuracy. [sent-584, score-0.522]

96 A novel discourse parser based on Support Vector Machine classification. [sent-636, score-0.402]

97 Recognizing implicit discourse relations in the Penn Discourse Treebank. [sent-672, score-0.584]

98 Automatic sense prediction for implicit discourse relations in text. [sent-754, score-0.584]

99 Supervised and unsupervised methods in employing discourse relations for improving opinion polarity classification. [sent-816, score-0.58]

100 Sentence level discourse parsing using syntactic and lexical information. [sent-823, score-0.402]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pdtb', 0.458), ('rstdt', 0.422), ('discourse', 0.402), ('relation', 0.242), ('rst', 0.176), ('instances', 0.152), ('unlabeled', 0.152), ('relations', 0.147), ('fi', 0.139), ('fj', 0.117), ('unseen', 0.113), ('classifier', 0.101), ('fx', 0.1), ('classification', 0.078), ('pitler', 0.075), ('prasad', 0.075), ('feature', 0.071), ('duverle', 0.07), ('fxi', 0.07), ('hernault', 0.07), ('prendinger', 0.07), ('tr', 0.068), ('nw', 0.063), ('matrix', 0.062), ('attribution', 0.06), ('extension', 0.058), ('marcu', 0.058), ('vector', 0.057), ('units', 0.051), ('labeled', 0.051), ('accuracy', 0.047), ('progressively', 0.047), ('soricut', 0.047), ('expansion', 0.046), ('vectors', 0.045), ('basili', 0.045), ('okyo', 0.045), ('zj', 0.045), ('kernels', 0.044), ('increase', 0.043), ('cooccurrence', 0.042), ('fc', 0.042), ('arguments', 0.041), ('features', 0.04), ('infrequent', 0.038), ('fscore', 0.038), ('datasets', 0.037), ('increased', 0.037), ('penn', 0.035), ('implicit', 0.035), ('abundant', 0.035), ('alch', 0.035), ('bloehdorn', 0.035), ('bollegala', 0.035), ('cammisa', 0.035), ('classias', 0.035), ('danushka', 0.035), ('dinesh', 0.035), ('edus', 0.035), ('enablement', 0.035), ('hugo', 0.035), ('piwek', 0.035), ('robaldo', 0.035), ('siolas', 0.035), ('subfigure', 0.035), ('zsz', 0.035), ('semantic', 0.035), ('employed', 0.034), ('kernel', 0.033), ('treebank', 0.033), ('proposed', 0.033), ('training', 0.032), ('instance', 0.032), ('production', 0.031), ('employing', 0.031), ('cristianini', 0.03), ('declined', 0.03), ('diminishes', 0.03), ('loper', 0.03), ('miltsakaki', 0.03), ('rhetorical', 0.03), ('dataset', 0.03), ('classifiers', 0.028), ('segmenter', 0.028), ('store', 0.028), ('deerwester', 0.027), ('elaboration', 0.027), ('mann', 0.027), ('nltk', 0.027), ('parse', 0.026), ('test', 0.025), ('extend', 0.025), ('tokyo', 0.025), ('lemmatized', 0.025), ('corresponds', 0.025), ('amounts', 0.025), ('method', 0.025), ('matrices', 0.025), ('regression', 0.025), ('comment', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka

Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

2 0.12749352 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

Author: Longhua Qian ; Guodong Zhou

Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.

3 0.1074362 28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data

Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum

Abstract: We present a novel approach to relation extraction that integrates information across documents, performs global inference and requires no labelled text. In particular, we tackle relation extraction and entity identification jointly. We use distant supervision to train a factor graph model for relation extraction based on an existing knowledge base (Freebase, derived in parts from Wikipedia). For inference we run an efficient Gibbs sampler that leads to linear time joint inference. We evaluate our approach both for an indomain (Wikipedia) and a more realistic outof-domain (New York Times Corpus) setting. For the in-domain setting, our joint model leads to 4% higher precision than an isolated local approach, but has no advantage over a pipeline. For the out-of-domain data, we benefit strongly from joint modelling, and observe improvements in precision of 13% over the pipeline, and 15% over the isolated baseline.

4 0.09311638 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian

Abstract: In this paper, we introduce a method that automatically builds text classifiers in a new language by training on already labeled data in another language. Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a word. We further exploit the readily available unlabeled data in the target language via semisupervised learning, and adapt the translated model to better fit the data distribution of the target language.

5 0.091499388 20 emnlp-2010-Automatic Detection and Classification of Social Events

Author: Apoorv Agarwal ; Owen Rambow

Abstract: In this paper we introduce the new task of social event extraction from text. We distinguish two broad types of social events depending on whether only one or both parties are aware of the social contact. We annotate part of Automatic Content Extraction (ACE) data, and perform experiments using Support Vector Machines with Kernel methods. We use a combination of structures derived from phrase structure trees and dependency trees. A characteristic of our events (which distinguishes them from ACE events) is that the participating entities can be spread far across the parse trees. We use syntactic and semantic insights to devise a new structure derived from dependency trees and show that this plays a role in achieving the best performing system for both social event detection and classification tasks. We also use three data sampling approaches to solve the problem of data skewness. Sampling methods improve the F1-measure for the task of relation detection by over 20% absolute over the baseline.

6 0.084200226 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

7 0.083916858 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

8 0.078988738 59 emnlp-2010-Identifying Functional Relations in Web Text

9 0.074331164 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

10 0.073329233 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

11 0.073310092 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

12 0.063443378 104 emnlp-2010-The Necessity of Combining Adaptation Methods

13 0.058016066 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

14 0.057721898 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

15 0.05161979 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

16 0.048795983 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

17 0.046585925 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

18 0.04533226 61 emnlp-2010-Improving Gender Classification of Blog Authors

19 0.045185708 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

20 0.044573858 77 emnlp-2010-Measuring Distributional Similarity in Context


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.185), (1, 0.124), (2, -0.032), (3, 0.163), (4, 0.032), (5, -0.069), (6, 0.081), (7, 0.099), (8, 0.07), (9, 0.052), (10, 0.016), (11, -0.09), (12, -0.068), (13, -0.098), (14, -0.004), (15, 0.041), (16, 0.049), (17, 0.133), (18, -0.075), (19, 0.079), (20, 0.147), (21, 0.01), (22, 0.051), (23, 0.02), (24, -0.025), (25, -0.235), (26, -0.087), (27, 0.044), (28, -0.106), (29, 0.241), (30, 0.103), (31, -0.062), (32, 0.074), (33, -0.066), (34, 0.133), (35, 0.217), (36, -0.109), (37, 0.049), (38, 0.019), (39, -0.0), (40, -0.031), (41, -0.019), (42, 0.001), (43, 0.066), (44, -0.032), (45, 0.154), (46, -0.051), (47, -0.016), (48, 0.009), (49, 0.107)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96115863 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka

Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

2 0.64740193 28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data

Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum

Abstract: We present a novel approach to relation extraction that integrates information across documents, performs global inference and requires no labelled text. In particular, we tackle relation extraction and entity identification jointly. We use distant supervision to train a factor graph model for relation extraction based on an existing knowledge base (Freebase, derived in parts from Wikipedia). For inference we run an efficient Gibbs sampler that leads to linear time joint inference. We evaluate our approach both for an indomain (Wikipedia) and a more realistic outof-domain (New York Times Corpus) setting. For the in-domain setting, our joint model leads to 4% higher precision than an isolated local approach, but has no advantage over a pipeline. For the out-of-domain data, we benefit strongly from joint modelling, and observe improvements in precision of 13% over the pipeline, and 15% over the isolated baseline.

3 0.58781618 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

Author: Longhua Qian ; Guodong Zhou

Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.

4 0.54177815 59 emnlp-2010-Identifying Functional Relations in Web Text

Author: Thomas Lin ; Mausam ; Oren Etzioni

Abstract: Determining whether a textual phrase denotes a functional relation (i.e., a relation that maps each domain element to a unique range element) is useful for numerous NLP tasks such as synonym resolution and contradiction detection. Previous work on this problem has relied on either counting methods or lexico-syntactic patterns. However, determining whether a relation is functional, by analyzing mentions of the relation in a corpus, is challenging due to ambiguity, synonymy, anaphora, and other linguistic phenomena. We present the LEIBNIZ system that overcomes these challenges by exploiting the synergy between the Web corpus and freelyavailable knowledge resources such as Freebase. It first computes multiple typedfunctionality scores, representing functionality of the relation phrase when its arguments are constrained to specific types. It then aggregates these scores to predict the global functionality for the phrase. LEIBNIZ outperforms previous work, increasing area under the precisionrecall curve from 0.61 to 0.88. We utilize LEIBNIZ to generate the first public repository of automatically-identified functional relations.

5 0.44720355 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

Author: Eduardo Blanco ; Dan Moldovan

Abstract: This paper presents a method for the automatic discovery of MANNER relations from text. An extended definition of MANNER is proposed, including restrictions on the sorts of concepts that can be part of its domain and range. The connections with other relations and the lexico-syntactic patterns that encode MANNER are analyzed. A new feature set specialized on MANNER detection is depicted and justified. Experimental results show improvement over previous attempts to extract MANNER. Combinations of MANNER with other semantic relations are also discussed.

6 0.38068306 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

7 0.36889997 20 emnlp-2010-Automatic Detection and Classification of Social Events

8 0.32871622 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

9 0.29675195 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

10 0.29291409 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

11 0.28899318 61 emnlp-2010-Improving Gender Classification of Blog Authors

12 0.2823981 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

13 0.27665147 104 emnlp-2010-The Necessity of Combining Adaptation Methods

14 0.27324769 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

15 0.27257165 83 emnlp-2010-Multi-Level Structured Models for Document-Level Sentiment Classification

16 0.25084707 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

17 0.24552518 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

18 0.23928362 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

19 0.23569986 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

20 0.23263152 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.015), (12, 0.038), (29, 0.095), (30, 0.015), (32, 0.024), (52, 0.027), (56, 0.06), (62, 0.014), (66, 0.136), (72, 0.151), (76, 0.02), (82, 0.041), (87, 0.014), (96, 0.222)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78099287 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka

Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

2 0.74796295 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

Author: Zhongqiang Huang ; Martin Cmejrek ; Bowen Zhou

Abstract: In this paper, we present a novel approach to enhance hierarchical phrase-based machine translation systems with linguistically motivated syntactic features. Rather than directly using treebank categories as in previous studies, we learn a set of linguistically-guided latent syntactic categories automatically from a source-side parsed, word-aligned parallel corpus, based on the hierarchical structure among phrase pairs as well as the syntactic structure of the source side. In our model, each X nonterminal in a SCFG rule is decorated with a real-valued feature vector computed based on its distribution of latent syntactic categories. These feature vectors are utilized at decod- ing time to measure the similarity between the syntactic analysis of the source side and the syntax of the SCFG rules that are applied to derive translations. Our approach maintains the advantages of hierarchical phrase-based translation systems while at the same time naturally incorporates soft syntactic constraints.

3 0.70513839 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

Author: Kostadin Cholakov ; Gertjan van Noord

Abstract: Unknown words are a hindrance to the performance of hand-crafted computational grammars of natural language. However, words with incomplete and incorrect lexical entries pose an even bigger problem because they can be the cause of a parsing failure despite being listed in the lexicon of the grammar. Such lexical entries are hard to detect and even harder to correct. We employ an error miner to pinpoint words with problematic lexical entries. An automated lexical acquisition technique is then used to learn new entries for those words which allows the grammar to parse previously uncovered sentences successfully. We test our method on a large-scale grammar of Dutch and a set of sentences for which this grammar fails to produce a parse. The application of the method enables the grammar to cover 83.76% of those sentences with an accuracy of 86.15%.

4 0.68805528 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

Author: Valentin Zhikov ; Hiroya Takamura ; Manabu Okumura

Abstract: This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the ”CHILDES” corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced . computational time.

5 0.68003494 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

Author: Pawel Mazur ; Robert Dale

Abstract: The reliable extraction of knowledge from text requires an appropriate treatment of the time at which reported events take place. Unfortunately, there are very few annotated data sets that support the development of techniques for event time-stamping and tracking the progression of time through a narrative. In this paper, we present a new corpus of temporally-rich documents sourced from English Wikipedia, which we have annotated with TIMEX2 tags. The corpus contains around 120000 tokens, and 2600 TIMEX2 expressions, thus comparing favourably in size to other existing corpora used in these areas. We describe the prepa- ration of the corpus, and compare the profile of the data with other existing temporally annotated corpora. We also report the results obtained when we use DANTE, our temporal expression tagger, to process this corpus, and point to where further work is required. The corpus is publicly available for research purposes.

6 0.63225305 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

7 0.6310668 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

8 0.63070279 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

9 0.6306172 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

10 0.62940061 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

11 0.62907946 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

12 0.62618273 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

13 0.62593418 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

14 0.62275076 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

15 0.62146616 20 emnlp-2010-Automatic Detection and Classification of Social Events

16 0.61978203 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

17 0.61448002 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

18 0.61437845 84 emnlp-2010-NLP on Spoken Documents Without ASR

19 0.6132533 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

20 0.6113326 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules