emnlp emnlp2010 emnlp2010-27 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Longhua Qian ; Guodong Zhou
Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.
Reference: text
sentIndex sentText sentNum sentScore
1 This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. [sent-4, score-1.004]
2 First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. [sent-5, score-0.874]
3 Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. [sent-6, score-1.091]
4 Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. [sent-7, score-0.558]
5 We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. [sent-8, score-0.848]
6 In particular, we compare various clustering algorithms on the stratified bootstrapping performance. [sent-9, score-0.749]
7 Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75. [sent-10, score-0.525]
8 9 on the subtask of semantic relation classification, approaching the one with golden clustering. [sent-11, score-0.416]
9 1 Introduction Semantic relation extraction aims to detect and classify semantic relationships between a pair of named entities occurring in a natural language text. [sent-12, score-0.37]
10 Current work on relation extraction mainly adopts supervised learning methods, since they achieve much better performance. [sent-29, score-0.349]
11 For example, Abney (2002) proposes a bootstrapping algorithm which chooses the unla- beled instances with the highest probability of being correctly labeled and add them in turn into the labeled training data iteratively. [sent-35, score-0.346]
12 Since the performance of bootstrapping depends much on the quality and quantity of the seed set and researchers tend to employ as few seeds as possible (e. [sent-37, score-0.638]
13 Therefore, it is critical for a bootstrapping procedure to select an appropriate seed set, which should be representative and diverse. [sent-43, score-0.527]
14 However, most of current semisupervised relation extraction systems (Zhang, 2004; Chen et al. [sent-44, score-0.363]
15 , 2006) use a random seed sampling strategy, which fails to fully exploit the affinity nature in the training data to derive the seed set. [sent-45, score-0.977]
16 (2009) bootstrap a set of weighted support vectors from both labeled and unlabeled data using SVM and feed these instances into semi-supervised relation extraction. [sent-47, score-0.515]
17 However, their seed set is sequentially generated only to ensure that there are at least 5 instances for each relation class. [sent-48, score-0.756]
18 , 2009) attempts to solve this problem via a simple stratified sampling strategy for selecting the seed set. [sent-50, score-1.01]
19 Experimentation on the ACE RDC 2004 corpus shows that the stratified sampling strategy achieves promising results for semi-supervised learning. [sent-51, score-0.679]
20 Nevertheless, the success of the strategy relies on the assumption that the true distribution of all relation types is already known, which is impractical for real NLP applications. [sent-52, score-0.349]
21 This paper presents a clustering-based stratified seed sampling approach for semi-supervised relation extraction, without the assumption on the true distribution of different relation types. [sent-53, score-1.528]
22 The motivations behind our approach are that the unlabeled data can be partitioned into a number of strata using a clustering algorithm and that representative and diverse seeds can be derived from such strata in the framework of stratified sampling (Neyman, 1934) for an oracle to annotate. [sent-54, score-1.529]
23 Particularly, we employ a diversity-motivated intra-stratum sampling scheme to pick a center and additional instances as seeds from each stratum. [sent-55, score-0.652]
24 Experimental results show the effectiveness of the clusteringbased stratified seed sampling for semi-supervised relation classification. [sent-56, score-1.279]
25 Then, Section 3 introduces the stratified bootstrapping framework including an intrastratum sampling scheme while Section 4 describes various clustering algorithms. [sent-59, score-0.982]
26 347 2 Related Work In semi-supervised learning for relation extraction, most of previous work construct the seed set either randomly (Zhang, 2004; Chen et al. [sent-62, score-0.606]
27 (2009) adopt a stratified sampling strategy to select the seed set. [sent-66, score-1.031]
28 However, their method needs a stratification variable such as the known distribution of the relation types, while our method uses clustering to divide relation instances into different strata. [sent-67, score-1.108]
29 In the literature, clustering techniques have been employed in active learning to sample representative seeds in a certain extent (Nguyen and Smeulders, 2004; Tang et al. [sent-68, score-0.531]
30 The cluster centers are used to construct a classifier and which in turn propagates classification decision to other examples via a local noise model. [sent-72, score-0.384]
31 Unlike their probabilistic models, we apply various clustering algorithms together with intra-stratum sampling to select a seed set in discriminative models like SVMs. [sent-73, score-0.862]
32 (2002) employ a sampling strategy of “most uncertain per cluster” to select representative examples and weight them using their cluster density, while we pick a few seeds (the number of the sampled seeds is proportional to the cluster density) from a cluster in addition to its center. [sent-75, score-1.413]
33 Unlike our sampling strategy of clustering for representativeness and stratified sampling for diversity, they either select cluster centroids or diverse examples from a prechosen set in terms of some combined metrics. [sent-78, score-1.428]
34 To the best of our knowledge, this is the first work to address the issue of seed selection using clustering techniques for semi-supervised learning with discriminative models. [sent-79, score-0.575]
35 3 Stratified Bootstrapping Framework The stratified bootstrapping framework consists of three major components: an underlying supervised learner and a bootstrapping algorithm on top of it as usual, plus a clustering-based stratified seed sampler. [sent-80, score-1.308]
36 1 Underlying Supervised Learner Due to recent success in tree kernel-based relation extraction, this paper adopts a tree kernel-based method in the underlying supervised learner. [sent-82, score-0.396]
37 3 Clustering-based Stratified Seed Sampler Stratified sampling is a method of sampling in statistics, in which the members of a population are grouped into relatively homogeneous subgroups (i. [sent-105, score-0.52]
38 Previous work justifies theoretically and practically that stratified sampling is more appropriate than random sampling for general use (Neyman, 1934) as well as for relation extraction (Qian et al. [sent-109, score-1.231]
39 However, the difficulty lies in how to find the appropriate stratification variable for complicated tasks, such as relation extraction. [sent-111, score-0.395]
40 The idea of clustering-based stratification circumvents this problem by clustering the unlabeled data into a number of strata without the need to explicitly specify a stratification variable. [sent-112, score-0.703]
41 Figure 2 illustrates the clustering-based stratified seed sampling strategy employed in the bootstrapping procedure, where RSET denotes the whole unlabeled data, SeedSET the seed set to be labeled and |RSETi| the number of instances in the i-th cluster2 RSETi. [sent-113, score-1.693]
42 Here, a relation instance is represented using USPT and the similarity between two instances is computed using the standard convolution tree 2 Hereafter, when we refer to clusters from the viewpoint of stratified sampling, they are often called “strata”. [sent-114, score-0.978]
43 , both the clustering and the classification adopt the same structural representation, since we want the representative seeds in the clustering space to be also representative in the classification space). [sent-118, score-0.886]
44 After clustering, a certain number of instances from every stratum are sampled using an intra-stratum scheme (c. [sent-119, score-0.428]
45 Furthermore, to ensure that the total num- ber of instances being sampled equals the prescribed NS, the number of seeds from dominant strata may be slightly adjusted accordingly. [sent-125, score-0.56]
46 Finally, these instances form the unlabeled seed set for an oracle to annotate as the input to the underlying supervised learner in the bootstrapping procedure. [sent-126, score-0.662]
47 4 Intra-stratum sampling Given the distribution of clusters, a simple way to select the most representative instances is to choose the center of each cluster with the cluster prior as the weight of the center (Tang et al. [sent-128, score-0.972]
48 Nevertheless, for the complicated task of relation extraction on the ACE RDC corpora, which is highly skewed across different relation classes, only considering the center of each cluster would severely under-represent the high-density data. [sent-130, score-0.866]
49 To overcome this problem, we adopt a sampling approach, in particular stratified sampling, which takes the size of each stratum into consideration. [sent-131, score-0.881]
50 Given the size of the seed set NS and the number of strata K, a natural question will arise as how to select the remaining (NS-K) seeds after we have extracted the K centers from the K strata. [sent-132, score-0.805]
51 We view this problem as intra-stratum sampling, which is required to choose the remaining number of seeds from inside individual stratum (excluding the centers themselves). [sent-133, score-0.537]
52 At the first glance, sampling a certain number of seeds from one particular stratum (e. [sent-134, score-0.689]
53 This will naturally lead to another application of a clustering algorithm to the stratification of the stratum RSETi. [sent-137, score-0.598]
54 349 Require: RSET ={R1,R2,…,RN}, the set of unlabeled relation instances and K, the number of strata being clustered Output: SeedSET with the size of NS (100) Procedure Initialize SeedSET = NULL Cluster RSET into K strata using a clustering algorithm and perform stratum pruning if necessary. [sent-138, score-1.293]
55 Calculate the number of instances being sampled for each stratum i={ 1,2,… ,K} … Ni=|RSNETi|∗NS and adjust this number if necessary. [sent-139, score-0.428]
56 The motivation is that we prefer the seeds with high variance to each other, thus avoiding repetitious seeds from a single stratum. [sent-143, score-0.39]
57 4 Clustering Algorithms This section describes several typical clustering algorithms in the literature, such as K-means, HAC, spectral clustering and affinity propagation, as well as their application in this paper. [sent-148, score-0.648]
58 1 K-medoids (KM) As a simple yet effective clustering method, the Kmeans algorithm assigns each instance to the cluster whose center (also called centroid) is nearest. [sent-150, score-0.488]
59 2 Hierarchical (HAC) Agglomerative Clustering Different from K-medoids, hierarchical clustering creates a hierarchy of clusters which can be represented in a tree structure called a dendrogram. [sent-158, score-0.396]
60 In this paper, we generate the final flat cluster structures greedily by maximizing the equal distribution of instances among different clusters. [sent-162, score-0.392]
61 matrix between clustering makes has become more and more Taking as input a similarity any two instances, spectral use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions. [sent-165, score-0.616]
62 For our application, affinity propagation takes as input a similarity matrix, whose elements represent either the similarity between two different instances or the preference (a real number p) for an instance when two instances are the same. [sent-173, score-0.481]
63 5 Experimentation This section systematically evaluates the bootstrapping approach using clustering-based stratified seed sampling, in the relation classification (i. [sent-175, score-1.122]
64 , given the relationship already detected) subtask of relation extraction on the ACE RDC 2004 corpus. [sent-177, score-0.351]
65 It contains 45 1 documents and 5702 positive relation instances of 7 relation types and 23 subtypes between 7 entity types. [sent-180, score-0.748]
66 The corpus is parsed using Charniak’s parser (Charniak, 2001) and relation instances are generated by extracting all pairs of entity mentions occurring in the same sentence with positive relationships. [sent-188, score-0.425]
67 For easy comparison with related work, we only evaluate the relation classification task on the 7 major relation types of the ACE RDC 2004 corpus. [sent-189, score-0.609]
68 For each 351 relation type, P is the ratio of the true relation instances in all the relation instances being identified, R is the ratio of the true relation instances being identified in all the true relation instances in the corpus, and F1 is the harmonic mean of P and R. [sent-194, score-1.975]
69 Here, the size of the seed set L is set to 100, and the top 100 instances with the highest confidence (c. [sent-198, score-0.481]
70 Since for these strategies the seed sets sampled from different trials may be quite different, their performance scores vary in a great degree accordingly. [sent-202, score-0.441]
71 Besides, two additional baseline sampling strategies are included for comparison: sequential sampling (SEQ), which selects a sequentially-occurring L instances as the seed set, and random sampling (RAND), which randomly selects L instances as the seed set. [sent-204, score-1.786]
72 This is due to the fact that the seed set via RAND may better reflect the distribution of the whole training data than that via SEQ, nevertheless at the expense of collecting the whole training data in advance. [sent-207, score-0.384]
73 Furthermore, all the four clustering-based seed sampling strategies achieve much smaller performance improvement in F1-score than RAND, among which KM performs worst with performance improvement of only 0. [sent-209, score-0.635]
74 gies without intra-stratum sampling on the development data 3) All the performance improvements from bootstrapping largely come from the improvements in precision. [sent-211, score-0.372]
75 Besides, the final flat cluster structures given a special number of clusters are generated greedily from the cluster hierarchy by maximizing the equal distribution of instances among different clusters. [sent-223, score-0.696]
76 In other words, when the cluster number reaches a certain threshold, the dense area will get more seeds represented in the seed set. [sent-224, score-0.744]
77 These observations also justify the application of the stratified seed sampling to the bootstrapping procedure, which enforces the number of seeds 352 sampled from a cluster to be proportional to its density, presumably approximated by its size in this paper. [sent-226, score-1.527]
78 For reference, we also list the F-score for golden clustering (GOLD), in which all instances are grouped in terms of their annotated ground relation major types (7), major types considering relation direction (13), subtypes (23), and subtypes considering direction (38). [sent-228, score-1.166]
79 Besides, the performance of clustering-based semisupervised relation classification is also measured over other typical cluster numbers (i. [sent-229, score-0.594]
80 Particularly, when the cluster number equals 1, it means that only diversity other than representativeness is considered in the seed sampling. [sent-232, score-0.614]
81 Among these clustering algorithms, one of the distinct characteristics with the AP algorithm is that the number of clusters cannot be specified in advance, rather, it is determined by the pre-defined preference parameter (c. [sent-233, score-0.362]
82 Table 3 shows that 1) The performance for all the clustering algorithms varies in some degree with the number of clusters being grouped. [sent-240, score-0.357]
83 And this could be further explained by our observation that, with the increase of cluster numbers, the clusters get smaller and denser while their centers also come closer to each other. [sent-243, score-0.391]
84 numbers with intra-stratum sampling on the development data 2) Golden clustering achieves the best performance of 73. [sent-245, score-0.543]
85 9 in F1-score when the cluster number is set to 7, significantly higher than the performance using other cluster numbers. [sent-246, score-0.394]
86 This is reasonable since the instances with the same relation type should be much more similar than those with different relation types and it is easy to discriminate the seed set of one relation type from that of other relation types. [sent-248, score-1.581]
87 3) Among the four clustering algorithms, HAC achieves best performance over most of cluster numbers. [sent-249, score-0.441]
88 That is, as a hierarchical clustering algorithm, HAC can sample seeds that better capture the distribution of the training data. [sent-251, score-0.46]
89 Final comparison of different clustering algorithms on the held-out test data After the optimal cluster numbers are determined for each clustering algorithm, we apply these numbers on the held-out test data and report the performance results (P/R/F1 and their respective improvements) in Table 4. [sent-255, score-0.79]
90 353 sampling strategies on the held-out test data with the optimal cluster number for each clustering algorithm Table 4 shows that 1) Among all the clustering algorithms, HAC achieves the best F1-score of 75. [sent-257, score-0.989]
91 This further justifies the merits of HAC as a clustering algorithm for stratified seed sampling in semi-supervised relation classification. [sent-262, score-1.503]
92 Since the distribution over relation types doesn’t always conform to that over instance structures, and for a statistical discriminative classifier, often the latter is more important than the former, it will be no surprise if HAC outperforms golden clustering in some real applications, e. [sent-266, score-0.624]
93 6 Conclusion and Future Work This paper presents a stratified seed sampling strategy based on clustering algorithms for semisupervised learning. [sent-269, score-1.326]
94 Further, diversity-motivated intra-strata sampling is employed to sample additional instances from within each stratum besides its center. [sent-272, score-0.685]
95 We compare the effect of various clustering algorithms on the performance of semi-supervised learning and find that HAC achieves the best performance since the distribution of its seed set better approximates that of the whole training data. [sent-273, score-0.623]
96 Extensive evaluation on the ACE RDC 2004 benchmark corpus shows that our clustering-based stratified seed sampling strategy significantly improves the performance of semi-supervised relation classification. [sent-274, score-1.285]
97 For the future work, it is possible to adapt our one-level clustering-based sampling to the multilevel one, where for every stratum it is still possible to divide it into lower sub-strata for further stratified sampling in order to make the seeds better represent the true distribution of the data. [sent-276, score-1.359]
98 Exploiting constituent dependencies for tree kernelbased semantic relation extraction. [sent-381, score-0.365]
99 Tree kernelbased semantic relation extraction with rich syntactic and semantic information. [sent-470, score-0.387]
100 Label propagation via bootstrapped support vectors for semantic relation extraction between named entities. [sent-480, score-0.414]
wordName wordTfidf (topN-words)
[('stratified', 0.366), ('seed', 0.331), ('relation', 0.275), ('sampling', 0.26), ('hac', 0.253), ('clustering', 0.244), ('stratum', 0.234), ('cluster', 0.197), ('seeds', 0.195), ('rdc', 0.171), ('strata', 0.171), ('instances', 0.15), ('qian', 0.144), ('stratification', 0.12), ('bootstrapping', 0.112), ('centers', 0.108), ('zhou', 0.103), ('rand', 0.096), ('seq', 0.093), ('ace', 0.089), ('ap', 0.088), ('clusters', 0.086), ('golden', 0.084), ('spectral', 0.078), ('rseti', 0.078), ('seedset', 0.062), ('zhang', 0.06), ('affinity', 0.055), ('tang', 0.053), ('strategy', 0.053), ('representative', 0.053), ('km', 0.048), ('subtypes', 0.048), ('representativeness', 0.048), ('unlabeled', 0.048), ('clusteringbased', 0.047), ('smeulders', 0.047), ('center', 0.047), ('semisupervised', 0.045), ('tree', 0.045), ('sampled', 0.044), ('propagation', 0.044), ('strategies', 0.044), ('extraction', 0.043), ('labeled', 0.042), ('nguyen', 0.042), ('classifier', 0.041), ('besides', 0.041), ('rset', 0.04), ('active', 0.039), ('numbers', 0.039), ('classification', 0.038), ('diversity', 0.038), ('kernel', 0.035), ('subtask', 0.033), ('nevertheless', 0.032), ('preference', 0.032), ('density', 0.031), ('frey', 0.031), ('hasegawa', 0.031), ('neyman', 0.031), ('shizi', 0.031), ('uspt', 0.031), ('convolution', 0.031), ('adopts', 0.031), ('chen', 0.031), ('procedure', 0.031), ('severely', 0.029), ('ns', 0.029), ('named', 0.028), ('algorithms', 0.027), ('suda', 0.027), ('suzhou', 0.027), ('justifies', 0.027), ('von', 0.027), ('similarity', 0.025), ('sc', 0.024), ('files', 0.024), ('su', 0.024), ('semantic', 0.024), ('soochow', 0.024), ('greedily', 0.024), ('margin', 0.024), ('divide', 0.023), ('doesn', 0.023), ('relations', 0.023), ('zelenko', 0.022), ('agichtein', 0.022), ('trials', 0.022), ('proportional', 0.022), ('ji', 0.022), ('major', 0.021), ('distribution', 0.021), ('adopt', 0.021), ('package', 0.021), ('oracle', 0.021), ('hierarchy', 0.021), ('messages', 0.021), ('kernelbased', 0.021), ('dense', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification
Author: Longhua Qian ; Guodong Zhou
Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.
2 0.13989292 20 emnlp-2010-Automatic Detection and Classification of Social Events
Author: Apoorv Agarwal ; Owen Rambow
Abstract: In this paper we introduce the new task of social event extraction from text. We distinguish two broad types of social events depending on whether only one or both parties are aware of the social contact. We annotate part of Automatic Content Extraction (ACE) data, and perform experiments using Support Vector Machines with Kernel methods. We use a combination of structures derived from phrase structure trees and dependency trees. A characteristic of our events (which distinguishes them from ACE events) is that the participating entities can be spread far across the parse trees. We use syntactic and semantic insights to devise a new structure derived from dependency trees and show that this plays a role in achieving the best performing system for both social event detection and classification tasks. We also use three data sampling approaches to solve the problem of data skewness. Sampling methods improve the F1-measure for the task of relation detection by over 20% absolute over the baseline.
Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka
Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.
4 0.12192332 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping
Author: Tara McIntosh
Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.
5 0.11273268 28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data
Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum
Abstract: We present a novel approach to relation extraction that integrates information across documents, performs global inference and requires no labelled text. In particular, we tackle relation extraction and entity identification jointly. We use distant supervision to train a factor graph model for relation extraction based on an existing knowledge base (Freebase, derived in parts from Wikipedia). For inference we run an efficient Gibbs sampler that leads to linear time joint inference. We evaluate our approach both for an indomain (Wikipedia) and a more realistic outof-domain (New York Times Corpus) setting. For the in-domain setting, our joint model leads to 4% higher precision than an isolated local approach, but has no advantage over a pipeline. For the out-of-domain data, we benefit strongly from joint modelling, and observe improvements in precision of 13% over the pipeline, and 15% over the isolated baseline.
6 0.112677 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs
7 0.11008683 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
8 0.10639788 84 emnlp-2010-NLP on Spoken Documents Without ASR
9 0.10325823 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
10 0.1008527 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
11 0.084075056 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?
12 0.070806533 31 emnlp-2010-Constraints Based Taxonomic Relation Classification
13 0.062950373 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution
14 0.062370345 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
15 0.061823271 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
16 0.060519055 59 emnlp-2010-Identifying Functional Relations in Web Text
17 0.057435103 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping
18 0.056937382 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning
19 0.054672696 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams
20 0.054568999 104 emnlp-2010-The Necessity of Combining Adaptation Methods
topicId topicWeight
[(0, 0.184), (1, 0.131), (2, -0.072), (3, 0.205), (4, 0.02), (5, -0.103), (6, -0.09), (7, 0.184), (8, 0.099), (9, 0.124), (10, 0.101), (11, -0.16), (12, -0.139), (13, -0.162), (14, -0.034), (15, -0.142), (16, 0.156), (17, 0.134), (18, 0.039), (19, 0.055), (20, 0.037), (21, 0.068), (22, -0.01), (23, -0.015), (24, 0.137), (25, 0.061), (26, 0.042), (27, 0.095), (28, -0.199), (29, -0.033), (30, 0.054), (31, -0.05), (32, 0.047), (33, -0.125), (34, -0.021), (35, 0.177), (36, -0.053), (37, -0.032), (38, -0.112), (39, 0.06), (40, 0.049), (41, 0.091), (42, -0.097), (43, 0.119), (44, 0.071), (45, 0.053), (46, -0.048), (47, -0.002), (48, -0.007), (49, -0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.98103565 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification
Author: Longhua Qian ; Guodong Zhou
Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.
Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka
Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.
3 0.49427763 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
Author: Joseph Reisinger ; Raymond Mooney
Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.
4 0.48451778 84 emnlp-2010-NLP on Spoken Documents Without ASR
Author: Mark Dredze ; Aren Jansen ; Glen Coppersmith ; Ken Church
Abstract: There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-ofvocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (∼ 1 sec) repetitions in speech, fainndd scl luostnegrs t∼he 1m sinecto) pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudoterms; performance on a Switchboard task approaches a baseline using gold standard man- ual transcriptions.
5 0.47412661 28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data
Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum
Abstract: We present a novel approach to relation extraction that integrates information across documents, performs global inference and requires no labelled text. In particular, we tackle relation extraction and entity identification jointly. We use distant supervision to train a factor graph model for relation extraction based on an existing knowledge base (Freebase, derived in parts from Wikipedia). For inference we run an efficient Gibbs sampler that leads to linear time joint inference. We evaluate our approach both for an indomain (Wikipedia) and a more realistic outof-domain (New York Times Corpus) setting. For the in-domain setting, our joint model leads to 4% higher precision than an isolated local approach, but has no advantage over a pipeline. For the out-of-domain data, we benefit strongly from joint modelling, and observe improvements in precision of 13% over the pipeline, and 15% over the isolated baseline.
6 0.42929521 20 emnlp-2010-Automatic Detection and Classification of Social Events
7 0.3969979 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
8 0.39036158 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs
9 0.35091537 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping
10 0.34603631 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
11 0.3415978 59 emnlp-2010-Identifying Functional Relations in Web Text
12 0.31813109 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?
13 0.2891756 31 emnlp-2010-Constraints Based Taxonomic Relation Classification
14 0.26756486 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
15 0.26269794 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
16 0.24758567 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams
17 0.24148901 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution
18 0.23998167 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
19 0.22576925 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping
20 0.22190724 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications
topicId topicWeight
[(2, 0.01), (3, 0.016), (12, 0.025), (29, 0.102), (30, 0.029), (32, 0.02), (52, 0.02), (56, 0.062), (62, 0.015), (66, 0.138), (72, 0.052), (76, 0.019), (82, 0.365), (87, 0.026), (89, 0.018)]
simIndex simValue paperId paperTitle
1 0.85028332 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications
Author: Eduardo Blanco ; Dan Moldovan
Abstract: This paper presents a method for the automatic discovery of MANNER relations from text. An extended definition of MANNER is proposed, including restrictions on the sorts of concepts that can be part of its domain and range. The connections with other relations and the lexico-syntactic patterns that encode MANNER are analyzed. A new feature set specialized on MANNER detection is depicted and justified. Experimental results show improvement over previous attempts to extract MANNER. Combinations of MANNER with other semantic relations are also discussed.
same-paper 2 0.79560846 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification
Author: Longhua Qian ; Guodong Zhou
Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.
3 0.78507853 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval
Author: Danish Contractor ; Govind Kothari ; Tanveer Faruquie ; L V Subramaniam ; Sumit Negi
Abstract: Recent times have seen a tremendous growth in mobile based data services that allow people to use Short Message Service (SMS) to access these data services. In a multilingual society it is essential that data services that were developed for a specific language be made accessible through other local languages also. In this paper, we present a service that allows a user to query a FrequentlyAsked-Questions (FAQ) database built in a local language (Hindi) using Noisy SMS English queries. The inherent noise in the SMS queries, along with the language mismatch makes this a challenging problem. We handle these two problems by formulating the query similarity over FAQ questions as a combinatorial search problem where the search space consists of combinations of dictionary variations of the noisy query and its top-N translations. We demonstrate the effectiveness of our approach on a real-life dataset.
Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka
Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.
5 0.50216883 31 emnlp-2010-Constraints Based Taxonomic Relation Classification
Author: Quang Do ; Dan Roth
Abstract: Determining whether two terms in text have an ancestor relation (e.g. Toyota and car) or a sibling relation (e.g. Toyota and Honda) is an essential component of textual inference in NLP applications such as Question Answering, Summarization, and Recognizing Textual Entailment. Significant work has been done on developing stationary knowledge sources that could potentially support these tasks, but these resources often suffer from low coverage, noise, and are inflexible when needed to support terms that are not identical to those placed in them, making their use as general purpose background knowledge resources difficult. In this paper, rather than building a stationary hierarchical structure of terms and relations, we describe a system that, given two terms, determines the taxonomic relation between them using a machine learning-based approach that makes use of existing resources. Moreover, we develop a global constraint opti- mization inference process and use it to leverage an existing knowledge base also to enforce relational constraints among terms and thus improve the classifier predictions. Our experimental evaluation shows that our approach significantly outperforms other systems built upon existing well-known knowledge sources.
6 0.49370012 20 emnlp-2010-Automatic Detection and Classification of Social Events
7 0.48345366 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
8 0.48319134 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
9 0.46215987 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks
10 0.46181908 86 emnlp-2010-Non-Isomorphic Forest Pair Translation
11 0.46043178 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment
12 0.46005854 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
13 0.45848688 84 emnlp-2010-NLP on Spoken Documents Without ASR
15 0.4574025 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
16 0.45634162 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
17 0.45598632 109 emnlp-2010-Translingual Document Representations from Discriminative Projections
18 0.45557505 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
19 0.45529479 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning
20 0.45466113 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields