acl acl2010 acl2010-89 knowledge-graph by maker-knowledge-mining

89 acl-2010-Distributional Similarity vs. PU Learning for Entity Set Expansion


Source: pdf

Author: Xiao-Li Li ; Lei Zhang ; Bing Liu ; See-Kiong Ng

Abstract: Distributional similarity is a classic technique for entity set expansion, where the system is given a set of seed entities of a particular class, and is asked to expand the set using a corpus to obtain more entities of the same class as represented by the seeds. This paper shows that a machine learning model called positive and unlabeled learning (PU learning) can model the set expansion problem better. Based on the test results of 10 corpora, we show that a PU learning technique outperformed distributional similarity significantly. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Distributional similarity is a classic technique for entity set expansion, where the system is given a set of seed entities of a particular class, and is asked to expand the set using a corpus to obtain more entities of the same class as represented by the seeds. [sent-7, score-1.024]

2 This paper shows that a machine learning model called positive and unlabeled learning (PU learning) can model the set expansion problem better. [sent-8, score-0.559]

3 Based on the test results of 10 corpora, we show that a PU learning technique outperformed distributional similarity significantly. [sent-9, score-0.518]

4 1 Introduction The entity set expansion problem is defined as follows: Given a set S of seed entities of a particular class, and a set D of candidate entities (e. [sent-10, score-0.969]

5 , extracted from a text corpus), we wish to determine which of the entities in D belong to S. [sent-12, score-0.184]

6 This is clearly a classification problem which requires arriving at a binary decision for each entity in D (belonging to S or not). [sent-14, score-0.241]

7 However, in practice, the problem is often solved as a ranking problem, i. [sent-15, score-0.107]

8 , ranking the entities in D based on their likelihoods of belonging to S. [sent-17, score-0.317]

9 The classic method for solving this problem is based on distributional similarity (Pantel et al. [sent-18, score-0.482]

10 The approach works by comparing the similarity of the surrounding word distributions of each candidate entity with the seed entities, and then ranking the candidate entities using their similarity scores. [sent-20, score-1.316]

11 sg 2 In machine learning, there is a class of semisupervised learning algorithms that learns from positive and unlabeled examples (PU learning for short). [sent-26, score-0.576]

12 The key characteristic of PU learning is that there is no negative training example available for learning. [sent-27, score-0.094]

13 This class of algorithms is less known to the natural language processing (NLP) community compared to some other semi- supervised learning models and algorithms. [sent-28, score-0.111]

14 2002): Given a set P of positive examples of a particular class and a set U of unlabeled examples (containing hidden positive and negative cases), a classifier is built using P and U for classifying the data in U or future test cases. [sent-31, score-0.746]

15 The results can be either binary decisions (whether each test case belongs to the positive class or not), or a ranking based on how likely each test case belongs to the positive class represented by P. [sent-32, score-0.675]

16 Clearly, the set expansion problem can be mapped into PU learning exactly, with S and D as P and U respectively. [sent-33, score-0.175]

17 2002) outperforms distributional similarity considerably based on the results from 10 corpora. [sent-35, score-0.417]

18 , product and organization names) of the same type or class as the given seeds. [sent-38, score-0.075]

19 c C2o0n1f0er Aenscseoc Sihatoirotn P faopre Crso,m papguetsat 3io5n9a–l3 L6i4n,guistics There is another approach used in the Web environment for entity set expansion. [sent-45, score-0.192]

20 1 Distributional Similarity Distributional similarity is a classic technique for the entity set expansion problem. [sent-53, score-0.624]

21 As such, a method based on distributional simi- larity typically fetches the surrounding contexts for each term (i. [sent-55, score-0.317]

22 both seeds and candidates) and represents them as vectors by using TF-IDF or PMI (Pointwise Mutual Information) values (Lin, 1998; Gorman and Curran, 2006; Paşca et al. [sent-57, score-0.275]

23 Similarity measures such as Cosine, Jaccard, Dice, etc, can then be employed to compute the similarities between each candidate vector and the seeds centroid vector (one centroid vector for all seeds). [sent-61, score-0.777]

24 Lee (1998) surveyed and discussed various distribution similarity measures. [sent-62, score-0.163]

25 It learns from positive and unlabeled examples as opposed to the model of learning from a small set of labeled examples of every class and a large set of unlabeled examples, which we call LU learning (L and U stand for labeled and unlabeled respectively) (Blum and Mitchell, 1998; Nigam et al. [sent-65, score-1.005]

26 The main idea of S-EM is to use a spy technique to identify some reliable negatives (RN) from the unlabeled set U, and then use an EM algorithm to learn from P, RN and U–RN. [sent-73, score-0.553]

27 The spy technique in S-EM works as follows (Figure 1): First, a small set of positive examples (denoted by SP) from P is randomly sampled (line 2). [sent-74, score-0.43]

28 Then, a NB classifier is built using the set P– SP as positive and the set U∪ SP as negative (line 3, 4, and 5). [sent-77, score-0.249]

29 The NB classifier is applied to classify each u ∈ U∪ ∪ SP, i. [sent-78, score-0.078]

30 , to assign a probabilistic class label p(+|u) (+ means positive). [sent-80, score-0.102]

31 The probabilistic labels of the spies are then used to decide reliable negatives (RN). [sent-81, score-0.204]

32 In particular, a probability threshold t is determined using the probabilistic labels of spies in SP and the input parameter l (noise level). [sent-82, score-0.088]

33 Since spy examples are from P and are put into U in building the NB classifier, they should behave similarly to the hidden positive cases in U. [sent-88, score-0.365]

34 Assign each example in P – SP the class label +1; 4. [sent-93, score-0.075]

35 Spy technique for extracting reliable negatives (RN) from U. [sent-101, score-0.181]

36 Given the positive set P, the reliable negative set RN and the remaining unlabeled set U–RN, an Expectation-Maximization (EM) algorithm is run. [sent-102, score-0.462]

37 3 Bayesian Sets Bayesian Sets, as its name suggests, is based on Bayesian inference, and was designed specifically for the set expansion problem (Ghahramani and Heller, 2005). [sent-107, score-0.166]

38 , a positive set P) and an unlabeled candidate set U. [sent-110, score-0.522]

39 Although it was not designed as a PU learning method, it has similar characteristics and produces similar results as PU learning. [sent-111, score-0.091]

40 PU learning is a classification model, while Bayesian Sets is a ranking method. [sent-113, score-0.192]

41 In essence, Bayesian Sets learns a score func360 tion using P and U to generate a score for each unlabeled case u ∈ U. [sent-116, score-0.32]

42 The function is as follows: score(u)=p(pu(u|)P) (1) where p(u|P) represents how probable u belongs to the positive class represented by P. [sent-117, score-0.284]

43 The scores can be used to rank the unlabeled candidates in U to reflect how likely each u ∈ U belongs to P. [sent-120, score-0.341]

44 3 Data Generation for Distributional Similarity, Bayesian Sets and S-EM Preparing the data for distributional similarity is fairly straightforward. [sent-125, score-0.417]

45 Given the seeds set S, a seeds centroid vector is produced using the surrounding word contexts (see below) of all occurrences of all the seeds in the corpus (Pantel et al, 2009). [sent-126, score-0.868]

46 In a similar way, a centroid is also produced for each candidate (or unlabeled) entity. [sent-127, score-0.298]

47 Candidate entities: Since we are interested in named entities, we select single words or phrases as candidate entities based on their corresponding part-of-speech (POS) tags. [sent-128, score-0.399]

48 In particular, we choose the following POS tags as entity indicators — NNP (proper noun), NNPS (plural proper noun), and CD (cardinal number). [sent-129, score-0.192]

49 We regard a phrase (could be one word) with a sequence of NNP, NNPS and CD POS tags as one candidate entity (CD cannot be the first word unless it starts with a letter), e. [sent-130, score-0.366]

50 Context: For each seed or candidate occurrence, the context is its set of surrounding words within a window of size w, i. [sent-133, score-0.388]

51 we use w words right before the seed or the candidate and w words right after it. [sent-135, score-0.27]

52 For S-EM and Bayesian Sets, both the positive set P (based on the seeds set S) and the unlabeled candidate set U are generated differently. [sent-137, score-0.733]

53 Positive and unlabeled sets: For each seed si ∈S, each occurrence in the corpus forms a vector as a positive example in P. [sent-139, score-0.527]

54 The vector is formed based on the surrounding words context (see above) of the seed mention. [sent-140, score-0.207]

55 Similarly, for each candidate d ∈ D (see above; D denotes the set of all candidates), each occurrence also forms a vector as an unlabeled example in U. [sent-141, score-0.453]

56 Thus, each unique seed or candidate entity may produce multiple feature vectors, depending on the number of times that it appears in the corpus. [sent-142, score-0.462]

57 The components in the feature vectors are term frequencies for S-EM as S-EM uses naïve Bayesian classification as its base classifier. [sent-143, score-0.113]

58 4 Candidate Ranking For distributional similarity, ranking is done using the similarity value of each candidate’s centroid and the seeds’ centroid (one centroid vector for all seeds). [sent-145, score-0.944]

59 After it ends, S-EM produces a Bayesian classifier C, which is used to classify each vector u ∈ U and to assign a probability p(+|u) to indicate the likelihood that u belongs to the positive class. [sent-148, score-0.39]

60 Recall that for both S-EM and Bayesian Sets, each unique candidate entity may generate multiple feature vectors, depending on the number of times that the candidate entity occurs in the corpus. [sent-150, score-0.732]

61 As such, the rankings produced by S-EM and Bayesian Sets are not the rankings of the entities, but rather the rankings of the entities’ occurrences. [sent-151, score-0.201]

62 Since different vectors representing the same candidate entity can have very different probabilities (for S-EM) or scores (for Bayesian Sets), we need to combine them and compute a single score for each unique candidate entity for ranking. [sent-152, score-0.836]

63 To this end, we also take the entity frequency into consideration. [sent-153, score-0.192]

64 Typically, it is highly desirable to rank those correct and frequent entities at the top because they are more important than the infrequent ones in applications. [sent-154, score-0.22]

65 Let the probabilities (or scores) of a candidate entity d ∈ D be Vd = {v1 , v2 vn} for the n feature vectors of the candidate. [sent-156, score-0.43]

66 The final score (fs) for d is defined as: … … …, fs(d) = Md × log(1 + n) (3) 361 The use of the median of Vd can be justified based on the statistical skewness (Neter et al. [sent-158, score-0.107]

67 If the values in Vd are skewed towards the high side (negative skew), it means that the candidate entity is very likely to be a true entity, and we should take the median as it is also high (higher than the mean). [sent-160, score-0.433]

68 However, if the skew is towards the low side (positive skew), it means that the candidate entity is unlikely to be a true entity and we should again use the median as it is low (lower than the mean) under this condition. [sent-161, score-0.695]

69 Note that here n is the frequency count of candidate entity d in the corpus. [sent-162, score-0.366]

70 The idea is to push the frequent candidate entities up by multiplying the logarithm of frequency. [sent-164, score-0.358]

71 The final score fs(d) indicates candidate d’s overall likelihood to be a relevant entity. [sent-166, score-0.214]

72 A high fs(d) implies a high likelihood that d is in the expanded entity set. [sent-167, score-0.192]

73 We can then rank all the candidates based on their fs(d) values. [sent-168, score-0.088]

74 For distributional similarity, we tested TF-IDF and PMI as feature values of vectors, and Cosine and Jaccard as similarity measures. [sent-178, score-0.417]

75 The tagged sentences were used to extract candidate entities and their contexts. [sent-187, score-0.358]

76 Table 1 shows the domains and the number of sentences in each corpus, as well as the three seed entities used in our experiments for each corpus. [sent-188, score-0.28]

77 The three seeds for each corpus were randomly selected from a set of common entities in the application domain. [sent-189, score-0.395]

78 Descriptions of the 10 corpora ty recognition such as precision and recall are not suitable for our purpose as we do not have the complete sets of gold standard entities to compare with. [sent-191, score-0.253]

79 We adopt rank precision, which is commonly used for evaluation of entity set expansion techniques (Pantel et al. [sent-192, score-0.367]

80 , 2009): Precision @ N: The percentage of correct entities among the top N entities in the ranked list. [sent-193, score-0.368]

81 Again, we can see the same performance pattern as in Table 2 (w = 3): S-EM performs the best, Bayesian-Sets the second, and the two distributional similarity methods the third and the fourth, with Distr-Sim-freq slightly better than Distr-Sim. [sent-206, score-0.417]

82 Average = 3) precisions over the 10 corpora of different window size (3 seeds) DBiTastoyriSpes- tRiErae-mMSnsi-umlftersqT0 o. [sent-209, score-0.117]

83 From the tables, we can see that both S-EM and Bayesian Sets performed better than distributional similarity. [sent-220, score-0.254]

84 We believe that the reason is as follows: Distributional similarity does not use any information in the candidate set (or the unlabeled set U). [sent-222, score-0.533]

85 It tries to rank the candidates solely through similarity comparisons with the given seeds (or positive cases). [sent-223, score-0.614]

86 Its learning method produces a weight vector for features based on their occurrence differences in the positive set P and the unlabeled set U (Ghahramani and Heller 2005). [sent-225, score-0.495]

87 In this way, Bayesian Sets is able to exploit the useful information in U that was ignored by distributional similarity. [sent-227, score-0.254]

88 S-EM also considers these differences in its NB classification; in addition, it uses the reliable negative set (RN) to help distinguish negative and positive cases, which both Bayesian Sets and distributional similarity do not do. [sent-228, score-0.741]

89 We believe this balanced attempt by S-EM to distinguish the positive and negative cases is the reason for the better performance of S-EM. [sent-229, score-0.21]

90 Since Bayesian Sets is a ranking method and S-EM is a classification method, can we say even for ranking (our evaluation is based on ranking) classification methods produce better results than ranking methods? [sent-231, score-0.419]

91 But intuitively, classification, which separates positive and negative cases by pulling them towards two opposite directions, should perform better than ranking which only pulls the data in one direction. [sent-233, score-0.317]

92 6 Conclusions and Future Work Although distributional similarity is a classic technique for entity set expansion, this paper showed that PU learning performs considerably better on our diverse corpora. [sent-235, score-0.775]

93 2007; Elkan and Noto, 2008) on this entity set expansion task, as well as other tasks that were tackled using distributional similarity. [sent-239, score-0.585]

94 A study on similarity and relatedness using distributional and WordNet-based approaches. [sent-251, score-0.417]

95 Name entity recognition: a maximum entropy approach using global information. [sent-281, score-0.192]

96 Learning to classify texts using positive and unlabeled data, IJCAI. [sent-363, score-0.387]

97 Text classification from labeled and unlabeled documents using EM. [sent-399, score-0.245]

98 “More like these ”: growing entity classes from seeds. [sent-418, score-0.192]

99 Iterative set expansion of named entities using the web. [sent-425, score-0.364]

100 PEBL: Positive example based learning for Web page classification using SVM. [sent-432, score-0.085]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pu', 0.355), ('bayesian', 0.258), ('distributional', 0.254), ('seeds', 0.211), ('unlabeled', 0.196), ('entity', 0.192), ('heller', 0.188), ('entities', 0.184), ('spy', 0.176), ('candidate', 0.174), ('sp', 0.169), ('similarity', 0.163), ('rn', 0.155), ('positive', 0.152), ('liu', 0.14), ('expansion', 0.139), ('nb', 0.124), ('centroid', 0.124), ('ghahramani', 0.123), ('ranking', 0.107), ('seed', 0.096), ('spies', 0.088), ('chicago', 0.086), ('fs', 0.081), ('elkan', 0.077), ('noto', 0.077), ('pantel', 0.076), ('class', 0.075), ('skew', 0.07), ('sets', 0.069), ('median', 0.067), ('rankings', 0.067), ('sem', 0.066), ('technique', 0.065), ('classic', 0.065), ('vectors', 0.064), ('lee', 0.064), ('surrounding', 0.063), ('precisions', 0.062), ('negatives', 0.06), ('fusionopolis', 0.059), ('neter', 0.059), ('sarmento', 0.059), ('negative', 0.058), ('belongs', 0.057), ('reliable', 0.056), ('window', 0.055), ('vd', 0.055), ('candidates', 0.052), ('brill', 0.052), ('gorman', 0.051), ('classification', 0.049), ('li', 0.049), ('vector', 0.048), ('pmi', 0.047), ('connexis', 0.047), ('nnps', 0.047), ('em', 0.045), ('lei', 0.044), ('nigam', 0.044), ('blum', 0.044), ('learns', 0.044), ('cd', 0.043), ('infocomm', 0.042), ('etzioni', 0.041), ('yu', 0.041), ('named', 0.041), ('score', 0.04), ('south', 0.04), ('classifier', 0.039), ('web', 0.039), ('classify', 0.039), ('harris', 0.038), ('examples', 0.037), ('nnp', 0.037), ('learning', 0.036), ('rank', 0.036), ('occurrence', 0.035), ('illinois', 0.034), ('agirre', 0.034), ('ve', 0.034), ('jaccard', 0.033), ('cosine', 0.033), ('na', 0.031), ('google', 0.031), ('bing', 0.031), ('mitchell', 0.031), ('il', 0.03), ('image', 0.029), ('md', 0.029), ('kdd', 0.029), ('curran', 0.029), ('popescu', 0.028), ('produces', 0.028), ('designed', 0.027), ('assign', 0.027), ('ng', 0.026), ('belonging', 0.026), ('message', 0.026), ('allyn', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 89 acl-2010-Distributional Similarity vs. PU Learning for Entity Set Expansion

Author: Xiao-Li Li ; Lei Zhang ; Bing Liu ; See-Kiong Ng

Abstract: Distributional similarity is a classic technique for entity set expansion, where the system is given a set of seed entities of a particular class, and is asked to expand the set using a corpus to obtain more entities of the same class as represented by the seeds. This paper shows that a machine learning model called positive and unlabeled learning (PU learning) can model the set expansion problem better. Based on the test results of 10 corpora, we show that a PU learning technique outperformed distributional similarity significantly. 1

2 0.21691072 27 acl-2010-An Active Learning Approach to Finding Related Terms

Author: David Vickrey ; Oscar Kipersztok ; Daphne Koller

Abstract: We present a novel system that helps nonexperts find sets of similar words. The user begins by specifying one or more seed words. The system then iteratively suggests a series of candidate words, which the user can either accept or reject. Current techniques for this task typically bootstrap a classifier based on a fixed seed set. In contrast, our system involves the user throughout the labeling process, using active learning to intelligently explore the space of similar words. In particular, our system can take advantage of negative examples provided by the user. Our system combines multiple preexisting sources of similarity data (a standard thesaurus, WordNet, contextual similarity), enabling it to capture many types of similarity groups (“synonyms of crash,” “types of car,” etc.). We evaluate on a hand-labeled evaluation set; our system improves over a strong baseline by 36%.

3 0.15238899 181 acl-2010-On Learning Subtypes of the Part-Whole Relation: Do Not Mix Your Seeds

Author: Ashwin Ittoo ; Gosse Bouma

Abstract: An important relation in information extraction is the part-whole relation. Ontological studies mention several types of this relation. In this paper, we show that the traditional practice of initializing minimally-supervised algorithms with a single set that mixes seeds of different types fails to capture the wide variety of part-whole patterns and tuples. The results obtained with mixed seeds ultimately converge to one of the part-whole relation types. We also demonstrate that all the different types of part-whole relations can still be discovered, regardless of the type characterized by the initializing seeds. We performed our experiments with a state-ofthe-art information extraction algorithm. 1

4 0.14869566 129 acl-2010-Growing Related Words from Seed via User Behaviors: A Re-Ranking Based Approach

Author: Yabin Zheng ; Zhiyuan Liu ; Lixing Xie

Abstract: Motivated by Google Sets, we study the problem of growing related words from a single seed word by leveraging user behaviors hiding in user records of Chinese input method. Our proposed method is motivated by the observation that the more frequently two words cooccur in user records, the more related they are. First, we utilize user behaviors to generate candidate words. Then, we utilize search engine to enrich candidate words with adequate semantic features. Finally, we reorder candidate words according to their semantic relatedness to the seed word. Experimental results on a Chinese input method dataset show that our method gains better performance. 1

5 0.12858374 150 acl-2010-Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing

Author: Ruihong Huang ; Ellen Riloff

Abstract: This research explores the idea of inducing domain-specific semantic class taggers using only a domain-specific text collection and seed words. The learning process begins by inducing a classifier that only has access to contextual features, forcing it to generalize beyond the seeds. The contextual classifier then labels new instances, to expand and diversify the training set. Next, a cross-category bootstrapping process simultaneously trains a suite of classifiers for multiple semantic classes. The positive instances for one class are used as negative instances for the others in an iterative bootstrapping cycle. We also explore a one-semantic-class-per-discourse heuristic, and use the classifiers to dynam- ically create semantic features. We evaluate our approach by inducing six semantic taggers from a collection of veterinary medicine message board posts.

6 0.12196562 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

7 0.10551431 25 acl-2010-Adapting Self-Training for Semantic Role Labeling

8 0.10124435 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

9 0.098869525 70 acl-2010-Contextualizing Semantic Representations Using Syntactically Enriched Vector Models

10 0.095438108 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

11 0.094123505 212 acl-2010-Simple Semi-Supervised Training of Part-Of-Speech Taggers

12 0.09062323 28 acl-2010-An Entity-Level Approach to Information Extraction

13 0.087350629 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery

14 0.084959351 141 acl-2010-Identifying Text Polarity Using Random Walks

15 0.0823346 3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities

16 0.082205102 160 acl-2010-Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns

17 0.082068257 258 acl-2010-Weakly Supervised Learning of Presupposition Relations between Verbs

18 0.078135692 38 acl-2010-Automatic Evaluation of Linguistic Quality in Multi-Document Summarization

19 0.077266954 238 acl-2010-Towards Open-Domain Semantic Role Labeling

20 0.076317176 78 acl-2010-Cross-Language Text Classification Using Structural Correspondence Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.223), (1, 0.09), (2, -0.039), (3, 0.024), (4, 0.062), (5, 0.007), (6, 0.051), (7, -0.005), (8, 0.043), (9, 0.02), (10, -0.106), (11, 0.074), (12, 0.017), (13, -0.219), (14, 0.077), (15, 0.081), (16, 0.12), (17, -0.099), (18, -0.074), (19, -0.026), (20, 0.076), (21, 0.103), (22, -0.1), (23, 0.19), (24, -0.081), (25, -0.07), (26, -0.128), (27, 0.057), (28, 0.216), (29, 0.051), (30, 0.005), (31, -0.01), (32, -0.064), (33, -0.022), (34, 0.025), (35, -0.005), (36, 0.053), (37, 0.031), (38, 0.005), (39, 0.009), (40, 0.026), (41, 0.045), (42, -0.116), (43, -0.026), (44, -0.069), (45, -0.07), (46, 0.064), (47, -0.012), (48, 0.036), (49, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97880095 89 acl-2010-Distributional Similarity vs. PU Learning for Entity Set Expansion

Author: Xiao-Li Li ; Lei Zhang ; Bing Liu ; See-Kiong Ng

Abstract: Distributional similarity is a classic technique for entity set expansion, where the system is given a set of seed entities of a particular class, and is asked to expand the set using a corpus to obtain more entities of the same class as represented by the seeds. This paper shows that a machine learning model called positive and unlabeled learning (PU learning) can model the set expansion problem better. Based on the test results of 10 corpora, we show that a PU learning technique outperformed distributional similarity significantly. 1

2 0.76526082 27 acl-2010-An Active Learning Approach to Finding Related Terms

Author: David Vickrey ; Oscar Kipersztok ; Daphne Koller

Abstract: We present a novel system that helps nonexperts find sets of similar words. The user begins by specifying one or more seed words. The system then iteratively suggests a series of candidate words, which the user can either accept or reject. Current techniques for this task typically bootstrap a classifier based on a fixed seed set. In contrast, our system involves the user throughout the labeling process, using active learning to intelligently explore the space of similar words. In particular, our system can take advantage of negative examples provided by the user. Our system combines multiple preexisting sources of similarity data (a standard thesaurus, WordNet, contextual similarity), enabling it to capture many types of similarity groups (“synonyms of crash,” “types of car,” etc.). We evaluate on a hand-labeled evaluation set; our system improves over a strong baseline by 36%.

3 0.64381081 150 acl-2010-Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing

Author: Ruihong Huang ; Ellen Riloff

Abstract: This research explores the idea of inducing domain-specific semantic class taggers using only a domain-specific text collection and seed words. The learning process begins by inducing a classifier that only has access to contextual features, forcing it to generalize beyond the seeds. The contextual classifier then labels new instances, to expand and diversify the training set. Next, a cross-category bootstrapping process simultaneously trains a suite of classifiers for multiple semantic classes. The positive instances for one class are used as negative instances for the others in an iterative bootstrapping cycle. We also explore a one-semantic-class-per-discourse heuristic, and use the classifiers to dynam- ically create semantic features. We evaluate our approach by inducing six semantic taggers from a collection of veterinary medicine message board posts.

4 0.63716555 129 acl-2010-Growing Related Words from Seed via User Behaviors: A Re-Ranking Based Approach

Author: Yabin Zheng ; Zhiyuan Liu ; Lixing Xie

Abstract: Motivated by Google Sets, we study the problem of growing related words from a single seed word by leveraging user behaviors hiding in user records of Chinese input method. Our proposed method is motivated by the observation that the more frequently two words cooccur in user records, the more related they are. First, we utilize user behaviors to generate candidate words. Then, we utilize search engine to enrich candidate words with adequate semantic features. Finally, we reorder candidate words according to their semantic relatedness to the seed word. Experimental results on a Chinese input method dataset show that our method gains better performance. 1

5 0.59128219 141 acl-2010-Identifying Text Polarity Using Random Walks

Author: Ahmed Hassan ; Dragomir Radev

Abstract: Automatically identifying the polarity of words is a very important task in Natural Language Processing. It has applications in text classification, text filtering, analysis of product review, analysis of responses to surveys, and mining online discussions. We propose a method for identifying the polarity of words. We apply a Markov random walk model to a large word relatedness graph, producing a polarity estimate for any given word. A key advantage of the model is its ability to accurately and quickly assign a polarity sign and magnitude to any word. The method could be used both in a semi-supervised setting where a training set of labeled words is used, and in an unsupervised setting where a handful of seeds is used to define the two polarity classes. The method is experimentally tested using a manually labeled set of positive and negative words. It outperforms the state of the art methods in the semi-supervised setting. The results in the unsupervised setting is comparable to the best reported values. However, the proposed method is faster and does not need a large corpus.

6 0.57778525 63 acl-2010-Comparable Entity Mining from Comparative Questions

7 0.53065264 183 acl-2010-Online Generation of Locality Sensitive Hash Signatures

8 0.51339281 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

9 0.50231677 3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities

10 0.4709042 263 acl-2010-Word Representations: A Simple and General Method for Semi-Supervised Learning

11 0.4658891 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

12 0.46256286 28 acl-2010-An Entity-Level Approach to Information Extraction

13 0.46209785 181 acl-2010-On Learning Subtypes of the Part-Whole Relation: Do Not Mix Your Seeds

14 0.4585205 160 acl-2010-Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns

15 0.45532539 232 acl-2010-The S-Space Package: An Open Source Package for Word Space Models

16 0.45075938 25 acl-2010-Adapting Self-Training for Semantic Role Labeling

17 0.44518384 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition

18 0.44146511 111 acl-2010-Extracting Sequences from the Web

19 0.4406755 43 acl-2010-Automatically Generating Term Frequency Induced Taxonomies

20 0.42959946 258 acl-2010-Weakly Supervised Learning of Presupposition Relations between Verbs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.467), (33, 0.012), (42, 0.019), (59, 0.078), (73, 0.03), (78, 0.02), (83, 0.09), (84, 0.018), (98, 0.168)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99361438 224 acl-2010-Talking NPCs in a Virtual Game World

Author: Tina Kluwer ; Peter Adolphs ; Feiyu Xu ; Hans Uszkoreit ; Xiwen Cheng

Abstract: This paper describes the KomParse system, a natural-language dialog system in the three-dimensional virtual world Twinity. In order to fulfill the various communication demands between nonplayer characters (NPCs) and users in such an online virtual world, the system realizes a flexible and hybrid approach combining knowledge-intensive domainspecific question answering, task-specific and domain-specific dialog with robust chatbot-like chitchat.

2 0.97683316 203 acl-2010-Rebanking CCGbank for Improved NP Interpretation

Author: Matthew Honnibal ; James R. Curran ; Johan Bos

Abstract: Once released, treebanks tend to remain unchanged despite any shortcomings in their depth of linguistic analysis or coverage of specific phenomena. Instead, separate resources are created to address such problems. In this paper we show how to improve the quality of a treebank, by integrating resources and implementing improved analyses for specific constructions. We demonstrate this rebanking process by creating an updated version of CCGbank that includes the predicate-argument structure of both verbs and nouns, baseNP brackets, verb-particle constructions, and restrictive and non-restrictive nominal modifiers; and evaluate the impact of these changes on a statistical parser.

3 0.9765842 28 acl-2010-An Entity-Level Approach to Information Extraction

Author: Aria Haghighi ; Dan Klein

Abstract: We present a generative model of template-filling in which coreference resolution and role assignment are jointly determined. Underlying template roles first generate abstract entities, which in turn generate concrete textual mentions. On the standard corporate acquisitions dataset, joint resolution in our entity-level model reduces error over a mention-level discriminative approach by up to 20%.

4 0.93224788 69 acl-2010-Constituency to Dependency Translation with Forests

Author: Haitao Mi ; Qun Liu

Abstract: Tree-to-string systems (and their forestbased extensions) have gained steady popularity thanks to their simplicity and efficiency, but there is a major limitation: they are unable to guarantee the grammaticality of the output, which is explicitly modeled in string-to-tree systems via targetside syntax. We thus propose to combine the advantages of both, and present a novel constituency-to-dependency translation model, which uses constituency forests on the source side to direct the translation, and dependency trees on the target side (as a language model) to ensure grammaticality. Medium-scale experiments show an absolute and statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with fewer rules. This is also the first time that a treeto-tree model can surpass tree-to-string counterparts.

same-paper 5 0.89018452 89 acl-2010-Distributional Similarity vs. PU Learning for Entity Set Expansion

Author: Xiao-Li Li ; Lei Zhang ; Bing Liu ; See-Kiong Ng

Abstract: Distributional similarity is a classic technique for entity set expansion, where the system is given a set of seed entities of a particular class, and is asked to expand the set using a corpus to obtain more entities of the same class as represented by the seeds. This paper shows that a machine learning model called positive and unlabeled learning (PU learning) can model the set expansion problem better. Based on the test results of 10 corpora, we show that a PU learning technique outperformed distributional similarity significantly. 1

6 0.87512535 237 acl-2010-Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection

7 0.85320377 23 acl-2010-Accurate Context-Free Parsing with Combinatory Categorial Grammar

8 0.72040081 71 acl-2010-Convolution Kernel over Packed Parse Forest

9 0.71428502 53 acl-2010-Blocked Inference in Bayesian Tree Substitution Grammars

10 0.71116376 169 acl-2010-Learning to Translate with Source and Target Syntax

11 0.70968777 257 acl-2010-WSD as a Distributed Constraint Optimization Problem

12 0.70894039 75 acl-2010-Correcting Errors in a Treebank Based on Synchronous Tree Substitution Grammar

13 0.70257455 128 acl-2010-Grammar Prototyping and Testing with the LinGO Grammar Matrix Customization System

14 0.69409955 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

15 0.69055718 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

16 0.68385702 46 acl-2010-Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression

17 0.67376947 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names

18 0.66752338 118 acl-2010-Fine-Grained Tree-to-String Translation Rule Extraction

19 0.65868926 9 acl-2010-A Joint Rule Selection Model for Hierarchical Phrase-Based Translation

20 0.65598714 84 acl-2010-Detecting Errors in Automatically-Parsed Dependency Relations