acl acl2011 acl2011-276 knowledge-graph by maker-knowledge-mining

276 acl-2011-Semi-Supervised SimHash for Efficient Document Similarity Search

Source: pdf

Author: Qixia Jiang ; Maosong Sun

Abstract: Searching documents that are similar to a query document is an important component in modern information retrieval. Some existing hashing methods can be used for efficient document similarity search. However, unsupervised hashing methods cannot incorporate prior knowledge for better hashing. Although some supervised hashing methods can derive effective hash functions from prior knowledge, they are either computationally expensive or poorly discriminative. This paper proposes a novel (semi-)supervised hashing method named Semi-Supervised SimHash (S3H) for high-dimensional data similarity search. The basic idea of S3H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. We evaluate our method with several state-of-the-art methods on two large datasets. All the results show that our method gets the best performance. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ang@ Abstract Searching documents that are similar to a query document is an important component in modern information retrieval. [sent-5, score-0.135]

2 Some existing hashing methods can be used for efficient document similarity search. [sent-6, score-0.641]

3 However, unsupervised hashing methods cannot incorporate prior knowledge for better hashing. [sent-7, score-0.667]

4 Although some supervised hashing methods can derive effective hash functions from prior knowledge, they are either computationally expensive or poorly discriminative. [sent-8, score-1.202]

5 This paper proposes a novel (semi-)supervised hashing method named Semi-Supervised SimHash (S3H) for high-dimensional data similarity search. [sent-9, score-0.601]

6 The basic idea of S3H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. [sent-10, score-0.683]

7 All the results show that our method gets the best performance. [sent-12, score-0.027]

8 1 Introduction Document Similarity Search (DSS) is to find similar documents to a query doc in a text corpus or on the web. [sent-13, score-0.094]

9 It is an important component in modern information retrieval since DSS can improve the traditional search engines and user experience (Wan et al. [sent-14, score-0.151]

10 Traditional search engines accept several terms submitted by a user as a query and return a set of docs that are relevant to the query. [sent-17, score-0.224]

11 However, for those users who are not search experts, it is always difficult to accurately specify some query terms to express their 93 com, sms @t s inghua . [sent-18, score-0.119]

12 Unlike short-query based search, DSS queries by a full (long) document, which allows users to directly submit a page or a document to the search engines as the description of their information needs. [sent-21, score-0.139]

13 Meanwhile, the explosion of information has brought great challenges to traditional methods. [sent-22, score-0.026]

14 For example, Inverted List (IL) which is a primary key-term access method would return a very large set of docs for a query document, which leads to the time-consuming post-processing. [sent-23, score-0.126]

15 Hashing methods can perform highly efficient but approximate similarity search, and have gained great success in many applications such as Content-Based Image Retrieval (CBIR) (Ke et al. [sent-25, score-0.087]

16 Hashing methods project high-dimensional objects to compact binary codes called fingerprints and make similar fingerprints for similar objects. [sent-31, score-0.381]

17 The similarity search in the Hamming space1 is much more efficient than in the original attribute space (Manku et al. [sent-32, score-0.139]

18 A kernelized variant of SH, called Kernelized Locality Sensitive Hashing (KLSH) (Kulis et al. [sent-39, score-0.135]

19 These methods are unsupervised thus cannot incorporate prior knowledge for better hashing. [sent-41, score-0.154]

20 Moti1Hamming space is a set of binary strings of length L. [sent-42, score-0.027]

21 c A2s 0o1c1ia Atisosnoc foiarti Conom fopru Ctaotmiopnuatla Lti on gaulis Lti ncsg,u pisagtiecs 93–101, vated by this, some supervised methods are proposed to derive effective hash functions from prior knowledge, i. [sent-45, score-0.596]

22 Regardless of different objectives, both methods derive hash functions via Principle Component Analysis (PCA) (Jolliffe, 1986). [sent-50, score-0.528]

23 However, PCA is computationally expensive, which limits their usage for high-dimensional data. [sent-51, score-0.087]

24 This paper proposes a novel (semi-)supervised hashing method, Semi-Supervised SimHash (S3H), for high-dimensional data similarity search. [sent-52, score-0.601]

25 Unlike SSH that tries to find a sequence of hash functions, S3H fixes the random projection directions and seeks the optimal feature weights from prior knowledge to relocate the objects such that similar objects have similar fingerprints. [sent-53, score-0.999]

26 This is implemented by maximizing the empirical accuracy on the prior knowledge (labeled data) and the en- tropy of hash functions (estimated over labeled and unlabeled data). [sent-54, score-0.646]

27 The proposed method avoids using PCA which is computationally expensive especially for high-dimensional data, and leads to an efficient Quasi-Newton based solution. [sent-55, score-0.119]

28 To evaluate our method, we compare with several state-ofthe-art hashing methods on two large datasets, i. [sent-56, score-0.513]

29 All experiments show that S3H gets the best search performance. [sent-60, score-0.079]

30 2 Background and Related Works Suppose we are given a set of N documents, X = {xi | xi ∈ eRM are}N gi=iv1e. [sent-65, score-0.127]

31 nF aor s a given query mdeocn q, XDS =S {trxies t ox fi∈nd R its }nearest neighbors in X or a subset tXri′e ⊂ oX fi nind witshi nceha rdeisstta nnecieg hfrboomrs t ihne Xdoc ourm ae snutbs etot tXhe query idno cw q cish le dissst athncane a give thher edsohcouldm. [sent-66, score-0.238]

32 Hntsow to- ever, such two tasks are computationally infeasible for large-scale data. [sent-67, score-0.057]

33 Thus, it turns to the approximate similarity search problem (Indyk et al. [sent-68, score-0.113]

34 In this section, we briefly review some related approximate similarity search methods. [sent-70, score-0.113]

35 , h(x) = sign(wTx) ={ −+1 , oifth werTwxi s≥e 0 (1) where w ∈ RM is a vector randomly generated. [sent-77, score-0.029]

36 SH specifies t∈he R distribution on a family of hash functions H = {h} such that for two objects xi and xj, Pr{h(xi) = h(xj)} = 1 −θ(xiπ,xj) (2) where θ(xi, xj) is the angle between xi and xj . [sent-78, score-0.993]

37 2 Kernelized Locality Sensitive Hashing A kernelized variant of SH, named Kernelized Locality Sensitive Hashing (KLSH) (Kulis et al. [sent-81, score-0.135]

38 To calculate the value of hashing fuction h(·), KLSH projects points onto the eigenvectors oofn t hh(e· ) k,e KrnLeSl mHa ptrroixje. [sent-84, score-0.621]

39 cItns short, st ohen complete procedure of KLSH can be summarized as follows: 1) randomly select P (a small value) points from X and form the kernel matrix, 2) for each hash ffuronmcti Xon a h(ϕ(x)), hcaelc kuelrnateel mitsa weight ω ∈ cRhP h just as Kcetironnel h (PϕC(Ax (Sch o¨lkopf et al. [sent-85, score-0.496]

40 , 1997), a∈nd R 3) the hash function is defined as: ∑P h(ϕ(x)) = sign(∑ωi · κ(x,xi)) (3) ∑i=1 where κ(·, ·) can be any kernel function. [sent-86, score-0.468]

41 However, KLSH is unsupervised, thus designing a data-specific kernel remains a big challenge. [sent-89, score-0.047]

42 , 2010a) is recently proposed to incorporate prior knowledge for better hashing. [sent-92, score-0.127]

43 Besides X, prior knowledge in the form of similaBre saidndes d Xiss,im priilaorr object-pairs is also required in SSH. [sent-93, score-0.097]

44 SSH tries to find L optimal hash functions which have maximum empirical accuracy on prior knowledge and maximum entropy by finding the top L eigenvectors of an extended covariance matrix2 via PCA or SVD. [sent-94, score-0.808]

45 However, despite of the potential problems of numerical stability, SVD requires massive computational space and O(M3) computational time where M is feature dimension, which limits its usage for high-dimensional data (Trefethen et al. [sent-95, score-0.057]

46 Furthermore, the variance of directions obtained by PCA decreases with the decrease of the rank (Jolliffe, 1986). [sent-97, score-0.03]

47 Thus, lower hash functions tend to have smaller entropy and larger empirical errors. [sent-98, score-0.525]

48 LSH performs a random linear projection to map similar objects to similar hash codes. [sent-103, score-0.591]

49 However, LSH suffers from the efficiency problem that it tends to generate long codes (Salakhutdinov et al. [sent-104, score-0.036]

50 , 2009) considers each hash function as a binary partition problem as in SVMs (Burges, 1998). [sent-107, score-0.448]

51 , 2009) maintains similarity between objects in the reduced Hamming space by minimizing the averaged Hamming distance3 between similar neighbors in the original Euclidean space. [sent-109, score-0.208]

52 However, spectral hashing takes the assumption that data should be distributed uniformly, which is always violated in real-world applications. [sent-110, score-0.579]

53 3 Semi-Supervised SimHash In this section, we present our hashing method, named Semi-Supervised SimHash (S3H). [sent-111, score-0.513]

54 Given} tthhee labeled data XL, we construct two sets, attraction lsaebt Θa a dnadt repulsion set Θr. [sent-126, score-0.115]

55 Specifically, any pair (xi, xj) ∈ Θa, i,j ≤ u, denotes that xi and xj are in t)he ∈ same class, i. [sent-127, score-0.256]

56 Unlike = 2The extended covariance matrix is composed of two components, one is an unsupervised covariance term and another is a constraint matrix involving labeled information. [sent-130, score-0.294]

57 3Hamming distance is defined as the number of bits that are different between two binary strings. [sent-131, score-0.027]

58 95 previews works that attempt to find L optimal hyperplanes, the basic idea of S3H is to fix L random hyperplanes and to find an optimal feature-weight vector to relocate the objects such that similar objects have similar codes. [sent-132, score-0.566]

59 1 Data Representation Since L random hyperplanes are fixed, we can represent a object x ∈ X as its relative position to these rraesnednotm a hyperplanes, i. [sent-134, score-0.201]

60 , D=Λ ·V (4) where the element Vml ∈ {+1, −1, 0} of V indicates that the object x is above, ,b−el1ow,0 or just Vin in tdhiel-th hyperplane with respect to the m-th feature, and Λ = diag( |x1|, |x2 | , . [sent-136, score-0.041]

61 , |xM |) is a diagonal matrix × which, atog some extent, . [sent-139, score-0.054]

62 2 Formulation Hashing maps the data set X to an L-dimensional Hamming space feor d compact representations. [sent-142, score-0.056]

63 Iifo we represent each object as Equation (4), the l-th hash function is then defined as: hl(x) = ~l(D) = sign(wTdl) (5) where w ∈ RM is the feature weight to be determwihneerde awnd ∈ dl Ris the l-th column of the matrix D. [sent-143, score-0.572]

64 Intuitively, the ”contribution” of a specific feature to different classes is different. [sent-144, score-0.027]

65 Therefore, we hope to incorporate this side information in S3H for better hashing. [sent-145, score-0.03]

66 , 2009), we can measure this contribution over XL as in Algorithm 1. [sent-147, score-0.031]

67 Clearly, itfh objects are represented as the occurrence numbers of features, the output of Algorithm 1 is just the conditional probability Pr(class|feature). [sent-148, score-0.144]

68 Finally, e caocnhd object (x, c) ∈ XL can cblea represented as an yM, e Lh ombajetrcixt Gx,: G = diag(ν1,c, ν2,c, . [sent-149, score-0.041]

69 , νM,c) · D (6) Note that, one pair (xi, xj) in Θa or Θr corresponds to (Gi, Gj) while (Di, Dj) if we ignore features’ contribution to different classes. [sent-152, score-0.031]

70 So, we define the following objective for ~(·)s: wherJ−N(wxp)i,=∑ j)N∈|Θ1pral∑|~=Ll+1({xi|)Θ(~ilr,|x∑ji)s∈}Θthae+~nlλ(ux1mil∑)=Lb~1el(rHxoj~f)latr(c7-) tion and repulsion pairs Θan|d λ1 hise a turmadbeeorff o f be atwttreaecntwo terms. [sent-154, score-0.059]

71 have proven that hash functions with maximum entropy must maximize the variance of the hash values, and vice-versa (Wang et al. [sent-156, score-1.008]

72 eU lanbfoelrteudna antedly u, dliarbeeclte dso dluattiao,n X fora nadbo Xve problem is non-trivial since Equation (7) is not differentiable. [sent-161, score-0.029]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hashing', 0.513), ('hash', 0.421), ('simhash', 0.299), ('klsh', 0.199), ('xl', 0.15), ('pca', 0.146), ('kernelized', 0.135), ('hyperplanes', 0.133), ('xj', 0.129), ('xi', 0.127), ('sh', 0.119), ('objects', 0.115), ('ssh', 0.108), ('hamming', 0.101), ('kulis', 0.1), ('relocate', 0.1), ('gi', 0.096), ('dss', 0.088), ('charikar', 0.088), ('fingerprints', 0.088), ('lsh', 0.081), ('locality', 0.076), ('functions', 0.074), ('prior', 0.068), ('query', 0.067), ('diag', 0.066), ('jolliffe', 0.066), ('manku', 0.066), ('covariance', 0.066), ('spectral', 0.066), ('similarity', 0.061), ('indyk', 0.059), ('repulsion', 0.059), ('docs', 0.059), ('computationally', 0.057), ('separable', 0.054), ('matrix', 0.054), ('search', 0.052), ('weiss', 0.051), ('gj', 0.051), ('tsinghua', 0.051), ('eigenvectors', 0.051), ('kernel', 0.047), ('engines', 0.046), ('ox', 0.046), ('xf', 0.046), ('rm', 0.045), ('sign', 0.045), ('document', 0.041), ('object', 0.041), ('sensitive', 0.04), ('xu', 0.04), ('projections', 0.039), ('optimal', 0.038), ('ke', 0.037), ('codes', 0.036), ('expensive', 0.036), ('derive', 0.033), ('neighbors', 0.032), ('maximize', 0.032), ('contribution', 0.031), ('tries', 0.031), ('variance', 0.03), ('entropy', 0.03), ('limits', 0.03), ('incorporate', 0.03), ('knowledge', 0.029), ('wang', 0.029), ('attraction', 0.029), ('oofn', 0.029), ('ilr', 0.029), ('vin', 0.029), ('awnd', 0.029), ('odp', 0.029), ('dso', 0.029), ('feor', 0.029), ('itfh', 0.029), ('oifth', 0.029), ('txhe', 0.029), ('points', 0.028), ('projection', 0.028), ('random', 0.027), ('labeled', 0.027), ('di', 0.027), ('modern', 0.027), ('bde', 0.027), ('doc', 0.027), ('salakhutdinov', 0.027), ('tropy', 0.027), ('eds', 0.027), ('feature', 0.027), ('binary', 0.027), ('unsupervised', 0.027), ('compact', 0.027), ('gets', 0.027), ('proposes', 0.027), ('efficient', 0.026), ('traditional', 0.026), ('fi', 0.026), ('pr', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 276 acl-2011-Semi-Supervised SimHash for Efficient Document Similarity Search

Author: Qixia Jiang ; Maosong Sun

2 0.43943584 189 acl-2011-K-means Clustering with Feature Hashing

Author: Hajime Senuma

Abstract: One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.

3 0.16688463 135 acl-2011-Faster and Smaller N-Gram Language Models

Author: Adam Pauls ; Dan Klein

Abstract: N-gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.

4 0.10945757 113 acl-2011-Efficient Online Locality Sensitive Hashing via Reservoir Counting

Author: Benjamin Van Durme ; Ashwin Lall

Abstract: We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint.

5 0.092551127 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

6 0.07152722 256 acl-2011-Query Weighting for Ranking Model Adaptation

7 0.065010801 239 acl-2011-P11-5002 k2opt.pdf

8 0.061200462 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

9 0.054244407 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

10 0.054036647 182 acl-2011-Joint Annotation of Search Queries

11 0.049684819 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

12 0.046203606 238 acl-2011-P11-2093 k2opt.pdf

13 0.045799706 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

14 0.043909635 169 acl-2011-Improving Question Recommendation by Exploiting Information Need

15 0.04337794 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

16 0.043310702 258 acl-2011-Ranking Class Labels Using Query Sessions

17 0.041464396 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

18 0.039892484 204 acl-2011-Learning Word Vectors for Sentiment Analysis

19 0.038118709 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

20 0.037718501 11 acl-2011-A Fast and Accurate Method for Approximate String Search

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.112), (1, 0.032), (2, -0.036), (3, 0.039), (4, -0.045), (5, -0.085), (6, -0.024), (7, -0.042), (8, 0.014), (9, 0.018), (10, 0.067), (11, 0.014), (12, 0.037), (13, 0.051), (14, -0.029), (15, -0.001), (16, -0.073), (17, 0.017), (18, 0.013), (19, -0.049), (20, 0.072), (21, -0.134), (22, 0.124), (23, -0.007), (24, -0.086), (25, -0.124), (26, -0.011), (27, -0.037), (28, 0.106), (29, -0.066), (30, 0.172), (31, 0.129), (32, 0.262), (33, 0.412), (34, -0.056), (35, -0.149), (36, 0.104), (37, -0.182), (38, -0.039), (39, -0.029), (40, -0.116), (41, 0.041), (42, -0.127), (43, 0.075), (44, -0.013), (45, 0.141), (46, 0.07), (47, -0.064), (48, -0.011), (49, -0.11)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95701534 276 acl-2011-Semi-Supervised SimHash for Efficient Document Similarity Search

Author: Qixia Jiang ; Maosong Sun

2 0.9279995 189 acl-2011-K-means Clustering with Feature Hashing

Author: Hajime Senuma

3 0.74684769 113 acl-2011-Efficient Online Locality Sensitive Hashing via Reservoir Counting

Author: Benjamin Van Durme ; Ashwin Lall

4 0.67617613 135 acl-2011-Faster and Smaller N-Gram Language Models

Author: Adam Pauls ; Dan Klein

5 0.38493237 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

Author: Hakan Ceylan ; Rada Mihalcea

Abstract: We introduce a new publicly available tool that implements efficient indexing and retrieval of large N-gram datasets, such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and duplicate data, it also allows users to issue wild card queries provided that the wild cards in the query are contiguous. Furthermore, we also implement some of the smoothing algorithms that are designed specifically for large datasets and are shown to yield better language models than the traditional ones on the Web1T 5gram corpus (Yuret, 2008). We demonstrate the effectiveness of our tool and the smoothing algorithms on the English Lexical Substi- tution task by a simple implementation that gives considerable improvement over a basic language model.

6 0.28979701 239 acl-2011-P11-5002 k2opt.pdf

7 0.28480494 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

8 0.27065605 1 acl-2011-(11-06-spirl)

9 0.25847274 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

10 0.23381193 278 acl-2011-Semi-supervised condensed nearest neighbor for part-of-speech tagging

11 0.22766529 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

12 0.22455814 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

13 0.2161274 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

14 0.21339925 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining

15 0.20497385 303 acl-2011-Tier-based Strictly Local Constraints for Phonology

16 0.20410106 291 acl-2011-SystemT: A Declarative Information Extraction System

17 0.20032822 11 acl-2011-A Fast and Accurate Method for Approximate String Search

18 0.19019181 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

19 0.18205628 342 acl-2011-full-for-print

20 0.18116708 234 acl-2011-Optimal Head-Driven Parsing Complexity for Linear Context-Free Rewriting Systems

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.014), (17, 0.077), (26, 0.029), (37, 0.068), (39, 0.035), (41, 0.119), (44, 0.011), (51, 0.277), (55, 0.016), (59, 0.058), (72, 0.024), (91, 0.033), (96, 0.166)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78894913 276 acl-2011-Semi-Supervised SimHash for Efficient Document Similarity Search

Author: Qixia Jiang ; Maosong Sun

2 0.76322049 55 acl-2011-Automatically Predicting Peer-Review Helpfulness

Author: Wenting Xiong ; Diane Litman

Abstract: Identifying peer-review helpfulness is an important task for improving the quality of feedback that students receive from their peers. As a first step towards enhancing existing peerreview systems with new functionality based on helpfulness detection, we examine whether standard product review analysis techniques also apply to our new context of peer reviews. In addition, we investigate the utility of incorporating additional specialized features tailored to peer review. Our preliminary results show that the structural features, review unigrams and meta-data combined are useful in modeling the helpfulness of both peer reviews and product reviews, while peer-review specific auxiliary features can further improve helpfulness prediction.

3 0.73767412 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

Author: Sameer Singh ; Amarnag Subramanya ; Fernando Pereira ; Andrew McCallum

Abstract: Cross-document coreference, the task of grouping all the mentions of each entity in a document collection, arises in information extraction and automated knowledge base construction. For large collections, it is clearly impractical to consider all possible groupings of mentions into distinct entities. To solve the problem we propose two ideas: (a) a distributed inference technique that uses parallelism to enable large scale processing, and (b) a hierarchical model of coreference that represents uncertainty over multiple granularities of entities to facilitate more effective approximate inference. To evaluate these ideas, we constructed a labeled corpus of 1.5 million disambiguated mentions in Web pages by selecting link anchors referring to Wikipedia entities. We show that the combination of the hierarchical model with distributed inference quickly obtains high accuracy (with error reduction of 38%) on this large dataset, demonstrating the scalability of our approach.

4 0.71148348 141 acl-2011-Gappy Phrasal Alignment By Agreement

Author: Mohit Bansal ; Chris Quirk ; Robert Moore

Abstract: We propose a principled and efficient phraseto-phrase alignment model, useful in machine translation as well as other related natural language processing problems. In a hidden semiMarkov model, word-to-phrase and phraseto-word translations are modeled directly by the system. Agreement between two directional models encourages the selection of parsimonious phrasal alignments, avoiding the overfitting commonly encountered in unsupervised training with multi-word units. Expanding the state space to include “gappy phrases” (such as French ne ? pas) makes the alignment space more symmetric; thus, it allows agreement between discontinuous alignments. The resulting system shows substantial improvements in both alignment quality and translation quality over word-based Hidden Markov Models, while maintaining asymptotically equivalent runtime.

5 0.64147353 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

Author: Shasha Liao ; Ralph Grishman

Abstract: Annotating training data for event extraction is tedious and labor-intensive. Most current event extraction tasks rely on hundreds of annotated documents, but this is often not enough. In this paper, we present a novel self-training strategy, which uses Information Retrieval (IR) to collect a cluster of related documents as the resource for bootstrapping. Also, based on the particular characteristics of this corpus, global inference is applied to provide more confident and informative data selection. We compare this approach to self-training on a normal newswire corpus and show that IR can provide a better corpus for bootstrapping and that global inference can further improve instance selection. We obtain gains of 1.7% in trigger labeling and 2.3% in role labeling through IR and an additional 1.1% in trigger labeling and 1.3% in role labeling by applying global inference. 1

6 0.63739181 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

7 0.63446331 94 acl-2011-Deciphering Foreign Language

8 0.63223171 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

9 0.63059747 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

10 0.62893927 11 acl-2011-A Fast and Accurate Method for Approximate String Search

11 0.62823015 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

12 0.62733912 135 acl-2011-Faster and Smaller N-Gram Language Models

13 0.62631476 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

14 0.62581652 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

15 0.62572491 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

16 0.62503862 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

17 0.62363434 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

18 0.62127435 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

19 0.62074769 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

20 0.62040436 137 acl-2011-Fine-Grained Class Label Markup of Search Queries