acl acl2010 acl2010-183 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Benjamin Van Durme ; Ashwin Lall
Abstract: Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common case of feature updates being additive over a data stream, we show that LSH signatures can be maintained online, without additional approximation error, and with lower memory requirements than when using the standard offline technique.
Reference: text
sentIndex sentText sentNum sentScore
1 Online Generation of Locality Sensitive Hash Signatures Benjamin Van Durme HLTCOE Johns Hopkins University Baltimore, MD 2121 1USA Abstract Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. [sent-1, score-0.342]
2 (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. [sent-2, score-0.28]
3 For the common case of feature updates being additive over a data stream, we show that LSH signatures can be maintained online, without additional approximation error, and with lower memory requirements than when using the standard offline technique. [sent-3, score-0.838]
4 1 Introduction There has been a surge of interest in adapting results from the streaming algorithms community to problems in processing large text collections. [sent-4, score-0.299]
5 The term streaming refers to a model where data is made available sequentially, and it is assumed that resource limitations preclude storing the entirety of the data for offline (batch) processing. [sent-5, score-0.416]
6 Statistics of interest are approximated via online, randomized algorithms. [sent-6, score-0.083]
7 Examples of text applications include: collecting approximate counts (Talbot, 2009; Van Durme and Lall, 2009a), finding top-n elements (Goyal et al. [sent-7, score-0.048]
8 , 2008), adaptive language modeling (Levenberg and Osborne, 2009), and building top-k ranklists based on pointwise mutual information (Van Durme and Lall, 2009b). [sent-9, score-0.077]
9 (2005) on building word similarity measures from large text collections by using the Locality Sensitive Hash (LSH) method of Charikar (2002). [sent-11, score-0.075]
10 (2010) made use of LSH signatures generated over individual tweets, for the purpose of first story detection. [sent-15, score-0.362]
11 Streaming LSH should allow for the clustering of Twitter authors, based on the tweets they generate, with signatures continually updated over the Twitter stream. [sent-16, score-0.387]
12 2 Locality Sensitive Hashing We are concerned with computing the cosine similarity of feature vectors, defined for a pair of vec- u tors and lengths: v as the dot product normalized by their cosine−similarity( u~, v~) =| u~ u~| ·| v~ v~|. [sent-17, score-0.5]
13 This similarity is the cosine of the angle between these high-dimensional vectors and attains a value of one (i. [sent-18, score-0.421]
14 , cos (0)) when the vectors are parallel and zero (i. [sent-20, score-0.155]
15 Building on the seminal work of Indyk and Motwani (1998) on locality sensitive hashing (LSH), Charikar (2002) presented an LSH that maps high-dimensional vectors to a much smaller dimensional space while still preserving (cosine) similarity between vectors in the original space. [sent-23, score-0.622]
16 The LSH algorithm computes a succinct signature of the feature set of the words in a corpus by computing d independent dot products of each feature vector with a random unit vector i. [sent-24, score-0.654]
17 , Pi viri, and retaining the sign of the d resulting pProducts. [sent-26, score-0.045]
18 Each entry of is drawn from the distribution N(0, 1), the normal distribution with zero mean and unit variance. [sent-27, score-0.076]
19 Charikar’s algorithm makes use v r, r of the fact (proved by Goemans and Williamson 231 UppsalaP,r Sowce ed ein ,g 1s1 o-f16 th Jeu AlyC 2L0 210 1. [sent-28, score-0.032]
20 c C2o0n1f0er Aenscseoc Sihatoirotn P faopre Crso,m papguetsat 2io3n1a–l2 L3i5n,guistics (1995) for an unrelated application) that the angle between any two vectors summarized in this fashion is proportional to the expected Hamming distance of their signature vectors. [sent-30, score-0.301]
21 Hence, we can retain length d bit-signatures in the place of high dimensional feature vectors, while preserving the ability to (quickly) approximate cosine similarity in the original space. [sent-31, score-0.433]
22 (2005) made use of this algorithm to reduce the computation in searching for similar nouns by first computing signatures for each noun and then computing similarity over the signatures rather than the original feature space. [sent-33, score-0.918]
23 3 Streaming Algorithm In this work, we focus on features that can be maintained additively, such as raw frequencies. [sent-34, score-0.082]
24 1 Our streaming algorithm for this problem makes use of the simple fact that the dot product of the feature vector with random vectors is a linear operation. [sent-35, score-0.698]
25 This permits us to replace the vi · ri operation by vi individual additions of ri, once for each time the feature is encountered in the stream (where vi is the frequency of a feature and ri is the randomly chosen Gaussian-distributed value associated with this feature). [sent-36, score-0.631]
26 The result of the final computation is identical to the dot products computed by the algorithm of Charikar (2002), but the processing can now be done online. [sent-37, score-0.185]
27 A similar technique, for stable random projections, was independently discussed by Li et al. [sent-38, score-0.062]
28 Since each feature may appear multiple times in the stream, we need a consistent way to retrieve the random values drawn from N(0, 1) associated with it. [sent-40, score-0.248]
29 To avoid the expense of computing and storing these values explicitly, as is the norm, we propose the use of a precomputed pool of random values drawn from this distribution that we can then hash into. [sent-41, score-0.629]
30 Hashing into a fixed pool ensures that the same feature will consistently be associated with the same value drawn from N(0, 1). [sent-42, score-0.331]
31 This introduces some weak dependence in the ran- dom vectors, but we will give some analysis showing that this should have very limited impact on the cosine similarity computation, which we further support with experimental evidence (see Table 3). [sent-43, score-0.266]
32 Our algorithm traverses a stream of words and 1Note that Ravichandran et al. [sent-44, score-0.335]
33 (2005) used pointwise mutual information features, which are not additive since they require a global statistic to compute. [sent-45, score-0.163]
34 Algorithm 1 STREAMINGLSH ALGORITHM Parameters: m : size of pool d : number of bits (size of resultant signature) s : a random seed h1, . [sent-46, score-0.409]
35 , m 1} INITIALIZ:A hTaIOshN f:u 1: Initialize floating point array P[0, . [sent-52, score-0.073]
36 rd,sm mto − −fl 1o]ating point arrays of size d 3: for i:= 0 . [sent-58, score-0.06]
37 nmdom − 1sa dmople from N(0, 1), using s as seed ONLINE: 1: for each word w in the stream do 2: for each feature fi associated with w do 3: for j := 1. [sent-64, score-0.424]
38 In particular, the state maintained for each word is a vector of floating point numbers of length d. [sent-72, score-0.196]
39 Each element of the vector holds the (partial) dot product of the feature vector of the word with a random unit vector. [sent-73, score-0.354]
40 Updating the state for a feature seen in the stream for a given word simply involves incrementing each position in the word’s vector by the random value associated with the feature, accessed by hash functions h1 through hd. [sent-74, score-0.631]
41 At any point in the stream, the vector for each word can be processed (in time O(d)) to create a signature computed by checking the sign of each component of its vector. [sent-75, score-0.232]
42 1 Analysis The update cost of the streaming algorithm, per word in the stream, is O(df), where d is the target signature size and f is the number offeatures associated with each word in the stream. [sent-77, score-0.476]
43 2 This results in an overall cost of O(ndf) for the streaming algorithm, where n is the length of the stream. [sent-78, score-0.266]
44 The memory footprint of our algorithm is O(n0d+m), where n0 is the number of distinct words in the stream and m is the size of the pool of normally distributed values. [sent-79, score-0.666]
45 In comparison, the original LSH algorithm computes signatures at a cost of O(nf + n0dF) updates and O(n0F + dF + n0d) memory, where F is the (large) number of unique 2For the bigram features used in § 4, f = 2. [sent-80, score-0.44]
46 Our algorithm is superior in terms of memory (because of the pooling trick), and has the benefit of supporting similarity queries online. [sent-82, score-0.221]
47 2 Pooling Normally-distributed Values We now discuss why it is possible to use a fixed pool of random values instead of generating unique ones for each feature. [sent-84, score-0.309]
48 Now, if we select for our pool the values g−1(1/m), g−1(2/m), . [sent-90, score-0.247]
49 , g−1(1 − 1/m), for some sufficiently large m, then this is identical to sampling from N(0, 1) with the caveat that the accuracy of the sample is limited. [sent-93, score-0.036]
50 More precisely, the deviation from sampling from this pool is off from the actual value by at most i=1m,. [sent-94, score-0.227]
51 By choosing m to be sufficiently large, we can bound the error of the approximate sample from a true sample (i. [sent-98, score-0.204]
52 This would result in the same relative error in the computation of the dot product (i. [sent-103, score-0.206]
53 , 1%), which would almost never affect the sign of the final value. [sent-105, score-0.045]
54 Hence, pooling as above should give results almost identical to the case where all the random values were chosen independently. [sent-106, score-0.172]
55 Finally, we make the observation that, for large m, randomly choosing m values from N(0, 1) results in a set of values that are distributed very similarly to the pool described above. [sent-107, score-0.344]
56 3 Extensions Decay The algorithm can be extended to support temporal decay in the stream, where recent observations are given higher relative weight, by multiplying the current sums by a decay value (e. [sent-110, score-0.154]
57 , via addition), and the randomness between machines can be tied based on a shared seed s. [sent-123, score-0.06]
58 At any point in process- ing the stream(s), current results can be aggregated by summing the d-dimensional vectors for each word, from each machine. [sent-124, score-0.116]
59 (2005), we evaluated the fidelity of signature generation in the context of calculating distributional similarity between words across a large text collection: in our case, articles taken from the NYTimes portion of the Gigaword corpus (Graff, 2003). [sent-126, score-0.221]
60 0 67 6595 Table 1: Mean absolute error when using signatures generated online (StreamingLSH), compared to offline (LSH). [sent-132, score-0.626]
61 This gave a stream of 773,185,086 tokens, with 1,138,467 unique types. [sent-134, score-0.303]
62 Given the number of types, this led to a (sparse) feature space with dimension on the order of 2. [sent-135, score-0.058]
63 After compiling signatures, fifty-thousand hx, yi pairs of types were randomly sampled by selecting x an tydp y e waechre independently, pwliethd replacement, from those types with at least 10 tokens in the stream (where 3 10,327 types satisfied this constraint). [sent-137, score-0.303]
64 The true cosine values between each such x and y was computed based on offline calculation, and compared to the cosine similarity predicted by the Hamming distance between the signatures for x and y. [sent-138, score-1.031]
65 Unless otherwise specified, the random pool size was fixed at m = 10, 000. [sent-139, score-0.292]
66 Figure 1visually reaffirms the trade-off in LSH between the number of bits and the accuracy of cosine prediction across the range of cosine values. [sent-140, score-0.447]
67 As the underlying vectors are strictly positive, the true cosine is restricted to [0, 1] . [sent-141, score-0.339]
68 Figure 2 shows the absolute error between truth and prediction for a similar sample, measured using signatures of a variety of bit lengths. [sent-142, score-0.467]
69 Here we see horizontal bands arising from truly orthogonal vectors leading to step-wise absolute error values tracked to Hamming distance. [sent-143, score-0.269]
70 Table 1 compares the online and batch LSH algorithms, giving the mean absolute error between predicted and actual cosine values, computed for the fifty-thousand element sample, using signatures of various lengths. [sent-144, score-0.804]
71 These results confirm that we achieve the same level of accuracy with online updates as compared to the standard method. [sent-145, score-0.146]
72 Figure 3 shows how a pool size as low as m = 100 gives reasonable variation in random values, and that m = 10, 000 is sufficient. [sent-146, score-0.292]
73 When using a standard 32 bit floating point representation, this is just 40 KBytes of memory, as compared to, e. [sent-147, score-0.104]
74 5 GBytes required to store 256 random vectors each containing 2. [sent-150, score-0.178]
75 Table 2 is based on taking an example for each of three part-of-speech categories, and reporting the resultant top-5 words as according to approximated cosine similarity. [sent-152, score-0.266]
76 Depending on the intended application, these results indicate a range Figure 2: Absolute error between predicted and true cosine for a sample of pairs, when using signatures of length log2 (d) ∈ {4, 5, 6, 7, 8}, drawn with added jitter to avoid overp(ldo)tt ∈ing {. [sent-153, score-0.731]
77 Pool Size Figure 3 : Error versus pool size, when using d = 256. [sent-154, score-0.196]
78 5 Conclusions We have shown that when updates to a feature vector are additive, it is possible to convert the offline LSH signature generation method into a stream- ing algorithm. [sent-156, score-0.437]
79 In addition to allowing for online querying of signatures, our approach leads to space efficiencies, as it does not require the explicit representation of either the feature vectors, nor the random matrix. [sent-157, score-0.192]
80 Possibilities for future work include the pairing of this method with algorithms for dynamic clustering, as well as exploring algorithms for different distances (e. [sent-158, score-0.066]
81 In Advances in Table 2: Top-5 items based on true cosine (bold), then using minimal Hamming distance, given in top-down order when using signatures of length log2 (d) ∈ {4, 5, 6, 7, 8}. [sent-197, score-0.557]
82 Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces. [sent-213, score-0.101]
83 Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. [sent-219, score-0.064]
wordName wordTfidf (topN-words)
[('lsh', 0.452), ('signatures', 0.334), ('stream', 0.303), ('streaming', 0.266), ('pool', 0.196), ('cosine', 0.191), ('charikar', 0.181), ('signature', 0.146), ('hash', 0.137), ('ravichandran', 0.137), ('offline', 0.118), ('vectors', 0.116), ('durme', 0.102), ('locality', 0.1), ('dot', 0.095), ('hamming', 0.091), ('hashing', 0.09), ('lall', 0.09), ('petrovic', 0.09), ('additive', 0.086), ('maintained', 0.082), ('similarity', 0.075), ('updates', 0.074), ('floating', 0.073), ('ashwin', 0.073), ('online', 0.072), ('sensitive', 0.064), ('random', 0.062), ('decay', 0.061), ('goemans', 0.06), ('goyal', 0.06), ('indyk', 0.06), ('levenberg', 0.06), ('sasa', 0.06), ('streaminglsh', 0.06), ('twitter', 0.06), ('pooling', 0.059), ('feature', 0.058), ('memory', 0.055), ('randomized', 0.053), ('tweets', 0.053), ('error', 0.052), ('values', 0.051), ('absolute', 0.05), ('dong', 0.048), ('mapreduce', 0.048), ('dean', 0.048), ('approximate', 0.048), ('drawn', 0.047), ('distributed', 0.046), ('pointwise', 0.046), ('osborne', 0.046), ('resultant', 0.045), ('deepak', 0.045), ('van', 0.045), ('sign', 0.045), ('miles', 0.044), ('revisit', 0.043), ('benjamin', 0.043), ('vector', 0.041), ('estimators', 0.041), ('bits', 0.039), ('cos', 0.039), ('angle', 0.039), ('predicted', 0.039), ('vi', 0.038), ('succinct', 0.038), ('asymmetric', 0.037), ('sample', 0.036), ('batch', 0.035), ('ri', 0.034), ('size', 0.034), ('algorithms', 0.033), ('seed', 0.033), ('algorithm', 0.032), ('true', 0.032), ('preserving', 0.032), ('storing', 0.032), ('approximation', 0.031), ('bit', 0.031), ('computation', 0.031), ('actual', 0.031), ('mutual', 0.031), ('church', 0.03), ('associated', 0.03), ('approximated', 0.03), ('df', 0.029), ('dimensional', 0.029), ('unit', 0.029), ('story', 0.028), ('product', 0.028), ('products', 0.027), ('computing', 0.027), ('machines', 0.027), ('tors', 0.026), ('sketches', 0.026), ('reaffirms', 0.026), ('precomputed', 0.026), ('glen', 0.026), ('arrays', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 183 acl-2010-Online Generation of Locality Sensitive Hash Signatures
Author: Benjamin Van Durme ; Ashwin Lall
Abstract: Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common case of feature updates being additive over a data stream, we show that LSH signatures can be maintained online, without additional approximation error, and with lower memory requirements than when using the standard offline technique.
2 0.12601231 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures
Author: Daphna Shezaf ; Ari Rappoport
Abstract: Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inef- ficient nature of alignment-based methods. We use NAS to eliminate incorrect translations from the generated lexicon. We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. Our algorithm substantially outperforms other lexicon generation methods.
3 0.1128642 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation
Author: Boxing Chen ; George Foster ; Roland Kuhn
Abstract: This paper proposes new algorithms to compute the sense similarity between two units (words, phrases, rules, etc.) from parallel corpora. The sense similarity scores are computed by using the vector space model. We then apply the algorithms to statistical machine translation by computing the sense similarity between the source and target side of translation rule pairs. Similarity scores are used as additional features of the translation model to improve translation performance. Significant improvements are obtained over a state-of-the-art hierarchical phrase-based machine translation system. 1
4 0.079760268 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices
Author: Matthew Skala ; Victoria Krakovna ; Janos Kramar ; Gerald Penn
Abstract: Constructing an encoding of a concept lattice using short bit vectors allows for efficient computation of join operations on the lattice. Join is the central operation any unification-based parser must support. We extend the traditional bit vector encoding, which represents join failure using the zero vector, to count any vector with less than a fixed number of one bits as failure. This allows non-joinable elements to share bits, resulting in a smaller vector size. A constraint solver is used to construct the encoding, and a variety of techniques are employed to find near-optimal solutions and handle timeouts. An evaluation is provided comparing the extended representation of failure with traditional bit vector techniques.
5 0.07844872 70 acl-2010-Contextualizing Semantic Representations Using Syntactically Enriched Vector Models
Author: Stefan Thater ; Hagen Furstenau ; Manfred Pinkal
Abstract: We present a syntactically enriched vector model that supports the computation of contextualized semantic representations in a quasi compositional fashion. It employs a systematic combination of first- and second-order context vectors. We apply our model to two different tasks and show that (i) it substantially outperforms previous work on a paraphrase ranking task, and (ii) achieves promising results on a wordsense similarity task; to our knowledge, it is the first time that an unsupervised method has been applied to this task.
6 0.06264165 27 acl-2010-An Active Learning Approach to Finding Related Terms
7 0.058325097 232 acl-2010-The S-Space Package: An Open Source Package for Word Space Models
8 0.056678638 220 acl-2010-Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure
9 0.05180788 89 acl-2010-Distributional Similarity vs. PU Learning for Entity Set Expansion
10 0.048006564 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery
11 0.045191783 25 acl-2010-Adapting Self-Training for Semantic Role Labeling
12 0.044398561 205 acl-2010-SVD and Clustering for Unsupervised POS Tagging
13 0.0417004 66 acl-2010-Compositional Matrix-Space Models of Language
14 0.037404768 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web
15 0.037250806 93 acl-2010-Dynamic Programming for Linear-Time Incremental Parsing
16 0.037017744 94 acl-2010-Edit Tree Distance Alignments for Semantic Role Labelling
17 0.03673926 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment
18 0.036212966 253 acl-2010-Using Smaller Constituents Rather Than Sentences in Active Learning for Japanese Dependency Parsing
19 0.035689622 10 acl-2010-A Latent Dirichlet Allocation Method for Selectional Preferences
20 0.035372857 264 acl-2010-Wrapping up a Summary: From Representation to Generation
topicId topicWeight
[(0, -0.122), (1, 0.023), (2, -0.026), (3, -0.006), (4, 0.034), (5, -0.021), (6, 0.025), (7, -0.021), (8, 0.04), (9, -0.004), (10, -0.064), (11, 0.055), (12, 0.072), (13, -0.066), (14, -0.03), (15, 0.029), (16, 0.01), (17, -0.092), (18, -0.062), (19, 0.002), (20, 0.1), (21, 0.018), (22, 0.04), (23, 0.039), (24, 0.075), (25, 0.03), (26, -0.047), (27, -0.035), (28, 0.099), (29, 0.001), (30, -0.035), (31, -0.075), (32, 0.015), (33, -0.024), (34, -0.042), (35, 0.062), (36, 0.02), (37, -0.108), (38, 0.055), (39, 0.001), (40, -0.038), (41, -0.013), (42, -0.026), (43, 0.06), (44, -0.065), (45, 0.018), (46, -0.151), (47, 0.115), (48, 0.037), (49, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.92603236 183 acl-2010-Online Generation of Locality Sensitive Hash Signatures
Author: Benjamin Van Durme ; Ashwin Lall
Abstract: Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common case of feature updates being additive over a data stream, we show that LSH signatures can be maintained online, without additional approximation error, and with lower memory requirements than when using the standard offline technique.
2 0.57544315 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures
Author: Daphna Shezaf ; Ari Rappoport
Abstract: Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inef- ficient nature of alignment-based methods. We use NAS to eliminate incorrect translations from the generated lexicon. We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. Our algorithm substantially outperforms other lexicon generation methods.
3 0.56456679 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices
Author: Matthew Skala ; Victoria Krakovna ; Janos Kramar ; Gerald Penn
Abstract: Constructing an encoding of a concept lattice using short bit vectors allows for efficient computation of join operations on the lattice. Join is the central operation any unification-based parser must support. We extend the traditional bit vector encoding, which represents join failure using the zero vector, to count any vector with less than a fixed number of one bits as failure. This allows non-joinable elements to share bits, resulting in a smaller vector size. A constraint solver is used to construct the encoding, and a variety of techniques are employed to find near-optimal solutions and handle timeouts. An evaluation is provided comparing the extended representation of failure with traditional bit vector techniques.
4 0.54606515 3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities
Author: Jun'ichi Kazama ; Stijn De Saeger ; Kow Kuroda ; Masaki Murata ; Kentaro Torisawa
Abstract: Existing word similarity measures are not robust to data sparseness since they rely only on the point estimation of words’ context profiles obtained from a limited amount of data. This paper proposes a Bayesian method for robust distributional word similarities. The method uses a distribution of context profiles obtained by Bayesian estimation and takes the expectation of a base similarity measure under that distribution. When the context profiles are multinomial distributions, the priors are Dirichlet, and the base measure is . the Bhattacharyya coefficient, we can derive an analytical form that allows efficient calculation. For the task of word similarity estimation using a large amount of Web data in Japanese, we show that the proposed measure gives better accuracies than other well-known similarity measures.
5 0.54109424 27 acl-2010-An Active Learning Approach to Finding Related Terms
Author: David Vickrey ; Oscar Kipersztok ; Daphne Koller
Abstract: We present a novel system that helps nonexperts find sets of similar words. The user begins by specifying one or more seed words. The system then iteratively suggests a series of candidate words, which the user can either accept or reject. Current techniques for this task typically bootstrap a classifier based on a fixed seed set. In contrast, our system involves the user throughout the labeling process, using active learning to intelligently explore the space of similar words. In particular, our system can take advantage of negative examples provided by the user. Our system combines multiple preexisting sources of similarity data (a standard thesaurus, WordNet, contextual similarity), enabling it to capture many types of similarity groups (“synonyms of crash,” “types of car,” etc.). We evaluate on a hand-labeled evaluation set; our system improves over a strong baseline by 36%.
6 0.53365749 232 acl-2010-The S-Space Package: An Open Source Package for Word Space Models
7 0.50930846 66 acl-2010-Compositional Matrix-Space Models of Language
8 0.47883275 70 acl-2010-Contextualizing Semantic Representations Using Syntactically Enriched Vector Models
9 0.47160569 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation
10 0.4615182 89 acl-2010-Distributional Similarity vs. PU Learning for Entity Set Expansion
11 0.45247868 107 acl-2010-Exemplar-Based Models for Word Meaning in Context
12 0.38813147 205 acl-2010-SVD and Clustering for Unsupervised POS Tagging
13 0.38730717 117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms
14 0.38565421 263 acl-2010-Word Representations: A Simple and General Method for Semi-Supervised Learning
15 0.38320407 129 acl-2010-Growing Related Words from Seed via User Behaviors: A Re-Ranking Based Approach
16 0.3582893 220 acl-2010-Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure
17 0.35145265 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers
18 0.33777487 197 acl-2010-Practical Very Large Scale CRFs
19 0.33508372 174 acl-2010-Modeling Semantic Relevance for Question-Answer Pairs in Web Social Communities
20 0.33433002 196 acl-2010-Plot Induction and Evolutionary Search for Story Generation
topicId topicWeight
[(14, 0.01), (25, 0.067), (33, 0.013), (39, 0.016), (42, 0.017), (59, 0.11), (73, 0.043), (76, 0.011), (78, 0.051), (80, 0.012), (81, 0.314), (83, 0.08), (84, 0.038), (98, 0.115)]
simIndex simValue paperId paperTitle
1 0.82822448 108 acl-2010-Expanding Verb Coverage in Cyc with VerbNet
Author: Clifton McFate
Abstract: A robust dictionary of semantic frames is an essential element of natural language understanding systems that use ontologies. However, creating lexical resources that accurately capture semantic representations en masse is a persistent problem. Where the sheer amount of content makes hand creation inefficient, computerized approaches often suffer from over generality and difficulty with sense disambiguation. This paper describes a semi-automatic method to create verb semantic frames in the Cyc ontology by converting the information contained in VerbNet into a Cyc usable format. This method captures the differences in meaning between types of verbs, and uses existing connections between WordNet, VerbNet, and Cyc to specify distinctions between individual verbs when available. This method provides 27,909 frames to OpenCyc which currently has none and can be used to extend ResearchCyc as well. We show that these frames lead to a 20% increase in sample sentences parsed over the Research Cyc verb lexicon. 1
same-paper 2 0.78047562 183 acl-2010-Online Generation of Locality Sensitive Hash Signatures
Author: Benjamin Van Durme ; Ashwin Lall
Abstract: Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common case of feature updates being additive over a data stream, we show that LSH signatures can be maintained online, without additional approximation error, and with lower memory requirements than when using the standard offline technique.
3 0.71485436 252 acl-2010-Using Parse Features for Preposition Selection and Error Detection
Author: Joel Tetreault ; Jennifer Foster ; Martin Chodorow
Abstract: Jennifer Foster NCLT Dublin City University Ireland j fo st er@ comput ing . dcu . ie Martin Chodorow Hunter College of CUNY New York, NY, USA martin . chodorow @hunter . cuny . edu We recreate a state-of-the-art preposition usage system (Tetreault and Chodorow (2008), henceWe evaluate the effect of adding parse features to a leading model of preposition us- age. Results show a significant improvement in the preposition selection task on native speaker text and a modest increment in precision and recall in an ESL error detection task. Analysis of the parser output indicates that it is robust enough in the face of noisy non-native writing to extract useful information.
4 0.53574669 169 acl-2010-Learning to Translate with Source and Target Syntax
Author: David Chiang
Abstract: Statistical translation models that try to capture the recursive structure of language have been widely adopted over the last few years. These models make use of varying amounts of information from linguistic theory: some use none at all, some use information about the grammar of the target language, some use information about the grammar of the source language. But progress has been slower on translation models that are able to learn the relationship between the grammars of both the source and target language. We discuss the reasons why this has been a challenge, review existing attempts to meet this challenge, and show how some old and new ideas can be combined into a sim- ple approach that uses both source and target syntax for significant improvements in translation accuracy.
5 0.5332917 120 acl-2010-Fully Unsupervised Core-Adjunct Argument Classification
Author: Omri Abend ; Ari Rappoport
Abstract: The core-adjunct argument distinction is a basic one in the theory of argument structure. The task of distinguishing between the two has strong relations to various basic NLP tasks such as syntactic parsing, semantic role labeling and subcategorization acquisition. This paper presents a novel unsupervised algorithm for the task that uses no supervised models, utilizing instead state-of-the-art syntactic induction algorithms. This is the first work to tackle this task in a fully unsupervised scenario.
6 0.53313243 158 acl-2010-Latent Variable Models of Selectional Preference
7 0.53293031 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar
8 0.53261399 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans
9 0.52980608 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
10 0.52950084 130 acl-2010-Hard Constraints for Grammatical Function Labelling
11 0.52853757 162 acl-2010-Learning Common Grammar from Multilingual Corpus
12 0.52776611 70 acl-2010-Contextualizing Semantic Representations Using Syntactically Enriched Vector Models
13 0.52749276 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition
14 0.52749032 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
15 0.52724004 160 acl-2010-Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns
16 0.52697963 148 acl-2010-Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
17 0.52653611 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation
18 0.52610689 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results
19 0.52603745 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data
20 0.5259074 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese