acl acl2011 acl2011-113 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Benjamin Van Durme ; Ashwin Lall
Abstract: We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint.
Reference: text
sentIndex sentText sentNum sentScore
1 Efficient Online Locality Sensitive Hashing via Reservoir Counting Benjamin Van Durme HLTCOE Johns Hopkins University Abstract We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. [sent-1, score-0.113]
2 This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint. [sent-2, score-0.286]
3 1 Introduction Feature vectors based on lexical co-occurrence are often of a high dimension, d. [sent-3, score-0.036]
4 This leads to O(d) operations to calculate cosine similarity, a fundamental tool in distributional semantics. [sent-4, score-0.112]
5 This is improved in practice through the use of data structures that exploit feature sparsity, leading to an expected O(f) operations, where f is the number of unique features we expect to have non-zero entries in a given vector. [sent-5, score-0.066]
6 Such LSH bit signatures are constructed using the following hash function, where ∈ Rd is a vector in the original feature space, ahnerde er ~v is ∈ randomly drawn from N(0, 1)d: v h(~ v) =? [sent-8, score-0.246]
7 d, which leads to faster pair-wise comparisons bbe ? [sent-12, score-0.038]
8 twe de,n w vectors, adnsd t a lfoaswteerr memory footprint. [sent-13, score-0.062]
9 Van Durme and Lall (2010) observed1 that if the feature values are additive over a dataset (e. [sent-14, score-0.052]
10 , when collecting word co-occurrence frequencies), then these signatures may be constructed online by unrolling the dot-product into a series of local operations: ri = Σt~ vt · ri, where vt represents features oabtisoenrsv:e v~ d· locally at ti· rm~e t in a data-stream. [sent-16, score-0.213]
11 Since updates may be done locally, feature vectors do not need to be stored explicitly. [sent-17, score-0.227]
12 This directly leads to significant space savings, as only one counter is needed for each of the b running sums. [sent-18, score-0.214]
13 In this work we focus on the following observation: the counters used to store the running sums may themselves be an inefficient use of space, in that they may be amenable to compression through approximation. [sent-19, score-0.039]
14 2 Since the accuracy ofthis LSH routine is a function of b, then if we were able to reduce the online requirements of each counter, we might afford a larger number of projections. [sent-20, score-0.075]
15 Even if a chance of approximation error were introduced for each hash function, this may be justified in greater v· overall fidelity from the resultant increase in b. [sent-21, score-0.212]
16 We show experimentally that this leads to greater accuracy approximations at the same memory cost, or similar accuracy approximations at a significantly reduced cost. [sent-28, score-0.146]
17 This result is relevant to work in large-scale distributional semantics (Bhagat and Ravichandran, 2008; Van Durme and Lall, 2009; Pantel et al. [sent-29, score-0.041]
18 We then employ an integer-valued projection matrix in order to work with an integer-valued stream of online updates, which is reduced (implicitly) to a stream of positive and negative unit updates. [sent-35, score-0.77]
19 The sign of the sum of these updates is approximated through a novel twist on Reservoir Sampling. [sent-36, score-0.247]
20 When computed explicitly this leads to an impractical mechanism linear in each feature value update. [sent-37, score-0.146]
21 To ensure our counter can (approximately) add and subtract in constant time, we then derive expressions for the expected value of each step of the update. [sent-38, score-0.232]
22 Unit Projection Rather than construct a projection matrix from N(0, 1), a matrix randomly populated with entries from the set {−1, 0, 1} will suffice, dw withit quality dependent on {th−e r,e0l,at1iv}e w proportion of these elements. [sent-40, score-0.209]
23 If we let p be the percent probability mass allocated to zeros, then we create a discrete projection matrix by sampling from the (1−2p 1−2p multinomial: : −1,p : 0, : +1). [sent-41, score-0.198]
24 Henceforth we assume this discrete projection matrix, with p = 0. [sent-44, score-0.093]
25 3 The use of such sparse projections was first proposed by Achlioptas (2003), then extended by Li et al. [sent-46, score-0.092]
26 19 Figure 1: With b = 256, mean absolute error in cosine approximation when using a projection based on N(0, 1), compared to {−1, 0, 1}. [sent-49, score-0.153]
27 Unit Stream Based on a unit projection, we can view an online counter as summing over a stream drawn from {−1, 1}: each projected feature value udnrarowlnle fdr oinmto { i−ts1 (positive or negative) unary representation. [sent-50, score-0.604]
28 Reservoir Sampling We can maintain a uniform sample of size k over a stream of unknown length as follows. [sent-52, score-0.263]
29 Accept the first k elements into an reservoir (array) of size k. [sent-53, score-0.836]
30 Each following element at position n is accepted with probability whereupon an element currently in the reservoir is evicted, and replaced with the just accepted item. [sent-54, score-0.905]
31 Reservoir sampling is a folklore algorithm that was extended by Vitter (1985) to allow for multiple updates. [sent-56, score-0.036]
32 nk, Reservoir Counting If we are sampling over a stream drawn from just two values, we can implic- itly represent the reservoir by counting only the frequency of one or the other elements. [sent-57, score-1.133]
33 counter, s, for tracking the number of 1values currently in the reservoir. [sent-59, score-0.03]
34 5 When a negative value is accepted, we decrement the counter with probability ks . [sent-60, score-0.306]
35 When a positive update is accepted, we increment the counter with probability (1−ks). [sent-61, score-0.311]
36 This reflects an update evicting e pirthoebra an teyle (m1−ent of the same sign, which has no effect on the makeup of the reservoir, or decreasing/increasing the number of 1’s currently sampled. [sent-62, score-0.096]
37 An approximate sum of all values seen up to position n is then simply: n(2ks − 1). [sent-63, score-0.058]
38 While this value is potentially interesting in fut−ure 1 applications, here we are only concerned with its sign. [sent-64, score-0.073]
39 Parallel Reservoir Counting On its own this counting mechanism hardly appears useful: as it is dependent on knowing n, then we might just as well sum the elements of the stream directly, counting in whatever space we would otherwise use in maintaining the value of n. [sent-65, score-0.695]
40 However, if we have a set of tied streams that we process in parallel,6 then we only need to track n once, across b different streams, each with their own reservoir. [sent-66, score-0.126]
41 When dealing with parallel streams resulting from different random projections of the same vector, we cannot assume these will be strictly tied. [sent-67, score-0.218]
42 Some projections will cancel out heavier elements than others, leading to update streams of different lengths once elements are unrolled into their (positive or negative) unary representation. [sent-68, score-0.431]
43 In practice we have found that tracking the mean value of n across b streams is sufficient. [sent-69, score-0.174]
44 5 zeroed matrix, we can update n by one half the magnitude of each observed value, as on average half the projections will cancel out any given element. [sent-71, score-0.275]
45 This step can be found in Algorithm 2, lines 8 and 9. [sent-72, score-0.028]
46 Example To make concrete what we have covered to this point, consider a given feature vector of dimensionality d = 3, say: [3, 2, 1]. [sent-73, score-0.028]
47 This might be projected into b = 4, vectors: [3, 0, 0], [0, -2, 1], [0, 0, 1], and [-3, 2, 0]. [sent-74, score-0.036]
48 When viewed as positive/negative, loosely-tied unit streams, they respectively have length n: 3, 3, 1, and 5, with mean length 3. [sent-75, score-0.062]
49 The goal of reservoir counting is to efficiently keep track of an approximation of their sums (here: 3, -1, 1, and -1), while the underlying feature 5E. [sent-76, score-1.039]
50 , a reservoir of size k = 255 requires an 8-bit integer. [sent-78, score-0.785]
51 6Tied in the sense that each stream is of the same length, e. [sent-79, score-0.209]
52 A k = 3 reservoir used for the last projected vector, [-3, 2, 0], might reasonably contain two values of -1, and one value of 1. [sent-86, score-0.857]
53 7 Represented explicitly as a vector, the reservoir would thus be in the arrangement: [1, -1, -1], [-1, 1, -1], or [-1, -1, 1]. [sent-87, score-0.755]
54 These are functionally equivalent: we only need to know that one of the k = 3 elements is positive. [sent-88, score-0.051]
55 Expected Number ofSamples Traversing m consecutive values of either 1or −1 in the unit stream sshecouutlidv e be v thought eoifth as seeing positive or negative m as a feature update. [sent-89, score-0.402]
56 For a reservoir of size k, let A(m, n, k) be the number of samples accepted when traversing the stream from position n + 1to n + m. [sent-90, score-1.115]
57 A is non-deterministic: it represents the results of flipping m consecutive coins, where each coin is increasingly biased towards rejection. [sent-91, score-0.036]
58 Rather than computing A explicitly, which is linear in m, we will instead use the expected number of updates, A0(m, n, k) = E[A(m, n, k)], which can be computed in constant time. [sent-92, score-0.038]
59 For example, consider m = 30, encountered at position n = 100, with a reservoir of k = 10. [sent-94, score-0.755]
60 As the reservoir is a discre)te ≈ set of bins, pflreasct oifon 1a. [sent-97, score-0.755]
61 l portions of a sample are resolved by a coin flip: if a = k loge(n+nm), then accept u = dae samples with probability (a bac ), aepndt u = bac samples − with γ, probabili 7Other options are: three -1’s, or oneP -1 and two 1’s. [sent-98, score-0.401]
62 8With x a positive integer, H(x) = Pix=1 1/x ≈ loge (x) + where γ is Euler’s constant. [sent-99, score-0.126]
63 These steps are found in lines 3 and 4 of Algorithm 1. [sent-101, score-0.028]
64 See Table 1 for simulation results using a variety of parameters. [sent-102, score-0.024]
65 Expected Reservoir Change We now discuss how to simulate many independent updates of the same type to the reservoir counter, e. [sent-103, score-0.918]
66 : five updates of 1, or three updates of -1, using a single estimate. [sent-105, score-0.326]
67 Consider a situation in which we have a reservoir of size k with some current value of s, 0 s k, and we ew kis wh ttoh perform u independent updates. [sent-106, score-0.827]
68 W ke, adnednote by Uk0(s, u) the expected value of the reservoir after these u updates have taken place. [sent-107, score-0.998]
69 Since a single update leads to no change with probability ks, we can write the following recurrence for Uk0: ≤ ≤ Uk0(s,u) =ksUk0(s,u−1)+kk − sUk0(s+1,u−1), with the boundary condition: for all s, Uk0(s, 0) = s. [sent-108, score-0.192]
70 Solving the above recurrence, we get that the expected value of the reservoir after these updates is: Uk0(s,u) = k + (s − k)? [sent-109, score-0.998]
71 The case for negative updates follows similarly (see lines 7 and 8 of Algorithm 1). [sent-112, score-0.233]
72 Hence, instead of simulating u independent updates of the same type to the reservoir, we simply update it to this expected value, where fractional updates are handled similarly as when estimating the number of accepts. [sent-113, score-0.492]
73 These steps are found in lines 5 through 9 of Algorithm 1, and as seen in Fig. [sent-114, score-0.028]
74 3, which shows the use of reservoir counting in Online Locality Sensitive Hashing (as made explicit in Algorithm 2), as compared to the method described by Van Durme and Lall (2010). [sent-117, score-0.888]
75 The total amount of space required when using this counting scheme is b log2 (k + 1) + 32: b reservoirs, and a 32 bit integer to track n. [sent-118, score-0.283]
76 This is compared to b 32 bit floating point values, as is standard. [sent-119, score-0.092]
77 Note that our scheme comes away with similar levels of accuracy, often at half the memory cost, while requiring larger b to account for the chance of approximation errors in individual reservoir counters. [sent-120, score-0.901]
78 21 Figure 2: Results of simulating many iterations of U0, for k = 255, and various values of s and u. [sent-121, score-0.056]
79 Algorithm 1 RESERVOIRUPDATE(n,k,m,σ,s) Parameters: n : size of stream so far k : size of reservoir, also maximum value of s m : magnitude of update σ : sign of update s : current value of reservoir 1: if m = 0 or σ = 0 then 2: Return without doing an? [sent-122, score-1.35]
80 standard counting mechanisms (blue), as measured by the amount of total memory required to the resultant error. [sent-129, score-0.229]
81 Algorithm 2 COMPUTESIGNATURE(S,k,b,p) Parameters: S : bit array of size b k : size of each reservoir b : number of projections p : percentage of zeros in projection, p ∈ [0, 1] 1: Initialize b reservoirs R[1, . [sent-130, score-1.065]
82 , b], each represented by a log2 (k + 1)-bit unsigned integer 2: Initialize b hash functions hi (w) that map features w to elements in a vector made up of −1 and 1 each twoit ehl proportion , aonrd m m0a date proportion p. [sent-133, score-0.273]
83 time, where a high-throughput streaming application that is not concerned with online memory requirements will not have reason to consider the developments in this article. [sent-139, score-0.27]
84 The approach given here is motivated by cases where data is not flooding in at breakneck speed, and resource considerations are dominated by a large number of unique elements for which we are maintaining signatures. [sent-140, score-0.082]
85 Dark or light states correspond to a prediction of a running sum being positive or negative. [sent-143, score-0.071]
86 States are numerically labeled to reflect the similarity to a small bit integer data type, one that never overflows. [sent-144, score-0.131]
87 Assuming a fixed probability of a positive versus negative update, then in expectation the state of the chain should correspond to the sign. [sent-146, score-0.139]
88 However if we are concerned with the global statistic, as we are here, then the assumption of a fixed probability update precludes the analysis of streaming sources that contain local irregularities. [sent-147, score-0.255]
89 9 In distributional semantics, consider a feature stream formed by sequentially reading the n-gram resource of Brants and Franz (2006). [sent-148, score-0.278]
90 The pair: (the dog : 3,502,485), can be viewed as a feature value pair: (leftWord= ’the ’ : 3,502,485), with respect to online signature generation for the word dog. [sent-149, score-0.185]
91 Rather than viewing this feature repeatedly, spread over a large corpus, the update happens just once, with large magnitude. [sent-150, score-0.124]
92 Reservoir Counting, representing an online uniform sample, is agnostic to the ordering of elements in the stream. [sent-153, score-0.126]
93 4 Conclusion We have presented a novel approximation scheme we call Reservoir Counting, motivated here by a desire for greater space efficiency in Online Locality Sensitive Hashing. [sent-154, score-0.084]
94 Going beyond our results provided for synthetic data, future work will explore applications of this technique, such as in experiments with streaming social media like Twitter. [sent-155, score-0.102]
95 ,1,1,-1,-1,-1), is overall positive, but locally negative at the end. [sent-160, score-0.074]
96 Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. [sent-190, score-0.06]
wordName wordTfidf (topN-words)
[('reservoir', 0.755), ('stream', 0.209), ('updates', 0.163), ('counter', 0.152), ('counting', 0.133), ('durme', 0.118), ('hash', 0.118), ('lall', 0.111), ('streaming', 0.102), ('streams', 0.102), ('update', 0.096), ('projection', 0.093), ('projections', 0.092), ('locality', 0.089), ('bac', 0.089), ('loge', 0.089), ('online', 0.075), ('lsh', 0.072), ('van', 0.065), ('unit', 0.062), ('memory', 0.062), ('accepted', 0.062), ('approximation', 0.06), ('ashwin', 0.059), ('bit', 0.058), ('elements', 0.051), ('sign', 0.05), ('sensitive', 0.047), ('integer', 0.044), ('dae', 0.044), ('goemans', 0.044), ('reservoirs', 0.044), ('reservoirupdate', 0.044), ('ks', 0.044), ('matrix', 0.043), ('negative', 0.042), ('signatures', 0.042), ('value', 0.042), ('distributional', 0.041), ('benjamin', 0.04), ('signature', 0.04), ('sums', 0.039), ('bsc', 0.039), ('cancel', 0.039), ('goyal', 0.039), ('indyk', 0.039), ('petrovic', 0.039), ('expected', 0.038), ('leads', 0.038), ('mechanism', 0.038), ('positive', 0.037), ('bergsma', 0.037), ('ravichandran', 0.037), ('projected', 0.036), ('sampling', 0.036), ('coin', 0.036), ('vectors', 0.036), ('kenneth', 0.035), ('accept', 0.035), ('chain', 0.034), ('ping', 0.034), ('bm', 0.034), ('floating', 0.034), ('resultant', 0.034), ('sum', 0.034), ('operations', 0.033), ('simulating', 0.032), ('recurrence', 0.032), ('vt', 0.032), ('locally', 0.032), ('concerned', 0.031), ('maintaining', 0.031), ('zeros', 0.031), ('hb', 0.031), ('savings', 0.031), ('tracking', 0.03), ('size', 0.03), ('traversing', 0.03), ('deepak', 0.03), ('proportion', 0.03), ('pantel', 0.029), ('samples', 0.029), ('hashing', 0.029), ('shane', 0.029), ('bhagat', 0.029), ('similarity', 0.029), ('lines', 0.028), ('feature', 0.028), ('sketch', 0.028), ('cos', 0.027), ('probability', 0.026), ('array', 0.025), ('random', 0.024), ('simulation', 0.024), ('track', 0.024), ('values', 0.024), ('sample', 0.024), ('space', 0.024), ('half', 0.024), ('approximations', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 113 acl-2011-Efficient Online Locality Sensitive Hashing via Reservoir Counting
Author: Benjamin Van Durme ; Ashwin Lall
Abstract: We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint.
2 0.10945757 276 acl-2011-Semi-Supervised SimHash for Efficient Document Similarity Search
Author: Qixia Jiang ; Maosong Sun
Abstract: Searching documents that are similar to a query document is an important component in modern information retrieval. Some existing hashing methods can be used for efficient document similarity search. However, unsupervised hashing methods cannot incorporate prior knowledge for better hashing. Although some supervised hashing methods can derive effective hash functions from prior knowledge, they are either computationally expensive or poorly discriminative. This paper proposes a novel (semi-)supervised hashing method named Semi-Supervised SimHash (S3H) for high-dimensional data similarity search. The basic idea of S3H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. We evaluate our method with several state-of-the-art methods on two large datasets. All the results show that our method gets the best performance. 1
3 0.095409237 189 acl-2011-K-means Clustering with Feature Hashing
Author: Hajime Senuma
Abstract: One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.
4 0.085847117 135 acl-2011-Faster and Smaller N-Gram Language Models
Author: Adam Pauls ; Dan Klein
Abstract: N-gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.
5 0.041080557 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models
Author: Viet Ha Thuc ; Nicola Cancedda
Abstract: Language models based on word surface forms only are unable to benefit from available linguistic knowledge, and tend to suffer from poor estimates for rare features. We propose an approach to overcome these two limitations. We use factored features that can flexibly capture linguistic regularities, and we adopt confidence-weighted learning, a form of discriminative online learning that can better take advantage of a heavy tail of rare features. Finally, we extend the confidence-weighted learning to deal with label noise in training data, a common case with discriminative lan- guage modeling.
6 0.040870946 117 acl-2011-Entity Set Expansion using Topic information
7 0.039410714 121 acl-2011-Event Discovery in Social Media Feeds
8 0.038017735 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation
9 0.037878204 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification
10 0.036918182 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques
11 0.035740055 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
12 0.035413798 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
13 0.034942843 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words
14 0.034507342 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
15 0.034410339 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
16 0.034005128 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
17 0.033301402 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
18 0.033283096 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
19 0.032836497 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity
20 0.032543737 243 acl-2011-Partial Parsing from Bitext Projections
topicId topicWeight
[(0, 0.102), (1, 0.016), (2, -0.01), (3, 0.003), (4, -0.014), (5, -0.029), (6, 0.007), (7, 0.007), (8, -0.012), (9, -0.005), (10, 0.009), (11, 0.01), (12, 0.024), (13, 0.03), (14, -0.02), (15, 0.016), (16, -0.054), (17, 0.008), (18, 0.002), (19, -0.028), (20, 0.069), (21, -0.031), (22, 0.047), (23, -0.029), (24, -0.05), (25, -0.08), (26, 0.002), (27, 0.015), (28, 0.062), (29, -0.06), (30, 0.11), (31, 0.033), (32, 0.125), (33, 0.148), (34, 0.008), (35, -0.056), (36, 0.042), (37, -0.028), (38, -0.009), (39, 0.012), (40, -0.027), (41, 0.033), (42, -0.024), (43, -0.012), (44, -0.006), (45, 0.022), (46, 0.055), (47, -0.013), (48, -0.001), (49, -0.1)]
simIndex simValue paperId paperTitle
same-paper 1 0.90831405 113 acl-2011-Efficient Online Locality Sensitive Hashing via Reservoir Counting
Author: Benjamin Van Durme ; Ashwin Lall
Abstract: We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint.
2 0.87858623 189 acl-2011-K-means Clustering with Feature Hashing
Author: Hajime Senuma
Abstract: One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.
3 0.87671244 276 acl-2011-Semi-Supervised SimHash for Efficient Document Similarity Search
Author: Qixia Jiang ; Maosong Sun
Abstract: Searching documents that are similar to a query document is an important component in modern information retrieval. Some existing hashing methods can be used for efficient document similarity search. However, unsupervised hashing methods cannot incorporate prior knowledge for better hashing. Although some supervised hashing methods can derive effective hash functions from prior knowledge, they are either computationally expensive or poorly discriminative. This paper proposes a novel (semi-)supervised hashing method named Semi-Supervised SimHash (S3H) for high-dimensional data similarity search. The basic idea of S3H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. We evaluate our method with several state-of-the-art methods on two large datasets. All the results show that our method gets the best performance. 1
4 0.63746876 135 acl-2011-Faster and Smaller N-Gram Language Models
Author: Adam Pauls ; Dan Klein
Abstract: N-gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.
5 0.4639689 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
Author: Hakan Ceylan ; Rada Mihalcea
Abstract: We introduce a new publicly available tool that implements efficient indexing and retrieval of large N-gram datasets, such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and duplicate data, it also allows users to issue wild card queries provided that the wild cards in the query are contiguous. Furthermore, we also implement some of the smoothing algorithms that are designed specifically for large datasets and are shown to yield better language models than the traditional ones on the Web1T 5gram corpus (Yuret, 2008). We demonstrate the effectiveness of our tool and the smoothing algorithms on the English Lexical Substi- tution task by a simple implementation that gives considerable improvement over a basic language model.
6 0.40546247 239 acl-2011-P11-5002 k2opt.pdf
7 0.40187997 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution
8 0.38920346 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
9 0.3848466 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity
10 0.38080192 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components
11 0.3656958 74 acl-2011-Combining Indicators of Allophony
12 0.35623017 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework
13 0.34617814 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
14 0.3444559 234 acl-2011-Optimal Head-Driven Parsing Complexity for Linear Context-Free Rewriting Systems
15 0.33547044 303 acl-2011-Tier-based Strictly Local Constraints for Phonology
16 0.33366477 1 acl-2011-(11-06-spirl)
17 0.33239079 342 acl-2011-full-for-print
18 0.33187985 291 acl-2011-SystemT: A Declarative Information Extraction System
19 0.33130097 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?
20 0.3248989 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations
topicId topicWeight
[(0, 0.012), (5, 0.018), (17, 0.066), (26, 0.025), (37, 0.083), (39, 0.036), (41, 0.099), (53, 0.012), (55, 0.06), (59, 0.032), (72, 0.022), (75, 0.287), (91, 0.026), (96, 0.122)]
simIndex simValue paperId paperTitle
1 0.94229549 303 acl-2011-Tier-based Strictly Local Constraints for Phonology
Author: Jeffrey Heinz ; Chetan Rawal ; Herbert G. Tanner
Abstract: Beginning with Goldsmith (1976), the phonological tier has a long history in phonological theory to describe non-local phenomena. This paper defines a class of formal languages, the Tier-based Strictly Local languages, which begin to describe such phenomena. Then this class is located within the Subregular Hierarchy (McNaughton and Papert, 1971). It is found that these languages contain the Strictly Local languages, are star-free, are incomparable with other known sub-star-free classes, and have other interesting properties.
same-paper 2 0.79007632 113 acl-2011-Efficient Online Locality Sensitive Hashing via Reservoir Counting
Author: Benjamin Van Durme ; Ashwin Lall
Abstract: We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint.
Author: Omar F. Zaidan ; Chris Callison-Burch
Abstract: The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of automatic dialect identification, using the collected labels for training and evaluation.
4 0.6532501 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence
Author: Nguyen Bach ; Fei Huang ; Yaser Al-Onaizan
Abstract: State-of-the-art statistical machine translation (MT) systems have made significant progress towards producing user-acceptable translation output. However, there is still no efficient way for MT systems to inform users which words are likely translated correctly and how confident it is about the whole sentence. We propose a novel framework to predict wordlevel and sentence-level MT errors with a large number of novel features. Experimental results show that the MT error prediction accuracy is increased from 69.1 to 72.2 in F-score. The Pearson correlation between the proposed confidence measure and the human-targeted translation edit rate (HTER) is 0.6. Improve- ments between 0.4 and 0.9 TER reduction are obtained with the n-best list reranking task using the proposed confidence measure. Also, we present a visualization prototype of MT errors at the word and sentence levels with the objective to improve post-editor productivity.
5 0.64248848 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation
Author: David Chen ; William Dolan
Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.
6 0.55909324 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction
7 0.55675435 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
8 0.55542022 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
9 0.55135787 135 acl-2011-Faster and Smaller N-Gram Language Models
10 0.55048728 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
11 0.55034125 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
12 0.54966789 237 acl-2011-Ordering Prenominal Modifiers with a Reranking Approach
13 0.54877651 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing
14 0.54871601 311 acl-2011-Translationese and Its Dialects
15 0.54660076 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
16 0.54459941 94 acl-2011-Deciphering Foreign Language
17 0.54434389 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision
18 0.54362798 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
19 0.54351318 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
20 0.54307342 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts