emnlp emnlp2011 emnlp2011-19 knowledge-graph by maker-knowledge-mining

19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP


Source: pdf

Author: Amit Goyal ; Hal Daume III

Abstract: We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. [sent-4, score-0.773]

2 We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. [sent-6, score-0.803]

3 Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. [sent-7, score-0.225]

4 Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory). [sent-8, score-0.583]

5 However, explicitly storing the counts of all word pairs is both computationally expensive and memory intensive (Agirre et al. [sent-38, score-0.364]

6 We explore Count-Min (CM) sketch to address the issue of efficient storage of such data. [sent-42, score-0.607]

7 The CM sketch stores counts of all word pairs within a bounded space. [sent-43, score-0.848]

8 Storage space saving is achieved by approximating the frequency of word pairs in the corpus without explicitly storing the word pairs themselves. [sent-44, score-0.263]

9 Both updating (adding a new word pair or increasing the frequency of existing word pair) and querying (finding the frequency of a given word pair) are constant time operations making it efficient online storage data structure for large data. [sent-45, score-0.275]

10 We use CM sketch to store counts of word pairs (except word pairs involving stop words) within a window of size1 7 over different size corpora. [sent-47, score-1.005]

11 We store exact counts of words (except stop words) in hash table (since the number of unique words is not large that is quadratically less than the number of unique word pairs). [sent-48, score-0.492]

12 The approximate PMI and LLR scores are computed using these approximate counts and are applied to solve our three NLP tasks. [sent-49, score-0.218]

13 Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. [sent-50, score-0.225]

14 Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory). [sent-51, score-0.583]

15 2 Sketch Techniques A sketch is a compact summary data structure to store the frequencies of all items in the input stream. [sent-55, score-0.685]

16 Sketching techniques use hashing to map items in streaming data onto a small sketch vector that can be updated and queried in constant time. [sent-56, score-0.661]

17 There exists an extensive literature on sketch techniques (Charikar et al. [sent-61, score-0.57]

18 However, in practice, researchers have preferred Count-Min (CM) sketch over other sketch techniques in many application ar- eas, such as Security (Schechter et al. [sent-64, score-1.14]

19 This motivated us to explore CM sketch to solve three important NLP problems. [sent-68, score-0.57]

20 1 Count-Min Sketch The Count-Min sketch (Cormode and Muthukrishnan, 2004) is a compact summary data structure used to store the frequencies of all items in the input stream. [sent-70, score-0.685]

21 The sketch allows fundamental queries on the data stream such as point, range and inner product queries to be approximately answered very quickly. [sent-71, score-0.7]

22 Given an input stream of word pairs of length N and user chosen parameters δ and ? [sent-76, score-0.221]

23 , the algorithm stores the frequencies of all the word pairs with the following guarantees: • • All reported frequencies are within the true frequencies by at most ? [sent-77, score-0.195]

24 2In future, in another line of research, we will explore comparing different sketch techniques for NLP problems. [sent-81, score-0.57]

25 1 CM Data Structure A Count-Min sketch (CM) with parameters (? [sent-85, score-0.57]

26 The depth d denotes the number of pairwise-independent hash functions employed by the algorithm and there exists a one-toone correspondence between the rows and the set of hash functions. [sent-95, score-0.404]

27 w}, 1 ≤ k ≤ d, takes a wo:r{dx pair from} →the input wstr}e,a1m ≤and k maps ,i tt aiknetos a counter indexed by the corresponding hash function. [sent-102, score-0.283]

28 For example, h2 (x) = 10 indicates that the word pair “x” is mapped to the 10th position in the second row of the sketch array. [sent-103, score-0.631]

29 Update Procedure: When a new word pair “x” with count c arrives, one counter in each row (as decided by its corresponding hash function) is updated by c. [sent-105, score-0.318]

30 sketch[k, hk (x)] ← sketch[k, hk (x)] + c, ∀1 ≤ k ≤ d Query Procedure: Since multiple word pairs can get hashed to the same position, the frequency stored by each position is guaranteed to overestimate the true count. [sent-106, score-0.318]

31 Thus, to answer the point query for a given word pair, we return minimum over all the positions indexed by the k hash functions. [sent-107, score-0.238]

32 The answer to Query(x): = mink sketch[k, hk (x)] Both update and query procedures involve evaluating d hash functions and reading of all the values in those indices and hence both these procedures are linear in the number of hash functions. [sent-108, score-0.551]

33 2 Properties Apart from the advantages of being space efficient, and having constant update and constant querying time, the Count-Min sketch has also other advantages that makes it an attractive choice for NLP applications. [sent-118, score-0.634]

34 • • Linearity: Given two sketches s1 and s2 computed (using tehne same parameters w and d) over different input streams, the sketch of the combined data stream can be easily obtained by adding the individual sketches in O(1? [sent-119, score-0.862]

35 The linearity is especially attractive because it aTlhloew lsin tehaer itnyd i sv eidsupeacl isaklleytc ahtetrsa ctot vbee computed independent of each other, which means that it is easy to implement it in distributed setting, where each machine computes the sketch over a sub set of corpus. [sent-121, score-0.599]

36 This can easily be used with CM sketch to further improve the estimate of a point query. [sent-124, score-0.57]

37 In our experiments, we found that employing the conservative update reduces the Average Relative Error (ARE) of these counts approximately by a factor of 1. [sent-127, score-0.251]

38 3 Intrinsic Evaluations To show the effectiveness of the CM sketch and CM sketch with conservative update (CU) in the context of NLP, we perform intrinsic evaluations. [sent-132, score-1.299]

39 First, the intrinsic evaluations are designed to measure the error in the approximate counts returned by CM sketch compared to their true counts. [sent-133, score-0.827]

40 Second, we compare the word pairs association rankings obtained using PMI and LLR with sketch and exact counts. [sent-134, score-0.787]

41 We store exact counts of words (except stop words) in a hash table and store approximate counts of word pairs (except word pairs involving stop words) in the sketch. [sent-141, score-0.906]

42 1 Evaluating approximate sketch counts To evaluate the amount of over-estimation error (see Section 2. [sent-143, score-0.742]

43 1) in CM and CU counts compared to the true counts, we first group all word pairs with the same true frequency into a single bucket. [sent-144, score-0.304]

44 Average Relative error (ARE) is defined as the average of absolute difference between the predicted and the exact value divided by the exact value over all the word pairs in each bucket. [sent-148, score-0.233]

45 ARE =N1iX=N1|ExactiE−x Pacretidictedi| Where Exact and Predicted denotes values of exact and CM/CU counts respectively; N denotes the number of word pairs with same counts in a bucket. [sent-149, score-0.414]

46 1(a), we fixed the number ofcounters to 20 million (20M) with four bytes of memory per each counter (thus it only requires 80 MB of main memory). [sent-151, score-0.255]

47 Keeping the total number of counters fixed, we try different values of depth (2, 3, 5 and 7) of the sketch array and in each case the width is set to 20dM. [sent-152, score-1.146]

48 5) for the runs which use conservative update (CUx run) compared to the runs that use direct CM sketch (CMx run). [sent-158, score-0.695]

49 To be more certain about this behavior with respect to different settings of width and depth, we tried another setting by increasing the number of counters to 50 million. [sent-161, score-0.481]

50 Low frequency word pairs are more prone to error compared to the frequent ones and employing conservative update reduces the ARE by a factor of 1. [sent-163, score-0.251]

51 We use CU counts and depth of 5 for the rest of the paper. [sent-166, score-0.2]

52 As 3 and 5 have lowest ARE in different settings and using 5 hash functions, we get δ = 0. [sent-167, score-0.203]

53 1(c) studies the effect of the number of counters in the sketch (the size of the two-dimensional sketch array) on the ARE with fixed depth 5. [sent-172, score-1.621]

54 This is intuitive because, as the length of each row in the sketch increases, the probability of collision decreases and hence the array is more likely to contain true counts. [sent-174, score-0.682]

55 The notation CMx represents the Count Min sketch with a depth of ’x’ and CUx represents the CM sketch along with conservative update and depth ’x’ . [sent-176, score-1.413]

56 Note that the space we save by not storing the exact counts is almost four times the memory that we use here because on an average each word pair is twelve characters long and requires twelve bytes (thrice the size of an integer) and 4 bytes for storing the integer count. [sent-178, score-0.488]

57 2 Evaluating word pairs association ranking In this experiment, we compare the word pairs association rankings obtained using PMI and LLR with CU and exact word pair counts. [sent-181, score-0.369]

58 Intuitively, recall captures the number of word pairs that are found in both the sets and then Spearman’s correlation captures if the relative order of these common word pairs is preserved in both the rankings. [sent-183, score-0.182]

59 The results with respect to different sized counter (20 million (20M), 50 million (50M)) models are shown in Table 1. [sent-185, score-0.195]

60 If we compare the second and third column of the table using PMI and LLR for 20M counters, we get exact rankings for LLR compared to PMI while comparing TopK word pairs. [sent-186, score-0.199]

61 The explanation for such a behavior is: since we are 3Even with other datasets we found that using counters linear in the size of the stream leads to ARE close to zero ∀ counts. [sent-187, score-0.537]

62 254 CM sketch with conservative update (CU) and Exact counts not throwing away any infrequent word pairs, PMI will rank pairs with low frequency counts higher (Church and Hanks, 1989). [sent-188, score-1.073]

63 Hence, we are evaluating the PMI values for rare word pairs and we need counters linear in size of stream to get almost perfect ranking. [sent-189, score-0.666]

64 This is also evident from the fourth column for 50M of the Table 1, where CU PMI ranking gets close to the optimal as the number of counters ap- proaches stream size. [sent-190, score-0.537]

65 In such cases, even using space less than linear in number of counters would suffice. [sent-192, score-0.407]

66 We store exact counts of words in a hash table and store approximate counts of word pairs in the sketch. [sent-201, score-0.753]

67 Hence, the stream size in our case is the total number of word pairs in a corpus. [sent-202, score-0.221]

68 Figure 2: Evaluating Semantic Orientation using PMI and LLR with different number of counters of CU sketch built using Gigaword. [sent-224, score-0.977]

69 1 Varying sketch size We evaluate SO of words using PMI and LLR on Gigaword (9. [sent-227, score-0.57]

70 We compare approximate SO computed using varying sizes of CU sketches: 50 million (50M), 100M, 200M, 500M, 1billion (1B) and 2 billion (2B) counters with Exact SO. [sent-229, score-0.56]

71 Note that computing the exact counts of all word pairs on these corpora is com- putationally expensive and memory intensive, so we consider only those pairs in which one word appears in the prototype list and the other word appears in the test set. [sent-231, score-0.482]

72 Second, nfcore b bootuhn PdMaryI faonrd a LLR, having more number of counters improve performance. [sent-235, score-0.407]

73 ± 5We use maximum of 2B counters (8GB main memory), as most of the current desktop machines have at most 8GB RAM. [sent-237, score-0.407]

74 Second, we will fix number of counters to 2B (CU-2B) as it performs the best in Section 4. [sent-245, score-0.407]

75 Hence, it shows that using counters less than the stream length does not degrade the performance. [sent-259, score-0.537]

76 Once, we have context vectors for each of the terms, cosine similarity measure returns distributional similarity between terms. [sent-267, score-0.171]

77 1 Efficient Distributional Similarity We propose an efficient approach for computing distributional similarity between word pairs using CU sketch. [sent-270, score-0.216]

78 In the first step, we traverse the corpus and store counts of all words (except stop words) in hash table and all word pairs (except word pairs involving stop words) in sketch. [sent-271, score-0.599]

79 ), and query the sketch for vocabulary number of word pairs, and compute approximate AS between word-context pairs. [sent-274, score-0.689]

80 In the third step, we use cosine similarity using these approximate top K context vectors to compute efficient distributional similarity. [sent-276, score-0.171]

81 The efficient distributional similarity using sketches has following advantages: • It can return semantic similarity between any Iwto cradn pairs rtnha ste are nsttiocre sdim minil tahriet ysk betecthw. [sent-277, score-0.308]

82 We generate the word pair rankings using efficient distributional similarity. [sent-286, score-0.195]

83 5BECxUact2B (a) Word Similarity PMI (b) Word Similarity LLR Figure 4: Evaluating Distributional Similarity between word pairs on WS-353 test set using PMI and LLR with different number of counters of CU sketch built using Gigaword data-set. [sent-301, score-1.068]

84 2 Varying sketch size We evaluate efficient distributional similarity between between word pairs on WS-353 test set using PMI and LLR association scores on Gigaword (9. [sent-304, score-0.786]

85 We compare different sizes of CU sketch (similar to SO evaluation): 50 million (50M), 100M, 200M, 500M, 1 billion (1B) and 2 billion (2B) counters with the Exact word pair counts. [sent-306, score-1.219]

86 Here again, computing the exact counts of all wordcontext pairs on these corpora is time, and memory intensive, we generate context vectors for only those words which are present in the test set. [sent-307, score-0.321]

87 First, if we look at word pair ranking using exact PMI and LLR across Figures 4(a) and 4(b) respectively, it shows that using LLR, we get better ρ of . [sent-308, score-0.17]

88 0n8k) context pairs awtiiotnh l foowr frequency counts higher (Church and Hanks, 1989) compared to frequent ones which are favored by LLR. [sent-313, score-0.217]

89 4(b), having more number of counters improve performance and using 2B counters, we get ρ close to the Exact. [sent-326, score-0.445]

90 3 Effect of Increasing Corpus Size We evaluate efficient distributional similarity between word pairs using three different sized corpora: GW (9. [sent-329, score-0.253]

91 Second, we will fix number of counters to 2B (CU2B) as it performs the best in Section 4. [sent-333, score-0.407]

92 Hence, here again it shows that using counters less than the stream length does not degrade the performance. [sent-350, score-0.537]

93 Finally we report variants of our approach using association scores computed on the GWB50 using CU sketch with 2 billion counters. [sent-404, score-0.644]

94 6 Discussion and Conclusion The advantage of using sketch in addition to being memory and time efficient is that it contains counts for all word pairs and hence can be used to com- 259 pute association scores like PMI and LLR between any word pairs. [sent-413, score-0.918]

95 We show that using sketch counts in our experiments, on the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. [sent-414, score-0.921]

96 Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory). [sent-415, score-0.583]

97 Moreover, the linearity property of the sketch makes it scalable and usable in distributed setting. [sent-416, score-0.599]

98 Association scores and counts from sketch can be used for more NLP tasks like small-space randomized language models, word sense disambiguation, spelling correction, relation learning, paraphrasing, and machine translation. [sent-417, score-0.76]

99 An im- proved data stream summary: The count-min sketch and its applications. [sent-464, score-0.7]

100 One sketch for all: Theory and application of conditional random sampling. [sent-537, score-0.57]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sketch', 0.57), ('counters', 0.407), ('llr', 0.339), ('pmi', 0.205), ('hash', 0.165), ('stream', 0.13), ('cm', 0.13), ('counts', 0.126), ('cu', 0.124), ('goyal', 0.122), ('cormode', 0.095), ('counter', 0.092), ('sketches', 0.081), ('distributional', 0.079), ('depth', 0.074), ('billion', 0.074), ('exact', 0.071), ('levenberg', 0.068), ('memory', 0.068), ('streaming', 0.066), ('hk', 0.064), ('store', 0.064), ('update', 0.064), ('durme', 0.063), ('bounded', 0.061), ('conservative', 0.061), ('gb', 0.059), ('muthukrishnan', 0.058), ('array', 0.058), ('pairs', 0.056), ('rankings', 0.055), ('agirre', 0.052), ('orientation', 0.051), ('gigaword', 0.049), ('approximate', 0.046), ('similarity', 0.046), ('storing', 0.046), ('amit', 0.043), ('turney', 0.043), ('littman', 0.042), ('ashwin', 0.041), ('estan', 0.041), ('lall', 0.041), ('rusu', 0.041), ('unzipped', 0.041), ('gw', 0.039), ('query', 0.038), ('get', 0.038), ('sized', 0.037), ('storage', 0.037), ('increasing', 0.037), ('width', 0.037), ('nlp', 0.036), ('frequency', 0.035), ('bytes', 0.035), ('jagarlamudi', 0.035), ('word', 0.035), ('intrinsic', 0.034), ('intensive', 0.033), ('million', 0.033), ('inquirer', 0.032), ('jagadeesh', 0.032), ('suresh', 0.032), ('window', 0.032), ('stop', 0.031), ('spearman', 0.03), ('ravichandran', 0.03), ('linearity', 0.029), ('randomized', 0.029), ('log', 0.029), ('hence', 0.028), ('van', 0.028), ('church', 0.027), ('daum', 0.027), ('abby', 0.027), ('aggarwal', 0.027), ('charikar', 0.027), ('cmx', 0.027), ('cux', 0.027), ('dobra', 0.027), ('dwork', 0.027), ('fancy', 0.027), ('manku', 0.027), ('mink', 0.027), ('ofcounters', 0.027), ('schechter', 0.027), ('semisup', 0.027), ('streambased', 0.027), ('streams', 0.027), ('usenix', 0.027), ('varghese', 0.027), ('seven', 0.026), ('pair', 0.026), ('frequencies', 0.026), ('hal', 0.026), ('true', 0.026), ('evaluations', 0.025), ('items', 0.025), ('brants', 0.025), ('naseem', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP

Author: Amit Goyal ; Hal Daume III

Abstract: We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).

2 0.069262475 112 emnlp-2011-Refining the Notions of Depth and Density in WordNet-based Semantic Similarity Measures

Author: Tong Wang ; Graeme Hirst

Abstract: We re-investigate the rationale for and the effectiveness of adopting the notions of depth and density in WordNet-based semantic similarity measures. We show that the intuition for including these notions in WordNet-based similarity measures does not always stand up to empirical examination. In particular, the traditional definitions of depth and density as ordinal integer values in the hierarchical structure of WordNet does not always correlate with human judgment of lexical semantic similarity, which imposes strong limitations on their contribution to an accurate similarity measure. We thus propose several novel definitions of depth and density, which yield significant improvement in degree of correlation with similarity. When used in WordNet-based semantic similarity measures, the new definitions consistently improve performance on a task of correlating with human judgment.

3 0.053621672 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources

Author: Yulia Tsvetkov ; Shuly Wintner

Abstract: We propose an architecture for expressing various linguistically-motivated features that help identify multi-word expressions in natural language texts. The architecture combines various linguistically-motivated classification features in a Bayesian Network. We introduce novel ways for computing many of these features, and manually define linguistically-motivated interrelationships among them, which the Bayesian network models. Our methodology is almost entirely unsupervised and completely languageindependent; it relies on few language resources and is thus suitable for a large number of languages. Furthermore, unlike much recent work, our approach can identify expressions of various types and syntactic con- structions. We demonstrate a significant improvement in identification accuracy, compared with less sophisticated baselines.

4 0.053109761 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association

Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman

Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.

5 0.044859216 107 emnlp-2011-Probabilistic models of similarity in syntactic context

Author: Diarmuid O Seaghdha ; Anna Korhonen

Abstract: This paper investigates novel methods for incorporating syntactic information in probabilistic latent variable models of lexical choice and contextual similarity. The resulting models capture the effects of context on the interpretation of a word and in particular its effect on the appropriateness of replacing that word with a potentially related one. Evaluating our techniques on two datasets, we report performance above the prior state of the art for estimating sentence similarity and ranking lexical substitutes.

6 0.043623634 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

7 0.043033324 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

8 0.041252278 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives

9 0.041072577 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning

10 0.03898165 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning

11 0.038309764 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

12 0.036943771 129 emnlp-2011-Structured Sparsity in Structured Prediction

13 0.036329143 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models

14 0.034476127 127 emnlp-2011-Structured Lexical Similarity via Convolution Kernels on Dependency Trees

15 0.034241986 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

16 0.034060586 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

17 0.034047149 125 emnlp-2011-Statistical Machine Translation with Local Language Models

18 0.03184709 4 emnlp-2011-A Fast, Accurate, Non-Projective, Semantically-Enriched Parser

19 0.031066233 100 emnlp-2011-Optimal Search for Minimum Error Rate Training

20 0.03095681 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.134), (1, -0.022), (2, -0.021), (3, -0.023), (4, 0.023), (5, 0.019), (6, -0.021), (7, 0.0), (8, -0.011), (9, 0.004), (10, 0.063), (11, -0.067), (12, -0.026), (13, -0.013), (14, -0.019), (15, 0.008), (16, -0.031), (17, 0.085), (18, 0.041), (19, -0.041), (20, -0.011), (21, -0.147), (22, 0.059), (23, 0.099), (24, 0.002), (25, 0.052), (26, -0.014), (27, 0.107), (28, 0.03), (29, -0.031), (30, 0.017), (31, 0.083), (32, -0.132), (33, 0.054), (34, 0.038), (35, -0.121), (36, -0.122), (37, 0.002), (38, -0.052), (39, 0.039), (40, -0.116), (41, 0.089), (42, 0.01), (43, -0.158), (44, -0.142), (45, 0.1), (46, -0.238), (47, 0.01), (48, 0.18), (49, 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91714782 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP

Author: Amit Goyal ; Hal Daume III

Abstract: We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).

2 0.6686728 112 emnlp-2011-Refining the Notions of Depth and Density in WordNet-based Semantic Similarity Measures

Author: Tong Wang ; Graeme Hirst

Abstract: We re-investigate the rationale for and the effectiveness of adopting the notions of depth and density in WordNet-based semantic similarity measures. We show that the intuition for including these notions in WordNet-based similarity measures does not always stand up to empirical examination. In particular, the traditional definitions of depth and density as ordinal integer values in the hierarchical structure of WordNet does not always correlate with human judgment of lexical semantic similarity, which imposes strong limitations on their contribution to an accurate similarity measure. We thus propose several novel definitions of depth and density, which yield significant improvement in degree of correlation with similarity. When used in WordNet-based semantic similarity measures, the new definitions consistently improve performance on a task of correlating with human judgment.

3 0.50405318 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association

Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman

Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.

4 0.37401977 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation

Author: Fabien Cromieres ; Sadao Kurohashi

Abstract: We propose an algorithm allowing to efficiently retrieve example treelets in a parsed tree database in order to allow on-the-fly extraction of syntactic translation rules. We also propose improvements of this algorithm allowing several kinds of flexible matchings.

5 0.37019286 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

Author: Jagadeesh Jagarlamudi ; Raghavendra Udupa ; Hal Daume III ; Abhijit Bhole

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of cross-lingual corpora. Many existing approaches are based on word co-occurrences extracted from aligned training data, represented as a covariance matrix. In theory, such a covariance matrix should represent semantic equivalence, and should be highly sparse. Unfortunately, the presence of noise leads to dense covariance matrices which in turn leads to suboptimal document representations. In this paper, we explore techniques to recover the desired sparsity in covariance matrices in two ways. First, we explore word association measures and bilingual dictionaries to weigh the word pairs. Later, we explore different selection strategies to remove the noisy pairs based on the association scores. Our experimental results on the task of aligning comparable documents shows the efficacy of sparse covariance matrices on two data sets from two different language pairs.

6 0.34681886 91 emnlp-2011-Literal and Metaphorical Sense Identification through Concrete and Abstract Context

7 0.31737831 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures

8 0.30463281 129 emnlp-2011-Structured Sparsity in Structured Prediction

9 0.29913336 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article

10 0.28512046 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction

11 0.26870871 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics

12 0.26857084 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

13 0.26422781 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

14 0.25797418 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search

15 0.24587937 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models

16 0.24083611 67 emnlp-2011-Hierarchical Verb Clustering Using Graph Factorization

17 0.23984639 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources

18 0.22673269 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction

19 0.2244847 104 emnlp-2011-Personalized Recommendation of User Comments via Factor Models

20 0.21350241 110 emnlp-2011-Ranking Human and Machine Summarization Systems


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.065), (36, 0.018), (37, 0.017), (45, 0.581), (53, 0.013), (54, 0.018), (57, 0.014), (62, 0.016), (64, 0.016), (66, 0.023), (69, 0.026), (79, 0.033), (82, 0.018), (96, 0.035), (98, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98250842 21 emnlp-2011-Bayesian Checking for Topic Models

Author: David Mimno ; David Blei

Abstract: Real document collections do not fit the independence assumptions asserted by most statistical topic models, but how badly do they violate them? We present a Bayesian method for measuring how well a topic model fits a corpus. Our approach is based on posterior predictive checking, a method for diagnosing Bayesian models in user-defined ways. Our method can identify where a topic model fits the data, where it falls short, and in which directions it might be improved.

2 0.97394556 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association

Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman

Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.

same-paper 3 0.95262808 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP

Author: Amit Goyal ; Hal Daume III

Abstract: We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).

4 0.91119605 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion

Author: Swapna Gottipati ; Jing Jiang

Abstract: In this paper we present a novel approach to entity linking based on a statistical language model-based information retrieval with query expansion. We use both local contexts and global world knowledge to expand query language models. We place a strong emphasis on named entities in the local contexts and explore a positional language model to weigh them differently based on their distances to the query. Our experiments on the TAC-KBP 2010 data show that incorporating such contextual information indeed aids in disambiguating the named entities and consistently improves the entity linking performance. Compared with the official results from KBP 2010 participants, our system shows competitive performance.

5 0.87384629 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus

Author: Emily M. Bender ; Dan Flickinger ; Stephan Oepen ; Yi Zhang

Abstract: In order to obtain a fine-grained evaluation of parser accuracy over naturally occurring text, we study 100 examples each of ten reasonably frequent linguistic phenomena, randomly selected from a parsed version of the English Wikipedia. We construct a corresponding set of gold-standard target dependencies for these 1000 sentences, operationalize mappings to these targets from seven state-of-theart parsers, and evaluate the parsers against this data to measure their level of success in identifying these dependencies.

6 0.75726384 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions

7 0.73397881 101 emnlp-2011-Optimizing Semantic Coherence in Topic Models

8 0.7100361 64 emnlp-2011-Harnessing different knowledge sources to measure semantic relatedness under a uniform model

9 0.69545007 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics

10 0.66192091 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs

11 0.65803111 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases

12 0.64400625 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

13 0.63362259 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models

14 0.60094124 116 emnlp-2011-Robust Disambiguation of Named Entities in Text

15 0.59692127 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms

16 0.5818873 91 emnlp-2011-Literal and Metaphorical Sense Identification through Concrete and Abstract Context

17 0.58146346 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation

18 0.57772022 107 emnlp-2011-Probabilistic models of similarity in syntactic context

19 0.57655734 11 emnlp-2011-A Simple Word Trigger Method for Social Tag Suggestion

20 0.57279831 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources