emnlp emnlp2011 emnlp2011-86 knowledge-graph by maker-knowledge-mining

86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association


Source: pdf

Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman

Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. [sent-13, score-0.211]

2 Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. [sent-14, score-0.262]

3 In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. [sent-15, score-0.268]

4 However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. [sent-17, score-0.185]

5 1 Introduction Lexical co-occurrence is an important indicator of word association and this has motivated several measures for word association like PMI (Church and Hanks, 1989), LLR (Dunning, 1993), Dice (Dice, 1945), and CWCD (Washtell and Markert, 2009). [sent-19, score-0.212]

6 In this paper, we present a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. [sent-20, score-0.211]

7 We formalize these ideas and construct a significance test that allows us to detect different kinds of co-occurrences within a single unified framework (a feature which is absent in current measures for co-occurrence). [sent-23, score-0.213]

8 Another distinguishing feature of our measure is that it is based solely on the cooccurrence counts in the documents containing both words of the pair, unlike all existing measures which also take global unigram frequencies in account. [sent-24, score-0.483]

9 We need a null hypothesis that can account for an observed co-occurrence as a pure chance event and this in-turn requires a corpus generation model. [sent-25, score-0.185]

10 Existing cooccurrence measures further assume that each document is drawn from a multinomial distribution based on global unigram frequencies. [sent-27, score-0.31]

11 The main concern with such a null model is the overbearing influence use of the unigram frequencies on the detection of word associations. [sent-28, score-0.237]

12 Also, under current models, the expected span2 of a word pair is very sensitive to the associated unigram frequencies: the expected span of a word pair composed of low frequency unigrams is much larger than that with high frequency unigrams. [sent-31, score-0.343]

13 This is contrary to how word associa2The span of an occurrence of a word-pair is the ‘unsigned distance’ between the positions of the corresponding word occurrences. [sent-32, score-0.23]

14 Based on these considerations we employ a null model that represents each document as a bag of words 3. [sent-36, score-0.233]

15 Under this null model, the locations of an unrelated pair of words will likely be randomly distributed in the documents in which they co-occur. [sent-38, score-0.202]

16 If the observed span distribution of a word-pair resembles that under the (random permutation) null model, then the relation between the words is not strong enough for one word to influence the placement of the other. [sent-39, score-0.312]

17 However, if the words are found to occur closer together than explainable by our null model, then we hypothesize a more direct association between the words. [sent-40, score-0.185]

18 Therefore, this null model detects biases in span distributions of word-pairs while being agnostic to variations in global unigram frequencies. [sent-41, score-0.336]

19 In this paper, we propose a new measure of word association based on the statistical significance of the observed span distribution of a word-pair. [sent-42, score-0.38]

20 The ranked list of word associations output by our measure has the best correlation with the corresponding gold-standard in three (out of seven) data sets in our experiments, while remaining in the top three in other four datasets. [sent-45, score-0.206]

21 While differ- ent measures perform best on different data sets, our measure outperforms other measures by being consistently either the best measure or very close to the best measure on all the data sets. [sent-46, score-0.541]

22 The average deviation of our measure’s correlation with the gold-standard from the best measure’s correlation with the gold-standard (average taken across all the 3There can be many ways to associate a bag of words with a document. [sent-47, score-0.251]

23 02, which is the least average deviation among all the measures, the next best deviations being 0. [sent-52, score-0.137]

24 We present our notion of statistical significance of span distribution in Section 2. [sent-56, score-0.254]

25 Algorithm for computing the proposed word association measure is described in Section 3. [sent-57, score-0.157]

26 First, at the document level, we may find that for a given word-pair, a surprisingly high proportion of its occurrences within a document have smaller spans than they would have by random chance. [sent-61, score-0.359]

27 We now describe how to combine both kinds ofevidence to decide whether the nearby occurrences of a wordpair are statistically significant or not. [sent-63, score-0.193]

28 Let the frequency f of a word-pair α in a document D, be the maximum number of nonoverlapped occurrences of α in D. [sent-64, score-0.268]

29 A set of occurrences of a word-pair is said to be non-overlapped if the words corresponding to one occurrence from the set do not appear in-between the words corresponding to any other occurrence from the set. [sent-65, score-0.255]

30 Let fbx denote the maximum number of nonoverlapbped occurrences of α in D with span less – than a gbiven threshold x. [sent-66, score-0.774]

31 We refer to fbx as the spanconstrained frequency of α in D. [sent-67, score-0.512]

32 1 Document-level significant co-occurrence To assess the statistical significance of the word-pair α we ask if the span-constrained frequency fbx (of α) is more than what we would expect in a dobcument of size ‘ containing f ‘random’ occurrencbes of α. [sent-70, score-0.557]

33 Our intuition is that if two words are associated in some way, they will often appear close to each other in the document and so the distribution of the spans will typically exhibit a bias toward values less than a suitably chosen threshold x. [sent-71, score-0.169]

34 Definition 1 Consider the null hypothesis that the linear representation of a document is generated by choosing a random permutation of the bag of words associated with the document. [sent-72, score-0.344]

35 Let ‘ be the length of the document and f denote the frequency of a wordpair in the document. [sent-73, score-0.195]

36 For a given a span threshold x, we define πx (fbx, f,‘) as the probability under the null that the worbd-pair will appear in the document with a span-consbtrained frequency of at least fbx. [sent-74, score-0.462]

37 b all f ≥ occurrences ≥w ‘il w always πhave span less than x for x ≥ ‘). [sent-77, score-0.302]

38 bFaobri exampleb, consider a document of length 40b0 with 4 non-obverlapped occurrences of α. [sent-84, score-0.213]

39 The pbrobabilities of observing at least 4, 3, 2, 1 and 0 occurrences of α within a span of 20 words are 0. [sent-85, score-0.302]

40 09, even if 3 of the 4 occurrences of α have span less than 20 words, there is 9% chance that the occurrences were a consequence of a random event. [sent-92, score-0.502]

41 < 1) and a span threshold x (≥ 0) the document D is said to support the hypothesis “thαe i dso an ? [sent-96, score-0.351]

42 2 Corpus-level significant co-occurrence We now describe how to aggregate evidence for lexical significance by considering the occurrence of α across multiple documents in the corpus. [sent-102, score-0.301]

43 , DK} denote the set of K documents (from {ouDt of the ent}irdee corpus) sthetato fcKon tdaoincu amt lnetass(tf one occurrence of α. [sent-106, score-0.16]

44 Let ‘i be the length of Di, fi be the frequency of α in Di, and, fbix be the spanconstrained frequency of α in Di. [sent-107, score-0.269]

45 Let Z = zi; Z models the number of documents (out oPf K) that support the hypothesis “α is an ? [sent-116, score-0.146]

46 Definition 2 Given a document of length ‘ in which a word-pair has a frequency of f, and given a span threshold x, we define g? [sent-121, score-0.359]

47 < u1s ,de thneot uepsp tehr-eb upper-bound on −th2eK probability that just due to random chance, more than (E(Z) + Kt) documents out of K will support the hypothesis “α is an ? [sent-133, score-0.178]

48 -significant occurrences in Z = 33 documents (out of 416) , while E(Z) = 14. [sent-140, score-0.232]

49 Suppose we want to be 99% sure that the occurrences of (canyon, landscape) in the 33 documents were a consequence of non-random phenomena. [sent-142, score-0.232]

50 , δ) combinations under a span constraint of 20 words. [sent-147, score-0.169]

51 We now summarize our significance test in the definition below. [sent-157, score-0.135]

52 Definition 3 (Significant lexical co-occurrence) Consider a word-pair α and a set of K documents containing at least one occurrence each of α. [sent-158, score-0.16]

53 Fix a span threshold of x (> 0), a document-level evidence of ? [sent-159, score-0.28]

54 Let Z denote the number of documents (out of K) that support the hypothesis “α is ? [sent-162, score-0.146]

55 3 Discussion The significance test of Definition 3 gathers both document-level and corpus-level evidence from data in calibrated amounts. [sent-170, score-0.141]

56 fixes the strength of the document-level hypothesis in our test, while, δ, controls the extent of corpus-level evidence we need to declare a word-pair as significant. [sent-172, score-0.134]

57 A small δ demands that there must be multiple documents in the corpus, each of which, individually have some evidence of relatedness for the pair of words. [sent-173, score-0.232]

58 The semantic notion of word association is an abstract concept and different kinds of associations (with potentially different statistical characterizations) may be preferred by human judges in differ- ent situations. [sent-205, score-0.133]

59 While in Section 5, we discuss in detail various datasets used, the evaluation methodology, and the performance of CSR across datasets, we wish to point out here that in 3 out of 5 crossvalidation runs for wordsim dataset, the best performing CSR parameters were x = 50w, ? [sent-206, score-0.202]

60 1 Computing histogram histf,‘,x (·) The first step is to compute a histogram for the span- constrained frequency, fbx, of a word-pair whose frequency is f in a documebnt of length ‘, given a chosen span threshold of x b(under our null model). [sent-219, score-0.382]

61 Definition 4 Given a document of length ‘ and a span threshold of x, we define histf,‘,x (fbx) as the number of ways to embed f non-overlappbed occurrences of a word-pair in the document sucbh that exactly fbx occurrences have span less than x. [sent-220, score-1.236]

62 Procedbure 1ComputeHist(f, ‘, x) Offline Input f -numberofnon-overlappedoccurrences; ‘-documentlength; x - span threshold Computes histf,‘,x [·] as per Definition 4 1: Initialize histf,‘,x[fbx] ← 0 for fbx = 0, . [sent-221, score-0.641]

63 2 Computing πx(·, f,‘) distribution Procedure 2 ComputePiDist(f, ‘, x) Offline Input f - number of non-overlapped occurrences; ‘ - document length; – x - span threshold Computes Distribution πx per Definition 2 [f, ‘, ·] as per Definition 1and g? [sent-230, score-0.304]

64 ,x [f, ‘] as = Pkf=0 1: N[f,‘,x] histf,‘,x[k] 2: for fbx ← 0 toP Pf do 3: Nbx[←fbx 0,f to,‘] f ← histf,‘,x[k] 4: πx[fbbx,f,‘] NPNx[[fbf,x‘b,,fx,]‘] 5: g? [sent-231, score-0.417]

65 We store the number of ways of embedding f non-overlapped occurrences woafy a word-pair in a document of length ‘ in the array N[f, ‘] . [sent-235, score-0.213]

66 Similarly, the array Nx [fbx, f,‘] stores the number of ways of embedding f nobn-overlapped occurrences of the word-pair in a docbument of length ‘, such that at least fbx of the f occurrences have span less than x. [sent-236, score-0.852]

67 - document-level evidence; δ - corpus-level evidence; x - span threshold; Corpus of documents Computes CSR(α) - Co-occurrence Significance Ratio (CSR) for α as per Definition 3 1: D ← {D1, . [sent-246, score-0.268]

68 then 9: zi ←b 1 10: else 11: zi ← 0 12: Z ← Z← ←+ 0 zi 13: ZZE ← ←← Z Z +E z + πx [g? [sent-251, score-0.3]

69 uNteex tt we ddet oenrm δin aen dho Kw many zofe t ohfe DK) documents support the hypothesis “α is ? [sent-256, score-0.146]

70 The expected number of documents supporting the hypothesis is accumulated in ZE (line 13). [sent-258, score-0.146]

71 4 Run-time overhead The computation of Co-occurrence Significance Ratio (CSR) as given in Definition 3 might appear more complex than the simple formulae for other co-occurrence measures given in Table 2. [sent-261, score-0.128]

72 The main data-dependent computations for a spanned measure 5http://www. [sent-265, score-0.144]

73 The main overhead in these procedures is incurred in line 7, where span-constrained frequencies in a given document are computed. [sent-273, score-0.185]

74 , DK } // set of documents containing at least one occurrence of α. [sent-278, score-0.16]

75 , 2008), and (iii) Knowledgebased measures that use knowledge-sources like thesauri, semantic networks, or taxonomies (Milne and Witten, 2008; Hughes and Ramage, 2007; Gabrilovich and Markovitch, 2007; Yeh et al. [sent-283, score-0.128]

76 These measures are used in several domains like ecology, psychology, medicine, and language processing. [sent-287, score-0.128]

77 Table 2 lists several measures chosen from all these domains. [sent-288, score-0.128]

78 ××× CWCD6 (Washtell and Markert, 2009) all other measures are well-known in the NLP community (Pecina and Schlesinger, 2006). [sent-290, score-0.128]

79 While presenting our experimental results, we report these pairs of measures together. [sent-303, score-0.128]

80 Figures in brackets against the deviation values denote the ranks of the measures in the corresponding data sets. [sent-337, score-0.246]

81 The methodology for collecting free association data is explained at (ESSLLI, 2008): The degree of free association between a stimulus (S) and response (R) is the percentage of respondents who respond R as the first response when presented with stimulus S. [sent-340, score-0.24]

82 This allows us to compare different measures comprehensively under varying range of circumstances. [sent-342, score-0.128]

83 We evaluate each measure by the Spearman’s rank correlation between the ranking produced by the measure and the gold-standard ranking. [sent-354, score-0.232]

84 The span threshold (or window-width) x is a userdefined parameter in all measures. [sent-355, score-0.224]

85 These other measures have not been applied to the free association datasets shown in Table 3. [sent-359, score-0.264]

86 The average rank correlation obtained by each measure over 5 cross-validation runs is re- ported for each dataset. [sent-369, score-0.148]

87 4 Results For each measure and for each data set, the average correlation over the 5 cross-validation runs is reported in Table 4. [sent-375, score-0.148]

88 While different measures performed best on different data sets, the results in Table 4 shows that CSR performs consistently well across all data sets. [sent-378, score-0.128]

89 Although, among all measures, CSR has the best average correlation over all datasets, taking average of correlations across widely different dataset is not a meaningful way to decide on which measure to use. [sent-385, score-0.177]

90 Short of such an oracle, if one were to pick a fixed measure a-priori, then one would like to know how much worse off one is compared to the best measure for that dataset. [sent-387, score-0.168]

91 To compare different measures from this perspective, we compute the deviation of the correlation for each measure from the correlation of the best measure for each data set. [sent-388, score-0.497]

92 While the focus of our work is on the cooccurrence measures, for completeness, we present all the known results for knowledge and distributional similarity-based measures on the datasets under consideration in Table 6. [sent-398, score-0.216]

93 , 2009), the wordsim data set was partitioned into two sets, namely sim and rel, and in Esslli shared task (ESSLLI, 2008), a 272 word pair subset of the Edinburgh dataset was chosen. [sent-400, score-0.148]

94 To facilitate comparison, in addition to CSR, we also present results for PMI and Ochiai (Chi-Square) which are the best performing co-occurrence measures on wordsim, and Esslli datasets. [sent-401, score-0.161]

95 6 Conclusions In this paper, we introduced a new measure called CSR for word-association based on statistical significance of lexical co-occurrences. [sent-404, score-0.169]

96 Our measure, while being agnostic to global unigram frequencies, detects skews in span distributions of word-pairs in documents containing both words. [sent-405, score-0.332]

97 Novel association measures using web search with double checking. [sent-425, score-0.17]

98 Free association task at lexical semantics workshop esslli 2008. [sent-443, score-0.181]

99 An effective, lowcost measure of semantic relatedness obtained from wikipedia links. [sent-516, score-0.202]

100 A comparison of windowless and window-based computational association measures as predictors of syntagmatic human associations. [sent-570, score-0.21]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('csr', 0.537), ('fbx', 0.417), ('histf', 0.278), ('span', 0.169), ('esslli', 0.139), ('occurrences', 0.133), ('measures', 0.128), ('kt', 0.12), ('wordsim', 0.119), ('null', 0.103), ('zi', 0.1), ('washtell', 0.099), ('documents', 0.099), ('significance', 0.085), ('measure', 0.084), ('document', 0.08), ('minnesota', 0.08), ('ochiai', 0.08), ('relatedness', 0.077), ('deviation', 0.073), ('frequencies', 0.07), ('correlation', 0.064), ('unigram', 0.064), ('deviations', 0.064), ('occurrence', 0.061), ('canyon', 0.06), ('edinburg', 0.06), ('fbix', 0.06), ('kent', 0.06), ('landscape', 0.06), ('spanned', 0.06), ('wordpair', 0.06), ('fi', 0.059), ('associations', 0.058), ('evidence', 0.056), ('threshold', 0.055), ('frequency', 0.055), ('florida', 0.051), ('agirre', 0.051), ('dk', 0.051), ('definition', 0.05), ('datasets', 0.05), ('bag', 0.05), ('di', 0.05), ('hypothesis', 0.047), ('brackets', 0.045), ('offline', 0.045), ('free', 0.044), ('markert', 0.043), ('association', 0.042), ('wikipedia', 0.041), ('finkelstein', 0.04), ('ze', 0.04), ('pf', 0.04), ('cwcd', 0.04), ('damani', 0.04), ('ecology', 0.04), ('explainable', 0.04), ('hughes', 0.04), ('janson', 0.04), ('laxman', 0.04), ('placement', 0.04), ('shaul', 0.04), ('spanconstrained', 0.04), ('windowless', 0.04), ('dice', 0.038), ('cooccurrence', 0.038), ('gabrilovich', 0.036), ('chance', 0.035), ('procedures', 0.035), ('nx', 0.035), ('kf', 0.034), ('lucene', 0.034), ('stimulus', 0.034), ('liberman', 0.034), ('bombay', 0.034), ('iit', 0.034), ('snake', 0.034), ('markovitch', 0.034), ('itb', 0.034), ('milne', 0.034), ('wandmacher', 0.034), ('spans', 0.034), ('ent', 0.033), ('computes', 0.033), ('performing', 0.033), ('random', 0.032), ('ratio', 0.032), ('permutation', 0.032), ('pmi', 0.031), ('bollegala', 0.031), ('aitor', 0.031), ('rubenstein', 0.031), ('declare', 0.031), ('budanitsky', 0.031), ('computing', 0.031), ('benchmark', 0.03), ('procedure', 0.03), ('dataset', 0.029), ('pecina', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association

Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman

Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.

2 0.07023064 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures

Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo

Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.

3 0.057859514 38 emnlp-2011-Data-Driven Response Generation in Social Media

Author: Alan Ritter ; Colin Cherry ; William B. Dolan

Abstract: Ottawa, Ontario, K1A 0R6 Co l . Cherry@ nrc-cnrc . gc . ca in Redmond, WA 98052 bi l ldol @mi cro so ft . com large corpus of status-response pairs found on Twitter to create a system that responds to Twitter status We present a data-driven approach to generating responses to Twitter status posts, based on phrase-based Statistical Machine Translation. We find that mapping conversational stimuli onto responses is more difficult than translating between languages, due to the wider range of possible responses, the larger fraction of unaligned words/phrases, and the presence of large phrase pairs whose alignment cannot be further decomposed. After addressing these challenges, we compare approaches based on SMT and Information Retrieval in a human evaluation. We show that SMT outperforms IR on this task, and its output is preferred over actual human responses in 15% of cases. As far as we are aware, this is the first work to investigate the use of phrase-based SMT to directly translate a linguistic stimulus into an appropriate response.

4 0.053731076 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

Author: Ashish Venugopal ; Jakob Uszkoreit ; David Talbot ; Franz Och ; Juri Ganitkevitch

Abstract: We propose a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method is robust to local editing operations and provides well defined trade-offs between the ability to identify algorithm outputs and the quality of the watermarked output. Unlike previous work in the field, our approach does not rely on controlling the inputs to the algorithm and provides probabilistic guarantees on the ability to identify collections of results from one’s own algorithm. We present an application in statistical machine translation, where machine translated output is watermarked at minimal loss in translation quality and detected with high recall. 1 Motivation Machine learning algorithms provide structured results to input queries by simulating human behavior. Examples include automatic machine translation (Brown et al. , 1993) or automatic text and rich media summarization (Goldstein et al. , 1999) . These algorithms often estimate some portion of their models from publicly available human generated data. As new services that output structured results are made available to the public and the results disseminated on the web, we face a daunting new challenge: Machine generated structured results contaminate the pool of naturally generated human data. For example, machine translated output 1363 2Center for Language and Speech Processing Johns Hopkins University Baltimore, MD 21218, USA juri@cs.jhu.edu and human generated translations are currently both found extensively on the web, with no automatic way of distinguishing between them. Algorithms that mine data from the web (Uszkoreit et al. , 2010) , with the goal of learning to simulate human behavior, will now learn models from this contaminated and potentially selfgenerated data, reinforcing the errors committed by earlier versions of the algorithm. It is beneficial to be able to identify a set of encountered structured results as having been generated by one’s own algorithm, with the purpose of filtering such results when building new models. Problem Statement: We define a structured result of a query q as r = {z1 · · · zL} where tthuree odr rdeesru latn dof identity qof a sele rm =en {tzs zi are important to the quality of the result r. The structural aspect of the result implies the existence of alternative results (across both the order of elements and the elements themselves) that might vary in their quality. Given a collection of N results, CN = r1 · · · rN, where each result ri has k rankedC alterna·t·iv·ers Dk (qi) of relatively similar quality and queries q1 · · · qN are arbitrary and not controlled by the watermarking algorithm, we define the watermarking task as: Task. Replace ri with ri0 ∈ Dk (qi) for some subset of results in CN to produce a watermarked sceoltle ocfti orens CN0 slleucchti otnha Ct: • CN0 is probabilistically identifiable as having bCeen generated by one’s own algorithm. Proce dEindgisnb oufr tgh e, 2 S0c1o1tl Canodn,f eUrKen,c Jeuol yn 2 E7m–3p1ir,ic 2a0l1 M1.e ?tc ho2d0s1 in A Nsasotucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag uesis 1ti3c6s3–1372, • • • 2 the degradation in quality from CN to the wthaete dremgarrakdeadt CN0 isnho quulda bitye analytically controllable, trading quality for detection performance. CN0 should not be detectable as watermCarked content without access to the generating algorithms. the detection of CN0 should be robust to simple eddeitte operations performed on individual results r ∈ CN0. Impact on Statistical Machine Translation Recent work(Resnik and Smith, 2003; Munteanu and Marcu, 2005; Uszkoreit et al. , 2010) has shown that multilingual parallel documents can be efficiently identified on the web and used as training data to improve the quality of statistical machine translation. The availability of free translation services (Google Translate, Bing Translate) and tools (Moses, Joshua) , increase the risk that the content found by parallel data mining is in fact generated by a machine, rather than by humans. In this work, we focus on statistical machine translation as an application for watermarking, with the goal of discarding documents from training if they have been generated by one’s own algorithms. To estimate the magnitude of the problem, we used parallel document mining (Uszkoreit et al. , 2010) to generate a collection of bilingual document pairs across several languages. For each document, we inspected the page content for source code that indicates the use of translation modules/plug-ins that translate and publish the translated content. We computed the proportion of the content within our corpus that uses these modules. We find that a significant proportion of the mined parallel data for some language pairs is generated via one of these translation modules. The top 3 languages pairs, each with parallel translations into English, are Tagalog (50.6%) , Hindi (44.5%) and Galician (41.9%) . While these proportions do not reflect impact on each language’s monolingual web, they are certainly high 1364 enough to affect machine translations systems that train on mined parallel data. In this work, we develop a general approach to watermark structured outputs and apply it to the outputs of a statistical machine translation system with the goal of identifying these same outputs on the web. In the context of the watermarking task defined above, we output selecting alternative translations for input source sentences. These translations often undergo simple edit and formatting operations such as case changes, sentence and word deletion or post editing, prior to publishing on the web. We want to ensure that we can still detect watermarked translations despite these edit operations. Given the rapid pace of development within machine translation, it is also important that the watermark be robust to improvements in underlying translation quality. Results from several iterations of the system within a single collection of documents should be identifiable under probabilistic bounds. While we present evaluation results for statistical machine translation, our proposed approach and associated requirements are applicable to any algorithm that produces structured results with several plausible alternatives. The alternative results can arise as a result of inherent task ambiguity (for example, there are multiple correct translations for a given input source sentence) or modeling uncertainty (for example, a model assigning equal probability to two competing results) . 3 Watermark Structured Results Selecting an alternative r0 from the space of alternatives Dk (q) can be stated as: r0= arr∈gDmk(aqx)w(r,Dk(q),h) (1) where w ranks r ∈ Dk (q) based on r’s presentwahtieorne owf a watermarking signal computed by a hashing operation h. In this approach, w and its component operation h are the only secrets held by the watermarker. This selection criterion is applied to all system outputs, ensuring that watermarked and non-watermarked version of a collection will never be available for comparison. A specific implementation of w within our watermarking approach can be evaluated by the following metrics: • • • False Positive Rate: how often nonFwaaltseermarked collections are falsely identified as watermarked. Recall Rate: how often watermarked collRecectiaolnls R are correctly inde wntaitfeierdm as wdat ceorl-marked. Quality Degradation: how significantly dQoueasl CN0 d Dieffegrr fdraotmio CN when evaluated by tdaoseks specific quality Cmetrics. While identification is performed at the collection level, we can scale these metrics based on the size of each collection to provide more task sensitive metrics. For example, in machine translation, we count the number of words in the collection towards the false positive and recall rates. In Section 3.1, we define a random hashing operation h and a task independent implementation of the selector function w. Section 3.2 describes how to classify a collection of watermarked results. Section 3.3 and 3.4 describes refinements to the selection and classification criteria that mitigate quality degradation. Following a comparison to related work in Section 4, we present experimental results for several languages in Section 5. 3.1 Watermarking: CN → CN0 We define a random hashing operation h that is applied to result r. It consists of two components: • A hash function applied to a structured re- sAul ht r hto f generate a lbieitd sequence cotfu a dfix reedlength. • An optional mapping that maps a single cAannd oidptaitoen raels umlta r ntog a hsaett mofa spusb -are ssiunlgtsle. Each sub-result is then hashed to generate a concatenated bit sequence for r. A good hash function produces outputs whose bits are independent. This implies that we can treat the bits for any input structured results 1365 as having been generated by a binomial distribution with equal probability of generating 1s vs 0s. This condition also holds when accumulating the bit sequences over a collection of results as long as its elements are selected uniformly from the space of possible results. Therefore, the bits generated from a collection of unwatermarked results will follow a binomial distribution with parameter p = 0.5. This result provides a null hypothesis for a statistical test on a given bit sequence, testing whether it is likely to have been generated from a binomial distribution binomial(n, p) where p = 0.5 and n is the length of the bit sequence. For a collection CN = r1 · · · rN, we can define a Fwaorte arm coalrlekc ranking funct·i·o·nr w to systematically select alternatives ri0 ∈ Dk (q) , such that the resulting CN0 is unlikely ∈to D produce bit sequences ltthinagt f Collow the p = 0.5 binomial distribution. A straightforward biasing criteria would be to select the candidate whose bit sequence exhibits the highest ratio of 1s. w can be defined as: (2) w(r,Dk(q),h) =#(|h1,(rh)(|r)) where h(r) returns the randomized bit sequence for result r, and #(x, y) counts the number of occurrences of x in sequence Selecting alternatives results to exhibit this bias will result in watermarked collections that exhibit this same bias. y. 3.2 Detecting the Watermark To classify a collection CN as watermarked or non-watermarked, we apply the hashing operation h on each element in CN and concatenate ttihoen sequences. eTlhemis sequence is tested against the null hypothesis that it was generated by a binomial distribution with parameter p = 0.5. We can apply a Fisherian test of statistical significance to determine whether the observed distribution of bits is unlikely to have occurred by chance under the null hypothesis (binomial with p = 0.5) . We consider a collection of results that rejects the null hypothesis to be watermarked results generated by our own algorithms. The p-value under the null hypothesis is efficiently computed by: p − value = Pn (X ≥ x) = Xi=nx?ni?pi(1 − p)n−i (3) (4) where x is the number of 1s observed in the collection, and n is the total number of bits in the sequence. Comparing this p-value against a desired significance level α, we reject the null hypothesis for collections that have Pn(X ≥ x) < α, thus deciding that such collections( were gen- erated by our own system. This classification criteria has a fixed false positive rate. Setting α = 0.05, we know that 5% of non-watermarked bit sequences will be falsely labeled as watermarked. This parameter α can be controlled on an application specific basis. By biasing the selection of candidate results to produce more 1s than 0s, we have defined a watermarking approach that exhibits a fixed false positive rate, a probabilistically bounded detection rate and a task independent hashing and selection criteria. In the next sections, we will deal with the question of robustness to edit operations and quality degradation. 3.3 Robustness and Inherent Bias We would like the ability to identify watermarked collections to be robust to simple edit operations. Even slight modifications to the elements within an item r would yield (by construction of the hash function) , completely different bit sequences that no longer preserve the biases introduced by the watermark selection function. To ensure that the distributional biases introduced by the watermark selector are preserved, we can optionally map individual results into a set of sub-results, each one representing some local structure of r. h is then applied to each subresult and the results concatenated to represent r. This mapping is defined as a component of the h operation. While a particular edit operation might affect a small number of sub-results, the majority of the bits in the concatenated bit sequence for r would remain untouched, thereby limiting the damage to the biases selected during watermark1366 ing. This is of course no defense to edit operations that are applied globally across the result; our expectation is that such edits would either significantly degrade the quality of the result or be straightforward to identify directly. For example, a sequence of words r = z1 · · · zL can be mapped into a set of consecutive n-gram sequences. Operations to edit a word zi in r will only affect events that consider the word zi. To account for the fact that alternatives in Dk (q) might now result in bit sequences of different lengths, we can generalize the biasing criteria to directly reflect the expected contribution to the watermark by defining: w(r, Dk(q), h) = Pn(X ≥ #(1, h(r))) (5) where Pn gives probabilities from binomial(n = |h(r) |,p = 0.5) . (Irn)|h,epr =en 0t. 5c)o.llection level biases: Our null hypothesis is based on the assumption that collections of results draw uniformly from the space of possible results. This assumption might not always hold and depends on the type of the results and collection. For example, considering a text document as a collection of sentences, we can expect that some sentences might repeat more frequently than others. This scenario is even more likely when applying a mapping into sub-results. n-gram sequences follow long-tailed or Zipfian distributions, with a small number of n-grams contributing heavily toward the total number of n-grams in a document. A random hash function guarantees that inputs are distributed uniformly at random over the output range. However, the same input will be assigned the same output deterministically. Therefore, if the distribution of inputs is heavily skewed to certain elements of the input space, the output distribution will not be uniformly distributed. The bit sequences resulting from the high frequency sub-results have the potential to generate inherently biased distributions when accumulated at the collection level. We want to choose a mapping that tends towards generating uniformly from the space of sub-results. We can empirically measure the quality of a sub-result mapping for a specific task by computing the false positive rate on non-watermarked collections. For a given significance level α, an ideal mapping would result in false positive rates close to α as well. Figure 1 shows false positive rates from 4 alternative mappings, computed on a large corpus of French documents (see Table 1for statistics) . Classification decisions are made at the collection level (documents) but the contribution to the false positive rate is based on the number of words in the classified document. We consider mappings from a result (sentence) into its 1-grams, 1 − 5-grams and 3 − 5 grams as well as trahem non-mapping case, w 3h −ere 5 tghrea mfusll a sres wuelltl is hashed. Figure 1 shows that the 1-grams and 1 − 5gram generate wsusb t-hraetsul tthse t 1h-agtr rmessu latn idn 1h −eav 5-ily biased false positive rates. The 3 − 5 gram mapping yields pfaolsseit positive r.a Ttesh ecl 3os −e t 5o gthraemir theoretically expected values. 1 Small deviations are expected since documents make different contributions to the false positive rate as a function of the number of words that they represent. For the remainder of this work, we use the 3-5 gram mapping and the full sentence mapping, since the alternatives generate inherently distributions with very high false positive rates. 3.4 Considering Quality The watermarking described in Equation 3 chooses alternative results on a per result basis, with the goal of influencing collection level bit sequences. The selection criteria as described will choose the most biased candidates available in Dk (q) . The parameter k determines the extent to which lesser quality alternatives can be chosen. If all the alternatives in each Dk (q) are of relatively similar quality, we expect minimal degradation due to watermarking. Specific tasks however can be particularly sensitive to choosing alternative results. Discriminative approaches that optimize for arg max selection like (Och, 2003; Liang et al. , 2006; Chiang et al. , 2009) train model parameters such 1In the final version of this paper we will perform sampling to create a more reliable estimate of the false positive rate that is not overly influenced by document length distributions. 1367 that the top-ranked result is well separated from its competing alternatives. Different queries also differ in the inherent ambiguity expected from their results; sometimes there really is just one correct result for a query, while for other queries, several alternatives might be equally good. By generalizing the definition of the w function to interpolate the estimated loss in quality and the gain in the watermarking signal, we can trade-off the ability to identify the watermarked collections against quality degradation: w(r,Dk(q),fw)− =(1 λ − ∗ λ g)ai ∗nl( or,s D(rk,(Dq)k,(fqw)) (6) Loss: The loss(r, Dk (q)) function reflects the quality degradation that results from selecting alternative r as opposed to the best ranked candidate in Dk (q)) . We will experiment with two variants: lossrank (r, Dk (q)) = (rank(r) − k)/k losscost(r, Dk(q)) = (cost(r)−cost(r1))/ cost(r1) where: • • • rank(r) : returns the rank of r within Dk (q) . cost(r) : a weighted sum of features (not cnoosrtm(ra)li:ze ad over httheed sse uarmch o space) rine a loglinear model such as those mentioned in (Och, 2003). r1: the highest ranked alternative in Dk (q) . lossrank provides a generally applicable criteria to select alternatives, penalizing selection from deep within Dk (q) . This estimate of the quality degradation does not reflect the generating model’s opinion on relative quality. losscost considers the relative increase in the generating model’s cost assigned to the alternative translation. Gain: The gain(r, Dk (q) , fw) function reflects the gain in the watermarking signal by selecting candidate r. We simply define the gain as the Pn(X ≥ #(1, h(r))) from Equation 5. ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (a) 1-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (c) 3 − 5-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (b) 1− 5-grams mapping p-value threshold (d) Full result hashing Figure 1 Comparison : of expected false positive rates against observed false positive rates for different sub-result mappings. 4 Related Work Using watermarks with the goal of transmitting a hidden message within images, video, audio and monolingual text media is common. For structured text content, linguistic approaches like (Chapman et al. , 2001; Gupta et al., 2006) use language specific linguistic and semantic expansions to introduce hidden watermarks. These expansions provide alternative candidates within which messages can be encoded. Recent publications have extended this idea to machine translation, using multiple systems and expansions to generate alternative translations. (Stutsman et al. , 2006) uses a hashing function to select alternatives that encode the hidden message in the lower order bits of the translation. In each of these approaches, the watermarker has control over the collection of results into which the watermark is to be embedded. These approaches seek to embed a hidden message into a collection of results that is selected by the watermarker. In contrast, we address the condition where the input queries are not in the watermarker’s control. 1368 The goal is therefore to introduce the watermark into all generated results, with the goal of probabilistically identifying such outputs. Our approach is also task independent, avoiding the need for templates to generate additional alternatives. By addressing the problem directly within the search space of a dynamic programming algorithm, we have access to high quality alternatives with well defined models of quality loss. Finally, our approach is robust to local word editing. By using a sub-result mapping, we increase the level of editing required to obscure the watermark signal; at high levels of editing, the quality of the results themselves would be significantly degraded. 5 Experiments We evaluate our watermarking approach applied to the outputs of statistical machine translation under the following experimental setup. A repository of parallel (aligned source and target language) web documents is sampled to produce a large corpus on which to evaluate the watermarking classification performance. The corpora represent translations into 4 diverse target languages, using English as the source language. Each document in this corpus can be considered a collection of un-watermarked structured results, where source sentences are queries and each target sentence represents a structured result. Using a state-of-the-art phrase-based statistical machine translation system (Och and Ney, 2004) trained on parallel documents identified by (Uszkoreit et al. , 2010) , we generate a set of 100 alternative translations for each source sentence. We apply the proposed watermarking approach, along with the proposed refinements that address task specific loss (Section 3.4) and robustness to edit operations (Section 3.3) to generate watermarked corpora. Each method is controlled via a single parameter (like k or λ) which is varied to generate alternative watermarked collections. For each parameter value, we evaluate the Recall Rate and Quality Degradation with the goal of finding a setting that yields a high recall rate, minimal quality degradation. False positive rates are evaluated based on a fixed classification significance level of α = 0.05. The false positive and recall rates are evaluated on the word level; a document that is misclassified or correctly identified contributes its length in words towards the error calculation. In this work, we use α = 0.05 during classification corresponding to an expected 5% false positive rate. The false positive rate is a function of h and the significance level α and therefore constant across the parameter values k and λ. We evaluate quality degradation on human translated test corpora that are more typical for machine translation evaluation. Each test corpus consists of 5000 source sentences randomly selected from the web and translated into each respective language. We chose to evaluate quality on test corpora to ensure that degradations are not hidden by imperfectly matched web corpora and are consistent with the kind of results often reported for machine translation systems. As with the classification corpora, we create watermarked versions at each parameter value. For a given pa1369 recall Figure 2: BLEU loss against recall of watermarked content for the baseline approach (max K-best) , rank and cost interpolation. rameter value, we measure false positive and re- call rates on the classification corpora and quality degradation on the evaluation corpora. Table 1 shows corpus statistics for the classification and test corpora and non-watermarked BLEU scores for each target language. All source texts are in English. 5.1 Loss Interpolated Experiments Our first set of experiments demonstrates baseline performance using the watermarking criteria in Equation 5 versus the refinements suggested in Section 3.4 to mitigate quality degradation. The h function is computed on the full sentence result r with no sub-event mapping. The following methods are evaluated in Figure 2. • • Baseline method (labeled “max K-best” ): sBealescetlsin er0 purely (blaasbedel on gain Kin- bweastte”r):marking signal (Equation 5) and is parameterized by k: the number of alternatives considered for each result. Rank interpolation: incorporates rank into w, varying ptholea interpolation parameter nλ.t • Cost interpolation: incorporates cost into w, varying tohlea interpolation parameter nλ.t The observed false positive rate on the French classification corpora is 1.9%. ClassificationQuality Ta AbFHularei ankg1bdic:sehitCon#t12e08n7w39t1065o s40r7tda617tsi c#sfo1e r85n37c2018tl2a5e4s 5n0sicfeastion#adno1c68 q3u09 06ma70lietynsdegr#ad7 aw3 t534io9 0rn279dcsorp# as.e5 nN54 t08 oe369n-wceatsrmBaL21 kEe6320d. U462579 B%LEU scores are reported for the quality corpora. We consider 0.2% BLEU loss as a threshold for acceptable quality degradation. Each method is judged by its ability to achieve high recall below this quality degradation threshold. Applying cost interpolation yields the best results in Figure 2, achieving a recall of 85% at 0.2% BLEU loss, while rank interpolation achieves a recall of 76%. The baseline approach of selecting the highest gain candidate within a depth of k candidates does not provide sufficient parameterization to yield low quality degradation. At k = 2, this method yields almost 90% recall, but with approximately 0.4% BLEU loss. 5.2 Robustness Experiments In Section 5.2, we proposed mapping results into sub-events or features. We considered alternative feature mappings in Figure 1, finding that mapping sentence results into a collection of 35 grams yields acceptable false positive rates at varied levels of α. Figure 3 presents results that compare moving from the result level hashing to the 3-5 gram sub-result mapping. We show the impact of the mapping on the baseline max K-best method as well as for cost interpolation. There are substantial reductions in recall rate at the 0.2% BLEU loss level when applying sub-result mappings in cases. The cost interpolation method recall drops from 85% to 77% when using the 3-5 grams event mapping. The observed false positive rate of the 3-5 gram mapping is 4.7%. By using the 3-5 gram mapping, we expect to increase robustness against local word edit operations, but we have sacrificed recall rate due to the inherent distributional bias discussed in Section 3.3. 1370 recall Figure 3: BLEU loss against recall of watermarked content for the baseline and cost interpolation methods using both result level and 3-5 gram mapped events. 5.3 Multilingual Experiments The watermarking approach proposed here introduces no language specific watermarking operations and it is thus broadly applicable to translating into all languages. In Figure 4, we report results for the baseline and cost interpolation methods, considering both the result level and 3-5 gram mapping. We set α = 0.05 and measure recall at 0.2% BLEU degradation for translation from English into Arabic, French, Hindi and Turkish. The observed false positive rates for full sentence hashing are: Arabic: 2.4%, French: 1.8%, Hindi: 5.6% and Turkish: 5.5%, while for the 3-5 gram mapping, they are: Arabic: 5.8%, French: 7.5%, Hindi:3.5% and Turkish: 6.2%. Underlying translation quality plays an important role in translation quality degradation when watermarking. Without a sub-result mapping, French (BLEU: 26.45%) Figure 4: Loss of recall when using 3-5 gram mapping vs sentence level mapping for Arabic, French, Hindi and Turkish translations. achieves recall of 85% at 0.2% BLEU loss, while the other languages achieve over 90% recall at the same BLEU loss threshold. Using a subresult mapping degrades quality for each language pair, but changes the relative performance. Turkish experiences the highest relative drop in recall, unlike French and Arabic, where results are relatively more robust to using sub-sentence mappings. This is likely a result of differences in n-gram distributions across these languages. The languages considered here all use space separated words. For languages that do not, like Chinese or Thai, our approach can be applied at the character level. 6 Conclusions In this work we proposed a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method provides probabilistic bounds on detection ability, analytic control on quality degradation and is robust to local edit- ing operations. Our method is applicable to any task where structured outputs are generated with ambiguities or ties in the results. We applied this method to the outputs of statistical machine translation, evaluating each refinement to our approach with false positive and recall rates against BLEU score quality degradation. Our results show that it is possible, across several language pairs, to achieve high recall rates (over 80%) with low false positive rates (between 5 and 8%) at minimal quality degradation (0.2% 1371 BLEU) , while still allowing for local edit operations on the translated output. In future work we will continue to investigate methods to mitigate quality loss. References Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Minimum error rate training in statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Peter F. Brown, Vincent J.Della Pietra, Stephen A. Della Pietra, and Robert. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263–311 . Mark Chapman, George Davida, and Marc Rennhardway. 2001. A practical and effective approach to large-scale automated linguistic steganography. In Proceedings of the Information Security Conference. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine translation. In North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT). Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and Jaime Carbonell. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval, pages 121–128. Gaurav Gupta, Josef Pieprzyk, and Hua Xiong Wang. 2006. An attack-localizing watermarking scheme for natural language documents. In Proceedings of the 2006 A CM Symposium on Information, computer and communications security, ASIACCS ’06, pages 157–165, New York, NY, USA. ACM. Percy Liang, Alexandre Bouchard-Cote, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proceedings of the Joint International Conference on Computational Linguistics and Association of Computational Linguistics (COLING/A CL, pages 761–768. Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics. Franz Josef Och and Hermann Ney. alignment template approach to statistical machine translation. Computational Linguistics. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 2003 Meeting of the Asssociation of Computational Linguistics. Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. computational linguistics. Computational Linguistics. Ryan Stutsman, Mikhail Atallah, Christian Grothoff, and Krista Grothoff. 2006. Lost in just the translation. In Proceedings of the 2006 A CM Symposium on Applied Computing. Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 2010 COLING. 1372 2004. The

5 0.053109761 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP

Author: Amit Goyal ; Hal Daume III

Abstract: We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).

6 0.051786806 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions

7 0.051333122 64 emnlp-2011-Harnessing different knowledge sources to measure semantic relatedness under a uniform model

8 0.048529912 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation

9 0.047670003 21 emnlp-2011-Bayesian Checking for Topic Models

10 0.047312215 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

11 0.045317192 74 emnlp-2011-Inducing Sentence Structure from Parallel Corpora for Reordering

12 0.044664562 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

13 0.041881908 112 emnlp-2011-Refining the Notions of Depth and Density in WordNet-based Semantic Similarity Measures

14 0.041555423 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

15 0.04130128 107 emnlp-2011-Probabilistic models of similarity in syntactic context

16 0.041027457 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model

17 0.039628938 125 emnlp-2011-Statistical Machine Translation with Local Language Models

18 0.039055891 116 emnlp-2011-Robust Disambiguation of Named Entities in Text

19 0.038651198 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction

20 0.03696293 110 emnlp-2011-Ranking Human and Machine Summarization Systems


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.14), (1, -0.045), (2, -0.031), (3, -0.076), (4, 0.022), (5, 0.03), (6, -0.002), (7, -0.02), (8, -0.001), (9, -0.032), (10, 0.028), (11, -0.095), (12, -0.041), (13, 0.005), (14, -0.006), (15, 0.018), (16, -0.042), (17, 0.056), (18, 0.008), (19, -0.056), (20, -0.012), (21, -0.167), (22, 0.039), (23, 0.045), (24, -0.085), (25, 0.028), (26, 0.007), (27, 0.189), (28, 0.072), (29, -0.122), (30, 0.107), (31, 0.064), (32, -0.143), (33, 0.02), (34, -0.05), (35, -0.137), (36, -0.015), (37, 0.153), (38, -0.066), (39, 0.108), (40, -0.07), (41, -0.032), (42, 0.177), (43, -0.09), (44, 0.105), (45, -0.073), (46, 0.106), (47, -0.058), (48, 0.245), (49, -0.128)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9396944 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association

Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman

Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.

2 0.52385181 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP

Author: Amit Goyal ; Hal Daume III

Abstract: We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).

3 0.47442722 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures

Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo

Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.

4 0.42931291 112 emnlp-2011-Refining the Notions of Depth and Density in WordNet-based Semantic Similarity Measures

Author: Tong Wang ; Graeme Hirst

Abstract: We re-investigate the rationale for and the effectiveness of adopting the notions of depth and density in WordNet-based semantic similarity measures. We show that the intuition for including these notions in WordNet-based similarity measures does not always stand up to empirical examination. In particular, the traditional definitions of depth and density as ordinal integer values in the hierarchical structure of WordNet does not always correlate with human judgment of lexical semantic similarity, which imposes strong limitations on their contribution to an accurate similarity measure. We thus propose several novel definitions of depth and density, which yield significant improvement in degree of correlation with similarity. When used in WordNet-based semantic similarity measures, the new definitions consistently improve performance on a task of correlating with human judgment.

5 0.40835178 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

Author: Jagadeesh Jagarlamudi ; Raghavendra Udupa ; Hal Daume III ; Abhijit Bhole

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of cross-lingual corpora. Many existing approaches are based on word co-occurrences extracted from aligned training data, represented as a covariance matrix. In theory, such a covariance matrix should represent semantic equivalence, and should be highly sparse. Unfortunately, the presence of noise leads to dense covariance matrices which in turn leads to suboptimal document representations. In this paper, we explore techniques to recover the desired sparsity in covariance matrices in two ways. First, we explore word association measures and bilingual dictionaries to weigh the word pairs. Later, we explore different selection strategies to remove the noisy pairs based on the association scores. Our experimental results on the task of aligning comparable documents shows the efficacy of sparse covariance matrices on two data sets from two different language pairs.

6 0.38518745 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation

7 0.38468122 110 emnlp-2011-Ranking Human and Machine Summarization Systems

8 0.36869806 38 emnlp-2011-Data-Driven Response Generation in Social Media

9 0.33841065 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

10 0.31496513 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction

11 0.31161484 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition

12 0.31006631 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

13 0.30442294 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference

14 0.29998785 64 emnlp-2011-Harnessing different knowledge sources to measure semantic relatedness under a uniform model

15 0.27612534 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge

16 0.27227792 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

17 0.25868982 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation

18 0.24385327 74 emnlp-2011-Inducing Sentence Structure from Parallel Corpora for Reordering

19 0.233256 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

20 0.23026454 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.064), (36, 0.023), (37, 0.016), (45, 0.632), (53, 0.01), (54, 0.02), (57, 0.012), (64, 0.01), (66, 0.018), (69, 0.01), (79, 0.038), (82, 0.011), (87, 0.012), (96, 0.019), (98, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97938764 21 emnlp-2011-Bayesian Checking for Topic Models

Author: David Mimno ; David Blei

Abstract: Real document collections do not fit the independence assumptions asserted by most statistical topic models, but how badly do they violate them? We present a Bayesian method for measuring how well a topic model fits a corpus. Our approach is based on posterior predictive checking, a method for diagnosing Bayesian models in user-defined ways. Our method can identify where a topic model fits the data, where it falls short, and in which directions it might be improved.

same-paper 2 0.97197324 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association

Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman

Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.

3 0.94561654 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP

Author: Amit Goyal ; Hal Daume III

Abstract: We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).

4 0.90371078 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion

Author: Swapna Gottipati ; Jing Jiang

Abstract: In this paper we present a novel approach to entity linking based on a statistical language model-based information retrieval with query expansion. We use both local contexts and global world knowledge to expand query language models. We place a strong emphasis on named entities in the local contexts and explore a positional language model to weigh them differently based on their distances to the query. Our experiments on the TAC-KBP 2010 data show that incorporating such contextual information indeed aids in disambiguating the named entities and consistently improves the entity linking performance. Compared with the official results from KBP 2010 participants, our system shows competitive performance.

5 0.86139756 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus

Author: Emily M. Bender ; Dan Flickinger ; Stephan Oepen ; Yi Zhang

Abstract: In order to obtain a fine-grained evaluation of parser accuracy over naturally occurring text, we study 100 examples each of ten reasonably frequent linguistic phenomena, randomly selected from a parsed version of the English Wikipedia. We construct a corresponding set of gold-standard target dependencies for these 1000 sentences, operationalize mappings to these targets from seven state-of-theart parsers, and evaluate the parsers against this data to measure their level of success in identifying these dependencies.

6 0.74369985 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions

7 0.70970976 101 emnlp-2011-Optimizing Semantic Coherence in Topic Models

8 0.69437504 64 emnlp-2011-Harnessing different knowledge sources to measure semantic relatedness under a uniform model

9 0.67540216 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics

10 0.64353031 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs

11 0.64064956 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases

12 0.61942583 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models

13 0.61853105 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

14 0.5780986 116 emnlp-2011-Robust Disambiguation of Named Entities in Text

15 0.5764277 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms

16 0.56478626 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation

17 0.56436121 91 emnlp-2011-Literal and Metaphorical Sense Identification through Concrete and Abstract Context

18 0.56011975 11 emnlp-2011-A Simple Word Trigger Method for Social Tag Suggestion

19 0.55849844 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources

20 0.5544154 107 emnlp-2011-Probabilistic models of similarity in syntactic context