emnlp emnlp2010 emnlp2010-101 knowledge-graph by maker-knowledge-mining

101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval


Source: pdf

Author: David Guthrie ; Mark Hepple

Abstract: We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30% of the cost when using the current stateof-the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1byte per n-gram.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. [sent-7, score-0.972]

2 Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. [sent-8, score-0.455]

3 66 bytes per n-gram (around 30% of the cost when using the current stateof-the-art approach), or quantized counts for 1. [sent-12, score-0.342]

4 A prevalent approach for language model storage is the use of compact trie structures, but these structures do not scale well and require space proportional to both to the number of n-grams and the vocabulary size. [sent-21, score-0.381]

5 These Bloom filter based models exploit the idea that it is not actually necessary to store the n-grams of the model, as long as, when queried with an n-gram, the model returns the correct count or probability for it. [sent-24, score-0.328]

6 The first structure makes use of an explicit perfect hash function that is minimal in that it maps n keys to integers in the range 1 to n. [sent-27, score-0.852]

7 We show that by using a minimal perfect hash function and exploiting the distributional characteristics of the data we produce n-gram models that use less space than all know approaches with no reduction in speed. [sent-28, score-0.694]

8 Our two further models achieve even more compact storage while maintaining constant time access by using variable length coding to compress the n-grams values and by using tiered hash structures to partiProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0. [sent-29, score-0.984]

9 This move allows, for example, the cost of storing a quantized model to be reduced to 1byte per n-gram or less. [sent-40, score-0.391]

10 Some of the empirical storage results reported later in the paper relate to LMs recording n-gram count values which have been quantized using this uniform binning approach. [sent-55, score-0.494]

11 These methods store language models in relatively little space by not actually keeping the ngram key in the structure and by allowing a small probability of returning a false positive, i. [sent-72, score-0.366]

12 This is a weaker structure than a dictionary or hash table which also associates a value with a key. [sent-84, score-0.506]

13 Bloom filters use well below the information theoretic lower bound of space required to actually store the keys and can answer queries in O(1) time. [sent-85, score-0.358]

14 A Bloom filter stores a set S of n elements in a bit array B of size m. [sent-87, score-0.6]

15 To store an item x from S in B we compute k random independent hash functions on x that each return a value in the range [0 . [sent-89, score-0.699]

16 We do this for all elements in S, storing to the same bit array. [sent-94, score-0.322]

17 Elements may hash to an index in B that has already been set to 1 and in this case we can think of these elements as “sharing” this bit. [sent-95, score-0.459]

18 To test whether set S contains a key w, we run our k hash functions on w and check if all those locations in B are set to 1. [sent-96, score-0.54]

19 This error rate depends on the number of k hash functions and the ratio of m/n. [sent-98, score-0.48]

20 For instance with k = 3 hash functions and a bit array of size m = 20n, we can expect to get a false positive rate of 0. [sent-99, score-0.968]

21 Talbot and Osborne (2007b) and Talbot and Osborne (2007a) adapt Bloom filters to the requirement of storing values for n-grams by concatenating the key (n-gram) and value to form a single item that is inserted into the filter. [sent-101, score-0.328]

22 V ] , a quantized value v is stored by inserting into the filter all pairings of the n-gram with values from 1up to v. [sent-104, score-0.387]

23 Talbot and Osborne use a simple logarithmic quantization of counts that produce limited quantized value ranges, where most items will have values that are low in the range, so that the serial lookup process will require quite a low number of steps on average. [sent-107, score-0.326]

24 0027, as noted above, but in a situation where 3 key-value items were being stored per n-gram on average, this error rate would in fact require a storage cost of 60 bits per original n-gram. [sent-116, score-0.727]

25 To test whether a given key is present in a populated Bloomier filter, we apply k hash functions to the key and use the results as indices for retrieving the data stored at k locations within the filter, similarly to look-up in a Bloom filter. [sent-122, score-0.713]

26 In this case, however, the data retrieved from the filter consists of k bit vectors, which are combined with a fingerprint of the key, using bitwise XOR, to return the stored value. [sent-123, score-0.609]

27 The risk of false positives is managed by making incorporating a fingerprint of the n-gram, and by making bit vectors longer than the minimum length required to store values. [sent-124, score-0.728]

28 Nevertheless, when using v bits to represent values and e bits for error detection, this approach allows a language model to be stored at a cost of is 1. [sent-129, score-0.581]

29 The S-MPHR structure can be divided into 3 parts as shown in Figure 1: Stage 1 is a minimal perfect hash function; Stage 2 is a fingerprint and rank array; and Stage 3 is a unique value array. [sent-133, score-1.053]

30 1 Minimal Perfect Hash Function The first part of the structure is a minimal perfect hash function that maps every n-gram in the training data to a distinct integer in the range 0 to N − 1, dwahteare to oN a i sd itshtein tcotta iln nteugmerb ienr o thf n-grams 0to t ost Nore. [sent-143, score-0.741]

31 − −W 1e, use these integers as indices into the array of Stage 2 of our structure. [sent-144, score-0.348]

32 , 2009) algorithm to generate a minimal perfect hash function that requires 2. [sent-146, score-0.648]

33 The first step is to use a hash function g(x), to map every key to a bucket B in the range 0 to r. [sent-150, score-0.618]

34 (For this step we use a simple hash function like the one used for generating fingerprints in the pervious section. [sent-151, score-0.688]

35 07 bits per key we use r = N5, so that the average bucket size is 5). [sent-155, score-0.339]

36 This bit array ids ous ceodn during czeornosstr Tuc[t0io. [sent-162, score-0.443]

37 tNo keep t Trahciks boitf which integers in the range 0 to N − 1the minimal perfect n htaesghe rhsa sin already mapped keys to. [sent-165, score-0.351]

38 mNeinxtim we must assume we have access to a family of random and independent hash functions h1, h2, h3, . [sent-166, score-0.505]

39 In practice it sufficient to use functions that behave similarly to fully random independent hash functions and Belazzougui et al. [sent-170, score-0.501]

40 (2009) demonstrate how such functions can be generated easily by combining two simple hash functions. [sent-171, score-0.48]

41 For each bucket, in the sorted order from largest to smallest, they search for a random hash function that maps all elements of the bucket to values in T that are currently set to 0. [sent-173, score-0.564]

42 So, for each bucket Bi, it is necessary to iteratively try hash functions, h‘ for ‘ = 1, 2, 3, . [sent-175, score-0.51]

43 to hash every element of Bi to a distinct index j in T that contains a zero. [sent-178, score-0.484]

44 t Wheh seinz es oufch { a h(axs)h|x xfu ∈nc Btio}n i sis e fqouuanld to we en seizeed only to store the index, ‘, of the successful function in an array σ and set T[j] = 1for all positions j that h‘ hashed to. [sent-180, score-0.478]

45 Notice that the reason the largest buckets are handled first is because they have the most elements to displace and this is easier when the array T contains more empty positions (zeros). [sent-181, score-0.346]

46 The final step in the algorithm is to compress the σ array (which has length equal to the number of buckets |B|), retaining O(1) access. [sent-182, score-0.382]

47 This compression is aectshi |eBv|e),d using simple )va arciacbelses length encoding ownit ihs an index array (Fredriksson and Nikitin, 2007). [sent-183, score-0.369]

48 To reduce the probability of these unseen n-grams giving false positives results from our model we store a fingerprint of each n-gram in Stage 2 of our structure that can be compared against the fingerprints of unseen n-grams when queried. [sent-186, score-0.836]

49 If these fingerprints of the queried n-gram and the stored n-gram do not match then the model will correctly report that the n-gram has not been seen before. [sent-187, score-0.342]

50 We use Austin Appleby’s Murmurhash21 implementation to fingerprint each n-gram and then store the m highest order bits. [sent-191, score-0.414]

51 This rank is an index into the array of Stage 3 of our structure that holds the unique values associated with any n-gram. [sent-195, score-0.476]

52 3 Unique Value Array We describe our storage of the values associated with n-grams in our model assuming we are storing frequency “counts” of n-grams, but it applies also to storing quantized probabilities. [sent-197, score-0.794]

53 rTrhaiys i ns similar to quantization in that it reduces the number of bits required for storage, but unlike quantization it does not require a loss of any information. [sent-205, score-0.386]

54 This was motivated by the sparsity of n-gram frequency counts in corpora in the sense that if we take the lowest n-gram frequency count and the highest n-gram frequency count then most of the integers in that range do not occur as a frequency count of any n-grams in the corpus. [sent-206, score-0.417]

55 Because we only store the frequency rank, to keep the precise frequency information we need only dlog2 Ke bciistse per n-gram, wfohremrea Kion nis w tehe n eneudm obnelry yo dfl doigstiKncet frequency counts. [sent-209, score-0.324]

56 4 Storage Requirements We now consider the storage requirements of our SMPHR approach, and how it compares against the Bloomier filter method of Talbot and Brants (2008). [sent-213, score-0.359]

57 We saw that the storage requirements of the Talbot and Brants (2008) Bloomier filter method are a function of the number of n-grams n, the bits of data d to be stored per ngram (with d = v + e: v bits for value storage, and e bits for error detection), and a multiplying factor of 1. [sent-215, score-1.144]

58 07 bits per n-gram for the PHF itself, and so the comparable overall cost to store a S-MPHR model is 2. [sent-220, score-0.455]

59 We therefore modify our original architecture to put the ranks and fingerprints into separate arrays, of which the ranks array will be compressed, as shown in Figure 2. [sent-237, score-0.609]

60 Much like the final stage of the CHD minimal perfect hash algorithm we employ a random access compression algorithm of Fredriksson and Nikitin (2007) to reduce the size required by the array of ranks. [sent-238, score-1.065]

61 The first step in the compression is to encode the ranks array using a dense variable length coding. [sent-240, score-0.449]

62 , sK be the ranks that occur in the rank array sorted by there frequency. [sent-245, score-0.477]

63 All of the values from the rank array are coded in this form and concatenated to form a large bit vector retaining their original ordering. [sent-249, score-0.649]

64 The length in bits for the ith number is thus blog2 (i + 2)c and so the fnourm thbeer i tohfb niutsm required ufosr bthloeg wh(oi +le v 2a)rcia abnled length coded rank array is: b = f(si)blog2 (i + 2)c. [sent-250, score-0.728]

65 We create an additional bit array D of the same size b as the variable length coded array that simply contains ones in all positions that a code begins in the rank array and zeros in all other positions. [sent-253, score-1.308]

66 That is the ith rank in the variable length coded array occurs at position select1 (D, i), where select1 gives the position of the ith one in the array. [sent-254, score-0.513]

67 We can use this structure to identify the ith code in our variable length encoded rank array by querying for its starting position, select1 (D, i), and compute its length using its ending position, select1 (D, i+ 1) −1. [sent-257, score-0.556]

68 in T hthee c original rank: rank = 5 code + 2(length in bits) Tiered MPHR − 2 In this section we describe an alternative route to extending our basic S-MPHR model to achieve better space efficiency, by using multiple hash stores. [sent-259, score-0.655]

69 for the example just mentioned, with two MPHRs having 8 and 20 bit value storage respectively. [sent-266, score-0.441]

70 rank(key) Figure 3: Tiered minimal perfect hash data structure We will here explore an alternative approach that we call Tiered MPHR, which avoids this compounding of errors, and which limits the number of looks ups to a maximum of 2, irrespective of how many hashes are used. [sent-288, score-0.727]

71 In addition, space is allocated to store rank values, but with some possible values being reserved to indicate redirection to other secondary hashes where values can be found. [sent-290, score-0.586]

72 Each secondary hash has a minimal perfect hash function that is computed only for the n-grams whose values it stores. [sent-291, score-1.209]

73 Secondary hashes do not need to record fingerprints, as fingerprint testing is done in the top-level hash. [sent-292, score-0.327]

74 For example, we might have a configuration of three hashes, with the top-level MPHR having 8-bit storage, and with secondary hashes having 10 and 20 bit storage respectively. [sent-293, score-0.585]

75 The 10-bit secondary hash can store 1024 different values, which would then represent ranks 255 to 1278, × with all ranks above this being represented in the 20-bit hash. [sent-299, score-0.804]

76 To look up the count for an n-gram, we begin with the top-level hash, where fingerprint testing can immediately reject unseen n-grams. [sent-300, score-0.353]

77 For some seen n-grams, the required rank value is provided directly by the top-level hash, but for others a redirection value is returned, indicating precisely the secondary hash in which the rank value will be found by simple look up (with no additional fingerprint testing). [sent-301, score-1.195]

78 6 Results In this section, we present some results comparing the performance ofour new storage methods to some of the existing methods, regarding the costs of storing LMs, and regarding the data access speeds that alternative systems allow. [sent-324, score-0.514]

79 1 Comparison of memory costs To test the effectiveness of our models we built models storing n-grams and full frequency counts for both the Gigaword and Google Web1T corpus storing all 1,2,3,4 and 5 grams. [sent-334, score-0.525]

80 Using the Bloomier algorithm of Talbot and Brants (2008) with 37 bit values and 12 bit fingerprints would require 7. [sent-340, score-0.545]

81 26 bytes per n-gram to store full frequency count information and stores the entire Web1T corpus in just 15. [sent-344, score-0.431]

82 This saving is mostly due to the ranking method allowing values to be stored at a cost of only 20 bits per n-gram. [sent-346, score-0.435]

83 Applying the same rank array optimization to the Bloomier method significantly reduces its memory requirement, but SMPHR still uses only 86% of the space that the Bloomier approach requires. [sent-347, score-0.495]

84 Tables 3, 4 and 5 show results for all methods2 on both corpora, for storing full counts, and for when 8-bit binning quantization of counts is used. [sent-352, score-0.331]

85 RandLM is a modern language modeling toolkit that uses Bloom filter based structures to store large language models and has been integrated so that it can be used as the language model storage for the Moses statistical machine translation system (Koehn et al. [sent-356, score-0.562]

86 We use randLM with the BloomMap (Talbot and Talbot, 2008) storage structure option with 8 bit quantized values and an error rate equivalent to using 8 bit fingerprints (as recommended in the Moses documentation). [sent-358, score-0.948]

87 The C-MPHR method tested here is slower than both S-MPHR and T-MPHR models due to the extra operations required for access to the variable length encoded array yet still performs similarly to SRILM and IRSTLM and is 14. [sent-367, score-0.394]

88 7 Variable Length Fingerprints To conclude our presentation of new methods for space-efficient language model storage, we suggest an additional possibility for reducing storage costs, which involves using different sizes of fingerprint for different n-grams. [sent-369, score-0.511]

89 False positives arise when our perfect hashing method maps an unseen n-gram to position where the stored n-gram fingerprint happens to coincide with that computed for the unseen n-gram. [sent-374, score-0.623]

90 To achieve a scheme that admits a higher risk of less damaging errors, but enforces a lower risk of more damaging errors, we need only store shorter fingerprints for n-grams in our model that have low counts or probabilities, and longer fingerprints for n-grams with higher values. [sent-376, score-0.776]

91 by storing fingerprints of different lengths contiguously within a bit array, and constructing a ‘selection structure’ of the kind described in Section 4 to allow us to locate a given fingerprint within the bit array. [sent-379, score-0.93]

92 rank(key) Figure 4: Variable length fingerprint T-MPHR structure using j bit fingerprints for the n-grams which are most rare and m bit fingerprints for all others. [sent-398, score-1.013]

93 Re- call that for T-MPHR, the top-level MPHR has all n-grams of the model as keys, and stores a fingerprint for each, plus a value that may represent an n-gram’s count or probability, or that may redirect to a second-level hash where that information can be found. [sent-400, score-0.915]

94 Redirection is done for items with higher counts or probabilities, so we can achieve lower error rates for precisely these items by storing additional fingerprint information for them in the second-level hash (see Figure 4). [sent-401, score-0.961]

95 Table 7 applies this idea to storing full and quantized counts of the Gigaword and Web 1T models, when fingerprints in the top-level MPHR have sizes in the range 1 to 6 bits, with the fingerprint information for items stored in secondary hashes being ‘topped up’ to 12 bits. [sent-403, score-1.096]

96 This approach achieves storage costs of around 1byte per n-gram or less for the quantized models. [sent-404, score-0.479]

97 271 to 6 bit fingerprints for rarest n-grams and 12 bit (in total) fingerprints for all other n-grams. [sent-405, score-0.72]

98 ) 8 Conclusion We have presented novel methods of storing large language models, consisting of billions of n-grams, that allow for quantized values or frequency counts to be accessed quickly and which require far less space than all known approaches. [sent-407, score-0.428]

99 We have shown that our models allow n-gram look-up at speeds comparable to modern language modeling toolkits (which have much greater storage costs), and at a rate approximately 15 times faster than a competitor approach for space-efficient storage. [sent-411, score-0.334]

100 How many bits are needed to store probabilities for phrasebased translation? [sent-436, score-0.353]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hash', 0.459), ('array', 0.291), ('storage', 0.263), ('fingerprint', 0.248), ('talbot', 0.238), ('fingerprints', 0.208), ('mphr', 0.208), ('bits', 0.187), ('storing', 0.17), ('store', 0.166), ('bloom', 0.161), ('bit', 0.152), ('bloomier', 0.149), ('rank', 0.131), ('quantized', 0.119), ('stored', 0.113), ('keys', 0.099), ('filter', 0.096), ('perfect', 0.091), ('quantization', 0.085), ('bytes', 0.079), ('hashes', 0.079), ('minimal', 0.077), ('redirect', 0.076), ('trie', 0.069), ('secondary', 0.069), ('gigaword', 0.068), ('brants', 0.062), ('tiered', 0.061), ('stores', 0.061), ('cost', 0.061), ('key', 0.06), ('unseen', 0.06), ('integers', 0.057), ('costs', 0.056), ('ranks', 0.055), ('compression', 0.054), ('bucket', 0.051), ('redirection', 0.05), ('memory', 0.048), ('false', 0.045), ('count', 0.045), ('grams', 0.042), ('coded', 0.042), ('compress', 0.042), ('billion', 0.042), ('counts', 0.042), ('bi', 0.041), ('per', 0.041), ('code', 0.04), ('chd', 0.04), ('damaging', 0.04), ('compressed', 0.04), ('filters', 0.039), ('frequency', 0.039), ('stage', 0.039), ('osborne', 0.038), ('modern', 0.037), ('risk', 0.036), ('lms', 0.035), ('falsely', 0.034), ('binning', 0.034), ('toolkits', 0.034), ('values', 0.033), ('google', 0.032), ('belazzougui', 0.03), ('construe', 0.03), ('displace', 0.03), ('fredriksson', 0.03), ('irstlm', 0.03), ('mphrs', 0.03), ('codes', 0.03), ('required', 0.029), ('coding', 0.028), ('positives', 0.028), ('range', 0.027), ('returning', 0.026), ('federico', 0.026), ('value', 0.026), ('whittaker', 0.025), ('buckets', 0.025), ('whilst', 0.025), ('access', 0.025), ('space', 0.025), ('distinct', 0.025), ('variable', 0.025), ('length', 0.024), ('compact', 0.024), ('hashing', 0.023), ('ngram', 0.023), ('configuration', 0.022), ('srilm', 0.022), ('functions', 0.021), ('structure', 0.021), ('zeros', 0.021), ('queried', 0.021), ('function', 0.021), ('items', 0.021), ('distributional', 0.021), ('integer', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

Author: David Guthrie ; Mark Hepple

Abstract: We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30% of the cost when using the current stateof-the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1byte per n-gram.

2 0.18027508 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

3 0.050602809 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

Author: Valentin Zhikov ; Hiroya Takamura ; Manabu Okumura

Abstract: This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the ”CHILDES” corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced . computational time.

4 0.048982434 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

Author: Joseph Reisinger ; Raymond Mooney

Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.

5 0.039406355 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

Author: Ching-Yun Chang ; Stephen Clark

Abstract: Linguistic Steganography is concerned with hiding information in natural language text. One of the major transformations used in Linguistic Steganography is synonym substitution. However, few existing studies have studied the practical application of this approach. In this paper we propose two improvements to the use of synonym substitution for encoding hidden bits of information. First, we use the Web 1T Google n-gram corpus for checking the applicability of a synonym in context, and we evaluate this method using data from the SemEval lexical substitution task. Second, we address the problem that arises from words with more than one sense, which creates a potential ambiguity in terms of which bits are encoded by a particular word. We develop a novel method in which words are the vertices in a graph, synonyms are linked by edges, and the bits assigned to a word are determined by a vertex colouring algorithm. This method ensures that each word encodes a unique sequence of bits, without cutting out large number of synonyms, and thus maintaining a reasonable embedding capacity.

6 0.034500401 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter

7 0.032192655 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

8 0.03059056 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

9 0.030099999 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

10 0.029186269 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

11 0.029181462 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

12 0.027878143 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

13 0.025356771 39 emnlp-2010-EMNLP 044

14 0.025028281 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

15 0.025014231 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

16 0.024683246 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

17 0.02460785 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

18 0.024482045 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

19 0.023991704 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

20 0.023904819 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.1), (1, 0.021), (2, -0.027), (3, 0.05), (4, 0.002), (5, 0.049), (6, -0.121), (7, -0.022), (8, -0.078), (9, 0.036), (10, -0.028), (11, 0.104), (12, 0.055), (13, 0.008), (14, 0.038), (15, -0.129), (16, 0.081), (17, -0.074), (18, -0.009), (19, -0.005), (20, -0.09), (21, -0.137), (22, -0.157), (23, 0.216), (24, -0.021), (25, -0.322), (26, -0.118), (27, 0.154), (28, 0.18), (29, -0.092), (30, 0.119), (31, -0.055), (32, 0.01), (33, 0.014), (34, 0.033), (35, 0.176), (36, 0.052), (37, 0.024), (38, -0.064), (39, -0.199), (40, 0.051), (41, 0.286), (42, 0.062), (43, -0.279), (44, 0.023), (45, 0.093), (46, 0.068), (47, -0.117), (48, -0.099), (49, 0.147)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96977067 101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

Author: David Guthrie ; Mark Hepple

Abstract: We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30% of the cost when using the current stateof-the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1byte per n-gram.

2 0.59939581 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

3 0.21834472 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

Author: Joseph Reisinger ; Raymond Mooney

Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.

4 0.18968472 94 emnlp-2010-SCFG Decoding Without Binarization

Author: Mark Hopkins ; Greg Langmead

Abstract: Conventional wisdom dictates that synchronous context-free grammars (SCFGs) must be converted to Chomsky Normal Form (CNF) to ensure cubic time decoding. For arbitrary SCFGs, this is typically accomplished via the synchronous binarization technique of (Zhang et al., 2006). A drawback to this approach is that it inflates the constant factors associated with decoding, and thus the practical running time. (DeNero et al., 2009) tackle this problem by defining a superset of CNF called Lexical Normal Form (LNF), which also supports cubic time decoding under certain implicit assumptions. In this paper, we make these assumptions explicit, and in doing so, show that LNF can be further expanded to a broader class of grammars (called “scope3”) that also supports cubic-time decoding. By simply pruning non-scope-3 rules from a GHKM-extracted grammar, we obtain better translation performance than synchronous binarization.

5 0.17760736 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

Author: Valentin Zhikov ; Hiroya Takamura ; Manabu Okumura

Abstract: This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the ”CHILDES” corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced . computational time.

6 0.14508216 13 emnlp-2010-A Simple Domain-Independent Probabilistic Approach to Generation

7 0.13098383 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

8 0.12721066 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

9 0.12629458 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

10 0.12567008 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

11 0.12451633 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

12 0.12262118 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

13 0.11845922 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

14 0.11806493 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter

15 0.11311392 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields

16 0.10824437 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

17 0.10659637 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

18 0.10611638 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

19 0.1034492 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

20 0.096227124 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.013), (12, 0.56), (29, 0.063), (30, 0.013), (32, 0.011), (52, 0.024), (56, 0.054), (62, 0.019), (66, 0.065), (72, 0.043), (76, 0.016), (89, 0.014), (92, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91320229 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

Author: Gae-won You ; Seung-won Hwang ; Young-In Song ; Long Jiang ; Zaiqing Nie

Abstract: This paper studies the problem of mining entity translation, specifically, mining English and Chinese name pairs. Existing efforts can be categorized into (a) a transliterationbased approach leveraging phonetic similarity and (b) a corpus-based approach exploiting bilingual co-occurrences, each of which suffers from inaccuracy and scarcity respectively. In clear contrast, we use unleveraged resources of monolingual entity co-occurrences, crawled from entity search engines, represented as two entity-relationship graphs extracted from two language corpora respectively. Our problem is then abstracted as finding correct mappings across two graphs. To achieve this goal, we propose a holistic approach, of exploiting both transliteration similarity and monolingual co-occurrences. This approach, building upon monolingual corpora, complements existing corpus-based work, requiring scarce resources of parallel or compa- rable corpus, while significantly boosting the accuracy of transliteration-based work. We validate our proposed system using real-life datasets.

same-paper 2 0.89095008 101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

Author: David Guthrie ; Mark Hepple

Abstract: We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30% of the cost when using the current stateof-the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1byte per n-gram.

3 0.85302418 95 emnlp-2010-SRL-Based Verb Selection for ESL

Author: Xiaohua Liu ; Bo Han ; Kuan Li ; Stephan Hyeonjun Stiller ; Ming Zhou

Abstract: In this paper we develop an approach to tackle the problem of verb selection for learners of English as a second language (ESL) by using features from the output of Semantic Role Labeling (SRL). Unlike existing approaches to verb selection that use local features such as n-grams, our approach exploits semantic features which explicitly model the usage context of the verb. The verb choice highly depends on its usage context which is not consistently captured by local features. We then combine these semantic features with other local features under the generalized perceptron learning framework. Experiments on both indomain and out-of-domain corpora show that our approach outperforms the baseline and achieves state-of-the-art performance. 1

4 0.40684137 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

5 0.39029729 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

Author: Sankaranarayanan Ananthakrishnan ; Rohit Prasad ; David Stallard ; Prem Natarajan

Abstract: Production of parallel training corpora for the development of statistical machine translation (SMT) systems for resource-poor languages usually requires extensive manual effort. Active sample selection aims to reduce the labor, time, and expense incurred in producing such resources, attaining a given performance benchmark with the smallest possible training corpus by choosing informative, nonredundant source sentences from an available candidate pool for manual translation. We present a novel, discriminative sample selection strategy that preferentially selects batches of candidate sentences with constructs that lead to erroneous translations on a held-out development set. The proposed strategy supports a built-in diversity mechanism that reduces redundancy in the selected batches. Simulation experiments on English-to-Pashto and Spanish-to-English translation tasks demon- strate the superiority of the proposed approach to a number of competing techniques, such as random selection, dissimilarity-based selection, as well as a recently proposed semisupervised active learning strategy.

6 0.38807094 54 emnlp-2010-Generating Confusion Sets for Context-Sensitive Error Correction

7 0.38207698 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

8 0.36586878 20 emnlp-2010-Automatic Detection and Classification of Social Events

9 0.35328931 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

10 0.34373558 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

11 0.33914265 19 emnlp-2010-Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation

12 0.33779144 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

13 0.33488613 68 emnlp-2010-Joint Inference for Bilingual Semantic Role Labeling

14 0.33425069 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

15 0.3323527 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

16 0.33221641 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

17 0.3305319 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

18 0.32911658 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

19 0.32799456 84 emnlp-2010-NLP on Spoken Documents Without ASR

20 0.32702953 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks