acl acl2011 acl2011-135 knowledge-graph by maker-knowledge-mining

135 acl-2011-Faster and Smaller N-Gram Language Models

Source: pdf

Author: Adam Pauls ; Dan Klein

Abstract: N-gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. [sent-6, score-0.881]

2 We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%. [sent-7, score-0.485]

3 As in much previous work (Whittaker 258 and Raj, 2001; Hsu and Glass, 2008), our methods are conceptually based on tabular trie encodings wherein each n-gram key is stored as the concatenation of one word (here, the last) and an offset encoding the remaining words (here, the context). [sent-16, score-0.698]

4 Caching itself is certainly not new to language modeling, but because well-tuned LMs are essentially lookup tables to begin with, naive cache designs only speed up slower systems. [sent-23, score-0.339]

5 We present a direct-addressing cache with a fast key identity check that speeds up our systems (or existing fast systems like the widelyused, speed-focused SRILM) by up to 300%. [sent-24, score-0.419]

6 Where classic LMs take word tuples and produce counts or probabilities, we propose an LM that takes a word-and-context encoding (so the context need not be re-looked up) and returns both the probabil- ity and also the context encoding for the suffix of the original query. [sent-26, score-0.604]

7 This setup substantially accelerates the scrolling queries issued by decoders, and also exploits language model state equivalence (Li and Khudanpur, 2008). [sent-27, score-0.338]

8 ” In this section, we will review existing techniques for encoding the keys and values of an n-gram language model, taking care to account for every bit of memory required by each implementation. [sent-39, score-0.497]

9 The additional array is small, requiring only about 3MB, but we save 17 bits per n-gram, reducing value storage from around 16GB to about 9GB for Web1T. [sent-46, score-0.768]

10 In general, the number of bits per value required to encode all value ranks for a given language model will vary we will refer to this variable as v . [sent-50, score-0.513]

11 2 Trie-Based Language Models The data structure of choice for the majority of modern language model implementations is a trie (Fredkin, 1960). [sent-52, score-0.296]

12 For example, the implementation used in SRILM represents a trie node as a C st ruct containing a 32-bit integer representing the word, a 64-bit memory2 pointer to the list of children, and a 32-bit floating point number representing the value stored at a node. [sent-60, score-0.421]

13 The total storage for a node alone is 16 bytes, with additional overhead required to store the list of children. [sent-61, score-0.329]

14 In total, the most compact implementation in SRILM uses 33 bytes per n-gram of storage, which would require around 116 GB of memory to store Web1T. [sent-62, score-0.45]

15 While it is simple to implement a trie node in this (already wasteful) way in programming languages that offer low-level access to memory allocation like C/C++, the situation is even worse in higher level programming languages. [sent-63, score-0.298]

16 In Java, for example, Cstyle st ructs are not available, and records are most naturally implemented as objects that carry an additional 64 bits of overhead. [sent-64, score-0.307]

17 Despite its relatively large storage requirements, the implementation employed by SRILM is still widely in use today, largely because of its speed to our knowledge, SRILM is the fastest freely available language model implementation. [sent-66, score-0.384]

18 3 Implicit Tries A more compact implementation of a trie is described in Whittaker and Raj (2001). [sent-69, score-0.317]

19 Each entry encodes a word with enough bits to index all words in the language model (24 bits for Web1T), a quantized value, and a 32-bit3 offset that encodes the contiguous block of the array containing the children of the node. [sent-71, score-1.22]

20 Note that 32 bits is sufficient to index all n-grams in Web1T; for larger corpora, we can always increase the size of the offset. [sent-72, score-0.307]

21 Effectively, this representation replaces systemlevel memory pointers with offsets that act as logical pointers that can reference other entries in the array, rather than arbitrary bytes in RAM. [sent-73, score-0.426]

22 This representation saves space because offsets require fewer bits than memory pointers, but more importantly, it permits straightforward implementation in any higherlevel language that provides access to arrays of integers. [sent-74, score-0.73]

23 4 Encoding n-grams Hsu and Glass (2008) describe a variant of the implicit tries of Whittaker and Raj (2001) in which each node in the trie stores the prefix (i. [sent-76, score-0.323]

24 This representation has the property that we can refer to each n-gram w1n by its last word wn and the offset of its prefix often called the context. [sent-79, score-0.327]

25 4Typically, programming languages only provide support for arrays of bytes, not bits, but it is of course possible to simulate arrays with arbitrary numbers of bits using byte arrays and bit manipulation. [sent-81, score-0.633]

26 encode We will refer to this encoding as a context encoding. [sent-83, score-0.34]

27 In our implementations, we found that the speed improvement from switching to a first-rest encoding and implementing more efficient queries was modest. [sent-85, score-0.439]

28 2, the last-rest encoding allows us to exploit the scrolling nature of queries issued by decoders, which results in speedups that far outweigh those achieved by reversing the trie. [sent-87, score-0.564]

29 We will also show that our data structures can be very effectively compressed by implicitly encoding the word wn, and further compressed by applying a variablelength encoding on context deltas. [sent-90, score-0.984]

30 1 Sorted Array A standard way to implement a map is to store an array of key/value pairs, sorted according to the key. [sent-92, score-0.541]

31 We construct keys (wn, using the context encoding described in the previous section, where the context offsets c refer to entries in the sorted array of (n − 1)-grams. [sent-95, score-1.01]

32 Because our keys are sorted according to their c(w1n−1)) context-encoded representation, we cannot straightforwardly answer queries about an n-gram w without first determining its context encoding. [sent-98, score-0.432]

33 We can do this efficiently by building up the encoding incrementally: we start with the context offset of the unigram w1, which is simply its integer representation, and use that to form the context encoding of the bigram = (w2, c(w1)). [sent-99, score-0.791]

34 1-grams 64 bits Figure 1: Our SORTED implementation of a trie. [sent-102, score-0.419]

35 Each node in the trie is an entry in an array with 3 parts: w represents the word at the node; val represents the (rank encoded) value; and c is an offset in the array of n − 1 grams that represents the parent (prefix) of a node. [sent-104, score-0.99]

36 Note, however, that if our queries arrive in context-encoded form, queries are faster since they involve only one binary search in the appropriate array. [sent-107, score-0.301]

37 2 This implementation, SORTED, uses 64 bits for the integer-encoded keys and v bits for the values. [sent-109, score-0.71]

38 To enable the use of our context encoding, we require an implementation in which we can refer to entries in the hash table via array offsets. [sent-115, score-0.795]

39 For this reason, we use an open address hash map that uses linear probing for collision resolution. [sent-116, score-0.307]

40 As in the sorted array implementation, in order to 261 insert an n-gram w1n into the hash table, we must form its context encoding incrementally from the offset of w1. [sent-117, score-1.239]

41 However, unlike the sorted array implementation, at query time, we only need to be able to check equality between the query key w1n = c(w1n−1)) w01n c(w01n−1)) (wn, and a key = (w0n, in the table. [sent-118, score-0.832]

42 Equality can easily be checked by first checking if wn = wn0, then recursively checking equality between and though again, equality is even faster if the query is already contextencoded. [sent-119, score-0.329]

43 This HASH data structure also uses 64 bits for integer-encoded keys and v bits for values. [sent-120, score-0.71]

44 However, to avoid excessive hash collisions, we also allocate additional empty space according to a userdefined parameter that trades off speed and time we used about 40% extra space in our experiments. [sent-121, score-0.387]

45 Look up in a hash map is linear in the length of an n-gram and constant with respect to the number of n-grams. [sent-123, score-0.307]

46 Unlike the sorted array implementation, the hash table implementation also permits efficient insertion and deletion, making it suitable for stream-based language models (Levenberg and Osborne, 2009). [sent-124, score-0.87]

47 3 Implicitly Encoding wn The context encoding we have used thus far still wastes space. [sent-126, score-0.348]

48 This is perhaps most evident in the sorted array representation (see Figure 1): all ngrams ending with a particular word wi are stored contiguously. [sent-127, score-0.496]

49 We can exploit this redundancy by storing only the context offsets in the main array, using as many bits as needed to encode all context offsets (32 bits for Web1T). [sent-128, score-1.048]

50 In auxiliary arrays, one for each n-gram order, we store the beginning and end of the range of the trie array in which all (wi, c) keys are stored for each wi. [sent-129, score-0.679]

51 The same trick can be applied in the hash table implementation. [sent-131, score-0.307]

52 We allocate contiguous blocks of the main array for n-grams which all share the same last word wi, and distribute keys within those ranges using the hashing function. [sent-132, score-0.39]

53 This representation reduces memory usage for keys from 64 bits to 32 bits, reducing overall storage for Web1T to 6. [sent-133, score-0.642]

54 It also increases query speed in the sorted array case, since to find (wi, c), we only need to search the range of the – array over which wi applies. [sent-136, score-0.921]

55 Because this implicit encoding reduces memory usage without a performance cost, we will assume its use for the rest of this paper. [sent-137, score-0.331]

56 If we ensure that the value rank array sorts raw values by descending order of frequency, then we expect that small ranks will occur much more frequently than large ones, which we can exploit with a variable-length encoding. [sent-142, score-0.392]

57 To compress n-grams, we can exploit the context encoding of our keys. [sent-143, score-0.305]

58 (c) The number of bits required to encode the context and word deltas as well as the value ranks. [sent-154, score-0.643]

59 Word deltas use variable-length block coding with k = 1, while context deltas and value ranks use k = 2. [sent-155, score-0.56]

60 of the key array used in our sorted array implementation. [sent-158, score-0.809]

61 While we have already exploited the fact that the 24 word bits repeat in the previous section, we note here that consecutive context offsets tend to be quite close together. [sent-159, score-0.462]

62 We found that for 5-grams, the median difference between consecutive offsets was about 50, and 90% of offset deltas were smaller than 10000. [sent-160, score-0.472]

63 By using a variable-length encoding to represent these deltas, we should require far fewer than 32 bits to encode context offsets. [sent-161, score-0.647]

64 We used a very simple variable-length coding to encode offset deltas, word deltas, and value ranks. [sent-162, score-0.343]

65 We found this encoding outperformed other standard prefix codes, including Golomb codes (Golomb, 1966; Church et al. [sent-172, score-0.346]

66 Finally, we found that Huffman codes outperformed our encoding slightly, but came at a much higher computational cost. [sent-176, score-0.304]

67 In particular, we compress the key/value array in blocks of 128 bytes. [sent-182, score-0.327]

68 The remainder of the block is filled with as many compressed key/value pairs as possible. [sent-184, score-0.325]

69 When we encode an offset delta, we store the delta of the word portion of the key separately from the delta of the context offset. [sent-187, score-0.543]

70 To find a key in this compressed array, we first perform binary search over the header blocks (which are predictably located every 128 bytes), followed by a linear search within a compressed block. [sent-189, score-0.581]

71 Using k = 6 for encoding offset deltas and k = 5 for encoding value ranks, this COMPRESSED implementation stores Web1T in less than 3 bytes per n-gram, or about 10. [sent-190, score-1.081]

72 4 Speeding up Decoding In the previous section, we provided compact and efficient implementations of associative arrays that allow us to query a value for an arbitrary n-gram. [sent-195, score-0.419]

73 1 Exploiting Repetitive Queries In a simple experiment, we recorded all of the language model queries issued by the Joshua decoder (Li et al. [sent-199, score-0.307]

74 Therefore, we expect that keeping the results of language model queries in a cache should be effective at reducing overall language model latency. [sent-202, score-0.331]

75 Our cache uses an array of key/value pairs with size fixed to 2b − 1 for some integer b (we used 24). [sent-204, score-0.53]

76 If the key located in the cache array matches the query key, then we return the value stored in the cache. [sent-207, score-0.73]

77 Caching n-grams in this way reduces overall latency for two reasons: first, lookup in the cache is extremely fast, requiring only a single evaluation of the hash function, one memory lookup to find the cache key, and one equality check on the key. [sent-210, score-0.991]

78 In contrast, even our fastest (HASH) implementation may have to perform multiple memory lookups and equality checks in order to resolve collisions. [sent-211, score-0.336]

79 When using a contextencoding, a query from the n-gram “the cat fell” returns the context offset of “cat fell”, which speeds up the query of “cat fell down”. [sent-216, score-0.635]

80 Federico and Cettolo (2007) also employ a cache in their language model implementation, though based on traditional hash table cache with linear probing. [sent-219, score-0.703]

81 We would not expect a large performance increase from such a cache for our faster models since our HASH implementation is already a hash table with linear probing. [sent-221, score-0.652]

82 We found in our experiments that a cache using linear probing provided marginal performance increases of about 40%, largely because of cached back-off computation, while our simpler cache increases performance by about 300% even over our HASH LM implementation. [sent-222, score-0.396]

83 o As discussed in Section 3, our LM implementations can answer queries about context-encoded ngrams faster than explicitly encoded n-grams. [sent-230, score-0.346]

84 With this in mind, we augment the values stored in our language model so that for a key (wn, c(w1n−1)), we store the offset of the suffix c(w2n) as well as the normal counts/probabilities. [sent-231, score-0.439]

85 When we query the language model, we get back both a language model score and context offset c( wˆn1−1), where wˆ 1n−1 is the the longest suffix of w1n−1 contained in the language model. [sent-233, score-0.382]

86 We can then quickly form the context encoding of the next query by simply concatenating the new word with the offset c( wˆn1−1) returned from the previous query. [sent-234, score-0.577]

87 The first two, SRILM-H and SRILM-S, refer to the hash tableand sorted array-based trie implementations provided by SRILM. [sent-271, score-0.76]

88 Because this implementation is not freely available, we use their published memory usage in bytes per n-gram on a language model of similar size and project total usage. [sent-274, score-0.309]

89 The memory usage of all of our models is con- siderably smaller than SRILM our HASH implementation is about 25% the size of SRILM-H, and our SORTED implementation is about 25% the size of SRILM-S. [sent-275, score-0.329]

90 Our COMPRESSED implementation is also smaller than the state-of-the-art compressed TPT implementation. [sent-276, score-0.334]

91 , 2004) and additional variable-length encoding that achieves the best published compression of WEB 1T to date. [sent-280, score-0.314]

92 We quote here the storage required when keys7 are encoded using 12bit hash codes, which gives a false positive rate of about 2−12 =0. [sent-283, score-0.512]

93 For HASH+SCROLL, all queries were issued to the decoder in context-encoded form, which speeds up queries that exhibit scrolling behaviour. [sent-289, score-0.612]

94 Note that memory usage is higher than for HASH because we store suffix offsets along with the values for an n-gram. [sent-290, score-0.335]

95 Our caching also reduces total decoding time by about 20% for our fastest models and speeds up COMPRESSED by a factor of 6. [sent-294, score-0.304]

96 3 Timing Experiments We first measured pure query speed by logging all LM queries issued by a decoder and measuring the time required to query those n-grams in isolation. [sent-297, score-0.614]

97 In HASH+SCROLL, we issued queries to the language model using the context encoding, which speeds up queries substantially. [sent-311, score-0.499]

98 In particular, our COMPRESSED implementation with caching is nearly as fast as SRILM-H without caching, and even the already fast HASH implementation is 300% faster in raw query speed with caching enabled. [sent-314, score-0.781]

99 With caching enabled, overall decoder speed is improved for both HASH and SRILMH, while the COMPRESSED implementation is only about 50% slower that the others. [sent-319, score-0.389]

100 Our experiments have demonstrated improvements in query speed over SRILM and compression rates against state-of-the-art lossy compression. [sent-321, score-0.348]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hash', 0.307), ('bits', 0.307), ('array', 0.294), ('encoding', 0.226), ('compressed', 0.222), ('offset', 0.209), ('cache', 0.198), ('sorted', 0.157), ('trie', 0.154), ('deltas', 0.154), ('implementations', 0.142), ('storage', 0.134), ('caching', 0.133), ('queries', 0.133), ('srilm', 0.116), ('implementation', 0.112), ('issued', 0.11), ('offsets', 0.109), ('memory', 0.105), ('block', 0.103), ('arrays', 0.097), ('keys', 0.096), ('query', 0.096), ('scrolling', 0.095), ('bytes', 0.092), ('store', 0.09), ('compression', 0.088), ('lm', 0.086), ('lms', 0.084), ('lossy', 0.084), ('speed', 0.08), ('hepple', 0.079), ('lossless', 0.079), ('codes', 0.078), ('speeds', 0.077), ('wn', 0.076), ('header', 0.073), ('guthrie', 0.07), ('encode', 0.068), ('key', 0.064), ('decoder', 0.064), ('equality', 0.061), ('lookup', 0.061), ('cat', 0.061), ('tries', 0.059), ('fastest', 0.058), ('germann', 0.058), ('storing', 0.056), ('decoders', 0.055), ('whittaker', 0.053), ('raj', 0.052), ('billion', 0.051), ('compact', 0.051), ('fell', 0.05), ('hsu', 0.048), ('chazelle', 0.048), ('golomb', 0.048), ('jni', 0.048), ('java', 0.046), ('context', 0.046), ('stored', 0.045), ('pointers', 0.042), ('prefix', 0.042), ('variablelength', 0.042), ('boldi', 0.042), ('fast', 0.04), ('node', 0.039), ('integer', 0.038), ('brants', 0.038), ('khudanpur', 0.037), ('ranks', 0.037), ('encoded', 0.036), ('entries', 0.036), ('decoding', 0.036), ('required', 0.035), ('faster', 0.035), ('joshua', 0.035), ('bit', 0.035), ('value', 0.033), ('glass', 0.033), ('compress', 0.033), ('delta', 0.033), ('coding', 0.033), ('bloomier', 0.032), ('cettolo', 0.032), ('harb', 0.032), ('tpt', 0.032), ('uncompressed', 0.032), ('vigna', 0.032), ('suffix', 0.031), ('overhead', 0.031), ('gb', 0.03), ('requirements', 0.029), ('zhifei', 0.029), ('stores', 0.029), ('counts', 0.029), ('rank', 0.028), ('levenberg', 0.028), ('gz', 0.028), ('scroll', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 135 acl-2011-Faster and Smaller N-Gram Language Models

Author: Adam Pauls ; Dan Klein

2 0.17258322 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

Author: Hakan Ceylan ; Rada Mihalcea

Abstract: We introduce a new publicly available tool that implements efficient indexing and retrieval of large N-gram datasets, such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and duplicate data, it also allows users to issue wild card queries provided that the wild cards in the query are contiguous. Furthermore, we also implement some of the smoothing algorithms that are designed specifically for large datasets and are shown to yield better language models than the traditional ones on the Web1T 5gram corpus (Yuret, 2008). We demonstrate the effectiveness of our tool and the smoothing algorithms on the English Lexical Substi- tution task by a simple implementation that gives considerable improvement over a basic language model.

3 0.16688463 276 acl-2011-Semi-Supervised SimHash for Efficient Document Similarity Search

Author: Qixia Jiang ; Maosong Sun

Abstract: Searching documents that are similar to a query document is an important component in modern information retrieval. Some existing hashing methods can be used for efficient document similarity search. However, unsupervised hashing methods cannot incorporate prior knowledge for better hashing. Although some supervised hashing methods can derive effective hash functions from prior knowledge, they are either computationally expensive or poorly discriminative. This paper proposes a novel (semi-)supervised hashing method named Semi-Supervised SimHash (S3H) for high-dimensional data similarity search. The basic idea of S3H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. We evaluate our method with several state-of-the-art methods on two large datasets. All the results show that our method gets the best performance. 1

4 0.16260397 189 acl-2011-K-means Clustering with Feature Hashing

Author: Hajime Senuma

Abstract: One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.

5 0.10984164 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

Author: Bo Pang ; Ravi Kumar

Abstract: Web search is an information-seeking activity. Often times, this amounts to a user seeking answers to a question. However, queries, which encode user’s information need, are typically not expressed as full-length natural language sentences in particular, as questions. Rather, they consist of one or more text fragments. As humans become more searchengine-savvy, do natural-language questions still have a role to play in web search? Through a systematic, large-scale study, we find to our surprise that as time goes by, web users are more likely to use questions to express their search intent. —

6 0.10685823 182 acl-2011-Joint Annotation of Search Queries

7 0.094356887 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

8 0.093439475 258 acl-2011-Ranking Class Labels Using Query Sessions

9 0.08816544 256 acl-2011-Query Weighting for Ranking Model Adaptation

10 0.085847117 113 acl-2011-Efficient Online Locality Sensitive Hashing via Reservoir Counting

11 0.081923835 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

12 0.080941729 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

13 0.076103091 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

14 0.073228799 333 acl-2011-Web-Scale Features for Full-Scale Parsing

15 0.07251855 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

16 0.067765661 11 acl-2011-A Fast and Accurate Method for Approximate String Search

17 0.06345129 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

18 0.059832543 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL

19 0.054446317 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search

20 0.052271411 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.14), (1, -0.013), (2, -0.04), (3, 0.029), (4, -0.064), (5, -0.114), (6, -0.063), (7, -0.131), (8, 0.053), (9, 0.004), (10, 0.051), (11, 0.016), (12, -0.004), (13, 0.037), (14, -0.019), (15, -0.004), (16, -0.044), (17, 0.036), (18, -0.01), (19, -0.013), (20, 0.09), (21, -0.079), (22, 0.126), (23, -0.037), (24, -0.015), (25, -0.075), (26, 0.045), (27, -0.026), (28, -0.006), (29, -0.026), (30, 0.119), (31, 0.053), (32, 0.159), (33, 0.229), (34, -0.051), (35, -0.119), (36, 0.071), (37, -0.109), (38, 0.026), (39, -0.067), (40, 0.004), (41, 0.01), (42, 0.023), (43, -0.035), (44, -0.083), (45, 0.155), (46, 0.018), (47, -0.121), (48, -0.046), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9316178 135 acl-2011-Faster and Smaller N-Gram Language Models

Author: Adam Pauls ; Dan Klein

2 0.81956911 276 acl-2011-Semi-Supervised SimHash for Efficient Document Similarity Search

Author: Qixia Jiang ; Maosong Sun

3 0.76481372 189 acl-2011-K-means Clustering with Feature Hashing

Author: Hajime Senuma

4 0.65873748 113 acl-2011-Efficient Online Locality Sensitive Hashing via Reservoir Counting

Author: Benjamin Van Durme ; Ashwin Lall

Abstract: We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint.

5 0.64431649 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

Author: Hakan Ceylan ; Rada Mihalcea

6 0.38020217 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

7 0.37984303 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

8 0.37950239 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search

9 0.37169766 182 acl-2011-Joint Annotation of Search Queries

10 0.35456696 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

11 0.34736454 210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models

12 0.34169012 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

13 0.34151632 11 acl-2011-A Fast and Accurate Method for Approximate String Search

14 0.34141746 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

15 0.33788419 256 acl-2011-Query Weighting for Ranking Model Adaptation

16 0.33302006 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

17 0.33011067 291 acl-2011-SystemT: A Declarative Information Extraction System

18 0.32693583 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

19 0.32604885 258 acl-2011-Ranking Class Labels Using Query Sessions

20 0.32264313 142 acl-2011-Generalized Interpolation in Decision Tree LM

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.041), (17, 0.055), (18, 0.025), (26, 0.046), (37, 0.068), (39, 0.037), (41, 0.13), (44, 0.196), (55, 0.038), (59, 0.03), (72, 0.031), (91, 0.058), (92, 0.035), (96, 0.106)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82527673 135 acl-2011-Faster and Smaller N-Gram Language Models

Author: Adam Pauls ; Dan Klein

2 0.78872275 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations

Author: Wen Wang ; Sibel Yaman ; Kristin Precoda ; Colleen Richey ; Geoffrey Raymond

Abstract: We present Conditional Random Fields based approaches for detecting agreement/disagreement between speakers in English broadcast conversation shows. We develop annotation approaches for a variety of linguistic phenomena. Various lexical, structural, durational, and prosodic features are explored. We compare the performance when using features extracted from automatically generated annotations against that when using human annotations. We investigate the efficacy of adding prosodic features on top of lexical, structural, and durational features. Since the training data is highly imbalanced, we explore two sampling approaches, random downsampling and ensemble downsampling. Overall, our approach achieves 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on the English broadcast conversation data. 1 ?yIntroduction In ?ythis work, we present models for detecting agre?yement/disagreement (denoted (dis)agreement) betwy?een speakers in English broadcast conversation show?ys. The Broadcast Conversation (BC) genre differs from the Broadcast News (BN) genre in that it is?y more interactive and spontaneous, referring to freey? speech in news-style TV and radio programs and consisting of talk shows, interviews, call-in prog?yrams, live reports, and round-tables. Previous y? y?This work was performed while the author was at ICSI. syaman@us . ibm .com, graymond@ s oc .uc sb . edu work on detecting (dis)agreements has been focused on meeting data. (Hillard et al., 2003), (Galley et al., 2004), (Hahn et al., 2006) used spurt-level agreement annotations from the ICSI meeting corpus (Janin et al., 2003). (Hillard et al., 2003) explored unsupervised machine learning approaches and on manual transcripts, they achieved an overall 3-way agreement/disagreement classification ac- curacy as 82% with keyword features. (Galley et al., 2004) explored Bayesian Networks for the detection of (dis)agreements. They used adjacency pair information to determine the structure of their conditional Markov model and outperformed the results of (Hillard et al., 2003) by improving the 3way classification accuracy into 86.9%. (Hahn et al., 2006) explored semi-supervised learning algorithms and reached a competitive performance of 86.7% 3-way classification accuracy on manual transcriptions with only lexical features. (Germesin and Wilson, 2009) investigated supervised machine learning techniques and yields competitive results on the annotated data from the AMI meeting corpus (McCowan et al., 2005). Our work differs from these previous studies in two major categories. One is that a different definition of (dis)agreement was used. In the current work, a (dis)agreement occurs when a responding speaker agrees with, accepts, or disagrees with or rejects, a statement or proposition by a first speaker. Second, we explored (dis)agreement detection in broadcast conversation. Due to the difference in publicity and intimacy/collegiality between speakers in broadcast conversations vs. meet- ings, (dis)agreement may have different character374 Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 374–378, istics. Different from the unsupervised approaches in (Hillard et al., 2003) and semi-supervised approaches in (Hahn et al., 2006), we conducted supervised training. Also, different from (Hillard et al., 2003) and (Galley et al., 2004), our classification was carried out on the utterance level, instead of on the spurt-level. Galley et al. extended Hillard et al.’s work by adding features from previous spurts and features from the general dialog context to infer the class of the current spurt, on top of features from the current spurt (local features) used by Hillard et al. Galley et al. used adjacency pairs to describe the interaction between speakers and the relations between consecutive spurts. In this preliminary study on broadcast conversation, we directly modeled (dis)agreement detection without using adjacency pairs. Still, within the conditional random fields (CRF) framework, we explored features from preceding and following utterances to consider context in the discourse structure. We explored a wide variety of features, including lexical, structural, du- rational, and prosodic features. To our knowledge, this is the first work to systematically investigate detection of agreement/disagreement for broadcast conversation data. The remainder of the paper is organized as follows. Section 2 presents our data and automatic annotation modules. Section 3 describes various features and the CRF model we explored. Experimental results and discussion appear in Section 4, as well as conclusions and future directions. 2 Data and Automatic Annotation In this work, we selected English broadcast conversation data from the DARPA GALE program collected data (GALE Phase 1 Release 4, LDC2006E91; GALE Phase 4 Release 2, LDC2009E15). Human transcriptions and manual speaker turn labels are used in this study. Also, since the (dis)agreement detection output will be used to analyze social roles and relations of an interacting group, we first manually marked soundbites and then excluded soundbites during annotation and modeling. We recruited annotators to provide manual annotations of speaker roles and (dis)agreement to use for the supervised training of models. We de- fined a set of speaker roles as follows. Host/chair is a person associated with running the discussions 375 or calling the meeting. Reporting participant is a person reporting from the field, from a subcommittee, etc. Commentator participant/Topic participant is a person providing commentary on some subject, or person who is the subject of the conversation and plays a role, e.g., as a newsmaker. Audience participant is an ordinary person who may call in, ask questions at a microphone at e.g. a large presentation, or be interviewed because of their presence at a news event. Other is any speaker who does not fit in one of the above categories, such as a voice talent, an announcer doing show openings or commercial breaks, or a translator. Agreements and disagreements are composed of different combinations of initiating utterances and responses. We reformulated the (dis)agreement detection task as the sequence tagging of 11 (dis)agreement-related labels for identifying whether a given utterance is initiating a (dis)agreement opportunity, is a (dis)agreement response to such an opportunity, or is neither of these, in the show. For example, a Negative tag question followed by a negation response forms an agreement, that is, A: [Negative tag] This is not black and white, is it? B: [Agreeing Response] No, it isn’t. The data sparsity problem is serious. Among all 27,071 utterances, only 2,589 utterances are involved in (dis)agreement as initiating or response utterances, about 10% only among all data, while 24,482 utterances are not involved. These annotators also labeled shows with a variety of linguistic phenomena (denoted language use constituents, LUC), including discourse markers, disfluencies, person addresses and person mentions, prefaces, extreme case formulations, and dialog act tags (DAT). We categorized dialog acts into statement, question, backchannel, and incomplete. We classified disfluencies (DF) into filled pauses (e.g., uh, um), repetitions, corrections, and false starts. Person address (PA) terms are terms that a speaker uses to address another person. Person mentions (PM) are references to non-participants in the conversation. Discourse markers (DM) are words or phrases that are related to the structure of the discourse and express a relation between two utter- ances, for example, I mean, you know. Prefaces (PR) are sentence-initial lexical tokens serving functions close to discourse markers (e.g., Well, I think that...). Extreme case formulations (ECF) are lexical patterns emphasizing extremeness (e.g., This is the best book I have ever read). In the end, we manually annotated 49 English shows. We preprocessed English manual transcripts by removing transcriber annotation markers and noise, removing punctuation and case information, and conducting text normalization. We also built automatic rule-based and statistical annotation tools for these LUCs. 3 Features and Model We explored lexical, structural, durational, and prosodic features for (dis)agreement detection. We included a set of “lexical” features, including ngrams extracted from all of that speaker’s utterances, denoted ngram features. Other lexical features include the presence of negation and acquiescence, yes/no equivalents, positive and negative tag questions, and other features distinguishing different types of initiating utterances and responses. We also included various lexical features extracted from LUC annotations, denoted LUC features. These additional features include features related to the presence of prefaces, the counts of types and tokens of discourse markers, extreme case formulations, disfluencies, person addressing events, and person mentions, and the normalized values of these counts by sentence length. We also include a set of features related to the DAT of the current utterance and preceding and following utterances. We developed a set of “structural” and “durational” features, inspired by conversation analysis, to quantitatively represent the different participation and interaction patterns of speakers in a show. We extracted features related to pausing and overlaps between consecutive turns, the absolute and relative duration of consecutive turns, and so on. We used a set of prosodic features including pause, duration, and the speech rate of a speaker. We also used pitch and energy of the voice. Prosodic features were computed on words and phonetic alignment of manual transcripts. Features are computed for the beginning and ending words of an utterance. For the duration features, we used the average and maximum vowel duration from forced align- ment, both unnormalized and normalized for vowel identity and phone context. For pitch and energy, we 376 calculated the minimum, maximum,E range, mean, standard deviation, skewnesSs and kurEtosis values. A decision tree model was useSd to comEpute posteriors fFrom prosodic features and Swe used cuEmulative binnFing of posteriors as final feSatures , simEilar to (Liu et aFl., 2006). As ilPlu?stErajtSed?F i?n SectionS 2, we refEormulated the F(dis)agrePe?mEEejnSt? Fdet?ection taSsk as a seqEuence tagging FproblemP. EWEejS u?sFe?d the MalSlet packagEe (McCallum, 2F002) toP i?mEEpjSle?mFe?nt the linSear chain CERF model for FsequencPe ?tEEagjSgi?nFg.? A CRFS is an undEirected graphiFcal modPe?lEE EthjSa?t Fde?fines a glSobal log-lEinear distributFion of Pthe?EE sjtaSt?eF (o?r label) Ssequence E conditioned oFn an oPbs?EeErvjaSt?ioFn? sequencSe, in our case including Fthe sequPe?nEcEej So?fF Fse?ntences S and the corresponding sFequencPe ?oEEf jfSea?Ftur?es for this sequence of sentences F. TheP ?mEEodjSe?l Fis? optimized globally over the entire seqPue?nEEcejS. TFh?e CRF model is trained to maximize theP c?oEEnjdSit?iFon?al log-likelihood of a given training set P?EEjS? F?. During testing, the most likely sequence E is found using the Viterbi algorithm. One of the motivations of choosing conditional random fields was to avoid the label-bias problem found in hidden Markov models. Compared to Maximum Entropy modeling, the CRF model is optimized globally over the entire sequence, whereas the ME model makes a decision at each point individually without considering the context event information. 4 Experiments All (dis)agreement detection results are based on nfold cross-validation. In this procedure, we held out one show as the test set, randomly held out another show as the dev set, trained models on the rest of the data, and tested the model on the heldout show. We iterated through all shows and computed the overall accuracy. Table 1 shows the results of (dis)agreement detection using all features except prosodic features. We compared two conditions: (1) features extracted completely from the automatic LUC annotations and automatically detected speaker roles, and (2) features from manual speaker role labels and manual LUC annotations when man- ual annotations are available. Table 1 showed that running a fully automatic system to generate automatic annotations and automatic speaker roles produced comparable performance to the system using features from manual annotations whenever available. Table 1: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using features extracted from manual speaker role labels and manual LUC annotations when available, denoted Manual Annotation, and automatic LUC annotations and automatically detected speaker roles, denoted Automatic Annotation. AMuatnoumaltAicn Aontaoitantio78P91.5Agr4eR3em.26en5tF671.5 AMuatnoumal tAicn Aontaoitanio76P04D.13isag3rR86e.56emn4F96t.176 We then focused on the condition of using features from manual annotations when available and added prosodic features as described in Section 3. The results are shown in Table 2. Adding prosodic features produced a 0.7% absolute gain on F1 on agreement detection, and 1.5% absolute gain on F1 on disagreement detection. Table 2: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using manual annotations without and with prosodic features. w /itohp ro s o d ic 8 P1 .58Agr4 e34Re.m02en5t F76.125 w i/tohp ro s o d ic 7 0 PD.81isag43r0R8e.15eme5n4F19t.172 Note that only about 10% utterances among all data are involved in (dis)agreement. This indicates a highly imbalanced data set as one class is more heavily represented than the other/others. We suspected that this high imbalance has played a major role in the high precision and low recall results we obtained so far. Various approaches have been studied to handle imbalanced data for classifications, 377 trying to balaNnce the class distribution in the training set by eithNer oversaNmpling the minority class or downsamplinNg the maNjority class. In this preliminary study of NNsamplingN Napproaches for handling imbalanced dataN NNfor CRF Ntraining, we investigated two apprNoaches, rNNandom dNownsampling and ensemble dowNnsamplinNgN. RandoNm downsampling randomly dowNnsamples NNthe majorNity class to equate the number Nof minoritNNy and maNjority class samples. Ensemble NdownsampNNling is a N refinement of random downsamNpling whiNNch doesn’Nt discard any majority class samNples. InstNNead, we pNartitioned the majority class samNples into NN subspaNces with each subspace containiNng the samNe numbNer of samples as the minority clasNs. Then wNe train N CRF models, each based on thNe minoritNy class samples and one disjoint partitionN Nfrom the N subspaces. During testing, the posterioNr probability for one utterance is averaged over the N CRF models. The results from these two sampling approaches as well as the baseline are shown in Table 3. Both sampling approaches achieved significant improvement over the baseline, i.e., train- ing on the original data set, and ensemble downsampling produced better performance than downsampling. We noticed that both sampling approaches degraded slightly in precision but improved significantly in recall, resulting in 4.5% absolute gain on F1 for agreement detection and 4.7% absolute gain on F1 for disagreement detection. Table 3: Precision (%), recall (%), and F1 (%) of (dis)agreement detection without sampling, with random downsampling and ensemble downsampling. Manual annotations and prosodic features are used. BERansedlmoinbedwonsampling78P19D.825Aisagr4e8R0.m7e5n6 tF701. 2 EBRa ns ne dlmoinmbel dodwowns asmamp lin gn 67 09. 8324 046. 8915 351. 892 In conclusion, this paper presents our work on detection of agreements and disagreements in English broadcast conversation data. We explored a variety of features, including lexical, structural, durational, and prosodic features. We experimented these features using a linear-chain conditional random fields model and conducted supervised training. We observed significant improvement from adding prosodic features and employing two sampling approaches, random downsampling and ensemble downsampling. Overall, we achieved 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on English broadcast conversation data. In future work, we plan to continue adding and refining features, explore dependencies between features and contextual cues with respect to agreements and disagreements, and investigate the efficacy of other machine learning approaches such as Bayesian networks and Support Vector Machines. Acknowledgments The authors thank Gokhan Tur and Dilek HakkaniT u¨r for valuable insights and suggestions. This work has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via Army Research Laboratory (ARL) contract number W91 1NF-09-C-0089. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, ARL, or the U.S. Government. References M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. 2004. Identifying agreement and disagreement in conversational speech: Use ofbayesian networks to model pragmatic dependencies. In Proceedings of ACL. S. Germesin and T. Wilson. 2009. Agreement detection in multiparty conversation. In Proceedings of International Conference on Multimodal Interfaces. S. Hahn, R. Ladner, and M. Ostendorf. 2006. Agreement/disagreement classification: Exploiting unlabeled data using constraint classifiers. In Proceedings of HLT/NAACL. 378 D. Hillard, M. Ostendorf, and E. Shriberg. 2003. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In Proceedings of HLT/NAACL. A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. In Proc. ICASSP, Hong Kong, April. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5): 1526–1540, September. Special Issue on Progress in Rich Transcription. Andrew McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, D. Reidsma, and P. Wellner. 2005. The AMI meeting corpus. In Proceedings of Measuring Behavior 2005, the 5th International Conference on Methods and Techniques in Behavioral Research.

3 0.76153845 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing

Author: Nikhil Garg ; James Henderson

Abstract: We propose a generative model based on Temporal Restricted Boltzmann Machines for transition based dependency parsing. The parse tree is built incrementally using a shiftreduce parse and an RBM is used to model each decision step. The RBM at the current time step induces latent features with the help of temporal connections to the relevant previous steps which provide context information. Our parser achieves labeled and unlabeled attachment scores of 88.72% and 91.65% respectively, which compare well with similar previous models and the state-of-the-art.

4 0.75581408 150 acl-2011-Hierarchical Text Classification with Latent Concepts

Author: Xipeng Qiu ; Xuanjing Huang ; Zhao Liu ; Jinlong Zhou

Abstract: Recently, hierarchical text classification has become an active research topic. The essential idea is that the descendant classes can share the information of the ancestor classes in a predefined taxonomy. In this paper, we claim that each class has several latent concepts and its subclasses share information with these different concepts respectively. Then, we propose a variant Passive-Aggressive (PA) algorithm for hierarchical text classification with latent concepts. Experimental results show that the performance of our algorithm is competitive with the recently proposed hierarchical classification algorithms.

5 0.75499266 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

Author: Xianpei Han ; Le Sun

Abstract: Linking entities with knowledge base (entity linking) is a key issue in bridging the textual data with the structural knowledge base. Due to the name variation problem and the name ambiguity problem, the entity linking decisions are critically depending on the heterogenous knowledge of entities. In this paper, we propose a generative probabilistic model, called entitymention model, which can leverage heterogenous entity knowledge (including popularity knowledge, name knowledge and context knowledge) for the entity linking task. In our model, each name mention to be linked is modeled as a sample generated through a three-step generative story, and the entity knowledge is encoded in the distribution of entities in document P(e), the distribution of possible names of a specific entity P(s|e), and the distribution of possible contexts of a specific entity P(c|e). To find the referent entity of a name mention, our method combines the evidences from all the three distributions P(e), P(s|e) and P(c|e). Experimental results show that our method can significantly outperform the traditional methods. 1

6 0.68112057 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

7 0.67521012 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars

8 0.673998 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

9 0.67280495 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

10 0.67266482 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices

11 0.67044425 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

12 0.67018956 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers

13 0.66924995 189 acl-2011-K-means Clustering with Feature Hashing

14 0.66213071 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

15 0.66131884 328 acl-2011-Using Cross-Entity Inference to Improve Event Extraction

16 0.66015851 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

17 0.65994352 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

18 0.65288877 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

19 0.65286517 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

20 0.65263909 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport