emnlp emnlp2013 emnlp2013-20 knowledge-graph by maker-knowledge-mining

20 emnlp-2013-An Efficient Language Model Using Double-Array Structures

Source: pdf

Author: Makoto Yasuhara ; Toru Tanaka ; Jun-ya Norimatsu ; Mikio Yamamoto

Abstract: Ngram language models tend to increase in size with inflating the corpus size, and consume considerable resources. In this paper, we propose an efficient method for implementing ngram models based on doublearray structures. First, we propose a method for representing backwards suffix trees using double-array structures and demonstrate its efficiency. Next, we propose two optimization methods for improving the efficiency of data representation in the double-array structures. Embedding probabilities into unused spaces in double-array structures reduces the model size. Moreover, tuning the word IDs in the language model makes the model smaller and faster. We also show that our method can be used for building large language models using the division method. Lastly, we show that our method outperforms methods based on recent related works from the viewpoints of model size and query speed when both optimization methods are used.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we propose an efficient method for implementing ngram models based on doublearray structures. [sent-2, score-0.396]

2 First, we propose a method for representing backwards suffix trees using double-array structures and demonstrate its efficiency. [sent-3, score-0.855]

3 Next, we propose two optimization methods for improving the efficiency of data representation in the double-array structures. [sent-4, score-0.17]

4 Embedding probabilities into unused spaces in double-array structures reduces the model size. [sent-5, score-0.287]

5 We also show that our method can be used for building large language models using the division method. [sent-7, score-0.037]

6 Lastly, we show that our method outperforms methods based on recent related works from the viewpoints of model size and query speed when both optimization methods are used. [sent-8, score-0.419]

7 Jelinek, 1990) are widely used as probabilistic models of sentence in natural language processing. [sent-10, score-0.049]

8 The wide use of the Internet has entailed a dramatic increase in size of the available corpora, which can be harnessed to obtain a significant improvement in model quality. [sent-11, score-0.169]

9 (2007) have shown that the performance of statistical machine translation systems is monotonically improved with the increasing size of training corpora for the language model. [sent-13, score-0.132]

10 jp However, models using larger corpora also consume more resources. [sent-18, score-0.153]

11 In recent years, many methods for improving the efficiency of language models have been proposed to tackle this problem (Pauls and Klein, 2011; Kenneth Heafield, 2011). [sent-19, score-0.073]

12 Such methods not only reduce the required memory size but also raise query speed. [sent-20, score-0.331]

13 In this paper, we propose the double-array lan- guage model (DALM) which uses double-array structures (Aoe, 1989). [sent-21, score-0.148]

14 Double-array structures are widely used in text processing, especially for Japanese. [sent-22, score-0.169]

15 They are known to provide a compact representation of tries (Fredkin, 1960) and fast transitions between trie nodes. [sent-23, score-0.368]

16 The ability to store and manipulate tries efficiently is expected to increase the performance of language models (i. [sent-24, score-0.455]

17 , improving query speed and reducing the model size in terms of memory) because tries are one of the most common representations of data structures in language models. [sent-26, score-0.657]

18 We use double-array structures to implement a language model since we can utilize their speed and compactness when querying the model about an ngram. [sent-27, score-0.474]

19 In order to utilize of double-array structures as language models, we modify them to be able to store probabilities and backoff weights. [sent-28, score-0.528]

20 We also propose two optimization methods: embedding and ordering. [sent-29, score-0.178]

21 These methods reduce model size and increase query speed. [sent-30, score-0.271]

22 Embedding is an efficient method for storing ngram probabilities and backoff weights, whereby we find vacant spaces in the double-array language model structure and populate them with language model information, such as probabilities and backoff weights. [sent-31, score-1.069]

23 DALM uses word IDs for all words of the ngram, and ordering assigns a word ID to each word to reduce the model size. [sent-35, score-0.097]

24 These two optimization methods can be used simultaneously and are also expected to work well. [sent-36, score-0.069]

25 In our experiments, we use a language model based on corpora of the NTCIR patent retrieval task (Atsushi Fujii et al. [sent-37, score-0.082]

26 We conducted experiments focusing on query speed and model size. [sent-43, score-0.254]

27 The results indicate that when the abovementioned optimization methods are used together, DALM outperforms state-ofthe-art methods on those points. [sent-44, score-0.119]

28 1 Tries and Backwards Suffix Trees Tries (Fredkin, 1960) are one of the most widely used tree structures in ngram language models since they can reduce memory requirements by sharing common prefix. [sent-46, score-0.655]

29 Moreover, since the query speed for tries depends only on the number of input words, the query speed remains constant even if the ngram model increases in size. [sent-47, score-0.968]

30 , 2009) are among the most efficient representations of tries for language models. [sent-50, score-0.278]

31 They contain ngrams in reverse order of history words. [sent-51, score-0.362]

32 Figure 1 shows an example of a backwards suffix tree representation. [sent-52, score-0.73]

33 In this paper, we denote an ngram: by the form w1, w2, · · · , wn as w1n. [sent-53, score-0.051]

34 In this example, word lists (represented as rectangular tables) contain target words (here, wn) of ngrams, and circled words in the tree denote history words (here, associated with target words. [sent-54, score-0.343]

35 The history words “I eat,” “you eat”, and “do you eat” are stored in reverse order. [sent-55, score-0.333]

36 Querying this trie about an ngram is simple: just trace history words in reverse and then find the target word in a list. [sent-56, score-0.809]

37 For example, consider querying about the trigram “I eat fish”. [sent-57, score-0.465]

38 First, simply trace the history in the trie in reverse order (“eat” → “I”); then, sfitondry y“f i nsth h”e i tnr elis itn < e1v >. [sent-58, score-0.555]

39 Similarly, query- w1n−1) ing a backwards suffix tree about unknown ngrams is also efficient, because the backwards suffix tree 223 Figure 1: Example of a backwards suffix tree. [sent-59, score-2.204]

40 There are two branch types in a backwards suffix tree: history words and target words. [sent-60, score-0.869]

41 History words are shown in circles and target words are stored in word lists. [sent-61, score-0.132]

42 For example, in querying about the 4gram “do you eat soup”, we first trace “eat” → “you” → ““ddoo” y ionu a manner ”si,m wilear fi to a tbroacvee. [sent-63, score-0.524]

43 “ However, soeua”rc →hing for the word “soup” in list <3> fails because list <3> does not contain the word “soup”. [sent-64, score-0.058]

44 In this case, we return to the node “you” to search the list <2>, where we find “soup”. [sent-65, score-0.029]

45 This means that the trigram “you eat soup” is contained in the tree while the 4gram “do you eat soup” is not. [sent-66, score-0.715]

46 This behavior can be efficiently used for backoff calculation. [sent-67, score-0.249]

47 SRILM (Stolcke, 2002) is a widely used language model toolkit. [sent-68, score-0.049]

48 It utilizes backwards suffix trees for its data structures. [sent-69, score-0.707]

49 In SRILM, tries are implemented as 64-bit pointer links, which wastes a lot of memory. [sent-70, score-0.294]

50 On the other hand, the access speed for ngram probabilities is relatively high. [sent-71, score-0.456]

51 2 Efficient Language Models In recent years, several methods have been proposed for storing language models efficiently in memory. [sent-73, score-0.138]

52 Talbot and Osborne (2007) have proposed an efficient method based on bloom filters. [sent-74, score-0.214]

53 This method modifies bloom filters to store count information about training sets. [sent-75, score-0.342]

54 In prior work, bloom filters have been used for checking whether certain data are contained in a set. [sent-76, score-0.256]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('backwards', 0.413), ('soup', 0.302), ('eat', 0.278), ('ngram', 0.258), ('suffix', 0.238), ('tries', 0.202), ('backoff', 0.192), ('dalm', 0.174), ('history', 0.146), ('atsushi', 0.138), ('bloom', 0.138), ('trie', 0.138), ('querying', 0.136), ('speed', 0.134), ('reverse', 0.123), ('fujii', 0.121), ('query', 0.12), ('structures', 0.12), ('fredkin', 0.116), ('store', 0.111), ('trace', 0.11), ('ngrams', 0.093), ('embedding', 0.081), ('makoto', 0.081), ('storing', 0.081), ('tree', 0.079), ('consume', 0.077), ('efficient', 0.076), ('optimization', 0.069), ('probabilities', 0.064), ('ids', 0.064), ('stored', 0.064), ('memory', 0.062), ('srilm', 0.06), ('reduce', 0.059), ('filters', 0.058), ('efficiently', 0.057), ('trees', 0.056), ('spaces', 0.053), ('stolcke', 0.052), ('trigram', 0.051), ('wn', 0.051), ('size', 0.05), ('abovementioned', 0.05), ('circled', 0.05), ('germann', 0.05), ('inflating', 0.05), ('talbot', 0.05), ('unused', 0.05), ('widely', 0.049), ('pointer', 0.046), ('ddoo', 0.046), ('mikio', 0.046), ('populate', 0.046), ('tanaka', 0.046), ('viewpoints', 0.046), ('wastes', 0.046), ('arpa', 0.043), ('heafield', 0.043), ('compactness', 0.043), ('entailed', 0.043), ('fish', 0.043), ('manipulate', 0.043), ('whereby', 0.043), ('increase', 0.042), ('corpora', 0.042), ('efficiency', 0.042), ('utilize', 0.041), ('monotonically', 0.04), ('raise', 0.04), ('ntcir', 0.04), ('patent', 0.04), ('yamamoto', 0.04), ('osborne', 0.038), ('branch', 0.038), ('itn', 0.038), ('ordering', 0.038), ('division', 0.037), ('gb', 0.037), ('bell', 0.037), ('file', 0.037), ('modifies', 0.035), ('rc', 0.034), ('dramatic', 0.034), ('pauls', 0.034), ('lastly', 0.034), ('circles', 0.034), ('implementing', 0.034), ('jp', 0.034), ('target', 0.034), ('jelinek', 0.033), ('checking', 0.031), ('improving', 0.031), ('years', 0.03), ('internet', 0.03), ('contained', 0.029), ('list', 0.029), ('propose', 0.028), ('requirements', 0.028), ('transitions', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 20 emnlp-2013-An Efficient Language Model Using Double-Array Structures

Author: Makoto Yasuhara ; Toru Tanaka ; Jun-ya Norimatsu ; Mikio Yamamoto

2 0.12899132 176 emnlp-2013-Structured Penalties for Log-Linear Language Models

Author: Anil Kumar Nelakanti ; Cedric Archambeau ; Julien Mairal ; Francis Bach ; Guillaume Bouchard

Abstract: Language models can be formalized as loglinear regression models where the input features represent previously observed contexts up to a certain length m. The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in nd, where n is the length of the training corpus and d is the number of observed features. We present a model that grows logarithmically in d, making it possible to efficiently leverage longer contexts. We account for the sequential structure of natural language using treestructured penalized objectives to avoid overfitting and achieve better generalization.

3 0.080591552 97 emnlp-2013-Identifying Web Search Query Reformulation using Concept based Matching

Author: Ahmed Hassan

Abstract: Web search users frequently modify their queries in hope of receiving better results. This process is referred to as “Query Reformulation”. Previous research has mainly focused on proposing query reformulations in the form of suggested queries for users. Some research has studied the problem of predicting whether the current query is a reformulation of the previous query or not. However, this work has been limited to bag-of-words models where the main signals being used are word overlap, character level edit distance and word level edit distance. In this work, we show that relying solely on surface level text similarity results in many false positives where queries with different intents yet similar topics are mistakenly predicted as query reformulations. We propose a new representation for Web search queries based on identifying the concepts in queries and show that we can sig- nificantly improve query reformulation performance using features of query concepts.

4 0.073064484 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings

Author: Artem Sokokov ; Laura Jehl ; Felix Hieber ; Stefan Riezler

Abstract: We present an approach to learning bilingual n-gram correspondences from relevance rankings of English documents for Japanese queries. We show that directly optimizing cross-lingual rankings rivals and complements machine translation-based cross-language information retrieval (CLIR). We propose an efficient boosting algorithm that deals with very large cross-product spaces of word correspondences. We show in an experimental evaluation on patent prior art search that our approach, and in particular a consensus-based combination of boosting and translation-based approaches, yields substantial improvements in CLIR performance. Our training and test data are made publicly available.

5 0.068132624 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech

Author: Stella Frank ; Frank Keller ; Sharon Goldwater

Abstract: Frank Keller keller@ inf .ed .ac .uk Sharon Goldwater sgwater@ inf .ed .ac .uk ILCC, School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK interactions are often (but not necessarily) synergisChildren learn various levels of linguistic structure concurrently, yet most existing models of language acquisition deal with only a single level of structure, implicitly assuming a sequential learning process. Developing models that learn multiple levels simultaneously can provide important insights into how these levels might interact synergistically dur- ing learning. Here, we present a model that jointly induces syntactic categories and morphological segmentations by combining two well-known models for the individual tasks. We test on child-directed utterances in English and Spanish and compare to single-task baselines. In the morphologically poorer language (English), the model improves morphological segmentation, while in the morphologically richer language (Spanish), it leads to better syntactic categorization. These results provide further evidence that joint learning is useful, but also suggest that the benefits may be different for typologically different languages.

6 0.067684948 58 emnlp-2013-Dependency Language Models for Sentence Completion

7 0.060704827 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models

8 0.049185436 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries

9 0.044667393 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs

10 0.039073862 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

11 0.038161203 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification

12 0.036560986 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation

13 0.036202662 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

14 0.033338405 31 emnlp-2013-Automatic Feature Engineering for Answer Selection and Extraction

15 0.031499192 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models

16 0.031410083 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries

17 0.029805418 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English

18 0.029000359 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations

19 0.027963055 187 emnlp-2013-Translation with Source Constituency and Dependency Trees

20 0.027852505 52 emnlp-2013-Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.099), (1, -0.035), (2, -0.008), (3, 0.009), (4, -0.05), (5, 0.024), (6, 0.029), (7, 0.024), (8, -0.011), (9, -0.049), (10, -0.027), (11, 0.052), (12, -0.115), (13, -0.034), (14, 0.035), (15, -0.05), (16, -0.042), (17, -0.086), (18, 0.097), (19, 0.004), (20, 0.059), (21, -0.119), (22, -0.06), (23, 0.1), (24, -0.091), (25, 0.061), (26, -0.002), (27, 0.006), (28, 0.038), (29, -0.052), (30, 0.04), (31, -0.167), (32, 0.013), (33, -0.152), (34, -0.033), (35, -0.054), (36, -0.017), (37, -0.024), (38, -0.233), (39, 0.307), (40, 0.045), (41, -0.093), (42, -0.13), (43, -0.068), (44, 0.168), (45, 0.12), (46, -0.077), (47, 0.036), (48, 0.008), (49, -0.203)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9708578 20 emnlp-2013-An Efficient Language Model Using Double-Array Structures

Author: Makoto Yasuhara ; Toru Tanaka ; Jun-ya Norimatsu ; Mikio Yamamoto

2 0.8112185 176 emnlp-2013-Structured Penalties for Log-Linear Language Models

Author: Anil Kumar Nelakanti ; Cedric Archambeau ; Julien Mairal ; Francis Bach ; Guillaume Bouchard

3 0.37186641 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings

Author: Artem Sokokov ; Laura Jehl ; Felix Hieber ; Stefan Riezler

4 0.3572959 58 emnlp-2013-Dependency Language Models for Sentence Completion

Author: Joseph Gubbins ; Andreas Vlachos

Abstract: Sentence completion is a challenging semantic modeling task in which models must choose the most appropriate word from a given set to complete a sentence. Although a variety of language models have been applied to this task in previous work, none of the existing approaches incorporate syntactic information. In this paper we propose to tackle this task using a pair of simple language models in which the probability of a sentence is estimated as the probability of the lexicalisation of a given syntactic dependency tree. We apply our approach to the Microsoft Research Sentence Completion Challenge and show that it improves on n-gram language models by 8.7 percentage points, achieving the highest accuracy reported to date apart from neural language models that are more complex and ex- pensive to train.

5 0.29530716 97 emnlp-2013-Identifying Web Search Query Reformulation using Concept based Matching

Author: Ahmed Hassan

6 0.2830151 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

7 0.26569441 52 emnlp-2013-Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation

8 0.26541099 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech

9 0.26026785 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs

10 0.25961336 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries

11 0.23929392 165 emnlp-2013-Scaling to Large3 Data: An Efficient and Effective Method to Compute Distributional Thesauri

12 0.2352151 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models

13 0.23023701 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk

14 0.21326798 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models

15 0.20526595 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries

16 0.19458617 55 emnlp-2013-Decoding with Large-Scale Neural Language Models Improves Translation

17 0.18097332 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem

18 0.17585927 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

19 0.17402323 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training

20 0.17276222 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(18, 0.026), (22, 0.014), (30, 0.809), (51, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97673661 20 emnlp-2013-An Efficient Language Model Using Double-Array Structures

Author: Makoto Yasuhara ; Toru Tanaka ; Jun-ya Norimatsu ; Mikio Yamamoto

2 0.91888642 92 emnlp-2013-Growing Multi-Domain Glossaries from a Few Seeds using Probabilistic Topic Models

Author: Stefano Faralli ; Roberto Navigli

Abstract: In this paper we present a minimallysupervised approach to the multi-domain acquisition ofwide-coverage glossaries. We start from a small number of hypernymy relation seeds and bootstrap glossaries from the Web for dozens of domains using Probabilistic Topic Models. Our experiments show that we are able to extract high-precision glossaries comprising thousands of terms and definitions.

3 0.90427196 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation

Author: Xinyan Xiao ; Deyi Xiong

Abstract: Traditional synchronous grammar induction estimates parameters by maximizing likelihood, which only has a loose relation to translation quality. Alternatively, we propose a max-margin estimation approach to discriminatively inducing synchronous grammars for machine translation, which directly optimizes translation quality measured by BLEU. In the max-margin estimation of parameters, we only need to calculate Viterbi translations. This further facilitates the incorporation of various non-local features that are defined on the target side. We test the effectiveness of our max-margin estimation framework on a competitive hierarchical phrase-based system. Experiments show that our max-margin method significantly outperforms the traditional twostep pipeline for synchronous rule extraction by 1.3 BLEU points and is also better than previous max-likelihood estimation method.

4 0.90161675 176 emnlp-2013-Structured Penalties for Log-Linear Language Models

Author: Anil Kumar Nelakanti ; Cedric Archambeau ; Julien Mairal ; Francis Bach ; Guillaume Bouchard

5 0.81217092 4 emnlp-2013-A Dataset for Research on Short-Text Conversations

Author: Hao Wang ; Zhengdong Lu ; Hang Li ; Enhong Chen

Abstract: Natural language conversation is widely regarded as a highly difficult problem, which is usually attacked with either rule-based or learning-based models. In this paper we propose a retrieval-based automatic response model for short-text conversation, to exploit the vast amount of short conversation instances available on social media. For this purpose we introduce a dataset of short-text conversation based on the real-world instances from Sina Weibo (a popular Chinese microblog service), which will be soon released to public. This dataset provides rich collection of instances for the research on finding natural and relevant short responses to a given short text, and useful for both training and testing of conversation models. This dataset consists of both naturally formed conversations, manually labeled data, and a large repository of candidate responses. Our preliminary experiments demonstrate that the simple retrieval-based conversation model performs reasonably well when combined with the rich instances in our dataset.

6 0.80320179 163 emnlp-2013-Sarcasm as Contrast between a Positive Sentiment and Negative Situation

7 0.61032748 113 emnlp-2013-Joint Language and Translation Modeling with Recurrent Neural Networks

8 0.57234126 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation

9 0.56155366 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming

10 0.54440308 172 emnlp-2013-Simple Customization of Recursive Neural Networks for Semantic Relation Classification

11 0.54045641 156 emnlp-2013-Recurrent Continuous Translation Models

12 0.5311746 2 emnlp-2013-A Convex Alternative to IBM Model 2

13 0.5283531 14 emnlp-2013-A Synchronous Context Free Grammar for Time Normalization

14 0.52804339 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing

15 0.52122408 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries

16 0.51059812 122 emnlp-2013-Learning to Freestyle: Hip Hop Challenge-Response Induction via Transduction Rule Segmentation

17 0.50803173 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training

18 0.50109595 52 emnlp-2013-Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation

19 0.49989545 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding

20 0.49156943 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction