emnlp emnlp2013 emnlp2013-20 emnlp2013-20-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Makoto Yasuhara ; Toru Tanaka ; Jun-ya Norimatsu ; Mikio Yamamoto
Abstract: Ngram language models tend to increase in size with inflating the corpus size, and consume considerable resources. In this paper, we propose an efficient method for implementing ngram models based on doublearray structures. First, we propose a method for representing backwards suffix trees using double-array structures and demonstrate its efficiency. Next, we propose two optimization methods for improving the efficiency of data representation in the double-array structures. Embedding probabilities into unused spaces in double-array structures reduces the model size. Moreover, tuning the word IDs in the language model makes the model smaller and faster. We also show that our method can be used for building large language models using the division method. Lastly, we show that our method outperforms methods based on recent related works from the viewpoints of model size and query speed when both optimization methods are used.