emnlp emnlp2013 emnlp2013-176 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anil Kumar Nelakanti ; Cedric Archambeau ; Julien Mairal ; Francis Bach ; Guillaume Bouchard
Abstract: Language models can be formalized as loglinear regression models where the input features represent previously observed contexts up to a certain length m. The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in nd, where n is the length of the training corpus and d is the number of observed features. We present a model that grows logarithmically in d, making it possible to efficiently leverage longer contexts. We account for the sequential structure of natural language using treestructured penalized objectives to avoid overfitting and achieve better generalization.
Reference: text
sentIndex sentText sentNum sentScore
1 The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in nd, where n is the length of the training corpus and d is the number of observed features. [sent-5, score-0.203]
2 In contrast to language models based on unstruc- tured norms such as ‘2 (quadratic penalties) or ‘1 (absolute discounting), we use tree-structured norms (Zhao et al. [sent-22, score-0.158]
3 Structured penalties have been successfully applied to various NLP tasks, including chunking and named entity recognition (Martins et al. [sent-25, score-0.205]
4 Such penalties are particularly well-suited to this problem as they mimic the nested nature of word contexts. [sent-27, score-0.205]
5 In this work, we show that structured tree norms provide an efficient framework for language modelling. [sent-29, score-0.27]
6 For a special case of these tree norms, we obtain an memory-efficient learning algorithm for log-linear language models. [sent-30, score-0.158]
7 Furthermore, we aslo give the first efficient learning algorithm for structured ‘∞ tree norms with a complexity nearly linear in the number of training samples. [sent-31, score-0.354]
8 In Section 3, we review unstructured penalties that were proposed earlier. [sent-35, score-0.24]
9 Next, we propose structured penalties and compare their memory and time requirements. [sent-36, score-0.292]
10 8 Figure 1: Example of uncollapsed (trie) and corresponding collapsed (tree) structured vectors and proximal operators applied to them. [sent-41, score-0.884]
11 Subfigure (a) shows the complete trie S and Subfigure (b) shows the corresponding collapsed tree T. [sent-43, score-0.432]
12 Subfigure (c) shows vector after proximal projection for ‘2T-norm (which cannot be collapsed), and Subfigure (d) that of ‘T∞-norm proximal projection which can be collapsed. [sent-45, score-1.604]
13 The suffix tree data structure encodes all the unique suffixes observed in a sequence up to a maximum given length. [sent-64, score-0.595]
14 The uncollapsed version of the suffix tree T is called a suffix trie, which we denote S. [sent-72, score-0.782]
15 A suffix trie also has a tree structure, but it potentially has much larger number of nodes. [sent-73, score-0.64]
16 An example of a suffix trie S and the associated suffix tree T are shown in Figures 1(a) and 1(b) respectively. [sent-74, score-0.927]
17 In the case of text, it has been observed that the number of nodes in the tree grows slower than that of the trie with the length of the sequence (Wood et al. [sent-78, score-0.496]
18 Naively, the feature vector φm(x) corresponds to one path of length m starting at the root of the suffix trie S. [sent-84, score-0.546]
19 We thus have a trie structure S on W (see Figure 1(a)) con235 straining the number of free parameters. [sent-86, score-0.23]
20 In other words, there is one weight parameter per node in the trie S and the matrix of parameters W is of size |S|. [sent-87, score-0.318]
21 the number of parameters is equal to the size of the suffix tree T, which has much fewer nodes than S. [sent-89, score-0.514]
22 This is achieved by ensuring that all parameters corresponding to suffixes at a node share the same parameter value (see Figure 1(b)). [sent-90, score-0.212]
23 These parameters correspond to paths in the suffix trie that do not branch i. [sent-91, score-0.571]
24 3 Proximal gradient algorithm The objective function (2) involves a smooth convex loss f and a possibly non-smooth penalty Ω. [sent-95, score-0.196]
25 Instead, we choose proximal methods (Nesterov, 2007), which have fast convergence rates and can deal with a large number of penalties Ω, see (Bach et al. [sent-97, score-0.886]
26 Proximal methods iteratively update the current estimate by making a generalized gradient update at each iteration. [sent-99, score-0.245]
27 Equation (3) can be rewritten to be solved in two independent steps: a gradient update from the smooth part followed by a projection depending only on the nonsmooth penalty: W0= W −L1∇f(W), Wt+1= argmin21? [sent-105, score-0.291]
28 4 Stochastic proximal gradient algorithm In language modelling applications, the number of training samples n is typically in the range of 105 or larger. [sent-121, score-0.811]
29 Stochastic version of the proximal methods (Hu et al. [sent-122, score-0.681]
30 At every update, the stochastic algorithm estimates the gradient on a mini-batch, that is, a subset of the samples. [sent-124, score-0.159]
31 The acceleration is obtained by making the gradient update at a specific weighted combination of the current and the previous estimates of the parameters. [sent-128, score-0.212]
32 When adopting a log-linear language model, the negative examples are associated with a small negative gradient step in (4), so that the solution is not sparse accross multiple categories in general. [sent-133, score-0.193]
33 , the set of feasible solutions K is the positive orthant), t sheet projection step u2t i onn Algorithm 1 p can v bee done with the same complexity, while maintaining sparse parameters accross multiple categories. [sent-136, score-0.273]
34 236 3 Unstructured penalties Standard choices for the penalty function Ω(W) include the ‘1-norm and the squared ‘2-norm. [sent-139, score-0.339]
35 In particular, the squared ‘2 and ‘1 penalties were used in the context of log-linear language models (Chen and Rosenfeld, 2000; Goodman, 2004), reporting performances competitive with bi-gram and tri-gram interpolated Kneser-Ney smoothing. [sent-141, score-0.273]
36 1 Proximal step on the suffix trie For squared ‘2 penalties, the proximal step Π‘22 (wt, 2κ) is the element-wise rescaling operation: w(it+1) ← wi(t)(1 + κ)−1 (6) For ‘1 penalties, the proximal step Π‘1 (wt, κ)] is the soft-thresholding operator: w(it+1) ← max(0,wi(t) − κ). [sent-143, score-2.067]
37 2 Proximal step on the suffix tree When feature values are identical, the corresponding proximal (and gradient) steps are identical. [sent-146, score-1.159]
38 This can be seen from the proximal steps (7) and (6), which apply to single weight entries. [sent-147, score-0.681]
39 Hence, we can collapse successive nodes that always have the same values in a suffix trie (as in Figure 1(b)), that is to say we can directly work on the suffix tree. [sent-149, score-0.912]
40 This leads to a proximal step with complexity that scales linearly with the number of symbols seen in the corpus (Ukkonen, 1995) and logarithmically with context length. [sent-150, score-0.934]
41 4 Structured penalties The ‘1 and squared ‘2 penalties do not account for the sequential dependencies in the data, treating suffixes of different lengths equally. [sent-151, score-0.57]
42 Hence, we propose to use the tree-structured Algorithm 2 w := Π‘T2 (w, κ) Proximal projection step for ‘2T on grouping G. [sent-154, score-0.217]
43 1 Input: T suffix tree, w trie-structured 2 Initialize: {γi } = 0, {ηi } = 1 32 η = UpwardPass(η, γ, κ, w) on grouping vector, κ G. [sent-155, score-0.343]
44 b wx is the weights corresponding to the suffix x from the weight vector w and children(x) returns all the immediate children to suffix x in the tree. [sent-157, score-0.62]
45 , 2011), which are based on the suffix trie or tree, where subtrees correspond to contexts of increasing lengths. [sent-160, score-0.616]
46 Group g(w, j) is the subvector of w associated with the subtree rooted at the node j of the suffix trie S(x). [sent-165, score-0.551]
47 Let G denote the ordered set of nodes of tfhinei tree T(x) Gsu dchen othteat t for r < s, g(w, r) e∩s g(w, s) = T∅( or g(w, r) ⊂ g(w, s). [sent-167, score-0.173]
48 237 Algorithm 3 w := Π‘T (w, κ) Proximal projection step for ‘T∞ on grouping G. [sent-172, score-0.217]
49 3 Proximal step Naively employing the projection on the ‘2-ball described above leads to an O(d2) algorithm for ‘2T proximal step. [sent-180, score-0.913]
50 This could be improved to a linear algorithm by aggregating all necessary scaling factors while making an upward pass of the trie S and applying them in a single downward pass as described in (Jenatton et al. [sent-181, score-0.295]
51 The complexity of ‘T∞-norm proximal step de- pends directly on that of∞ ∞the pivot finding algorithm used within its ‘1-projection method. [sent-184, score-0.898]
52 Pivot finding can be improved by randomly choosing candidates for the pivot and the best known algorithm due to (Bruckner, 1984) has amortized linear time complexity in the size of the vector. [sent-186, score-0.177]
53 This leaves us with O(d2) complexity for ‘T∞-norm proximal step. [sent-187, score-0.73]
54 , 2008) proposes a method that scales linearly with the number of non-zero entries in the gradient update (s) but logarithmically in d. [sent-189, score-0.298]
55 But recursive calls to ‘1-projection over subtrees will fail the sparsity assumption (with s ≈ d) making proximal step quadratic. [sent-190, score-0.758]
56 238 We next explain how the number of ‘1-projections can be reduced by switching to the tree T instead of trie S which is possible due to the good properties of ‘T∞-norm. [sent-192, score-0.353]
57 4 ‘T∞-norm with suffix trees We consider the case where all parameters are initialized with the same value for the optimization procedure, typically with zeros. [sent-195, score-0.373]
58 The condition that the parameters at any given node continue to share the same value requires that both the gradient update (4) and proximal step (5) have this property. [sent-196, score-1.011]
59 We modify the tree structure to ensure that after gradient updates parameters at a given node continue to share a single value. [sent-197, score-0.306]
60 Nodes that do not share a value after gradient update are split into multiple nodes where each node has a single value. [sent-198, score-0.286]
61 A constant value non-branching path is a set of nodes P ∈ P(T, w) of a tree structure T w. [sent-200, score-0.284]
62 The nodes of Figure 1(b) correspond to constant value non-branching paths when the values for all parameters at each of the nodes are the same. [sent-207, score-0.264]
63 Next we show that this tree structure is retained after proximal steps of ‘T∞-norm. [sent-208, score-0.804]
64 Constant value non-branching paths P(T, w) of T structured vector w are preserved undPe(rT t,hew proximal projection step Π‘T (w, κ). [sent-210, score-0.913]
65 Figure 1(d) illustrates this idea showing ‘T∞ projection applied on the collapsed tree. [sent-211, score-0.2]
66 This m∞akes it memory efficient but the time required for the proximal step remains the same since we must project each subtree of S on the ‘1-ball. [sent-212, score-0.798]
67 The sequence of projections at nodes of S in a non-branching path can be rewritten into a single projection step using the following technique bringing the number of projections from |S| to |T|. [sent-213, score-0.384]
68 Successive projection steps for subtrees with root in a constant value non-branching path P = {g1, · · · , g|P|} ∈ P(T, w) for Π‘T (w, κ) is πg|P| ◦·· ·◦πg1(w, κ) applied in bottom-up order defined by ·G·◦. [sent-215, score-0.269]
69 a single projection step wcitithon κ sccanale bde by the number of projections |P| as, πg|P| (w, κ|P| ) ≡ πg|P| ◦ · · · ◦ πg1 (w, κ) . [sent-217, score-0.219]
70 The above propositions show that ‘T∞-norm can be used with the suffix tree with fewer projection steps. [sent-218, score-0.531]
71 5 Fast proximal step for ‘T∞-norm Let k be the cardinality of the set of values larger than the pivot in a vector to compute the threshold for ‘1-projection as referred in (11). [sent-221, score-0.842]
72 Given the heap of the entries the cost of finding the pivot is O(k log(d)) if the pivot is the kth largest entry and there are d features. [sent-226, score-0.407]
73 The heap itself is built on the fly during this upward pass. [sent-228, score-0.217]
74 At each subtree, the heap is built by merging those of their children in constant time by using Fibonacci heaps. [sent-229, score-0.283]
75 This leaves us with a O(dk log(d)) complexity for the proximal step. [sent-230, score-0.73]
76 The suffix tree representation leads to an efficient memory usage. [sent-235, score-0.523]
77 Furthermore, to make the training algorithm time efficient, the parameters corresponding to contexts which always occur in the same larger 239 Algorithm 4 w := Π‘T (w, κ) Proximal projection step for ‘T∞ on grouping G using heap data structure. [sent-236, score-0.526]
78 Tuples are ordered by decreasing value of v and Hj refers to heap with values in sub-tree rooted at j. [sent-241, score-0.247]
79 OrderedIterator returns values from the heap in decreasing order of v. [sent-243, score-0.215]
80 We will illustrate in the experiments that these penalties do not lead to good predictive performances. [sent-245, score-0.205]
81 But it can only be applied on the uncollapsed tree since there is no closure property of the constant value nonbranching path for its proximal step making it less amenable for larger tree depths. [sent-249, score-1.197]
82 Time Memory efficiency is measured by the number of free Note that the suffix tree is much smaller than the trie (uncollapsed complexities reported are that of one proximal projection step. [sent-259, score-1.442]
83 (b) plot compares complexity for weighted structured 2-gram through the same with appropriate penalties w‘T2 and w‘T∞ 12- feature measure by then number of parameters. [sent-263, score-0.362]
84 means that the proximal step can be applied directly to the suffix tree. [sent-264, score-1.008]
85 )(tmeic s4260 rka-bned1s-tphiveoat2-pcol345 train size x 106 (b) Iteration time of random-pivoting and k-best heap on the collapsed tree. [sent-283, score-0.266]
86 Figure 3: Comparison of different methods for performing ‘T∞ proximal projection. [sent-284, score-0.681]
87 The k-be st heap is the method described in Algorithm 4. [sent-286, score-0.187]
88 To investigate this further, we adjust the penalties by choosing an exponential decrease of weights varying as αm for a feature at depth m in the suffix tree. [sent-291, score-0.528]
89 However, the introduction of weighted features prevents us from using the suffix tree representation, making these models inefficient in terms of memory. [sent-302, score-0.452]
90 Figure 2(c) compares model complexity measured by the number of parameters for weighted models using structured penalties. [sent-306, score-0.211]
91 However, the number of parameters for the w‘T∞ penalty grows logarithmically with the model or∞der. [sent-309, score-0.208]
92 This is due to the fact that it operates on the suffix tree-structured vectors instead of the suffix trie-structured vectors. [sent-310, score-0.574]
93 Next, we compare the average time taken per iteration for different implementations of the ‘T∞ proximal step. [sent-312, score-0.681]
94 Figure 3(a) shows this time against increasing depth of the language model order for ran- dom pivoting method with and without the collapsing of parameters at different constant value nonbranching paths. [sent-313, score-0.289]
95 This shows that the complexity of the full proximal step is sublinear when accounting for the suffix tree data structure. [sent-315, score-1.18]
96 Figure 3(b) plots time per iteration random pivoting and k-best heap against the varying size of training sequence. [sent-316, score-0.241]
97 The two algorithms are operating directly on the suffix tree. [sent-317, score-0.287]
98 We also developed a proximal projection algorithm for the treestructured ‘T∞-norm suitable for large trees. [sent-321, score-0.871]
99 From ukkonen to McCreight and weiner: A unifying view of linear-time suffix tree construction. [sent-415, score-0.444]
100 The composite absolute penalties family for grouped and hierarchical variable selection. [sent-574, score-0.205]
wordName wordTfidf (topN-words)
[('proximal', 0.681), ('suffix', 0.287), ('trie', 0.23), ('penalties', 0.205), ('heap', 0.187), ('tree', 0.123), ('projection', 0.121), ('gradient', 0.095), ('pivot', 0.093), ('suffixes', 0.092), ('uncollapsed', 0.085), ('norms', 0.079), ('collapsed', 0.079), ('update', 0.075), ('jenatton', 0.074), ('subfigure', 0.068), ('squared', 0.068), ('hj', 0.068), ('penalty', 0.066), ('vj', 0.065), ('wood', 0.063), ('logarithmically', 0.059), ('projections', 0.058), ('grouping', 0.056), ('pivoting', 0.054), ('parameters', 0.054), ('archambeau', 0.051), ('depths', 0.051), ('kennington', 0.051), ('mairal', 0.051), ('constant', 0.05), ('nodes', 0.05), ('complexity', 0.049), ('memory', 0.048), ('hx', 0.047), ('children', 0.046), ('perplexity', 0.045), ('mnih', 0.045), ('heaps', 0.044), ('postorder', 0.044), ('regularization', 0.042), ('weighted', 0.042), ('unweighted', 0.041), ('rosenfeld', 0.041), ('step', 0.04), ('structured', 0.039), ('cx', 0.038), ('bach', 0.038), ('subtrees', 0.037), ('depth', 0.036), ('observed', 0.036), ('leads', 0.036), ('algorithm', 0.035), ('unstructured', 0.035), ('linearly', 0.035), ('entries', 0.034), ('quadratic', 0.034), ('symbols', 0.034), ('bruckner', 0.034), ('cormen', 0.034), ('depthfirstnodetraversal', 0.034), ('depthfirstsuffixtraversal', 0.034), ('downwardpass', 0.034), ('gasthaus', 0.034), ('giegerich', 0.034), ('lipschitz', 0.034), ('mccullagh', 0.034), ('memoizer', 0.034), ('nonbranching', 0.034), ('orderediterator', 0.034), ('paramupdate', 0.034), ('treestructured', 0.034), ('ukkonen', 0.034), ('upwardpass', 0.034), ('vargas', 0.034), ('node', 0.034), ('contexts', 0.033), ('penalized', 0.032), ('naively', 0.032), ('value', 0.032), ('procedure', 0.03), ('successive', 0.03), ('upward', 0.03), ('accelerated', 0.03), ('accross', 0.03), ('kneser', 0.03), ('grenoble', 0.03), ('increasing', 0.029), ('maximum', 0.029), ('grows', 0.029), ('stochastic', 0.029), ('efficient', 0.029), ('path', 0.029), ('values', 0.028), ('yi', 0.028), ('sparse', 0.028), ('sequence', 0.028), ('wt', 0.028), ('compares', 0.027), ('initialize', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 176 emnlp-2013-Structured Penalties for Log-Linear Language Models
Author: Anil Kumar Nelakanti ; Cedric Archambeau ; Julien Mairal ; Francis Bach ; Guillaume Bouchard
Abstract: Language models can be formalized as loglinear regression models where the input features represent previously observed contexts up to a certain length m. The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in nd, where n is the length of the training corpus and d is the number of observed features. We present a model that grows logarithmically in d, making it possible to efficiently leverage longer contexts. We account for the sequential structure of natural language using treestructured penalized objectives to avoid overfitting and achieve better generalization.
2 0.12899132 20 emnlp-2013-An Efficient Language Model Using Double-Array Structures
Author: Makoto Yasuhara ; Toru Tanaka ; Jun-ya Norimatsu ; Mikio Yamamoto
Abstract: Ngram language models tend to increase in size with inflating the corpus size, and consume considerable resources. In this paper, we propose an efficient method for implementing ngram models based on doublearray structures. First, we propose a method for representing backwards suffix trees using double-array structures and demonstrate its efficiency. Next, we propose two optimization methods for improving the efficiency of data representation in the double-array structures. Embedding probabilities into unused spaces in double-array structures reduces the model size. Moreover, tuning the word IDs in the language model makes the model smaller and faster. We also show that our method can be used for building large language models using the division method. Lastly, we show that our method outperforms methods based on recent related works from the viewpoints of model size and query speed when both optimization methods are used.
3 0.10967216 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech
Author: Stella Frank ; Frank Keller ; Sharon Goldwater
Abstract: Frank Keller keller@ inf .ed .ac .uk Sharon Goldwater sgwater@ inf .ed .ac .uk ILCC, School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK interactions are often (but not necessarily) synergisChildren learn various levels of linguistic structure concurrently, yet most existing models of language acquisition deal with only a single level of structure, implicitly assuming a sequential learning process. Developing models that learn multiple levels simultaneously can provide important insights into how these levels might interact synergistically dur- ing learning. Here, we present a model that jointly induces syntactic categories and morphological segmentations by combining two well-known models for the individual tasks. We test on child-directed utterances in English and Spanish and compare to single-task baselines. In the morphologically poorer language (English), the model improves morphological segmentation, while in the morphologically richer language (Spanish), it leads to better syntactic categorization. These results provide further evidence that joint learning is useful, but also suggest that the benefits may be different for typologically different languages.
4 0.065814458 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk
Author: Xiaoning Zhu ; Zhongjun He ; Hua Wu ; Haifeng Wang ; Conghui Zhu ; Tiejun Zhao
Abstract: This paper proposes a novel approach that utilizes a machine learning method to improve pivot-based statistical machine translation (SMT). For language pairs with few bilingual data, a possible solution in pivot-based SMT using another language as a
5 0.056977067 58 emnlp-2013-Dependency Language Models for Sentence Completion
Author: Joseph Gubbins ; Andreas Vlachos
Abstract: Sentence completion is a challenging semantic modeling task in which models must choose the most appropriate word from a given set to complete a sentence. Although a variety of language models have been applied to this task in previous work, none of the existing approaches incorporate syntactic information. In this paper we propose to tackle this task using a pair of simple language models in which the probability of a sentence is estimated as the probability of the lexicalisation of a given syntactic dependency tree. We apply our approach to the Microsoft Research Sentence Completion Challenge and show that it improves on n-gram language models by 8.7 percentage points, achieving the highest accuracy reported to date apart from neural language models that are more complex and ex- pensive to train.
6 0.055781808 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models
7 0.053913184 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs
8 0.052462131 159 emnlp-2013-Regularized Minimum Error Rate Training
9 0.052444179 2 emnlp-2013-A Convex Alternative to IBM Model 2
10 0.048956804 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts
11 0.048711803 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
12 0.046599951 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
13 0.045337781 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
14 0.045180075 187 emnlp-2013-Translation with Source Constituency and Dependency Trees
15 0.044494238 134 emnlp-2013-Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks
16 0.044258654 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
17 0.043583311 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem
18 0.041692685 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations
19 0.040306095 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
20 0.039486945 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
topicId topicWeight
[(0, -0.144), (1, -0.047), (2, -0.009), (3, 0.003), (4, -0.072), (5, 0.015), (6, 0.046), (7, -0.03), (8, -0.03), (9, 0.004), (10, -0.009), (11, -0.003), (12, -0.09), (13, 0.021), (14, 0.057), (15, -0.065), (16, 0.026), (17, 0.012), (18, 0.035), (19, 0.032), (20, 0.058), (21, -0.083), (22, 0.007), (23, 0.122), (24, -0.108), (25, 0.048), (26, -0.062), (27, -0.045), (28, 0.028), (29, -0.095), (30, 0.058), (31, -0.123), (32, -0.019), (33, -0.159), (34, -0.057), (35, -0.059), (36, -0.003), (37, 0.032), (38, -0.242), (39, 0.354), (40, 0.1), (41, -0.043), (42, -0.185), (43, -0.081), (44, 0.151), (45, 0.025), (46, -0.048), (47, 0.062), (48, 0.04), (49, -0.083)]
simIndex simValue paperId paperTitle
same-paper 1 0.94650525 176 emnlp-2013-Structured Penalties for Log-Linear Language Models
Author: Anil Kumar Nelakanti ; Cedric Archambeau ; Julien Mairal ; Francis Bach ; Guillaume Bouchard
Abstract: Language models can be formalized as loglinear regression models where the input features represent previously observed contexts up to a certain length m. The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in nd, where n is the length of the training corpus and d is the number of observed features. We present a model that grows logarithmically in d, making it possible to efficiently leverage longer contexts. We account for the sequential structure of natural language using treestructured penalized objectives to avoid overfitting and achieve better generalization.
2 0.88175207 20 emnlp-2013-An Efficient Language Model Using Double-Array Structures
Author: Makoto Yasuhara ; Toru Tanaka ; Jun-ya Norimatsu ; Mikio Yamamoto
Abstract: Ngram language models tend to increase in size with inflating the corpus size, and consume considerable resources. In this paper, we propose an efficient method for implementing ngram models based on doublearray structures. First, we propose a method for representing backwards suffix trees using double-array structures and demonstrate its efficiency. Next, we propose two optimization methods for improving the efficiency of data representation in the double-array structures. Embedding probabilities into unused spaces in double-array structures reduces the model size. Moreover, tuning the word IDs in the language model makes the model smaller and faster. We also show that our method can be used for building large language models using the division method. Lastly, we show that our method outperforms methods based on recent related works from the viewpoints of model size and query speed when both optimization methods are used.
3 0.38107428 58 emnlp-2013-Dependency Language Models for Sentence Completion
Author: Joseph Gubbins ; Andreas Vlachos
Abstract: Sentence completion is a challenging semantic modeling task in which models must choose the most appropriate word from a given set to complete a sentence. Although a variety of language models have been applied to this task in previous work, none of the existing approaches incorporate syntactic information. In this paper we propose to tackle this task using a pair of simple language models in which the probability of a sentence is estimated as the probability of the lexicalisation of a given syntactic dependency tree. We apply our approach to the Microsoft Research Sentence Completion Challenge and show that it improves on n-gram language models by 8.7 percentage points, achieving the highest accuracy reported to date apart from neural language models that are more complex and ex- pensive to train.
4 0.3484661 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs
Author: John Canny ; David Hall ; Dan Klein
Abstract: Constituency parsing with rich grammars remains a computational challenge. Graphics Processing Units (GPUs) have previously been used to accelerate CKY chart evaluation, but gains over CPU parsers were modest. In this paper, we describe a collection of new techniques that enable chart evaluation at close to the GPU’s practical maximum speed (a Teraflop), or around a half-trillion rule evaluations per second. Net parser performance on a 4-GPU system is over 1 thousand length30 sentences/second (1 trillion rules/sec), and 400 general sentences/second for the Berkeley Parser Grammar. The techniques we introduce include grammar compilation, recursive symbol blocking, and cache-sharing.
5 0.32423747 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech
Author: Stella Frank ; Frank Keller ; Sharon Goldwater
Abstract: Frank Keller keller@ inf .ed .ac .uk Sharon Goldwater sgwater@ inf .ed .ac .uk ILCC, School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK interactions are often (but not necessarily) synergisChildren learn various levels of linguistic structure concurrently, yet most existing models of language acquisition deal with only a single level of structure, implicitly assuming a sequential learning process. Developing models that learn multiple levels simultaneously can provide important insights into how these levels might interact synergistically dur- ing learning. Here, we present a model that jointly induces syntactic categories and morphological segmentations by combining two well-known models for the individual tasks. We test on child-directed utterances in English and Spanish and compare to single-task baselines. In the morphologically poorer language (English), the model improves morphological segmentation, while in the morphologically richer language (Spanish), it leads to better syntactic categorization. These results provide further evidence that joint learning is useful, but also suggest that the benefits may be different for typologically different languages.
6 0.31716445 2 emnlp-2013-A Convex Alternative to IBM Model 2
7 0.31467462 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
8 0.31066945 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models
10 0.29985467 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem
11 0.29732618 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion
12 0.29269502 159 emnlp-2013-Regularized Minimum Error Rate Training
13 0.29163417 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
14 0.28605399 172 emnlp-2013-Simple Customization of Recursive Neural Networks for Semantic Relation Classification
15 0.25275552 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
16 0.24933638 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
17 0.2460885 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk
18 0.22489752 52 emnlp-2013-Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation
19 0.22091772 165 emnlp-2013-Scaling to Large3 Data: An Efficient and Effective Method to Compute Distributional Thesauri
20 0.21105103 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations
topicId topicWeight
[(3, 0.016), (9, 0.018), (18, 0.031), (22, 0.038), (30, 0.546), (43, 0.011), (50, 0.015), (51, 0.111), (66, 0.027), (71, 0.015), (75, 0.011), (77, 0.023), (95, 0.016), (96, 0.015)]
simIndex simValue paperId paperTitle
1 0.99493808 20 emnlp-2013-An Efficient Language Model Using Double-Array Structures
Author: Makoto Yasuhara ; Toru Tanaka ; Jun-ya Norimatsu ; Mikio Yamamoto
Abstract: Ngram language models tend to increase in size with inflating the corpus size, and consume considerable resources. In this paper, we propose an efficient method for implementing ngram models based on doublearray structures. First, we propose a method for representing backwards suffix trees using double-array structures and demonstrate its efficiency. Next, we propose two optimization methods for improving the efficiency of data representation in the double-array structures. Embedding probabilities into unused spaces in double-array structures reduces the model size. Moreover, tuning the word IDs in the language model makes the model smaller and faster. We also show that our method can be used for building large language models using the division method. Lastly, we show that our method outperforms methods based on recent related works from the viewpoints of model size and query speed when both optimization methods are used.
2 0.97196704 92 emnlp-2013-Growing Multi-Domain Glossaries from a Few Seeds using Probabilistic Topic Models
Author: Stefano Faralli ; Roberto Navigli
Abstract: In this paper we present a minimallysupervised approach to the multi-domain acquisition ofwide-coverage glossaries. We start from a small number of hypernymy relation seeds and bootstrap glossaries from the Web for dozens of domains using Probabilistic Topic Models. Our experiments show that we are able to extract high-precision glossaries comprising thousands of terms and definitions.
3 0.96334934 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
Author: Xinyan Xiao ; Deyi Xiong
Abstract: Traditional synchronous grammar induction estimates parameters by maximizing likelihood, which only has a loose relation to translation quality. Alternatively, we propose a max-margin estimation approach to discriminatively inducing synchronous grammars for machine translation, which directly optimizes translation quality measured by BLEU. In the max-margin estimation of parameters, we only need to calculate Viterbi translations. This further facilitates the incorporation of various non-local features that are defined on the target side. We test the effectiveness of our max-margin estimation framework on a competitive hierarchical phrase-based system. Experiments show that our max-margin method significantly outperforms the traditional twostep pipeline for synchronous rule extraction by 1.3 BLEU points and is also better than previous max-likelihood estimation method.
same-paper 4 0.96327436 176 emnlp-2013-Structured Penalties for Log-Linear Language Models
Author: Anil Kumar Nelakanti ; Cedric Archambeau ; Julien Mairal ; Francis Bach ; Guillaume Bouchard
Abstract: Language models can be formalized as loglinear regression models where the input features represent previously observed contexts up to a certain length m. The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in nd, where n is the length of the training corpus and d is the number of observed features. We present a model that grows logarithmically in d, making it possible to efficiently leverage longer contexts. We account for the sequential structure of natural language using treestructured penalized objectives to avoid overfitting and achieve better generalization.
5 0.90252781 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
Author: Hao Wang ; Zhengdong Lu ; Hang Li ; Enhong Chen
Abstract: Natural language conversation is widely regarded as a highly difficult problem, which is usually attacked with either rule-based or learning-based models. In this paper we propose a retrieval-based automatic response model for short-text conversation, to exploit the vast amount of short conversation instances available on social media. For this purpose we introduce a dataset of short-text conversation based on the real-world instances from Sina Weibo (a popular Chinese microblog service), which will be soon released to public. This dataset provides rich collection of instances for the research on finding natural and relevant short responses to a given short text, and useful for both training and testing of conversation models. This dataset consists of both naturally formed conversations, manually labeled data, and a large repository of candidate responses. Our preliminary experiments demonstrate that the simple retrieval-based conversation model performs reasonably well when combined with the rich instances in our dataset.
6 0.89324641 163 emnlp-2013-Sarcasm as Contrast between a Positive Sentiment and Negative Situation
7 0.72642016 113 emnlp-2013-Joint Language and Translation Modeling with Recurrent Neural Networks
8 0.7034182 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation
9 0.6813851 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming
10 0.67708588 172 emnlp-2013-Simple Customization of Recursive Neural Networks for Semantic Relation Classification
11 0.66957474 156 emnlp-2013-Recurrent Continuous Translation Models
12 0.66570103 2 emnlp-2013-A Convex Alternative to IBM Model 2
13 0.66559803 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
14 0.65788573 14 emnlp-2013-A Synchronous Context Free Grammar for Time Normalization
15 0.65776676 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries
16 0.64467704 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
17 0.64113027 52 emnlp-2013-Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation
18 0.63870317 122 emnlp-2013-Learning to Freestyle: Hip Hop Challenge-Response Induction via Transduction Rule Segmentation
19 0.63643849 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
20 0.63548243 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation