emnlp emnlp2011 emnlp2011-5 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anoop Deoras ; Tomas Mikolov ; Kenneth Church
Abstract: A re-scoring strategy is proposed that makes it feasible to capture more long-distance dependencies in the natural language. Two pass strategies have become popular in a number of recognition tasks such as ASR (automatic speech recognition), MT (machine translation) and OCR (optical character recognition). The first pass typically applies a weak language model (n-grams) to a lattice and the second pass applies a stronger language model to N best lists. The stronger language model is intended to capture more longdistance dependencies. The proposed method uses RNN-LM (recurrent neural network language model), which is a long span LM, to rescore word lattices in the second pass. A hill climbing method (iterative decoding) is proposed to search over islands of confusability in the word lattice. An evaluation based on Broadcast News shows speedups of 20 over basic N best re-scoring, and word error rate reduction of 8% (relative) on a highly competitive setup.
Reference: text
sentIndex sentText sentNum sentScore
1 Two pass strategies have become popular in a number of recognition tasks such as ASR (automatic speech recognition), MT (machine translation) and OCR (optical character recognition). [sent-7, score-0.242]
2 The first pass typically applies a weak language model (n-grams) to a lattice and the second pass applies a stronger language model to N best lists. [sent-8, score-0.552]
3 The proposed method uses RNN-LM (recurrent neural network language model), which is a long span LM, to rescore word lattices in the second pass. [sent-10, score-0.555]
4 A hill climbing method (iterative decoding) is proposed to search over islands of confusability in the word lattice. [sent-11, score-0.294]
5 , 2001), recurrent neural network language models (Mikolov et al. [sent-37, score-0.241]
6 Due to the prohibitive increase in the search space of sentence hypotheses (or longer length word sub sequences), it becomes challenging to use a long span language model in the first pass decoding. [sent-40, score-0.62]
7 A word graph (word lattices for speech recognition systems and hypergraphs for machine translation systems), encoding exponential number of hypotheses is hence outputted at the first pass output on which a sophisticated and complex language model is deployed for re-scoring. [sent-41, score-0.595]
8 , 2011) how to tackle the problem ofincorporating long span information during decoding in speech recognition systems by variationaly approximating (Bishop, 2006, pp. [sent-44, score-0.418]
9 462) the long span language model by a tractable substitute such that this substitute model comes closest to the long span model (closest in terms of Kullback Leibler Divergence (Cover and J. [sent-45, score-0.368]
10 The tractable substi- tute was then used directly in the first pass speech recognition systems. [sent-49, score-0.242]
11 In this paper we propose an 1117 approach that keeps the model intact but approximates the search space instead (which can become intractable to handle especially under a long span model), thus enabling the use of full blown model for re-scoring. [sent-50, score-0.233]
12 With this approach, we can achieve full lattice re-scoring with a complex model, at a cost more than 20 times less than of a naive brute force approach that is commonly used today. [sent-51, score-0.352]
13 His work has been followed by Schwenk (2007), who has shown that neural network language models actually work very well in the state-of-theart speech recognition systems. [sent-64, score-0.288]
14 Due to the requirement of un- limited history, many optimization tricks for rescoring with feedforward-based NNLMs as presented by Schwenk (2007) cannot be applied during rescoring with RNN LM. [sent-76, score-0.364]
15 The network has an input layer w, a hidden layer s and an output layer y. [sent-78, score-0.371]
16 The hidden layer s(t) has additional recurrent connections that are delayed by one time step. [sent-84, score-0.215]
17 After the network is trained, the output layer y(t) represents probability distribution for the current word, given the previous word and the state of the hidden layer from the previous time step. [sent-85, score-0.269]
18 A path, π, in a lattice is an element of E∗ with coAnse pcautthi,ve π t,ra inns aiti olantstic. [sent-97, score-0.352]
19 e W ies wanil e dleemnoetnet othfe origin / 1118 previous state of this path by p[π] and destination / next state of this path by n[π] . [sent-98, score-0.29]
20 A path, π is called a complete path if p[π] = ns and n[π] ∈ Ne. [sent-99, score-0.222]
21 A path, π, is called a partial path if p[π] = ns b uNt n[π] may or may not belong to Ne. [sent-100, score-0.222]
22 A path, π, is called a trailing path if p[π] may or may not be equal to ns and n[π] ∈ Ne. [sent-101, score-0.222]
23 We will also denote the time stamp da tn t[πhe] s ∈tar tN of the path by Ts [π] and the time stamp at the end of the path by Te [π] . [sent-102, score-0.435]
24 The acoustic likelihood of the path π ∈ E∗ is then given as: Y|π| A[π] = YP(aj|w[πj]) Yj=1 ? [sent-108, score-0.224]
25 The speech recognizer, which uses m-th order Markov LM for first pass recognition, imposes a constraint on the word lattice such that at each state there exists an unambiguous context of consecutive m 1words. [sent-124, score-0.52]
26 − A − f 1ir wst pass output is then a path π∗ having Maximum a Posterior (MAP) probability. [sent-125, score-0.227]
27 3Note that asterisk symbol here connotes that the path is opobtained as: π∗ = arg max A[π]γL[π] , πn:[pπ[π]∈]=Nnes where γ is the scaling parameter needed to balance the dynamic variability between the distributions of acoustic and language model (Ogawa et al. [sent-128, score-0.224]
28 Under a new n-gram Language Model, rescoring involves replacing the existing language model scores of all paths π. [sent-132, score-0.252]
29 Itefx xthte o fM nar −kov 1 rescoring n-gram nLeMw nreesecdosr a bigger context for the task of prediction (i. [sent-139, score-0.227]
30 It should be noted that if the rescoring LM needs a context of the entire past in order to predict the next word, then the lattice has to be expanded by splitting the states many more times. [sent-143, score-0.534]
31 When the task is to do rescoring under a long span LM, such as RNN-LM, then exact lattice re-scoring option is not feasible. [sent-147, score-0.718]
32 In order to tackle this problem, a suboptimal approach via N best list rescoring is utilized. [sent-148, score-0.265]
33 ≥ A[πN]γL[πN] Efficient algorithms exist for extracting N best paths from word lattices (Chow and Schwartz, 1989; Mohri and Riley, 2002). [sent-160, score-0.349]
34 If we denote the new LM scores by Lnew [·] , then under N best list paradigm, optimal path ˜π i[s· ]f,o tuhennd ounutd esur cNh that: π˜ = arg max A[π]ηLnew [π] , (2) π∈{π1 ,. [sent-162, score-0.307]
35 |L| (where |L| yis o othr em taoyta ln notu mbeb eerq uofa complete paths i |nL w|o (rwdh leartetice, sw thheic tho are exponentially many), htsh einn wthoer path obtained using (2) is not guaranteed to be optimal (under the rescoring model). [sent-167, score-0.435]
36 The short list of hypotheses so used for re-scoring would yield suboptimal output if the best path π∗ (according to the new model) is not present among the top N candidates extracted from the lattice. [sent-168, score-0.338]
37 , 2010) gram LM to produce lattices and hence N best lists. [sent-172, score-0.279]
38 , N}) and the new rank which the hypothesis got uNn}d)er a nthde rescoring model). [sent-181, score-0.23]
39 With a bigger and a better LM, the WER decreases at the expense of huge re-rankings of N best lists, only suggesting the fact that N best lists generated under a weaker model, are not reflective enough of a relatively better model. [sent-189, score-0.263]
40 In the next section, we propose an algorithm which keeps the representation of search space as 1120 simple as that of N best list, but does not restrict itself to top N best paths alone and hence does not get biased towards the starting weaker model. [sent-190, score-0.3]
41 4 Proposed Approach for Rescoring A high level idea of our proposed approach is to identify islands of confusability in the word lattice and replace the problem of global search over word lattice by series of local search problems over these islands in an iterative manner. [sent-191, score-1.045]
42 In this work, we propose a technique which generalizes very well on word lattices and overcomes the limitations posed by a CN or by the limited nature of local neighborhood. [sent-198, score-0.277]
43 Thus the local neighborhood are in some sense a function of the confusability present in the lattice rather than some predetermined factor. [sent-200, score-0.492]
44 Below we describe the process, virtue of which, we can cut the lattice to form many self contained smaller sized sub lattices. [sent-201, score-0.65]
45 Once these sub lattices are formed, we follow a similar hill climbing procedure as proposed in our previous work (Deoras and Jelinek, 2009). [sent-202, score-0.543]
46 Before we define the procedure for cutting the lattice into many small self contained lattices, we will define some more terms necessary for the ease of understandability of the algorithm. [sent-206, score-0.481]
47 p[π]=πX∈vE,n∗[π]∈Ne A[π]γL[π] (5) If we define the sum of joint likelihood under the baseline acoustic and language models of all paths in the lattice by Z, then it can simply be obtained as: Z = Pu∈Ne α(u) = β(ns) In Pordue∈rN Nto cut the lattice, we want to identify sets of noPdes, S1, S2, . [sent-215, score-0.547]
48 The first property can be easily checked by first pushing states into a linked list associated with each time marker (this can be done by iterating over all the states of the graph) then iterating over the unique time markers and retrieving back the nodes associated with it. [sent-223, score-0.206]
49 Thus effectively, the time complexity for cutting the lattice is O( |V| + |E| ). [sent-230, score-0.424]
50 Having formed such sets, we can now cut the lattice at time stamps associated with these sets i. [sent-231, score-0.444]
51 n,eLd lattices yet by simply cutting the parent lattice at the cut points. [sent-245, score-0.746]
52 − Once we cut the lattice at these cut points, we implicitly introduce many new starting nodes and ending nodes for any sub lattice. [sent-246, score-0.912]
53 We will refer to these nodes as exposed starting nodes and exposed ending nodes. [sent-247, score-0.481]
54 Thus for some jth sub lattice, Lj, there will be as many new exposed starting tnicodee,s L as there are nodes in the set Sj and as many exposed ending nodes as there are nodes in the set Sj+1. [sent-248, score-0.749]
55 In order to make these sub lattices consistent with the definition of a word lattice (see Sec. [sent-249, score-0.79]
56 1), we unify all the exposed starting nodes and exposed ending nodes. [sent-251, score-0.451]
57 To unify the exposed starting nodes, we introduce as many new edges as there are nodes in the set Sj such that they have a common starting node, ns [Lj], (newly created) and distinct ending nodes present in Sj. [sent-252, score-0.597]
58 tro Tdouc uen as many new edges as gth neroed are onfo dLes in the set Sj+1 such that they have distinct starting nodes present in Sj+1 and a common ending node ne [Lj] (newly created). [sent-254, score-0.289]
59 From the totality of these new edges and nodes along with the ones already present in Lj forms an induced directed acyclic subgraph G[Lj] = (V[Lj] , E[Lj] , ns [Lj] , ne [Lj] ). [sent-255, score-0.239]
60 For any path π ∈ E[Lj] such that p[π] = ns [Lj] andF n[π] ∈ Sj, we assign the value of α(n[π]) taon dde nn[oπt]e ∈the joint likelihood A[π]γL[π] and as- sign epsilon for word associated with these edges i. [sent-256, score-0.258]
61 Similarly, dfeorn any path π ∈ E[Lj] such that p[π] ∈ Sj+1 and n[π] = ne [Lj], we assign the value of β(p[π])6 to denote the joint likelihood A[π]γL[π] and assign epsilon for word associated with these edges i. [sent-260, score-0.275]
62 , LC, where C is the total number of sub lLatt,iLces, formed, then the idea is to divide the rescoring problem into many small re-scoring problems carried over the sub lattices one at a time by fixing single best paths from all the remaining sub lattices. [sent-269, score-1.116]
63 The inputs to the algorithm are the sub lattices (produced by cutting the parent lattice generated under some Markov n-gram LM) and a new rescoring LM, which now need not be restricted to finite state machine family. [sent-270, score-1.077]
64 The output of the al- gorithm is a word string, W∗, such that it is the concatenation of final decoded word strings from each sub lattice. [sent-271, score-0.285]
65 Thus if we denote the final decoded path (under some decoding scheme, which will become apparent next) in the jth sub lattice by πj∗ and the concatenation symbol by ’ ·’, then W∗ = W[π1∗] ·W[π2∗] ·. [sent-272, score-0.897]
66 e 1122 The algorithm is initialized by setting PrevHypo to null and CurrHypo to the concatenation of 1-best output from each sub lattice. [sent-302, score-0.242]
67 During the initialization step, each sub lattice is analyzed independent of any other sub lattice and under the baseline acoustic scores and baseline n-gram LM scores, 1-best path is found out. [sent-303, score-1.318]
68 Once all the sub lattices are re-scored, that constitutes one iteration. [sent-313, score-0.438]
69 At the end of each iteration, CurrHypo is set to the concatenation of 1 best paths from each sub lattice while PrevHypo is set to the old value of CurrHypo. [sent-314, score-0.7]
70 Thus if we are analyzing some ith sub-lattice in some iteration, then 1-best paths from all but this sub-lattice is kept fixed and a new 1-best path under the re-scoring LM is found out. [sent-315, score-0.215]
71 Since the cutting of parent lattices produce many small lattices with considerably lesser number of nodes, in practice, an exhaustive search for the 1best hypothesis can be carried out via N best list. [sent-317, score-0.676]
72 Entropy of a lattice reflects the confidence of the recognizer in recognizing the acoustics. [sent-321, score-0.43]
73 Based on the observation that if the N best list / lattice generated under some model has a very low entropy, then the Spearman’s rank correlation factor, ρ (Eqn. [sent-322, score-0.527]
74 3), tends to be higher even when the N best lists / lattice is re-ranked with a bigger and a better model. [sent-323, score-0.491]
75 We can see from Table 2 that the rank correlation values tend to be higher (indicating little re-rankings) when the entropy of the N best list, under the baseline model, is lower. [sent-332, score-0.225]
76 Of course, if the starting LM is much weaker than the rescoring model, then the entropy values need not be reflective of the difficulty of the overall task. [sent-335, score-0.388]
77 This observation then suggests that it is safe to rescore only those N best lists whose entropy under the starting model is higher than some threshold. [sent-336, score-0.255]
78 While computation of entropy for N best list is tractable, for a word lattice, the computation of entropy is intractable if one were to enumerate all the hypotheses. [sent-343, score-0.277]
79 Using efficient semiring techniques introduced by Li and Eisner (2009) or using posterior probabilities on the edges leading to end states, we can compute the entropy of a lattice in one single forward pass using dynamic programming. [sent-345, score-0.567]
80 One has to resort to approximate entropy computation via N best list, if entropy under long span LM is desired. [sent-347, score-0.414]
81 Once we have formed self contained sub lattices, we want to prune all but the top few best complete paths (obtained un1123 der baseline / starting model) of those sub lattices whose entropy is below some threshold. [sent-351, score-1.003]
82 Thus, believing in the original model’s confidence, we want to focus only on those sub lattices which the recognizer found difficult to decode in the first pass. [sent-352, score-0.516]
83 All other part of the parent lattice will be not be analyzed. [sent-353, score-0.385]
84 We followed IBM’s multi-pass decoding recipe using KN:BN-Small in the first pass followed by either N best list re-scoring or word lattice re-scoring using bigger and better models. [sent-366, score-0.636]
85 N best list search method obtains the same reduction in WER by evaluating as many as 228K sentence hypotheses on an average. [sent-374, score-0.276]
86 We used two sets for decoding: rt03+dev04f set was used as a development set while rt04 was used as a blind set for the purpose of evaluating the performance of long span RNN models using the proposed approach. [sent-378, score-0.218]
87 , 2007) for manipulating lattice graphs and generating N best lists. [sent-380, score-0.388]
88 Since the 1124 re-scoring LM belonged to the n-gram family, it was possible to compute the optimal word string by rescoring the whole lattice (see Sec. [sent-386, score-0.572]
89 N best list achieved the best possible reduction by evaluating as many as 228K sentence hypotheses on an average. [sent-393, score-0.263]
90 For the purpose of this experiment, entropy based pruning was carried out when the en- tropy of the sub lattice was below 5 nats. [sent-402, score-0.691]
91 Since the re-scoring model was an n-gram LM, it was possible to obtain the optimal performance via lattice update technique (see Sec. [sent-408, score-0.424]
92 We then carried out the re-scoring of the word lattices under KN:BN-Big using our proposed technique and found it to give the same performance yielding the WER of 13. [sent-411, score-0.277]
93 Due to long span nature of the re-scoring LM, it was not possible to obtain the optimal WER performance. [sent-417, score-0.222]
94 8K sentence hypotheses on an average, while the proposed method (with entropy pruning) obtains the same reduction by evaluating 21 times smaller search space. [sent-427, score-0.29]
95 the size of the search space (in terms of number of sentence hypotheses evaluated by a long span language model). [sent-428, score-0.343]
96 In-spite of starting off with a very strong n-gram LM, the N best lists so extracted were still not representative enough of the long span rescoring models. [sent-429, score-0.524]
97 Had we started off with KN:BN-Small, the N best list re-scoring method would have had no chance of finding the optimal hypothesis in reasonable size of hypotheses search space. [sent-430, score-0.28]
98 Table 4 compares the two search methods for this setup when many other long span LMs were also used for re-scoring. [sent-431, score-0.233]
99 Due to long span nature of the LM, optimal WER could not be estimated. [sent-452, score-0.222]
100 As part of the future work, we plan to extend this technique for hypergraphs and lattices in re-scoring MT outputs with complex and long span language models. [sent-456, score-0.461]
wordName wordTfidf (topN-words)
[('lattice', 0.352), ('lm', 0.274), ('wer', 0.244), ('lattices', 0.243), ('kn', 0.215), ('sub', 0.195), ('rescoring', 0.182), ('deoras', 0.167), ('mikolov', 0.158), ('lnew', 0.15), ('lj', 0.15), ('path', 0.145), ('span', 0.119), ('recurrent', 0.113), ('sj', 0.112), ('hypotheses', 0.11), ('exposed', 0.104), ('layer', 0.102), ('entropy', 0.097), ('speech', 0.086), ('confusability', 0.083), ('pass', 0.082), ('acoustic', 0.079), ('recognizer', 0.078), ('ns', 0.077), ('recognition', 0.074), ('decoding', 0.074), ('nodes', 0.073), ('cutting', 0.072), ('icassp', 0.072), ('paths', 0.07), ('currenthyp', 0.067), ('rnn', 0.067), ('network', 0.065), ('long', 0.065), ('starting', 0.064), ('neural', 0.063), ('ending', 0.063), ('ieee', 0.063), ('lists', 0.058), ('climbing', 0.057), ('islands', 0.057), ('neighborhood', 0.057), ('self', 0.057), ('acoustics', 0.053), ('ne', 0.053), ('stamp', 0.052), ('anoop', 0.051), ('broadcast', 0.051), ('currhypo', 0.05), ('ernock', 0.05), ('jc', 0.05), ('luk', 0.05), ('prevhyp', 0.05), ('search', 0.049), ('hill', 0.048), ('mohri', 0.048), ('rank', 0.048), ('list', 0.047), ('pruning', 0.047), ('concatenation', 0.047), ('spearman', 0.047), ('formed', 0.046), ('cut', 0.046), ('plot', 0.046), ('iterative', 0.046), ('bigger', 0.045), ('weaker', 0.045), ('correlation', 0.044), ('huge', 0.043), ('decoded', 0.043), ('asru', 0.043), ('burget', 0.043), ('filimonov', 0.043), ('honza', 0.043), ('kombrink', 0.043), ('unify', 0.043), ('iterating', 0.043), ('denote', 0.041), ('axis', 0.041), ('jelinek', 0.041), ('lms', 0.041), ('tom', 0.041), ('chomsky', 0.039), ('optimal', 0.038), ('edges', 0.036), ('bg', 0.036), ('ts', 0.036), ('best', 0.036), ('technique', 0.034), ('aj', 0.034), ('bengio', 0.034), ('frederick', 0.034), ('harper', 0.034), ('evaluating', 0.034), ('parent', 0.033), ('chow', 0.033), ('karafi', 0.033), ('nnes', 0.033), ('ogawa', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies
Author: Anoop Deoras ; Tomas Mikolov ; Kenneth Church
Abstract: A re-scoring strategy is proposed that makes it feasible to capture more long-distance dependencies in the natural language. Two pass strategies have become popular in a number of recognition tasks such as ASR (automatic speech recognition), MT (machine translation) and OCR (optical character recognition). The first pass typically applies a weak language model (n-grams) to a lattice and the second pass applies a stronger language model to N best lists. The stronger language model is intended to capture more longdistance dependencies. The proposed method uses RNN-LM (recurrent neural network language model), which is a long span LM, to rescore word lattices in the second pass. A hill climbing method (iterative decoding) is proposed to search over islands of confusability in the word lattice. An evaluation based on Broadcast News shows speedups of 20 over basic N best re-scoring, and word error rate reduction of 8% (relative) on a highly competitive setup.
2 0.22144899 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
3 0.13686393 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
Author: Gonzalo Iglesias ; Cyril Allauzen ; William Byrne ; Adria de Gispert ; Michael Riley
Abstract: This paper compares several translation representations for a synchronous context-free grammar parse including CFGs/hypergraphs, finite-state automata (FSA), and pushdown automata (PDA). The representation choice is shown to determine the form and complexity of target LM intersection and shortest-path algorithms that follow. Intersection, shortest path, FSA expansion and RTN replacement algorithms are presented for PDAs. Chinese-toEnglish translation experiments using HiFST and HiPDT, FSA and PDA-based decoders, are presented using admissible (or exact) search, possible for HiFST with compact SCFG rulesets and HiPDT with compact LMs. For large rulesets with large LMs, we introduce a two-pass search strategy which we then analyze in terms of search errors and translation performance.
4 0.12399378 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?
Author: Denis Filimonov ; Mary Harper
Abstract: Decision trees have been applied to a variety of NLP tasks, including language modeling, for their ability to handle a variety of attributes and sparse context space. Moreover, forests (collections of decision trees) have been shown to substantially outperform individual decision trees. In this work, we investigate methods for combining trees in a forest, as well as methods for diversifying trees for the task of syntactic language modeling. We show that our tree interpolation technique outperforms the standard method used in the literature, and that, on this particular task, restricting tree contexts in a principled way produces smaller and better forests, with the best achieving an 8% relative reduction in Word Error Rate over an n-gram baseline.
5 0.1070381 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models
Author: Puyang Xu ; Asela Gunawardana ; Sanjeev Khudanpur
Abstract: We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1% ∼ 5% of the strtaaninda MrdE complexity LwMith a no %los ∼s 5in% performance.
6 0.10518668 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
7 0.089493334 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts
8 0.072800554 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
9 0.06971167 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
10 0.066128045 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation
11 0.063466541 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing
12 0.062995031 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
13 0.059832644 125 emnlp-2011-Statistical Machine Translation with Local Language Models
14 0.057232115 109 emnlp-2011-Random Walk Inference and Learning in A Large Scale Knowledge Base
15 0.055196103 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
16 0.055103425 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
17 0.054996889 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
18 0.054283138 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation
19 0.053743586 110 emnlp-2011-Ranking Human and Machine Summarization Systems
20 0.053399656 96 emnlp-2011-Multilayer Sequence Labeling
topicId topicWeight
[(0, 0.202), (1, 0.058), (2, 0.014), (3, -0.071), (4, 0.03), (5, -0.027), (6, -0.026), (7, -0.226), (8, 0.029), (9, -0.08), (10, 0.094), (11, 0.048), (12, -0.03), (13, 0.077), (14, -0.07), (15, 0.112), (16, 0.082), (17, 0.043), (18, 0.133), (19, 0.021), (20, 0.265), (21, -0.146), (22, 0.103), (23, -0.201), (24, -0.104), (25, 0.084), (26, -0.095), (27, -0.143), (28, 0.108), (29, 0.075), (30, 0.058), (31, -0.173), (32, -0.073), (33, 0.059), (34, 0.051), (35, -0.104), (36, -0.067), (37, 0.129), (38, 0.133), (39, -0.017), (40, 0.133), (41, -0.097), (42, 0.155), (43, 0.047), (44, -0.133), (45, -0.11), (46, 0.022), (47, -0.002), (48, 0.024), (49, -0.137)]
simIndex simValue paperId paperTitle
same-paper 1 0.96029127 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies
Author: Anoop Deoras ; Tomas Mikolov ; Kenneth Church
Abstract: A re-scoring strategy is proposed that makes it feasible to capture more long-distance dependencies in the natural language. Two pass strategies have become popular in a number of recognition tasks such as ASR (automatic speech recognition), MT (machine translation) and OCR (optical character recognition). The first pass typically applies a weak language model (n-grams) to a lattice and the second pass applies a stronger language model to N best lists. The stronger language model is intended to capture more longdistance dependencies. The proposed method uses RNN-LM (recurrent neural network language model), which is a long span LM, to rescore word lattices in the second pass. A hill climbing method (iterative decoding) is proposed to search over islands of confusability in the word lattice. An evaluation based on Broadcast News shows speedups of 20 over basic N best re-scoring, and word error rate reduction of 8% (relative) on a highly competitive setup.
2 0.66290766 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models
Author: Puyang Xu ; Asela Gunawardana ; Sanjeev Khudanpur
Abstract: We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1% ∼ 5% of the strtaaninda MrdE complexity LwMith a no %los ∼s 5in% performance.
3 0.57543248 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
Author: Andrea Gesmundo ; James Henderson
Abstract: State of the art Tree Structures Prediction techniques rely on bottom-up decoding. These approaches allow the use of context-free features and bottom-up features. We discuss the limitations of mainstream techniques in solving common Natural Language Processing tasks. Then we devise a new framework that goes beyond Bottom-up Decoding, and that allows a better integration of contextual features. Furthermore we design a system that addresses these issues and we test it on Hierarchical Machine Translation, a well known tree structure prediction problem. The structure of the proposed system allows the incorporation of non-bottom-up features and relies on a more sophisticated decoding approach. We show that the proposed approach can find bet- ter translations using a smaller portion of the search space.
4 0.53552711 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
Author: Gonzalo Iglesias ; Cyril Allauzen ; William Byrne ; Adria de Gispert ; Michael Riley
Abstract: This paper compares several translation representations for a synchronous context-free grammar parse including CFGs/hypergraphs, finite-state automata (FSA), and pushdown automata (PDA). The representation choice is shown to determine the form and complexity of target LM intersection and shortest-path algorithms that follow. Intersection, shortest path, FSA expansion and RTN replacement algorithms are presented for PDAs. Chinese-toEnglish translation experiments using HiFST and HiPDT, FSA and PDA-based decoders, are presented using admissible (or exact) search, possible for HiFST with compact SCFG rulesets and HiPDT with compact LMs. For large rulesets with large LMs, we introduce a two-pass search strategy which we then analyze in terms of search errors and translation performance.
5 0.4264375 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
6 0.36264038 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?
7 0.31822363 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts
8 0.29528049 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation
9 0.27735773 110 emnlp-2011-Ranking Human and Machine Summarization Systems
10 0.27508497 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
11 0.26895285 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
12 0.26218143 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association
13 0.25935251 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
14 0.24413505 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
15 0.21796554 109 emnlp-2011-Random Walk Inference and Learning in A Large Scale Knowledge Base
16 0.21434632 74 emnlp-2011-Inducing Sentence Structure from Parallel Corpora for Reordering
17 0.21340458 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
18 0.21148971 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
19 0.20621341 120 emnlp-2011-Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions
20 0.20204686 134 emnlp-2011-Third-order Variational Reranking on Packed-Shared Dependency Forests
topicId topicWeight
[(23, 0.092), (36, 0.035), (37, 0.486), (45, 0.052), (53, 0.018), (54, 0.02), (57, 0.026), (62, 0.027), (64, 0.019), (66, 0.021), (69, 0.024), (79, 0.029), (82, 0.023), (96, 0.03), (98, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.92419791 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies
Author: Anoop Deoras ; Tomas Mikolov ; Kenneth Church
Abstract: A re-scoring strategy is proposed that makes it feasible to capture more long-distance dependencies in the natural language. Two pass strategies have become popular in a number of recognition tasks such as ASR (automatic speech recognition), MT (machine translation) and OCR (optical character recognition). The first pass typically applies a weak language model (n-grams) to a lattice and the second pass applies a stronger language model to N best lists. The stronger language model is intended to capture more longdistance dependencies. The proposed method uses RNN-LM (recurrent neural network language model), which is a long span LM, to rescore word lattices in the second pass. A hill climbing method (iterative decoding) is proposed to search over islands of confusability in the word lattice. An evaluation based on Broadcast News shows speedups of 20 over basic N best re-scoring, and word error rate reduction of 8% (relative) on a highly competitive setup.
2 0.91681802 74 emnlp-2011-Inducing Sentence Structure from Parallel Corpora for Reordering
Author: John DeNero ; Jakob Uszkoreit
Abstract: When translating among languages that differ substantially in word order, machine translation (MT) systems benefit from syntactic preordering—an approach that uses features from a syntactic parse to permute source words into a target-language-like order. This paper presents a method for inducing parse trees automatically from a parallel corpus, instead of using a supervised parser trained on a treebank. These induced parses are used to preorder source sentences. We demonstrate that our induced parser is effective: it not only improves a state-of-the-art phrase-based system with integrated reordering, but also approaches the performance of a recent preordering method based on a supervised parser. These results show that the syntactic structure which is relevant to MT pre-ordering can be learned automatically from parallel text, thus establishing a new application for unsupervised grammar induction.
3 0.81661361 12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents
Author: Yufan Guo ; Anna Korhonen ; Thierry Poibeau
Abstract: Documents Anna Korhonen Thierry Poibeau Computer Laboratory LaTTiCe, UMR8094 University of Cambridge, UK CNRS & ENS, France alk2 3 @ cam . ac .uk thierry .po ibeau @ ens . fr tific literature according to categories of information structure (or discourse, rhetorical, argumentative or Argumentative Zoning (AZ) analysis of the argumentative structure of a scientific paper has proved useful for a number of information access tasks. Current approaches to AZ rely on supervised machine learning (ML). – – Requiring large amounts of annotated data, these approaches are expensive to develop and port to different domains and tasks. A potential solution to this problem is to use weaklysupervised ML instead. We investigate the performance of four weakly-supervised classifiers on scientific abstract data annotated for multiple AZ classes. Our best classifier based on the combination of active learning and selftraining outperforms our best supervised classifier, yielding a high accuracy of 81% when using just 10% of the labeled data. This result suggests that weakly-supervised learning could be employed to improve the practical applicability and portability of AZ across different information access tasks.
4 0.50213641 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
Author: Marco Dinarelli ; Sophie Rosset
Abstract: Reranking models have been successfully applied to many tasks of Natural Language Processing. However, there are two aspects of this approach that need a deeper investigation: (i) Assessment of hypotheses generated for reranking at classification phase: baseline models generate a list of hypotheses and these are used for reranking without any assessment; (ii) Detection of cases where reranking models provide a worst result: the best hypothesis provided by the reranking model is assumed to be always the best result. In some cases the reranking model provides an incorrect hypothesis while the baseline best hypothesis is correct, especially when baseline models are accurate. In this paper we propose solutions for these two aspects: (i) a semantic inconsistency metric to select possibly more correct n-best hypotheses, from a large set generated by an SLU basiline model. The selected hypotheses are reranked applying a state-of-the-art model based on Partial Tree Kernels, which encode SLU hypotheses in Support Vector Machines with complex structured features; (ii) finally, we apply a decision strategy, based on confidence values, to select the final hypothesis between the first ranked hypothesis provided by the baseline SLU model and the first ranked hypothesis provided by the re-ranker. We show the effectiveness of these solutions presenting comparative results obtained reranking hypotheses generated by a very accurate Conditional Random Field model. We evaluate our approach on the French MEDIA corpus. The results show significant improvements with respect to current state-of-the-art and previous 1104 Sophie Rosset LIMSI-CNRS B.P. 133, 91403 Orsay Cedex France ro s set @ l ims i fr . re-ranking models.
5 0.49366191 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models
Author: Puyang Xu ; Asela Gunawardana ; Sanjeev Khudanpur
Abstract: We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1% ∼ 5% of the strtaaninda MrdE complexity LwMith a no %los ∼s 5in% performance.
6 0.43249917 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
7 0.42630413 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
8 0.41519064 136 emnlp-2011-Training a Parser for Machine Translation Reordering
9 0.41416371 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction
10 0.41339117 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
11 0.40938327 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
12 0.40593931 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
13 0.40227664 134 emnlp-2011-Third-order Variational Reranking on Packed-Shared Dependency Forests
14 0.4008927 13 emnlp-2011-A Word Reordering Model for Improved Machine Translation
15 0.40068296 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
16 0.38531765 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
17 0.3848857 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation
18 0.38297155 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?
19 0.3800523 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
20 0.37978008 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming