acl acl2012 acl2012-121 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhiheng Huang ; Yi Chang ; Bo Long ; Jean-Francois Crespo ; Anlei Dong ; Sathiya Keerthi ; Su-Lin Wu
Abstract: Sequential modeling has been widely used in a variety of important applications including named entity recognition and shallow parsing. However, as more and more real time large-scale tagging applications arise, decoding speed has become a bottleneck for existing sequential tagging algorithms. In this paper we propose 1-best A*, 1-best iterative A*, k-best A* and k-best iterative Viterbi A* algorithms for sequential decoding. We show the efficiency of these proposed algorithms for five NLP tagging tasks. In particular, we show that iterative Viterbi A* decoding can be several times or orders of magnitude faster than the state-of-the-art algorithm for tagging tasks with a large number of labels. This algorithm makes real-time large-scale tagging applications with thousands of labels feasible.
Reference: text
sentIndex sentText sentNum sentScore
1 However, as more and more real time large-scale tagging applications arise, decoding speed has become a bottleneck for existing sequential tagging algorithms. [sent-6, score-0.634]
2 In this paper we propose 1-best A*, 1-best iterative A*, k-best A* and k-best iterative Viterbi A* algorithms for sequential decoding. [sent-7, score-0.721]
3 In particular, we show that iterative Viterbi A* decoding can be several times or orders of magnitude faster than the state-of-the-art algorithm for tagging tasks with a large number of labels. [sent-9, score-0.832]
4 Unfortunately, due to its O(TL2) time complexity, where T is the input token size and L is the label size, the Viterbi decoding can become prohibitively slow when the label size is large (say, larger than 200). [sent-16, score-0.553]
5 (2010) proposed a staggered decoding algorithm, which proves to be very efficient on datasets with a large number of labels. [sent-21, score-0.447]
6 What the aforementioned literature does not cover is the k-best sequential decoding problem, which is indeed frequently required in practice. [sent-22, score-0.442]
7 The k-best parses have been extensively studied in syntactic parsing context (Huang, 2005; Pauls and Klein, 2009), but it is not well accommodated in sequential decoding context. [sent-24, score-0.442]
8 To our best knowledge, the state-of-the-art k-best sequential decoding algorithm is Viterbi A* 1. [sent-25, score-0.592]
9 , 2010) and propose a k-best sequential decoding algorithm, namely iterative Viterbi A*. [sent-27, score-0.711]
10 (1) We apply the A* search framework to sequential decoding problem. [sent-30, score-0.47]
11 (2) We propose 1-best A*, 1-best iterative A* decoding algorithms which are the second and third fastest decoding algorithms among the five decoding algorithms for comparison, although there is a significant gap to the fastest 1-best decoding algorithm. [sent-32, score-1.845]
12 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi6c 1s1–619, k-best decoding algorithm. [sent-40, score-0.325]
13 2 Problem formulation In this section, we formulate the sequential decoding problem in the context of perceptron algorithm (Collins, 2002) and CRFs (Lafferty et al. [sent-42, score-0.594]
14 Xy tX= X1 kX= X1 (3) If x is given, the decoding is to find the best y which maximizes the score of f(y, x) for perceptron or the probability of p(y|x) for CRFs. [sent-49, score-0.466]
15 We thus rewrite the decoding problem as XT XK1 XK2 argymaxXt=1(kX=1θk1fk1(yt,xt)+Xk=1θk2fk2(yt,yt−1,xt)). [sent-55, score-0.325]
16 So the seqtu−e1ntial decoding problem tis− 1cast as aP max score pathfinding problem2. [sent-57, score-0.399]
17 In the discussion hereafter, we assume scores of nodes and edges are pre-computed (denoted as n(yt) and e(yt−1 , yt)), and we can thus focus on the analysis of td−i1fferent decoding algorithms. [sent-58, score-0.392]
18 3 Background We present the existing algorithms for both 1-best and k-best sequential decoding in this section. [sent-59, score-0.508]
19 1 1-Best Viterbi The Viterbi algorithm is a classic dynamic program- ming based decoding algorithm. [sent-62, score-0.448]
20 rw Tahrde pass since the computing traverses the lattice from left to right. [sent-69, score-0.341]
21 We can use either forward or backward pass to compute the best sequence. [sent-74, score-0.4]
22 Table 1 summarizes the computational complexity of all decoding algorithms including Viterbi, which has the complexity of TL2 for both best and worst cases. [sent-75, score-0.588]
23 Note that N/A means the decoding algorithms are not applicable (for example, iterative Viterbi is not applicable + to k-best decoding). [sent-76, score-0.66]
24 , 2010) presented an efficient sequential decoding algorithm named staggered decoding. [sent-81, score-0.654]
25 We use the name iterative Viterbi to describe 2With the constraint that the path consists of one and only one node at each position. [sent-82, score-0.372]
26 this algorithm for the reason that the iterative process plays a central role in this algorithm. [sent-84, score-0.366]
27 Indeed, this iterative process is generalized in this paper to handle k-best sequential decoding (see Section 4. [sent-85, score-0.711]
28 The main idea is to start with a coarse lattice which consists of both active labels and degenerate labels. [sent-87, score-0.661]
29 The new label, which is made by grouping the inactive labels, is referred to as a degenerate label (i. [sent-95, score-0.367]
30 1 (a) shows a lattice which consists of active labels only and (b) shows a lattice which consists of both active and degenerate ones. [sent-99, score-0.995]
31 The score of a degenerate label is the max score of inactive labels which are included in the degenerate label. [sent-100, score-0.745]
32 Ures ibnegtw theeen nab aonvye i ndaecftinivietio lnasb,e lth ye b∈es zt sequenc∈e d zerived from a degenerate lattice would be the upper bound of the sequence derived from the original lattice. [sent-102, score-0.568]
33 If the best sequence does not include any degenerate labels, it is indeed the best sequence for the original lattice. [sent-103, score-0.456]
34 ACBFEDACFDBAEDCFEDBACEBFDECBFAEDCBAFEDCBA Figure 1: (a) A lattice consisting of active labels only. [sent-104, score-0.427]
35 (b) A lattice consisting of both active labels and degenerate ones. [sent-105, score-0.633]
36 Each position has one active label (A) and one degenerate label (consisting of B, C. [sent-106, score-0.53]
37 The lattice is initialized to include one active label and one degenerate label at each position (see Figure 1 (b)). [sent-109, score-0.775]
38 The Viterbi algorithm is applied to the lattice to find the best sequence. [sent-111, score-0.395]
39 Otherwise, the lower bound lb4 of the active sequence in the lattice is updated and the lattice is expanded. [sent-113, score-0.724]
40 The lower bound can be initialized to the best sequence score using a beam search (with beam size being 1). [sent-114, score-0.387]
41 After either a forward or a backward pass, the lower bound is assigned with 4The maximum score of the active sequences found so far. [sent-115, score-0.512]
42 The expansion of lattice ensures that the lattice has twice active labels as before at a given position. [sent-117, score-0.697]
43 The number of active labels in the column is doubled only if the best sequence of the degenerate lattice passes through the degenerate label of that column. [sent-119, score-1.091]
44 Algorithm 2 shows the forward pass in which the node pruning is performed. [sent-124, score-0.367]
45 That is, for any node, if the best score of sequence which passes such a node is less than the lower bound lb, such a node is removed from the lattice. [sent-125, score-0.452]
46 5We do not update the lower bound lb if we cannot find an active sequence. [sent-128, score-0.377]
47 The backward pass is similar to the forward one and it is thus omitted. [sent-130, score-0.347]
48 The alternative calls of forward and backward passes (in Algorithm 1) ensure the alternative updating/lowering of node forward and backward scores, which makes the node pruning in either forward pass (see Algorithm 2) or backward pass more efficient. [sent-131, score-1.254]
49 The lower bound lb is updated once in each iteration of the main loop in Algorithm 1. [sent-132, score-0.335]
50 While the forward and backwards scores of nodes gradually decrease and the lower bound lb increases, more and more nodes are pruned. [sent-133, score-0.529]
51 The iterative Viterbi algorithm has computational complexity of T and TL2 for best and worst cases respectively. [sent-134, score-0.516]
52 At the m-th iteration in Algorithm 1, iterative Viterbi decoding requires order of T4m time because there are 2m active labPels (plus one degenerate label). [sent-137, score-0.945]
53 3 1-Best CarpediePm Esposito and Radicioni (2009) have proposed a novel 1-best6 sequential decoding algorithm, Carpediem, which attempts to open only necessary nodes in searching the best sequence in a given lattice. [sent-144, score-0.634]
54 Carpediem is used as a baseline in our experiments for decoding speed comparison. [sent-147, score-0.399]
55 The k-best sequential decoding gives up this 1-best label memorization in the dynamic programming paradigm. [sent-150, score-0.526]
56 Once we store the k-best labels per node in a lattice, the k-best Viterbi algorithm calls either the forward or the backward passes just in the same way as the 1-best Viterbi decoding does. [sent-153, score-0.937]
57 5 K-Best Viterbi A* To our best knowledge the most efficient k-best sequence algorithm is the Viterbi A* algorithm as shown in Algorithm 3. [sent-157, score-0.346]
58 The algorithm consists ofone forward pass and an A* backward pass. [sent-158, score-0.444]
59 The forward pass computes and stores the Viterbi forward scores, which are the best scores from the start to the current nodes. [sent-159, score-0.45]
60 In each loop, the best node is popped off from the agenda and is stored in a set r. [sent-180, score-0.327]
61 The idea is that, when an optimal node (or sequence) is popped off, we have to push to the agenda all nodes (sequences) which are slightly worse than the just popped one. [sent-182, score-0.505]
62 Suppose an optimal node 2:B (in red, standing for node B at position 2, representing the sequence of 0:A 1:D 2:B 3:C) is popped off, new nodes of 1:A, 1:B, 1:C and 0:B, 0:C and 0:D are pushed to the agenda according to the double nested for loop in Algorithm 3. [sent-186, score-0.689]
63 Each of the pushed nodes represents a sequence, for example, node 1:B represents a sequence which consists of three parts: Viterb sequence from start to 1:B (0:C 1:B), 2:B and forward link of 2:B (3:C in this case). [sent-187, score-0.503]
64 computation complexity of TL2 + TL for both best and worst cases, with the first term accounting for Viterbi forward pass and the second term accounting for A* backward process. [sent-192, score-0.497]
65 Algorithm 3 K-Best Viterbi A* algorithm 1:forward() 2: push L best nodes to agenda q 3: c = 0 4: r = {} 5: wr =hil {e} c < K do 6: Node n = q. [sent-194, score-0.365]
66 The updating of lb using Viterbi A* sequences (line 19) can boost the decoding speed further. [sent-200, score-0.68]
67 We experimented with different k0 values (k0 = nk, where n is an integer) and selected k0 = 2k which results in the largest decoding speed boost. [sent-201, score-0.399]
68 1 Experimental setting We apply 1-best and k-best sequential decoding algorithms to five NLP tagging tasks: Penn TreeBank (PTB) POS tagging, CoNLL2000 joint POS tagging and chunking, CoNLL 2003 joint POS tagging, chunking and named entity tagging, HPSG supertagging (Matsuzaki et al. [sent-204, score-0.731]
69 We note that the selection of training algorithm does not affect the decoding process: the decoding is identical for both CRF and perceptron training algorithms. [sent-229, score-0.802]
70 We note that all decoding algorithms as listed in Section 3 and Section 4 are exact. [sent-242, score-0.391]
71 As this paper is more concerned with the decoding speed, the feature engineering is beyond the scope of this paper. [sent-247, score-0.347]
72 Table 3 shows how many iterations in average are required for iterative Viterbi and iterative Viterbi A* algorithms. [sent-248, score-0.538]
73 f eDreensptit peo sthiteio lnasrg me nyunm obter e xopfa intedr aa-t tions used in iterative based algorithms (especially iterative Viterbi A* algorithm), the algorithms are still very efficient (see below). [sent-250, score-0.697]
74 Table 3: Iteration numbers of iterative Viterbi and itera- tive Viterbi A* algorithms for five datasets. [sent-251, score-0.367]
75 4781 query Table 4 and 5 show the decoding speed (sentences per second) of 1-best and 5-best decoding algorithms respectively. [sent-262, score-0.857]
76 The proposed decoding algorithms and the largest decoding speeds across different decoding algorithms (other than beam) are highlighted in bold. [sent-263, score-1.107]
77 The beam search decoding is also shown as a baseline. [sent-265, score-0.416]
78 We note that beam decoding is the only approximate decoding algorithm in this table. [sent-266, score-0.81]
79 All other decoding algorithms produce exactly the same accuracy, which is usually much better than the accuracy of beam decoding. [sent-267, score-0.454]
80 The iterative process successfully boosts the decoding speed of iterative Viterbi compared to Viterbi, but it slows down the decoding speed of iterative A* compared 7The lower accuracy is due to the dynamic nature of queries: many of test query tokens are unseen in the training set. [sent-272, score-1.686]
81 This is because in the Viterbi case, the iterative process has a node pruning procedure, while it does not have such pruning in A*(best) algorithm. [sent-274, score-0.438]
82 Take CoNLL 2003 data as an example, the removal of the pruning slows down the 1best iterative Viterbi decoding from 2239 to 604 sentences/second. [sent-275, score-0.667]
83 The classic Viterbi A* can usually obtain a decent decoding speed, for example, 40 sentences per second for CoNLL 2003 dataset. [sent-283, score-0.377]
84 The decoding speed of iterative Viterbi A* can even be comparable to that of beam search. [sent-289, score-0.731]
85 Figure 4 shows k-best decoding algorithms decoding speed with respect to different k values for CoNLL 2003 data . [sent-290, score-0.79]
86 The Viterbi A* and iterative Viterbi A* algorithms are significantly faster than the Viterbi and A*(best) algorithms. [sent-291, score-0.359]
87 Although the iterative Viterbi A* significantly outperforms the Viterbi A* for k < 30, the speed of the former converges to the latter when k becomes 90 or larger. [sent-292, score-0.343]
88 This is expected as the k-best sequences span over the whole lattice: the earlier iteration in iterative Viterbi A* algorithm cannot provide the k-best sequences using the degenerate lattice. [sent-293, score-0.732]
89 The overhead of multiple iterations slows down the decoding speed compared to the Viterbi A* algorithm. [sent-294, score-0.439]
90 1e /ntcos2e1cdn8042 60 01G2 304G5 6G0iVAte*7r(Ga0btievs8At)V0G*ier9b A*10G k Figure 4: Decoding speed of k-best decoding algorithms for various k for CoNLL 2003 dataset. [sent-295, score-0.465]
91 Esposito and Radicioni (2009) proposed an algorithm which opens necessary nodes in a lattice in searching the best sequence. [sent-297, score-0.462]
92 , 2010) forms the basis for our work on iterative based decoding algorithms. [sent-299, score-0.594]
93 Apart from the exact decoding, approximate decoding algorithms such as beam search are also related to our work. [sent-300, score-0.506]
94 Pauls and Klein (2009) proposed an algorithm which replaces the Viterbi forward pass with an A* search. [sent-306, score-0.328]
95 Their algorithm optimizes the Viterbi pass, while the proposed iterative Viterbi A* algorithm optimizes both Viterbi and A* passes. [sent-307, score-0.463]
96 However, the difference is that the coarse-to-fine parsing is an approximate decoding while ours is exact one. [sent-310, score-0.349]
97 7 Conclusions We have presented and evaluated the A* and iterative A* algorithms for 1-best sequential decoding in this paper. [sent-315, score-0.777]
98 In addition, we proposed A* and iterative Viterbi A* algorithm for k-best sequential decoding. [sent-316, score-0.483]
99 K-best Iterative A* algorithm can be several times or orders of magnitude faster than the state-of-theart k-best decoding algorithm. [sent-317, score-0.504]
100 Table 4: Decoding speed (sentences per second) of 1-best decoding algorithms for five datasets. [sent-321, score-0.523]
wordName wordTfidf (topN-words)
[('viterbi', 0.507), ('decoding', 0.325), ('yt', 0.273), ('iterative', 0.269), ('lattice', 0.245), ('lb', 0.215), ('degenerate', 0.206), ('kaji', 0.139), ('forward', 0.135), ('active', 0.117), ('sequential', 0.117), ('backward', 0.116), ('carpediem', 0.108), ('conll', 0.104), ('node', 0.103), ('algorithm', 0.097), ('pass', 0.096), ('agenda', 0.094), ('label', 0.084), ('inactive', 0.077), ('popped', 0.077), ('speed', 0.074), ('sequence', 0.072), ('supertag', 0.067), ('nodes', 0.067), ('sequences', 0.066), ('algorithms', 0.066), ('labels', 0.065), ('beam', 0.063), ('esposito', 0.062), ('staggered', 0.062), ('crfs', 0.06), ('tagging', 0.059), ('perceptron', 0.055), ('push', 0.054), ('pushed', 0.054), ('best', 0.053), ('ys', 0.051), ('worst', 0.05), ('loop', 0.047), ('complexity', 0.047), ('kfk', 0.046), ('radicioni', 0.046), ('xt', 0.046), ('bound', 0.045), ('passes', 0.043), ('max', 0.041), ('edge', 0.041), ('query', 0.041), ('decodes', 0.04), ('slows', 0.04), ('end', 0.04), ('position', 0.039), ('terminates', 0.037), ('yi', 0.037), ('fk', 0.035), ('kx', 0.034), ('score', 0.033), ('datasets', 0.033), ('pruning', 0.033), ('optimal', 0.033), ('heuristic', 0.032), ('five', 0.032), ('dataset', 0.032), ('stores', 0.031), ('myt', 0.031), ('nisd', 0.031), ('siddiqi', 0.031), ('zhiheng', 0.031), ('size', 0.03), ('pauls', 0.03), ('tl', 0.03), ('orders', 0.029), ('magnitude', 0.029), ('iteration', 0.028), ('hpsg', 0.028), ('search', 0.028), ('pos', 0.028), ('coarse', 0.028), ('tx', 0.027), ('calls', 0.027), ('felzenszwalb', 0.027), ('efficient', 0.027), ('classic', 0.026), ('named', 0.026), ('per', 0.026), ('expansion', 0.025), ('tags', 0.025), ('jeong', 0.025), ('supertagging', 0.025), ('pkk', 0.025), ('exact', 0.024), ('faster', 0.024), ('bigrams', 0.024), ('xk', 0.023), ('init', 0.023), ('fastest', 0.023), ('entity', 0.022), ('concerned', 0.022), ('huang', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 121 acl-2012-Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
Author: Zhiheng Huang ; Yi Chang ; Bo Long ; Jean-Francois Crespo ; Anlei Dong ; Sathiya Keerthi ; Su-Lin Wu
Abstract: Sequential modeling has been widely used in a variety of important applications including named entity recognition and shallow parsing. However, as more and more real time large-scale tagging applications arise, decoding speed has become a bottleneck for existing sequential tagging algorithms. In this paper we propose 1-best A*, 1-best iterative A*, k-best A* and k-best iterative Viterbi A* algorithms for sequential decoding. We show the efficiency of these proposed algorithms for five NLP tagging tasks. In particular, we show that iterative Viterbi A* decoding can be several times or orders of magnitude faster than the state-of-the-art algorithm for tagging tasks with a large number of labels. This algorithm makes real-time large-scale tagging applications with thousands of labels feasible.
Author: Qiuye Zhao ; Mitch Marcus
Abstract: We show for both English POS tagging and Chinese word segmentation that with proper representation, large number of deterministic constraints can be learned from training examples, and these are useful in constraining probabilistic inference. For tagging, learned constraints are directly used to constrain Viterbi decoding. For segmentation, character-based tagging constraints can be learned with the same templates. However, they are better applied to a word-based model, thus an integer linear programming (ILP) formulation is proposed. For both problems, the corresponding constrained solutions have advantages in both efficiency and accuracy. 1 introduction In recent work, interesting results are reported for applications of integer linear programming (ILP) such as semantic role labeling (SRL) (Roth and Yih, 2005), dependency parsing (Martins et al., 2009) and so on. In an ILP formulation, ’non-local’ deterministic constraints on output structures can be naturally incorporated, such as ”a verb cannot take two subject arguments” for SRL, and the projectivity constraint for dependency parsing. In contrast to probabilistic constraints that are estimated from training examples, this type of constraint is usually hand-written reflecting one’s linguistic knowledge. Dynamic programming techniques based on Markov assumptions, such as Viterbi decoding, cannot handle those ’non-local’ constraints as discussed above. However, it is possible to constrain Viterbi 1054 decoding by ’local’ constraints, e.g. ”assign label t to word w” for POS tagging. This type of constraint may come from human input solicited in interactive inference procedure (Kristjansson et al., 2004). In this work, we explore deterministic constraints for two fundamental NLP problems, English POS tagging and Chinese word segmentation. We show by experiments that, with proper representation, large number of deterministic constraints can be learned automatically from training data, which can then be used to constrain probabilistic inference. For POS tagging, the learned constraints are directly used to constrain Viterbi decoding. The corresponding constrained tagger is 10 times faster than searching in a raw space pruned with beam-width 5. Tagging accuracy is moderately improved as well. For Chinese word segmentation (CWS), which can be formulated as character tagging, analogous constraints can be learned with the same templates as English POS tagging. High-quality constraints can be learned with respect to a special tagset, however, with this tagset, the best segmentation accuracy is hard to achieve. Therefore, these character-based constraints are not directly used for determining predictions as in English POS tagging. We propose an ILP formulation of the CWS problem. By adopting this ILP formulation, segmentation F-measure is increased from 0.968 to 0.974, as compared to Viterbi decoding with the same feature set. Moreover, the learned constraints can be applied to reduce the number of possible words over a character sequence, i.e. to reduce the number of variables to set. This reduction of problem size immediately speeds up an ILP solver by more than 100 times. ProceediJnegjus, o Rfe thpeu 5bl0icth o Afn Knouraela M, 8e-e1t4in Jgul oyf t 2h0e1 A2.s ?oc c2ia0t1io2n A fsosro Cciaotmiopnu ftaotrio Cnoamlp Luintagtuioisntaicls L,i pnaggueis t 1i0c5s4–1062, 2 English POS tagging 2.1 Explore deterministic constraints Suppose that, following (Chomsky, 1970), we distinguish major lexical categories (Noun, Verb, Adjective and Preposition) by two binary features: + |− N and +|− V. Let (+N −V) =Noun, (−N +V) =Verb, (+N, +V) =Adjective, aonudn (−N, −V) =preposition. A word occurring in betw(e−eNn a preceding wosoitrdio nth.e Aand w a following wgo irnd of always bears the feature +N. On the other hand, consider the annotation guideline of English Treebank (Marcus et al., 1993) instead. Part-of-speech (POS) tags are used to categorize words, for example, the POS tag VBG tags verbal gerunds, NNS tags nominal plurals, DT tags determiners and so on. Following this POS representation, there are as many as 10 possible POS tags that may occur in between the–of, as estimated from the WSJ corpus of Penn Treebank. , 2.1.1 Templates of deterministic constraints , To explore determinacy in the distribution of POS tags in Penn Treebank, we need to consider that a POS tag marks the basic syntactic category of a word as well as its morphological inflection. A constraint that may determine the POS category should reflect both the context and the morphological feature of the corresponding word. The practical difficulty in representing such deterministic constraints is that we do not have a perfect mechanism to analyze morphological features of a word. Endings or prefixes of English words do not deterministically mark their morphological inflections. We propose to compute the morph feature of a word as the set of all of its possible tags, i.e. all tag types that are assigned to the word in training data. Furthermore, we approximate unknown words in testing data by rare words in training data. For a word that occurs less than 5 times in the training corpus, we compute its morph feature as its last two characters, which is also conjoined with binary features indicating whether the rare word contains digits, hyphens or upper-case characters respectively. See examples of morph features in Table 1. We consider bigram and trigram templates for generating potentially deterministic constraints. Let denote the ith word relative to the current word w0; and mi denote the morph feature of wi. A wi 1055 w(fr0e=qtruaednets)(set of pmos0s=ib{lNeN taSg,s V oBfZ th}e word) w0=t(imraere-s)hares(thme0 l=as{t- tewso, c HhYaPraHcEteNrs}. .) Table 1: Morph features offrequent words and rare words as computed from the WSJ Corpus of Penn Treebank. -gtbr ai -m w −1w 0w−mw1 m,wm 0−, 1mw1 0 w mw1 , mw m− 1m 1mw0m0w,1 wm, m0 −m1 m 0wm1 Table 2: The templates for generating potentially deterministic constraints of English POS tagging. bigram constraint includes one contextual word (w−1 |w1) or the corresponding morph feature; and a trigram constraint includes both contextual words or their morph features. Each constraint is also con- joined with w0 or m0, as described in Table 2. 2.1.2 Learning of deterministic constraints In the above section, we explore templates for potentially deterministic constraints that may determine POS category. With respect to a training corpus, if a constraint C relative to w0 ’always’ assigns a certain POS category t∗ to w0 in its context, i.e. > thr, and this constraint occurs more than a cutoff number, we consider it as a deterministic constraint. The threshold thr is a real number just under 1.0 and the cutoff number is empirically set to 5 in our experiments. counctou(Cnt∧(tC0)=t∗) 2.1.3 Decoding of deterministic constraints By the above definition, the constraint of w−1 = the, m0 = {NNS VBZ } and w1 = of is deterministic. It det=er{mNiNneSs, ,the V BPZO}S category of w0 to be NNS. There are at least two ways of decoding these constraints during POS tagging. Take the word trades for example, whose morph feature is {NNS, VBZ}. fOonre e xaaltemrnplaet,ive w hiso sthea tm as long as rtera dises { occurs Zb e}-. tween the-of, it is tagged with NNS. The second alternative is that the tag decision is made only if all deterministic constraints relative to this occurrence , of trades agree on the same tag. Both ways of decoding are purely rule-based and involve no probabilistic inference. In favor of a higher precision, we adopt the latter one in our experiments. tTchoe/nDscrotTamwSpci&lnoeLmxpd;/–fiulenbtaxp/i–cloufntg/aNpnlOci(amgnw/1–tOhNTpe(lanS+Ti&/m2cNL)lubTdaien2ls/)IoVNuBtlZamwn.1=ic2l3ud,ems.2=1 Table 3: Comparison of raw input and constrained input. 2.2 Search in a constrained space Following most previous work, we consider POS tagging as a sequence classification problem and de- compose the overall sequence scnore over the linear structure, i.e. ˆt =t∈atraggGmENa(xw)Xi=1score(ti) where function tagGEN maps input seXntence w = w1...wn to the set of all tag sequences that are of length n. If a POS tagger takes raw input only, i.e. for every word, the number of possible tags is a constant T, the space of tagGEN is as large as Tn. On the other hand, if we decode deterministic constraints first be- fore a probabilistic search, i.e. for some words, the number of possible tags is reduced to 1, the search space is reduced to Tm, where m is the number of (unconstrained) words that are not subject to any deterministic constraints. Viterbi algorithm is widely used for tagging, and runs in O(nT2) when searching in an unconstrained space. On the other hand, consider searching in a constrained space. Suppose that among the m unconstrained words, m1 of them follow a word that has been tagged by deterministic constraints and m2 (=m-m1) of them follow another unconstrained word. Viterbi decoder runs in O(m1T + m2T2) while searching in such a constrained space. The example in Table 3 shows raw and constrained input with respect to a typical input sentence. Lookahead features The score of tag predictions are usually computed in a high-dimensional feature space. We adopt the basic feature set used in (Ratnaparkhi, 1996) and (Collins, 2002). Moreover, when deterministic constraints have applied to contextual words of w0, it is also possible to include some lookahead feature templates, such as: t0&t1; , t0&t1;&t2; , and t−1&t0;&t1; where ti represents the tag of the ith word relative 1056 to the current word w0. As discussed in (Shen et al., 2007), categorical information of neighbouring words on both sides of w0 help resolve POS ambiguity of w0. In (Shen et al., 2007), lookahead features may be available for use during decoding since searching is bidirectional instead of left-to-right as in Viterbi decoding. In this work, deterministic constraints are decoded before the application of probabilistic models, therefore lookahead features are made available during Viterbi decoding. 3 Chinese Word Segmentation (CWS) 3.1 Word segmentation as character tagging Considering the ambiguity problem that a Chinese character may appear in any relative position in a word and the out-of-vocabulary (OOV) problem that it is impossible to observe all words in training data, CWS is widely formulated as a character tagging problem (Xue, 2003). A character-based CWS decoder is to find the highest scoring tag sequence tˆ over the input character sequence c, i.e. Xn tˆ =t∈ atraggGmEaNx(c)Xi=1score(ti) . This is the same formulation as POS tagging. The Viterbi algorithm is also widely used for decoding. The tag of each character represents its relative position in a word. Two popular tagsets include 1) IB: where B tags the beginning of a word and I all other positions; and 2) BMES: where B, M and E represent the beginning, middle and end of a multicharacter word respectively, and S tags a singlecharacter word. For example, after decoding with BMES, 4 consecutive characters associated with the tag sequence BMME compose a word. However, after decoding with IB, characters associated with BIII may compose a word if the following tag is B or only form part of a word if the following tag is I. Even though character tagging accuracy is higher with tagset IB, tagset BMES is more popular in use since better performance of the original problem CWS can be achieved by this tagset. Character-based feature templates We adopt the ’non-lexical-target’ feature templates in (Jiang et al., 2008a). Let ci denote the ith character relative to the current character c0 and t0 denote the tag assigned to c0. The following templates are used: ci&t0; (i=-2...2), cici+1&t0; (i=-2...1) and c−1c1&t0.; Character-based deterministic constraints We can use the same templates as described in Table 2 to generate potentially deterministic constraints for CWS character tagging, except that there are no morph features computed for Chinese characters. As we will show with experimental results in Section 5.2, useful deterministic constraints for CWS can be learned with tagset IB but not with tagset BMES. It is interesting but not surprising to notice, again, that the determinacy of a problem is sensitive to its representation. Since it is hard to achieve the best segmentations with tagset IB, we propose an indirect way to use these constraints in the following section, instead of applying these constraints as straightforwardly as in English POS tagging. 3.2 Word-based word segmentation A word-based CWS decoder finds the highest scoring segmentation sequence wˆ that is composed by the input character sequence c, i.e. wˆ =w∈arseggGmEaNx(c)Xi|=w1|score(wi) . where function segGEN maps character sequence c to the set of all possible segmentations of c. For example, w = (c1. .cl1 ) ...(cn−lk+1 ...cn) represents a segmentation of k words and the lengths of the first and last word are l1 and lk respectively. In early work, rule-based models find words one by one based on heuristics such as forward maximum match (Sproat et al., 1996). Exact search is possible with a Viterbi-style algorithm, but beamsearch decoding is more popular as used in (Zhang and Clark, 2007) and (Jiang et al., 2008a). We propose an Integer Linear Programming (ILP) formulation of word segmentation, which is naturally viewed as a word-based model for CWS. Character-based deterministic constraints, as discussed in Section 3.1, can be easily applied. 3.3 ILP formulation of CWS Given a character sequence c=c1 ...cn, there are s(= n(n + 1)/2) possible words that are contiguous subsets of c, i.e. w1, ..., ws ⊆ c. Our goal is to find 1057 Table 4: Comparison of raw input and constrained input. an optimal solution x = ...xs that maximizes x1 Xs Xscore(wi) · xi, subject to Xi= X1 (1) X xi = 1, ∀c ∈ c; (2) ix:Xic∈∈wi {0,1},1 ≤i≤s The boolean value of xi, as guaranteed by constraint (2), indicates whether wi is selected in the segmentation solution or not. Constraint (1) requires every character to be included in exactly one selected word, thus guarantees a proper segmentation of the whole sequence. This resembles the ILP formulation of the set cover problem, though the first con- straint is different. Take n = 2 for example, i.e. c = c1c2, the set of possible words is {c1, c2 , c1c2}, i.e. s = |x| = t3 o. T pohesrseib are only t iwso { possible soli.uet.ion ss = subject t o3 .co Tnhsetrreain artse (1) yan tdw (2), x = 1 s1o0giving an output set {c1, c2}, or x = 001 giving an output asent {c1c2}. tTphuet efficiency o.f solving this problem depends on the number of possible words (contiguous subsets) over a character sequence, i.e. the number of variables in x. So as to reduce |x|, we apply determiniasbtlice sc ionn xs.tra Sinots a predicting I |xB| tags first, w dehtiecrhm are learned as described in Section 3.1. Possible words are generated with respect to the partially tagged character sequence. A character tagged with B always occurs at the beginning of a possible word. Table 4 illustrates the constrained and raw input with respect to a typical character sequence. 3.4 Character- and word-based features As studied in previous work, word-based feature templates usually include the word itself, sub-words contained in the word, contextual characters/words and so on. It has been shown that combining the use of character- and word-based features helps improve performance. However, in the character tag- ging formulation, word-based features are non-local. To incorporate these non-local features and make the search tractable, various efforts have been made. For example, Jiang et al. (2008a) combine different levels of knowledge in an outside linear model of a twolayer cascaded model; Jiang et al. (2008b) uses the forest re-ranking technique (Huang, 2008); and in (Kruengkrai et al., 2009), only known words in vocabulary are included in the hybrid lattice consisting of both character- and word-level nodes. We propose to incorporate character-based features in word-based models. Consider a characterbased feature function φ(c, t,c) that maps a character-tag pair to a high-dimensional feature space, with respect to an input character sequence c. For a possible word over c of length l , wi = ci0 ...ci0+l−1, tag each character cij in this word with a character-based tag tij . Character-based features of wi can be computed as {φ(cij , tij , c) |0 ≤ j < l}. The ficrsant row oofm pTautbeled a5s i {llφus(tcrates c,ch)a|r0ac ≤ter j-b
3 0.15482 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
Author: Tong Xiao ; Jingbo Zhu ; Hao Zhang ; Qiang Li
Abstract: We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntaxbased models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decoding. Moreover, several useful utilities were distributed with the toolkit, including a discriminative reordering model, a simple and fast language model, and an implementation of minimum error rate training for weight tuning. 1
4 0.099822856 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
Author: Spence Green ; John DeNero
Abstract: When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agreement model. Agreement is promoted by scoring a sequence of fine-grained morpho-syntactic classes that are predicted during decoding for each translation hypothesis. For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders. 1
5 0.093455791 131 acl-2012-Learning Translation Consensus with Structured Label Propagation
Author: Shujie Liu ; Chi-Ho Li ; Mu Li ; Ming Zhou
Abstract: In this paper, we address the issue for learning better translation consensus in machine translation (MT) research, and explore the search of translation consensus from similar, rather than the same, source sentences or their spans. Unlike previous work on this topic, we formulate the problem as structured labeling over a much smaller graph, and we propose a novel structured label propagation for the task. We convert such graph-based translation consensus from similar source strings into useful features both for n-best output reranking and for decoding algorithm. Experimental results show that, our method can significantly improve machine translation performance on both IWSLT and NIST data, compared with a state-ofthe-art baseline. 1
6 0.086936429 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining
7 0.085102737 68 acl-2012-Decoding Running Key Ciphers
8 0.081446886 213 acl-2012-Utilizing Dependency Language Models for Graph-based Dependency Parsing Models
9 0.080051951 96 acl-2012-Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection
10 0.079990007 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation
11 0.078239024 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction
12 0.071199544 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese
13 0.069850437 177 acl-2012-Sentence Dependency Tagging in Online Question Answering Forums
14 0.068394914 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation
15 0.067645676 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
16 0.06701079 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors
17 0.065837771 109 acl-2012-Higher-order Constituent Parsing and Parser Combination
18 0.063680485 57 acl-2012-Concept-to-text Generation via Discriminative Reranking
19 0.063134566 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
20 0.061294161 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment
topicId topicWeight
[(0, -0.172), (1, -0.054), (2, -0.074), (3, -0.039), (4, 0.003), (5, 0.064), (6, 0.056), (7, -0.038), (8, 0.055), (9, -0.025), (10, -0.039), (11, 0.009), (12, -0.032), (13, -0.063), (14, 0.07), (15, 0.076), (16, 0.04), (17, 0.127), (18, 0.119), (19, -0.081), (20, 0.102), (21, 0.003), (22, -0.085), (23, 0.035), (24, -0.062), (25, 0.037), (26, 0.013), (27, -0.021), (28, -0.187), (29, -0.074), (30, 0.005), (31, -0.098), (32, -0.026), (33, 0.052), (34, -0.023), (35, -0.129), (36, 0.016), (37, 0.27), (38, -0.203), (39, -0.028), (40, -0.122), (41, -0.028), (42, -0.022), (43, -0.051), (44, 0.038), (45, 0.03), (46, 0.124), (47, -0.107), (48, -0.108), (49, 0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.97804952 121 acl-2012-Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
Author: Zhiheng Huang ; Yi Chang ; Bo Long ; Jean-Francois Crespo ; Anlei Dong ; Sathiya Keerthi ; Su-Lin Wu
Abstract: Sequential modeling has been widely used in a variety of important applications including named entity recognition and shallow parsing. However, as more and more real time large-scale tagging applications arise, decoding speed has become a bottleneck for existing sequential tagging algorithms. In this paper we propose 1-best A*, 1-best iterative A*, k-best A* and k-best iterative Viterbi A* algorithms for sequential decoding. We show the efficiency of these proposed algorithms for five NLP tagging tasks. In particular, we show that iterative Viterbi A* decoding can be several times or orders of magnitude faster than the state-of-the-art algorithm for tagging tasks with a large number of labels. This algorithm makes real-time large-scale tagging applications with thousands of labels feasible.
2 0.7439962 68 acl-2012-Decoding Running Key Ciphers
Author: Sravana Reddy ; Kevin Knight
Abstract: There has been recent interest in the problem of decoding letter substitution ciphers using techniques inspired by natural language processing. We consider a different type of classical encoding scheme known as the running key cipher, and propose a search solution using Gibbs sampling with a word language model. We evaluate our method on synthetic ciphertexts of different lengths, and find that it outperforms previous work that employs Viterbi decoding with character-based models.
Author: Qiuye Zhao ; Mitch Marcus
Abstract: We show for both English POS tagging and Chinese word segmentation that with proper representation, large number of deterministic constraints can be learned from training examples, and these are useful in constraining probabilistic inference. For tagging, learned constraints are directly used to constrain Viterbi decoding. For segmentation, character-based tagging constraints can be learned with the same templates. However, they are better applied to a word-based model, thus an integer linear programming (ILP) formulation is proposed. For both problems, the corresponding constrained solutions have advantages in both efficiency and accuracy. 1 introduction In recent work, interesting results are reported for applications of integer linear programming (ILP) such as semantic role labeling (SRL) (Roth and Yih, 2005), dependency parsing (Martins et al., 2009) and so on. In an ILP formulation, ’non-local’ deterministic constraints on output structures can be naturally incorporated, such as ”a verb cannot take two subject arguments” for SRL, and the projectivity constraint for dependency parsing. In contrast to probabilistic constraints that are estimated from training examples, this type of constraint is usually hand-written reflecting one’s linguistic knowledge. Dynamic programming techniques based on Markov assumptions, such as Viterbi decoding, cannot handle those ’non-local’ constraints as discussed above. However, it is possible to constrain Viterbi 1054 decoding by ’local’ constraints, e.g. ”assign label t to word w” for POS tagging. This type of constraint may come from human input solicited in interactive inference procedure (Kristjansson et al., 2004). In this work, we explore deterministic constraints for two fundamental NLP problems, English POS tagging and Chinese word segmentation. We show by experiments that, with proper representation, large number of deterministic constraints can be learned automatically from training data, which can then be used to constrain probabilistic inference. For POS tagging, the learned constraints are directly used to constrain Viterbi decoding. The corresponding constrained tagger is 10 times faster than searching in a raw space pruned with beam-width 5. Tagging accuracy is moderately improved as well. For Chinese word segmentation (CWS), which can be formulated as character tagging, analogous constraints can be learned with the same templates as English POS tagging. High-quality constraints can be learned with respect to a special tagset, however, with this tagset, the best segmentation accuracy is hard to achieve. Therefore, these character-based constraints are not directly used for determining predictions as in English POS tagging. We propose an ILP formulation of the CWS problem. By adopting this ILP formulation, segmentation F-measure is increased from 0.968 to 0.974, as compared to Viterbi decoding with the same feature set. Moreover, the learned constraints can be applied to reduce the number of possible words over a character sequence, i.e. to reduce the number of variables to set. This reduction of problem size immediately speeds up an ILP solver by more than 100 times. ProceediJnegjus, o Rfe thpeu 5bl0icth o Afn Knouraela M, 8e-e1t4in Jgul oyf t 2h0e1 A2.s ?oc c2ia0t1io2n A fsosro Cciaotmiopnu ftaotrio Cnoamlp Luintagtuioisntaicls L,i pnaggueis t 1i0c5s4–1062, 2 English POS tagging 2.1 Explore deterministic constraints Suppose that, following (Chomsky, 1970), we distinguish major lexical categories (Noun, Verb, Adjective and Preposition) by two binary features: + |− N and +|− V. Let (+N −V) =Noun, (−N +V) =Verb, (+N, +V) =Adjective, aonudn (−N, −V) =preposition. A word occurring in betw(e−eNn a preceding wosoitrdio nth.e Aand w a following wgo irnd of always bears the feature +N. On the other hand, consider the annotation guideline of English Treebank (Marcus et al., 1993) instead. Part-of-speech (POS) tags are used to categorize words, for example, the POS tag VBG tags verbal gerunds, NNS tags nominal plurals, DT tags determiners and so on. Following this POS representation, there are as many as 10 possible POS tags that may occur in between the–of, as estimated from the WSJ corpus of Penn Treebank. , 2.1.1 Templates of deterministic constraints , To explore determinacy in the distribution of POS tags in Penn Treebank, we need to consider that a POS tag marks the basic syntactic category of a word as well as its morphological inflection. A constraint that may determine the POS category should reflect both the context and the morphological feature of the corresponding word. The practical difficulty in representing such deterministic constraints is that we do not have a perfect mechanism to analyze morphological features of a word. Endings or prefixes of English words do not deterministically mark their morphological inflections. We propose to compute the morph feature of a word as the set of all of its possible tags, i.e. all tag types that are assigned to the word in training data. Furthermore, we approximate unknown words in testing data by rare words in training data. For a word that occurs less than 5 times in the training corpus, we compute its morph feature as its last two characters, which is also conjoined with binary features indicating whether the rare word contains digits, hyphens or upper-case characters respectively. See examples of morph features in Table 1. We consider bigram and trigram templates for generating potentially deterministic constraints. Let denote the ith word relative to the current word w0; and mi denote the morph feature of wi. A wi 1055 w(fr0e=qtruaednets)(set of pmos0s=ib{lNeN taSg,s V oBfZ th}e word) w0=t(imraere-s)hares(thme0 l=as{t- tewso, c HhYaPraHcEteNrs}. .) Table 1: Morph features offrequent words and rare words as computed from the WSJ Corpus of Penn Treebank. -gtbr ai -m w −1w 0w−mw1 m,wm 0−, 1mw1 0 w mw1 , mw m− 1m 1mw0m0w,1 wm, m0 −m1 m 0wm1 Table 2: The templates for generating potentially deterministic constraints of English POS tagging. bigram constraint includes one contextual word (w−1 |w1) or the corresponding morph feature; and a trigram constraint includes both contextual words or their morph features. Each constraint is also con- joined with w0 or m0, as described in Table 2. 2.1.2 Learning of deterministic constraints In the above section, we explore templates for potentially deterministic constraints that may determine POS category. With respect to a training corpus, if a constraint C relative to w0 ’always’ assigns a certain POS category t∗ to w0 in its context, i.e. > thr, and this constraint occurs more than a cutoff number, we consider it as a deterministic constraint. The threshold thr is a real number just under 1.0 and the cutoff number is empirically set to 5 in our experiments. counctou(Cnt∧(tC0)=t∗) 2.1.3 Decoding of deterministic constraints By the above definition, the constraint of w−1 = the, m0 = {NNS VBZ } and w1 = of is deterministic. It det=er{mNiNneSs, ,the V BPZO}S category of w0 to be NNS. There are at least two ways of decoding these constraints during POS tagging. Take the word trades for example, whose morph feature is {NNS, VBZ}. fOonre e xaaltemrnplaet,ive w hiso sthea tm as long as rtera dises { occurs Zb e}-. tween the-of, it is tagged with NNS. The second alternative is that the tag decision is made only if all deterministic constraints relative to this occurrence , of trades agree on the same tag. Both ways of decoding are purely rule-based and involve no probabilistic inference. In favor of a higher precision, we adopt the latter one in our experiments. tTchoe/nDscrotTamwSpci&lnoeLmxpd;/–fiulenbtaxp/i–cloufntg/aNpnlOci(amgnw/1–tOhNTpe(lanS+Ti&/m2cNL)lubTdaien2ls/)IoVNuBtlZamwn.1=ic2l3ud,ems.2=1 Table 3: Comparison of raw input and constrained input. 2.2 Search in a constrained space Following most previous work, we consider POS tagging as a sequence classification problem and de- compose the overall sequence scnore over the linear structure, i.e. ˆt =t∈atraggGmENa(xw)Xi=1score(ti) where function tagGEN maps input seXntence w = w1...wn to the set of all tag sequences that are of length n. If a POS tagger takes raw input only, i.e. for every word, the number of possible tags is a constant T, the space of tagGEN is as large as Tn. On the other hand, if we decode deterministic constraints first be- fore a probabilistic search, i.e. for some words, the number of possible tags is reduced to 1, the search space is reduced to Tm, where m is the number of (unconstrained) words that are not subject to any deterministic constraints. Viterbi algorithm is widely used for tagging, and runs in O(nT2) when searching in an unconstrained space. On the other hand, consider searching in a constrained space. Suppose that among the m unconstrained words, m1 of them follow a word that has been tagged by deterministic constraints and m2 (=m-m1) of them follow another unconstrained word. Viterbi decoder runs in O(m1T + m2T2) while searching in such a constrained space. The example in Table 3 shows raw and constrained input with respect to a typical input sentence. Lookahead features The score of tag predictions are usually computed in a high-dimensional feature space. We adopt the basic feature set used in (Ratnaparkhi, 1996) and (Collins, 2002). Moreover, when deterministic constraints have applied to contextual words of w0, it is also possible to include some lookahead feature templates, such as: t0&t1; , t0&t1;&t2; , and t−1&t0;&t1; where ti represents the tag of the ith word relative 1056 to the current word w0. As discussed in (Shen et al., 2007), categorical information of neighbouring words on both sides of w0 help resolve POS ambiguity of w0. In (Shen et al., 2007), lookahead features may be available for use during decoding since searching is bidirectional instead of left-to-right as in Viterbi decoding. In this work, deterministic constraints are decoded before the application of probabilistic models, therefore lookahead features are made available during Viterbi decoding. 3 Chinese Word Segmentation (CWS) 3.1 Word segmentation as character tagging Considering the ambiguity problem that a Chinese character may appear in any relative position in a word and the out-of-vocabulary (OOV) problem that it is impossible to observe all words in training data, CWS is widely formulated as a character tagging problem (Xue, 2003). A character-based CWS decoder is to find the highest scoring tag sequence tˆ over the input character sequence c, i.e. Xn tˆ =t∈ atraggGmEaNx(c)Xi=1score(ti) . This is the same formulation as POS tagging. The Viterbi algorithm is also widely used for decoding. The tag of each character represents its relative position in a word. Two popular tagsets include 1) IB: where B tags the beginning of a word and I all other positions; and 2) BMES: where B, M and E represent the beginning, middle and end of a multicharacter word respectively, and S tags a singlecharacter word. For example, after decoding with BMES, 4 consecutive characters associated with the tag sequence BMME compose a word. However, after decoding with IB, characters associated with BIII may compose a word if the following tag is B or only form part of a word if the following tag is I. Even though character tagging accuracy is higher with tagset IB, tagset BMES is more popular in use since better performance of the original problem CWS can be achieved by this tagset. Character-based feature templates We adopt the ’non-lexical-target’ feature templates in (Jiang et al., 2008a). Let ci denote the ith character relative to the current character c0 and t0 denote the tag assigned to c0. The following templates are used: ci&t0; (i=-2...2), cici+1&t0; (i=-2...1) and c−1c1&t0.; Character-based deterministic constraints We can use the same templates as described in Table 2 to generate potentially deterministic constraints for CWS character tagging, except that there are no morph features computed for Chinese characters. As we will show with experimental results in Section 5.2, useful deterministic constraints for CWS can be learned with tagset IB but not with tagset BMES. It is interesting but not surprising to notice, again, that the determinacy of a problem is sensitive to its representation. Since it is hard to achieve the best segmentations with tagset IB, we propose an indirect way to use these constraints in the following section, instead of applying these constraints as straightforwardly as in English POS tagging. 3.2 Word-based word segmentation A word-based CWS decoder finds the highest scoring segmentation sequence wˆ that is composed by the input character sequence c, i.e. wˆ =w∈arseggGmEaNx(c)Xi|=w1|score(wi) . where function segGEN maps character sequence c to the set of all possible segmentations of c. For example, w = (c1. .cl1 ) ...(cn−lk+1 ...cn) represents a segmentation of k words and the lengths of the first and last word are l1 and lk respectively. In early work, rule-based models find words one by one based on heuristics such as forward maximum match (Sproat et al., 1996). Exact search is possible with a Viterbi-style algorithm, but beamsearch decoding is more popular as used in (Zhang and Clark, 2007) and (Jiang et al., 2008a). We propose an Integer Linear Programming (ILP) formulation of word segmentation, which is naturally viewed as a word-based model for CWS. Character-based deterministic constraints, as discussed in Section 3.1, can be easily applied. 3.3 ILP formulation of CWS Given a character sequence c=c1 ...cn, there are s(= n(n + 1)/2) possible words that are contiguous subsets of c, i.e. w1, ..., ws ⊆ c. Our goal is to find 1057 Table 4: Comparison of raw input and constrained input. an optimal solution x = ...xs that maximizes x1 Xs Xscore(wi) · xi, subject to Xi= X1 (1) X xi = 1, ∀c ∈ c; (2) ix:Xic∈∈wi {0,1},1 ≤i≤s The boolean value of xi, as guaranteed by constraint (2), indicates whether wi is selected in the segmentation solution or not. Constraint (1) requires every character to be included in exactly one selected word, thus guarantees a proper segmentation of the whole sequence. This resembles the ILP formulation of the set cover problem, though the first con- straint is different. Take n = 2 for example, i.e. c = c1c2, the set of possible words is {c1, c2 , c1c2}, i.e. s = |x| = t3 o. T pohesrseib are only t iwso { possible soli.uet.ion ss = subject t o3 .co Tnhsetrreain artse (1) yan tdw (2), x = 1 s1o0giving an output set {c1, c2}, or x = 001 giving an output asent {c1c2}. tTphuet efficiency o.f solving this problem depends on the number of possible words (contiguous subsets) over a character sequence, i.e. the number of variables in x. So as to reduce |x|, we apply determiniasbtlice sc ionn xs.tra Sinots a predicting I |xB| tags first, w dehtiecrhm are learned as described in Section 3.1. Possible words are generated with respect to the partially tagged character sequence. A character tagged with B always occurs at the beginning of a possible word. Table 4 illustrates the constrained and raw input with respect to a typical character sequence. 3.4 Character- and word-based features As studied in previous work, word-based feature templates usually include the word itself, sub-words contained in the word, contextual characters/words and so on. It has been shown that combining the use of character- and word-based features helps improve performance. However, in the character tag- ging formulation, word-based features are non-local. To incorporate these non-local features and make the search tractable, various efforts have been made. For example, Jiang et al. (2008a) combine different levels of knowledge in an outside linear model of a twolayer cascaded model; Jiang et al. (2008b) uses the forest re-ranking technique (Huang, 2008); and in (Kruengkrai et al., 2009), only known words in vocabulary are included in the hybrid lattice consisting of both character- and word-level nodes. We propose to incorporate character-based features in word-based models. Consider a characterbased feature function φ(c, t,c) that maps a character-tag pair to a high-dimensional feature space, with respect to an input character sequence c. For a possible word over c of length l , wi = ci0 ...ci0+l−1, tag each character cij in this word with a character-based tag tij . Character-based features of wi can be computed as {φ(cij , tij , c) |0 ≤ j < l}. The ficrsant row oofm pTautbeled a5s i {llφus(tcrates c,ch)a|r0ac ≤ter j-b
4 0.60077095 107 acl-2012-Heuristic Cube Pruning in Linear Time
Author: Andrea Gesmundo ; Giorgio Satta ; James Henderson
Abstract: We propose a novel heuristic algorithm for Cube Pruning running in linear time in the beam size. Empirically, we show a gain in running time of a standard machine translation system, at a small loss in accuracy.
5 0.45909524 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
Author: Tong Xiao ; Jingbo Zhu ; Hao Zhang ; Qiang Li
Abstract: We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntaxbased models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decoding. Moreover, several useful utilities were distributed with the toolkit, including a discriminative reordering model, a simple and fast language model, and an implementation of minimum error rate training for weight tuning. 1
6 0.45242107 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining
7 0.42980093 42 acl-2012-Bootstrapping via Graph Propagation
9 0.42002034 57 acl-2012-Concept-to-text Generation via Discriminative Reranking
10 0.36604351 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition
11 0.36563572 131 acl-2012-Learning Translation Consensus with Structured Label Propagation
12 0.35902214 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese
13 0.33379453 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation
14 0.32408366 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
15 0.32230306 73 acl-2012-Discriminative Learning for Joint Template Filling
16 0.30520999 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation
17 0.2990644 96 acl-2012-Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection
18 0.2929475 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
19 0.28444836 213 acl-2012-Utilizing Dependency Language Models for Graph-based Dependency Parsing Models
20 0.27734482 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation
topicId topicWeight
[(26, 0.044), (28, 0.02), (29, 0.259), (30, 0.029), (37, 0.052), (39, 0.041), (57, 0.011), (74, 0.072), (82, 0.032), (84, 0.016), (85, 0.027), (90, 0.151), (92, 0.076), (94, 0.023), (99, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.82233256 121 acl-2012-Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
Author: Zhiheng Huang ; Yi Chang ; Bo Long ; Jean-Francois Crespo ; Anlei Dong ; Sathiya Keerthi ; Su-Lin Wu
Abstract: Sequential modeling has been widely used in a variety of important applications including named entity recognition and shallow parsing. However, as more and more real time large-scale tagging applications arise, decoding speed has become a bottleneck for existing sequential tagging algorithms. In this paper we propose 1-best A*, 1-best iterative A*, k-best A* and k-best iterative Viterbi A* algorithms for sequential decoding. We show the efficiency of these proposed algorithms for five NLP tagging tasks. In particular, we show that iterative Viterbi A* decoding can be several times or orders of magnitude faster than the state-of-the-art algorithm for tagging tasks with a large number of labels. This algorithm makes real-time large-scale tagging applications with thousands of labels feasible.
2 0.82106364 190 acl-2012-Syntactic Stylometry for Deception Detection
Author: Song Feng ; Ritwik Banerjee ; Yejin Choi
Abstract: Most previous studies in computerized deception detection have relied only on shallow lexico-syntactic patterns. This paper investigates syntactic stylometry for deception detection, adding a somewhat unconventional angle to prior literature. Over four different datasets spanning from the product review to the essay domain, we demonstrate that features driven from Context Free Grammar (CFG) parse trees consistently improve the detection performance over several baselines that are based only on shallow lexico-syntactic features. Our results improve the best published result on the hotel review data (Ott et al., 2011) reaching 91.2% accuracy with 14% error reduction. ,
3 0.59189016 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation
Author: Arianna Bisazza ; Marcello Federico
Abstract: This paper presents a novel method to suggest long word reorderings to a phrase-based SMT decoder. We address language pairs where long reordering concentrates on few patterns, and use fuzzy chunk-based rules to predict likely reorderings for these phenomena. Then we use reordered n-gram LMs to rank the resulting permutations and select the n-best for translation. Finally we encode these reorderings by modifying selected entries of the distortion cost matrix, on a per-sentence basis. In this way, we expand the search space by a much finer degree than if we simply raised the distortion limit. The proposed techniques are tested on Arabic-English and German-English using well-known SMT benchmarks.
4 0.59146851 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization
Author: Alessandro Moschitti ; Qi Ju ; Richard Johansson
Abstract: In this paper, we encode topic dependencies in hierarchical multi-label Text Categorization (TC) by means of rerankers. We represent reranking hypotheses with several innovative kernels considering both the structure of the hierarchy and the probability of nodes. Additionally, to better investigate the role ofcategory relationships, we consider two interesting cases: (i) traditional schemes in which node-fathers include all the documents of their child-categories; and (ii) more general schemes, in which children can include documents not belonging to their fathers. The extensive experimentation on Reuters Corpus Volume 1 shows that our rerankers inject effective structural semantic dependencies in multi-classifiers and significantly outperform the state-of-the-art.
Author: Patrick Simianer ; Stefan Riezler ; Chris Dyer
Abstract: With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data. Evidence from machine learning indicates that increasing the training sample size results in better prediction. The goal of this paper is to show that this common wisdom can also be brought to bear upon SMT. We deploy local features for SCFG-based SMT that can be read off from rules at runtime, and present a learning algorithm that applies ‘1/‘2 regularization for joint feature selection over distributed stochastic learning processes. We present experiments on learning on 1.5 million training sentences, and show significant improvements over tuning discriminative models on small development sets.
6 0.58930051 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures
7 0.58913839 167 acl-2012-QuickView: NLP-based Tweet Search
8 0.58699965 10 acl-2012-A Discriminative Hierarchical Model for Fast Coreference at Large Scale
9 0.58618307 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities
10 0.58555573 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition
11 0.58544385 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining
12 0.58534867 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information
13 0.58473784 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging
14 0.58472848 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
15 0.58438963 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing
16 0.58312631 191 acl-2012-Temporally Anchored Relation Extraction
17 0.58312285 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
18 0.58267987 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons
19 0.58237565 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation
20 0.58163291 117 acl-2012-Improving Word Representations via Global Context and Multiple Word Prototypes