acl acl2011 acl2011-107 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marco Kuhlmann ; Carlos Gomez-Rodriguez ; Giorgio Satta
Abstract: We develop a general dynamic programming technique for the tabulation of transition-based dependency parsers, and apply it to obtain novel, polynomial-time algorithms for parsing with the arc-standard and arc-eager models. We also show how to reverse our technique to obtain new transition-based dependency parsers from existing tabular methods. Additionally, we provide a detailed discussion of the conditions under which the feature models commonly used in transition-based parsing can be integrated into our algorithms.
Reference: text
sentIndex sentText sentNum sentScore
1 Dynamic Programming Algorithms for Transition-Based Dependency Parsers Marco Kuhlmann of Linguistics and Philology Uppsala University, Sweden marco . [sent-1, score-0.042]
2 es Giorgio Satta of Information Engineering University of Padua, Italy satta@dei . [sent-6, score-0.028]
3 Abstract We develop a general dynamic programming technique for the tabulation of transition-based dependency parsers, and apply it to obtain novel, polynomial-time algorithms for parsing with the arc-standard and arc-eager models. [sent-9, score-0.509]
4 We also show how to reverse our technique to obtain new transition-based dependency parsers from existing tabular methods. [sent-10, score-0.422]
5 Additionally, we provide a detailed discussion of the conditions under which the feature models commonly used in transition-based parsing can be integrated into our algorithms. [sent-11, score-0.097]
6 1 Introduction Dynamic programming algorithms, also known as tabular or chart-based algorithms, are at the core of many applications in natural language processing. [sent-12, score-0.251]
7 When applied to formalisms such as context-free grammar, they provide polynomial-time parsing al- gorithms and polynomial-space representations of the resulting parse forests, even in cases where the size of the search space is exponential in the length of the input string. [sent-13, score-0.179]
8 In combination with appropriate semirings, these packed representations can be exploited to compute many values of interest for machine learning, such as best parses and feature expectations (Goodman, 1999; Li and Eisner, 2009). [sent-14, score-0.05]
9 In this paper, we follow the line of investigation started by Huang and Sagae (2010) and apply dynamic programming to (projective) transition-based dependency parsing (Nivre, 2008). [sent-15, score-0.312]
10 This can be effectively implemented through dynamic programming, resulting in a packed representation of the set of all computations. [sent-17, score-0.116]
11 We provide (declarative specifications of) novel, polynomial-time algorithms for two widelyused transition-based parsing models: arc-standard (Nivre, 2004; Huang and Sagae, 2010) and arc-eager (Nivre, 2003; Zhang and Clark, 2008). [sent-19, score-0.216]
12 Our algorithm for the arc-eager model is the first tabular algorithm for this model that runs in polynomial time. [sent-20, score-0.165]
13 Both algorithms are derived using the same general technique; in fact, we show that this technique is applicable to all transition-parsing models whose transitions can be classified into “shift” and “reduce” transitions. [sent-21, score-0.245]
14 We also show how to reverse the tabulation to derive a new transition system from an existing tabular algorithm for dependency parsing, originally developed by Gómez-Rodríguez et al. [sent-22, score-0.629]
15 Finally, we discuss in detail the role of feature information in our algorithms, and in particular the conditions under which the feature models traditionally used in transition-based dependency parsing can be integrated into our framework. [sent-24, score-0.215]
16 While our general approach is the same as the one of Huang and Sagae (2010), we depart from their framework by not representing the computations of a parser as a graph-structured stack in the sense of Tomita (1986). [sent-25, score-0.728]
17 We instead simulate computations as in Lang (1974), which results in simpler algorithm specifications, and also reveals deep similarities between transition-based systems for dependency parsing and existing tabular methods for lexicalized context-free grammars. [sent-26, score-0.741]
18 Ac s2s0o1ci1a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 673–682, 2 Transition-Based Dependency Parsing We start by briefly introducing the framework of transition-based dependency parsing; for details, we refer to Nivre (2008). [sent-29, score-0.118]
19 A dependency graph for w is a directed graph G D . [sent-40, score-0.2]
20 position of a token in w, and each arc in A encodes a dependency relation between two tokens. [sent-47, score-0.251]
21 j ; here, the node iis the head, a/n 2d t Ahe node j is tihe ! [sent-50, score-0.272]
22 A sample dependency graph is given ; : : : ; ? [sent-52, score-0.159]
23 2 Transition Systems A transition system is a structure S D . [sent-55, score-0.163]
24 isC a Tfi;nIit;eC set of transitions, which are partial functions t W C * C , I a total initialization function mapping teWaCch input is string to a unique initial configuration, and Ct ? [sent-57, score-0.218]
25 The transition systems that we investigate in this paper differ from each other only with respect to their sets of transitions, and are identical in all other aspects. [sent-59, score-0.163]
26 In each of them, a configuration is defined relative to a string w as above, and is a triple c D . [sent-60, score-0.316]
27 r;oˇm; Vw , called stack and buffer, respectively, and A ? [sent-65, score-0.301]
28 We follow a standard convention and write the stack with its topmost element to the right, and the buffer with its first element to the left; furthermore, we indicate concatenation in the stack and in the buffer by a vertical bar. [sent-75, score-1.319]
29 The initialization function maps each string w to the initial configuration . [sent-76, score-0.376]
30 Given an input string w, a parser based on S processes w from left to right, starting in the initial configuration I. [sent-86, score-0.403]
31 At each point, it applies one of the transitions, until at the end it reaches a terminal ; 674 . [sent-88, score-0.084]
32 configuration; the dependency graph defined by the arc set associated with that configuration is then returned as the analysis for w. [sent-106, score-0.485]
33 Formally, a computation of S on w is a sequence ? [sent-107, score-0.149]
34 h e0ach configuration is obtained as the value of the preceding one under some transition. [sent-111, score-0.193]
35 bwe uniquely specified by its initial configuration c0 and the sequence of its transitions, : : : ; understood as a string over T. [sent-115, score-0.402]
36 Complete computations, where c0 is fixed, can be specified by their transition sequences alone. [sent-116, score-0.227]
37 3 Arc-Standard Model To introduce the core concepts of the paper, we first look at a particularly simple model for transitionbased dependency parsing, known as the arc-standard model. [sent-117, score-0.175]
38 This model has been used, in slightly different variants, by a number of parsers (Nivre, 2004; Attardi, 2006; Huang and Sagae, 2010). [sent-118, score-0.063]
39 1 Transition System The arc-standard model uses three types of transitions: Shift (sh) removes the first node in the buffer and pushes it to the stack. [sent-120, score-0.429]
40 Left-Arc (la) creates a new arc with the topmost node on the stack as the head and the second-topmost node as the dependent, and removes the second-topmost node from the stack. [sent-121, score-1.057]
41 Right-Arc (ra) is symmetric to Left-Arc in that it creates an arc with the second-topmost node as the head and the topmost node as the dependent, and removes the topmost node. [sent-122, score-0.779]
42 The three transitions can be formally specified as in Figure 1. [sent-123, score-0.226]
43 The right half of Figure 2 shows a complete computation of the arc-standard transition system, specified by its transition sequence. [sent-124, score-0.57]
44 The picture also shows the contents of the stack over the course of the computation; more specifically, column i shows the stack ? [sent-125, score-0.602]
45 Figure 2: A dependency tree (left) and a computation generating this tree in the arc-standard system (right). [sent-131, score-0.267]
46 2 Push Computations The key to the tabulation of transition-based dependency parsers is to find a way to decompose computations into smaller, shareable parts. [sent-133, score-0.684]
47 For the arcstandard model, as well as for the other transition systems that we consider in this paper, we base our decomposition on the concept of push computations. [sent-134, score-0.508]
48 1; on some input string w with the following properties: (P1) The initial stack ? [sent-137, score-0.481]
49 c0/ is not modified during the computation, and is not even exposed after the first transition: For every 1 ? [sent-139, score-0.03]
50 (P2) The overall effect of the computation is to push a single node to the stack: The stack ? [sent-149, score-0.912]
51 We can verify that the computation in Figure 2 is a push computation. [sent-156, score-0.534]
52 We can also see that it contains shorter computations that are push computations; one example is the computation ? [sent-157, score-0.883]
53 0 D c1 c16, whose overall effect is to push the nodDe 3c. [sent-158, score-0.345]
54 In Figure 2, this computation is marked by the zig-zag path traced in bold. [sent-159, score-0.18]
55 Every computation that consists of a single sh transition is a push computation. [sent-164, score-0.696]
56 Starting from these atoms, we can build larger push computations by means of two (partial) binary operations fla and fra, defined as follows. [sent-165, score-0.804]
57 2 D c20; c2m2 be push computations on the same input string w such that c1m1 D c20. [sent-168, score-0.864]
58 2/ D c10; :::; c1m1 ; c21 ; : : : ; c2m2 ; c ; 675 where c is obtained from c2m2 by applying the ra transition. [sent-172, score-0.035]
59 3 Deduction System Building on the compositional structure of push computations, we now construct a deduction system (in the sense of Shieber et al. [sent-188, score-0.439]
60 (1995)) that tabulates the computations of the arc-standard model for a given input string w D w0 ? [sent-189, score-0.519]
61 w/, and ˇn denotes the empty buffer, associated with a terminal configuration c 2 Ct . [sent-205, score-0.277]
62 The items of our deduction system take the form Œi; h; j? [sent-207, score-0.094]
wordName wordTfidf (topN-words)
[('computations', 0.389), ('push', 0.345), ('vw', 0.304), ('stack', 0.301), ('buffer', 0.219), ('configuration', 0.193), ('tabular', 0.165), ('transition', 0.163), ('transitions', 0.162), ('computation', 0.149), ('topmost', 0.14), ('arc', 0.133), ('fra', 0.13), ('dependency', 0.118), ('node', 0.117), ('tabulation', 0.114), ('sagae', 0.099), ('string', 0.095), ('deduction', 0.094), ('nivre', 0.093), ('terminal', 0.084), ('lang', 0.083), ('cm', 0.081), ('tomita', 0.076), ('ji', 0.074), ('configurations', 0.07), ('fla', 0.07), ('guez', 0.07), ('kuhlmann', 0.07), ('specifications', 0.07), ('parsing', 0.069), ('dynamic', 0.066), ('specified', 0.064), ('parsers', 0.063), ('removes', 0.063), ('satta', 0.06), ('programming', 0.059), ('ct', 0.056), ('jj', 0.056), ('huang', 0.052), ('initial', 0.05), ('packed', 0.05), ('write', 0.047), ('shift', 0.044), ('algorithms', 0.044), ('marco', 0.042), ('graph', 0.041), ('verify', 0.04), ('technique', 0.039), ('sh', 0.039), ('atoms', 0.038), ('departamento', 0.038), ('niit', 0.038), ('sceht', 0.038), ('billot', 0.038), ('computaci', 0.038), ('dde', 0.038), ('depart', 0.038), ('gorithms', 0.038), ('philology', 0.038), ('tihe', 0.038), ('wn', 0.038), ('initialization', 0.038), ('creates', 0.038), ('exponential', 0.037), ('reverse', 0.037), ('operation', 0.036), ('ra', 0.035), ('itnh', 0.035), ('iit', 0.035), ('semirings', 0.035), ('jh', 0.035), ('padua', 0.035), ('input', 0.035), ('attardi', 0.033), ('uf', 0.033), ('nco', 0.033), ('dei', 0.033), ('jw', 0.033), ('whe', 0.033), ('widelyused', 0.033), ('element', 0.033), ('la', 0.032), ('originally', 0.032), ('head', 0.031), ('ogf', 0.031), ('traced', 0.031), ('ahe', 0.031), ('giorgio', 0.031), ('right', 0.031), ('left', 0.03), ('exposed', 0.03), ('pushes', 0.03), ('transitionbased', 0.03), ('es', 0.028), ('triple', 0.028), ('conditions', 0.028), ('dependent', 0.027), ('core', 0.027), ('convention', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 107 acl-2011-Dynamic Programming Algorithms for Transition-Based Dependency Parsers
Author: Marco Kuhlmann ; Carlos Gomez-Rodriguez ; Giorgio Satta
Abstract: We develop a general dynamic programming technique for the tabulation of transition-based dependency parsers, and apply it to obtain novel, polynomial-time algorithms for parsing with the arc-standard and arc-eager models. We also show how to reverse our technique to obtain new transition-based dependency parsers from existing tabular methods. Additionally, we provide a detailed discussion of the conditions under which the feature models commonly used in transition-based parsing can be integrated into our algorithms.
2 0.17697759 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing
Author: Jinho D. Choi ; Martha Palmer
Abstract: This paper suggests two ways of improving transition-based, non-projective dependency parsing. First, we add a transition to an existing non-projective parsing algorithm, so it can perform either projective or non-projective parsing as needed. Second, we present a bootstrapping technique that narrows down discrepancies between gold-standard and automatic parses used as features. The new addition to the algorithm shows a clear advantage in parsing speed. The bootstrapping technique gives a significant improvement to parsing accuracy, showing near state-of-theart performance with respect to other parsing approaches evaluated on the same data set.
3 0.16517845 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features
Author: Yue Zhang ; Joakim Nivre
Abstract: Transition-based dependency parsers generally use heuristic decoding algorithms but can accommodate arbitrarily rich feature representations. In this paper, we show that we can improve the accuracy of such parsers by considering even richer feature sets than those employed in previous systems. In the standard Penn Treebank setup, our novel features improve attachment score form 91.4% to 92.9%, giving the best results so far for transitionbased parsing and rivaling the best results overall. For the Chinese Treebank, they give a signficant improvement of the state of the art. An open source release of our parser is freely available.
4 0.11445749 282 acl-2011-Shift-Reduce CCG Parsing
Author: Yue Zhang ; Stephen Clark
Abstract: CCGs are directly compatible with binarybranching bottom-up parsing algorithms, in particular CKY and shift-reduce algorithms. While the chart-based approach has been the dominant approach for CCG, the shift-reduce method has been little explored. In this paper, we develop a shift-reduce CCG parser using a discriminative model and beam search, and compare its strengths and weaknesses with the chart-based C&C; parser. We study different errors made by the two parsers, and show that the shift-reduce parser gives competitive accuracies compared to C&C.; Considering our use of a small beam, and given the high ambiguity levels in an automatically-extracted grammar and the amount of information in the CCG lexical categories which form the shift actions, this is a surprising result.
5 0.11094499 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing
Author: Gisle Ytrestl
Abstract: This paper describes a backtracking strategy for an incremental deterministic transitionbased parser for HPSG. The method could theoretically be implemented on any other transition-based parser with some adjustments. In this paper, the algorithm is evaluated on CuteForce, an efficient deterministic shiftreduce HPSG parser. The backtracking strategy may serve to improve existing parsers, or to assess if a deterministic parser would benefit from backtracking as a strategy to improve parsing.
6 0.097468406 210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models
7 0.091405287 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation
8 0.087805398 243 acl-2011-Partial Parsing from Bitext Projections
9 0.08730831 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing
10 0.080701947 167 acl-2011-Improving Dependency Parsing with Semantic Classes
11 0.079486392 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines
12 0.076315135 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
13 0.072483696 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars
14 0.070701234 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation
15 0.068504632 234 acl-2011-Optimal Head-Driven Parsing Complexity for Linear Context-Free Rewriting Systems
16 0.062840723 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach
17 0.058388989 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks
18 0.058243014 23 acl-2011-A Pronoun Anaphora Resolution System based on Factorial Hidden Markov Models
19 0.05650723 61 acl-2011-Binarized Forest to String Translation
20 0.050695457 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation
topicId topicWeight
[(0, 0.128), (1, -0.052), (2, -0.046), (3, -0.184), (4, -0.021), (5, -0.033), (6, -0.029), (7, 0.052), (8, 0.018), (9, -0.008), (10, 0.033), (11, 0.041), (12, 0.032), (13, -0.058), (14, -0.006), (15, 0.005), (16, 0.024), (17, 0.004), (18, -0.016), (19, -0.035), (20, -0.035), (21, -0.01), (22, 0.043), (23, 0.044), (24, 0.016), (25, -0.137), (26, -0.026), (27, 0.011), (28, -0.091), (29, 0.005), (30, 0.006), (31, -0.027), (32, 0.019), (33, 0.048), (34, -0.044), (35, -0.009), (36, -0.009), (37, 0.116), (38, 0.018), (39, 0.026), (40, -0.032), (41, -0.121), (42, 0.062), (43, 0.028), (44, -0.092), (45, -0.076), (46, -0.114), (47, -0.076), (48, 0.02), (49, 0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.95600301 107 acl-2011-Dynamic Programming Algorithms for Transition-Based Dependency Parsers
Author: Marco Kuhlmann ; Carlos Gomez-Rodriguez ; Giorgio Satta
Abstract: We develop a general dynamic programming technique for the tabulation of transition-based dependency parsers, and apply it to obtain novel, polynomial-time algorithms for parsing with the arc-standard and arc-eager models. We also show how to reverse our technique to obtain new transition-based dependency parsers from existing tabular methods. Additionally, we provide a detailed discussion of the conditions under which the feature models commonly used in transition-based parsing can be integrated into our algorithms.
2 0.76533937 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing
Author: Jinho D. Choi ; Martha Palmer
Abstract: This paper suggests two ways of improving transition-based, non-projective dependency parsing. First, we add a transition to an existing non-projective parsing algorithm, so it can perform either projective or non-projective parsing as needed. Second, we present a bootstrapping technique that narrows down discrepancies between gold-standard and automatic parses used as features. The new addition to the algorithm shows a clear advantage in parsing speed. The bootstrapping technique gives a significant improvement to parsing accuracy, showing near state-of-theart performance with respect to other parsing approaches evaluated on the same data set.
3 0.72389764 243 acl-2011-Partial Parsing from Bitext Projections
Author: Prashanth Mannem ; Aswarth Dara
Abstract: Recent work has shown how a parallel corpus can be leveraged to build syntactic parser for a target language by projecting automatic source parse onto the target sentence using word alignments. The projected target dependency parses are not always fully connected to be useful for training traditional dependency parsers. In this paper, we present a greedy non-directional parsing algorithm which doesn’t need a fully connected parse and can learn from partial parses by utilizing available structural and syntactic information in them. Our parser achieved statistically significant improvements over a baseline system that trains on only fully connected parses for Bulgarian, Spanish and Hindi. It also gave a significant improvement over previously reported results for Bulgarian and set a benchmark for Hindi.
4 0.69796491 210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models
Author: Brian Roark ; Richard Sproat ; Izhak Shafran
Abstract: In this paper we introduce a novel use of the lexicographic semiring and motivate its use for speech and language processing tasks. We prove that the semiring allows for exact encoding of backoff models with epsilon transitions. This allows for off-line optimization of exact models represented as large weighted finite-state transducers in contrast to implicit (on-line) failure transition representations. We present preliminary empirical results demonstrating that, even in simple intersection scenarios amenable to the use of failure transitions, the use of the more powerful lexicographic semiring is competitive in terms of time of intersection. 1 Introduction and Motivation Representing smoothed n-gram language models as weighted finite-state transducers (WFST) is most naturally done with a failure transition, which reflects the semantics of the “otherwise” formulation of smoothing (Allauzen et al., 2003). For example, the typical backoff formulation of the probability of a word w given a history h is as follows P(w | h) = ?αPh(Pw( |w h )| h0) oift hc(ehrwwis)e > 0 (1) where P is an empirical estimate of the probability that reserves small finite probability for unseen n-grams; αh is a backoff weight that ensures normalization; and h0 is a backoff history typically achieved by excising the earliest word in the history h. The principle benefit of encoding the WFST in this way is that it only requires explicitly storing n-gram transitions for observed n-grams, i.e., count greater than zero, as opposed to all possible n-grams of the given order which would be infeasible in for example large vocabulary speech recognition. This is a massive space savings, and such an approach is also used for non-probabilistic stochastic language 1 models, such as those trained with the perceptron algorithm (Roark et al., 2007), as the means to access all and exactly those features that should fire for a particular sequence in a deterministic automaton. Similar issues hold for other finite-state se- quence processing problems, e.g., tagging, bracketing or segmenting. Failure transitions, however, are an implicit method for representing a much larger explicit automaton in the case of n-gram models, all possible n-grams for that order. During composition with the model, the failure transition must be interpreted on the fly, keeping track of those symbols that have already been found leaving the original state, and only allowing failure transition traversal for symbols that have not been found (the semantics of “otherwise”). This compact implicit representation cannot generally be preserved when composing with other models, e.g., when combining a language model with a pronunciation lexicon as in widelyused FST approaches to speech recognition (Mohri et al., 2002). Moving from implicit to explicit representation when performing such a composition leads to an explosion in the size of the resulting transducer, frequently making the approach intractable. In practice, an off-line approximation to the model is made, typically by treating the failure transitions as epsilon transitions (Mohri et al., 2002; Allauzen et al., 2003), allowing large transducers to be composed and optimized off-line. These complex approximate transducers are then used during first-pass – decoding, and the resulting pruned search graphs (e.g., word lattices) can be rescored with exact language models encoded with failure transitions. Similar problems arise when building, say, POStaggers as WFST: not every pos-tag sequence will have been observed during training, hence failure transitions will achieve great savings in the size of models. Yet discriminative models may include complex features that combine both input stream (word) and output stream (tag) sequences in a single feature, yielding complicated transducer topologies for which effective use of failure transitions may not Proceedings Pofo trhtlea 4nd9,th O Arnegnouna,l J Muneeet 1in9g-2 o4f, t 2h0e1 A1s.s ?oc ci2a0t1io1n A fosrso Ccioamtiopnut faotrio Cnoaml Lpiuntgauti osntiacls: Lsihnogrutpisatipcesrs, pages 1–5, be possible. An exact encoding using other mechanisms is required in such cases to allow for off-line representation and optimization. In this paper, we introduce a novel use of a semiring the lexicographic semiring (Golan, 1999) which permits an exact encoding of these sorts of models with the same compact topology as with failure transitions, but using epsilon transitions. Unlike the standard epsilon approximation, this semiring allows for an exact representation, while also allowing (unlike failure transition approaches) for off-line – – composition with other transducers, with all the optimizations that such representations provide. In the next section, we introduce the semiring, followed by a proof that its use yields exact representations. We then conclude with a brief evaluation of the cost of intersection relative to failure transitions in comparable situations. 2 The Lexicographic Semiring Weighted automata are automata in which the transitions carry weight elements of a semiring (Kuich and Salomaa, 1986). A semiring is a ring that may lack negation, with two associative operations ⊕ and ⊗lac akn nde tghaetiiro nre,ws piethcti twveo i dasesnotictiya ievleem oepnertas t 0io annsd ⊕ ⊕ 1. a nAd ⊗com anmdo tnh esierm reirsipnegc tiivn es pideeenchti ayn edl elmanegnutasg 0e processing, and one that we will be using in this paper, is the tropical semiring (R ∪ {∞}, min, +, ∞, 0), i.e., tmhein t riosp tihcea l⊕ s omfi trhineg gs(e mRi∪rin{g∞ (w},imthi nid,e+nt,i∞ty ,∞0)), ia.end., m+ ins tihse t ⊗e o⊕f othfe t hseem seirminigri n(wg i(wth tidhe indteitnyt t0y). ∞Th)i asn ids a+pp isro thpreia ⊗te o ofof rth h pee srfeomrmiri nngg (Vwitietrhb iid seenatricthy u0s).in Tgh negative log probabilities – we add negative logs along a path and take the min between paths. A hW1 , W2 . . . Wni-lexicographic weight is a tupleA o hfW weights wherei- eeaxichco gorfa pthhiec w weeiigghhtt cisla ass teusW1, W2 . . . Wn, must observe the path property (Mohri, 2002). The path property of a semiring K is defined in terms of the natural order on K such that: a <2 ws e3 w&o4;)r The term “lexicographic” is an apt term for this semiring since the comparison for ⊕ is like the lexicseomgriaripnhgic s icnocmep thareis coonm opfa srtisrionngs f,o rco ⊕m ipsa lrikineg t thhee l feixrist- elements, then the second, and so forth. 3 Language model encoding 3.1 Standard encoding For language model encoding, we will differentiate between two classes of transitions: backoff arcs (labeled with a φ for failure, or with ? using our new semiring); and n-gram arcs (everything else, labeled with the word whose probability is assigned). Each state in the automaton represents an n-gram history string h and each n-gram arc is weighted with the (negative log) conditional probability of the word w labeling the arc given the history h. For a given history h and n-gram arc labeled with a word w, the destination of the arc is the state associated with the longest suffix of the string hw that is a history in the model. This will depend on the Markov order of the n-gram model. For example, consider the trigram model schematic shown in Figure 1, in which only history sequences of length 2 are kept in the model. Thus, from history hi = wi−2wi−1, the word wi transitions to hi+1 = wi−1wi, w2hii−ch1 is the longest suffix of hiwi in the modie−l1. As detailed in the “otherwise” semantics of equation 1, backoff arcs transition from state h to a state h0, typically the suffix of h of length |h| − 1, with we,i tgyhpti c(a−lllyog th αeh s)u. Wixe o cfa hll othf ele ndgestthin |hat|io −n 1s,ta wtei ah bwaecikgohtff s−taltoe.g αThis recursive backoff topology terminates at the unigram state, i.e., h = ?, no history. Backoff states of order k may be traversed either via φ-arcs from the higher order n-gram of order k + 1or via an n-gram arc from a lower order n-gram of order k −1. This means that no n-gram arc can enter tohred ezre rko−eth1. .o Trhdiesr mstaeaten s(fi tnhaalt bnaoc nk-ogfrfa),m ma andrc f cualln-o enrdteerr states history strings of length n − 1 for a model sotfa toersde —r n h may ihnagvse o n-gram a nrc −s e 1nt feorri nag m forodeml other full-order states as well as from backoff states of history size n − 2. — s—to 3.2 Encoding with lexicographic semiring For an LM machine M on the tropical semiring with failure transitions, which is deterministic and has the wih-2 =i1wφ/-logPwα(hi-1|whiφ)/-logwPhiα(+-1w i=|-1wiφ)/-logPαw(hi+)1 Figure 1: Deterministic finite-state representation of n-gram models with negative log probabilities (tropical semiring). The symbol φ labels backoff transitions. Modified from Roark and Sproat (2007), Figure 6.1. path property, we can simulate φ-arcs in a standard LM topology by a topologically equivalent machine M0 on the lexicographic hT, Ti semiring, where φ has boenen th hreep l eaxciceod gwraitphh eicps hilTo,nT, ais sfeomlloirwinsg. ,F worh every n-gram arc with label w and weight c, source state si and destination state sj, construct an n-gram arc with label w, weight h0, ci, source state si0, and deswtiniathtio lanb estla wte, s0j. gThhte h e0x,citi c, soosut rocfe e satcahte s state is constructed as follows. If the state is non-final, h∞, ∞i . sOttruhectrewdis aes fifo litl ofiwnsa.l Iwf tihthe e sxtiatt ec iosst n co nit- fwinilall ,b he∞ ∞h0,,∞ ∞cii . hLeertw n sbee tfh iet length oithf th exei longest history string iin. the model. For every φ-arc with (backoff) weight c, source state si, and destination state sj representing a history of length k, construct an ?-arc with source state si0, destination state s0j, and weight hΦ⊗(n−k) , ci, where Φ > 0 and Φ⊗(n−k) takes Φ to the (n − k)th power with the ⊗ operation. In the tropical semiring, ⊗ ris w +, so Φe⊗ ⊗(n o−pke) = (n − k)Φ. tFroorp iecxaalm sepmlei,r i nng a, t⊗rigi sra +m, msoo Φdel, if we= =ar (en b −ac kki)nΦg. off from a bigram state h (history length = 1) to a unigram state, n − k = 2 − 0 = 2, so we set the buanicgkroafmf w steaigteh,t nto −h2 kΦ, = =− l2og − α 0h) = =for 2 ,s soome w Φe s>et 0 th. cInk ofrfd were gtoh tco tom hb2iΦn,e −thleo gmαodel with another automaton or transducer, we would need to also convert those models to the hT, Ti semiring. For these aveutrotm thaotsae, mwoed seilmsp toly t uese hT a, Tdeif saeumlt rtrinagn.sf Foromra thtieosen such that every transition with weight c is assigned weight h0, ci . For example, given a word lattice wL,e iwghe tco h0n,vceir.t the lattice to L0 in the lexicographic semiring using this default transformation, and then perform the intersection L0 ∩ M0. By removing epsilon transitions and determ∩in Mizing the result, the low cost path for any given string will be retained in the result, which will correspond to the path achieved with Finally we project the second dimension of the hT, Ti weights to produce a lattice dini mtheen strioonpi ocfal t seem hTir,iTngi, wweihgichhts i tso e pqruoidvuacleen at ltaot tichee 3 result of L ∩ M, i.e., φ-arcs. C2(det(eps-rem(L0 ∩ M0))) = L ∩ M where C2 denotes projecting the second-dimension wofh tehree ChT, Ti weights, det(·) denotes determinizatoifon t,h aen hdT e,pTsi-r wemei(g·h) sde,n doette(s· )?- dreenmootveasl. d 4 Proof We wish to prove that for any machine N, ShortestPath(M0 ∩ N0) passes through the equivalent states in M0 to∩ t Nhose passed through in M for ShortestPath(M ∩ N) . Therefore determinization Sofh othrtee rsetsPualttihn(gM Mint ∩er Nse)c.ti Tonh rafefteorr e?- dreemteromvianl yzaiteilodns the same topology as intersection with the equivalent φ machine. Intuitively, since the first dimension of the hT, Ti weights is 0 for n-gram arcs and > 0 foofr t h beac hkTo,ffT arcs, tghhet ss ihsor 0te fostr p na-tghr awmil la rtcrasv aenrdse > >the 0 fewest possible backoff arcs; further, since higherorder backoff arcs cost less in the first dimension of the hT, Ti weights in M0, the shortest path will intchleud heT n-gram iagrhcst sa ti nth Meir earliest possible point. We prove this by induction on the state-sequence of the path p/p0 up to a given state si/si0 in the respective machines M/M0. Base case: If p/p0 is of length 0, and therefore the states si/si0 are the initial states of the respective machines, the proposition clearly holds. Inductive step: Now suppose that p/p0 visits s0...si/s00...si0 and we have therefore reached si/si0 in the respective machines. Suppose the cumulated weights of p/p0 are W and hΨ, Wi, respectively. We wish to show thaarte w Whic anhedv heΨr sj isi n reexspt evcitsiivteedly o. nW p (i.e., the path becomes s0...sisj) the equivalent state s0 is visited on p0 (i.e., the path becomes s00...si0s0j). Let w be the next symbol to be matched leaving states si and si0. There are four cases to consider: (1) there is an n-gram arc leaving states si and si0 labeled with w, but no backoff arc leaving the state; (2) there is no n-gram arc labeled with w leaving the states, but there is a backoff arc; (3) there is no ngram arc labeled with w and no backoff arc leaving the states; and (4) there is both an n-gram arc labeled with w and a backoff arc leaving the states. In cases (1) and (2), there is only one possible transition to take in either M or M0, and based on the algorithm for construction of M0 given in Section 3.2, these transitions will point to sj and s0j respectively. Case (3) leads to failure of intersection with either machine. This leaves case (4) to consider. In M, since there is a transition leaving state si labeled with w, the backoff arc, which is a failure transition, cannot be traversed, hence the destination of the n-gram arc sj will be the next state in p. However, in M0, both the n-gram transition labeled with w and the backoff transition, now labeled with ?, can be traversed. What we will now prove is that the shortest path through M0 cannot include taking the backoff arc in this case. In order to emit w by taking the backoff arc out of state si0, one or more backoff (?) transitions must be taken, followed by an n-gram arc labeled with w. Let k be the order of the history represented by state si0, hence the cost of the first backoff arc is h(n − k)Φ, −log(αsi0 )i in our semiring. If we tirsa vhe(rns e− km) Φ b,a−ckloofgf( αarcs) ip irnior o tro eemmiitrtiinngg. the w, the first dimension of our accumulated cost will be m(n −k + m−21)Φ, based on our algorithm for consmtr(unct−ionk +of M0 given in Section 3.2. Let sl0 be the destination state after traversing m backoff arcs followed by an n-gram arc labeled with w. Note that, by definition, m ≤ k, and k − m + 1 is the orbdeyr oeffi nstitaitoen ,sl0 m. B≤ as ked, onnd t khe − c mons +tru 1ct iiosn t ealg oor-rithm, the state sl0 is also reachable by first emitting w from state si0 to reach state s0j followed by some number of backoff transitions. The order of state s0j is either k (if k is the highest order in the model) or k + 1 (by extending the history of state si0 by one word). If it is of order k, then it will require m −1 backoff arcs to reach state sl0, one fewer tqhuainre t mhe− −pa1t hb ctok osftfat aer ss0l oth raeta cbheg sitanste w sith a backoff arc, for a total cost of (m − 1) (n − k + m−21)Φ which is less than m(n − k + m−21)Φ. If state s0j icish o ifs o lerdsser hka n+ m1,( th −er ke +will be m backoff arcs to reach state sl0, but with a total cost of m(n − (k + 1) + m−21)Φ m(n − k + m−23)Φ = which is also less than m(n − km + m−21)Φ. Hence twheh cstha ties asl0ls coa lne asl twhaayns mbe( n re −ac khe +d from si0 with a lower cost through state s0j than by first taking the backoff arc from si0. Therefore the shortest path on M0 must follow s00...si0s0j. 2 This completes the proof. 5 Experimental Comparison of ?, φ and hT, Ti encoded language models For our experiments we used lattices derived from a very large vocabulary continuous speech recognition system, which was built for the 2007 GALE Arabic speech recognition task, and used in the work reported in Lehr and Shafran (201 1). The lexicographic semiring was evaluated on the development 4 set (2.6 hours of broadcast news and conversations; 18K words). The 888 word lattices for the development set were generated using a competitive baseline system with acoustic models trained on about 1000 hrs of Arabic broadcast data and a 4-gram language model. The language model consisting of 122M n-grams was estimated by interpolation of 14 components. The vocabulary is relatively large at 737K and the associated dictionary has only single pronunciations. The language model was converted to the automaton topology described earlier, and represented in three ways: first as an approximation of a failure machine using epsilons instead of failure arcs; second as a correct failure machine; and third using the lexicographic construction derived in this paper. The three versions of the LM were evaluated by intersecting them with the 888 lattices of the development set. The overall error rate for the systems was 24.8%—comparable to the state-of-theart on this task1 . For the shortest paths, the failure and lexicographic machines always produced identical lattices (as determined by FST equivalence); in contrast, 81% of the shortest paths from the epsilon approximation are different, at least in terms of weights, from the shortest paths using the failure LM. For full lattices, 42 (4.7%) of the lexicographic outputs differ from the failure LM outputs, due to small floating point rounding issues; 863 (97%) of the epsilon approximation outputs differ. In terms of size, the failure LM, with 5.7 million arcs requires 97 Mb. The equivalent hT, Tillieoxnico argcrasp rheqicu iLreMs r 9e7qu Mireb.s 1 T20h eM ebq,u idvuael eton tth heT ,dToui-bling of the size of the weights.2 To measure speed, we performed the intersections 1000 times for each of our 888 lattices on a 2993 MHz Intel?R Xeon?R CPU, and took the mean times for each of our methods. The 888 lattices were processed with a mean of 1.62 seconds in total (1.8 msec per lattice) using the failure LM; using the hT, Ti-lexicographic iLnMg t rheequ fairieldur 1e.8 L Msec;o unsdinsg g(2 t.h0e m hTse,cT per lxaitctiocger)a, pahnidc is thus about 11% slower. Epsilon approximation, where the failure arcs are approximated with epsilon arcs took 1.17 seconds (1.3 msec per lattice). The 1The error rate is a couple of points higher than in Lehr and Shafran (2011) since we discarded non-lexical words, which are absent in maximum likelihood estimated language model and are typically augmented to the unigram backoff state with an arbitrary cost, fine-tuned to optimize performance for a given task. 2If size became an issue, the first dimension of the hT, TiweigIhft scizane bbee c raemprees aennt iesdsu eby, tah esi fnigrlste bdyimtee. slightly slower speeds for the exact method using the failure LM, and hT, Ti can be related to the overhfaeialdur eof L cMom, apnudtin hgT ,tThei f caailnur bee f urenlcattieodn aot rhuen toivmeer,and determinization, respectively. 6 Conclusion In this paper we have introduced a novel application of the lexicographic semiring, proved that it can be used to provide an exact encoding of language model topologies with failure arcs, and provided experimental results that demonstrate its efficiency. Since the hT, Ti-lexicographic semiring is both lefSt-i nacned hr iegh htT-d,iTstrii-bleuxtiicvoe,g roatphheirc o spetmimiriiznagtions such as minimization are possible. The particular hT, Ti-lexicographic semiring we have used thiceruel airs h bTu,t Toni-el oxifc many h piocss siebmlei ilnexgic woegr haapvheic u esendcodings. We are currently exploring the use of a lexicographic semiring that involves different semirings in the various dimensions, for the integration of part-of-speech taggers into language models. An implementation of the lexicographic semiring by the second author is already available as part of the OpenFst package (Allauzen et al., 2007). The methods described here are part of the NGram language-model-training toolkit, soon to be released at opengrm .org. Acknowledgments This research was supported in part by NSF Grant #IIS-081 1745 and DARPA grant #HR001 1-09-10041. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF or DARPA. We thank Maider Lehr for help in preparing the test data. We also thank the ACL reviewers for valuable comments. References Cyril Allauzen, Mehryar Mohri, and Brian Roark. 2003. Generalized algorithms for constructing statistical language models. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 40–47. Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. 2007. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Twelfth International Conference on Implementation and Application of Automata (CIAA 2007), Lecture Notes in Computer Sci5 ence, volume 4793, pages 11–23, Prague, Czech Republic. Springer. Jonathan Golan. 1999. Semirings and their Applications. Kluwer Academic Publishers, Dordrecht. Werner Kuich and Arto Salomaa. 1986. Semirings, Automata, Languages. Number 5 in EATCS Monographs on Theoretical Computer Science. SpringerVerlag, Berlin, Germany. Maider Lehr and Izhak Shafran. 2011. Learning a discriminative weighted finite-state transducer for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, July. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 16(1):69–88. Mehryar Mohri. 2002. Semiring framework and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350. Brian Roark and Richard Sproat. 2007. Computational Approaches to Morphology and Syntax. Oxford University Press, Oxford. Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech and Language, 21(2):373–392.
5 0.69220233 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing
Author: Gisle Ytrestl
Abstract: This paper describes a backtracking strategy for an incremental deterministic transitionbased parser for HPSG. The method could theoretically be implemented on any other transition-based parser with some adjustments. In this paper, the algorithm is evaluated on CuteForce, an efficient deterministic shiftreduce HPSG parser. The backtracking strategy may serve to improve existing parsers, or to assess if a deterministic parser would benefit from backtracking as a strategy to improve parsing.
6 0.67756379 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing
7 0.66007406 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines
8 0.61604601 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features
9 0.57471108 234 acl-2011-Optimal Head-Driven Parsing Complexity for Linear Context-Free Rewriting Systems
10 0.55435234 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing
11 0.52431756 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
12 0.51889759 282 acl-2011-Shift-Reduce CCG Parsing
13 0.51104647 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation
14 0.45028755 167 acl-2011-Improving Dependency Parsing with Semantic Classes
15 0.44112238 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser
16 0.42537108 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks
18 0.41603285 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing
19 0.41194075 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation
20 0.40725094 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation
topicId topicWeight
[(5, 0.01), (17, 0.573), (37, 0.089), (39, 0.035), (41, 0.072), (59, 0.013), (72, 0.011), (91, 0.024), (96, 0.081)]
simIndex simValue paperId paperTitle
same-paper 1 0.93830067 107 acl-2011-Dynamic Programming Algorithms for Transition-Based Dependency Parsers
Author: Marco Kuhlmann ; Carlos Gomez-Rodriguez ; Giorgio Satta
Abstract: We develop a general dynamic programming technique for the tabulation of transition-based dependency parsers, and apply it to obtain novel, polynomial-time algorithms for parsing with the arc-standard and arc-eager models. We also show how to reverse our technique to obtain new transition-based dependency parsers from existing tabular methods. Additionally, we provide a detailed discussion of the conditions under which the feature models commonly used in transition-based parsing can be integrated into our algorithms.
2 0.88413823 109 acl-2011-Effective Measures of Domain Similarity for Parsing
Author: Barbara Plank ; Gertjan van Noord
Abstract: It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain (Sekine, 1997; Gildea, 2001). Hence, an important task is to select appropriate domains. However, most previous work on domain adaptation relied on the implicit assumption that domains are somehow given. As more and more data becomes available, automatic ways to select data that is beneficial for a new (unknown) target domain are becoming attractive. This paper evaluates various ways to automatically acquire related training data for a given test set. The results show that an unsupervised technique based on topic models is effective – it outperforms random data selection on both languages exam- ined, English and Dutch. Moreover, the technique works better than manually assigned labels gathered from meta-data that is available for English. 1 Introduction and Motivation Previous research on domain adaptation has focused on the task of adapting a system trained on one domain, say newspaper text, to a particular new domain, say biomedical data. Usually, some amount of (labeled or unlabeled) data from the new domain was given which has been determined by a human. However, with the growth of the web, more and more data is becoming available, where each document “is potentially its own domain” (McClosky et al., 2010). It is not straightforward to determine – 1566 Gertjan van Noord University of Groningen The Netherlands G J M van Noord@ rug nl . . . . . which data or model (in case we have several source domain models) will perform best on a new (unknown) target domain. Therefore, an important issue that arises is how to measure domain similarity, i.e. whether we can find a simple yet effective method to determine which model or data is most beneficial for an arbitrary piece of new text. Moreover, if we had such a measure, a related question is whether it can tell us something more about what is actually meant by “domain”. So far, it was mostly arbitrarily used to refer to some kind of coherent unit (related to topic, style or genre), e.g.: newspaper text, biomedical abstracts, questions, fiction. Most previous work on domain adaptation, for instance Hara et al. (2005), McClosky et al. (2006), Blitzer et al. (2006), Daum e´ III (2007), sidestepped this problem of automatic domain selection and adaptation. For parsing, to our knowledge only one recent study has started to examine this issue (McClosky et al., 2010) we will discuss their approach in Section 2. Rather, an implicit assumption of all of these studies is that domains are given, i.e. that they are represented by the respective corpora. Thus, a corpus has been considered a homogeneous unit. As more data is becoming available, it is unlikely that – domains will be ‘given’ . Moreover, a given corpus might not always be as homogeneous as originally thought (Webber, 2009; Lippincott et al., 2010). For instance, recent work has shown that the well-known Penn Treebank (PT) Wall Street Journal (WSJ) actually contains a variety of genres, including letters, wit and short verse (Webber, 2009). In this study we take a different approach. Rather than viewing a given corpus as a monolithic entity, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s566–1576, we break it down to the article-level and disregard corpora boundaries. Given the resulting set of documents (articles), we evaluate various ways to automatically acquire related training data for a given test set, to find answers to the following questions: • Given a pool of data (a collection of articles fGriovmen nun ak pnooowln o domains) caonldle a test article, eiss there a way to automatically select data that is relevant for the new domain? If so: • Which similarity measure is good for parsing? • How does it compare to human-annotated data? • Is the measure also useful for other languages Iasnd th/oer mtaesakssu?r To this end, we evaluate measures of domain similarity and feature representations and their impact on dependency parsing accuracy. Given a collection of annotated articles, and a new article that we want to parse, we want to select the most similar articles to train the best parser for that new article. In the following, we will first compare automatic measures to human-annotated labels by examining parsing performance within subdomains of the Penn Treebank WSJ. Then, we extend the experiments to the domain adaptation scenario. Experiments were performed on two languages: English and Dutch. The empirical results show that a simple measure based on topic distributions is effective for both languages and works well also for Part-of-Speech tagging. As the approach is based on plain surfacelevel information (words) and it finds related data in a completely unsupervised fashion, it can be easily applied to other tasks or languages for which annotated (or automatically annotated) data is available. 2 Related Work The work most related to ours is McClosky et al. (2010). They try to find the best combination of source models to parse data from a new domain, which is related to Plank and Sima’an (2008). In the latter, unlabeled data was used to create several parsers by weighting trees in the WSJ according to their similarity to the subdomain. McClosky et al. (2010) coined the term multiple source domain adaptation. Inspired by work on parsing accuracy 1567 prediction (Ravi et al., 2008), they train a linear regression model to predict the best (linear interpolation) of source domain models. Similar to us, McClosky et al. (2010) regard a target domain as mixture of source domains, but they focus on phrasestructure parsing. Furthermore, our approach differs from theirs in two respects: we do not treat source corpora as one entity and try to mix models, but rather consider articles as base units and try to find subsets of related articles (the most similar articles); moreover, instead of creating a supervised model (in their case to predict parsing accuracy), our approach is ‘simplistic’ : we apply measures of domain simi- larity directly (in an unsupervised fashion), without the necessity to train a supervised model. Two other related studies are (Lippincott et al., 2010; Van Asch and Daelemans, 2010). Van Asch and Daelemans (2010) explore a measure of domain difference (Renyi divergence) between pairs of domains and its correlation to Part-of-Speech tagging accuracy. Their empirical results show a linear correlation between the measure and the performance loss. Their goal is different, but related: rather than finding related data for a new domain, they want to estimate the loss in accuracy of a PoS tagger when applied to a new domain. We will briefly discuss results obtained with the Renyi divergence in Section 5.1. Lippincott et al. (2010) examine subdomain variation in biomedicine corpora and propose awareness of NLP tools to such variation. However, they did not yet evaluate the effect on a practical task, thus our study is somewhat complementary to theirs. The issue of data selection has recently been examined for Language Modeling (Moore and Lewis, 2010). A subset of the available data is automatically selected as training data for a Language Model based on a scoring mechanism that compares cross- entropy scores. Their approach considerably outperformed random selection and two previous proposed approaches both based on perplexity scoring.1 3 Measures of Domain Similarity 3.1 Measuring Similarity Automatically Feature Representations A similarity function may be defined over any set of events that are con1We tested data selection by perplexity scoring, but found the Language Models too small to be useful in our setting. sidered to be relevant for the task at hand. For parsing, these might be words, characters, n-grams (of words or characters), Part-of-Speech (PoS) tags, bilexical dependencies, syntactic rules, etc. However, to obtain more abstract types such as PoS tags or dependency relations, one would first need to gather respective labels. The necessary tools for this are again trained on particular corpora, and will suffer from domain shifts, rendering labels noisy. Therefore, we want to gauge the effect of the simplest representation possible: plain surface characteristics (unlabeled text). This has the advantage that we do not need to rely on additional supervised tools; moreover, it is interesting to know how far we can get with this level of information only. We examine the following feature representations: relative frequencies of words, relative frequencies of character tetragrams, and topic models. Our motivation was as follows. Relative frequencies of words are a simple and effective representation used e.g. in text classification (Manning and Sch u¨tze, 1999), while character n-grams have proven successful in genre classification (Wu et al., 2010). Topic models (Blei et al., 2003; Steyvers and Griffiths, 2007) can be considered an advanced model over word distributions: every article is represented by a topic distribution, which in turn is a distribution over words. Similarity between documents can be measured by comparing topic distributions. Similarity Functions There are many possible similarity (or distance) functions. They fall broadly into two categories: probabilistically-motivated and geometrically-motivated functions. The similarity functions examined in this study will be described in the following. The Kullback-Leibler (KL) divergence D(q| |r) is a cTlahsesic Kaull measure oibfl ‘edri s(KtaLn)ce d’i2v ebregtweneceen D Dtw(oq probability distributions, and is defined as: D(q| |r) = Pyq(y)logrq((yy)). It is a non-negative, additive, aPsymmetric measure, and 0 iff the two distributions are identical. However, the KL-divergence is undefined if there exists an event y such that q(y) > 0 but r(y) = 0, which is a property that “makes it unsuitable for distributions derived via maximumlikelihood estimates” (Lee, 2001). 2It is not a proper distance metric since it is asymmetric. 1568 One option to overcome this limitation is to apply smoothing techniques to gather non-zero estimates for all y. The alternative, examined in this paper, is to consider an approximation to the KL divergence, such as the Jensen-Shannon (JS) divergence (Lin, 1991) and the skew divergence (Lee, 2001). The Jensen-Shannon divergence, which is symmetric, computes the KL-divergence between q, r, and the average between the two. We use the JS divergence as defined in Lee (2001): JS(q, r) = [D(q| |avg(q, r)) + D(r| |avg(q, r))] . The asymm[eDtr(icq |s|akvewg( divergence sα, proposed by Lee (2001), mixes one distribution with the other by a degree de- 21 fined by α ∈ [0, 1) : sα (q, r, α) = D(q| |αr + (1 α)q). Ays α α approaches 1, rt,hαe )sk =ew D divergence approximates the KL-divergence. An alternative way to measure similarity is to consider the distributions as vectors and apply geometrically-motivated distance functions. This family of similarity functions includes the cosine cos(q, r) = qq(y) · r(y)/ | |q(y) | | | |r(y) | |, euclidean − euc(q,r) = qPy(q(y) − r(y))2 and variational (also known asq LP1 or MPanhattan) distance function, defined as var(q, r) = Py |q(y) − r(y) |. 3.2 Human-annotatePd data In contrast to the automatic measures devised in the previous section, we might have access to human annotated data. That is, use label information such as topic or genre to define the set of similar articles. Genre For the Penn Treebank (PT) Wall Street Journal (WSJ) section, more specifically, the subset available in the Penn Discourse Treebank, there exists a partition of the data by genre (Webber, 2009). Every article is assigned one of the following genre labels: news, letters, highlights, essays, errata, wit and short verse, quarterly progress reports, notable and quotable. This classification has been made on the basis of meta-data (Webber, 2009). It is wellknown that there is no meta-data directly associated with the individual WSJ files in the Penn Treebank. However, meta-data can be obtained by looking at the articles in the ACL/DCI corpus (LDC99T42), and a mapping file that aligns document numbers of DCI (DOCNO) to WSJ keys (Webber, 2009). An example document is given in Figure 1. The metadata field HL contains headlines, SO source info, and the IN field includes topic markers.
3 0.85325283 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
Author: Gunter Neumann ; Sven Schmeier
Abstract: We present a mobile touchable application for online topic graph extraction and exploration of web content. The system has been implemented for operation on an iPad. The topic graph is constructed from N web snippets which are determined by a standard search engine. We consider the extraction of a topic graph as a specific empirical collocation extraction task where collocations are extracted between chunks. Our measure of association strength is based on the pointwise mutual information between chunk pairs which explicitly takes their distance into account. An initial user evaluation shows that this system is especially helpful for finding new interesting information on topics about which the user has only a vague idea or even no idea at all.
4 0.81836289 118 acl-2011-Entrainment in Speech Preceding Backchannels.
Author: Rivka Levitan ; Agustin Gravano ; Julia Hirschberg
Abstract: In conversation, when speech is followed by a backchannel, evidence of continued engagement by one’s dialogue partner, that speech displays a combination of cues that appear to signal to one’s interlocutor that a backchannel is appropriate. We term these cues backchannel-preceding cues (BPC)s, and examine the Columbia Games Corpus for evidence of entrainment on such cues. Entrainment, the phenomenon of dialogue partners becoming more similar to each other, is widely believed to be crucial to conversation quality and success. Our results show that speaking partners entrain on BPCs; that is, they tend to use similar sets of BPCs; this similarity increases over the course of a dialogue; and this similarity is associated with measures of dialogue coordination and task success. 1
5 0.81598049 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation
Author: Ashish Vaswani ; Haitao Mi ; Liang Huang ; David Chiang
Abstract: Most statistical machine translation systems rely on composed rules (rules that can be formed out of smaller rules in the grammar). Though this practice improves translation by weakening independence assumptions in the translation model, it nevertheless results in huge, redundant grammars, making both training and decoding inefficient. Here, we take the opposite approach, where we only use minimal rules (those that cannot be formed out of other rules), and instead rely on a rule Markov model of the derivation history to capture dependencies between minimal rules. Large-scale experiments on a state-of-the-art tree-to-string translation system show that our approach leads to a slimmer model, a faster decoder, yet the same translation quality (measured using B ) as composed rules.
6 0.7722612 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction
7 0.5961588 180 acl-2011-Issues Concerning Decoding with Synchronous Context-free Grammar
8 0.57859528 30 acl-2011-Adjoining Tree-to-String Translation
9 0.53832608 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules
10 0.51086766 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks
11 0.50605452 296 acl-2011-Terminal-Aware Synchronous Binarization
12 0.50331748 141 acl-2011-Gappy Phrasal Alignment By Agreement
13 0.49232173 154 acl-2011-How to train your multi bottom-up tree transducer
14 0.48783854 61 acl-2011-Binarized Forest to String Translation
15 0.47942913 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
16 0.47554129 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering
17 0.47348499 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars
19 0.4629761 50 acl-2011-Automatic Extraction of Lexico-Syntactic Patterns for Detection of Negation and Speculation Scopes
20 0.45973706 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation