emnlp emnlp2013 emnlp2013-66 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: He He ; Hal Daume III ; Jason Eisner
Abstract: Feature computation and exhaustive search have significantly restricted the speed of graph-based dependency parsing. We propose a faster framework of dynamic feature selection, where features are added sequentially as needed, edges are pruned early, and decisions are made online for each sentence. We model this as a sequential decision-making problem and solve it by imitation learning techniques. We test our method on 7 languages. Our dynamic parser can achieve accuracies comparable or even superior to parsers using a full set of features, while computing fewer than 30% of the feature templates.
Reference: text
sentIndex sentText sentNum sentScore
1 We propose a faster framework of dynamic feature selection, where features are added sequentially as needed, edges are pruned early, and decisions are made online for each sentence. [sent-4, score-0.776]
2 Our dynamic parser can achieve accuracies comparable or even superior to parsers using a full set of features, while computing fewer than 30% of the feature templates. [sent-7, score-0.398]
3 In the scoring stage, we score all possible edges (or other small substructures) using a learned function; in the decoding stage, we use combinatorial optimization to find the dependency tree with the highest total score. [sent-9, score-0.651]
4 As a result, parsing time is dominated by the scoring stage— computing edge attributes, using them to instantiate feature templates, and looking up the weights of the resulting features in a hash table. [sent-12, score-0.428]
5 (2005a) used on average about 120 first-order feature templates on each edge, built from attributes such as the edge direction and length, the 1455 Jason Eisner Department of Computer Science Johns Hopkins University Baltimore, MD 21218 j as on@ c s . [sent-14, score-0.451]
6 Motivated by recent progress on dynamic feature selection (Benbouzid et al. [sent-18, score-0.301]
7 , 2012), we propose to add features one group at a time to the dependency graph, and to use these features together with interactions among edges (as determined by intermediate parsing results) to make hard decisions on some edges before all their features have been seen. [sent-20, score-1.158]
8 However, in place of relatively simple heuristics such as a global relative pruning threshold, we learn a featurized decisionmaking policy of a more complex form. [sent-22, score-0.394]
9 Previous work has made much progress on the complementary problem: speeding up the decoding stage by pruning the search space of tree structures. [sent-25, score-0.335]
10 In the recent vine pruning approach (Rush and Petrov, 2012), significant speedup is gained by leveraging structured information via a coarse-to-fine projective parsing casProce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et. [sent-27, score-0.821]
11 Although pruned edges do not require further feature computation, the pruning step must itself compute similar high-dimensional features just to decide which edges to prune. [sent-32, score-1.056]
12 We aim to do feature selection and edge pruning dynamically, balancing speed and accuracy by using only as many features as needed. [sent-34, score-0.559]
13 In this paper, we first explore standard static feature selection methods for dependency parsing, and show that even a few feature templates can give decent accuracy (Section 3. [sent-35, score-0.666]
14 However, our approach applies to other graph-based dependency parsers as well—including non-projective parsing, higher-order parsing, or approximations to higher-order parsing that use stacking (Martins et al. [sent-40, score-0.31]
15 2 Graph-based Dependency Parsing In graph-based dependency parsing of an n-word input sentence, we must construct a tree y whose vertices 0, 1, . [sent-43, score-0.325]
16 Each directed edge of this tree points from a head (parent) to one of its modifiers (child). [sent-47, score-0.354]
17 The first-order model decomposes the tree into edges E of the form hh, mi, where h ∈ [0, n] and m ∈ [1, n] (with hr m) are a hheeraed htok ∈en [0 a,nnd] one = 1456 of its modifiers. [sent-51, score-0.455]
18 Since scoring the edges independently in this way restricts the parser to a local view of the dependency structure, higher-order models can achieve better accuracy. [sent-53, score-0.591]
19 We can further require the tree to be projective, meaning that edges are not allowed to cross each other. [sent-62, score-0.455]
20 It is well known that dynamic programming can be used to find the best projective dependency tree in O(n3) time, much as in CKY, for firstorder models and some higher-order models (Eisner, 1996; McDonald and Pereira, 2006). [sent-63, score-0.675]
21 3 Dynamic Feature Selection Unlike typical feature selection methods that fix a subset of selected features and use it throughout testing, in dynamic feature selection we choose features adaptively for each instance. [sent-67, score-0.513]
22 One approximate solution (McDonald and Pereira, 2006) works by doing projective parsing and then rearranging edges. [sent-71, score-0.445]
23 (a) Start with all possible edges except those filtered by the length dictionary. [sent-83, score-0.393]
24 (b) (e) Add the next group of feature templates and parse using the non-projective parser. [sent-84, score-0.383]
25 Predicted trees are shown as blue and red edges, where red indicates the edges that we then decide to lock. [sent-85, score-0.456]
26 1 Sequential Decision Making Our work is motivated by recent progress on dynamic feature selection (Benbouzid et al. [sent-89, score-0.301]
27 In the specific case of dynamic feature selection, when the system is in a given state, it decides whether to add some more features or to stop and make a prediction based on the features added so far. [sent-94, score-0.315]
28 Usually the sequential decision making problem is solved by reinforcement learning (Sutton and Barto, 1998) or imitation learning (Abbeel and Ng, 2004; Ratliff et al. [sent-95, score-0.333]
29 As there are n2 possible edges on a sentence of length n, we wish to avoid the overhead of making many individual decisions about specific features on specific edges, with each decision considering the current scores of all other edges. [sent-101, score-0.608]
30 In Figure 2, we observe that (a) feature computation took more than 80% of the total time; (b) even though non-projective decoding time grows quadratically in terms of the sentence length, in practice it is almost negligible compared to the projective decoding time, with an average of 0. [sent-106, score-0.534]
31 23 ms; (c) the secondorder projective model is significantly slower due to higher asymptotic complexity in both the scoring and decoding stages. [sent-107, score-0.439]
32 As making this decision separately for each of the n2 possible edges is expensive, we in- stead propose a version that reduces the number of decisions needed. [sent-109, score-0.521]
33 We use the fast first-order non-projective parser for this purpose, since given observations (b) and (c), we cannot afford to run projective parsing multiple times. [sent-112, score-0.525]
34 The single resulting tree (blue and red edges in Figure 1) has only sentence length Figure 2: Time comparison of scoring time and decoding time on English PTB section 22. [sent-113, score-0.623]
35 n edges, and we use a classifier to decide which of these edges are reliable enough that we should “lock” them—i. [sent-114, score-0.362]
36 This is the only decision that our policy π must make. [sent-117, score-0.304]
37 Locked (red) edges are definitely in the final tree. [sent-118, score-0.362]
38 We also do constraint propagation: we rule out all edges that conflict with the locked edges, barring them from appearing in the final tree. [sent-119, score-0.686]
39 The 2-dot-3-dash edge time ← firms is removed because it crosses the locked edge (comma) ← were (whereas we ultimately csekeekd a projective parse). [sent-121, score-1.011]
40 3Constraint propagation also automatically locks an edge when all other edges with the same child have been ruled out. [sent-126, score-0.644]
41 4A reviewer asks about the cost of finding edges that cross a locked edge. [sent-127, score-0.686]
42 But at most n edges will be locked during the entire algorithm, for a total O(n3) runtime— the same as one call to projective parsing, and far faster in practice. [sent-129, score-1.005]
43 1458 Once constraint propagation has finished, we visit all edges (gray) whose fate is still unknown, and update their scores in parallel by adding the next group of features. [sent-131, score-0.501]
44 As a result, most edges will be locked in or ruled out without needing to look up all of their features. [sent-132, score-0.716]
45 Some edges may still remain uncertain even after including all features. [sent-133, score-0.4]
46 If so, a final iteration (Figure 1 (f)) uses the slower projective parser to resolve the status of these maximally uncertain edges. [sent-134, score-0.513]
47 Also, at earlier steps we would not prune edges crossing the locked edges. [sent-139, score-0.686]
48 4 Methods Our goal is to produce a faster dependency parser by reducing the feature computation time. [sent-140, score-0.307]
49 We assume that we are given three increasingly accurate but increasingly slow parsers that can be called as subroutines: a first-order non-projective parser, a firstorder projective parser, and a second-order projective parser. [sent-141, score-0.754]
50 In general we will return the output of one of the projective parsers. [sent-143, score-0.319]
51 But at early iterations, the non-projective parser helps us rapidly consider interactions among edges that may be relevant to our dynamic decisions. [sent-144, score-0.561]
52 The 112 second-order feature templates are then ranked by adding them in a similar greedy fashion (given that all first-order features have already been added), evaluating with the second-order projective parser. [sent-152, score-0.601]
53 Our parser paldadtes an oen Ktire g group {oTf feature templates art p aerascehr step, since adding one template at a time would require too many decisions and obviate speedups. [sent-157, score-0.558]
54 In this case, earlier stages are fast because they tend to have many fewer feature templates than later stages. [sent-161, score-0.343]
55 For example, for English, we use 7 groups of first-order feature templates and 4 groups of second-order feature templates. [sent-162, score-0.456]
56 2 Sequential Feature Selection Similar to the length dictionary filter of Rush and Petrov (2012), for each test sentence, we first deterministically remove edges longer than the maximum length of edges in the training set that have the same head POS tag, modifier POS tag, and direction. [sent-165, score-0.857]
57 This simple step prunes around 40% of the non-gold edges in our Penn Treebank development set (Section 6. [sent-166, score-0.362]
58 A),ft ewrh tehree length dictionary pruning step, we compute T1 fthore alel n remaining edges to obtain a pruned weighted directed graph. [sent-170, score-0.718]
59 fi Tnhale parse tcrteieo by pruning edges ith aapt ceoanrfsli icnt twheith fi hh, mi s. [sent-174, score-0.628]
60 The add action computes the next group of features for hh, mi and all other competing edges with child m. [sent-177, score-0.646]
61 (Since we classify the edges one at a time, decisions on one edge may affect later edges. [sent-178, score-0.637]
62 To improve efficiency and reduce cascaded error, we sort the edges in the predicted tree and process them as above in descending order of their scores. [sent-179, score-0.455]
63 , TK1−1), then 1 iteration of projective firstorder parsing (adding group TK1), and finally K2 iterations of projective second-order parsing (adding groups TK1+1 , . [sent-185, score-1.114]
64 Before each iteration, we use the result of the pre- vious iteration (as explained above) to prune some edges and add a new group offeatures to the rest. [sent-189, score-0.504]
65 We 5If the conflicting edge is in the current predicted parse tree (which can happen because of non-projectivity), we forbid the model to prune it. [sent-190, score-0.304]
66 Each ofthe three parsers has a different set of feature weights, so when we switch parsers on rounds K1 and K1 + 1, we must also change the weights of the previously added features to those specified by the new parsing model. [sent-193, score-0.372]
67 In practice, we can stop as soon as the fate of all edges is known. [sent-194, score-0.4]
68 Also, if no projective parse tree can be constructed at round K1 using the available unpruned edges, then we immediately fall back to returning the non-projective parse tree from round K1 − 1. [sent-195, score-0.589]
69 In the sequential feature selection framework, it is hard to directly apply standard reinforcement learn- ing algorithms, as we cannot assign credit to certain features until the policy decides to stop and let us evaluate the prediction result. [sent-205, score-0.61]
70 On the other hand, knowing the gold parse tree makes it easy to obtain expert demonstrations, which enables imitation learning. [sent-206, score-0.438]
71 If the learned policy cannot mimic the expert perfectly, one wrong step may lead to states never visited by the expert due to cumulative errors. [sent-209, score-0.551]
72 This problem of insufficient exploration can be alleviated by iteratively learning a policy trained under states visited by both the expert and the learner (Ross et al. [sent-210, score-0.42]
73 ← ∅, π1 ← π∗ fIonirt i a = 1e Dto N← ∅d,o for G(V, E) ∈ Ti do Sample trajectories {(sπi , πi (sπi ))} by DynFS(G(V, SE), πi) FDS ← VD, S {(ψ(s), π∗(s)} enDd f ←or end for Train policy πi+1 on D return Best πi evaolnua Dted on development set 5. [sent-238, score-0.312]
74 4 Policy Features Our linear edge classifier uses a feature vector ψ that concatenates all previously acquired parsing features together with “meta-features” that reflect confidence in the edge. [sent-239, score-0.385]
75 The expert only makes optimal pruning decisions but the performance depends on the pre-trained parser as well. [sent-241, score-0.46]
76 For all languages, we run DAgger for 20 iterations and se- Table 1: Comparison of speedup and accuracy with the vine pruning cascade approach for six languages. [sent-264, score-0.459]
77 In the setup, DYNFS means our dynamic feature selection model, VINEP means the vine pruning cascade model, UAS(D) and UAS(F) refer to the unlabeled attachment score of the dynamic model (D) and the full-feature model (F) respectively. [sent-265, score-0.747]
78 Results for the vine pruning cascade model are taken from Rush and Petrov (2012). [sent-267, score-0.327]
79 The cost is the percentage of feature templates used per sentence on edges that are not pruned by the dictionary filter. [sent-268, score-0.774]
80 2 Baseline Models We use the publicly available implementation of MSTParser7 (with modifications to the feature computation) and its default settings, so the feature weights of the projective and non-projective parsers are trained by the MIRA algorithm (Crammer and Singer, 2003; Crammer et al. [sent-271, score-0.577]
81 For PTB, our first-order model has 268 feature templates and 76,287,848 features; the second-order model has 380 feature templates and 95,796,140 features. [sent-278, score-0.564]
82 Red (locked), gray (undecided), dashed gray (pruned) lines correspond to edges shown in Figure 1. [sent-287, score-0.442]
83 In Table 1, we compare the dynamic parsing models with the full-feature models and the vine pruning cascade models for first-order and second-order Figure 5: Pareto curves for the dynamic and static approaches on English PTB section 23. [sent-288, score-0.754]
84 As speed comparison for parsing largely relies on implementation, we also report the percentage of feature templates chosen for each sentence. [sent-292, score-0.44]
85 The cost column shows the average number of feature templates computed for each sentence, expressed as a percentage of the number of feature templates if we had only pruned using the length dictionary filter. [sent-293, score-0.725]
86 From the table we notice that our first-order model’s performance is comparable or superior to the vine pruning model, both in terms of speedup and accuracy. [sent-294, score-0.336]
87 Therefore, to maintain parsing accuracy, the policy must make high-precision pruning decisions and becomes conservative. [sent-300, score-0.626]
88 We could mitigate this by training the original parsing feature weights in conjunction with our policy feature weights. [sent-301, score-0.557]
89 In addition, there is larger overhead during when checking nonprojective edges and cycles. [sent-302, score-0.484]
90 We show how the runtime, accuracy, number of locked edges and undecided edges change over the iterations in our first1463 order dynamic projective parsing. [sent-304, score-1.563]
91 From iterations 1 to 6, we obtain parsing results from the nonprojective parser; in iteration 7, we run the projective parser. [sent-305, score-0.596]
92 The number of remaining edges drops quickly after the second iteration. [sent-307, score-0.362]
93 From Figure 3, however, we notice that even with the first feature group which only contains one feature template, the non-projective parser can almost achieve 50% accuracy. [sent-308, score-0.319]
94 Thus, ideally, our policy should have locked that many edges after the first iteration. [sent-309, score-0.937]
95 The learned policy does not imitate the expert perfectly, either because our policy features are not discriminative enough, or because a linear classifier is not powerful enough for this task. [sent-310, score-0.633]
96 Finally, to show the advantage ofmaking dynamic decisions that consider the interaction among edges on the given input sentence, we compare our results with a static feature selection approach on PTB section 23. [sent-311, score-0.832]
97 In each iteration, instead of running a fast parser and making decisions online, it simply adds the next group of feature templates to all edges. [sent-313, score-0.527]
98 7 Conclusion In this paper we present a dynamic feature selection algorithm for graph-based dependency parsing. [sent-316, score-0.407]
99 We show that choosing feature templates adaptively for each edge in the dependency graph greatly reduces feature computation time and in some cases improves parsing accuracy. [sent-317, score-0.834]
100 In future, we are interested in training parsers favoring the dynamic feature selection setting, for example, parsers that are robust to missing features, or parsers optimized for different stages. [sent-319, score-0.535]
wordName wordTfidf (topN-words)
[('edges', 0.362), ('locked', 0.324), ('projective', 0.319), ('policy', 0.251), ('templates', 0.192), ('imitation', 0.172), ('edge', 0.169), ('ptb', 0.16), ('hh', 0.157), ('pruning', 0.143), ('vine', 0.133), ('expert', 0.131), ('parsing', 0.126), ('dynamic', 0.119), ('dagger', 0.114), ('dependency', 0.106), ('decisions', 0.106), ('pruned', 0.099), ('tree', 0.093), ('selection', 0.092), ('feature', 0.09), ('rush', 0.085), ('lock', 0.083), ('mcdonald', 0.083), ('mi', 0.081), ('parser', 0.08), ('parsers', 0.078), ('ross', 0.076), ('nonprojective', 0.066), ('benbouzid', 0.066), ('static', 0.063), ('trajectories', 0.061), ('speedup', 0.06), ('group', 0.059), ('ratliff', 0.057), ('overhead', 0.056), ('reinforcement', 0.056), ('decision', 0.053), ('stage', 0.052), ('petrov', 0.052), ('sequential', 0.052), ('directed', 0.052), ('koo', 0.051), ('cascade', 0.051), ('abbeel', 0.05), ('crammer', 0.049), ('decoding', 0.047), ('red', 0.047), ('pereira', 0.047), ('iteration', 0.046), ('scoring', 0.043), ('cycle', 0.042), ('parse', 0.042), ('groups', 0.042), ('propagation', 0.042), ('child', 0.041), ('head', 0.04), ('structured', 0.04), ('gray', 0.04), ('iterations', 0.039), ('daum', 0.038), ('apprenticeship', 0.038), ('bagnell', 0.038), ('dynfs', 0.038), ('esort', 0.038), ('fate', 0.038), ('grubb', 0.038), ('heh', 0.038), ('projectivity', 0.038), ('undecided', 0.038), ('viola', 0.038), ('firstorder', 0.038), ('uncertain', 0.038), ('visited', 0.038), ('add', 0.037), ('uas', 0.036), ('decides', 0.035), ('action', 0.035), ('parent', 0.034), ('prediction', 0.034), ('accuracy', 0.033), ('chu', 0.033), ('mst', 0.033), ('speed', 0.032), ('competing', 0.031), ('template', 0.031), ('dictionary', 0.031), ('computation', 0.031), ('length', 0.031), ('fewer', 0.031), ('root', 0.031), ('executes', 0.03), ('crosses', 0.03), ('pareto', 0.03), ('adaptively', 0.03), ('ruled', 0.03), ('spanning', 0.03), ('slower', 0.03), ('hal', 0.03), ('stages', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
Author: He He ; Hal Daume III ; Jason Eisner
Abstract: Feature computation and exhaustive search have significantly restricted the speed of graph-based dependency parsing. We propose a faster framework of dynamic feature selection, where features are added sequentially as needed, edges are pruned early, and decisions are made online for each sentence. We model this as a sequential decision-making problem and solve it by imitation learning techniques. We test our method on 7 languages. Our dynamic parser can achieve accuracies comparable or even superior to parsers using a full set of features, while computing fewer than 30% of the feature templates.
2 0.13331309 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
Author: Wenliang Chen ; Min Zhang ; Yue Zhang
Abstract: In current dependency parsing models, conventional features (i.e. base features) defined over surface words and part-of-speech tags in a relatively high-dimensional feature space may suffer from the data sparseness problem and thus exhibit less discriminative power on unseen data. In this paper, we propose a novel semi-supervised approach to addressing the problem by transforming the base features into high-level features (i.e. meta features) with the help of a large amount of automatically parsed data. The meta features are used together with base features in our final parser. Our studies indicate that our proposed approach is very effective in processing unseen data and features. Experiments on Chinese and English data sets show that the final parser achieves the best-reported accuracy on the Chinese data and comparable accuracy with the best known parsers on the English data.
3 0.10474524 102 emnlp-2013-Improving Learning and Inference in a Large Knowledge-Base using Latent Syntactic Cues
Author: Matt Gardner ; Partha Pratim Talukdar ; Bryan Kisiel ; Tom Mitchell
Abstract: Automatically constructed Knowledge Bases (KBs) are often incomplete and there is a genuine need to improve their coverage. Path Ranking Algorithm (PRA) is a recently proposed method which aims to improve KB coverage by performing inference directly over the KB graph. For the first time, we demonstrate that addition of edges labeled with latent features mined from a large dependency parsed corpus of 500 million Web documents can significantly outperform previous PRAbased approaches on the KB inference task. We present extensive experimental results validating this finding. The resources presented in this paper are publicly available.
4 0.095897615 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming
Author: Kai Zhao ; James Cross ; Liang Huang
Abstract: We present the first provably optimal polynomial time dynamic programming (DP) algorithm for best-first shift-reduce parsing, which applies the DP idea of Huang and Sagae (2010) to the best-first parser of Sagae and Lavie (2006) in a non-trivial way, reducing the complexity of the latter from exponential to polynomial. We prove the correctness of our algorithm rigorously. Experiments confirm that DP leads to a significant speedup on a probablistic best-first shift-reduce parser, and makes exact search under such a model tractable for the first time.
5 0.089682482 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging
Author: Thomas Mueller ; Helmut Schmid ; Hinrich Schutze
Abstract: Training higher-order conditional random fields is prohibitive for huge tag sets. We present an approximated conditional random field using coarse-to-fine decoding and early updating. We show that our implementation yields fast and accurate morphological taggers across six languages with different morphological properties and that across languages higher-order models give significant improvements over 1st-order models.
6 0.089113951 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization
7 0.088214517 149 emnlp-2013-Overcoming the Lack of Parallel Data in Sentence Compression
8 0.086931318 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
9 0.080170505 141 emnlp-2013-Online Learning for Inexact Hypergraph Search
10 0.080165267 182 emnlp-2013-The Topology of Semantic Knowledge
11 0.076214232 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
12 0.075251587 145 emnlp-2013-Optimal Beam Search for Machine Translation
13 0.072320573 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English
14 0.072107643 58 emnlp-2013-Dependency Language Models for Sentence Completion
15 0.066256128 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology
16 0.066005781 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
17 0.065192938 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs
18 0.058755238 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering
19 0.058129076 187 emnlp-2013-Translation with Source Constituency and Dependency Trees
20 0.057863832 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations
topicId topicWeight
[(0, -0.199), (1, -0.038), (2, 0.036), (3, 0.034), (4, -0.108), (5, 0.021), (6, 0.044), (7, -0.027), (8, 0.023), (9, 0.162), (10, -0.044), (11, -0.029), (12, -0.135), (13, 0.071), (14, 0.025), (15, -0.065), (16, -0.213), (17, 0.1), (18, -0.092), (19, -0.01), (20, 0.101), (21, -0.068), (22, 0.133), (23, 0.161), (24, 0.013), (25, -0.02), (26, 0.012), (27, 0.129), (28, 0.001), (29, -0.023), (30, 0.066), (31, -0.042), (32, 0.022), (33, 0.033), (34, -0.057), (35, 0.0), (36, -0.036), (37, 0.061), (38, -0.036), (39, -0.029), (40, 0.062), (41, 0.048), (42, 0.179), (43, -0.078), (44, 0.082), (45, 0.045), (46, -0.031), (47, -0.01), (48, -0.018), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.96349722 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
Author: He He ; Hal Daume III ; Jason Eisner
Abstract: Feature computation and exhaustive search have significantly restricted the speed of graph-based dependency parsing. We propose a faster framework of dynamic feature selection, where features are added sequentially as needed, edges are pruned early, and decisions are made online for each sentence. We model this as a sequential decision-making problem and solve it by imitation learning techniques. We test our method on 7 languages. Our dynamic parser can achieve accuracies comparable or even superior to parsers using a full set of features, while computing fewer than 30% of the feature templates.
2 0.72629791 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization
Author: Joseph Le Roux ; Antoine Rozenknop ; Jennifer Foster
Abstract: It has recently been shown that different NLP models can be effectively combined using dual decomposition. In this paper we demonstrate that PCFG-LA parsing models are suitable for combination in this way. We experiment with the different models which result from alternative methods of extracting a grammar from a treebank (retaining or discarding function labels, left binarization versus right binarization) and achieve a labeled Parseval F-score of 92.4 on Wall Street Journal Section 23 this represents an absolute improvement of 0.7 and an error reduction rate of 7% over a strong PCFG-LA product-model baseline. Although we experiment only with binarization and function labels in this study, there is much scope for applying this approach to – other grammar extraction strategies.
3 0.65177208 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
Author: Wenliang Chen ; Min Zhang ; Yue Zhang
Abstract: In current dependency parsing models, conventional features (i.e. base features) defined over surface words and part-of-speech tags in a relatively high-dimensional feature space may suffer from the data sparseness problem and thus exhibit less discriminative power on unseen data. In this paper, we propose a novel semi-supervised approach to addressing the problem by transforming the base features into high-level features (i.e. meta features) with the help of a large amount of automatically parsed data. The meta features are used together with base features in our final parser. Our studies indicate that our proposed approach is very effective in processing unseen data and features. Experiments on Chinese and English data sets show that the final parser achieves the best-reported accuracy on the Chinese data and comparable accuracy with the best known parsers on the English data.
4 0.61242205 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming
Author: Kai Zhao ; James Cross ; Liang Huang
Abstract: We present the first provably optimal polynomial time dynamic programming (DP) algorithm for best-first shift-reduce parsing, which applies the DP idea of Huang and Sagae (2010) to the best-first parser of Sagae and Lavie (2006) in a non-trivial way, reducing the complexity of the latter from exponential to polynomial. We prove the correctness of our algorithm rigorously. Experiments confirm that DP leads to a significant speedup on a probablistic best-first shift-reduce parser, and makes exact search under such a model tractable for the first time.
5 0.5917241 116 emnlp-2013-Joint Parsing and Disfluency Detection in Linear Time
Author: Mohammad Sadegh Rasooli ; Joel Tetreault
Abstract: We introduce a novel method to jointly parse and detect disfluencies in spoken utterances. Our model can use arbitrary features for parsing sentences and adapt itself with out-ofdomain data. We show that our method, based on transition-based parsing, performs at a high level of accuracy for both the parsing and disfluency detection tasks. Additionally, our method is the fastest for the joint task, running in linear time.
6 0.5888027 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English
7 0.57677186 102 emnlp-2013-Improving Learning and Inference in a Large Knowledge-Base using Latent Syntactic Cues
8 0.56983745 182 emnlp-2013-The Topology of Semantic Knowledge
9 0.51953143 58 emnlp-2013-Dependency Language Models for Sentence Completion
10 0.48646486 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs
11 0.45141637 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging
12 0.42881393 145 emnlp-2013-Optimal Beam Search for Machine Translation
14 0.40457743 141 emnlp-2013-Online Learning for Inexact Hypergraph Search
15 0.39541787 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation
16 0.38979274 149 emnlp-2013-Overcoming the Lack of Parallel Data in Sentence Compression
17 0.37594831 203 emnlp-2013-With Blinkers on: Robust Prediction of Eye Movements across Readers
18 0.37558118 144 emnlp-2013-Opinion Mining in Newspaper Articles by Entropy-Based Word Connections
19 0.36635202 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology
topicId topicWeight
[(3, 0.043), (9, 0.027), (18, 0.047), (22, 0.041), (26, 0.035), (30, 0.113), (50, 0.025), (51, 0.131), (64, 0.012), (66, 0.026), (67, 0.263), (71, 0.023), (75, 0.022), (77, 0.033), (90, 0.022), (95, 0.014), (96, 0.015), (97, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.76882017 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
Author: He He ; Hal Daume III ; Jason Eisner
Abstract: Feature computation and exhaustive search have significantly restricted the speed of graph-based dependency parsing. We propose a faster framework of dynamic feature selection, where features are added sequentially as needed, edges are pruned early, and decisions are made online for each sentence. We model this as a sequential decision-making problem and solve it by imitation learning techniques. We test our method on 7 languages. Our dynamic parser can achieve accuracies comparable or even superior to parsers using a full set of features, while computing fewer than 30% of the feature templates.
2 0.75575221 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation
Author: Kilian Evang ; Valerio Basile ; Grzegorz Chrupala ; Johan Bos
Abstract: Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that highaccuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models. 1 An Elephant in the Room Tokenization, the task of segmenting a text into words and sentences, is often regarded as a solved problem in natural language processing (Dridan and . Oepen, 2012), probably because many corpora are already in tokenized format. But like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the NLP pipeline affect further analysis negatively. And even though current tokenizers reach high performance, there are three issues that we feel haven’t been addressed satisfactorily so far: • • Most tokenizers are rule-based and therefore hard to maintain and hard to adapt to new domains and new languages (Silla Jr. and Kaestner, 2004); Word and sentence segmentation are often seen as separate tasks, but they obviously inform each other and it could be advantageous to view them as a combined task; 1422 bo s }@ rug .nl † g .chrupal a @ uvt .nl • Most tokenization methods provide no align- ment between raw and tokenized text, which makes mapping the tokenized version back onto the actual source hard or impossible. In short, we believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules? Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? Can we use the same method across languages? Can we combine word and sentence boundary detection into one task? 2 Related Work Usually the text segmentation task is split into word tokenization and sentence boundary detection. Rulebased systems for finding word and sentence boundaries often are variations on matching hand-coded regular expressions (Grefenstette, 1999; Silla Jr. and Kaestner, 2004; Jurafsky and Martin, 2008; Dridan and Oepen, 2012). Several unsupervised systems have been proposed for sentence boundary detection. Kiss and Strunk (2006) present a language-independent, unsupervised approach and note that abbreviations form a major source of ambiguity in sentence boundary detection and use collocation detection to build a high-accuracy abbreviation detector. The resulting system reaches high accuracy, rivalling handcrafted rule-based and supervised systems. A similar system was proposed earlier by Mikheev (2002). Existing supervised learning approaches for sentence boundary detection use as features tokens preceding and following potential sentence boundary, part of speech, capitalization information and lists of abbreviations. Learning methods employed in Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is4t2ic2s–1426, these approaches include maximum entropy models (Reynar and Ratnaparkhi, 1997) decision trees (Riley, 1989), and neural networks (Palmer and Hearst, 1997). Closest to our work are approaches that present token and sentence splitters using conditional random fields (Tomanek et al., 2007; Fares et al., 2013). However, these previous approaches consider tokens (i.e. character sequences) as basic units for labeling, whereas we consider single characters. As a consequence, labeling is more resource-intensive, but it also gives us more expressive power. In fact, our approach kills two birds with one stone, as it allows us to integrate token and sentence boundaries detection into one task. 3 Method 3.1 IOB Tokenization IOB tagging is widely used in tasks identifying chunks of tokens. We use it to identify chunks of characters. Characters outside of tokens are labeled O, inside of tokens I. For characters at the beginning of tokens, we use S at sentence boundaries, otherwise T (for token). This scheme offers some nice features, like allowing for discontinuous tokens (e.g. hyphenated words at line breaks) and starting a new token in the middle of a typographic word if the tokenization scheme requires it, as e.g. in did|n ’t. An example ins given ien r Figure 1 i.t It didn ’ t matter i f the face s were male , S I I T I OT I I I IOT I OT I I OT I I I I OT I I I I OT I II I OT I TO female or tho se of chi ldren . Eighty T I I I I I I OT I I I I I I I OT OT I I OT I I I TOS I I I O III three percent o f people in the 3 0 -to-3 4 I I I I I I OT I I I I I I OT I I I I I I OT I I I OT I I OT OT I I I IO year old age range gave correct responses . T I I I OT I OT I I OT I I I I I OT I I I I T I OT I I II I OT I I I IIII Figure 1: Example of IOB-labeled characters 3.2 Datasets In our experiments we use three datasets to compare our method for different languages and for different domains: manually checked English newswire texts taken from the Groningen Meaning Bank, GMB (Basile et al., 2012), Dutch newswire texts, comprising two days from January 2000 extracted from the Twente News Corpus, TwNC (Ordelman et al., 1423 2007), and a random sample of Italian texts from the corpus (Borghetti et al., 2011). PAISA` Table 1: Datasets characteristics. NameLanguageDomainSentences Tokens TGNMCB EDnugtclihshNNeewwsswwiir ee492,,58387686 604,,644337 PAIItalianWeb/various42,674869,095 The data was converted into IOB format by inferring an alignment between the raw text and the segmented text. 3.3 Sequence labeling We apply the Wapiti implementation (Lavergne et al., 2010) of Conditional Random Fields (Lafferty et al., 2001), using as features the output label of each character, combined with 1) the character itself, 2) the output label on the previous character, 3) characters and/or their Unicode categories from context windows of varying sizes. For example, with a context size of 3, in Figure 1, features for the E in Eighty-three with the output label S would be E/S, O/S, /S, i/S, Space/S, Lowercase/S. The intuition is that the 3 1 existing Unicode categories can generalize across similar characters whereas character features can identify specific contexts such as abbreviations or contractions (e.g. didn ’t). The context window sizes we use are 0, 1, 3, 5, 7, 9, 11 and 13, centered around the focus character. 3.4 Deep learning of features Automatically learned word embeddings have been successfully used in NLP to reduce reliance on manual feature engineering and boost performance. We adapt this approach to the character level, and thus, in addition to hand-crafted features we use text representations induced in an unsupervised fashion from character strings. A complete discussion of our approach to learning text embeddings can be found in (Chrupała, 2013). Here we provide a brief overview. Our representations correspond to the activation of the hidden layer in a simple recurrent neural (SRN) network (Elman, 1990; Elman, 1991), implemented in a customized version of Mikolov (2010)’s RNNLM toolkit. The network is sequentially presented with a large amount of raw text and learns to predict the next character in the sequence. It uses the units in the hidden layer to store a generalized representation of the recent history. After training the network on large amounts on unlabeled text, we run it on the training and test data, and record the activation of the hidden layer at each position in the string as it tries to predict the next character. The vector of activations of the hidden layer provides additional features used to train and run the CRF. For each of the K = 10 most active units out of total J = 400 hidden units, we create features (f(1) . . . f(K)) defined as f(k) = 1if sj(k) > 0.5 and f(k) = 0 otherwise, where sj (k) returns the activation of the kth most active unit. For training the SRN only raw text is necessary. We trained on the entire GMB 2.0.0 (2.5M characters), the portion of TwNC corresponding to January 2000 (43M characters) and a sample of the PAISA` corpus (39M characters). 4 Results and Evaluation In order to evaluate the quality of the tokenization produced by our models we conducted several experiments with different combinations of features and context sizes. For these tests, the models are trained on an 80% portion of the data sets and tested on a 10% development set. Final results are obtained on a 10% test set. We report both absolute number of errors and error rates per thousand (‰). 4.1 Feature sets We experiment with two kinds of features at the character level, namely Unicode categories (31 dif- ferent ones), Unicode character codes, and a combination of them. Unicode categories are less sparse than the character codes (there are 88, 134, and 502 unique characters for English, Dutch and Italian, respectively), so the combination provide some generalization over just character codes. Table 2: Error rates obtained with different feature sets. Cat stands for Unicode category, Code for Unicode character code, and Cat-Code for a union of these features. Error rates per thousand (‰) Feature setEnglishDutchItalian C ao td-9eC-9ode-94568 ( 0 1. 241950) 1,7 4807243 ( 12 . 685078) 1,65 459872 ( 12 . 162470) 1424 From these results we see that categories alone perform worse than only codes. For English there is no gain from the combination over using only character codes. For Dutch and Italian there is an improvement, although it is only significant for Italian (p = 0.480 and p = 0.005 respectively, binomial exact test). We use this feature combination in the experiments that follow. Note that these models are trained using a symmetrical context of 9 characters (four left and four right of the current character). In the next section we show performance of models with different window sizes. 4.2 Context window We run an experiment to evaluate how the size of the context in the training phase impacts the classification. In Table 4.2 we show the results for symmetrical windows ranging in size from 1to 13. Table 3: Using different context window sizes. Feature setEngElisrhror rateDs puetrch thousandI (t‰al)ian C Ca t - C Co d e - 31957217830 ( 308 . 2635218) 4,39 2753742085(1 (017. 0956208 6) 92,1760 8516873 (1 (135. 31854617) CCaat - CCood e - 1 3198 ( 0 . 2 58) 7 561 ( 1 . 5 64) 6 9702 ( 1 . 1271) 4.3 SRN features We also tested the automatically learned features de- rived from the activation of the hidden layer of an SRN language model, as explained in Section 3. We combined these features with character code and Unicode category features in windows of different sizes. The results of this test are shown in Table 4. The first row shows the performance of SRN features on their own. The following rows show the combination of SRN features with the basic feature sets of varying window size. It can be seen that augmenting the feature sets with SRN features results in large reductions of error rates. The Cat-Code-1SRN setting has error rates comparable to Cat-Code9. The addition of SRN features to the two best previous models, Cat-Code-9 and Cat-Code-13, reduces the error rate by 83% resp. 81% for Dutch, and by 24% resp. 26% for Italian. All these differences are statistically significant according to the binomial test (p < 0.001). For English, there are too few errors to detect a statistically significant effect for Cat-Code-9 (p = 0.07), but for Cat-Code-13 we find p = 0.016. Table 4: Results obtained using different context window sizes and addition of SRN features. Error rates per thousand (‰) Feature setEnglishDutchItalian C SaRtN-C o d e -59173 S -R SN 27413( 0 . 2107635)12 7643251 (0 .42358697)45 90376489(01 .829631) In a final step, we selected the best models based on the development sets (Cat-Code-7-SRN for English and Dutch, Cat-Code-1 1-SRN for Italian), and checked their performance on the final test set. This resulted in 10 errors (0.27 ‰) for English (GMB corpus), 199 errors (0.35 ‰) for Dutch (TwNC corpus), and 454 errors (0.76 ‰) for Italian (PAISA` corpus). 5 Discussion It is interesting to examine what kind of errors the SRN features help avoid. In the English and Dutch datasets many errors are caused by failure to recognize personal titles and initials or misparsing of numbers. In the Italian data, a large fraction of errors is due to verbs with clitics, which are written as a single word, but treated as separate tokens. Table 5 shows examples of errors made by a simpler model that are fixed by adding SRN features. Table 6 shows the confusion matrices for the Cat-Code-7 and CatCode-7-SRN sets on the Dutch data. The mistake most improved by SRN features is T/I with 89% error reduction (see also Table 5). The is also the most common remaining mistake. A comparison with other approaches is hard because of the difference in datasets and task definition (combined word/sentence segmentation). Here we just compare our results for sentence segmentation (sentence F1 score) with Punkt, a state-of-the1425 Table 5: Positive impact of SRN features. Table 6: Confusion matrix for Dutch development set. GoTOSIld32P8r1e52d480iIc7te52d,3O0C4 at-32C So20d8e-47612T089P3r2e8d5ic43t1065Ied7,2C3Oa04 t-C3o1d2S0 e-78S1R0562TN038 art sentence boundary detection system (Kiss and Strunk, 2006). With its standard distributed models, Punkt achieves 98.51% on our English test set, 98.87% on Dutch and 98.34% on Italian, compared with 100%, 99.54% and 99.51% for our system. Our system benefits here from its ability to adapt to a new domain with relatively little (but annotated) training data. 6 What Elephant? Word and sentence segmentation can be recast as a combined tagging task. This way, tokenization is cast as a supervised learning task, causing a shift of labor from writing rules to manually correcting labels. Learning this task with CRF achieves high accuracy.1 Furthermore, our tagging method does not lose the connection between original text and tokens. In future work, we plan to broaden the scope of this work to other steps in document preparation, 1All software needed to replicate our experiments is available at http : / / gmb . let . rug . nl / e lephant / experiments . php such as normalization of punctuation, and their interaction with segmentation. We further plan to test our method on a wider range of datasets, allowing a more direct comparison with other approaches. Finally, we plan to explore the possibility of a statistical universal segmentation model for mutliple languages and domains. In a famous scene with a live elephant on stage, the comedian Jimmy Durante was asked about it by a policeman and surprisedly answered: “What elephant?” We feel we can say the same now as far as tokenization is concerned. References Valerio Basile, Johan Bos, Kilian Evang, and Noortje Venhuizen. 2012. Developing a large semantically annotated corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), pages 3 196–3200, Istanbul, Turkey. Claudia Borghetti, Sara Castagnoli, and Marco Brunello. 2011. Itesti del web: una proposta di classificazione sulla base del corpus PAISA`. In M. Cerruti, E. Corino, and C. Onesti, editors, Formale e informale. La variazione di registro nella comunicazione elettronica, pages 147–170. Carocci, Roma. Grzegorz Chrupała. 2013. Text segmentation with character-level text embeddings. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, USA. Rebecca Dridan and Stephan Oepen. 2012. Tokenization: Returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea. Association for Computational Linguistics. Jeffrey L. Elman. 1990. Finding structure in time. Cognitive science, 14(2): 179–21 1. Jeffrey L. Elman. 1991 . Distributed representations, simple recurrent networks, and grammatical structure. Machine learning, 7(2): 195–225. Murhaf Fares, Stephan Oepen, and Zhang Yi. 2013. Machine learning for high-quality tokenization - replicating variable tokenization schemes. In A. Gelbukh, editor, CICLING 2013, volume 7816 of Lecture Notes in Computer Science, pages 23 1–244, Berlin Heidelberg. Springer-Verlag. Gregory Grefenstette. 1999. Tokenization. In Hans van Halteren, editor, Syntactic Wordclass Tagging, pages 117–133. Kluwer Academic Publishers, Dordrecht. – –. 1426 Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, 2nd edition. Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4):485–525. John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282–289. Thomas Lavergne, Olivier Capp e´, and Fran ¸cois Yvon. 2010. Practical very large scale CRFs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 504–5 13, Uppsala, Sweden, July. Association for Computational Linguistics. Andrei Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics, 28(3):289–3 18. Tom a´ˇ s Mikolov, Martin Karafi´ at, Luk a´ˇ s Burget, Jan Cˇernock y´, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech. Roeland Ordelman, Franciska de Jong, Arjan van Hessen, and Hendri Hondorp. 2007. TwNC: a multifaceted Dutch news corpus. ELRA Newsleter, 12(3/4):4–7. David D. Palmer and Marti A. Hearst. 1997. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 23(2):241–267. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 16– 19, Washington, DC, USA. Association for Computational Linguistics. Michael D. Riley. 1989. Some applications of tree-based modelling to speech and language. In Proceedings of the workshop on Speech and Natural Language, HLT ’89, pages 339–352, Stroudsburg, PA, USA. Association for Computational Linguistics. Carlos N. Silla Jr. and Celso A. A. Kaestner. 2004. An analysis of sentence boundary detection systems for English and Portuguese documents. In Fifth International Conference on Intelligent Text Processing and Computational Linguistics, volume 2945 of Lecture Notes in Computer Science, pages 135–141. Springer. Katrin Tomanek, Joachim Wermter, and Udo Hahn. 2007. Sentence and token splitting based on conditional random fields. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pages 49–57, Melbourne, Australia.
3 0.56884038 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu
Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.
4 0.56798208 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
Author: Uri Lerner ; Slav Petrov
Abstract: We present a simple and novel classifier-based preordering approach. Unlike existing preordering models, we train feature-rich discriminative classifiers that directly predict the target-side word order. Our approach combines the strengths of lexical reordering and syntactic preordering models by performing long-distance reorderings using the structure of the parse tree, while utilizing a discriminative model with a rich set of features, including lexical features. We present extensive experiments on 22 language pairs, including preordering into English from 7 other languages. We obtain improvements of up to 1.4 BLEU on language pairs in the WMT 2010 shared task. For languages from different families the improvements often exceed 2 BLEU. Many of these gains are also significant in human evaluations.
5 0.56672806 122 emnlp-2013-Learning to Freestyle: Hip Hop Challenge-Response Induction via Transduction Rule Segmentation
Author: Dekai Wu ; Karteek Addanki ; Markus Saers ; Meriem Beloucif
Abstract: We present a novel model, Freestyle, that learns to improvise rhyming and fluent responses upon being challenged with a line of hip hop lyrics, by combining both bottomup token based rule induction and top-down rule segmentation strategies to learn a stochastic transduction grammar that simultaneously learns bothphrasing and rhyming associations. In this attack on the woefully under-explored natural language genre of music lyrics, we exploit a strictly unsupervised transduction grammar induction approach. Our task is particularly ambitious in that no use of any a priori linguistic or phonetic information is allowed, even though the domain of hip hop lyrics is particularly noisy and unstructured. We evaluate the performance of the learned model against a model learned only using the more conventional bottom-up token based rule induction, and demonstrate the superiority of our combined token based and rule segmentation induction method toward generating higher quality improvised responses, measured on fluency and rhyming criteria as judged by human evaluators. To highlight some of the inherent challenges in adapting other algorithms to this novel task, we also compare the quality ofthe responses generated by our model to those generated by an out-ofthe-box phrase based SMT system. We tackle the challenge of selecting appropriate training data for our task via a dedicated rhyme scheme detection module, which is also acquired via unsupervised learning and report improved quality of the generated responses. Finally, we report results with Maghrebi French hip hop lyrics indicating that our model performs surprisingly well with no special adaptation to other languages. 102
6 0.56252009 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
7 0.56216723 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
8 0.56209558 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation
9 0.55901873 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
10 0.55590808 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
11 0.55573893 187 emnlp-2013-Translation with Source Constituency and Dependency Trees
12 0.55564034 172 emnlp-2013-Simple Customization of Recursive Neural Networks for Semantic Relation Classification
13 0.55539227 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
14 0.55325431 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach
15 0.55267441 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
16 0.55207562 143 emnlp-2013-Open Domain Targeted Sentiment
17 0.5520311 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk
18 0.55195785 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
19 0.55187958 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
20 0.5517174 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation