acl acl2010 acl2010-200 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Valentin I. Spitkovsky ; Daniel Jurafsky ; Hiyan Alshawi
Abstract: We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived constraints aid grammar induction by training Klein and Manning’s Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-the- art by more than 5%. Web-scale experiments show that the DMV, perhaps because it is unlexicalized, does not benefit from orders of magnitude more annotated but noisier data. Our model, trained on a single blog, generalizes to 53.3% accuracy out-of-domain, against the Brown corpus nearly 10% higher than the previous published best. The fact that web mark-up strongly correlates with syntactic structure may have broad applicability in NLP.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We show how web mark-up can be used to improve unsupervised dependency parsing. [sent-7, score-0.231]
2 Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. [sent-8, score-0.584]
3 We demonstrate that derived constraints aid grammar induction by training Klein and Manning’s Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50. [sent-10, score-0.117]
4 Web-scale experiments show that the DMV, perhaps because it is unlexicalized, does not benefit from orders of magnitude more annotated but noisier data. [sent-12, score-0.046]
5 The fact that web mark-up strongly correlates with syntactic structure may have broad applicability in NLP. [sent-15, score-0.113]
6 A restricted version ofthis problem that targets dependencies and assumes partial annotation sentence boundaries and part-of-speech (POS) tagging has received much attention. [sent-17, score-0.236]
7 Klein and Manning (2004) were the first to beat a simple parsing heuristic, the right-branching baseline; — — today’s state-of-the-art systems (Headden et al. [sent-18, score-0.055]
8 Pereira and Schabes (1992) outlined three major problems with classic EM, applied to a related problem, constituent parsing. [sent-21, score-0.116]
9 They extended classic inside-outside re-estimation (Baker, 1979) to respect any bracketing constraints included with a training corpus. [sent-22, score-0.362]
10 Their algorithm sometimes found good solutions from bracketed corpora but not from raw text, supporting the view that purely unsupervised, selforganizing inference methods can miss the trees for the forest of distributional regularities. [sent-24, score-0.155]
11 This was a promising break-through, but the problem of whence to get partial bracketings was left open. [sent-25, score-0.236]
12 We suggest mining partial bracketings from a cheap and abundant natural language resource: the hyper-text mark-up that annotates web-pages. [sent-26, score-0.273]
13 To validate this idea, we created a new data set, novel in combining a real blog’s raw HTML with tree-bank-like constituent structure parses, gener1278 ProceedingUsp opfs thaela 4, 8Stwhe Adnennu,a 1l1- M16ee Jtiunlgy o 2f0 t1h0e. [sent-31, score-0.117]
14 Our linguistic analysis of the most prevalent tags (anchors, bold, italics and underlines) over its 1M+ words reveals a strong con- nection between syntax and mark-up (all of our examples draw from this corpus), inspiring several simple techniques for automatically deriving parsing constraints. [sent-34, score-0.2]
15 Experiments with both hard and more flexible constraints, as well as with different styles and quantities of annotated training data the blog, web news and the web itself, confirm that mark-up-induced constraints consistently improve (otherwise unsupervised) dependency parsing. [sent-35, score-0.53]
16 — 2 Intuition and Motivating Examples It is natural to expect hidden structure to seep through when a person annotates a sentence. [sent-36, score-0.091]
17 As it happens, a non-trivial fraction of the world’s population routinely annotates text diligently, if only partially and informally. [sent-37, score-0.091]
18 As noted, web annotations can be indicative of phrase boundaries, e. [sent-39, score-0.163]
19 In doing so, mark-up sometimes offers useful cues even for low-level tokenization decisions: [NP [NP Libyan ruler] [NP Mu‘ammar al-Qaddafi] ] referred to . [sent-44, score-0.045]
20 Admittedly, not all boundaries between HTML tags and syntactic constituents match up nicely: . [sent-50, score-0.266]
21 ]]] Combining parsing with mark-up may not be straight-forward, but there is hope: even above, 1Even when (American) grammar schools lived up to their name, they only taught dependencies. [sent-56, score-0.055]
22 This was back in the days before constituent grammars were invented. [sent-57, score-0.116]
23 edu :8080/parser/ one of each nested tag’s boundaries aligns; and Toronto Star’s neglected determiner could be forgiven, certainly within a dependency formulation. [sent-60, score-0.28]
24 3 A High-Level Outline of Our Approach Our idea is to implement the DMV (Klein and Manning, 2004) a standard unsupervised grammar inducer. [sent-61, score-0.062]
25 But instead of learning the unannotated test set, we train with text that contains web mark-up, using various ways of converting HTML into parsing constraints. [sent-62, score-0.234]
26 Our parsing constraints come from a blog a new corpus we created, the web and news (see Table 1for corpora’s sentence and token counts). [sent-65, score-0.463]
27 To facilitate future work, we make the final models and our manually-constructed blog data publicly available. [sent-66, score-0.145]
28 4 Data Sets for Evaluation and Training The appeal of unsupervised parsing lies in its ability to learn from surface text alone; but (intrinsic) evaluation still requires parsed sentences. [sent-69, score-0.168]
29 edu/∼valentin/ 1279 sentences having at least one bracketing no shorter than the length cutoff (but shorter than the sentence). [sent-73, score-0.355]
30 We also evaluate on Brown100, similarly derived from the parsed portion of the Brown corpus (Francis and Kucera, 1979). [sent-76, score-0.051]
31 While we use WSJ45 and WSJ15 to train baseline models, the bulk of our experiments is with web data. [sent-77, score-0.113]
32 1 A News-Style Blog: Daniel Pipes Since there was no corpus overlaying syntactic structure with mark-up, we began constructing a new one by downloading articles4 from a newsstyle blog. [sent-79, score-0.058]
33 Although limited to a single genre political opinion, danielpipes . [sent-80, score-0.043]
34 5 After extracting moderately clean text and mark-up locations, we used MxTerminator (Reynar and Ratnaparkhi, 1997) to detect sentence boundaries. [sent-83, score-0.052]
35 This initial automated pass begot multiple rounds of various semi-automated clean-ups that involved fixing sentence breaking, modifying parser-unfriendly tokens, converting HTML entities and non-ASCII text, correcting typos, and so on. [sent-84, score-0.053]
36 , Sesame Street -like), we broke up all markup that crossed sentence boundaries (i. [sent-89, score-0.104]
37 , 2003),6 and BLOGp, parsed with Charniak’s parser (Charniak, 2001 ; Charniak and Johnson, 2005). [sent-112, score-0.051]
38 7 The reason for this dichotomy was to use state-of-the-art parses to analyze the relationship between syntax and mark-up, yet to prevent jointly tagged (and non-standard AUX[G]) POS sequences from interfering with our (otherwise unsupervised) training. [sent-113, score-0.063]
39 gz 8However, since many taggers are themselves trained on manually parsed corpora, such as WSJ, no parser that relies on external POS tags could be considered truly unsupervised; for a fully unsupervised example, see Seginer’s (2007) CCL parser, available at http ://www . [sent-133, score-0.252]
40 If web mark-up shared a similar characteristic, it might not provide sufficiently disambiguating cues to syntactic structure: HTML tags could be too short (e. [sent-140, score-0.239]
41 , singletons like “click here ”) or otherwise unhelpful in resolving truly difficult ambiguities (such as PPattachment). [sent-142, score-0.102]
42 We began simply by counting various basic events in BLOGp. [sent-143, score-0.058]
43 We do not distinguish HTML tags and track only unique bracketing end-points within a sentence. [sent-150, score-0.334]
44 10 — 10A non-trivial fraction of our corpus is older (pre-internet) unannotated articles, so this estimate may be conservative. [sent-153, score-0.066]
45 Mark-up is short, typically under five words, yet (by far) the most frequently marked sequence of POS tags is a pair. [sent-155, score-0.136]
46 2 Common Syntactic Subtrees For three-quarters of all mark-up, the lowest dominating non-terminal is a noun phrase (see Table 4); there are also non-trace quantities of verb phrases (12. [sent-157, score-0.237]
47 2% of all annotated productions, only one is not a noun phrase (see Table 5, left). [sent-160, score-0.05]
48 Four of the fifteen lowest dominating non-terminals do not match the entire bracketing all four miss the leading determiner, as we saw earlier. [sent-161, score-0.546]
49 In such cases, we recursively split internal nodes until the bracketing aligned, as follows: [S [NP the Toronto Star] [VP reports [NP this] [PP in the softest possible way] , [S stating . [sent-162, score-0.361]
50 ]]] S → NP VP → DT NNP NNP VBZ NP PP S We can summarize productions more compactly by using a dependency framework and clipping — — — off any dependents whose subtrees do not cross a bracketing boundary, relative to the parent. [sent-165, score-0.552]
51 Thus, DT NNP NNP VBZ DT IN DT JJS JJ NN DT NNP VBZ, “the Star reports . [sent-166, score-0.055]
52 ” becomes Viewed this way, the top fifteen (now collapsed) productions cover 59. [sent-167, score-0.177]
53 This exposes five cases of inexact matches, three of which involve neglected determiners or adjectives to the left of the head. [sent-169, score-0.128]
54 In fact, the only case that cannot be explained by dropped dependents is #8, where the daughters are marked but the parent is left out. [sent-170, score-0.297]
55 As this example shows, disagreements (as well as agreements) between mark-up and machinegenerated parse trees with automatically percolated heads should be taken with a grain of salt. [sent-179, score-0.272]
56 — — 1281 Table 5: Top 15 marked productions, viewed as constituents (left) and as dependencies (right), after recursively expanding any internal nodes that did not align with the bracketing (underlined). [sent-188, score-0.466]
57 Tabulated dependencies were collapsed, dropping any dependents that fell entirely in the same region as their parent (i. [sent-189, score-0.219]
58 , both inside the bracketing, both to its left or both to its right), keeping only crossing attachments. [sent-191, score-0.054]
59 3 Proposed Parsing Constraints The straight-forward approach forcing mark-up to correspond to constituents agrees with Charniak’s parse trees only 48. [sent-193, score-0.215]
60 in [NP [NP an analysis] [PP of perhaps the most astonishing PC item I have yet stumbled upon]] . [sent-199, score-0.046]
61 — — This number should be higher, as the vast majority of disagreements are due to tree-bank idiosyncrasies (e. [sent-200, score-0.087]
62 A dependency formulation is less sensitive to such stylistic differences. [sent-208, score-0.056]
63 loose same as strict, but allows the bracketing’s ohoesaed — —wo sradm toe ahsa svetr iecxtt,e bruntal a dependents. [sent-222, score-0.083]
64 5% of the time, catching many • — (though far from all) dropped dependents, e. [sent-224, score-0.046]
65 • sprawl same as loose, but now allows all wo•rd ssp rianwsidle — a bracketing steo, a btutatc nho wext aelrlnoawls d ael-l pendents. [sent-232, score-0.253]
66 , where “Toronto Star” is embedded in longer mark-up that includes its own parent a verb: — — repo . [sent-236, score-0.048]
67 , a fused phrase like “Fox News in Canada” that detached a preposition from its verb: . [sent-249, score-0.05]
68 Nevertheless, it is possible for mark-up to be torn apart by external heads from both sides. [sent-258, score-0.057]
69 Below, “CSA” modifies “authority” (to its left), appositively, while “AlManar” modifies “television” (to its right): 13 The French broadcasting authority, CSA, banned . [sent-260, score-0.11]
70 12This view evokes the trapezoids of the O(n3) recognizer for split head automaton grammars (Eisner and Satta, 1999). [sent-266, score-0.047]
71 13But this is a stretch, since the comma after “CSA” renders the marked phrase ungrammatical even out of context. [sent-267, score-0.105]
72 1), ibsu itt i bt eatltseor asduimteidts a t trhiveia gle implementation eoef (most of) the dependency constraints we proposed. [sent-272, score-0.118]
73 , 2010a): we chose the Laplace-smoothed model trained at WSJ15 (the “sweet spot” data gradation) but initialized off WSJ8, since that ad-hoc harmonic initializer has the best cross-entropy on WSJ15 (see Figure 1). [sent-280, score-0.068]
74 Overconstrained sentences are re-attempted at successively lower levels until they become possible to parse, if necessary at the lowest (default) level 0. [sent-286, score-0.055]
wordName wordTfidf (topN-words)
[('bracketing', 0.253), ('dmv', 0.171), ('html', 0.157), ('star', 0.15), ('blog', 0.145), ('spitkovsky', 0.137), ('nnp', 0.133), ('np', 0.133), ('bracketings', 0.127), ('toronto', 0.121), ('web', 0.113), ('wsj', 0.11), ('boundaries', 0.104), ('csa', 0.102), ('dependents', 0.094), ('stanford', 0.093), ('charniak', 0.091), ('vbz', 0.091), ('agrees', 0.091), ('annotates', 0.091), ('productions', 0.09), ('dt', 0.088), ('news', 0.088), ('disagreements', 0.087), ('fifteen', 0.087), ('dominating', 0.087), ('ammar', 0.085), ('blogt', 0.085), ('libyan', 0.085), ('percolated', 0.085), ('pipes', 0.085), ('ruler', 0.085), ('underlines', 0.085), ('loose', 0.083), ('tags', 0.081), ('constituents', 0.081), ('dependencies', 0.077), ('google', 0.075), ('neglected', 0.074), ('tar', 0.074), ('kucera', 0.074), ('hiyan', 0.074), ('manning', 0.073), ('brown', 0.072), ('constituent', 0.069), ('initializer', 0.068), ('unannotated', 0.066), ('vp', 0.064), ('television', 0.064), ('italics', 0.064), ('miss', 0.064), ('strict', 0.063), ('parses', 0.063), ('constraints', 0.062), ('unsupervised', 0.062), ('anchors', 0.06), ('subtrees', 0.059), ('klein', 0.059), ('truly', 0.058), ('began', 0.058), ('francis', 0.058), ('heads', 0.057), ('dependency', 0.056), ('parsing', 0.055), ('default', 0.055), ('modifies', 0.055), ('marked', 0.055), ('partial', 0.055), ('lowest', 0.055), ('reports', 0.055), ('left', 0.054), ('authority', 0.053), ('stating', 0.053), ('fixing', 0.053), ('styles', 0.053), ('valentin', 0.053), ('fox', 0.053), ('clean', 0.052), ('parsed', 0.051), ('shorter', 0.051), ('phrase', 0.05), ('valence', 0.05), ('pos', 0.049), ('collapsed', 0.048), ('scaled', 0.048), ('raw', 0.048), ('nn', 0.048), ('parent', 0.048), ('street', 0.048), ('classic', 0.047), ('grammars', 0.047), ('dropped', 0.046), ('determiner', 0.046), ('tokens', 0.046), ('perhaps', 0.046), ('quantities', 0.045), ('cues', 0.045), ('ambiguities', 0.044), ('trees', 0.043), ('genre', 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing
Author: Valentin I. Spitkovsky ; Daniel Jurafsky ; Hiyan Alshawi
Abstract: We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived constraints aid grammar induction by training Klein and Manning’s Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-the- art by more than 5%. Web-scale experiments show that the DMV, perhaps because it is unlexicalized, does not benefit from orders of magnitude more annotated but noisier data. Our model, trained on a single blog, generalizes to 53.3% accuracy out-of-domain, against the Brown corpus nearly 10% higher than the previous published best. The fact that web mark-up strongly correlates with syntactic structure may have broad applicability in NLP.
2 0.13756198 169 acl-2010-Learning to Translate with Source and Target Syntax
Author: David Chiang
Abstract: Statistical translation models that try to capture the recursive structure of language have been widely adopted over the last few years. These models make use of varying amounts of information from linguistic theory: some use none at all, some use information about the grammar of the target language, some use information about the grammar of the source language. But progress has been slower on translation models that are able to learn the relationship between the grammars of both the source and target language. We discuss the reasons why this has been a challenge, review existing attempts to meet this challenge, and show how some old and new ideas can be combined into a sim- ple approach that uses both source and target syntax for significant improvements in translation accuracy.
3 0.1333901 214 acl-2010-Sparsity in Dependency Grammar Induction
Author: Jennifer Gillenwater ; Kuzman Ganchev ; Joao Graca ; Fernando Pereira ; Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In ex- periments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsity-inducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other state-of-the-art techniques.
4 0.12304989 195 acl-2010-Phylogenetic Grammar Induction
Author: Taylor Berg-Kirkpatrick ; Dan Klein
Abstract: We present an approach to multilingual grammar induction that exploits a phylogeny-structured model of parameter drift. Our method does not require any translated texts or token-level alignments. Instead, the phylogenetic prior couples languages at a parameter level. Joint induction in the multilingual model substantially outperforms independent learning, with larger gains both from more articulated phylogenies and as well as from increasing numbers of languages. Across eight languages, the multilingual approach gives error reductions over the standard monolingual DMV averaging 21. 1% and reaching as high as 39%.
5 0.12247853 203 acl-2010-Rebanking CCGbank for Improved NP Interpretation
Author: Matthew Honnibal ; James R. Curran ; Johan Bos
Abstract: Once released, treebanks tend to remain unchanged despite any shortcomings in their depth of linguistic analysis or coverage of specific phenomena. Instead, separate resources are created to address such problems. In this paper we show how to improve the quality of a treebank, by integrating resources and implementing improved analyses for specific constructions. We demonstrate this rebanking process by creating an updated version of CCGbank that includes the predicate-argument structure of both verbs and nouns, baseNP brackets, verb-particle constructions, and restrictive and non-restrictive nominal modifiers; and evaluate the impact of these changes on a statistical parser.
6 0.1024526 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data
7 0.098828718 233 acl-2010-The Same-Head Heuristic for Coreference
8 0.096615754 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar
9 0.089672059 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery
10 0.08900018 252 acl-2010-Using Parse Features for Preposition Selection and Error Detection
11 0.087872751 84 acl-2010-Detecting Errors in Automatically-Parsed Dependency Relations
12 0.086280473 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans
13 0.085508689 115 acl-2010-Filtering Syntactic Constraints for Statistical Machine Translation
14 0.085156001 75 acl-2010-Correcting Errors in a Treebank Based on Synchronous Tree Substitution Grammar
15 0.083732992 133 acl-2010-Hierarchical Search for Word Alignment
16 0.082338415 130 acl-2010-Hard Constraints for Grammatical Function Labelling
17 0.082009457 99 acl-2010-Efficient Third-Order Dependency Parsers
18 0.07992135 205 acl-2010-SVD and Clustering for Unsupervised POS Tagging
19 0.078678668 114 acl-2010-Faster Parsing by Supertagger Adaptation
20 0.077493109 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
topicId topicWeight
[(0, -0.245), (1, 0.004), (2, 0.061), (3, 0.01), (4, -0.093), (5, -0.033), (6, 0.107), (7, 0.013), (8, 0.09), (9, 0.087), (10, -0.013), (11, -0.01), (12, -0.045), (13, -0.032), (14, 0.002), (15, -0.006), (16, -0.035), (17, 0.044), (18, 0.128), (19, 0.019), (20, -0.051), (21, 0.054), (22, 0.155), (23, 0.07), (24, -0.007), (25, -0.019), (26, 0.029), (27, 0.043), (28, -0.091), (29, -0.059), (30, -0.037), (31, 0.018), (32, 0.026), (33, 0.035), (34, 0.13), (35, -0.122), (36, -0.094), (37, -0.081), (38, -0.19), (39, -0.051), (40, 0.013), (41, 0.113), (42, 0.053), (43, -0.015), (44, -0.004), (45, -0.123), (46, -0.075), (47, -0.028), (48, 0.072), (49, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.95400298 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing
Author: Valentin I. Spitkovsky ; Daniel Jurafsky ; Hiyan Alshawi
Abstract: We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived constraints aid grammar induction by training Klein and Manning’s Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-the- art by more than 5%. Web-scale experiments show that the DMV, perhaps because it is unlexicalized, does not benefit from orders of magnitude more annotated but noisier data. Our model, trained on a single blog, generalizes to 53.3% accuracy out-of-domain, against the Brown corpus nearly 10% higher than the previous published best. The fact that web mark-up strongly correlates with syntactic structure may have broad applicability in NLP.
2 0.65512025 252 acl-2010-Using Parse Features for Preposition Selection and Error Detection
Author: Joel Tetreault ; Jennifer Foster ; Martin Chodorow
Abstract: Jennifer Foster NCLT Dublin City University Ireland j fo st er@ comput ing . dcu . ie Martin Chodorow Hunter College of CUNY New York, NY, USA martin . chodorow @hunter . cuny . edu We recreate a state-of-the-art preposition usage system (Tetreault and Chodorow (2008), henceWe evaluate the effect of adding parse features to a leading model of preposition us- age. Results show a significant improvement in the preposition selection task on native speaker text and a modest increment in precision and recall in an ESL error detection task. Analysis of the parser output indicates that it is robust enough in the face of noisy non-native writing to extract useful information.
3 0.63063753 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data
Author: Shane Bergsma ; Emily Pitler ; Dekang Lin
Abstract: In this paper, we systematically assess the value of using web-scale N-gram data in state-of-the-art supervised NLP classifiers. We compare classifiers that include or exclude features for the counts of various N-grams, where the counts are obtained from a web-scale auxiliary corpus. We show that including N-gram count features can advance the state-of-the-art accuracy on standard data sets for adjective ordering, spelling correction, noun compound bracketing, and verb part-of-speech disambiguation. More importantly, when operating on new domains, or when labeled training data is not plentiful, we show that using web-scale N-gram features is essential for achieving robust performance.
4 0.62142599 130 acl-2010-Hard Constraints for Grammatical Function Labelling
Author: Wolfgang Seeker ; Ines Rehbein ; Jonas Kuhn ; Josef Van Genabith
Abstract: For languages with (semi-) free word order (such as German), labelling grammatical functions on top of phrase-structural constituent analyses is crucial for making them interpretable. Unfortunately, most statistical classifiers consider only local information for function labelling and fail to capture important restrictions on the distribution of core argument functions such as subject, object etc., namely that there is at most one subject (etc.) per clause. We augment a statistical classifier with an integer linear program imposing hard linguistic constraints on the solution space output by the classifier, capturing global distributional restrictions. We show that this improves labelling quality, in particular for argument grammatical functions, in an intrinsic evaluation, and, importantly, grammar coverage for treebankbased (Lexical-Functional) grammar acquisition and parsing, in an extrinsic evaluation.
5 0.54020333 203 acl-2010-Rebanking CCGbank for Improved NP Interpretation
Author: Matthew Honnibal ; James R. Curran ; Johan Bos
Abstract: Once released, treebanks tend to remain unchanged despite any shortcomings in their depth of linguistic analysis or coverage of specific phenomena. Instead, separate resources are created to address such problems. In this paper we show how to improve the quality of a treebank, by integrating resources and implementing improved analyses for specific constructions. We demonstrate this rebanking process by creating an updated version of CCGbank that includes the predicate-argument structure of both verbs and nouns, baseNP brackets, verb-particle constructions, and restrictive and non-restrictive nominal modifiers; and evaluate the impact of these changes on a statistical parser.
6 0.53973246 19 acl-2010-A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation
7 0.53899008 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar
8 0.53504646 114 acl-2010-Faster Parsing by Supertagger Adaptation
9 0.52242637 12 acl-2010-A Probabilistic Generative Model for an Intermediate Constituency-Dependency Representation
10 0.51256192 139 acl-2010-Identifying Generic Noun Phrases
11 0.50803757 195 acl-2010-Phylogenetic Grammar Induction
12 0.50328815 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields
13 0.50225282 214 acl-2010-Sparsity in Dependency Grammar Induction
14 0.47900274 99 acl-2010-Efficient Third-Order Dependency Parsers
15 0.47578657 39 acl-2010-Automatic Generation of Story Highlights
16 0.47439066 34 acl-2010-Authorship Attribution Using Probabilistic Context-Free Grammars
17 0.46791747 143 acl-2010-Importance of Linguistic Constraints in Statistical Dependency Parsing
18 0.45525247 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
19 0.44619066 205 acl-2010-SVD and Clustering for Unsupervised POS Tagging
20 0.44613689 117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms
topicId topicWeight
[(14, 0.022), (25, 0.085), (33, 0.024), (39, 0.022), (40, 0.278), (42, 0.029), (44, 0.011), (59, 0.098), (73, 0.051), (78, 0.033), (80, 0.017), (83, 0.104), (84, 0.027), (98, 0.109)]
simIndex simValue paperId paperTitle
same-paper 1 0.82963639 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing
Author: Valentin I. Spitkovsky ; Daniel Jurafsky ; Hiyan Alshawi
Abstract: We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived constraints aid grammar induction by training Klein and Manning’s Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-the- art by more than 5%. Web-scale experiments show that the DMV, perhaps because it is unlexicalized, does not benefit from orders of magnitude more annotated but noisier data. Our model, trained on a single blog, generalizes to 53.3% accuracy out-of-domain, against the Brown corpus nearly 10% higher than the previous published best. The fact that web mark-up strongly correlates with syntactic structure may have broad applicability in NLP.
2 0.62204748 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction
Author: Xiaojun Wan ; Huiying Li ; Jianguo Xiao
Abstract: Cross-language document summarization is a task of producing a summary in one language for a document set in a different language. Existing methods simply use machine translation for document translation or summary translation. However, current machine translation services are far from satisfactory, which results in that the quality of the cross-language summary is usually very poor, both in readability and content. In this paper, we propose to consider the translation quality of each sentence in the English-to-Chinese cross-language summarization process. First, the translation quality of each English sentence in the document set is predicted with the SVM regression method, and then the quality score of each sentence is incorporated into the summarization process. Finally, the English sentences with high translation quality and high informativeness are selected and translated to form the Chinese summary. Experimental results demonstrate the effectiveness and usefulness of the proposed approach. 1
3 0.59022999 71 acl-2010-Convolution Kernel over Packed Parse Forest
Author: Min Zhang ; Hui Zhang ; Haizhou Li
Abstract: This paper proposes a convolution forest kernel to effectively explore rich structured features embedded in a packed parse forest. As opposed to the convolution tree kernel, the proposed forest kernel does not have to commit to a single best parse tree, is thus able to explore very large object spaces and much more structured features embedded in a forest. This makes the proposed kernel more robust against parsing errors and data sparseness issues than the convolution tree kernel. The paper presents the formal definition of convolution forest kernel and also illustrates the computing algorithm to fast compute the proposed convolution forest kernel. Experimental results on two NLP applications, relation extraction and semantic role labeling, show that the proposed forest kernel significantly outperforms the baseline of the convolution tree kernel. 1
4 0.58738571 169 acl-2010-Learning to Translate with Source and Target Syntax
Author: David Chiang
Abstract: Statistical translation models that try to capture the recursive structure of language have been widely adopted over the last few years. These models make use of varying amounts of information from linguistic theory: some use none at all, some use information about the grammar of the target language, some use information about the grammar of the source language. But progress has been slower on translation models that are able to learn the relationship between the grammars of both the source and target language. We discuss the reasons why this has been a challenge, review existing attempts to meet this challenge, and show how some old and new ideas can be combined into a sim- ple approach that uses both source and target syntax for significant improvements in translation accuracy.
5 0.58734679 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields
Author: Jackie Chi Kit Cheung ; Gerald Penn
Abstract: One goal of natural language generation is to produce coherent text that presents information in a logical order. In this paper, we show that topological fields, which model high-level clausal structure, are an important component of local coherence in German. First, we show in a sentence ordering experiment that topological field information improves the entity grid model of Barzilay and Lapata (2008) more than grammatical role and simple clausal order information do, particularly when manual annotations of this information are not available. Then, we incorporate the model enhanced with topological fields into a natural language generation system that generates constituent orders for German text, and show that the added coherence component improves performance slightly, though not statistically significantly.
6 0.58668721 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar
7 0.58371449 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition
8 0.58270717 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese
9 0.58195865 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation
10 0.5814693 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results
11 0.58128101 247 acl-2010-Unsupervised Event Coreference Resolution with Rich Linguistic Features
12 0.58049971 120 acl-2010-Fully Unsupervised Core-Adjunct Argument Classification
13 0.58005857 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
14 0.57934415 252 acl-2010-Using Parse Features for Preposition Selection and Error Detection
15 0.57924545 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data
16 0.57903582 248 acl-2010-Unsupervised Ontology Induction from Text
17 0.57785231 162 acl-2010-Learning Common Grammar from Multilingual Corpus
18 0.57734871 158 acl-2010-Latent Variable Models of Selectional Preference
19 0.57688689 128 acl-2010-Grammar Prototyping and Testing with the LinGO Grammar Matrix Customization System
20 0.57625031 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web