emnlp emnlp2011 emnlp2011-115 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wenbin Jiang ; Qun Liu ; Yajuan Lv
Abstract: We propose a relaxed correspondence assumption for cross-lingual projection of constituent syntax, which allows a supposed constituent of the target sentence to correspond to an unrestricted treelet in the source parse. Such a relaxed assumption fundamentally tolerates the syntactic non-isomorphism between languages, and enables us to learn the target-language-specific syntactic idiosyncrasy rather than a strained grammar directly projected from the source language syntax. Based on this assumption, a novel constituency projection method is also proposed in order to induce a projected constituent treebank from the source-parsed bilingual corpus. Experiments show that, the parser trained on the projected treebank dramatically outperforms previous projected and unsupervised parsers.
Reference: text
sentIndex sentText sentNum sentScore
1 Such a relaxed assumption fundamentally tolerates the syntactic non-isomorphism between languages, and enables us to learn the target-language-specific syntactic idiosyncrasy rather than a strained grammar directly projected from the source language syntax. [sent-2, score-1.289]
2 Based on this assumption, a novel constituency projection method is also proposed in order to induce a projected constituent treebank from the source-parsed bilingual corpus. [sent-3, score-1.554]
3 Experiments show that, the parser trained on the projected treebank dramatically outperforms previous projected and unsupervised parsers. [sent-4, score-1.576]
4 1 Introduction , For languages with treebanks, supervised models give the state-of-the-art performance in dependency parsing (McDonald and Pereira, 2006; Nivre et al. [sent-5, score-0.134]
5 , 2010) and constituent parsing (Collins, 2003; Charniak and Johnson, 2005; Petrov et al. [sent-7, score-0.169]
6 To break the restriction of the treebank scale, lots of works have been devoted to the unsupervised methods (Klein and Manning, 2004; Bod, 2006; Seginer, 2007; Cohen and Smith, 2009) and the semi-supervised methods (Sarkar, 2001; Steedman et al. [sent-9, score-0.172]
7 cn 1192 conducted many investigations on syntax projection (Hwa et al. [sent-15, score-0.329]
8 Different from the bilingual parsing (Smith and Smith, 2004; Burkett and Klein, 2008; Zhao et al. [sent-19, score-0.147]
9 , 2010) that improves parsing performance with bilingual constraints, and the bilingual grammar induction (Wu, 1997; Kuhn, 2004; Blunsom et al. [sent-22, score-0.424]
10 , 2009) that induces grammar from parallel text, the syntax projection aims to project the syntactic knowledge from one language to another. [sent-24, score-0.577]
11 This seems especially promising for the languages that have bilingual corpora parallel to resource-rich languages with large treebanks. [sent-25, score-0.254]
12 The dependency relationship between words in the parsed source sentences can be directly projected across the word alignment to words in the target sentences, following the direct correspondence assumption (DCA) (Hwa et al. [sent-27, score-0.972]
13 Due to the syntactic nonisomorphism between languages, DCA assumption usually leads to conflicting or incomplete projection. [sent-29, score-0.167]
14 , 2005), and resorting to the quasi-synchronous grammar (Smith and Eisner, 2009). [sent-31, score-0.158]
15 For constituency projection, however, the lack of isomorphism becomes much more serious, since a constituent grammar describes a language in a more detailed way. [sent-32, score-0.527]
16 In this paper we propose a relaxed correspondence assumption (RCA) for constituency ProceedEindgisnb oufr tghhe, 2 S0c1o1tl Canodn,f eUrKen,c Jeuol yn 2 E7m–3p1ir,ic 2a0l1 M1. [sent-33, score-0.484]
17 A dash dot line links a projected constituent to its corresponding DEG ? [sent-45, score-0.769]
18 The projection is from English treelet, which is marked with gray background; An Arabic numeral relates a directly-projected constituent to its counter-part in the source parse. [sent-50, score-0.45]
19 It allows a supposed constituent of the target sentence to correspond to an unrestricted treelet in the source parse. [sent-52, score-0.373]
20 Such a relaxed assumption fundamentally tolerates the syntactic nonisomorphism between languages, and enables us to learn the target-language-specific syntactic idiosyncrasy, rather than induce a strained grammar directly projected from the source language syntax. [sent-53, score-1.31]
21 We also propose a novel cross-lingual projection method for constituent syntax based on the RCA assumption. [sent-54, score-0.445]
22 The projected PCFG grammar is then used to parse each target sentence under the guidance of the corresponding source tree, so as to produce an optimized projected constituent tree. [sent-56, score-1.678]
23 Experiments validate the effectiveness of the RCA assumption and the constituency projection method. [sent-57, score-0.651]
24 We induce a projected Chinese constituent treebank from the FBIS Chinese-English parallel corpus with English sentences parsed by the Charniak parser. [sent-58, score-0.962]
25 The Berkeley Parser trained on the pro1193 jected treebank dramatically outperforms the previous projected and unsupervised parsers. [sent-59, score-0.835]
26 This provides an promising substitute for unsupervised parsing methods, to the resource-scarce languages that have bilingual corpora parallel to resource-rich languages with human-annotated treebanks. [sent-60, score-0.386]
27 In the rest of this paper we first presents the RCA assumption, and the algorithm used to determine the corresponding treelet in the source parse for a candidate constituent in the target sentence. [sent-61, score-0.336]
28 Then we describe the induction of the projected PCFG grammar and the projected constituent treebank from the word-aligned source-parsed parallel corpus. [sent-62, score-1.76]
29 After giving experimental results and the comparison with previous unsupervised and projected parsers, we finally conclude our work and point out several aspects to be improved in the future work. [sent-63, score-0.701]
30 2 Relaxed Correspondence Assumption The DCA assumption (Hwa et al. [sent-64, score-0.097]
31 A dependency grammar describes a sentence in a compact manner where the syntactic information is carried by the dependency relationships between pairs of words. [sent-66, score-0.286]
32 1:Input: Tf: parse tree of source sentence f 2: e: target sentence 3: A: word alignment of e and f 4: for i,j s. [sent-68, score-0.201]
33 Wise t haels oco nuenetd o tfo b icanlacruyla ttree tshe o fc otaurngte otf s e bni-nary tree fragments that cover the nodes outside span hi, ji. [sent-73, score-0.088]
34 The count of trees containing span hi, ji is α(i, j) · β(i, j). [sent-77, score-0.139]
35 The counting approach above is based on the assumption that there is a uniform distribution over the projected trees for every target sentence. [sent-79, score-0.827]
36 A binarized projected PCFG grammar can then be easily induced by maximum likelihood estimation. [sent-82, score-0.844]
37 Due to word alignment errors, free translation, and exhaustive enumeration of possible projected productions, such a PCFG grammar may contain too much noisy nonterminals and production rules. [sent-83, score-1.094]
38 A production rule can be reserved only if its frequency is larger than bRULE. [sent-85, score-0.134]
39 2 Relaxed Tree Projection The projected PCFG grammar is used in the procedure of constituency projection. [sent-87, score-1.064]
40 Such a grammar, as a kind of global syntactic knowledge, can attenuate the negative effect of word alignment er- ror, free translation and syntactic non-isomorphism for the constituency projection between each single sentence pair. [sent-88, score-0.736]
41 To obtain as optimal a projected constituency tree as possible, we have to integrate two kinds of knowledge: the local knowledge in the candidate projected production set of the target sentence, and the global knowledge in the projected PCFG grammar. [sent-89, score-2.463]
42 The integrated projection strategy can be conducted as follows. [sent-90, score-0.301]
43 We parse each target sentence with the projected PCFG grammar G, and use the candidate projected production set D to guide the PCFG parsing. [sent-91, score-1.669]
44 4 Experiments Our work focuses on the constituency projection from English to Chinese. [sent-95, score-0.554]
45 The FBIS Chinese-English parallel corpus is used to obtain a projected constituent treebank. [sent-96, score-0.825]
46 We perform word alignment by runing GIZA++ (Och and Ney, 2000), and then use the alignment results for constituency projection. [sent-103, score-0.367]
47 Following the previous works of unsupervised constituent parsing, we evaluate the projected parser on the subsets of CTB 1. [sent-104, score-0.954]
48 The evaluation for unsupervised parsing differs slightly from the standard 1197 # selected NTs Figure 4: Performance curve of the projected PCFG grammars corresponding to different sizes of nonterminal sets. [sent-108, score-0.845]
49 1 Projected PCFG Grammar An initial projected PCFG grammar can be induced from the word-aligned and source-parsed parallel corpus according to section 3. [sent-112, score-0.9]
50 Such an initial grammar is huge and contains a large amount of projected nonterminals and production rules, where many of them come from free translation and word alignment errors. [sent-114, score-1.069]
51 Figure 3 shows the statistics of the remained production rules. [sent-118, score-0.111]
52 We sort the projected nonterminals according to their frequencies and select the top 2N (1 ≤ N ≤ 10) best ones, and then discard the rules t(h1a t≤ ≤fa Nll o ≤ut 1o0f) )t bhee sstel oencetse,d a nnodn ttheernmi dnisacl asredt. [sent-119, score-0.711]
53 Thhee r ufrlee-s quency summation of the rule set corresponding to 32 best nonterminals accounts for nearly 90% of the frequency summation of the whole rule set. [sent-120, score-0.206]
54 Figure 4 gives the unlabeled F1 value of each grammar on all trees in the developing set. [sent-123, score-0.256]
55 The filtered grammar corresponding to the set of top 32 nonterminals achieves the highest performance. [sent-124, score-0.216]
56 We denote this grammar as G32 and use it Weight coefficient Figure 5: Performance curve of the Berkeley Parser trained on 5 thousand projected trees. [sent-125, score-0.961]
57 2 Projected Treebank and Parser The projected grammar G32 provides global syntactic knowledge for constituency projection. [sent-129, score-1.123]
58 Such global knowledge and the local knowledge carried by the candidate projected production set are integrated in a linear weighted manner as in Formula 7. [sent-130, score-0.818]
59 The weight coefficient λ is tuned to maximize the quality of the projected treebank, which is indirectly measured by evaluating the performance of the parser trained on it. [sent-131, score-0.777]
60 We select the first 5 thousand sentence pairs from the Chinese-English FBIS corpus, and induce a series of projected treebanks using different λ, ranging from 0 to 5. [sent-132, score-0.804]
61 Then we train the Berkeley Parser on each projected treebank, and test it on the developing set of CTB 1. [sent-133, score-0.689]
62 Figure 5 gives the performance curve, which reports the unlabeled F1 values of the projected parsers on all sentences of the developing set. [sent-135, score-0.744]
63 It can be concluded that, the projected PCFG grammar and the candidate projected production set do represent two different kinds of constraints, and we can effectively coordinate them by tuning the weight coefficient. [sent-138, score-1.604]
64 Since different λ values in this range result in slight performance fluctuation of the projected parser, we simply set it to 1 for the constituency projection on the whole FBIS corpus. [sent-139, score-1.273]
65 There are more than 200 thousand projected trees 1198 Scale of treebank Figure 6: Performance curve of the Berkeley Parser trained on different amounts of best project trees. [sent-140, score-0.903]
66 The scale of the selected treebank ranges from 5000 to 160000. [sent-141, score-0.099]
67 It is a heavy burden for a parser to train on so large a treebank. [sent-143, score-0.088]
68 And on the other hand, the free translation and word alignment errors result in many projected trees of poor-quality. [sent-144, score-0.779]
69 We design a criteria to approximate the quality of the projected tree y for the target sentence x: Q˜(y) = |x|−s1Yd∈y(p(d|G) · eλ·δ(d,D) Y (8) and use an amount of best projected trees instead of the whole projected treebank to train the parser. [sent-145, score-2.211]
70 However, treebanks containing more than 40 thousand projected trees can not brings significant improvement. [sent-148, score-0.803]
71 The parser trained on 160 thousand trees only achieves an F1 increment of 0. [sent-149, score-0.233]
72 This indicates that the newly added trees do not give the parser more information due to their projection quality, and a larger parallel corpus may lead to better parsing performance. [sent-151, score-0.535]
73 The Berkeley Parser trained on 160 thousand best projected trees is used in the final test. [sent-152, score-0.768]
74 Our projected parser significantly Table1:oS( TyJuKBrsiheolatw gendimogp,e2rka0tn,f o2d6rl0m. [sent-155, score-0.741]
75 (2010), where they directly adapt the DCA assumption of (Hwa et al. [sent-168, score-0.097]
76 , 2005) from dependency projection to constituency projection and resort to a better word alignment and a more complicated tree projection algorithm. [sent-169, score-1.306]
77 This indicates that the RCA assumption is more suitable for constituency projection than the DCA assumption, and can induce a better grammar that much more reflects the language-specific syntactic idiosyncrasy of the target language. [sent-170, score-0.974]
78 Our projected parser also obviously surpasses existing unsupervised parsers. [sent-171, score-0.837]
79 The parser of Seginer (2007) performs slightly better on CTB 5. [sent-172, score-0.088]
80 Figure 7 shows the unlabeled F1 of our parser on a series of subsets of CTB 5. [sent-174, score-0.137]
81 We find that even on the whole treebank, our parser still gives a promising result. [sent-176, score-0.154]
82 Compared with unsupervised parsing, constituency projection can make use of the syntactic information of another language, so that it probably induce a better grammar. [sent-177, score-0.674]
83 5 Conclusion and Future Works This paper describes a relaxed correspondence assumption (RCA) for constituency projection. [sent-179, score-0.484]
84 Under this assumption a supposed constituent in the target sentence can correspond to an unrestricted 1199 Upper limit of sentence length Figure 7: Performance of the Berkeley Parser on subsets of CTB 5. [sent-180, score-0.368]
85 Different from the direct correspondence assumption (DCA) widely used in dependency projection, the RCA assumption is more suitable for constituency projection, since it fundamentally tolerates the syntactic non-isomorphism between the source and target languages. [sent-184, score-0.748]
86 According to the RCA assumption we propose a novel constituency projection method. [sent-185, score-0.651]
87 First, a projected PCFG grammar is induced from the wordaligned source-parsed parallel corpus. [sent-186, score-0.9]
88 Then, the tree projection is conducted on each sentence pair by a PCFG parsing procedure, which integrates both the global knowledge in the projected PCFG grammar and the local knowledge in the set of candidate projected productions. [sent-187, score-1.918]
89 Experiments show that the parser trained on the projected treebank significantly outperforms the projected parsers based on the DCA assumption. [sent-188, score-1.523]
90 This validates the effectiveness of the RCA assumption and the constituency projection method, and indicates that the RCA assumption is more suitable for constituency projection than the DCA assumption. [sent-189, score-1.302]
91 The projected parser also obviously surpasses the unsupervised parsers. [sent-190, score-0.837]
92 This suggests that if a resource-poor language has a bilingual corpus parallel to a resource-rich language with a human-annotated treebank, the constituency projection based on RCA assumption is an promising substitute for unsupervised methods. [sent-191, score-0.916]
93 First, the word alignment is the fundamental precondition for projected grammar induction and the following constituency projection, we can adopt the better word alignment strategies to improve the word alignment quality. [sent-193, score-1.283]
94 Second, the PCFG grammar is too weak due to its context free assumption, we can adopt more complicated grammars such as TAG (Joshi et al. [sent-194, score-0.238]
95 , 1975), in order to provide a more powerful global syntactic constraints for the tree projection procedure. [sent-195, score-0.406]
96 Third, the current tree projection algorithm is too simple, more bilingual constraints could lead to better projected trees. [sent-196, score-1.094]
97 Last but not least, the constituency projection and the unsupervised parsing make use of different kinds of knowledge, therefore the unsupervised methods can be integrated into the constituency projection framework to achieve better projected grammars, treebanks, and parsers. [sent-197, score-1.91]
98 Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. [sent-225, score-0.206]
99 Corpusbased induction of syntactic structure: Models of dependency and constituency. [sent-263, score-0.106]
100 Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. [sent-338, score-0.228]
wordName wordTfidf (topN-words)
[('projected', 0.653), ('projection', 0.301), ('constituency', 0.253), ('rca', 0.231), ('pcfg', 0.187), ('grammar', 0.158), ('dca', 0.129), ('ctb', 0.119), ('constituent', 0.116), ('production', 0.111), ('treebank', 0.099), ('hwa', 0.098), ('assumption', 0.097), ('bilingual', 0.094), ('treelet', 0.093), ('relaxed', 0.089), ('parser', 0.088), ('hi', 0.08), ('thousand', 0.078), ('fbis', 0.07), ('berkeley', 0.066), ('tolerates', 0.062), ('ji', 0.06), ('nonterminals', 0.058), ('alignment', 0.057), ('parallel', 0.056), ('hk', 0.056), ('parsing', 0.053), ('idiosyncrasy', 0.053), ('smith', 0.05), ('jiang', 0.049), ('unsupervised', 0.048), ('dependency', 0.047), ('tree', 0.046), ('unrestricted', 0.046), ('correspondence', 0.045), ('supposed', 0.045), ('span', 0.042), ('productions', 0.042), ('brule', 0.041), ('wenbin', 0.04), ('fundamentally', 0.04), ('ki', 0.04), ('target', 0.04), ('charniak', 0.039), ('induce', 0.038), ('trees', 0.037), ('koo', 0.037), ('summation', 0.036), ('promising', 0.036), ('developing', 0.036), ('coefficient', 0.036), ('curve', 0.036), ('nonisomorphism', 0.036), ('fluctuation', 0.036), ('strained', 0.036), ('treebanks', 0.035), ('dramatically', 0.035), ('syntactic', 0.034), ('languages', 0.034), ('source', 0.033), ('induced', 0.033), ('free', 0.032), ('qun', 0.032), ('nts', 0.032), ('substitute', 0.031), ('nonterminal', 0.03), ('whole', 0.03), ('increment', 0.03), ('seginer', 0.03), ('parsers', 0.03), ('candidate', 0.029), ('mcclosky', 0.028), ('syntax', 0.028), ('martins', 0.026), ('yajuan', 0.026), ('proceedings', 0.026), ('grammars', 0.025), ('parse', 0.025), ('anoop', 0.025), ('enumeration', 0.025), ('surpasses', 0.025), ('bitext', 0.025), ('burkett', 0.025), ('ganchev', 0.025), ('chinese', 0.025), ('induction', 0.025), ('unlabeled', 0.025), ('global', 0.025), ('works', 0.025), ('xue', 0.024), ('subsets', 0.024), ('adopt', 0.023), ('rule', 0.023), ('brackets', 0.023), ('rebecca', 0.023), ('noah', 0.023), ('obviously', 0.023), ('snyder', 0.022), ('joshi', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 115 emnlp-2011-Relaxed Cross-lingual Projection of Constituent Syntax
Author: Wenbin Jiang ; Qun Liu ; Yajuan Lv
Abstract: We propose a relaxed correspondence assumption for cross-lingual projection of constituent syntax, which allows a supposed constituent of the target sentence to correspond to an unrestricted treelet in the source parse. Such a relaxed assumption fundamentally tolerates the syntactic non-isomorphism between languages, and enables us to learn the target-language-specific syntactic idiosyncrasy rather than a strained grammar directly projected from the source language syntax. Based on this assumption, a novel constituency projection method is also proposed in order to induce a projected constituent treebank from the source-parsed bilingual corpus. Experiments show that, the parser trained on the projected treebank dramatically outperforms previous projected and unsupervised parsers.
2 0.28352973 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers
Author: Ryan McDonald ; Slav Petrov ; Keith Hall
Abstract: We present a simple method for transferring dependency parsers from source languages with labeled training data to target languages without labeled training data. We first demonstrate that delexicalized parsers can be directly transferred between languages, producing significantly higher accuracies than unsupervised parsers. We then use a constraint driven learning algorithm where constraints are drawn from parallel corpora to project the final parser. Unlike previous work on projecting syntactic resources, we show that simple methods for introducing multiple source lan- guages can significantly improve the overall quality of the resulting parsers. The projected parsers from our system result in state-of-theart performance when compared to previously studied unsupervised and projected parsing systems across eight different languages.
3 0.11993639 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
Author: Shay B. Cohen ; Dipanjan Das ; Noah A. Smith
Abstract: We describe a method for prediction of linguistic structure in a language for which only unlabeled data is available, using annotated data from a set of one or more helper languages. Our approach is based on a model that locally mixes between supervised models from the helper languages. Parallel data is not used, allowing the technique to be applied even in domains where human-translated texts are unavailable. We obtain state-of-theart performance for two tasks of structure prediction: unsupervised part-of-speech tagging and unsupervised dependency parsing.
4 0.11763655 118 emnlp-2011-SMT Helps Bitext Dependency Parsing
Author: Wenliang Chen ; Jun'ichi Kazama ; Min Zhang ; Yoshimasa Tsuruoka ; Yujie Zhang ; Yiou Wang ; Kentaro Torisawa ; Haizhou Li
Abstract: We propose a method to improve the accuracy of parsing bilingual texts (bitexts) with the help of statistical machine translation (SMT) systems. Previous bitext parsing methods use human-annotated bilingual treebanks that are hard to obtain. Instead, our approach uses an auto-generated bilingual treebank to produce bilingual constraints. However, because the auto-generated bilingual treebank contains errors, the bilingual constraints are noisy. To overcome this problem, we use large-scale unannotated data to verify the constraints and design a set of effective bilingual features for parsing models based on the verified results. The experimental results show that our new parsers significantly outperform state-of-theart baselines. Moreover, our approach is still able to provide improvement when we use a larger monolingual treebank that results in a much stronger baseline. Especially notable is that our approach can be used in a purely monolingual setting with the help of SMT.
5 0.11572726 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
6 0.10125594 16 emnlp-2011-Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP
7 0.092114598 4 emnlp-2011-A Fast, Accurate, Non-Projective, Semantically-Enriched Parser
8 0.091677628 74 emnlp-2011-Inducing Sentence Structure from Parallel Corpora for Reordering
9 0.089894757 141 emnlp-2011-Unsupervised Dependency Parsing without Gold Part-of-Speech Tags
10 0.081021398 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
11 0.078424692 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing
12 0.074751087 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
13 0.073748127 31 emnlp-2011-Computation of Infix Probabilities for Probabilistic Context-Free Grammars
14 0.072810858 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation
15 0.072594598 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus
16 0.071611069 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference
17 0.068544745 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
18 0.067730464 3 emnlp-2011-A Correction Model for Word Alignments
19 0.066998221 136 emnlp-2011-Training a Parser for Machine Translation Reordering
20 0.066906832 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
topicId topicWeight
[(0, 0.213), (1, 0.154), (2, 0.001), (3, 0.164), (4, -0.053), (5, 0.113), (6, -0.121), (7, 0.031), (8, -0.053), (9, 0.011), (10, -0.096), (11, 0.08), (12, 0.092), (13, 0.111), (14, 0.196), (15, -0.012), (16, 0.046), (17, -0.015), (18, -0.205), (19, -0.027), (20, -0.003), (21, -0.202), (22, -0.106), (23, 0.076), (24, 0.067), (25, 0.006), (26, 0.074), (27, -0.009), (28, 0.099), (29, 0.021), (30, -0.03), (31, 0.071), (32, 0.136), (33, 0.089), (34, -0.088), (35, -0.106), (36, -0.011), (37, 0.024), (38, 0.038), (39, -0.173), (40, -0.058), (41, -0.157), (42, 0.017), (43, -0.083), (44, 0.009), (45, -0.061), (46, -0.073), (47, 0.008), (48, -0.015), (49, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.96698904 115 emnlp-2011-Relaxed Cross-lingual Projection of Constituent Syntax
Author: Wenbin Jiang ; Qun Liu ; Yajuan Lv
Abstract: We propose a relaxed correspondence assumption for cross-lingual projection of constituent syntax, which allows a supposed constituent of the target sentence to correspond to an unrestricted treelet in the source parse. Such a relaxed assumption fundamentally tolerates the syntactic non-isomorphism between languages, and enables us to learn the target-language-specific syntactic idiosyncrasy rather than a strained grammar directly projected from the source language syntax. Based on this assumption, a novel constituency projection method is also proposed in order to induce a projected constituent treebank from the source-parsed bilingual corpus. Experiments show that, the parser trained on the projected treebank dramatically outperforms previous projected and unsupervised parsers.
2 0.74661797 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers
Author: Ryan McDonald ; Slav Petrov ; Keith Hall
Abstract: We present a simple method for transferring dependency parsers from source languages with labeled training data to target languages without labeled training data. We first demonstrate that delexicalized parsers can be directly transferred between languages, producing significantly higher accuracies than unsupervised parsers. We then use a constraint driven learning algorithm where constraints are drawn from parallel corpora to project the final parser. Unlike previous work on projecting syntactic resources, we show that simple methods for introducing multiple source lan- guages can significantly improve the overall quality of the resulting parsers. The projected parsers from our system result in state-of-theart performance when compared to previously studied unsupervised and projected parsing systems across eight different languages.
3 0.50882345 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
Author: Shay B. Cohen ; Dipanjan Das ; Noah A. Smith
Abstract: We describe a method for prediction of linguistic structure in a language for which only unlabeled data is available, using annotated data from a set of one or more helper languages. Our approach is based on a model that locally mixes between supervised models from the helper languages. Parallel data is not used, allowing the technique to be applied even in domains where human-translated texts are unavailable. We obtain state-of-theart performance for two tasks of structure prediction: unsupervised part-of-speech tagging and unsupervised dependency parsing.
4 0.49573615 118 emnlp-2011-SMT Helps Bitext Dependency Parsing
Author: Wenliang Chen ; Jun'ichi Kazama ; Min Zhang ; Yoshimasa Tsuruoka ; Yujie Zhang ; Yiou Wang ; Kentaro Torisawa ; Haizhou Li
Abstract: We propose a method to improve the accuracy of parsing bilingual texts (bitexts) with the help of statistical machine translation (SMT) systems. Previous bitext parsing methods use human-annotated bilingual treebanks that are hard to obtain. Instead, our approach uses an auto-generated bilingual treebank to produce bilingual constraints. However, because the auto-generated bilingual treebank contains errors, the bilingual constraints are noisy. To overcome this problem, we use large-scale unannotated data to verify the constraints and design a set of effective bilingual features for parsing models based on the verified results. The experimental results show that our new parsers significantly outperform state-of-theart baselines. Moreover, our approach is still able to provide improvement when we use a larger monolingual treebank that results in a much stronger baseline. Especially notable is that our approach can be used in a purely monolingual setting with the help of SMT.
5 0.48410389 16 emnlp-2011-Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP
Author: Federico Sangati ; Willem Zuidema
Abstract: We present a novel approach to Data-Oriented Parsing (DOP). Like other DOP models, our parser utilizes syntactic fragments of arbitrary size from a treebank to analyze new sentences, but, crucially, it uses only those which are encountered at least twice. This criterion allows us to work with a relatively small but representative set of fragments, which can be employed as the symbolic backbone of several probabilistic generative models. For parsing we define a transform-backtransform approach that allows us to use standard PCFG technology, making our results easily replicable. According to standard Parseval metrics, our best model is on par with many state-ofthe-art parsers, while offering some complementary benefits: a simple generative probability model, and an explicit representation of the larger units of grammar.
6 0.45487222 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference
7 0.3781867 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
8 0.36800373 141 emnlp-2011-Unsupervised Dependency Parsing without Gold Part-of-Speech Tags
9 0.36794677 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus
10 0.36145037 74 emnlp-2011-Inducing Sentence Structure from Parallel Corpora for Reordering
11 0.36014009 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
12 0.34922171 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
13 0.32865575 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction
14 0.31362513 31 emnlp-2011-Computation of Infix Probabilities for Probabilistic Context-Free Grammars
15 0.2972717 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
16 0.29481179 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification
17 0.29171479 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
18 0.28638634 4 emnlp-2011-A Fast, Accurate, Non-Projective, Semantically-Enriched Parser
19 0.27969155 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
20 0.27329829 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
topicId topicWeight
[(23, 0.075), (35, 0.012), (36, 0.025), (37, 0.015), (45, 0.048), (53, 0.014), (54, 0.014), (62, 0.021), (64, 0.074), (65, 0.018), (66, 0.036), (79, 0.459), (82, 0.012), (87, 0.018), (90, 0.019), (96, 0.041)]
simIndex simValue paperId paperTitle
1 0.97632372 121 emnlp-2011-Semi-supervised CCG Lexicon Extension
Author: Emily Thomforde ; Mark Steedman
Abstract: This paper introduces Chart Inference (CI), an algorithm for deriving a CCG category for an unknown word from a partial parse chart. It is shown to be faster and more precise than a baseline brute-force method, and to achieve wider coverage than a rule-based system. In addition, we show the application of CI to a domain adaptation task for question words, which are largely missing in the Penn Treebank. When used in combination with self-training, CI increases the precision of the baseline StatCCG parser over subjectextraction questions by 50%. An error analysis shows that CI contributes to the increase by expanding the number of category types available to the parser, while self-training adjusts the counts.
same-paper 2 0.9495728 115 emnlp-2011-Relaxed Cross-lingual Projection of Constituent Syntax
Author: Wenbin Jiang ; Qun Liu ; Yajuan Lv
Abstract: We propose a relaxed correspondence assumption for cross-lingual projection of constituent syntax, which allows a supposed constituent of the target sentence to correspond to an unrestricted treelet in the source parse. Such a relaxed assumption fundamentally tolerates the syntactic non-isomorphism between languages, and enables us to learn the target-language-specific syntactic idiosyncrasy rather than a strained grammar directly projected from the source language syntax. Based on this assumption, a novel constituency projection method is also proposed in order to induce a projected constituent treebank from the source-parsed bilingual corpus. Experiments show that, the parser trained on the projected treebank dramatically outperforms previous projected and unsupervised parsers.
3 0.88970572 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures
Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo
Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.
4 0.88184404 34 emnlp-2011-Corpus-Guided Sentence Generation of Natural Images
Author: Yezhou Yang ; Ching Teo ; Hal Daume III ; Yiannis Aloimonos
Abstract: We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions that make up the core sentence structure. The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. Experimental results show that our strategy of combining vision and language produces readable and de- , scriptive sentences compared to naive strategies that use vision alone.
5 0.59162164 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing
Author: Tom Kwiatkowski ; Luke Zettlemoyer ; Sharon Goldwater ; Mark Steedman
Abstract: We consider the problem of learning factored probabilistic CCG grammars for semantic parsing from data containing sentences paired with logical-form meaning representations. Traditional CCG lexicons list lexical items that pair words and phrases with syntactic and semantic content. Such lexicons can be inefficient when words appear repeatedly with closely related lexical content. In this paper, we introduce factored lexicons, which include both lexemes to model word meaning and templates to model systematic variation in word usage. We also present an algorithm for learning factored CCG lexicons, along with a probabilistic parse-selection model. Evaluations on benchmark datasets demonstrate that the approach learns highly accurate parsers, whose generalization performance greatly from the lexical factoring. benefits
6 0.5529657 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
7 0.53412437 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference
8 0.51889127 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
9 0.5182845 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
10 0.51412368 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
11 0.51292378 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
12 0.51154548 57 emnlp-2011-Extreme Extraction - Machine Reading in a Week
13 0.51024896 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification
14 0.49709886 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
15 0.48938894 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
16 0.48900107 136 emnlp-2011-Training a Parser for Machine Translation Reordering
17 0.48356932 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers
18 0.48217323 97 emnlp-2011-Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French
19 0.48138291 31 emnlp-2011-Computation of Infix Probabilities for Probabilistic Context-Free Grammars
20 0.48138234 147 emnlp-2011-Using Syntactic and Semantic Structural Kernels for Classifying Definition Questions in Jeopardy!