acl acl2011 acl2011-127 knowledge-graph by maker-knowledge-mining

127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing

Source: pdf

Author: Guangyou Zhou ; Jun Zhao ; Kang Liu ; Li Cai

Abstract: In this paper, we present a novel approach which incorporates the web-derived selectional preferences to improve statistical dependency parsing. Conventional selectional preference learning methods have usually focused on word-to-class relations, e.g., a verb selects as its subject a given nominal class. This paper extends previous work to wordto-word selectional preferences by using webscale data. Experiments show that web-scale data improves statistical dependency parsing, particularly for long dependency relationships. There is no data like more data, performance improves log-linearly with the number of parameters (unique N-grams). More importantly, when operating on new domains, we show that using web-derived selectional preferences is essential for achieving robust performance.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Conventional selectional preference learning methods have usually focused on word-to-class relations, e. [sent-2, score-0.625]

2 This paper extends previous work to wordto-word selectional preferences by using webscale data. [sent-5, score-0.695]

3 Experiments show that web-scale data improves statistical dependency parsing, particularly for long dependency relationships. [sent-6, score-0.745]

4 More importantly, when operating on new domains, we show that using web-derived selectional preferences is essential for achieving robust performance. [sent-8, score-0.636]

5 1 Introduction Dependency parsing is the task of building dependency links between words in a sentence, which has recently gained a wide interest in the natural language processing community. [sent-9, score-0.449]

6 , 1993), it is easy to train a high-performance dependency parser using supervised learning methods. [sent-11, score-0.375]

7 However, current state-of-the-art statistical dependency parsers (McDonald et al. [sent-12, score-0.432]

8 The length of a dependency from word wi to word wj is simply equal to |i − j |. [sent-21, score-0.353]

9 Figure 1 shows the F1 score1 relative to the dependency length on the development set by using the graph-based dependency parsers (McDonald et al. [sent-24, score-0.785]

10 We note that the parsers provide very good results for adjacent dependencies (96. [sent-26, score-0.16]

11 89% for dependency length =1), while the dependency length increases, the accuracies degrade sharply. [sent-27, score-0.755]

12 These longer dependencies are therefore a major opportunity to improve the overall performance of dependency parsing. [sent-28, score-0.413]

13 (2008) proposed a semi-supervised dependency parsing by introducing lexical intermediaries at a coarser level than words themselves via a cluster method. [sent-35, score-0.449]

14 This approach, however, ignores the selectional preference for word-to-word interactions, such as head-modifier relationship. [sent-36, score-0.625]

15 c ss2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s556–1565, Figure 1: F score relative to dependency length. [sent-40, score-0.325]

16 Our purpose in this paper is to exploit web- derived selectional preferences to improve the supervised statistical dependency parsing. [sent-42, score-0.927]

17 All of our lexical statistics are derived from two kinds of webscale corpus: one is the web, which is the largest data set that is available for NLP (Keller and Lapata, 2003). [sent-43, score-0.215]

18 By leveraging some assistant data, the dependency parsing model can directly utilize the additional information to capture the word-to-word level relationships. [sent-46, score-0.449]

19 We address two natural and related questions which some previous studies leave open: Question I: Is there a benefit in incorporating web-derived selectional preference features for statistical dependency parsing, especially for longer dependencies? [sent-47, score-1.072]

20 Question II: How well do web-derived selectional preferences perform on new domains? [sent-48, score-0.565]

21 For Question I, we systematically assess the value of using web-scale data in state-of-the-art supervised dependency parsers. [sent-49, score-0.325]

22 We compare dependency parsers that include or exclude selectional prefer- ence features obtained from web-scale corpus. [sent-50, score-0.931]

23 To the best of our knowledge, none of the existing studies directly address long dependencies of dependency parsing by using web-scale data. [sent-51, score-0.566]

24 In this paper we incorporate the web-derived selectional preference features to design our parsers for robust opendomain testing. [sent-59, score-0.824]

25 The results show that web-derived selectional preference can improve the statistical dependency parsing, partic- ularly for long dependency relationships. [sent-62, score-1.311]

26 More importantly, when operating on new domains, the webderived selectional preference features show great potential for achieving robust performance (Section 4. [sent-63, score-0.828]

27 Section 2 gives a brief introduction of dependency parsing. [sent-66, score-0.325]

28 2 Dependency Parsing In dependency parsing, we attempt to build headmodifier (or head-dependent) relations between words in a sentence. [sent-70, score-0.368]

29 The probability of a parse tree is p(y|x;w) =Z(x1;w)exp{ρ∑∈yw · Φ(x,ρ)} (1) where Z(x; w) is the partition function and Φ are part-factored feature functions that include headmodifier parts, sibling parts and grandchild parts. [sent-74, score-0.238]

30 1 Web-scale resources All of our selectional preference features described in this paper rely on probabilities derived from unlabeled data. [sent-81, score-0.78]

31 4Selectional preference tells us which arguments are plausible for a particular predicate, one way to determine the selectional preference is from co-occurrences of predicates and arguments in text (Bergsma et al. [sent-91, score-0.81]

32 In this paper, the selectional preferences have the same meaning with N-grams, which model the word-to-word relationships, rather than only considering the predicates and arguments relationships. [sent-93, score-0.565]

33 1558 roostubjobjmdoetdobjdet Figure 2: An example of a labeled dependency tree. [sent-94, score-0.325]

34 1 PMI Previous work on noun compounds bracketing has used adjacency model (Resnik, 1993) and dependency model (Lauer, 1995) to compute associa- tion statistics between pairs of words. [sent-103, score-0.498]

35 In this paper we generalize the adjacency and dependency models by including the pointwise mutual information (Church and Hanks, 1900) between all pairs of words in the dependency tree: PMI(x,y) = logp(“px(”“)xp y(“”)y”) (3) where p( “x y” ) is the co-occurrence probabilities. [sent-104, score-0.679]

36 When use the Google V1 corpus, this probabilities can be calculated directly from the N-gram counts, while using the Google hits, we send the queries to the search engine Google5 and all the search queries are performed as exact matches by using quotation marks. [sent-105, score-0.245]

37 tween the three words in the dependency tree: PMI(x,y,z) = logp(“xp( y“”x) yp( z“”y) z”) (4) This kinds of trigram features, for example in MSTParser, which can directly capture the sibling and grandchild features. [sent-116, score-0.487]

38 We illustrate the PMI features with an example of dependency parsing tree in Figure 2. [sent-117, score-0.508]

39 In deciding the dependency between the main verb hit and its argument headed preposition with, an example of the N-gram PMI features and the conjoin features with the baseline are shown in Table 1. [sent-118, score-0.685]

40 2 PP-attachment Propositional phrase (PP) attachment is one of the hardest problems in English dependency parsing. [sent-121, score-0.419]

41 Ambiguity resolution reflects the selectional preference between the verb and noun with their prepositional phrase. [sent-123, score-0.801]

42 For example, considering the following two examples: (1) John hit the ball with the bat. [sent-124, score-0.269]

43 In sentence (1), the preposition with depends on the main verb hit; but in sentence (2), the prepositional phrase is a noun attribute and the preposition with needs to depends on the word ball. [sent-126, score-0.394]

44 We thus have PP-attachment features that determine the PMI association across the preposition word “IN”7: PMIIN(x,z) = logp(“xp( IxN) z”) (5) 7Here, the preposition word “IN” (e. [sent-128, score-0.215]

45 For example, “hw, mw” reprsents a class of indicator features with one feature for each possible combination of head word and modifier word. [sent-134, score-0.159]

46 PMIIN(y,z) = logp(“yp( IyN) z”) (6) where the word x and y are usually verb and noun, z is a noun which directly depends on the preposition word “IN”. [sent-137, score-0.211]

47 If both PMI features exist and PMIwith(hit, bat) > PMIwith(ball, bat), indicating to our dependency parsing model that the preposition word with depends on the verb hit is a good choice. [sent-139, score-0.781]

48 The surrounding word N-gram features represent the local context of the selectional preference. [sent-149, score-0.499]

49 Besides, we also present the second-order feature templates, including the sibling and grandchild features. [sent-150, score-0.195]

50 2, for sentence (1), the dependency graph path feature ball → with → bat should have a lower weight sbianlcle →bal wl rarely →i sb mto sdhiofiueldd by bat, obwute ris w eofigtehnt seen through them (e. [sent-154, score-0.687]

51 Iwnil cl tnetlrla us that the prepositional phrase is much more likely to attach to the noun since the dependency graph path feature ball → with → stripe should have a high weight bdaulel t →o th wei high strength sohfo useldlec htaivonea al preference between ball and stripe. [sent-158, score-1.055]

52 Web-derived selectional preference features based on PMI values are trickier to incorporate into the dependency parsing model because they are continuous rather than discrete. [sent-159, score-1.133]

53 Log-linear dependency parsing model is sensitive to inappropriately scaled feature. [sent-162, score-0.449]

54 4 Experiments In order to evaluate the effectiveness ofour proposed approach, we conducted dependency parsing experiments in English. [sent-165, score-0.449]

55 , 1993), using a standard set of head-selection rules (Yamada 1560 and Matsumoto, 2003) to convert the phrase structure syntax of the Treebank into a dependency tree representation, dependency labels were obtained via the ”Malt” hard-coded setting. [sent-167, score-0.65]

56 Web page hits for word pairs and trigrams are obtained using a simple heuristic query to the search engine Google. [sent-170, score-0.314]

57 These forms are then submitted as literal queries, and the resulting hits are summed up. [sent-172, score-0.205]

58 We measured the performance of the parsers using the following metrics: unlabeled attachment score (UAS), labeled attachment score (LAS) and complete match (CM), which were defined by Hall et al. [sent-175, score-0.354]

59 First, performance increases with the order of the parser: edge-factored model (dep1) has the lowest performance, adding sibling and grandchild relationships (dep2) significantly increases perfor- mance. [sent-180, score-0.209]

60 Second, note that the parsers incorporating the Ngram feature sets consistently outperform the models using the baseline features in all test data sets, regardless of model order or label usage. [sent-183, score-0.199]

61 Abbreviation: dep1/dep2=first-order parser and second-order parser with the baseline features; +hits=N-gram features derived from the Google hits; +V1=N-gram features derived from the Google V1; suffix-L=labeled parser. [sent-215, score-0.292]

62 Unlabeled parsers are scored using unlabeled parent predictions, and labeled parsers are scored using labeled parent predictions. [sent-216, score-0.273]

63 finding is that the N-gram features derived from Google hits are slightly better than Google V1 due to the large N-gram coverage, we will discuss later. [sent-217, score-0.301]

64 Google hits is the largest N-gram data and shows the best performance. [sent-235, score-0.205]

65 Although Google hits is noisier, it has very much larger coverage of bigrams or trigrams. [sent-241, score-0.205]

66 We have shown that this trend continues well for dependency parsing by using web-scale data (NEWS and Google V1). [sent-246, score-0.449]

67 2 Improvement relative to dependency length The experiments in (McDonald and Nivre, 2007) showed a negative impact on the dependency parsing performance from too long dependencies. [sent-258, score-0.838]

68 For our proposed approach, the improvement relative to dependency length is shown in Figure 4. [sent-259, score-0.353]

69 From the Figure, it is seen that our method gives observable better performance when dependency lengths are larger than 3. [sent-260, score-0.325]

70 The results here show that the proposed approach improves the dependency parsing performance, particularly for long dependency relationships. [sent-261, score-0.869]

71 3 Cross-genre testing In this section, we present the experiments to validate the robustness the web-derived selectional preferences. [sent-263, score-0.44]

72 The intent is to understand how well the web-derived selectional preferences transfer to other sources. [sent-264, score-0.565]

73 WSJ is the performance of our second-order dependency parser trained on section 2-21; WSJ+N-gram is the performance of our proposed approach trained on section 2-21; WSJ+BioMed is the performance of the parser trained on WSJ and biomedical data. [sent-272, score-0.489]

74 The results show that incorporating the webscale N-gram features can significantly improve the dependency parsing performance, and the improvement is much larger than the in-domain testing presented in Section 4. [sent-274, score-0.638]

75 4 Discussion In this paper, we present a novel method to improve dependency parsing by using web-scale data. [sent-277, score-0.449]

76 (1) Google hits is less sparse than Google V1 in modeling the word-to-word relationships, but Google hits are likely to be noisier than Google V1. [sent-279, score-0.444]

77 It is very appealing to carry out a correlation anal- WSJ: performance of parser trained only on WSJ; WSJ+N-gram: performance of our proposed approach trained only on WSJ; WSJ+BioMed: parser trained on WSJ and biomedical text; WSJ+BioMed+N-gram: our approach trained on WSJ and biomedical text. [sent-280, score-0.228]

78 ysis to determine whether Google hits and Google V1 are highly correlated. [sent-281, score-0.205]

79 (2) Veronis (2005) pointed out that there had been a debate about reliability of Google hits due to the inconsistencies of page hits estimates. [sent-283, score-0.41]

80 5 Related Work Our approach is to exploit web-derived selectional preferences to improve the dependency parsing. [sent-289, score-0.89]

81 Our research, however, applies the web-scale data (Google hits and Google V1) to model the word-to-word dependency relationships rather than compound bracketing disambiguation. [sent-295, score-0.699]

82 1563 Several previous studies have exploited the webscale data for word pair acquisition. [sent-296, score-0.158]

83 Keller and Lapata (2003) evaluated the utility of using web search engine statistics for unseen bigram. [sent-297, score-0.16]

84 Nakov and Hearst (2005) demonstrated the effectiveness of using search engine statistics to improve the noun compound bracketing. [sent-298, score-0.249]

85 (2010) created robust supervised classifiers via web-scale N-gram data for adjective ordering, spelling correction, noun compound bracketing and verb part-of-speech disambiguation. [sent-302, score-0.257]

86 Our approach, however, extends these techniques to dependency parsing, particularly for long dependency relationships, which involves more challenging tasks than the previous work. [sent-303, score-0.715]

87 Johnson and Riezler (2000) incorporated the lexical selectional preference features derived from British National Corpus (Graff, 2003) into a stochastic unification-based grammar. [sent-306, score-0.721]

88 Abekawa and Okumura (2006) improved Japanese dependency parsing by using the co-occurrence information derived from the results of automatic dependency parsing of large-scale corpora. [sent-307, score-0.935]

89 However, we explore the webscale data for dependency parsing, the performance improves log-linearly with the number ofparameters (unique N-grams). [sent-308, score-0.485]

90 To the best of our knowledge, web-derived selectional preference has not been successfully applied to dependency parsing. [sent-309, score-0.95]

91 6 Conclusion In this paper, we present a novel method which incorporates the web-derived selectional preferences to improve statistical dependency parsing. [sent-310, score-0.89]

92 The results show that web-scale data improves the dependency parsing, particularly for long dependency relationships. [sent-311, score-0.745]

93 More importantly, when operating on new domains, the webderived selectional preferences show great potential for achieving robust performance. [sent-313, score-0.709]

94 Japanese dependency parsing using co-occurrence information and a combination of case elements. [sent-323, score-0.449]

95 Acquiring selectional preferences from untagged text for prepositional phrase attachment disambiguation. [sent-349, score-0.733]

96 DILUCT: An opensource Spanish dependency parser based on rules, heuristics, and selectional preferences. [sent-355, score-0.815]

97 Using co-occurrence statistics as an information source for partial parsing of Chinese. [sent-438, score-0.172]

98 Search engine statistics beyond the n-gram: application to noun compound bracketing. [sent-560, score-0.22]

99 An empirical study of semi-supervised structured conditional models for dependency parsing. [sent-601, score-0.325]

100 A tale of two parsers: investigating and combining graph-based and transitionbased dependency parsing using beam-search. [sent-651, score-0.449]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('selectional', 0.44), ('dependency', 0.325), ('hits', 0.205), ('pmi', 0.205), ('google', 0.192), ('preference', 0.185), ('bat', 0.171), ('wsj', 0.164), ('ball', 0.158), ('pmiwith', 0.146), ('webscale', 0.13), ('preferences', 0.125), ('parsing', 0.124), ('mcdonald', 0.123), ('carreras', 0.123), ('pitler', 0.122), ('hit', 0.111), ('parsers', 0.107), ('biomed', 0.097), ('bergsma', 0.096), ('attachment', 0.094), ('calvo', 0.086), ('grandchild', 0.086), ('preposition', 0.078), ('sibling', 0.076), ('compound', 0.075), ('prepositional', 0.074), ('stripe', 0.073), ('webderived', 0.073), ('biomedical', 0.064), ('unlabeled', 0.059), ('features', 0.059), ('suzuki', 0.058), ('uas', 0.057), ('logp', 0.055), ('verb', 0.053), ('dependencies', 0.053), ('nivre', 0.052), ('koo', 0.052), ('mcclosky', 0.052), ('queries', 0.051), ('parser', 0.05), ('accuracies', 0.049), ('abekawa', 0.049), ('cilibrasi', 0.049), ('drabek', 0.049), ('pmiin', 0.049), ('quota', 0.049), ('veronis', 0.049), ('noun', 0.049), ('statistics', 0.048), ('engine', 0.048), ('bracketing', 0.047), ('relationships', 0.047), ('brants', 0.047), ('mstparser', 0.046), ('keller', 0.043), ('headmodifier', 0.043), ('exponentiated', 0.043), ('unique', 0.042), ('resolve', 0.042), ('xp', 0.04), ('ngram', 0.04), ('collins', 0.039), ('church', 0.038), ('operating', 0.038), ('derived', 0.037), ('gelbukh', 0.037), ('quotation', 0.037), ('modifier', 0.037), ('marcus', 0.037), ('long', 0.036), ('web', 0.035), ('indexed', 0.035), ('longer', 0.035), ('hall', 0.034), ('nakov', 0.034), ('noisier', 0.034), ('lapata', 0.033), ('feature', 0.033), ('robust', 0.033), ('graff', 0.032), ('trigrams', 0.032), ('pereira', 0.032), ('treebank', 0.032), ('isozaki', 0.031), ('genre', 0.031), ('depends', 0.031), ('improves', 0.03), ('head', 0.03), ('adjacency', 0.029), ('cai', 0.029), ('particularly', 0.029), ('search', 0.029), ('charniak', 0.029), ('penn', 0.029), ('studies', 0.028), ('inflected', 0.028), ('ptb', 0.028), ('length', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing

Author: Guangyou Zhou ; Jun Zhao ; Kang Liu ; Li Cai

2 0.26637268 333 acl-2011-Web-Scale Features for Full-Scale Parsing

Author: Mohit Bansal ; Dan Klein

Abstract: Counts from large corpora (like the web) can be powerful syntactic cues. Past work has used web counts to help resolve isolated ambiguities, such as binary noun-verb PP attachments and noun compound bracketings. In this work, we first present a method for generating web count features that address the full range of syntactic attachments. These features encode both surface evidence of lexical affinities as well as paraphrase-based cues to syntactic structure. We then integrate our features into full-scale dependency and constituent parsers. We show relative error reductions of7.0% over the second-order dependency parser of McDonald and Pereira (2006), 9.2% over the constituent parser of Petrov et al. (2006), and 3.4% over a non-local constituent reranker.

3 0.24661343 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

Author: Yue Zhang ; Joakim Nivre

Abstract: Transition-based dependency parsers generally use heuristic decoding algorithms but can accommodate arbitrarily rich feature representations. In this paper, we show that we can improve the accuracy of such parsers by considering even richer feature sets than those employed in previous systems. In the standard Penn Treebank setup, our novel features improve attachment score form 91.4% to 92.9%, giving the best results so far for transitionbased parsing and rivaling the best results overall. For the Chinese Treebank, they give a signficant improvement of the state of the art. An open source release of our parser is freely available.

4 0.23020588 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

Author: Gholamreza Haffari ; Marzieh Razavi ; Anoop Sarkar

Abstract: We combine multiple word representations based on semantic clusters extracted from the (Brown et al., 1992) algorithm and syntactic clusters obtained from the Berkeley parser (Petrov et al., 2006) in order to improve discriminative dependency parsing in the MSTParser framework (McDonald et al., 2005). We also provide an ensemble method for combining diverse cluster-based models. The two contributions together significantly improves unlabeled dependency accuracy from 90.82% to 92. 13%.

5 0.18159135 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

Author: Jinho D. Choi ; Martha Palmer

Abstract: This paper suggests two ways of improving transition-based, non-projective dependency parsing. First, we add a transition to an existing non-projective parsing algorithm, so it can perform either projective or non-projective parsing as needed. Second, we present a bootstrapping technique that narrows down discrepancies between gold-standard and automatic parses used as features. The new addition to the algorithm shows a clear advantage in parsing speed. The bootstrapping technique gives a significant improvement to parsing accuracy, showing near state-of-theart performance with respect to other parsing approaches evaluated on the same data set.

6 0.18097512 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

7 0.17412752 167 acl-2011-Improving Dependency Parsing with Semantic Classes

8 0.16152008 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

9 0.1330542 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

10 0.12627283 282 acl-2011-Shift-Reduce CCG Parsing

11 0.1209768 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping

12 0.11824341 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

13 0.11476521 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

14 0.1089173 275 acl-2011-Semi-Supervised Modeling for Prenominal Modifier Ordering

15 0.1080587 109 acl-2011-Effective Measures of Domain Similarity for Parsing

16 0.10022411 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

17 0.099935949 122 acl-2011-Event Extraction as Dependency Parsing

18 0.096851259 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

19 0.096712127 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

20 0.094983377 243 acl-2011-Partial Parsing from Bitext Projections

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.252), (1, -0.017), (2, -0.101), (3, -0.27), (4, -0.041), (5, -0.097), (6, 0.09), (7, 0.051), (8, 0.11), (9, -0.024), (10, 0.067), (11, 0.01), (12, 0.05), (13, -0.134), (14, 0.008), (15, 0.085), (16, -0.022), (17, 0.073), (18, -0.05), (19, -0.046), (20, -0.136), (21, -0.031), (22, -0.022), (23, 0.04), (24, 0.102), (25, -0.101), (26, 0.021), (27, 0.027), (28, -0.019), (29, 0.025), (30, 0.046), (31, -0.028), (32, 0.061), (33, -0.008), (34, 0.036), (35, 0.057), (36, 0.036), (37, -0.027), (38, 0.041), (39, 0.124), (40, 0.007), (41, 0.018), (42, 0.039), (43, -0.037), (44, 0.032), (45, 0.013), (46, 0.103), (47, -0.022), (48, -0.056), (49, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96710223 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing

Author: Guangyou Zhou ; Jun Zhao ; Kang Liu ; Li Cai

2 0.889296 333 acl-2011-Web-Scale Features for Full-Scale Parsing

Author: Mohit Bansal ; Dan Klein

3 0.88453835 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

Author: Yue Zhang ; Joakim Nivre

4 0.86680233 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

Author: Gholamreza Haffari ; Marzieh Razavi ; Anoop Sarkar

5 0.80726486 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

Author: Jinho D. Choi ; Martha Palmer

6 0.78006083 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

7 0.77749032 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

8 0.74101818 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

9 0.7331515 167 acl-2011-Improving Dependency Parsing with Semantic Classes

10 0.72750616 243 acl-2011-Partial Parsing from Bitext Projections

11 0.67091936 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning

12 0.66138989 107 acl-2011-Dynamic Programming Algorithms for Transition-Based Dependency Parsers

13 0.65857846 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

14 0.62001961 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

15 0.60331601 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing

16 0.59413654 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

17 0.59341514 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing

18 0.58630139 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

19 0.58387148 282 acl-2011-Shift-Reduce CCG Parsing

20 0.54611981 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.022), (17, 0.047), (26, 0.045), (37, 0.493), (39, 0.045), (41, 0.04), (55, 0.04), (59, 0.027), (72, 0.024), (91, 0.035), (96, 0.101), (97, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9741447 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

Author: Kevin Duh ; Akinori Fujino ; Masaaki Nagata

Abstract: Recent advances in Machine Translation (MT) have brought forth a new paradigm for building NLP applications in low-resource scenarios. To build a sentiment classifier for a language with no labeled resources, one can translate labeled data from another language, then train a classifier on the translated text. This can be viewed as a domain adaptation problem, where labeled translations and test data have some mismatch. Various prior work have achieved positive results using this approach. In this opinion piece, we take a step back and make some general statements about crosslingual adaptation problems. First, we claim that domain mismatch is not caused by MT errors, and accuracy degradation will occur even in the case of perfect MT. Second, we argue that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered. This paper will describe a series of carefullydesigned experiments that led us to these conclusions. 1 Summary Question 1: If MT gave perfect translations (semantically), do we still have a domain adaptation challenge in cross-lingual sentiment classification? Answer: Yes. The reason is that while many lations of a word may be valid, the MT system have a systematic bias. For example, the word some” might be prevalent in English reviews, transmight “awebut in 429 translated reviews, the word “excellent” is generated instead. From the perspective of MT, this translation is correct and preserves sentiment polarity. But from the perspective of a classifier, there is a domain mismatch due to differences in word distributions. Question 2: Can we apply standard adaptation algorithms developed for other (monolingual) adaptation problems to cross-lingual adaptation? Answer: No. It appears that the interaction between target unlabeled data and source data can be rather unexpected in the case of cross-lingual adaptation. We do not know the reason, but our experiments show that the accuracy of adaptation algorithms in cross-lingual scenarios have much higher variance than monolingual scenarios. The goal of this opinion piece is to argue the need to better understand the characteristics of domain adaptation in cross-lingual problems. We invite the reader to disagree with our conclusion (that the true barrier to good performance is not insufficient MT quality, but inappropriate domain adaptation methods). Here we present a series of experiments that led us to this conclusion. First we describe the experiment design (§2) and baselines (§3), before answering Question §12 (§4) dan bda Question 32) (§5). 2 Experiment Design The cross-lingual setup is this: we have labeled data from source domain S and wish to build a sentiment classifier for target domain T. Domain mismatch can arise from language differences (e.g. English vs. translated text) or market differences (e.g. DVD vs. Book reviews). Our experiments will involve fixing Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 429–433, T to a common testset and varying S. This allows us to experiment with different settings for adaptation. We use the Amazon review dataset of Prettenhofer (2010)1 , due to its wide range of languages (English [EN], Japanese [JP], French [FR], German [DE]) and markets (music, DVD, books). Unlike Prettenhofer (2010), we reverse the direction of cross-lingual adaptation and consider English as target. English is not a low-resource language, but this setting allows for more comparisons. Each source dataset has 2000 reviews, equally balanced between positive and negative. The target has 2000 test samples, large unlabeled data (25k, 30k, 50k samples respectively for Music, DVD, and Books), and an additional 2000 labeled data reserved for oracle experiments. Texts in JP, FR, and DE are translated word-by-word into English with Google Translate.2 We perform three sets of experiments, shown in Table 1. Table 2 lists all the results; we will interpret them in the following sections. Target (T) Source (S) 312BDMToVuasbDkil-ecE1N:ExpDMB eorVuimsDkice-JEnPtN,s eBD,MtuoVBDpuoVsk:-iFDck-iERxFN,T DB,vVoMaDruky-sSiDc.E-, 3 How much performance degradation occurs in cross-lingual adaptation? First, we need to quantify the accuracy degradation under different source data, without consideration of domain adaptation methods. So we train a SVM classifier on labeled source data3, and directly apply it on test data. The oracle setting, which has no domain-mismatch (e.g. train on Music-EN, test on Music-EN), achieves an average test accuracy of (81.6 + 80.9 + 80.0)/3 = 80.8%4. Aver1http://www.webis.de/research/corpora/webis-cls-10 2This is done by querying foreign words to build a bilingual dictionary. The words are converted to tfidf unigram features. 3For all methods we try here, 5% of the 2000 labeled source samples are held-out for parameter tuning. 4See column EN of Table 2, Supervised SVM results. 430 age cross-lingual accuracies are: 69.4% (JP), 75.6% (FR), 77.0% (DE), so degradations compared to oracle are: -11% (JP), -5% (FR), -4% (DE).5 Crossmarket degradations are around -6%6. Observation 1: Degradations due to market and language mismatch are comparable in several cases (e.g. MUSIC-DE and DVD-EN perform similarly for target MUSIC-EN). Observation 2: The ranking of source language by decreasing accuracy is DE > FR > JP. Does this mean JP-EN is a more difficult language pair for MT? The next section will show that this is not necessarily the case. Certainly, the domain mismatch for JP is larger than DE, but this could be due to phenomenon other than MT errors. 4 Where exactly is the domain mismatch? 4.1 Theory of Domain Adaptation We analyze domain adaptation by the concepts of labeling and instance mismatch (Jiang and Zhai, 2007). Let pt(x, y) = pt (y|x)pt (x) be the target distribution of samples x (e.g. unigram feature vec- tor) and labels y (positive / negative). Let ps (x, y) = ps (y|x)ps (x) be the corresponding source distributio(ny. Wx)pe assume that one (or both) of the following distributions differ between source and target: • Instance mismatch: ps (x) pt (x). • Labeling mismatch: ps (y|x) pt(y|x). Instance mismatch implies that the input feature vectors have different distribution (e.g. one dataset uses the word “excellent” often, while the other uses the word “awesome”). This degrades performance because classifiers trained on “excellent” might not know how to classify texts with the word “awesome.” The solution is to tie together these features (Blitzer et al., 2006) or re-weight the input distribution (Sugiyama et al., 2008). Under some assumptions (i.e. covariate shift), oracle accuracy can be achieved theoretically (Shimodaira, 2000). Labeling mismatch implies the same input has different labels in different domains. For example, the JP word meaning “excellent” may be mistranslated as “bad” in English. Then, positive JP = = 5See “Adapt by Language” columns of Table 2. Note JP+FR+DE condition has 6000 labeled samples, so is not directly comparable to other adaptation scenarios (2000 samples). Nevertheless, mixing languages seem to give good results. 6See “Adapt by Market” columns of Table 2. TargetClassifierOEraNcleJPAFdaRpt bDyE LanJgPu+agFeR+DEMUASdIaCpt D byV MDar BkeOtOK MUSIC-ENSAudpaeprtvedise TdS SVVMM8719..666783..50 7745..62 7 776..937880..36--7768..847745..16 DVD-ENSAudpaeprtveidse TdS SVVMM8801..907701..14 7765..54 7 767..347789..477754..28--7746..57 BOOK-ENSAudpaeprtveidse TdS SVVMM8801..026793..68 7775..64 7 767..747799..957735..417767..24-Table 2: Test accuracies (%) for English Music/DVD/Book reviews. Each column is an adaptation scenario using different source data. The source data may vary by language or by market. For example, the first row shows that for the target of Music-EN, the accuracy of a SVM trained on translated JP reviews (in the same market) is 68.5, while the accuracy of a SVM trained on DVD reviews (in the same language) is 76.8. “Oracle” indicates training on the same market and same language domain as the target. “JP+FR+DE” indicates the concatenation of JP, FR, DE as source data. Boldface shows the winner of Supervised vs. Adapted. reviews ps (y will be associated = +1|x = bad) co(nydit =io +na1l − |x = 1 will be high, whereas the true xdis =tr bibaudti)o wn bad) instead. labeling mismatch, with the word “bad”: lslh boeu hldi hha,v we high pt(y = There are several cases for depending on sheovwe tahle c polarity changes (Table 3). The solution is to filter out these noisy samples (Jiang and Zhai, 2007) or optimize loosely-linked objectives through shared parameters or Bayesian priors (Finkel and Manning, 2009). Which mismatch is responsible for accuracy degradations in cross-lingual adaptation? • Instance mismatch: Systematic Iantessta nwcoerd m diissmtraibtcuhti:on Ssy MT bias gener- sdtiefmferaetinct MfroTm b naturally- occurring English. (Translation may be valid.) Label mismatch: MT error mis-translates a word iLnatob something w: MithT Td eifrfreorren mti polarity. Conclusion from §4.2 and §4.3: Instance mismaCtcohn occurs often; M §4T. error appears Imnisntainmcael. • Mis-translated polarity Effect Taeb0+±.lge→ .3(:±“ 0−tgLhoae b”nd →l m− i“sg→m otbah+dce”h):mIfpoLAinse ca-ptsoriuaesncvieatl /ndioeansgbvcaewrptlimovaeshipntdvaei(+), negative (−), or neutral (0) words have different effects. Wnege athtiivnek ( −th)e, foirrs nt tuwtroa cases hoardves graceful degradation, but the third case may be catastrophic. 431 4.2 Analysis of Instance Mismatch To measure instance mismatch, we compute statistics between ps (x) and pt(x), or approximations thereof: First, we calculate a (normalized) average feature from all samples of source S, which represents the unigram distribution of MT output. Simi- larly, the average feature vector for target T approximates the unigram distribution of English reviews pt(x). Then we measure: • KL Divergence between Avg(S) and Avg(T), wKhLer De Avg() nisc eth bee average Avvegct(oSr.) • Set Coverage of Avg(T) on Avg(S): how many Sweotrd C (type) ien o Tf appears oatn le Aavsgt once ionw wS .m Both measures correlate strongly with final accuracy, as seen in Figure 1. The correlation coefficients are r = −0.78 for KL Divergence and r = 0.71 for Coverage, 0 b.7o8th statistically significant (p < 0.05). This implies that instance mismatch is an important reason for the degradations seen in Section 3.7 4.3 Analysis of Labeling Mismatch We measure labeling mismatch by looking at differences in the weight vectors of oracle SVM and adapted SVM. Intuitively, if a feature has positive weight in the oracle SVM, but negative weight in the adapted SVM, then it is likely a MT mis-translation 7The observant reader may notice that cross-market points exhibit higher coverage but equal accuracy (74-78%) to some cross-lingual points. This suggests that MT output may be more constrained in vocabulary than naturally-occurring English. 0.35 0.3 gnvLrDeiceKe0 0 0. 120.25 510 erts TeCovega0 0 0. .98657 68 70 72 7A4ccuracy76 78 80 82 0.4 68 70 72 7A4ccuracy76 78 80 82 Figure 1: KL Divergence and Coverage vs. accuracy. (o) are cross-lingual and (x) are cross-market data points. is causing the polarity flip. Algorithm 1 (with K=2000) shows how we compute polarity flip rate.8 We found that the polarity flip rate does not correlate well with accuracy at all (r = 0.04). Conclusion: Labeling mismatch is not a factor in performance degradation. Nevertheless, we note there is a surprising large number of flips (24% on average). A manual check of the flipped words in BOOK-JP revealed few MT mistakes. Only 3.7% of 450 random EN-JP word pairs checked can be judged as blatantly incorrect (without sentence context). The majority of flipped words do not have a clear sentiment orientation (e.g. “amazon”, “human”, “moreover”). 5 Are standard adaptation algorithms applicable to cross-lingual problems? One of the breakthroughs in cross-lingual text classification is the realization that it can be cast as domain adaptation. This makes available a host of preexisting adaptation algorithms for improving over supervised results. However, we argue that it may be 8The feature normalization in Step 1 is important that the weight magnitudes are comparable. to ensure 432 Algorithm 1 Measuring labeling mismatch Input: Weight vectors for source wsand target wt Input: Target data average sample vector avg(T) Output: Polarity flip rate f 1: Normalize: ws = avg(T) * ws ; wt = avg(T) * wt 2: Set S+ = { K most positive features in ws} 3: Set S− == {{ KK mmoosstt negative ffeeaattuurreess inn wws}} 4: Set T+ == {{ KK m moosstt npoesgiatitivvee f efeaatuturreess i inn w wt}} 5: Set T− == {{ KK mmoosstt negative ffeeaattuurreess inn wwt}} 6: for each= f{e a Ktur me io ∈t T+ adtiov 7: rif e ia c∈h S fe−a ttuhreen i if ∈ = T f + 1 8: enidf fio ∈r 9: for each feature j ∈ T− do 10: rif e j ∈h Sfe+a uthreen j f ∈ = T f + 1 11: enidf fjo r∈ 12: f = 2Kf better to “adapt” the standard adaptation algorithm to the cross-lingual setting. We arrived at this conclusion by trying the adapted counterpart of SVMs off-the-shelf. Recently, (Bergamo and Torresani, 2010) showed that Transductive SVMs (TSVM), originally developed for semi-supervised learning, are also strong adaptation methods. The idea is to train on source data like a SVM, but encourage the classification boundary to divide through low density regions in the unlabeled target data. Table 2 shows that TSVM outperforms SVM in all but one case for cross-market adaptation, but gives mixed results for cross-lingual adaptation. This is a puzzling result considering that both use the same unlabeled data. Why does TSVM exhibit such a large variance on cross-lingual problems, but not on cross-market problems? Is unlabeled target data interacting with source data in some unexpected way? Certainly there are several successful studies (Wan, 2009; Wei and Pal, 2010; Banea et al., 2008), but we think it is important to consider the possibility that cross-lingual adaptation has some fundamental differences. We conjecture that adapting from artificially-generated text (e.g. MT output) is a different story than adapting from naturallyoccurring text (e.g. cross-market). In short, MT is ripe for cross-lingual adaptation; what is not ripe is probably our understanding of the special characteristics of the adaptation problem. References Carmen Banea, Rada Mihalcea, Janyce Wiebe, and Samer Hassan. 2008. Multilingual subjectivity analysis using machine translation. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Alessandro Bergamo and Lorenzo Torresani. 2010. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems (NIPS). John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Jenny Rose Finkel and Chris Manning. 2009. Hierarchical Bayesian domain adaptation. In Proc. of NAACL Human Language Technologies (HLT). Jing Jiang and ChengXiang Zhai. 2007. Instance weighting for domain adaptation in NLP. In Proc. of the Association for Computational Linguistics (ACL). Peter Prettenhofer and Benno Stein. 2010. Crosslanguage text classification using structural correspondence learning. In Proc. of the Association for Computational Linguistics (ACL). Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inferenc, 90. Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B ¨unau, and Motoaki Kawanabe. 2008. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4). Xiaojun Wan. 2009. Co-training for cross-lingual sentiment classification. In Proc. of the Association for Computational Linguistics (ACL). Bin Wei and Chris Pal. 2010. Cross lingual adaptation: an experiment on sentiment classification. In Proceedings of the ACL 2010 Conference Short Papers. 433

same-paper 2 0.94354486 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing

Author: Guangyou Zhou ; Jun Zhao ; Kang Liu ; Li Cai

3 0.94216055 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

Author: Roy Schwartz ; Omri Abend ; Roi Reichart ; Ari Rappoport

Abstract: Dependency parsing is a central NLP task. In this paper we show that the common evaluation for unsupervised dependency parsing is highly sensitive to problematic annotations. We show that for three leading unsupervised parsers (Klein and Manning, 2004; Cohen and Smith, 2009; Spitkovsky et al., 2010a), a small set of parameters can be found whose modification yields a significant improvement in standard evaluation measures. These parameters correspond to local cases where no linguistic consensus exists as to the proper gold annotation. Therefore, the standard evaluation does not provide a true indication of algorithm quality. We present a new measure, Neutral Edge Direction (NED), and show that it greatly reduces this undesired phenomenon.

4 0.94047219 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars

Author: Mark-Jan Nederhof ; Giorgio Satta

Abstract: We present a method for the computation of prefix probabilities for synchronous contextfree grammars. Our framework is fairly general and relies on the combination of a simple, novel grammar transformation and standard techniques to bring grammars into normal forms.

5 0.93070048 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

Author: Bing Xiang ; Abraham Ittycheriah

Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.

6 0.92232776 122 acl-2011-Event Extraction as Dependency Parsing

7 0.9212954 334 acl-2011-Which Noun Phrases Denote Which Concepts?

8 0.92041641 204 acl-2011-Learning Word Vectors for Sentiment Analysis

9 0.87516487 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

10 0.82149947 256 acl-2011-Query Weighting for Ranking Model Adaptation

11 0.81826317 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

12 0.81684196 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

13 0.81548858 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

14 0.80596298 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

15 0.80202258 85 acl-2011-Coreference Resolution with World Knowledge

16 0.7842834 292 acl-2011-Target-dependent Twitter Sentiment Classification

17 0.78365827 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

18 0.78199661 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

19 0.78122056 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning

20 0.77601713 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing