acl acl2011 acl2011-230 knowledge-graph by maker-knowledge-mining

230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

Source: pdf

Author: Roy Schwartz ; Omri Abend ; Roi Reichart ; Ari Rappoport

Abstract: Dependency parsing is a central NLP task. In this paper we show that the common evaluation for unsupervised dependency parsing is highly sensitive to problematic annotations. We show that for three leading unsupervised parsers (Klein and Manning, 2004; Cohen and Smith, 2009; Spitkovsky et al., 2010a), a small set of parameters can be found whose modification yields a significant improvement in standard evaluation measures. These parameters correspond to local cases where no linguistic consensus exists as to the proper gold annotation. Therefore, the standard evaluation does not provide a true indication of algorithm quality. We present a new measure, Neutral Edge Direction (NED), and show that it greatly reduces this undesired phenomenon.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we show that the common evaluation for unsupervised dependency parsing is highly sensitive to problematic annotations. [sent-7, score-0.429]

2 These parameters correspond to local cases where no linguistic consensus exists as to the proper gold annotation. [sent-10, score-0.235]

3 Parser quality is usually evaluated by comparing its output to a gold standard whose annotations are linguistically motivated. [sent-19, score-0.319]

4 For such cases, evaluation measures should not punish the algorithm for deviating from the gold standard. [sent-29, score-0.195]

5 In this paper we show that the evaluation mea- sures reported in current works are highly sensitive to the annotation in problematic cases, and propose a simple new measure that greatly neutralizes the problem. [sent-30, score-0.305]

6 , 2010a), a small set (at most 18 out of a few thousands) of parameters can be found whose modification dramatically improves the standard evaluation measures (the attachment score measure by 9. [sent-32, score-0.305]

7 1%, and the undirected measure by a smaller but still significant 1. [sent-34, score-0.233]

8 We show that these parameter changes can be mapped to edge direction changes in local structures in the dependency graph, and that these correspond to problematic annotations. [sent-38, score-0.721]

9 We explain why the standard undirected evaluation measure is in fact sensitive to such edge direc- 1It is also language-independent; we have produced it in five different languages: English, Czech, Japanese, Portuguese, and Turkish. [sent-40, score-0.502]

10 Ac s2s0o1ci1a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 663–672, tion changes, and present a new evaluation measure, Neutral Edge Direction (NED), which greatly alleviates the problem by ignoring the edge direction in local structures. [sent-44, score-0.295]

11 First, we show the impact of a small number of annotation decisions on the performance of unsupervised dependency parsers. [sent-50, score-0.249]

12 This reveals a problem in the common evaluation of unsupervised dependency parsing. [sent-52, score-0.191]

13 Section 4 discusses the linguistic controversies in annotating problematic dependency structures. [sent-57, score-0.307]

14 For unsupervised dependency parsing, the Dependency Model with Valence (DMV) (Klein and Manning, 2004) was the first to beat the simple right-branching baseline. [sent-62, score-0.191]

15 The controversial nature of some dependency structures was discussed in (Nivre, 2006; K ¨ubler et al. [sent-92, score-0.289]

16 Klein (2005) discussed controversial constituency structures and the evaluation problems stemming from them, stressing the importance of a consistent standard of evaluation. [sent-94, score-0.249]

17 (2006) transformed the dependency annotations of coordinations and verb groups in the Prague TreeBank. [sent-97, score-0.265]

18 Klein and Manning (2004) observed that a large portion of their errors is caused by predicting the wrong direction of the edge between a noun and its determiner. [sent-101, score-0.295]

19 K ¨ubler (2005) compared two different conversion schemes in German supervised constituency parsing and found one to have positive influence on parsing quality. [sent-102, score-0.216]

20 PSTOP(dir, h, adj) determines the probability to stop generating arguments, and is conditioned on 3 arguments: the head h, the direction dir ((L)eft or (R)ight) and adjacency adj (whether the head already has dependents ((Y )es) in direction dir or not ((N)o)). [sent-107, score-0.4]

21 3 Significant Effects of Edge Flipping In this section we present recurring error patterns in some of the leading unsupervised dependency parsers. [sent-109, score-0.191]

22 PRP VBP TO VB ROOT Figure 2: A parse of the sentence “I want to eat”, before (straight line) and after (dashed line) an edge-flip of the edge “to”←“eat”. [sent-115, score-0.228]

23 Edge flipping (henceforth, edgeflip) the edge w2→w1 is the following modification of a parse tree: (1) setting w2’s parent as w1 (instead of the other way around), and (2) setting w1 parent as w3 (instead of the edge w3→w2). [sent-118, score-0.735]

24 Setting this parent to be w3 is the minimal modification of the original parse, since it does not change the attachment of the structure [w2, w1] to the rest of the sentence, but only the direction of the internal edge. [sent-121, score-0.442]

25 Figure 2 presents a parse of the sentence “I want to eat”, before and after an edge-flip of the edge “to”←“eat”. [sent-122, score-0.228]

26 Since unsupervised dependency parsers are generally structure prediction models, the predictions of the parse edges are not independent. [sent-123, score-0.27]

27 Therefore, there is no single parameter which completely controls the edge direction, and hence there is no direct way to perform an edge-flip by parameter modification. [sent-124, score-0.27]

28 However, setting extreme values for the parameters controlling the direction of a certain edge type creates a strong preference towards one of the directions, and effectively determines the edge direction. [sent-125, score-0.528]

29 We show how an edge in the dependency graph is encoded using the DMV parameters. [sent-133, score-0.307]

30 When the modi|fwications to PATTACH are insufficient to modify the edge direction, PSTOP(w2 , L, N) is set to a very low value and PSTOP(w1 , R, N) to a very high value2. [sent-141, score-0.188]

31 As the table shows, the modified structures cover a significant portion of the tokens. [sent-150, score-0.184]

32 Following standard practice, we present the attachment score (i. [sent-179, score-0.178]

33 The performance difference between the original and the modified parameter set is considerable for all data sets, where differences exceed 9. [sent-185, score-0.214]

34 4 Linguistically Problematic Annotations In this section, we discuss the controversial nature of the annotation in the modified structures (K¨ ubler 5http://www. [sent-189, score-0.393]

35 We begin by showing that all the structures modified are indeed linguistically problematic. [sent-210, score-0.268]

36 We then note that these controversies are reflected in the eval- uation of this task, resulting in three, significantly different, gold standards currently in use. [sent-211, score-0.191]

37 This happens because they use different rules for converting constituency annotation to dependency annotation. [sent-235, score-0.22]

38 A probable explanation for this fact is that people have tried to correct linguistically problematic annotations in different ways, which is why we note this issue here7. [sent-236, score-0.28]

39 5 The Neutral Edge Direction (NED) Measure As shown in the previous sections, the annotation of problematic edges can substantially affect performance. [sent-251, score-0.217]

40 This was briefly discussed in (Klein and Manning, 2004), which used undirected evaluation as a measure which is less sensitive to alternative annotations. [sent-252, score-0.307]

41 7Indeed, half a dozen flags in the LTH Constituent-toDependency Conversion Tool (Johansson and Nugues, 2007) are used to control the conversion in problematic cases. [sent-261, score-0.219]

42 The significant effects of edge flipping were observed with the other two schemes as well. [sent-263, score-0.273]

43 The measure is defined as follows: traverse over the tokens and mark a correct attachment if the token’s induced parent is either (1) its gold parent or (2) its gold child. [sent-266, score-0.879]

44 Assume that 3(a) is the gold standard and that→ 3(b) is the induced parse. [sent-270, score-0.298]

45 Its induced parent (w3) is its gold child, and thus undirected evaluation does not consider it an error. [sent-272, score-0.572]

46 This is considered an error, since w1 is neither w3’s gold parent (as it is w2), nor its gold child9. [sent-274, score-0.444]

47 Recall the example “I want to eat” and the edgeflip of the edge “to”←“eat” (Figure 2). [sent-276, score-0.221]

48 n Aesith “etro ”i’tss gold parent nor its gold child, the undirected evaluation measure marks it as an error. [sent-278, score-0.677]

49 This is an example where an edge-flip in a problematic edge, which should not be considered an error, was in fact considered an error by undirected evaluation. [sent-279, score-0.349]

50 The NED measure is a simple extension of the undirected evaluation measure10. [sent-281, score-0.233]

51 Unlike undirected evaluation, NED ignores all errors directly resulting from an edge-flip. [sent-282, score-0.224]

52 9Otherwise, the gold parse would have contained w1→w2→w3→w1 cycle. [sent-283, score-0.202]

53 html a 668 NED is defined as follows: traverse over the tokens and mark a correct attachment if the token’s induced parent is either (1) its gold parent (2) its gold child or (3) its gold grandparent. [sent-289, score-1.026]

54 Consider again Figure 3, where we assume that 3(a) is the gold standard and that 3(b) is the induced parse. [sent-292, score-0.298]

55 Much like undirected evaluation, NED will mark the attachment of w2 as correct, since its induced parent is its gold child. [sent-293, score-0.714]

56 However, unlike undirected evaluation, w3’s induced attachment will also be marked as correct, as its induced parent is its gold grandparent. [sent-294, score-0.814]

57 Now consider another induced parse in which the direction of the edge between w2 and w3 is switched and the w3’s parent is set to be some other word, (Figure 3(c)). [sent-295, score-0.555]

58 This should be marked as an error, even if the direction of the edge between w2 and w3 is controversial, since the structure [w2 , w3] is no longer a dependent of w1. [sent-296, score-0.295]

59 Note that undirected evaluation gives the parses in Figure 3(b) and Figure 3(c) the same score, while if the structure [w2 , w3] is problematic, there is a major difference in their correctness. [sent-298, score-0.22]

60 Therefore, even a substantial difference in the attachment between two parsers is not necessarily indicative of a true quality difference. [sent-301, score-0.211]

61 However, an attachment score difference that persists under NED is an indication of a true quality difference, since generally problematic structures are local (i. [sent-302, score-0.438]

62 Reporting NED alone is insufficient, as obviously the edge direction does matter in some cases. [sent-305, score-0.295]

63 , “big house”), the correct edge direction is widely agreed upon (“big”←“house”) (K¨ ubler et al. [sent-308, score-0.383]

64 A possible criticism on NED is that it is only indifferent to alternative annotations in structures of size 2 (e. [sent-312, score-0.26]

65 , “to eat”) and does not necessarily handle larger problematic structures, such as coordinations ROOT ROOT ROOT ROOT ROOT John and and(a)Mary Mary in house the in the house house in the John (b) (c) (d) (e) Figure 4: Alternative parses of “John and Mary” and “in the house”. [sent-314, score-0.419]

66 Assume the parse in Figure 4(a) is the gold parse and that in Figure 4(b) is the induced parse. [sent-320, score-0.342]

67 The word “Mary” is a NED error, since its induced parent (“and”) is neither its gold child nor its gold grandparent. [sent-321, score-0.572]

68 Thus, NED does not accept all possible annotations of structures of size 3. [sent-322, score-0.173]

69 A better solution may be to modify the gold standard annotation, so to explicitly annotate problematic structures as such. [sent-324, score-0.464]

70 NED is therefore an evaluation measure which is indifferent to edge-flips, and is consequently less sensitive to alternative annotations. [sent-326, score-0.175]

71 We now show that NED is indifferent to the differences between the structures originally learned by the algorithms mentioned in Section 3 and the gold standard annotation in all the problematic cases we consider. [sent-327, score-0.653]

72 The exceptions are coordinations and prepositional phrases which are structures of size 3. [sent-329, score-0.193]

73 Regarding prepositional phrases, Figure 4(c) presents the gold standard of “in the house”, Figure 4(d) the parse induced by km04 and saj10a and Figure 4(e) the parse induced by cs09. [sent-333, score-0.52]

74 In order to further demonstrate NED’s insensitivity to alternative annotations, we took two of the three common gold standard annotations (see Sec669 tion 4) and evaluated them one against the other. [sent-335, score-0.293]

75 We considered section 23 of WSJ following the scheme of (Yamada and Matsumoto, 2003) as the gold standard and of (Collins, 1999) as the evaluated set. [sent-336, score-0.227]

76 6 Experimenting with NED In this section we show that NED indeed reduces the performance difference between the original and the modified parameter sets, thus providing empirical evidence for its validity. [sent-342, score-0.211]

77 Table 3 shows the score differences between the parameter sets using attachment score, undirected evaluation and NED. [sent-346, score-0.405]

78 A substantial difference persists under undirected evaluation: a gap of 7. [sent-347, score-0.22]

79 This is consistent with our discussion in Section 5, and shows that undirected evaluation only ignores some of the errors inflicted by edge-flips. [sent-352, score-0.224]

80 For completeness, Table 4 shows a comparison of some ofthe current state-of-the-art algorithms, using attachment score, undirected evaluation and NED. [sent-361, score-0.332]

81 86 ) Table 3: Differences between the modified and original parameter sets when evaluated using attachment score (Attach. [sent-377, score-0.294]

82 Our experiments suggest that it is crucial to deal with such structures if we would like to have a proper evaluation of unsupervised parsing algorithms against a gold standard. [sent-398, score-0.444]

83 The first way was to modify the parameters of the parsing algorithms so that in cases where such problematic decisions are to be made they follow the gold standard annotation. [sent-399, score-0.477]

84 Indeed, this modification leads to a substantial improvement in the attachment score of the algorithms. [sent-400, score-0.181]

85 The NED measure we proposed does not punish for differences between gold and induced structures in the problematic cases. [sent-404, score-0.636]

86 Indeed, in Section 6 (Table 3) we show that the differences between the original and modified models are much smaller when evaluating with NED compared to when evaluating with the traditional attachment score. [sent-405, score-0.285]

87 As Table 3 reveals, however, even when evaluating with NED, there is still some difference between the original and the modified model, for each of the algorithms we consider. [sent-406, score-0.182]

88 Moreover, for two of the al- gorithms (km04 and saj10a) NED prefers the original model while for one (cs09) it prefers the modified version. [sent-407, score-0.225]

89 The first is annotated with the gold standard annotation. [sent-415, score-0.198]

90 The second is similarly annotated except in the linguistically problematic structures. [sent-416, score-0.214]

91 We replace these structures with the ones that would have been created with the unsupervised version of the algorithm (see Table 1 for the relevant structures for each algorithm)12. [sent-417, score-0.286]

92 Each 12In cases the structures are comprised of a single edge, the second corpus is obtained from the gold standard by an edgeflip. [sent-418, score-0.305]

93 Their gold standard and the learned structures for each of the algorithms are shown in Figure 4. [sent-420, score-0.346]

94 In this case, the second corpus is obtained from the gold standard by replacing each prepositional phrase in the gold standard with the corresponding corpus is divided into a training and a test set. [sent-421, score-0.438]

95 The second line shows the results of the supervised versions of the algorithms using the corpus which agrees with the unsupervised model in the problematic cases (Orig. [sent-444, score-0.272]

96 The table clearly demonstrates that a set of parameters (original or modified) is preferred by NED in the unsupervised experiments reported in Section 6 (top line) if and only if the structures produced by this set are better learned by the supervised version of the algorithm (bottom line). [sent-447, score-0.224]

97 671 8 Conclusion In this paper we showed that the standard evaluation of unsupervised dependency parsers is highly sensitive to problematic annotations. [sent-454, score-0.47]

98 We modified a small set of parameters that controls the annotation in such problematic cases in three leading parsers. [sent-455, score-0.339]

99 As the problematic structures are generally local, NED is less sensitive to their alternative annotations. [sent-458, score-0.34]

100 Improving unsupervised dependency parsing with richer contexts and smoothing. [sent-529, score-0.225]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ned', 0.605), ('spitkovsky', 0.242), ('dmv', 0.219), ('undirected', 0.19), ('edge', 0.188), ('gold', 0.162), ('problematic', 0.159), ('attachment', 0.142), ('eat', 0.124), ('parent', 0.12), ('dependency', 0.119), ('pattach', 0.116), ('direction', 0.107), ('structures', 0.107), ('induced', 0.1), ('pstop', 0.099), ('cohen', 0.098), ('ubler', 0.088), ('klein', 0.079), ('modified', 0.077), ('yamada', 0.077), ('house', 0.072), ('unsupervised', 0.072), ('matsumoto', 0.07), ('smith', 0.069), ('headden', 0.067), ('gillenwater', 0.066), ('annotations', 0.066), ('controversial', 0.063), ('conversion', 0.06), ('indifferent', 0.058), ('annotation', 0.058), ('prefers', 0.057), ('nivre', 0.057), ('head', 0.056), ('linguistically', 0.055), ('manning', 0.053), ('collins', 0.052), ('henceforth', 0.052), ('blunsom', 0.051), ('nugues', 0.049), ('wsj', 0.046), ('parameters', 0.045), ('sensitive', 0.045), ('schemes', 0.045), ('mary', 0.044), ('experimenting', 0.044), ('valentin', 0.044), ('bosco', 0.044), ('omri', 0.044), ('coordinations', 0.044), ('doe', 0.044), ('columns', 0.043), ('measure', 0.043), ('nilsson', 0.043), ('johansson', 0.043), ('constituency', 0.043), ('neutral', 0.042), ('prepositional', 0.042), ('parameter', 0.041), ('algorithms', 0.041), ('shay', 0.04), ('flipping', 0.04), ('parse', 0.04), ('modification', 0.039), ('parsers', 0.039), ('grammar', 0.038), ('cohn', 0.038), ('ptb', 0.038), ('alshawi', 0.038), ('hiyan', 0.038), ('joakim', 0.038), ('participate', 0.037), ('dir', 0.037), ('verb', 0.036), ('standard', 0.036), ('eisner', 0.034), ('replication', 0.034), ('ignores', 0.034), ('original', 0.034), ('parsing', 0.034), ('azrieli', 0.033), ('edgeflip', 0.033), ('lightlysupervised', 0.033), ('phylogenetic', 0.033), ('punish', 0.033), ('root', 0.033), ('differences', 0.032), ('difference', 0.03), ('tokens', 0.03), ('indeed', 0.029), ('alternative', 0.029), ('lombardo', 0.029), ('controversies', 0.029), ('scheme', 0.029), ('viterbi', 0.028), ('john', 0.028), ('proper', 0.028), ('child', 0.028), ('sandra', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

Author: Roy Schwartz ; Omri Abend ; Roi Reichart ; Ari Rappoport

2 0.14335512 333 acl-2011-Web-Scale Features for Full-Scale Parsing

Author: Mohit Bansal ; Dan Klein

Abstract: Counts from large corpora (like the web) can be powerful syntactic cues. Past work has used web counts to help resolve isolated ambiguities, such as binary noun-verb PP attachments and noun compound bracketings. In this work, we first present a method for generating web count features that address the full range of syntactic attachments. These features encode both surface evidence of lexical affinities as well as paraphrase-based cues to syntactic structure. We then integrate our features into full-scale dependency and constituent parsers. We show relative error reductions of7.0% over the second-order dependency parser of McDonald and Pereira (2006), 9.2% over the constituent parser of Petrov et al. (2006), and 3.4% over a non-local constituent reranker.

3 0.13660119 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

Author: Alexander Volokh ; Gunter Neumann

Abstract: Annotated corpora are essential for almost all NLP applications. Whereas they are expected to be of a very high quality because of their importance for the followup developments, they still contain a considerable number of errors. With this work we want to draw attention to this fact. Additionally, we try to estimate the amount of errors and propose a method for their automatic correction. Whereas our approach is able to find only a portion of the errors that we suppose are contained in almost any annotated corpus due to the nature of the process of its creation, it has a very high precision, and thus is in any case beneficial for the quality of the corpus it is applied to. At last, we compare it to a different method for error detection in treebanks and find out that the errors that we are able to detect are mostly different and that our approaches are complementary. 1

4 0.13161957 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

Author: Yue Zhang ; Joakim Nivre

Abstract: Transition-based dependency parsers generally use heuristic decoding algorithms but can accommodate arbitrarily rich feature representations. In this paper, we show that we can improve the accuracy of such parsers by considering even richer feature sets than those employed in previous systems. In the standard Penn Treebank setup, our novel features improve attachment score form 91.4% to 92.9%, giving the best results so far for transitionbased parsing and rivaling the best results overall. For the Chinese Treebank, they give a signficant improvement of the state of the art. An open source release of our parser is freely available.

5 0.13065855 167 acl-2011-Improving Dependency Parsing with Semantic Classes

Author: Eneko Agirre ; Kepa Bengoetxea ; Koldo Gojenola ; Joakim Nivre

Abstract: This paper presents the introduction of WordNet semantic classes in a dependency parser, obtaining improvements on the full Penn Treebank for the first time. We tried different combinations of some basic semantic classes and word sense disambiguation algorithms. Our experiments show that selecting the adequate combination of semantic features on development data is key for success. Given the basic nature of the semantic classes and word sense disambiguation algorithms used, we think there is ample room for future improvements. 1

6 0.12760501 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

7 0.11824341 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing

8 0.10586884 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

9 0.10452716 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

10 0.098127507 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

11 0.081673093 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

12 0.078860082 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

13 0.074759223 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

14 0.073604837 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

15 0.073201172 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

16 0.073067136 122 acl-2011-Event Extraction as Dependency Parsing

17 0.072061189 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

18 0.071049385 243 acl-2011-Partial Parsing from Bitext Projections

19 0.068658374 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names

20 0.068014644 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.184), (1, -0.03), (2, -0.07), (3, -0.18), (4, -0.022), (5, -0.045), (6, 0.05), (7, 0.04), (8, 0.011), (9, -0.009), (10, 0.036), (11, 0.046), (12, 0.019), (13, -0.057), (14, -0.031), (15, 0.017), (16, 0.038), (17, 0.015), (18, -0.032), (19, -0.038), (20, -0.085), (21, 0.003), (22, -0.021), (23, -0.009), (24, 0.091), (25, -0.029), (26, -0.049), (27, -0.054), (28, -0.041), (29, 0.013), (30, 0.008), (31, 0.038), (32, 0.052), (33, -0.001), (34, 0.067), (35, -0.004), (36, -0.0), (37, 0.034), (38, 0.009), (39, -0.027), (40, 0.032), (41, 0.07), (42, -0.038), (43, -0.015), (44, -0.064), (45, 0.05), (46, -0.002), (47, 0.025), (48, 0.052), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94574904 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

Author: Roy Schwartz ; Omri Abend ; Roi Reichart ; Ari Rappoport

2 0.76829422 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

Author: Alexander Volokh ; Gunter Neumann

3 0.7657001 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing

Author: Guangyou Zhou ; Jun Zhao ; Kang Liu ; Li Cai

Abstract: In this paper, we present a novel approach which incorporates the web-derived selectional preferences to improve statistical dependency parsing. Conventional selectional preference learning methods have usually focused on word-to-class relations, e.g., a verb selects as its subject a given nominal class. This paper extends previous work to wordto-word selectional preferences by using webscale data. Experiments show that web-scale data improves statistical dependency parsing, particularly for long dependency relationships. There is no data like more data, performance improves log-linearly with the number of parameters (unique N-grams). More importantly, when operating on new domains, we show that using web-derived selectional preferences is essential for achieving robust performance.

4 0.76508093 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

Author: Jinho D. Choi ; Martha Palmer

Abstract: This paper suggests two ways of improving transition-based, non-projective dependency parsing. First, we add a transition to an existing non-projective parsing algorithm, so it can perform either projective or non-projective parsing as needed. Second, we present a bootstrapping technique that narrows down discrepancies between gold-standard and automatic parses used as features. The new addition to the algorithm shows a clear advantage in parsing speed. The bootstrapping technique gives a significant improvement to parsing accuracy, showing near state-of-theart performance with respect to other parsing approaches evaluated on the same data set.

5 0.75370312 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

Author: Yue Zhang ; Joakim Nivre

6 0.75066096 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

7 0.74448091 167 acl-2011-Improving Dependency Parsing with Semantic Classes

8 0.72896183 333 acl-2011-Web-Scale Features for Full-Scale Parsing

9 0.7174502 243 acl-2011-Partial Parsing from Bitext Projections

10 0.71702498 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

11 0.67947859 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

12 0.67595214 267 acl-2011-Reversible Stochastic Attribute-Value Grammars

13 0.66340858 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

14 0.66291076 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing

15 0.63967884 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing

16 0.63319349 107 acl-2011-Dynamic Programming Algorithms for Transition-Based Dependency Parsers

17 0.59849143 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

18 0.56431162 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

19 0.5611555 282 acl-2011-Shift-Reduce CCG Parsing

20 0.55825204 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.021), (17, 0.038), (26, 0.013), (37, 0.53), (39, 0.04), (41, 0.046), (55, 0.019), (59, 0.034), (72, 0.024), (77, 0.015), (91, 0.027), (96, 0.096), (97, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97146553 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

Author: Kevin Duh ; Akinori Fujino ; Masaaki Nagata

Abstract: Recent advances in Machine Translation (MT) have brought forth a new paradigm for building NLP applications in low-resource scenarios. To build a sentiment classifier for a language with no labeled resources, one can translate labeled data from another language, then train a classifier on the translated text. This can be viewed as a domain adaptation problem, where labeled translations and test data have some mismatch. Various prior work have achieved positive results using this approach. In this opinion piece, we take a step back and make some general statements about crosslingual adaptation problems. First, we claim that domain mismatch is not caused by MT errors, and accuracy degradation will occur even in the case of perfect MT. Second, we argue that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered. This paper will describe a series of carefullydesigned experiments that led us to these conclusions. 1 Summary Question 1: If MT gave perfect translations (semantically), do we still have a domain adaptation challenge in cross-lingual sentiment classification? Answer: Yes. The reason is that while many lations of a word may be valid, the MT system have a systematic bias. For example, the word some” might be prevalent in English reviews, transmight “awebut in 429 translated reviews, the word “excellent” is generated instead. From the perspective of MT, this translation is correct and preserves sentiment polarity. But from the perspective of a classifier, there is a domain mismatch due to differences in word distributions. Question 2: Can we apply standard adaptation algorithms developed for other (monolingual) adaptation problems to cross-lingual adaptation? Answer: No. It appears that the interaction between target unlabeled data and source data can be rather unexpected in the case of cross-lingual adaptation. We do not know the reason, but our experiments show that the accuracy of adaptation algorithms in cross-lingual scenarios have much higher variance than monolingual scenarios. The goal of this opinion piece is to argue the need to better understand the characteristics of domain adaptation in cross-lingual problems. We invite the reader to disagree with our conclusion (that the true barrier to good performance is not insufficient MT quality, but inappropriate domain adaptation methods). Here we present a series of experiments that led us to this conclusion. First we describe the experiment design (§2) and baselines (§3), before answering Question §12 (§4) dan bda Question 32) (§5). 2 Experiment Design The cross-lingual setup is this: we have labeled data from source domain S and wish to build a sentiment classifier for target domain T. Domain mismatch can arise from language differences (e.g. English vs. translated text) or market differences (e.g. DVD vs. Book reviews). Our experiments will involve fixing Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 429–433, T to a common testset and varying S. This allows us to experiment with different settings for adaptation. We use the Amazon review dataset of Prettenhofer (2010)1 , due to its wide range of languages (English [EN], Japanese [JP], French [FR], German [DE]) and markets (music, DVD, books). Unlike Prettenhofer (2010), we reverse the direction of cross-lingual adaptation and consider English as target. English is not a low-resource language, but this setting allows for more comparisons. Each source dataset has 2000 reviews, equally balanced between positive and negative. The target has 2000 test samples, large unlabeled data (25k, 30k, 50k samples respectively for Music, DVD, and Books), and an additional 2000 labeled data reserved for oracle experiments. Texts in JP, FR, and DE are translated word-by-word into English with Google Translate.2 We perform three sets of experiments, shown in Table 1. Table 2 lists all the results; we will interpret them in the following sections. Target (T) Source (S) 312BDMToVuasbDkil-ecE1N:ExpDMB eorVuimsDkice-JEnPtN,s eBD,MtuoVBDpuoVsk:-iFDck-iERxFN,T DB,vVoMaDruky-sSiDc.E-, 3 How much performance degradation occurs in cross-lingual adaptation? First, we need to quantify the accuracy degradation under different source data, without consideration of domain adaptation methods. So we train a SVM classifier on labeled source data3, and directly apply it on test data. The oracle setting, which has no domain-mismatch (e.g. train on Music-EN, test on Music-EN), achieves an average test accuracy of (81.6 + 80.9 + 80.0)/3 = 80.8%4. Aver1http://www.webis.de/research/corpora/webis-cls-10 2This is done by querying foreign words to build a bilingual dictionary. The words are converted to tfidf unigram features. 3For all methods we try here, 5% of the 2000 labeled source samples are held-out for parameter tuning. 4See column EN of Table 2, Supervised SVM results. 430 age cross-lingual accuracies are: 69.4% (JP), 75.6% (FR), 77.0% (DE), so degradations compared to oracle are: -11% (JP), -5% (FR), -4% (DE).5 Crossmarket degradations are around -6%6. Observation 1: Degradations due to market and language mismatch are comparable in several cases (e.g. MUSIC-DE and DVD-EN perform similarly for target MUSIC-EN). Observation 2: The ranking of source language by decreasing accuracy is DE > FR > JP. Does this mean JP-EN is a more difficult language pair for MT? The next section will show that this is not necessarily the case. Certainly, the domain mismatch for JP is larger than DE, but this could be due to phenomenon other than MT errors. 4 Where exactly is the domain mismatch? 4.1 Theory of Domain Adaptation We analyze domain adaptation by the concepts of labeling and instance mismatch (Jiang and Zhai, 2007). Let pt(x, y) = pt (y|x)pt (x) be the target distribution of samples x (e.g. unigram feature vec- tor) and labels y (positive / negative). Let ps (x, y) = ps (y|x)ps (x) be the corresponding source distributio(ny. Wx)pe assume that one (or both) of the following distributions differ between source and target: • Instance mismatch: ps (x) pt (x). • Labeling mismatch: ps (y|x) pt(y|x). Instance mismatch implies that the input feature vectors have different distribution (e.g. one dataset uses the word “excellent” often, while the other uses the word “awesome”). This degrades performance because classifiers trained on “excellent” might not know how to classify texts with the word “awesome.” The solution is to tie together these features (Blitzer et al., 2006) or re-weight the input distribution (Sugiyama et al., 2008). Under some assumptions (i.e. covariate shift), oracle accuracy can be achieved theoretically (Shimodaira, 2000). Labeling mismatch implies the same input has different labels in different domains. For example, the JP word meaning “excellent” may be mistranslated as “bad” in English. Then, positive JP = = 5See “Adapt by Language” columns of Table 2. Note JP+FR+DE condition has 6000 labeled samples, so is not directly comparable to other adaptation scenarios (2000 samples). Nevertheless, mixing languages seem to give good results. 6See “Adapt by Market” columns of Table 2. TargetClassifierOEraNcleJPAFdaRpt bDyE LanJgPu+agFeR+DEMUASdIaCpt D byV MDar BkeOtOK MUSIC-ENSAudpaeprtvedise TdS SVVMM8719..666783..50 7745..62 7 776..937880..36--7768..847745..16 DVD-ENSAudpaeprtveidse TdS SVVMM8801..907701..14 7765..54 7 767..347789..477754..28--7746..57 BOOK-ENSAudpaeprtveidse TdS SVVMM8801..026793..68 7775..64 7 767..747799..957735..417767..24-Table 2: Test accuracies (%) for English Music/DVD/Book reviews. Each column is an adaptation scenario using different source data. The source data may vary by language or by market. For example, the first row shows that for the target of Music-EN, the accuracy of a SVM trained on translated JP reviews (in the same market) is 68.5, while the accuracy of a SVM trained on DVD reviews (in the same language) is 76.8. “Oracle” indicates training on the same market and same language domain as the target. “JP+FR+DE” indicates the concatenation of JP, FR, DE as source data. Boldface shows the winner of Supervised vs. Adapted. reviews ps (y will be associated = +1|x = bad) co(nydit =io +na1l − |x = 1 will be high, whereas the true xdis =tr bibaudti)o wn bad) instead. labeling mismatch, with the word “bad”: lslh boeu hldi hha,v we high pt(y = There are several cases for depending on sheovwe tahle c polarity changes (Table 3). The solution is to filter out these noisy samples (Jiang and Zhai, 2007) or optimize loosely-linked objectives through shared parameters or Bayesian priors (Finkel and Manning, 2009). Which mismatch is responsible for accuracy degradations in cross-lingual adaptation? • Instance mismatch: Systematic Iantessta nwcoerd m diissmtraibtcuhti:on Ssy MT bias gener- sdtiefmferaetinct MfroTm b naturally- occurring English. (Translation may be valid.) Label mismatch: MT error mis-translates a word iLnatob something w: MithT Td eifrfreorren mti polarity. Conclusion from §4.2 and §4.3: Instance mismaCtcohn occurs often; M §4T. error appears Imnisntainmcael. • Mis-translated polarity Effect Taeb0+±.lge→ .3(:±“ 0−tgLhoae b”nd →l m− i“sg→m otbah+dce”h):mIfpoLAinse ca-ptsoriuaesncvieatl /ndioeansgbvcaewrptlimovaeshipntdvaei(+), negative (−), or neutral (0) words have different effects. Wnege athtiivnek ( −th)e, foirrs nt tuwtroa cases hoardves graceful degradation, but the third case may be catastrophic. 431 4.2 Analysis of Instance Mismatch To measure instance mismatch, we compute statistics between ps (x) and pt(x), or approximations thereof: First, we calculate a (normalized) average feature from all samples of source S, which represents the unigram distribution of MT output. Simi- larly, the average feature vector for target T approximates the unigram distribution of English reviews pt(x). Then we measure: • KL Divergence between Avg(S) and Avg(T), wKhLer De Avg() nisc eth bee average Avvegct(oSr.) • Set Coverage of Avg(T) on Avg(S): how many Sweotrd C (type) ien o Tf appears oatn le Aavsgt once ionw wS .m Both measures correlate strongly with final accuracy, as seen in Figure 1. The correlation coefficients are r = −0.78 for KL Divergence and r = 0.71 for Coverage, 0 b.7o8th statistically significant (p < 0.05). This implies that instance mismatch is an important reason for the degradations seen in Section 3.7 4.3 Analysis of Labeling Mismatch We measure labeling mismatch by looking at differences in the weight vectors of oracle SVM and adapted SVM. Intuitively, if a feature has positive weight in the oracle SVM, but negative weight in the adapted SVM, then it is likely a MT mis-translation 7The observant reader may notice that cross-market points exhibit higher coverage but equal accuracy (74-78%) to some cross-lingual points. This suggests that MT output may be more constrained in vocabulary than naturally-occurring English. 0.35 0.3 gnvLrDeiceKe0 0 0. 120.25 510 erts TeCovega0 0 0. .98657 68 70 72 7A4ccuracy76 78 80 82 0.4 68 70 72 7A4ccuracy76 78 80 82 Figure 1: KL Divergence and Coverage vs. accuracy. (o) are cross-lingual and (x) are cross-market data points. is causing the polarity flip. Algorithm 1 (with K=2000) shows how we compute polarity flip rate.8 We found that the polarity flip rate does not correlate well with accuracy at all (r = 0.04). Conclusion: Labeling mismatch is not a factor in performance degradation. Nevertheless, we note there is a surprising large number of flips (24% on average). A manual check of the flipped words in BOOK-JP revealed few MT mistakes. Only 3.7% of 450 random EN-JP word pairs checked can be judged as blatantly incorrect (without sentence context). The majority of flipped words do not have a clear sentiment orientation (e.g. “amazon”, “human”, “moreover”). 5 Are standard adaptation algorithms applicable to cross-lingual problems? One of the breakthroughs in cross-lingual text classification is the realization that it can be cast as domain adaptation. This makes available a host of preexisting adaptation algorithms for improving over supervised results. However, we argue that it may be 8The feature normalization in Step 1 is important that the weight magnitudes are comparable. to ensure 432 Algorithm 1 Measuring labeling mismatch Input: Weight vectors for source wsand target wt Input: Target data average sample vector avg(T) Output: Polarity flip rate f 1: Normalize: ws = avg(T) * ws ; wt = avg(T) * wt 2: Set S+ = { K most positive features in ws} 3: Set S− == {{ KK mmoosstt negative ffeeaattuurreess inn wws}} 4: Set T+ == {{ KK m moosstt npoesgiatitivvee f efeaatuturreess i inn w wt}} 5: Set T− == {{ KK mmoosstt negative ffeeaattuurreess inn wwt}} 6: for each= f{e a Ktur me io ∈t T+ adtiov 7: rif e ia c∈h S fe−a ttuhreen i if ∈ = T f + 1 8: enidf fio ∈r 9: for each feature j ∈ T− do 10: rif e j ∈h Sfe+a uthreen j f ∈ = T f + 1 11: enidf fjo r∈ 12: f = 2Kf better to “adapt” the standard adaptation algorithm to the cross-lingual setting. We arrived at this conclusion by trying the adapted counterpart of SVMs off-the-shelf. Recently, (Bergamo and Torresani, 2010) showed that Transductive SVMs (TSVM), originally developed for semi-supervised learning, are also strong adaptation methods. The idea is to train on source data like a SVM, but encourage the classification boundary to divide through low density regions in the unlabeled target data. Table 2 shows that TSVM outperforms SVM in all but one case for cross-market adaptation, but gives mixed results for cross-lingual adaptation. This is a puzzling result considering that both use the same unlabeled data. Why does TSVM exhibit such a large variance on cross-lingual problems, but not on cross-market problems? Is unlabeled target data interacting with source data in some unexpected way? Certainly there are several successful studies (Wan, 2009; Wei and Pal, 2010; Banea et al., 2008), but we think it is important to consider the possibility that cross-lingual adaptation has some fundamental differences. We conjecture that adapting from artificially-generated text (e.g. MT output) is a different story than adapting from naturallyoccurring text (e.g. cross-market). In short, MT is ripe for cross-lingual adaptation; what is not ripe is probably our understanding of the special characteristics of the adaptation problem. References Carmen Banea, Rada Mihalcea, Janyce Wiebe, and Samer Hassan. 2008. Multilingual subjectivity analysis using machine translation. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Alessandro Bergamo and Lorenzo Torresani. 2010. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems (NIPS). John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Jenny Rose Finkel and Chris Manning. 2009. Hierarchical Bayesian domain adaptation. In Proc. of NAACL Human Language Technologies (HLT). Jing Jiang and ChengXiang Zhai. 2007. Instance weighting for domain adaptation in NLP. In Proc. of the Association for Computational Linguistics (ACL). Peter Prettenhofer and Benno Stein. 2010. Crosslanguage text classification using structural correspondence learning. In Proc. of the Association for Computational Linguistics (ACL). Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inferenc, 90. Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B ¨unau, and Motoaki Kawanabe. 2008. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4). Xiaojun Wan. 2009. Co-training for cross-lingual sentiment classification. In Proc. of the Association for Computational Linguistics (ACL). Bin Wei and Chris Pal. 2010. Cross lingual adaptation: an experiment on sentiment classification. In Proceedings of the ACL 2010 Conference Short Papers. 433

same-paper 2 0.93743932 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

Author: Roy Schwartz ; Omri Abend ; Roi Reichart ; Ari Rappoport

3 0.93310171 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing

Author: Guangyou Zhou ; Jun Zhao ; Kang Liu ; Li Cai

4 0.930246 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars

Author: Mark-Jan Nederhof ; Giorgio Satta

Abstract: We present a method for the computation of prefix probabilities for synchronous contextfree grammars. Our framework is fairly general and relies on the combination of a simple, novel grammar transformation and standard techniques to bring grammars into normal forms.

5 0.92449737 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

Author: Bing Xiang ; Abraham Ittycheriah

Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.

6 0.9138841 122 acl-2011-Event Extraction as Dependency Parsing

7 0.91228211 204 acl-2011-Learning Word Vectors for Sentiment Analysis

8 0.91041994 334 acl-2011-Which Noun Phrases Denote Which Concepts?

9 0.85939378 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

10 0.80087453 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

11 0.80043072 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

12 0.79468191 256 acl-2011-Query Weighting for Ranking Model Adaptation

13 0.79101473 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

14 0.78543705 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

15 0.78043956 85 acl-2011-Coreference Resolution with World Knowledge

16 0.76574945 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

17 0.76394331 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning

18 0.76323879 292 acl-2011-Target-dependent Twitter Sentiment Classification

19 0.76175392 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

20 0.75751483 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing