acl acl2010 acl2010-48 knowledge-graph by maker-knowledge-mining

48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules

Source: pdf

Author: Zhiyang Wang ; Yajuan Lv ; Qun Liu ; Young-Sook Hwang

Abstract: This paper presents a novel filtration criterion to restrict the rule extraction for the hierarchical phrase-based translation model, where a bilingual but relaxed wellformed dependency restriction is used to filter out bad rules. Furthermore, a new feature which describes the regularity that the source/target dependency edge triggers the target/source word is also proposed. Experimental results show that, the new criteria weeds out about 40% rules while with translation performance improvement, and the new feature brings an- other improvement to the baseline system, especially on larger corpus.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn ct Abstract This paper presents a novel filtration criterion to restrict the rule extraction for the hierarchical phrase-based translation model, where a bilingual but relaxed wellformed dependency restriction is used to filter out bad rules. [sent-6, score-1.232]

2 Furthermore, a new feature which describes the regularity that the source/target dependency edge triggers the target/source word is also proposed. [sent-7, score-0.502]

3 Experimental results show that, the new criteria weeds out about 40% rules while with translation performance improvement, and the new feature brings an- other improvement to the baseline system, especially on larger corpus. [sent-8, score-0.361]

4 1 Introduction Hierarchical phrase-based (HPB) model (Chiang, 2005) is the state-of-the-art statistical machine translation (SMT) model. [sent-9, score-0.175]

5 By looking for phrases that contain other phrases and replacing the subphrases with nonterminal symbols, it gets hierarchical rules. [sent-10, score-0.121]

6 Hierarchical rules are more powerful than conventional phrases since they have better generalization capability and could capture long distance reordering. [sent-11, score-0.112]

7 However, when the training corpus becomes larger, the number of rules will grow exponentially, which inevitably results in slow and memory-consuming decoding. [sent-12, score-0.112]

8 In this paper, we address the problem of reducing the hierarchical translation rule table resorting to the dependency information of bilingual languages. [sent-13, score-0.864]

9 We only keep rules that both sides are relaxed-well-formed (RWF) dependency structure (see the definition in Section 3), and discard others which do not satisfy this constraint. [sent-14, score-0.663]

10 In this way, about 40% bad rules are weeded out from the orig- inal rule table. [sent-15, score-0.391]

11 However, the performance is even better than the traditional HPB translation system. [sent-16, score-0.175]

12 com † ff’ e Source Target Figure 1: Solid wire reveals the dependency relation pointing from the child to the parent. [sent-18, score-0.405]

13 Target word e is triggered by the source word f and it’s head word f′, p(e|f → f′). [sent-19, score-0.428]

14 Based on the relaxed-well-formed dependency structure, we also introduce a new linguistic feature to enhance translation performance. [sent-20, score-0.577]

15 In the traditional phrase-based SMT model, there are always lexical translation probabilities based on IBM model 1 (Brown et al. [sent-21, score-0.175]

16 p(e|f), namely, tdheel target rwoowrdn e its a triggered by eth. [sent-24, score-0.259]

17 Intuitively, however, the generation of e is not only involved with f, sometimes may also be triggered by other context words in the source side. [sent-26, score-0.202]

18 Here we assume that the dependency edge (f → f′) of word f generates target word e (we (cfall → →it h fead word trigger in Section 4). [sent-27, score-0.822]

19 Therefore, two words in one language trigger one word in another, which provides a more sophisticated and better choice for the target word, i. [sent-28, score-0.389]

20 Similarly, the dependency feature works well in Chinese-to-English translation task, especially on large corpus. [sent-31, score-0.573]

21 2 Related Work In the past, a significant number of techniques have been presented to reduce the hierarchical rule table. [sent-32, score-0.311]

22 (2009) just used the key phrases of source side to filter the rule table without taking advantage of any linguistic information. [sent-34, score-0.418]

23 (2009) put rules into syntactic classes based on the number of non-terminals and patterns, and applied various filtration strategies to improve the rule table quality. [sent-36, score-0.482]

24 (2008) discarded 142 UppsalaP,r Sowce ed ein ,g 1s1 o-f16 th Jeu AlyC 2L0 210 1. [sent-38, score-0.079]

25 c C2o0n1f0er Aenscseoc Sihatoirotn P faopre Crso,m papguetsat 1io4n2a–l1 L4i6n,guistics Thelovelygirlafoundbeautifulhouse Figure 2: An example of dependency tree. [sent-40, score-0.284]

26 The corresponding plain sentence is The lovely girl found a beautiful house. [sent-41, score-0.193]

27 most entries of the rule table by using the constraint that rules of the target-side are well-formed (WF) dependency structure, but this filtering led to degradation in translation performance. [sent-42, score-1.006]

28 They obtained improvements by adding an additional dependency language model. [sent-43, score-0.284]

29 , 2008) is that we keep rules that both sides should be relaxed-wellformed dependency structure, not just the target side. [sent-45, score-0.629]

30 The feature of head word trigger which we apply to the log-linear model is motivated by the trigger-based approach (Hasan and Ney, 2009). [sent-47, score-0.475]

31 Hasan and Ney (2009) introduced a second word to trigger the target word without considering any linguistic information. [sent-48, score-0.43]

32 Furthermore, since the second word can come from any part of the sentence, there may be a prohibitively large number of parameters involved. [sent-49, score-0.041]

33 (2008) built a maximum entropy model which combines rich context information for selecting translation rules during decoding. [sent-51, score-0.287]

34 , 2009), context language model is proposed for better rule selection. [sent-54, score-0.19]

35 Taking the dependency edge as condition, our approach is very different from previous approaches of exploring context information. [sent-55, score-0.351]

36 It reveals long-distance relation between words and directly models the semantic structure of a sentence without any constituent labels. [sent-60, score-0.165]

37 In this example, the word found is the root of the tree. [sent-62, score-0.041]

38 (2008) propose the well-formed dependency structure to filter the hierarchical rule table. [sent-64, score-0.788]

39 A well-formed dependency structure could be either a single-rooted dependency tree or a set of sibling trees. [sent-65, score-0.648]

40 Although most rules are discarded with the constraint that the target side should be well-formed, this filtration leads to degradation in translation performance. [sent-66, score-0.832]

41 , 2008), we introduce the so-called relaxedwell-formed dependency structure to filter the hierarchical rule table. [sent-68, score-0.788]

42 dn represent the position of parent word for each word. [sent-76, score-0.041]

43 If and only if it satisfies the following hco =ndit −io1n)s. [sent-87, score-0.039]

44 − • dh ∈/ [i, j] • ∀k ∈ [i, j] , dk ∈ [i, j] or dk = h From the definition above, we can see that the relaxed-well-formed structure obviously covers the well-formed one. [sent-88, score-0.238]

45 In this structure, we don’t constrain that all the children of the sub-root should be complete. [sent-89, score-0.076]

46 Let’s review the dependency tree in Figure 2 as an example. [sent-90, score-0.284]

47 Except for the wellformed structure, we could also extract girl found a beautiful house. [sent-91, score-0.184]

48 Therefore, if the modifier The lovely changes to The cute, this rule also works. [sent-92, score-0.258]

49 Source word f aligns with target word e, according to the IBM model 1, the lexical translation probability is p(e|f). [sent-95, score-0.348]

50 However, in the sense of dependency relationship, we e bveelrie,v ien tthhaet sthenes generation odefn ntchey target word e, is not only triggered by the aligned source word f, but also associated with f ’s head word f′. [sent-96, score-0.803]

51 Therefore, the lexical translation probability becomes p(e|f → f′), which of course aalbliolwitys fboerc a more fine-grained lexical choice of 143 the target word. [sent-97, score-0.266]

52 This new feature can be easily integrated into the log-linear model as lexical weighting does. [sent-100, score-0.112]

53 5 Experiments In this section, we describe the experimental setting used in this work, and verify the effect of the relaxed-well-formed structure filtering and the new feature, head word trigger. [sent-101, score-0.277]

54 1 Experimental Setup Experiments are carried out on the NIST1 Chinese-English translation task with two different size of training corpora. [sent-103, score-0.175]

55 We evaluate the translation quality using case-insensitive BLEU metric (Papineni et al. [sent-116, score-0.175]

56 , 2002) without dropping OOV words, and the feature weights are tuned by minimum error rate training (Och, 2003). [sent-117, score-0.074]

57 In order to get the dependency relation of the training corpus, we re-implement a beam-search style monolingual dependency parser according to (Nivre and Scholz, 2004). [sent-118, score-0.722]

58 Then we use the same method suggested in (Chiang, 2005) to extract SCFG grammar rules within dependency constraint on both sides except that unaligned words are allowed at the edge of phrases. [sent-119, score-0.691]

59 Parameters of head word trigger are estimated as described in Section 4. [sent-120, score-0.401]

60 As a default, the maximum initial phrase length is set to 10 and the maximum rule length of the source side is set to 5. [sent-121, score-0.305]

61 In fact, we just exploit the dependency structure during the rule extrac- tion phase. [sent-123, score-0.596]

62 We first parse the bilingual languages with monolingual dependency parser respectively, and then only retain the rules that both sides are in line with the constraint of dependency structure. [sent-127, score-1.041]

63 In Table 1, the relaxed-well-formed structure filtered out 35% of the rule table and the well-formed discarded 74%. [sent-128, score-0.349]

64 RWF extracts additional 39% compared to WF, which can be seen as some kind of evidence that the rules we additional get seem common in the sense of linguistics. [sent-129, score-0.151]

65 , 2008), we just use the dependency structure to constrain rules, not to maintain the tree structures to guide decoding. [sent-131, score-0.44]

66 We can see that the RWF structure constraint can improve translation quality substantially both at development set and different test sets. [sent-133, score-0.341]

67 Although it discard about 74% of the rule table, the 144 SRyHW sPtFeBmR13u79l0,e 164ta52b0,l2e03591si0ze Table 1: Rule table size with different constraint on FBIS. [sent-139, score-0.321]

68 Here HPB refers to the baseline hierarchal phrase-based system, RWF means relaxed-well-formed constraint and WF represents the well-formed structure. [sent-140, score-0.176]

69 Here Tri means the feature of head word trigger on both sides. [sent-145, score-0.475]

70 And we don’t test the new feature on Test04 because of the bad performance on development set. [sent-146, score-0.127]

71 As for the feature of head word trigger, it seems not work on the FBIS corpus. [sent-153, score-0.218]

72 3 Result on GQ Corpus In this part, we increased the size of the training corpus to check whether the feature of head word trigger works on large corpus. [sent-157, score-0.515]

73 We get 152M rule entries from the GQ corpus according to (Chiang, 2007)’s extraction method. [sent-158, score-0.271]

74 If we use the RWF structure to constrain both sides, the number of rules is 87M, about 43% of rule entries are discarded. [sent-159, score-0.5]

75 Compared to the result of the baseline, only using the RWF structure to filter performs the same as the baseline on Test05, and +0. [sent-171, score-0.193]

76 6 Conclusions This paper proposes a simple strategy to filter the hierarchal rule table, and introduces a new feature to enhance the translation performance. [sent-173, score-0.686]

77 We employ the relaxed-well-formed dependency structure to constrain both sides of the rule, and about 40% of rules are discarded with improvement of the translation performance. [sent-174, score-0.948]

78 In order to make full use of the dependency information, we assume that the target word e is triggered by dependency edge of the corresponding source word f. [sent-175, score-1.01]

79 And this feature works well on large parallel training corpus. [sent-176, score-0.114]

80 How to estimate the probability of head word trigger is very important. [sent-177, score-0.401]

81 Here we only get the parameters in a generative way. [sent-178, score-0.039]

82 In the future, we we are plan to exploit some discriminative approach to train parameters of this feature, such as EM algorithm (Hasan et al. [sent-179, score-0.042]

83 As the next step, we will try to exploit bilingual knowledge to improve the monolingual parser, i. [sent-183, score-0.175]

84 In ACL 145 ’05: Proceedings of the 43rd Annual Meeting on Associationfor Computational Linguistics, pages 263– 270. [sent-204, score-0.042]

85 In ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computa- tional Linguistics, pages 541–548. [sent-214, score-0.042]

86 In NAACL ’09: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 17–20. [sent-218, score-0.042]

87 Saˇ sa Hasan, Juri Ganitkevitch, Hermann Ney, and Jes u´s Andr e´s-Ferrer. [sent-219, score-0.079]

88 In EMNLP ’08: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 372– 381. [sent-222, score-0.042]

89 In COLING ’08: Proceedings of the 22nd International Conference on Computational Linguistics, pages 321–328. [sent-226, score-0.042]

90 In ACL-IJCNLP ’09: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 121–124. [sent-230, score-0.042]

91 In EMNLP ’09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1222–123 1. [sent-234, score-0.042]

92 In EACL ’09: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 380–388. [sent-239, score-0.042]

93 In NAACL ’03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 48–54. [sent-243, score-0.042]

94 In COLING ’04: Proceedings of the 20th international conference on Computational Linguistics, pages 64–70. [sent-247, score-0.042]

95 In ACL ’03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 160–167. [sent-251, score-0.042]

96 In ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 3 11–3 18. [sent-255, score-0.042]

97 In ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 271–279. [sent-259, score-0.042]

98 A new string-to-dependency machine translation algorithm with a target dependency language model. [sent-262, score-0.55]

99 In EMNLP ’09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 72–80. [sent-267, score-0.042]

100 In In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002), pages 901–904. [sent-271, score-0.042]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rwf', 0.315), ('dependency', 0.284), ('trigger', 0.257), ('gq', 0.225), ('rule', 0.19), ('filtration', 0.18), ('translation', 0.175), ('fbis', 0.175), ('shen', 0.166), ('hasan', 0.16), ('sides', 0.142), ('hpb', 0.135), ('triggered', 0.132), ('hierarchical', 0.121), ('filter', 0.113), ('rules', 0.112), ('head', 0.103), ('wf', 0.1), ('target', 0.091), ('qun', 0.091), ('hierarchal', 0.09), ('constraint', 0.086), ('smt', 0.085), ('structure', 0.08), ('bleu', 0.079), ('discarded', 0.079), ('sa', 0.079), ('zhongjun', 0.079), ('iglesias', 0.079), ('besides', 0.076), ('monolingual', 0.076), ('constrain', 0.076), ('feature', 0.074), ('chiang', 0.073), ('source', 0.07), ('korea', 0.068), ('lovely', 0.068), ('edge', 0.067), ('jinxi', 0.064), ('libin', 0.064), ('beautiful', 0.064), ('degradation', 0.064), ('dk', 0.061), ('girl', 0.061), ('wenbin', 0.061), ('wellformed', 0.059), ('yajuan', 0.059), ('bilingual', 0.057), ('ff', 0.055), ('bad', 0.053), ('filtering', 0.053), ('ef', 0.051), ('ney', 0.048), ('quirk', 0.047), ('reveals', 0.046), ('discard', 0.045), ('side', 0.045), ('enhance', 0.044), ('ding', 0.044), ('entries', 0.042), ('exploit', 0.042), ('nivre', 0.042), ('pages', 0.042), ('word', 0.041), ('ralph', 0.041), ('xu', 0.041), ('works', 0.04), ('jiang', 0.04), ('china', 0.039), ('della', 0.039), ('hco', 0.039), ('adri', 0.039), ('banga', 0.039), ('eduardo', 0.039), ('pce', 0.039), ('scholz', 0.039), ('sco', 0.039), ('seoul', 0.039), ('thke', 0.039), ('triplet', 0.039), ('get', 0.039), ('relation', 0.039), ('papineni', 0.038), ('pietra', 0.038), ('weighting', 0.038), ('ibm', 0.037), ('reducing', 0.037), ('gains', 0.036), ('arul', 0.036), ('matsoukas', 0.036), ('menezes', 0.036), ('spyros', 0.036), ('treelet', 0.036), ('dh', 0.036), ('gispert', 0.036), ('inal', 0.036), ('regularity', 0.036), ('shu', 0.036), ('tdheel', 0.036), ('wire', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules

Author: Zhiyang Wang ; Yajuan Lv ; Qun Liu ; Young-Sook Hwang

2 0.26175568 9 acl-2010-A Joint Rule Selection Model for Hierarchical Phrase-Based Translation

Author: Lei Cui ; Dongdong Zhang ; Mu Li ; Ming Zhou ; Tiejun Zhao

Abstract: In hierarchical phrase-based SMT systems, statistical models are integrated to guide the hierarchical rule selection for better translation performance. Previous work mainly focused on the selection of either the source side of a hierarchical rule or the target side of a hierarchical rule rather than considering both of them simultaneously. This paper presents a joint model to predict the selection of hierarchical rules. The proposed model is estimated based on four sub-models where the rich context knowledge from both source and target sides is leveraged. Our method can be easily incorporated into the practical SMT systems with the log-linear model framework. The experimental results show that our method can yield significant improvements in performance.

3 0.23914061 69 acl-2010-Constituency to Dependency Translation with Forests

Author: Haitao Mi ; Qun Liu

Abstract: Tree-to-string systems (and their forestbased extensions) have gained steady popularity thanks to their simplicity and efficiency, but there is a major limitation: they are unable to guarantee the grammaticality of the output, which is explicitly modeled in string-to-tree systems via targetside syntax. We thus propose to combine the advantages of both, and present a novel constituency-to-dependency translation model, which uses constituency forests on the source side to direct the translation, and dependency trees on the target side (as a language model) to ensure grammaticality. Medium-scale experiments show an absolute and statistically significant improvement of +0.7 BLEU points over a state-of-the-art forest-based tree-to-string system even with fewer rules. This is also the first time that a treeto-tree model can surpass tree-to-string counterparts.

4 0.20738277 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

Author: Boxing Chen ; George Foster ; Roland Kuhn

Abstract: This paper proposes new algorithms to compute the sense similarity between two units (words, phrases, rules, etc.) from parallel corpora. The sense similarity scores are computed by using the vector space model. We then apply the algorithms to statistical machine translation by computing the sense similarity between the source and target side of translation rule pairs. Similarity scores are used as additional features of the translation model to improve translation performance. Significant improvements are obtained over a state-of-the-art hierarchical phrase-based machine translation system. 1

5 0.17982808 54 acl-2010-Boosting-Based System Combination for Machine Translation

Author: Tong Xiao ; Jingbo Zhu ; Muhua Zhu ; Huizhen Wang

Abstract: In this paper, we present a simple and effective method to address the issue of how to generate diversified translation systems from a single Statistical Machine Translation (SMT) engine for system combination. Our method is based on the framework of boosting. First, a sequence of weak translation systems is generated from a baseline system in an iterative manner. Then, a strong translation system is built from the ensemble of these weak translation systems. To adapt boosting to SMT system combination, several key components of the original boosting algorithms are redesigned in this work. We evaluate our method on Chinese-to-English Machine Translation (MT) tasks in three baseline systems, including a phrase-based system, a hierarchical phrasebased system and a syntax-based system. The experimental results on three NIST evaluation test sets show that our method leads to significant improvements in translation accuracy over the baseline systems. 1

6 0.17794001 83 acl-2010-Dependency Parsing and Projection Based on Word-Pair Classification

7 0.17420188 169 acl-2010-Learning to Translate with Source and Target Syntax

8 0.15260126 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

9 0.1505944 118 acl-2010-Fine-Grained Tree-to-String Translation Rule Extraction

10 0.14292061 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

11 0.14253423 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

12 0.14065842 84 acl-2010-Detecting Errors in Automatically-Parsed Dependency Relations

13 0.13107865 115 acl-2010-Filtering Syntactic Constraints for Statistical Machine Translation

14 0.12914643 243 acl-2010-Tree-Based and Forest-Based Translation

15 0.12766424 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation

16 0.12096195 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels

17 0.1166152 20 acl-2010-A Transition-Based Parser for 2-Planar Dependency Structures

18 0.11535604 265 acl-2010-cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models

19 0.11270294 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

20 0.11075241 133 acl-2010-Hierarchical Search for Word Alignment

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.287), (1, -0.262), (2, -0.006), (3, 0.039), (4, -0.032), (5, -0.015), (6, 0.065), (7, -0.003), (8, -0.144), (9, 0.145), (10, 0.06), (11, 0.076), (12, 0.085), (13, 0.059), (14, 0.197), (15, -0.027), (16, -0.045), (17, 0.019), (18, -0.081), (19, -0.04), (20, -0.007), (21, -0.038), (22, 0.035), (23, 0.014), (24, -0.03), (25, 0.028), (26, 0.02), (27, -0.055), (28, 0.1), (29, -0.083), (30, -0.033), (31, 0.016), (32, 0.037), (33, -0.07), (34, 0.072), (35, -0.004), (36, 0.063), (37, 0.058), (38, -0.052), (39, -0.006), (40, -0.067), (41, 0.019), (42, -0.079), (43, 0.007), (44, 0.102), (45, 0.01), (46, 0.023), (47, -0.132), (48, -0.082), (49, -0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97431487 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules

Author: Zhiyang Wang ; Yajuan Lv ; Qun Liu ; Young-Sook Hwang

2 0.87372833 9 acl-2010-A Joint Rule Selection Model for Hierarchical Phrase-Based Translation

Author: Lei Cui ; Dongdong Zhang ; Mu Li ; Ming Zhou ; Tiejun Zhao

3 0.82142586 69 acl-2010-Constituency to Dependency Translation with Forests

Author: Haitao Mi ; Qun Liu

4 0.75132173 84 acl-2010-Detecting Errors in Automatically-Parsed Dependency Relations

Author: Markus Dickinson

Abstract: We outline different methods to detect errors in automatically-parsed dependency corpora, by comparing so-called dependency rules to their representation in the training data and flagging anomalous ones. By comparing each new rule to every relevant rule from training, we can identify parts of parse trees which are likely erroneous. Even the relatively simple methods of comparison we propose show promise for speeding up the annotation process. 1 Introduction and Motivation Given the need for high-quality dependency parses in applications such as statistical machine translation (Xu et al., 2009), natural language generation (Wan et al., 2009), and text summarization evaluation (Owczarzak, 2009), there is a corresponding need for high-quality dependency annotation, for the training and evaluation of dependency parsers (Buchholz and Marsi, 2006). Furthermore, parsing accuracy degrades unless sufficient amounts of labeled training data from the same domain are available (e.g., Gildea, 2001 ; Sekine, 1997), and thus we need larger and more varied annotated treebanks, covering a wide range of domains. However, there is a bottleneck in obtaining annotation, due to the need for manual intervention in annotating a treebank. One approach is to develop automatically-parsed corpora (van Noord and Bouma, 2009), but a natural disadvantage with such data is that it contains parsing errors. Identifying the most problematic parses for human post-processing could combine the benefits of automatic and manual annotation, by allowing a human annotator to efficiently correct automatic errors. We thus set out in this paper to detect errors in automatically-parsed data. If annotated corpora are to grow in scale and retain a high quality, annotation errors which arise from automatic processing must be minimized, as errors have a negative impact on training and eval- uation of NLP technology (see discussion and references in Boyd et al., 2008, sec. 1). There is work on detecting errors in dependency corpus annotation (Boyd et al., 2008), but this is based on finding inconsistencies in annotation for identical recurring strings. This emphasis on identical strings can result in high precision, but many strings do not recur, negatively impacting the recall of error detection. Furthermore, since the same strings often receive the same automatic parse, the types of inconsistencies detected are likely to have resulted from manual annotation. While we can build from the insight that simple methods can provide reliable annotation checks, we need an approach which relies on more general properties of the dependency structures, in order to develop techniques which work for automatically-parsed corpora. Developing techniques to detect errors in parses in a way which is independent of corpus and parser has fairly broad implications. By using only the information available in a training corpus, the methods we explore are applicable to annotation error detection for either hand-annotated or automatically-parsed corpora and can also provide insights for parse reranking (e.g., Hall and Nov a´k, 2005) or parse revision (Attardi and Ciaramita, 2007). Although we focus only on detecting errors in automatically-parsed data, similar techniques have been applied for hand-annotated data (Dickinson, 2008; Dickinson and Foster, 2009). Our general approach is based on extracting a grammar from an annotated corpus and comparing dependency rules in a new (automaticallyannotated) corpus to the grammar. Roughly speaking, if a dependency rule—which represents all the dependents of a head together (see section 3. 1)— does not fit well with the grammar, it is flagged as potentially erroneous. The methods do not have to be retrained for a given parser’s output (e.g., 729 Proce dinUgsp osfa tlhae, 4S8wthed Aen n,u 1a1l-1 M6e Jeutilnyg 2 o0f1 t0h.e ?c As2s0o1c0ia Atisosnoc foiart Cionom fopru Ctaotmiopnuatla Lti on gaulis Lti cnsg,u piasgtiecs 729–738, Campbell and Johnson, 2002), but work by comparing any tree to what is in the training grammar (cf. also approaches stacking hand-written rules on top of other parsers (Bick, 2007)). We propose to flag erroneous parse rules, using information which reflects different grammatical properties: POS lookup, bigram information, and full rule comparisons. We build on a method to detect so-called ad hoc rules, as described in section 2, and then turn to the main approaches in section 3. After a discussion of a simple way to flag POS anomalies in section 4, we evaluate the different methods in section 5, using the outputs from two different parsers. The methodology proposed in this paper is easy to implement and independent of corpus, language, or parser. 2 Approach We take as a starting point two methods for detecting ad hoc rules in constituency annotation (Dickinson, 2008). Ad hoc rules are CFG productions extracted from a treebank which are “used for specific constructions and unlikely to be used again,” indicating annotation errors and rules for ungrammaticalities (see also Dickinson and Foster, 2009). Each method compares a given CFG rule to all the rules in a treebank grammar. Based on the number of similar rules, a score is assigned, and rules with the lowest scores are flagged as potentially ad hoc. This procedure is applicable whether the rules in question are from a new data set—as in this paper, where parses are compared to a training data grammar—or drawn from the treebank grammar itself (i.e., an internal consistency check). The two methods differ in how the comparisons are done. First, the bigram method abstracts a rule to its bigrams. Thus, a rule such as NP → rJJu NeN to provides support fso,r aN rPu → uDcTh aJJs J NJ NN, iJnJ NthNat pitr vshidareess tuhpep oJrJt NfoNr sequence. By contrast, in the other method, which we call the whole rule method,1 a rule is compared in its totality to the grammar rules, using Levenshtein distance. There is no abstraction, meaning all elements are present—e.g., NP → DT JJ JJ NN is very similar to eNsePn → eD.gT. ,J NJ PN N→ b DeTcau JsJe J Jth Ne sequences mdiiflfearr by only one category. While previously used for constituencies, what is at issue is simply the valency of a rule, where by valency we refer to a head and its entire set 1This is referred to whole daughters in Dickinson (2008), but the meaning of “daughters” is less clear for dependencies. of arguments and adjuncts (cf. Przepi´ orkowski, 2006)—that is, a head and all its dependents. The methods work because we expect there to be regularities in valency structure in a treebank grammar; non-conformity to such regularities indicates a potential problem. 3 Ad hoc rule detection 3.1 An appropriate representation To capture valency, consider the dependency tree from the Talbanken05 corpus (Nilsson and Hall, 2005) in figure 1, for the Swedish sentence in (1), which has four dependency pairs.2 (1) Det g a˚r bara inte ihop . it goes just not together ‘It just doesn’t add up.’ SS MA NA PL Det g a˚r bara inte ihop PO VV AB AB AB Figure 1: Dependency graph example On a par with constituency rules, we define a grammar rule as a dependency relation rewriting as a head with its sequence of POS/dependent pairs (cf. Kuhlmann and Satta, 2009), as in figure 2. This representation supports the detection of idiosyncracies in valency.3 1. 12.. 23.. 34.. TOP → root ROOT:VV TROOPOT → → SoSt R:POOO VT:VV MVA:AB NA:AB PL:AB RSSO → P →O :5A. BN AN → ABB P SMSA → → AOB 56.. NPLA → A ABB Figure 2: Rule representation for (1) For example, for the ROOT category, the head is a verb (VV), and it has 4 dependents. The extent to which this rule is odd depends upon whether comparable rules—i.e., other ROOT rules or other VV rules (see section 3.2)—have a similar set of dependents. While many of the other rules seem rather spare, they provide useful information, showing categories which have no dependents. With a TOP rule, we have a rule for every 2Category definitions are in appendix A. 3Valency is difficult to define for coordination and is specific to an annotation scheme. We leave this for the future. 730 head, including the virtual root. Thus, we can find anomalous rules such as TOP → root ROOT:AV ROOT:NN, wulheesre su multiple categories hROavOe T b:AeeVn parsed as ROOT. 3.2 Making appropriate comparisons In comparing rules, we are trying to find evidence that a particular (parsed) rule is valid by examining the evidence from the (training) grammar. Units of comparison To determine similarity, one can compare dependency relations, POS tags, or both. Valency refers to both properties, e.g., verbs which allow verbal (POS) subjects (dependency). Thus, we use the pairs of dependency relations and POS tags as the units of comparison. Flagging individual elements Previous work scored only entire rules, but some dependencies are problematic and others are not. Thus, our methods score individual elements of a rule. Comparable rules We do not want to compare a rule to all grammar rules, only to those which should have the same valents. Comparability could be defined in terms of a rule’s dependency relation (LHS) or in terms of its head. Consider the four different object (OO) rules in (2). These vary a great deal, and much of the variability comes from the fact that they are headed by different POS categories, which tend to have different selectional properties. The head POS thus seems to be predictive of a rule’s valency. (2) a. OO → PO b. OO → DT:EN AT:AJ NN ET:VV c. OO → SS:PO QV VG:VV d. OO → DT:PO AT:AJ VN But we might lose information by ignoring rules with the same left-hand side (LHS). Our approach is thus to take the greater value of scores when comparing to rules either with the same depen- dency relation or with the same head. A rule has multiple chances to prove its value, and low scores will only be for rules without any type of support. Taking these points together, for a given rule of interest r, we assign a score (S) to each element ei in r, where r = e1...em by taking the maximum of scores for rules with the same head (h) or same LHS (lhs), as in (3). For the first element in (2b), for example, S(DT:EN) = max{s(DT:EN, NN), s(DT:EN, OO)}. TTh:eE question ixs now Tho:EwN we dNe)-, fsin(De s(ei, c) fOor)} t.he T comparable sele nmowen hto c. (3) S(ei) = max{s(ei, h) , s(ei, lhs)} 3.3 Whole rule anomalies 3.3.1 Motivation The whole rule method compares a list of a rule’s dependents to rules in a database, and then flags rule elements without much support. By using all dependents as a basis for comparison, this method detects improper dependencies (e.g., an adverb modifying a noun), dependencies in the wrong overall location of a rule (e.g., an adverb before an object), and rules with unnecessarily long ar- gument structures. For example, in (4), we have an improper relation between skall (‘shall’) and sambeskattas (‘be taxed together’), as in figure 3. It is parsed as an adverb (AA), whereas it should be a verb group (VG). The rule for this part of the tree is +F → ++:++ SV AA:VV, and the AA:VV position wF i→ll b +e low-scoring b:VecVau,s aen dth teh ++:++ VSVV context does not support it. (4) Makars o¨vriga inkomster a¨r B-inkomster spouses’ other incomes are B-incomes och skall som tidigare sambeskattas . and shall as previously be taxed togeher . ‘The other incomes of spouses are B-incomes and shall, as previously, be taxed together.’ ++ +F UK KA och skall som tidigare ++ SV UK AJ VG sambeskattas VV ++ +F UK SS och skall som tidigare ++ SV UK AJ AA sambeskattas VV Figure 3: Wrong label (top=gold, bottom=parsed) 3.3.2 Implementation The method we use to determine similarity arises from considering what a rule is like without a problematic element. Consider +F → ++:++ SV pArAob:VleVm afrtiocm e figure 3, Cwohnesried eArA + Fsh →ould + +b:e+ a d SifVferent category (VG). The rule without this error, +F → ++:++ SV, starts several rules in the 731 training data, including some with VG:VV as the next item. The subrule ++:++ SV seems to be reliable, whereas the subrules containing AA:VV (++:++ AA:VV and SV AA:VV) are less reliable. We thus determine reliability by seeing how often each subsequence occurs in the training rule set. Throughout this paper, we use the term subrule to refer to a rule subsequence which is exactly one element shorter than the rule it is a component of. We examine subrules, counting their frequency as subrules, not as complete rules. For example, TOP rules with more than one dependent are problematic, e.g., TOP → root ROOT:AV ROOT:NN. Correspondingly, Pth →ere r are no rOulTe:sA wVith R OthOrTee: NeNle-. ments containing the subrule root ROOT:AV. We formalize this by setting the score s(ei, c) equal to the summation of the frequencies of all comparable subrules containing ei from the training data, as in (5), where B is the set of subrules of r with length one less. (5) s(ei, c) = Psub∈B:ei∈sub C(sub, c) For example, Pwith c = +F, the frequency of +F → ++:++ SV as a subrule is added to the scores f→or ++:++ aVnd a sS aV. s Ibnr tlheis i case, d+ tFo → ++:++ SfoVr VG:BV, +dF S → ++:++ S cVas VG:AV, a +nd+ ++F+ → ++:++ VSV, +VFG →:VV + a:l+l +ad SdV support Vfo,r a n+dF → ++:++ +SV+ being a legitimate dsdub sruuplep.o Thus, ++:++ and SV are less likely to be the sources of any problems. Since +F → SV AA:VV and +F → ++:++ mAsA.:V SVin hcaev +e very l SittVle support i ann tdhe + trFai →ning data, AA:VV receives a low score. Note that the subrule count C(sub, c) is different than counting the number of rules containing a subrule, as can be seen with identical elements. For example, for SS → VN ET:PR ET:PR, C(VN ET:PR, SS) = 2, SinS keeping wE Tith:P tRhe E fTac:Pt Rth,a Ct t(hVerNe are 2 pieces of evidence for its legitimacy. 3.4 Bigram anomalies 3.4.1 Motivation The bigram method examines relationships between adjacent sisters, complementing the whole rule method by focusing on local properties. For (6), for example, we find the gold and parsed trees in figure 4. For the long parsed rule TA → PR HinD f:igIDur HeD 4.:ID F IoRr t:hIeR lAonNg:R pOar JR:IR, ea lTl Aele →men PtRs get low whole rule scores, i.e., are flagged as potentially erroneous. But only the final elements have anomalous bigrams: HD:ID IR:IR, IR:IR AN:RO, and AN:RO JR:IR all never occur. (6) N a¨r det g ¨aller inkomst a˚ret 1971 ( when it concerns the income year 1971 ( taxerings a˚ret 1972 ) skall barnet ... assessment year 1972 ) shall the child . . . ‘Concerning the income year of 1971 (assessment year 1972), the child . . . ’ 3.4.2 Implementation To obtain a bigram score for an element, we simply add together the bigrams which contain the element in question, as in (7). (7) s(ei, c) = C(ei−1ei, c) + C(eiei+1 , c) Consider the rule from figure 4. With c = TA, the bigram HD:ID IR:IR never occurs, so both HD:ID and IR:IR get 0 added to their score. HD:ID HD:ID, however, is a frequent bigram, so it adds weight to HD:ID, i.e., positive evidence comes from the bigram on the left. If we look at IR:IR, on the other hand, IR:IR AN:RO occurs 0 times, and so IR:IR gets a total score of 0. Both scoring methods treat each element independently. Every single element could be given a low score, even though once one is corrected, another would have a higher score. Future work can examine factoring in all elements at once. 4 Additional information The methods presented so far have limited definitions of comparability. As using complementary information has been useful in, e.g., POS error detection (Loftsson, 2009), we explore other simple comparable properties of a dependency grammar. Namely, we include: a) frequency information of an overall dependency rule and b) information on how likely each dependent is to be in a relation with its head, described next. 4.1 Including POS information Consider PA → SS:NN XX:XX HV OO:VN, as iCl ounsstirdaeterd P iAn figure :5N foNr XthXe :sXeXnte HncVe OinO (8). NT,h aiss rule is entirely correct, yet the XX:XX position has low whole rule and bigram scores. (8) Uppgift om vilka orter som information of which neighborhood who har utk o¨rning finner Ni has delivery find ocks a˚ i . . . you also in . . . ‘You can also find information about which neighborhoods have delivery services in . . . ’ 732 AA HD HD DT PA IR DT AN JR ... N a¨r det g ¨aller inkomst a˚ret 1971 ( taxerings a˚ret 1972 ) ... PR ID ID RO IR NN NN RO TAHDHDPAETIRDTANJR. N a¨r det g ¨aller PR ID inkomst a˚ret ID NN 1971 ( RO IR taxerings a˚ret NN 1972 RO IR ... ) ... IR ... Figure 4: A rule with extra dependents (top=gold, bottom=parsed) ET Uppgift NN DT om vilka PR PO SS orter NN XX PA som har XX OO utk o¨rning HV VN Figure 5: Overflagging (gold=parsed) One method which does not have this problem of overflagging uses a “lexicon” of POS tag pairs, examining relations between POS, irrespective of position. We extract POS pairs, note their dependency relation, and add a L/R to the label to indicate which is the head (Boyd et al., 2008). Additionally, we note how often two POS categories occur as a non-depenency, using the label NIL, to help determine whether there should be any attachment. We generate NILs by enumerating all POS pairs in a sentence. For example, from figure 5, the parsed POS pairs include NN PR → ETL, eN 5N, t hPeO p → NIL, eStc. p We convert the frequencies to probabilities. For example, of 4 total occurrences of XX HV in the training data, 2 are XX-R (cf. figure 5). A probability of 0.5 is quite high, given that NILs are often the most frequent label for POS pairs. 5 Evaluation In evaluating the methods, our main question is: how accurate are the dependencies, in terms of both attachment and labeling? We therefore currently examine the scores for elements functioning as dependents in a rule. In figure 5, for example, for har (‘has’), we look at its score within ET → PfoRr hPAar:H (‘Vha asn’)d, not wloohken a itt iftusn scctoiornes w as a head, as in PA → SS:NN XX:XX HV OO:VN. Relatedly, for each method, we are interested in whether elements with scores below a threshold have worse attachment accuracy than scores above, as we predict they do. We can measure this by scoring each testing data position below the threshold as a 1 if it has the correct head and dependency relation and a 0 otherwise. These are simply labeled attachment scores (LAS). Scoring separately for positions above and below a threshold views the task as one of sorting parser output into two bins, those more or less likely to be correctly parsed. For development, we also report unlabeled attachement scores (UAS). Since the goal is to speed up the post-editing of corpus data by flagging erroneous rules, we also report the precision and recall for error detection. We count either attachment or labeling errors as an error, and precision and recall are measured with respect to how many errors are found below the threshold. For development, we use two Fscores to provide a measure of the settings to ex- amine across language, corpus, and parser conditions: the balanced F1 measure and the F0.5 measure, weighing precision twice as much. Precision is likely more important in this context, so as to prevent annotators from sorting through too many false positives. In practice, one way to use these methods is to start with the lowest thresholds and work upwards until there are too many non-errors. To establish a basis for comparison, we compare 733 method performance to a parser on its own.4 By examining the parser output without any automatic assistance, how often does a correction need to be made? 5.1 The data All our data comes from the CoNLL-X Shared Task (Buchholz and Marsi, 2006), specifically the 4 data sets freely available online. We use the Swedish Talbanken data (Nilsson and Hall, 2005) and the transition-based dependency parser MaltParser (Nivre et al., 2007), with the default set- tings, for developing the method. To test across languages and corpora, we use MaltParser on the other 3 corpora: the Danish DDT (Kromann, 2003), Dutch Alpino (van der Beek et al., 2002), and Portuguese Bosque data (Afonso et al., 2002). Then, we present results using the graph-based parser MSTParser (McDonald and Pereira, 2006), again with default settings, to test the methods across parsers. We use the gold standard POS tags for all experiments. 5.2 Development data In the first line of table 1, we report the baseline MaltParser accuracies on the Swedish test data, including baseline error detection precision (=1LASb), recall, and (the best) F-scores. In the rest of table 1, we report the best-performing results for each of the methods,5 providing the number of rules below and above a particular threshold, along with corresponding UAS and LAS values. To get the raw number of identified rules, multiply the number of corpus position below a threshold (b) times the error detection precision (P). For ex- × ample, the bigram method with a threshold of 39 leads to finding 283 errors (455 .622). Dependency e 2le8m3e enrrtos rws (it4h5 frequency below the lowest threshold have lower attachment scores (66.6% vs. 90. 1% LAS), showing that simply using a complete rule helps sort dependencies. However, frequency thresholds have fairly low precision, i.e., 33.4% at their best. The whole rule and bigram methods reveal greater precision in identifying problematic dependencies, isolating elements with lower UAS and LAS scores than with frequency, along with corresponding greater pre4One may also use parser confidence or parser revision methods as a basis of comparison, but we are aware of no systematic evaluation of these approaches for detecting errors. 5Freq=rule frequency, WR=whole rule, Bi=bigram, POS=POS-based (POS scores multiplied by 10,000) cision and F-scores. The bigram method is more fine-grained, identifying small numbers of rule elements at each threshold, resulting in high error detection precision. With a threshold of 39, for example, we find over a quarter of the parser errors with 62% precision, from this one piece of information. For POS information, we flag 23.6% of the cases with over 60% precision (at 81.6). Taking all these results together, we can begin to sort more reliable from less reliable dependency tree elements, using very simple information. Additionally, these methods naturally group cases together by linguistic properties (e.g., adverbialverb dependencies within a particualr context), allowing a human to uncover the principle behind parse failure and ajudicate similar cases at the same time (cf. Wallis, 2003). 5.3 Discussion Examining some of the output from the Talbanken test data by hand, we find that a prominent cause of false positives, i.e., correctly-parsed cases with low scores, stems from low-frequency dependency-POS label pairs. If the dependency rarely occurs in the training data with the particular POS, then it receives a low score, regardless of its context. For example, the parsed rule TA → IG:IG RO has a correct dependency relation (IG) G be:tIwGee RnO Oth hea aPsO aS c tags IcGt d daenpde nitsd e hnecayd RO, yet is assigned a whole rule score of 2 and a bigram score of 20. It turns out that IG:IG only occurs 144 times in the training data, and in 11 of those cases (7.6%) it appears immediately before RO. One might consider normalizing the scores based on overall frequency or adjusting the scores to account for other dependency rules in the sentence: in this case, there may be no better attachment. Other false positives are correctly-parsed elements that are a part of erroneous rules. For instance, in AA → UK:UK SS:PO TA:AJ AV SP:AJ sOtaAn:PceR, +nF A:HAV → +F:HV, Kth SeS fi:rPsOt + TFA:H:AVJ AisV correct, yet given a low score (0 whole rule, 1 bigram). The following and erroneous +F:HV is similarly given a low score. As above, such cases might be handled by looking for attachments in other rules (cf. Attardi and Ciaramita, 2007), but these cases should be relatively unproblematic for handcorrection, given the neighboring error. We also examined false negatives, i.e., errors with high scores. There are many examples of PR PA:NN rules, for instance, with the NN improp734 erly attached, but there are also many correct instances of PR PA:NN. To sort out the errors, one needs to look at lexical knowledge and/or other dependencies in the tree. With so little context, frequent rules with only one dependent are not prime candidates for our methods of error detection. 5.4 Other corpora We now turn to the parsed data from three other corpora. The Alpino and Bosque corpora are approximately the same size as Talbanken, so we use the same thresholds for them. The DDT data is approximately half the size; to adjust, we simply halve the scores. In tables 2, 3, and 4, we present the results, using the best F0.5 and F1 settings from development. At a glance, we observe that the best method differs for each corpus and depending on an emphasis of precision or recall, with the bigram method generally having high precision. For Alpino, error detection is better with frequency than, for example, bigram scores. This is likely due to the fact that Alpino has the smallest label set of any of the corpora, with only 24 dependency labels and 12 POS tags (cf. 64 and 41 in Talbanken, respectively). With a smaller label set, there are less possible bigrams that could be anomalous, but more reliable statistics about a whole rule. Likewise, with fewer possible POS tag pairs, Alpino has lower precision for the lowthreshold POS scores than the other corpora. For the whole rule scores, the DDT data is worse (compare its 46. 1% precision with Bosque’s 45.6%, with vastly different recall values), which could be due to the smaller training data. One might also consider the qualitative differences in the dependency inventory of DDT compared to the others—e.g., appositions, distinctions in names, and more types of modifiers. 5.5 MSTParser Turning to the results of running the methods on the output of MSTParser, we find similar but slightly worse values for the whole rule and bigram methods, as shown in tables 5-8. What is 735 most striking are the differences in the POS-based method for Bosque and DDT (tables 7 and 8), where a large percentage of the test corpus is underneath the threshold. MSTParser is apparently positing fewer distinct head-dependent pairs, as most of them fall under the given thresholds. With the exception of the POS-based method for DDT (where LASb is actually higher than LASa) the different methods seem to be accurate enough to be used as part of corpus post-editing. 6 Summary and Outlook We have proposed different methods for flagging the errors in automatically-parsed corpora, by treating the problem as one of looking for anoma- lous rules with respect to a treebank grammar. The different methods incorporate differing types and amounts of information, notably comparisons among dependency rules and bigrams within such rules. Using these methods, we demonstrated success in sorting well-formed output from erroneous output across language, corpora, and parsers. Given that the rule representations and comparison methods use both POS and dependency information, a next step in evaluating and improving the methods is to examine automatically POStagged data. Our methods should be able to find POS errors in addition to dependency errors. Furthermore, although we have indicated that differences in accuracy can be linked to differences in the granularity and particular distinctions of the annotation scheme, it is still an open question as to which methods work best for which schemes and for which constructions (e.g., coordination). Acknowledgments Thanks to Sandra K ¨ubler and Amber Smith for comments on an earlier draft; Yvonne Samuelsson for help with the Swedish translations; the IU Computational Linguistics discussion group for feedback; and Julia Hockenmaier, Chris Brew, and Rebecca Hwa for discussion on the general topic. A Some Talbanken05 categories Dependencies 736 References Afonso, Susana, Eckhard Bick, Renato Haber and Diana Santos (2002). Floresta Sint a´(c)tica: a treebank for Portuguese. In Proceedings of LREC 2002. Las Palmas, pp. 1698–1703. Attardi, Giuseppe and Massimiliano Ciaramita (2007). Tree Revision Learning for Dependency Parsing. In Proceedings of NAACL-HLT-07. Rochester, NY, pp. 388–395. Bick, Eckhard (2007). Hybrid Ways to Improve Domain Independence in an ML Dependency Parser. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007. Prague, Czech Republic, pp. 1119–1 123. Boyd, Adriane, Markus Dickinson and Detmar Meurers (2008). On Detecting Errors in Dependency Treebanks. Research on Language and Computation 6(2), 113–137. Buchholz, Sabine and Erwin Marsi (2006). CoNLL-X Shared Task on Multilingual Dependency Parsing. In Proceedings of CoNLL-X. New York City, pp. 149–164. Campbell, David and Stephen Johnson (2002). A transformational-based learner for dependency grammars in discharge summaries. In Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain. Phildadelphia, pp. 37–44. Dickinson, Markus (2008). Ad Hoc Treebank Structures. In Proceedings of ACL-08. Columbus, OH. Dickinson, Markus and Jennifer Foster (2009). Similarity Rules! Exploring Methods for AdHoc Rule Detection. In Proceedings of TLT-7. Groningen, The Netherlands. Gildea, Daniel (2001). Corpus Variation and Parser Performance. In Proceedings of EMNLP-01. Pittsburgh, PA. Hall, Keith and V ´aclav Nov a´k (2005). Corrective Modeling for Non-Projective Dependency Parsing. In Proceedings of IWPT-05. Vancouver, pp. 42–52. Kromann, Matthias Trautner (2003). The Danish Dependency Treebank and the underlying linguistic theory. In Proceedings of TLT-03. Kuhlmann, Marco and Giorgio Satta (2009). Treebank Grammar Techniques for Non-Projective Dependency Parsing. In Proceedings of EACL09. Athens, Greece, pp. 478–486. Loftsson, Hrafn (2009). Correcting a POS-Tagged Corpus Using Three Complementary Methods. In Proceedings of EACL-09. Athens, Greece, pp. 523–531. McDonald, Ryan and Fernando Pereira (2006). Online learning of approximate dependency parsing algorithms. In Proceedings of EACL06. Trento. Nilsson, Jens and Johan Hall (2005). Reconstruction of the Swedish Treebank Talbanken. MSI report 05067, V ¨axj¨ o University: School of Mathematics and Systems Engineering. Nivre, Joakim, Johan Hall, Jens Nilsson, Atanas Chanev, Gulsen Eryigit, Sandra K ¨ubler, Svetoslav Marinov and Erwin Marsi (2007). MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13(2), 95–135. Owczarzak, Karolina (2009). DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries. In Proceedings of ACL-AFNLP-09. Suntec, Singapore, pp. 190–198. Przepi´ orkowski, Adam (2006). What to acquire from corpora in automatic valence acquisition. In Violetta Koseska-Toszewa and Roman Roszko (eds.), Semantyka a konfrontacja jezykowa, tom 3, Warsaw: Slawistyczny O ´srodek Wydawniczy PAN, pp. 25–41. Sekine, Satoshi (1997). The Domain Dependence of Parsing. In Proceedings of ANLP-96. Washington, DC. van der Beek, Leonoor, Gosse Bouma, Robert Malouf and Gertjan van Noord (2002). The Alpino Dependency Treebank. In Proceedings of CLIN 2001. Rodopi. van Noord, Gertjan and Gosse Bouma (2009). Parsed Corpora for Linguistics. In Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?. Athens, pp. 33–39. Wallis, Sean (2003). Completing Parsed Corpora. In Anne Abeill´ e (ed.), Treebanks: Building and using syntactically annoted corpora, Dordrecht: Kluwer Academic Publishers, pp. 61–71. Wan, Stephen, Mark Dras, Robert Dale and C ´ecile Paris (2009). Improving Grammaticality in Sta737 tistical Sentence Generation: Introducing a Dependency Spanning Tree Algorithm with an Argument Satisfaction Model. In Proceedings of EACL-09. Athens, Greece, pp. 852–860. Xu, Peng, Jaeho Kang, Michael Ringgaard and Franz Och (2009). Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages. In Proceedings of NAACL-HLT-09. Boulder, Colorado, pp. 245–253. 738

5 0.70007646 118 acl-2010-Fine-Grained Tree-to-String Translation Rule Extraction

Author: Xianchao Wu ; Takuya Matsuzaki ; Jun'ichi Tsujii

Abstract: Tree-to-string translation rules are widely used in linguistically syntax-based statistical machine translation systems. In this paper, we propose to use deep syntactic information for obtaining fine-grained translation rules. A head-driven phrase structure grammar (HPSG) parser is used to obtain the deep syntactic information, which includes a fine-grained description of the syntactic property and a semantic representation of a sentence. We extract fine-grained rules from aligned HPSG tree/forest-string pairs and use them in our tree-to-string and string-to-tree systems. Extensive experiments on largescale bidirectional Japanese-English trans- lations testified the effectiveness of our approach.

6 0.6767059 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

7 0.66709107 54 acl-2010-Boosting-Based System Combination for Machine Translation

8 0.65781474 169 acl-2010-Learning to Translate with Source and Target Syntax

9 0.62971902 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

10 0.62061381 83 acl-2010-Dependency Parsing and Projection Based on Word-Pair Classification

11 0.60437882 265 acl-2010-cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models

12 0.60071445 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

13 0.5948332 243 acl-2010-Tree-Based and Forest-Based Translation

14 0.59082222 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

15 0.56429082 12 acl-2010-A Probabilistic Generative Model for an Intermediate Constituency-Dependency Representation

16 0.56191337 119 acl-2010-Fixed Length Word Suffix for Factored Statistical Machine Translation

17 0.55991453 143 acl-2010-Importance of Linguistic Constraints in Statistical Dependency Parsing

18 0.53493249 45 acl-2010-Balancing User Effort and Translation Error in Interactive Machine Translation via Confidence Measures

19 0.52690339 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration

20 0.52572876 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.011), (16, 0.02), (20, 0.233), (25, 0.083), (44, 0.022), (59, 0.143), (73, 0.045), (78, 0.033), (83, 0.101), (84, 0.033), (98, 0.196)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87206024 170 acl-2010-Letter-Phoneme Alignment: An Exploration

Author: Sittichai Jiampojamarn ; Grzegorz Kondrak

Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.

same-paper 2 0.85974181 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules

Author: Zhiyang Wang ; Yajuan Lv ; Qun Liu ; Young-Sook Hwang

3 0.7612226 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

Author: John DeNero ; Dan Klein

Abstract: We present a discriminative model that directly predicts which set ofphrasal translation rules should be extracted from a sentence pair. Our model scores extraction sets: nested collections of all the overlapping phrase pairs consistent with an underlying word alignment. Extraction set models provide two principle advantages over word-factored alignment models. First, we can incorporate features on phrase pairs, in addition to word links. Second, we can optimize for an extraction-based loss function that relates directly to the end task of generating translations. Our model gives improvements in alignment quality relative to state-of-the-art unsupervised and supervised baselines, as well as providing up to a 1.4 improvement in BLEU score in Chinese-to-English translation experiments.

4 0.75686622 169 acl-2010-Learning to Translate with Source and Target Syntax

Author: David Chiang

Abstract: Statistical translation models that try to capture the recursive structure of language have been widely adopted over the last few years. These models make use of varying amounts of information from linguistic theory: some use none at all, some use information about the grammar of the target language, some use information about the grammar of the source language. But progress has been slower on translation models that are able to learn the relationship between the grammars of both the source and target language. We discuss the reasons why this has been a challenge, review existing attempts to meet this challenge, and show how some old and new ideas can be combined into a sim- ple approach that uses both source and target syntax for significant improvements in translation accuracy.

5 0.75670815 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

Author: Celina Santamaria ; Julio Gonzalo ; Javier Artiles

Abstract: Is it possible to use sense inventories to improve Web search results diversity for one word queries? To answer this question, we focus on two broad-coverage lexical resources of a different nature: WordNet, as a de-facto standard used in Word Sense Disambiguation experiments; and Wikipedia, as a large coverage, updated encyclopaedic resource which may have a better coverage of relevant senses in Web pages. Our results indicate that (i) Wikipedia has a much better coverage of search results, (ii) the distribution of senses in search results can be estimated using the internal graph structure of the Wikipedia and the relative number of visits received by each sense in Wikipedia, and (iii) associating Web pages to Wikipedia senses with simple and efficient algorithms, we can produce modified rankings that cover 70% more Wikipedia senses than the original search engine rankings. 1 Motivation The application of Word Sense Disambiguation (WSD) to Information Retrieval (IR) has been subject of a significant research effort in the recent past. The essential idea is that, by indexing and matching word senses (or even meanings) , the retrieval process could better handle polysemy and synonymy problems (Sanderson, 2000). In practice, however, there are two main difficulties: (i) for long queries, IR models implicitly perform disambiguation, and thus there is little room for improvement. This is the case with most standard IR benchmarks, such as TREC (trec.nist.gov) or CLEF (www.clef-campaign.org) ad-hoc collections; (ii) for very short queries, disambiguation j ul io @ l i uned . e s j avart s . @bec . uned . e s may not be possible or even desirable. This is often the case with one word and even two word queries in Web search engines. In Web search, there are at least three ways of coping with ambiguity: • • • Promoting diversity in the search results (Clarke negt al., 2008): given th seea query s”uolatssis”, the search engine may try to include representatives for different senses of the word (such as the Oasis band, the Organization for the Advancement of Structured Information Standards, the online fashion store, etc.) among the top results. Search engines are supposed to handle diversity as one of the multiple factors that influence the ranking. Presenting the results as a set of (labelled) cPlruessteenrtsi nragth tehre eth reansu as a a rsan ake sde lti ostf (Carpineto et al., 2009). Complementing search results with search suggestions (e.g. e”oaracshis band”, ”woitahsis s fashion store”) that serve to refine the query in the intended way (Anick, 2003). All of them rely on the ability of the search engine to cluster search results, detecting topic similarities. In all of them, disambiguation is implicit, a side effect of the process but not its explicit target. Clustering may detect that documents about the Oasis band and the Oasis fashion store deal with unrelated topics, but it may as well detect a group of documents discussing why one of the Oasis band members is leaving the band, and another group of documents about Oasis band lyrics; both are different aspects of the broad topic Oasis band. A perfect hierarchical clustering should distinguish between the different Oasis senses at a first level, and then discover different topics within each of the senses. Is it possible to use sense inventories to improve search results for one word queries? To answer 1357 Proce dingUsp opfs thaela 4, 8Stwhe Adnen u,a 1l1- M16e Jtiunlgy o 2f0 t1h0e. A ?c s 2o0c1ia0ti Aosnso focria Ctio nm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsetisc 1s357–136 , this question, we will focus on two broad-coverage lexical resources of a different nature: WordNet (Miller et al., 1990), as a de-facto standard used in Word Sense Disambiguation experiments and many other Natural Language Processing research fields; and Wikipedia (www.wikipedia.org), as a large coverage and updated encyclopedic resource which may have a better coverage of relevant senses in Web pages. Our hypothesis is that, under appropriate conditions, any of the above mechanisms (clustering, search suggestions, diversity) might benefit from an explicit disambiguation (classification of pages in the top search results) using a wide-coverage sense inventory. Our research is focused on four relevant aspects of the problem: 1. Coverage: Are Wikipedia/Wordnet senses representative of search results? Otherwise, trying to make a disambiguation in terms of a fixed sense inventory would be meaningless. 2. If the answer to (1) is positive, the reverse question is also interesting: can we estimate search results diversity using our sense inven- tories? 3. Sense frequencies: knowing sense frequencies in (search results) Web pages is crucial to have a usable sense inventory. Is it possible to estimate Web sense frequencies from currently available information? 4. Classification: The association of Web pages to word senses must be done with some unsupervised algorithm, because it is not possible to hand-tag training material for every possible query word. Can this classification be done accurately? Can it be effective to promote diversity in search results? In order to provide an initial answer to these questions, we have built a corpus consisting of 40 nouns and 100 Google search results per noun, manually annotated with the most appropriate Wordnet and Wikipedia senses. Section 2 describes how this corpus has been created, and in Section 3 we discuss WordNet and Wikipedia coverage of search results according to our testbed. As this initial results clearly discard Wordnet as a sense inventory for the task, the rest of the paper mainly focuses on Wikipedia. In Section 4 we estimate search results diversity from our testbed, finding that the use of Wikipedia could substantially improve diversity in the top results. In Section 5 we use the Wikipedia internal link structure and the number of visits per page to estimate relative frequencies for Wikipedia senses, obtaining an estimation which is highly correlated with actual data in our testbed. Finally, in Section 6 we discuss a few strategies to classify Web pages into word senses, and apply the best classifier to enhance diversity in search results. The paper concludes with a discussion of related work (Section 7) and an overall discussion of our results in Section 8. 2 Test Set 2.1 Set of Words The most crucial step in building our test set is choosing the set of words to be considered. We are looking for words which are susceptible to form a one-word query for a Web search engine, and therefore we should focus on nouns which are used to denote one or more named entities. At the same time we want to have some degree of comparability with previous research on Word Sense Disambiguation, which points to noun sets used in Senseval/SemEval evaluation campaigns1 . Our budget for corpus annotation was enough for two persons-month, which limited us to handle 40 nouns (usually enough to establish statistically significant differences between WSD algorithms, although obviously limited to reach solid figures about the general behaviour of words in the Web). With these arguments in mind, we decided to choose: (i) 15 nouns from the Senseval-3 lexical sample dataset, which have been previously employed by (Mihalcea, 2007) in a related experiment (see Section 7); (ii) 25 additional words which satisfy two conditions: they are all ambiguous, and they are all names for music bands in one of their senses (not necessarily the most salient). The Senseval set is: {argument, arm, atmosphere, bank, degree, difference, disc, irmm-, age, paper, party, performance, plan, shelter, sort, source}. The bands set is {amazon, apple, camel, cell, columbia, cream, foreigner, fox, genesis, jaguar, oasis, pioneer, police, puma, rainbow, shell, skin, sun, tesla, thunder, total, traffic, trapeze, triumph, yes}. Fpoerz e,a trchiu noun, we looked up all its possible senses in WordNet 3.0 and in Wikipedia (using 1http://senseval.org 1358 Table 1: Coverage of Search Results: Wikipedia vs. WordNet Wikiped#ia documents # senses WordNe#t documents Senseval setava2il4a2b/1le0/u0sedassign8e7d7 to (5 s9o%me) senseavai9la2b/5le2/usedassigne6d96 to (4 s6o%m)e sense # senses BaTnodtsa lset868420//21774421323558 ((5546%%))17780/3/9911529995 (2 (342%%)) Wikipedia disambiguation pages). Wikipedia has an average of 22 senses per noun (25.2 in the Bands set and 16. 1in the Senseval set), and Wordnet a much smaller figure, 4.5 (3. 12 for the Bands set and 6.13 for the Senseval set). For a conventional dictionary, a higher ambiguity might indicate an excess of granularity; for an encyclopaedic resource such as Wikipedia, however, it is just an indication of larger coverage. Wikipedia en- tries for camel which are not in WordNet, for instance, include the Apache Camel routing and mediation engine, the British rock band, the brand of cigarettes, the river in Cornwall, and the World World War I fighter biplane. 2.2 Set of Documents We retrieved the 150 first ranked documents for each noun, by submitting the nouns as queries to a Web search engine (Google). Then, for each document, we stored both the snippet (small description of the contents of retrieved document) and the whole HTML document. This collection of documents contain an implicit new inventory of senses, based on Web search, as documents retrieved by a noun query are associated with some sense of the noun. Given that every document in the top Web search results is supposed to be highly relevant for the query word, we assume a ”one sense per document” scenario, although we allow annotators to assign more than one sense per document. In general this assumption turned out to be correct except in a few exceptional cases (such as Wikipedia disambiguation pages): only nine docu- ments received more than one WordNet sense, and 44 (1. 1% of all annotated pages) received more than one Wikipedia sense. 2.3 Manual Annotation We implemented an annotation interface which stored all documents and a short description for every Wordnet and Wikipedia sense. The annotators had to decide, for every document, whether there was one or more appropriate senses in each of the dictionaries. They were instructed to provide annotations for 100 documents per name; if an URL in the list was corrupt or not available, it had to be discarded. We provided 150 documents per name to ensure that the figure of 100 usable documents per name could be reached without problems. Each judge provided annotations for the 4,000 documents in the final data set. In a second round, they met and discussed their independent annotations together, reaching a consensus judgement for every document. 3 Coverage of Web Search Results: Wikipedia vs Wordnet Table 1 shows how Wikipedia and Wordnet cover the senses in search results. We report each noun subset separately (Senseval and bands subsets) as well as aggregated figures. The most relevant fact is that, unsurprisingly, Wikipedia senses cover much more search results (56%) than Wordnet (32%). If we focus on the top ten results, in the bands subset (which should be more representative of plausible web queries) Wikipedia covers 68% of the top ten documents. This is an indication that it can indeed be useful for promoting diversity or help clustering search results: even if 32% of the top ten documents are not covered by Wikipedia, it is still a representative source of senses in the top search results. We have manually examined all documents in the top ten results that are not covered by Wikipedia: a majority of the missing senses consists of names of (generally not well-known) companies (45%) and products or services (26%); the other frequent type (12%) of non annotated doc- ument is disambiguation pages (from Wikipedia and also from other dictionaries). It is also interesting to examine the degree of overlap between Wikipedia and Wordnet senses. Being two different types of lexical resource, they might have some degree of complementarity. Table 2 shows, however, that this is not the case: most of the (annotated) documents either fit Wikipedia senses (26%) or both Wikipedia and Wordnet (29%), and just 3% fit Wordnet only. 1359 Table 2: Overlap between Wikipedia and Wordnet in Search Results # documents annotated with Senseval setWikipe60di7a ( &40 W%o)rdnetWi2k7ip0e (d1i8a% on)lyWo8r9d (n6e%t o)nly534no (3n6e%) BaTnodtsa slet1517729 ( (2239%%))1708566 (3 (216%%))12176 ( (13%%))11614195 ( (4415%%)) Therefore, Wikipedia seems to extend the coverage of Wordnet rather than providing complementary sense information. If we wanted to extend the coverage of Wikipedia, the best strategy seems to be to consider lists ofcompanies, products and services, rather than complementing Wikipedia with additional sense inventories. 4 Diversity in Google Search Results Once we know that Wikipedia senses are a representative subset of actual Web senses (covering more than half of the documents retrieved by the search engine), we can test how well search results respect diversity in terms of this subset of senses. Table 3 displays the number of different senses found at different depths in the search results rank, and the average proportion of total senses that they represent. These results suggest that diversity is not a major priority for ranking results: the top ten results only cover, in average, 3 Wikipedia senses (while the average number of senses listed in Wikipedia is 22). When considering the first 100 documents, this number grows up to 6.85 senses per noun. Another relevant figure is the frequency of the most frequent sense for each word: in average, 63% of the pages in search results belong to the most frequent sense of the query word. This is roughly comparable with most frequent sense figures in standard annotated corpora such as Semcor (Miller et al., 1993) and the Senseval/Semeval data sets, which suggests that diversity may not play a major role in the current Google ranking algorithm. Of course this result must be taken with care, because variability between words is high and unpredictable, and we are using only 40 nouns for our experiment. But what we have is a positive indication that Wikipedia could be used to improve diversity or cluster search results: potentially the first top ten results could cover 6.15 different senses in average (see Section 6.5), which would be a substantial growth. 5 Sense Frequency Estimators for Wikipedia Wikipedia disambiguation pages contain no systematic information about the relative importance of senses for a given word. Such information, however, is crucial in a lexicon, because sense distributions tend to be skewed, and knowing them can help disambiguation algorithms. We have attempted to use two estimators of expected sense distribution: • • Internal relevance of a word sense, measured as incoming alinnckes o ffo ar wthoer U seRnLs o, fm a given sense in Wikipedia. External relevance of a word sense, measured as ttheren naulm rebleevr aonfc vei osifts a f woro trhde s eUnRsLe, mofe a given sense (as reported in http://stats.grok.se). The number of internal incoming links is expected to be relatively stable for Wikipedia articles. As for the number of visits, we performed a comparison of the number of visits received by the bands noun subset in May, June and July 2009, finding a stable-enough scenario with one notorious exception: the number of visits to the noun Tesla raised dramatically in July, because July 10 was the anniversary of the birth of Nicola Tesla, and a special Google logo directed users to the Wikipedia page for the scientist. We have measured correlation between the relative frequencies derived from these two indicators and the actual relative frequencies in our testbed. Therefore, for each noun w and for each sense wi, we consider three values: (i) proportion of documents retrieved for w which are manually assigned to each sense wi; (ii) inlinks(wi) : relative amount of incoming links to each sense wi; and (iii) visits(wi) : relative number of visits to the URL for each sense wi. We have measured the correlation between these three values using a linear regression correlation coefficient, which gives a correlation value of .54 for the number of visits and of .71 for the number of incoming links. Both estimators seem 1360 Table 3: Diversity in Search Results according to Wikipedia F ir s t 12570 docsBave6n425.rd9854a6 s8get#snSe 65sien43. v68a3s27elarcthesTu6543l.o t5083as5lBvaen.r3d2a73s81gectovrSaegnso. 4f32v615aWlsiketpdaTs.3oe249tn01asle to be positively correlated with real relative frequencies in our testbed, with a strong preference for the number of links. We have experimented with weighted combinations of both indicators, using weights of the form (k, 1 k) , k ∈ {0, 0.1, 0.2 . . . 1}, reaching a maxi(mk,a1l c−okrre),lkati ∈on { 0of, .07.13, f0o.r2 t.h.e. following weights: − freq(wi) = 0.9∗inlinks(wi) +0. 1∗visits(wi) (1) This weighted estimator provides a slight advantage over the use of incoming links only (.73 vs .71). Overall, we have an estimator which has a strong correlation with the distribution of senses in our testbed. In the next section we will test its utility for disambiguation purposes. 6 Association of Wikipedia Senses to Web Pages We want to test whether the information provided by Wikipedia can be used to classify search results accurately. Note that we do not want to consider approaches that involve a manual creation of training material, because they can’t be used in practice. Given a Web page p returned by the search engine for the query w, and the set of senses w1 . . . wn listed in Wikipedia, the task is to assign the best candidate sense to p. We consider two different techniques: • A basic Information Retrieval approach, wAhe breas tche I dfoocrmumateionnts Ranetdr tvhael Wikipedia pages are represented using a Vector Space Model (VSM) and compared with a standard cosine measure. This is a basic approach which, if successful, can be used efficiently to classify search results. An approach based on a state-of-the-art supervised oWacShD b system, extracting training examples automatically from Wikipedia content. We also compute two baselines: • • • A random assignment of senses (precision is computed as itghnem ienvnter osfe oenfs tehse ( pnruemcibsieorn o isf senses, for every test case). A most frequent sense heuristic which uses our eosstitm fraetiqoune otf s sense frequencies acnhd u assigns the same sense (the most frequent) to all documents. Both are naive baselines, but it must be noted that the most frequent sense heuristic is usually hard to beat for unsupervised WSD algorithms in most standard data sets. We now describe each of the two main approaches in detail. 6.1 VSM Approach For each word sense, we represent its Wikipedia page in a (unigram) vector space model, assigning standard tf*idf weights to the words in the document. idf weights are computed in two different ways: 1. Experiment VSM computes inverse document frequencies in the collection of retrieved documents (for the word being considered). 2. Experiment VSM-GT uses the statistics provided by the Google Terabyte collection (Brants and Franz, 2006), i.e. it replaces the collection of documents with statistics from a representative snapshot of the Web. 3. Experiment VSM-mixed combines statistics from the collection and from the Google Terabyte collection, following (Chen et al., 2009). The document p is represented in the same vector space as the Wikipedia senses, and it is compared with each of the candidate senses wi via the cosine similarity metric (we have experimented 1361 with other similarity metrics such as χ2, but differences are irrelevant). The sense with the highest similarity to p is assigned to the document. In case of ties (which are rare), we pick the first sense in the Wikipedia disambiguation page (which in practice is like a random decision, because senses in disambiguation pages do not seem to be ordered according to any clear criteria). We have also tested a variant of this approach which uses the estimation of sense frequencies presented above: once the similarities are computed, we consider those cases where two or more senses have a similar score (in particular, all senses with a score greater or equal than 80% of the highest score). In that cases, instead of using the small similarity differences to select a sense, we pick up the one which has the largest frequency according to our estimator. We have applied this strategy to the best performing system, VSM-GT, resulting in experiment VSM-GT+freq. 6.2 WSD Approach We have used TiMBL (Daelemans et al., 2001), a state-of-the-art supervised WSD system which uses Memory-Based Learning. The key, in this case, is how to extract learning examples from the Wikipedia automatically. For each word sense, we basically have three sources of examples: (i) occurrences of the word in the Wikipedia page for the word sense; (ii) occurrences of the word in Wikipedia pages pointing to the page for the word sense; (iii) occurrences of the word in external pages linked in the Wikipedia page for the word sense. After an initial manual inspection, we decided to discard external pages for being too noisy, and we focused on the first two options. We tried three alternatives: • • • TiMBL-core uses only the examples found Tini MtheB page rfoer u tshees sense being atrmaipneleds. TiMBL-inlinks uses the examples found in Wikipedia pages pointing etxoa mthep sense being trained. TiMBL-all uses both sources of examples. In order to classify a page p with respect to the senses for a word w, we first disambiguate all occurrences of w in the page p. Then we choose the sense which appears most frequently in the page according to TiMBL results. In case of ties we pick up the first sense listed in the Wikipedia disambiguation page. We have also experimented with a variant of the approach that uses our estimation of sense frequencies, similarly to what we did with the VSM approach. In this case, (i) when there is a tie between two or more senses (which is much more likely than in the VSM approach), we pick up the sense with the highest frequency according to our estimator; and (ii) when no sense reaches 30% of the cases in the page to be disambiguated, we also resort to the most frequent sense heuristic (among the candidates for the page). This experiment is called TiMBL-core+freq (we discarded ”inlinks” and ”all” versions because they were clearly worse than ”core”). 6.3 Classification Results Table 4 shows classification results. The accuracy of systems is reported as precision, i.e. the number of pages correctly classified divided by the total number of predictions. This is approximately the same as recall (correctly classified pages divided by total number of pages) for our systems, because the algorithms provide an answer for every page containing text (actual coverage is 94% because some pages only contain text as part of an image file such as photographs and logotypes). Table 4: Classification Results Experiment Precision random most frequent sense (estimation) .19 .46 TiMBL-core TiMBL-inlinks TiMBL-all TiMBL-core+freq .60 .50 .58 .67 VSM VSM-GT VSM-mixed VSM-GT+freq .67 .68 .67 .69 All systems are significantly better than the random and most frequent sense baselines (using p < 0.05 for a standard t-test). Overall, both approaches (using TiMBL WSD machinery and using VSM) lead to similar results (.67 vs. .69), which would make VSM preferable because it is a simpler and more efficient approach. Taking a 1362 Figure 1: Precision/Coverage curves for VSM-GT+freq classification algorithm closer look at the results with TiMBL, there are a couple of interesting facts: • There is a substantial difference between using only examples itaalke dnif fferroemnc tehe b Wikipedia Web page for the sense being trained (TiMBL-core, .60) and using examples from the Wikipedia pages pointing to that page (TiMBL-inlinks, .50). Examples taken from related pages (even if the relationship is close as in this case) seem to be too noisy for the task. This result is compatible with findings in (Santamar ı´a et al., 2003) using the Open Directory Project to extract examples automatically. • Our estimation of sense frequencies turns oOuutr rto e tbiem very helpful sfeor f cases wcihesere t our TiMBL-based algorithm cannot provide an answer: precision rises from .60 (TiMBLcore) to .67 (TiMBL-core+freq). The difference is statistically significant (p < 0.05) according to the t-test. As for the experiments with VSM, the variations tested do not provide substantial improvements to the baseline (which is .67). Using idf frequencies obtained from the Google Terabyte corpus (instead of frequencies obtained from the set of retrieved documents) provides only a small improvement (VSM-GT, .68), and adding the estimation of sense frequencies gives another small improvement (.69). Comparing the baseline VSM with the optimal setting (VSM-GT+freq), the difference is small (.67 vs .69) but relatively robust (p = 0.066 according to the t-test). Remarkably, the use of frequency estimations is very helpful for the WSD approach but not for the SVM one, and they both end up with similar performance figures; this might indicate that using frequency estimations is only helpful up to certain precision ceiling. 6.4 Precision/Coverage Trade-off All the above experiments are done at maximal coverage, i.e., all systems assign a sense for every document in the test collection (at least for every document with textual content). But it is possible to enhance search results diversity without annotating every document (in fact, not every document can be assigned to a Wikipedia sense, as we have discussed in Section 3). Thus, it is useful to investigate which is the precision/coverage trade-off in our dataset. We have experimented with the best performing system (VSM-GT+freq), introducing a similarity threshold: assignment of a document to a sense is only done if the similarity of the document to the Wikipedia page for the sense exceeds the similarity threshold. We have computed precision and coverage for every threshold in the range [0.00 −0.90] (beyond 0e.v9e0ry coverage was null) anngde represented 0th] e(b breeysuolntds in Figure 1 (solid line). The graph shows that we 1363 can classify around 20% of the documents with a precision above .90, and around 60% of the documents with a precision of .80. Note that we are reporting disambiguation results using a conventional WSD test set, i.e., one in which every test case (every document) has been manually assigned to some Wikipedia sense. But in our Web Search scenario, 44% of the documents were not assigned to any Wikipedia sense: in practice, our classification algorithm would have to cope with all this noise as well. Figure 1 (dotted line) shows how the precision/coverage curve is affected when the algorithm attempts to disambiguate all documents retrieved by Google, whether they can in fact be assigned to a Wikipedia sense or not. At a coverage of 20%, precision drops approximately from .90 to .70, and at a coverage of 60% it drops from .80 to .50. We now address the question of whether this performance is good enough to improve search re- sults diversity in practice. 6.5 Using Classification to Promote Diversity We now want to estimate how the reported classification accuracy may perform in practice to enhance diversity in search results. In order to provide an initial answer to this question, we have re-ranked the documents for the 40 nouns in our testbed, using our best classifier (VSM-GT+freq) and making a list of the top-ten documents with the primary criterion of maximising the number of senses represented in the set, and the secondary criterion of maximising the similarity scores of the documents to their assigned senses. The algorithm proceeds as follows: we fill each position in the rank (starting at rank 1), with the document which has the highest similarity to some of the senses which are not yet represented in the rank; once all senses are represented, we start choosing a second representative for each sense, following the same criterion. The process goes on until the first ten documents are selected. We have also produced a number of alternative rankings for comparison purposes: clustering (centroids): this method applies eHriiengrarc (hciecnatlr Agglomerative Clustering which proved to be the most competitive clustering algorithm in a similar task (Artiles et al., 2009) to the set of search results, forcing the algorithm to create ten clusters. The centroid of each cluster is then selected Table 5: Enhancement of Search Results Diversity • – – rank@10 # senses coverage Original rank2.8049% Wikipedia 4.75 77% clustering (centroids) 2.50 42% clustering (top ranked) 2.80 46% random 2.45 43% upper bound6.1597% as one of the top ten documents in the new rank. • clustering (top ranked): Applies the same clustering algorithm, db u)t: tAhpisp lti emse t tehe s top ranked document (in the original Google rank) of each cluster is selected. • • random: Randomly selects ten documents frraonmd otmhe: :se Rt aofn dreomtrielyve sde lreecstuslts te. upper bound: This is the maximal diversity tuhpapt can o beu nodb:tai Tnheids iins our mteasxtbiemda. lN doivteer tshitayt coverage is not 100%, because some words have more than ten meanings in Wikipedia and we are only considering the top ten documents. All experiments have been applied on the full set of documents in the testbed, including those which could not be annotated with any Wikipedia sense. Coverage is computed as the ratio of senses that appear in the top ten results compared to the number of senses that appear in all search results. Results are presented in Table 5. Note that diversity in the top ten documents increases from an average of 2.80 Wikipedia senses represented in the original search engine rank, to 4.75 in the modified rank (being 6.15 the upper bound), with the coverage of senses going from 49% to 77%. With a simple VSM algorithm, the coverage of Wikipedia senses in the top ten results is 70% larger than in the original ranking. Using Wikipedia to enhance diversity seems to work much better than clustering: both strategies to select a representative from each cluster are unable to improve the diversity of the original ranking. Note, however, that our evaluation has a bias towards using Wikipedia, because only Wikipedia senses are considered to estimate diversity. Of course our results do not imply that the Wikipedia modified rank is better than the original 1364 Google rank: there are many other factors that influence the final ranking provided by a search engine. What our results indicate is that, with simple and efficient algorithms, Wikipedia can be used as a reference to improve search results diversity for one-word queries. 7 Related Work Web search results clustering and diversity in search results are topics that receive an increasing attention from the research community. Diversity is used both to represent sub-themes in a broad topic, or to consider alternative interpretations for ambiguous queries (Agrawal et al., 2009), which is our interest here. Standard IR test collections do not usually consider ambiguous queries, and are thus inappropriate to test systems that promote diversity (Sanderson, 2008); it is only recently that appropriate test collections are being built, such as (Paramita et al., 2009) for image search and (Artiles et al., 2009) for person name search. We see our testbed as complementary to these ones, and expect that it can contribute to foster research on search results diversity. To our knowledge, Wikipedia has not explicitly been used before to promote diversity in search results; but in (Gollapudi and Sharma, 2009), it is used as a gold standard to evaluate diversification algorithms: given a query with a Wikipedia disambiguation page, an algorithm is evaluated as promoting diversity when different documents in the search results are semantically similar to different Wikipedia pages (describing the alternative senses of the query). Although semantic similarity is measured automatically in this work, our results confirm that this evaluation strategy is sound, because Wikipedia senses are indeed representative of search results. (Clough et al., 2009) analyses query diversity in a Microsoft Live Search, using click entropy and query reformulation as diversity indicators. It was found that at least 9.5% - 16.2% of queries could benefit from diversification, although no correlation was found between the number of senses of a word in Wikipedia and the indicators used to discover diverse queries. This result does not discard, however, that queries where applying diversity is useful cannot benefit from Wikipedia as a sense inventory. In the context of clustering, (Carmel et al., 2009) successfully employ Wikipedia to enhance automatic cluster labeling, finding that Wikipedia labels agree with manual labels associated by humans to a cluster, much more than with signif- icant terms that are extracted directly from the text. In a similar line, both (Gabrilovich and Markovitch, 2007) and (Syed et al., 2008) provide evidence suggesting that categories of Wikipedia articles can successfully describe common concepts in documents. In the field of Natural Language Processing, there has been successful attempts to connect Wikipedia entries to Wordnet senses: (RuizCasado et al., 2005) reports an algorithm that provides an accuracy of 84%. (Mihalcea, 2007) uses internal Wikipedia hyperlinks to derive sensetagged examples. But instead of using Wikipedia directly as sense inventory, Mihalcea then manually maps Wikipedia senses into Wordnet senses (claiming that, at the time of writing the paper, Wikipedia did not consistently report ambiguity in disambiguation pages) and shows that a WSD system based on acquired sense-tagged examples reaches an accuracy well beyond an (informed) most frequent sense heuristic. 8 Conclusions We have investigated whether generic lexical resources can be used to promote diversity in Web search results for one-word, ambiguous queries. We have compared WordNet and Wikipedia and arrived to a number of conclusions: (i) unsurprisingly, Wikipedia has a much better coverage of senses in search results, and is therefore more appropriate for the task; (ii) the distribution of senses in search results can be estimated using the internal graph structure of the Wikipedia and the relative number of visits received by each sense in Wikipedia, and (iii) associating Web pages to Wikipedia senses with simple and efficient algorithms, we can produce modified rankings that cover 70% more Wikipedia senses than the original search engine rankings. We expect that the testbed created for this research will complement the - currently short - set of benchmarking test sets to explore search results diversity and query ambiguity. Our testbed is publicly available for research purposes at http://nlp.uned.es. Our results endorse further investigation on the use of Wikipedia to organize search results. Some limitations of our research, however, must be 1365 noted: (i) the nature of our testbed (with every search result manually annotated in terms of two sense inventories) makes it too small to extract solid conclusions on Web searches (ii) our work does not involve any study of diversity from the point of view of Web users (i.e. when a Web query addresses many different use needs in practice); research in (Clough et al., 2009) suggests that word ambiguity in Wikipedia might not be related with diversity of search needs; (iii) we have tested our classifiers with a simple re-ordering of search results to test how much diversity can be improved, but a search results ranking depends on many other factors, some of them more crucial than diversity; it remains to be tested how can we use document/Wikipedia associations to improve search results clustering (for instance, providing seeds for the clustering process) and to provide search suggestions. Acknowledgments This work has been partially funded by the Spanish Government (project INES/Text-Mess) and the Xunta de Galicia. References R. Agrawal, S. Gollapudi, A. Halverson, and S. Leong. 2009. Diversifying Search Results. In Proc. of WSDM’09. ACM. P. Anick. 2003. Using Terminological Feedback for Web Search Refinement : a Log-based Study. In Proc. ACM SIGIR 2003, pages 88–95. ACM New York, NY, USA. J. Artiles, J. Gonzalo, and S. Sekine. 2009. WePS 2 Evaluation Campaign: overview of the Web People Search Clustering Task. In 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference. 2009. T. Brants and A. Franz. 2006. Web 1T 5-gram, version 1. Philadelphia: Linguistic Data Consortium. D. Carmel, H. Roitman, and N. Zwerdling. 2009. Enhancing Cluster Labeling using Wikipedia. In Pro- ceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 139–146. ACM. C. Carpineto, S. Osinski, G. Romano, and Dawid Weiss. 2009. A Survey of Web Clustering Engines. ACM Computing Surveys, 41(3). Y. Chen, S. Yat Mei Lee, and C. Huang. 2009. PolyUHK: A Robust Information Extraction System for Web Personal Names. In Proc. WWW’09 (WePS2 Workshop). ACM. C. Clarke, M. Kolla, G. Cormack, O. Vechtomova, A. Ashkan, S. B ¨uttcher, and I. MacKinnon. 2008. Novelty and Diversity in Information Retrieval Evaluation. In Proc. SIGIR ’08, pages 659–666. ACM. P. Clough, M. Sanderson, M. Abouammoh, S. Navarro, and M. Paramita. 2009. Multiple Approaches to Analysing Query Diversity. In Proc. of SIGIR 2009. ACM. W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch. 2001 . TiMBL: Tilburg Memory Based Learner, version 4.0, Reference Guide. Technical report, University of Antwerp. E. Gabrilovich and S. Markovitch. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India. S. Gollapudi and A. Sharma. 2009. An Axiomatic Approach for Result Diversification. In Proc. WWW 2009, pages 381–390. ACM New York, NY, USA. R. Mihalcea. 2007. Using Wikipedia for Automatic Word Sense Disambiguation. In Proceedings of NAACL HLT, volume 2007. G. Miller, C. R. Beckwith, D. Fellbaum, Gross, and K. Miller. 1990. Wordnet: An on-line lexical database. International Journal of Lexicograph, 3(4). G.A Miller, C. Leacock, R. Tengi, and Bunker R. T. 1993. A Semantic Concordance. In Proceedings of the ARPA WorkShop on Human Language Technology. San Francisco, Morgan Kaufman. M. Paramita, M. Sanderson, and P. Clough. 2009. Diversity in Photo Retrieval: Overview of the ImageCLEFPhoto task 2009. CLEF working notes, 2009. M. Ruiz-Casado, E. Alfonseca, and P. Castells. 2005. Automatic Assignment of Wikipedia Encyclopaedic Entries to Wordnet Synsets. Advances in Web Intelligence, 3528:380–386. M. Sanderson. 2000. Retrieving with Good Sense. Information Retrieval, 2(1):49–69. M. Sanderson. 2008. Ambiguous Queries: Test Collections Need More Sense. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 499–506. ACM New York, NY, USA. C. Santamar ı´a, J. Gonzalo, and F. Verdejo. 2003. Automatic Association of Web Directories to Word Senses. Computational Linguistics, 29(3):485–502. Z. S. Syed, T. Finin, and Joshi. A. 2008. Wikipedia as an Ontology for Describing Documents. In Proc. ICWSM’08. 1366

6 0.75631452 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

7 0.7525202 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

8 0.75211221 145 acl-2010-Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment

9 0.75192463 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans

10 0.75165939 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery

11 0.75152755 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment

12 0.75012451 133 acl-2010-Hierarchical Search for Word Alignment

13 0.74982864 15 acl-2010-A Semi-Supervised Key Phrase Extraction Approach: Learning from Title Phrases through a Document Semantic Network

14 0.7491467 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

15 0.7490378 79 acl-2010-Cross-Lingual Latent Topic Extraction

16 0.74835908 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons

17 0.7474066 9 acl-2010-A Joint Rule Selection Model for Hierarchical Phrase-Based Translation

18 0.74634689 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

19 0.74551034 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web

20 0.74415207 163 acl-2010-Learning Lexicalized Reordering Models from Reordering Graphs