acl acl2013 acl2013-112 knowledge-graph by maker-knowledge-mining

112 acl-2013-Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data

Source: pdf

Author: Xuezhe Ma ; Fei Xia

Abstract: In this paper, we propose a simple and effective approach to domain adaptation for dependency parsing. This is a feature augmentation approach in which the new features are constructed based on subtree information extracted from the autoparsed target domain data. To demonstrate the effectiveness of the proposed approach, we evaluate it on three pairs of source-target data, compared with several common baseline systems and previous approaches. Our approach achieves significant improvement on all the three pairs of data sets.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data Xuezhe Ma Department of Linguistics University of Washington Seattle, WA 98195, USA x zma @ uw . [sent-1, score-0.077]

2 edu Abstract In this paper, we propose a simple and effective approach to domain adaptation for dependency parsing. [sent-2, score-0.482]

3 This is a feature augmentation approach in which the new features are constructed based on subtree information extracted from the autoparsed target domain data. [sent-3, score-0.846]

4 Our approach achieves significant improvement on all the three pairs of data sets. [sent-5, score-0.049]

5 1 Introduction In recent years, several dependency parsing algorithms (Nivre and Scholz, 2004; McDonald et al. [sent-6, score-0.299]

6 , 2005b; McDonald and Pereira, 2006; Carreras, 2007; Koo and Collins, 2010; Ma and Zhao, 2012) have been proposed and achieved high parsing accuracies on several treebanks of different languages. [sent-8, score-0.212]

7 However, the performance of such parsers declines when training and test data come from different domains. [sent-9, score-0.249]

8 Furthermore, the manually annotated treebanks that these parsers rely on are highly expensive to create. [sent-10, score-0.105]

9 Therefore, developing dependency parsing algorithms that can be easily ported from one domain to another—say, from a resource-rich domain to a resource-poor domain—is of great importance. [sent-11, score-0.699]

10 Several approaches have been proposed for the task of parser adaptation. [sent-12, score-0.16]

11 (2006) successfully applied self-training to domain adaptation for constituency parsing using the rerank- ing parser of Charniak and Johnson (2005). [sent-14, score-0.699]

12 Reichart and Rappoport (2007) explored self-training when the amount of the annotated data is small Fei Xia Department of Linguistics University of Washington Seattle, WA 98195, USA fxi a @ uw . [sent-15, score-0.144]

13 Zhang and Wang (2009) enhanced the performance of dependency parser adaptation by utilizing a large-scale hand-crafted HPSG grammar. [sent-17, score-0.442]

14 Plank and van Noord (201 1) proposed a data selection method based on effective measures of domain similarity for dependency parsing. [sent-18, score-0.442]

15 There are roughly two varieties of domain adaptation problem—fully supervised case in which there are a small amount of labeled data in the target domain, and semi-supervised case in which there are no labeled data in the target domain. [sent-19, score-1.417]

16 In this paper, we present a parsing adaptation approach focused on the fully supervised case. [sent-20, score-0.339]

17 It is a feature augmentation approach in which the new features are constructed based on subtree information extracted from the auto-parsed target domain data. [sent-21, score-0.782]

18 Our approach achieves significant improvement on allthese data sets. [sent-23, score-0.049]

19 (2009)’s work on semi-supervised parsing with additional subtree-based features extracted from unlabeled data and by the feature augmentation method proposed by Daume III (2007). [sent-25, score-0.586]

20 ’s work and explain how we extend that for domain adaptation. [sent-27, score-0.2]

21 , subtrees), instead of the entire trees, from the autoparsed data is used to re-train the parsing models. [sent-36, score-0.291]

22 For example, a first-order subtree is a single edge consisting of a head and a dependent, and a second-order sibling subtree is one that consists of a head and two dependents. [sent-38, score-0.381]

23 (2009), they first extract all the subtrees in auto-parsed data and store them in a list Lst. [sent-40, score-0.259]

24 Then they count the frequency of these subtrees and divide them into three groups according to their levels of frequency. [sent-41, score-0.242]

25 Finally, they construct new features for the subtrees based on which groups they belongs to and retrain a new parser with feature-augmented training data. [sent-42, score-0.684]

26 2 Parser adaptation with subtree-based Features Chen et al. [sent-44, score-0.161]

27 (2009)’s work is for semi-supervised learning, where the labeled training data and the test data come from the same domain; the subtreebased features collected from auto-parsed data are added to all the labeled training data to retrain the parsing model. [sent-45, score-1.35]

28 In the supervised setting for domain adaptation, there is a large amount of labeled data in the source domain and a small amount of labeled data in the target domain. [sent-46, score-1.369]

29 One intuitive way of applying Chen’s method to this setting is to simply take the union of the labeled training data from both domains and add subtree-based features to all the data in the union when re-training the parsing model. [sent-47, score-0.875]

30 However, it turns out that adding subtree-based features to only the labeled training data in the target domain works better. [sent-48, score-0.87]

31 Train a baseline parser with the small amount of labeled data in the target domain and use the parser to parse the large amount of unla- beled sentences in the target domain. [sent-50, score-1.394]

32 Extract subtrees from the auto-parsed data and add subtree-based features to the labeled training data in the target domain. [sent-52, score-0.866]

33 Retrain the parser with the union of the labeled training data in the two domains, where the instances from the target domain are augmented with the subtree-based features. [sent-54, score-1.007]

34 1If a subtree does not appear in Lst, it falls to the fourth group for “unseen subtrees”. [sent-55, score-0.149]

35 To state our feature augmentation approach more formally, we use X to denote the input space, and Ds and Dt to denote the labeled data in the source and target domains, respectively. [sent-56, score-0.754]

36 Let be the augmented input space, and Φs and Φt be the mappings from X to for the instances in the source and target domains respectively. [sent-57, score-0.407]

37 X′ Φs (xorg) X′ = < xorg, 0 > Φt (xorg) = < xorg, xnew > (1) Here, xorg is the original feature vector in X, and xnew is the vector of the subtree-based features extracted from auto-parsed data of the target domain. [sent-62, score-0.814]

38 The subtree extraction method used in our approach is the same as in (Chen et al. [sent-63, score-0.149]

39 , 2009) except that we use different thresholds when dividing subtrees into three frequency groups: the threshold for the high-frequency level is TOP 1% of the subtrees, the one for the middle-frequency level is TOP 10%, and the rest of subtrees belong to the low-frequency level. [sent-64, score-0.483]

40 These thresholds are chosen empirically on some development data set. [sent-65, score-0.082]

41 The difference between that study and our approach is that our new features are based on subtree information instead of copies of original features. [sent-67, score-0.243]

42 Since the new features are based on the subtree information extracted from the auto-parsed target data, they represent certain properties of the target domain and that explains why adding them to the target data works better than adding them to both the source and target data. [sent-68, score-1.482]

43 3 Experiments For evaluation, we tested our approach on three pairs of source-target data and compared it with 2The mapping in Eq 2 looks different from the one proposed in (Daume III, 2007), but it can be proved that the two are equivalent. [sent-69, score-0.049]

44 In this section, we first describe the data sets and parsing models used in each of the three experiments in section 3. [sent-71, score-0.227]

45 In the first experiment denoted by “WSJto-B”, WSJ corpus is used as the source domain and Brown corpus as the target domain. [sent-80, score-0.546]

46 The phrase structures in the treebank are converted into dependencies using Penn2Malt tool3 with the standard head rules (Yamada and Matsumoto, 2003). [sent-82, score-0.067]

47 For the WSJ corpus, we used the standard data split: sections 2-21 for training and section 23 for test. [sent-83, score-0.139]

48 In the experiment of B-to-WSJ, we randomly selected about 2000 sentences from the training portion of WSJ as the labeled data in the target domain. [sent-84, score-0.683]

49 The rest of training data in WSJ is regarded as the unlabeled data of the target domain. [sent-85, score-0.533]

50 The training and test sections consist of sentences from all of the genres that form the corpus. [sent-87, score-0.09]

51 The training portion consists of 90% (9 of each 10 consecutive sentences) of the data, and the test portion is the remaining 10%. [sent-88, score-0.175]

52 For the experiment of WSJ-to-B, we randomly selected about 2000 sentences from training portion of Brown and use them as labeled data and the rest as unlabeled data in the target domain. [sent-89, score-0.898]

53 In the third experiment denoted by ’“WSJ-toG”, we used WSJ corpus as the source domain and Genia corpus (G)4 as the target domain. [sent-90, score-0.546]

54 Following Plank and van Noord (201 1), we used the training data in CoNLL 2008 shared task (Surdeanu et al. [sent-91, score-0.229]

55 , 2008) which are also from WSJ sections 2-21 but converted into dependency structure by the LTH converter (Johansson and Nugues, 2007). [sent-92, score-0.253]

56 The dependency parsing models we used in this study are the graph-based first-order and secondorder sibling parsing models (McDonald et al. [sent-106, score-0.592]

57 The feature sets of first-order and second-order sibling parsing models used in our experiments are the same as the ones in (Ma and Zhao, 2012). [sent-110, score-0.261]

58 Parsing accuracy is measured with unlabeled attachment score (UAS) and the percentage of complete matches (CM) for the first and second experiments. [sent-112, score-0.167]

59 For the third experiment, we also report labeled attachment score (LAS) in order to compare with the results in (Plank and van Noord, 2011). [sent-113, score-0.347]

60 2 Comparison Systems For comparison, we re-implemented the following well-known baselines and previous approaches, and tested them on the three data sets: SrcOnly: Train a parser with the labeled data from the source domain only. [sent-115, score-0.775]

61 TgtOnly: Train a parser with the labeled data from the target domain only. [sent-116, score-0.863]

62 Src&Tgt;: Train a parser with the labeled data from the source and target domains. [sent-117, score-0.736]

63 Self-Training: Following Reichart and Rappoport (2007), we train a parser with the union of the source and target labeled data, parse the unlabeled data in the target domain, 5http://sourceforge. [sent-118, score-1.233]

64 net/projects/maxparser/ 587 add the entire auto-parsed trees to the manually labeled data in a single step without checking their parsing quality, and retrain the parser. [sent-119, score-0.649]

65 Co-Training: In the co-training system, we first train two parsers with the labeled data from the source and target domains, respectively. [sent-120, score-0.682]

66 Then we use the parsers to parse unlabeled data in the target domain and select sentences for which the two parsers produce identical trees. [sent-121, score-0.768]

67 Finally, we add the analyses for those sentences to the union of the source and tar- get labeled data to retrain a new parser. [sent-122, score-0.629]

68 This approach is similar to the one used in (Sagae and Tsujii, 2007), which achieved the highest scores in the domain adaptation track of the CoNLL 2007 shared task (Nivre et al. [sent-123, score-0.41]

69 We use the union of the labeled data from the source and target domains as the labeled training data. [sent-129, score-1.045]

70 The unlabeled data needed to construct subtreebased features come from the target domain. [sent-130, score-0.587]

71 Plank and van Noord (2011): This system performs data selection on a data pool consisting of large amount of labeled data to get a training set that is similar to the test domain. [sent-131, score-0.572]

72 Per-corpus: The parser is trained with the large training set from the target domain. [sent-133, score-0.429]

73 For example, for the experiment of WSJ-to-B, all the labeled training data from the Brown corpus is used for training, including the subset of data which are treated as unlabeled in our approach and other comparison systems. [sent-134, score-0.6]

74 The results serve as an upper bound of domain adaptation when there is a large amount of labeled data in the target domain. [sent-135, score-0.914]

75 3 Results Table 2 illustrates the results of our approach with the first-order parsing model in the first and second experiments, together with the results of the comparison systems described in section 3. [sent-137, score-0.178]

76 The superscript indicates the source of labeled data used in training. [sent-140, score-0.411]

77 parsing model in the first and second experiments. [sent-141, score-0.178]

78 results with the second-order sibling parsing model is shown in Table 3. [sent-142, score-0.261]

79 Table 4 shows the results in the third experiment with the first-order parsing model. [sent-144, score-0.241]

80 We also include the result from (Plank and van Noord, 2011), which use the same parsing model as ours. [sent-145, score-0.25]

81 Note that this result is not comparable with other numbers in the table as it uses a larger set of labeled data, as indicated by the superscript. [sent-146, score-0.244]

82 “Plank (201 1)” refers to the approach in Plank and van Noord (201 1). [sent-148, score-0.108]

83 6 The improvement of our approach over the feature augmentation approach in Daume III (2007) indicates that adding subtreebased features provides better results than making several copies of the original features. [sent-150, score-0.444]

84 , 2009), implying that adding subtree-based features to only the target labeled data is better than adding them to the labeled data in both the source and target domains. [sent-152, score-1.25]

85 2, the training data used to train the parser in Step 1 can be from the target domain only or from the source and target domains. [sent-154, score-0.996]

86 Similarly, in Step 3 the subtree-based features can be added to the labeled data from the target domain only or from the source and target domains. [sent-155, score-1.031]

87 Our approach is the one that uses the labeled data from the target domain only in both steps, and Chen’s system uses labeled data from the source and target domains in both steps. [sent-157, score-1.36]

88 Table 5 compares the performance of the final parser in the WSJ-to-Genia experiment when the parser is created with one of the four combinations. [sent-158, score-0.383]

89 6The results of Per-corpus are better than ours but it uses a much larger labeled training set in the target domain. [sent-161, score-0.513]

90 2 Table 5: The performance (UAS/LAS) of the final parser in the WSJ-to-Genia experiment when different training data are used to create the final parser. [sent-170, score-0.331]

91 The column label and row label indicate the choice of the labeled data used in Step 1 and 3 of the process described in Section 2. [sent-171, score-0.329]

92 4 Conclusion In this paper, we propose a feature augmentation approach for dependency parser adaptation which constructs new features based on subtree informa- tion extracted from auto-parsed data from the target domain. [sent-173, score-1.073]

93 We distinguish the source and target domains by adding the new features only to the data from the target domain. [sent-174, score-0.731]

94 The experimental results on three source-target domain pairs show that our approach outperforms all the comparison systems. [sent-175, score-0.2]

95 For the future work, we will explore the potential benefits of adding other types of features extracted from unlabeled data in the target domain. [sent-176, score-0.503]

96 We will also experiment with various ways of combining our current approach with other domain adaptation methods (such as self-training and co-training) to further improve system performance. [sent-177, score-0.424]

97 Self-training for enhancement and domain adaptation of statistical parsers trained on small datasets. [sent-250, score-0.432]

98 Dependency parsing and domain adaptation with LR models and parser ensembles. [sent-254, score-0.699]

99 The conll2008 shared task on joint parsing of syntactic and semantic dependencies. [sent-258, score-0.227]

100 S- tatistical dependency analysis with support vector machines. [sent-262, score-0.121]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xorg', 0.364), ('labeled', 0.244), ('plank', 0.223), ('subtrees', 0.21), ('target', 0.21), ('domain', 0.2), ('daume', 0.19), ('augmentation', 0.178), ('retrain', 0.178), ('parsing', 0.178), ('noord', 0.168), ('adaptation', 0.161), ('parser', 0.16), ('subtree', 0.149), ('unlabeled', 0.136), ('wsj', 0.131), ('mcdonald', 0.122), ('dependency', 0.121), ('genia', 0.119), ('subtreebased', 0.109), ('chen', 0.095), ('conll', 0.088), ('reichart', 0.087), ('union', 0.085), ('sibling', 0.083), ('domains', 0.081), ('rappoport', 0.08), ('source', 0.073), ('computional', 0.073), ('xnew', 0.073), ('xuezhe', 0.073), ('van', 0.072), ('parsers', 0.071), ('iii', 0.07), ('autoparsed', 0.064), ('converter', 0.064), ('ryan', 0.064), ('adding', 0.063), ('experiment', 0.063), ('brown', 0.06), ('johansson', 0.06), ('training', 0.059), ('nivre', 0.058), ('portion', 0.058), ('crammer', 0.054), ('session', 0.051), ('lth', 0.051), ('pereira', 0.05), ('amount', 0.05), ('joakim', 0.05), ('copies', 0.049), ('shared', 0.049), ('data', 0.049), ('charniak', 0.048), ('ma', 0.046), ('features', 0.045), ('superscript', 0.045), ('uw', 0.045), ('sagae', 0.044), ('eq', 0.044), ('mappings', 0.043), ('prague', 0.043), ('czech', 0.041), ('koby', 0.041), ('mcclosky', 0.041), ('koo', 0.038), ('come', 0.038), ('converted', 0.037), ('fernando', 0.037), ('seattle', 0.036), ('surdeanu', 0.036), ('republic', 0.036), ('refers', 0.036), ('choice', 0.036), ('train', 0.035), ('ichi', 0.035), ('treebanks', 0.034), ('yamada', 0.033), ('thresholds', 0.033), ('usa', 0.033), ('penn', 0.032), ('differentiating', 0.032), ('zma', 0.032), ('bw', 0.032), ('declines', 0.032), ('marquez', 0.032), ('scholz', 0.032), ('secondorder', 0.032), ('eugene', 0.032), ('singapore', 0.032), ('groups', 0.032), ('sections', 0.031), ('attachment', 0.031), ('wa', 0.031), ('parse', 0.031), ('meeting', 0.03), ('treebank', 0.03), ('rest', 0.03), ('spanning', 0.03), ('beled', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 112 acl-2013-Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data

Author: Xuezhe Ma ; Fei Xia

2 0.19966237 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

Author: Greg Coppola ; Mark Steedman

Abstract: Higher-order dependency features are known to improve dependency parser accuracy. We investigate the incorporation of such features into a cube decoding phrase-structure parser. We find considerable gains in accuracy on the range of standard metrics. What is especially interesting is that we find strong, statistically significant gains on dependency recovery on out-of-domain tests (Brown vs. WSJ). This suggests that higher-order dependency features are not simply overfitting the training material.

3 0.19429217 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

Author: Muhua Zhu ; Yue Zhang ; Wenliang Chen ; Min Zhang ; Jingbo Zhu

Abstract: Shift-reduce dependency parsers give comparable accuracies to their chartbased counterparts, yet the best shiftreduce constituent parsers still lag behind the state-of-the-art. One important reason is the existence of unary nodes in phrase structure trees, which leads to different numbers of shift-reduce actions between different outputs for the same input. This turns out to have a large empirical impact on the framework of global training and beam search. We propose a simple yet effective extension to the shift-reduce process, which eliminates size differences between action sequences in beam-search. Our parser gives comparable accuracies to the state-of-the-art chart parsers. With linear run-time complexity, our parser is over an order of magnitude faster than the fastest chart parser.

4 0.17697978 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

Author: Jinho D. Choi ; Andrew McCallum

Abstract: We present a novel approach, called selectional branching, which uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to a greedy transition-based dependency parsing approach. Selectional branching is guaranteed to perform a fewer number of transitions than beam search yet performs as accurately. We also present a new transition-based dependency parsing algorithm that gives a complexity of O(n) for projective parsing and an expected linear time speed for non-projective parsing. With the standard setup, our parser shows an unlabeled attachment score of 92.96% and a parsing speed of 9 milliseconds per sentence, which is faster and more accurate than the current state-of-the-art transitionbased parser that uses beam search.

5 0.16614959 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

Author: Ryan McDonald ; Joakim Nivre ; Yvonne Quirmbach-Brundage ; Yoav Goldberg ; Dipanjan Das ; Kuzman Ganchev ; Keith Hall ; Slav Petrov ; Hao Zhang ; Oscar Tackstrom ; Claudia Bedini ; Nuria Bertomeu Castello ; Jungmee Lee

Abstract: We present a new collection of treebanks with homogeneous syntactic dependency annotation for six languages: German, English, Swedish, Spanish, French and Korean. To show the usefulness of such a resource, we present a case study of crosslingual transfer parsing with more reliable evaluation than has been possible before. This ‘universal’ treebank is made freely available in order to facilitate research on multilingual dependency parsing.1

6 0.14762983 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing

7 0.14594525 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction

8 0.14321376 332 acl-2013-Subtree Extractive Summarization via Submodular Maximization

9 0.12618878 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy

10 0.12609653 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

11 0.12205931 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

12 0.11811589 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

13 0.11352443 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers

14 0.10892228 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

15 0.10789752 80 acl-2013-Chinese Parsing Exploiting Characters

16 0.10731135 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

17 0.10640141 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation

18 0.10367037 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

19 0.10334709 136 acl-2013-Enhanced and Portable Dependency Projection Algorithms Using Interlinear Glossed Text

20 0.10261991 357 acl-2013-Transfer Learning for Constituency-Based Grammars

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.231), (1, -0.132), (2, -0.174), (3, 0.061), (4, -0.074), (5, -0.036), (6, 0.024), (7, -0.008), (8, 0.007), (9, -0.115), (10, 0.036), (11, 0.022), (12, -0.043), (13, 0.101), (14, 0.042), (15, 0.111), (16, -0.115), (17, 0.012), (18, -0.044), (19, 0.029), (20, 0.1), (21, 0.014), (22, 0.032), (23, 0.01), (24, -0.001), (25, 0.058), (26, 0.044), (27, -0.025), (28, -0.053), (29, 0.011), (30, -0.06), (31, 0.14), (32, -0.054), (33, 0.116), (34, 0.05), (35, 0.08), (36, 0.016), (37, -0.187), (38, 0.006), (39, -0.005), (40, -0.009), (41, 0.0), (42, -0.005), (43, -0.093), (44, 0.078), (45, -0.012), (46, 0.004), (47, -0.008), (48, 0.014), (49, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96100366 112 acl-2013-Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data

Author: Xuezhe Ma ; Fei Xia

2 0.79931939 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

Author: Greg Coppola ; Mark Steedman

3 0.73165613 335 acl-2013-Survey on parsing three dependency representations for English

Author: Angelina Ivanova ; Stephan Oepen ; Lilja vrelid

Abstract: In this paper we focus on practical issues of data representation for dependency parsing. We carry out an experimental comparison of (a) three syntactic dependency schemes; (b) three data-driven dependency parsers; and (c) the influence of two different approaches to lexical category disambiguation (aka tagging) prior to parsing. Comparing parsing accuracies in various setups, we study the interactions of these three aspects and analyze which configurations are easier to learn for a dependency parser.

4 0.71097934 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

5 0.7109741 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing

Author: Guangyou Zhou ; Jun Zhao

Abstract: This paper is concerned with the problem of heterogeneous dependency parsing. In this paper, we present a novel joint inference scheme, which is able to leverage the consensus information between heterogeneous treebanks in the parsing phase. Different from stacked learning methods (Nivre and McDonald, 2008; Martins et al., 2008), which process the dependency parsing in a pipelined way (e.g., a second level uses the first level outputs), in our method, multiple dependency parsing models are coordinated to exchange consensus information. We conduct experiments on Chinese Dependency Treebank (CDT) and Penn Chinese Treebank (CTB), experimental results show that joint infer- ence can bring significant improvements to all state-of-the-art dependency parsers.

6 0.67904127 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

7 0.67136657 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

8 0.66944593 94 acl-2013-Coordination Structures in Dependency Treebanks

9 0.6557917 331 acl-2013-Stop-probability estimates computed on a large corpus improve Unsupervised Dependency Parsing

10 0.64987874 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers

11 0.64668018 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction

12 0.63491142 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

13 0.61157578 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy

14 0.60805404 288 acl-2013-Punctuation Prediction with Transition-based Parsing

15 0.56692255 28 acl-2013-A Unified Morpho-Syntactic Scheme of Stanford Dependencies

16 0.56127697 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

17 0.55525696 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

18 0.54304206 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

19 0.53022361 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

20 0.52808815 222 acl-2013-Learning Semantic Textual Similarity with Structural Representations

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.061), (6, 0.029), (11, 0.095), (24, 0.027), (26, 0.035), (28, 0.309), (35, 0.044), (42, 0.069), (48, 0.067), (70, 0.036), (88, 0.037), (90, 0.021), (95, 0.089)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92854112 349 acl-2013-The mathematics of language learning

Author: Andras Kornai ; Gerald Penn ; James Rogers ; Anssi Yli-Jyra

Abstract: unkown-abstract

2 0.78722847 124 acl-2013-Discriminative state tracking for spoken dialog systems

Author: Angeliki Metallinou ; Dan Bohus ; Jason Williams

Abstract: In spoken dialog systems, statistical state tracking aims to improve robustness to speech recognition errors by tracking a posterior distribution over hidden dialog states. Current approaches based on generative or discriminative models have different but important shortcomings that limit their accuracy. In this paper we discuss these limitations and introduce a new approach for discriminative state tracking that overcomes them by leveraging the problem structure. An offline evaluation with dialog data collected from real users shows improvements in both state tracking accuracy and the quality of the posterior probabilities. Features that encode speech recognition error patterns are particularly helpful, and training requires rel- atively few dialogs.

3 0.77520478 363 acl-2013-Two-Neighbor Orientation Model with Cross-Boundary Global Contexts

Author: Hendra Setiawan ; Bowen Zhou ; Bing Xiang ; Libin Shen

Abstract: Long distance reordering remains one of the greatest challenges in statistical machine translation research as the key contextual information may well be beyond the confine of translation units. In this paper, we propose Two-Neighbor Orientation (TNO) model that jointly models the orientation decisions between anchors and two neighboring multi-unit chunks which may cross phrase or rule boundaries. We explicitly model the longest span of such chunks, referred to as Maximal Orientation Span, to serve as a global parameter that constrains underlying local decisions. We integrate our proposed model into a state-of-the-art string-to-dependency translation system and demonstrate the efficacy of our proposal in a large-scale Chinese-to-English translation task. On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement.

same-paper 4 0.7567935 112 acl-2013-Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data

Author: Xuezhe Ma ; Fei Xia

5 0.75062954 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments

Author: Tony Veale ; Guofu Li

Abstract: Just as observing is more than just seeing, comparing is far more than mere matching. It takes understanding, and even inventiveness, to discern a useful basis for judging two ideas as similar in a particular context, especially when our perspective is shaped by an act of linguistic creativity such as metaphor, simile or analogy. Structured resources such as WordNet offer a convenient hierarchical means for converging on a common ground for comparison, but offer little support for the divergent thinking that is needed to creatively view one concept as another. We describe such a means here, by showing how the web can be used to harvest many divergent views for many familiar ideas. These lateral views complement the vertical views of WordNet, and support a system for idea exploration called Thesaurus Rex. We show also how Thesaurus Rex supports a novel, generative similarity measure for WordNet. 1 Seeing is Believing (and Creating) Similarity is a cognitive phenomenon that is both complex and subjective, yet for practical reasons it is often modeled as if it were simple and objective. This makes sense for the many situations where we want to align our similarity judgments with those of others, and thus focus on the same conventional properties that others are also likely to focus upon. This reliance on the consensus viewpoint explains why WordNet (Fellbaum, 1998) has proven so useful as a basis for computational measures of lexico-semantic similarity Guofu Li School of Computer Science and Informatics, University College Dublin, Belfield, Dublin D2, Ireland. l .guo fu . l gmai l i @ .com (e.g. see Pederson et al. 2004, Budanitsky & Hirst, 2006; Seco et al. 2006). These measures reduce the similarity of two lexical concepts to a single number, by viewing similarity as an objective estimate of the overlap in their salient qualities. This convenient perspective is poorly suited to creative or insightful comparisons, but it is sufficient for the many mundane comparisons we often perform in daily life, such as when we organize books or look for items in a supermarket. So if we do not know in which aisle to locate a given item (such as oatmeal), we may tacitly know how to locate a similar product (such as cornflakes) and orient ourselves accordingly. Yet there are occasions when the recognition of similarities spurs the creation of similarities, when the act of comparison spurs us to invent new ways of looking at an idea. By placing pop tarts in the breakfast aisle, food manufacturers encourage us to view them as a breakfast food that is not dissimilar to oatmeal or cornflakes. When ex-PM Tony Blair published his memoirs, a mischievous activist encouraged others to move his book from Biography to Fiction in bookshops, in the hope that buyers would see it in a new light. Whenever we use a novel metaphor to convey a non-obvious viewpoint on a topic, such as “cigarettes are time bombs”, the comparison may spur us to insight, to see aspects of the topic that make it more similar to the vehicle (see Ortony, 1979; Veale & Hao, 2007). In formal terms, assume agent A has an insight about concept X, and uses the metaphor X is a Y to also provoke this insight in agent B. To arrive at this insight for itself, B must intuit what X and Y have in common. But this commonality is surely more than a standard categorization of X, or else it would not count as an insight about X. To understand the metaphor, B must place X 660 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 6 0–670, in a new category, so that X can be seen as more similar to Y. Metaphors shape the way we per- ceive the world by re-shaping the way we make similarity judgments. So if we want to imbue computers with the ability to make and to understand creative metaphors, we must first give them the ability to look beyond the narrow viewpoints of conventional resources. Any measure that models similarity as an objective function of a conventional worldview employs a convergent thought process. Using WordNet, for instance, a similarity measure can vertically converge on a common superordinate category of both inputs, and generate a single numeric result based on their distance to, and the information content of, this common generalization. So to find the most conventional ways of seeing a lexical concept, one simply ascends a narrowing concept hierarchy, using a process de Bono (1970) calls vertical thinking. To find novel, non-obvious and useful ways of looking at a lexical concept, one must use what Guilford (1967) calls divergent thinking and what de Bono calls lateral thinking. These processes cut across familiar category boundaries, to simultaneously place a concept in many different categories so that we can see it in many different ways. de Bono argues that vertical thinking is selective while lateral thinking is generative. Whereas vertical thinking concerns itself with the “right” way or a single “best” way of looking at things, lateral thinking focuses on producing alternatives to the status quo. To be as useful for creative tasks as they are for conventional tasks, we need to re-imagine our computational similarity measures as generative rather than selective, expansive rather than reductive, divergent as well as convergent and lateral as well as vertical. Though WordNet is ideally structured to support vertical, convergent reasoning, its comprehensive nature means it can also be used as a solid foundation for building a more lateral and divergent model of similarity. Here we will use the web as a source of diverse perspectives on familiar ideas, to complement the conventional and often narrow views codified by WordNet. Section 2 provides a brief overview of past work in the area of similarity measurement, before section 3 describes a simple bootstrapping loop for acquiring richly diverse perspectives from the web for a wide variety of familiar ideas. These perspectives are used to enhance a Word- Net-based measure of lexico-semantic similarity in section 4, by broadening the range of informative viewpoints the measure can select from. Similarity is thus modeled as a process that is both generative and selective. This lateral-andvertical approach is evaluated in section 5, on the Miller & Charles (1991) data-set. A web app for the lateral exploration of diverse viewpoints, named Thesaurus Rex, is also presented, before closing remarks are offered in section 6. 2 Related Work and Ideas WordNet’s taxonomic organization of nounsenses and verb-senses – in which very general categories are successively divided into increasingly informative sub-categories or instancelevel ideas – allows us to gauge the overlap in information content, and thus of meaning, of two lexical concepts. We need only identify the deepest point in the taxonomy at which this content starts to diverge. This point of divergence is often called the LCS, or least common subsumer, of two concepts (Pederson et al., 2004). Since sub-categories add new properties to those they inherit from their parents – Aristotle called these properties the differentia that stop a category system from trivially collapsing into itself – the depth of a lexical concept in a taxonomy is an intuitive proxy for its information content. Wu & Palmer (1994) use the depth of a lexical concept in the WordNet hierarchy as such a proxy, and thereby estimate the similarity of two lexical concepts as twice the depth of their LCS divided by the sum of their individual depths. Leacock and Chodorow (1998) instead use the length of the shortest path between two concepts as a proxy for the conceptual distance between them. To connect any two ideas in a hierarchical system, one must vertically ascend the hierarchy from one concept, change direction at a potential LCS, and then descend the hierarchy to reach the second concept. (Aristotle was also first to suggest this approach in his Poetics). Leacock and Chodorow normalize the length of this path by dividing its size (in nodes) by twice the depth of the deepest concept in the hierarchy; the latter is an upper bound on the distance between any two concepts in the hierarchy. Negating the log of this normalized length yields a corresponding similarity score. While the role of an LCS is merely implied in Leacock and Chodorow’s use of a shortest path, the LCS is pivotal nonetheless, and like that of Wu & Palmer, the approach uses an essentially vertical reasoning process to identify a single “best” generalization. Depth is a convenient proxy for information content, but more nuanced proxies can yield 661 more rounded similarity measures. Resnick (1995) draws on information theory to define the information content of a lexical concept as the negative log likelihood of its occurrence in a corpus, either explicitly (via a direct mention) or by presupposition (via a mention of any of its sub-categories or instances). Since the likelihood of a general category occurring in a corpus is higher than that of any of its sub-categories or instances, such categories are more predictable, and less informative, than rarer categories whose occurrences are less predictable and thus more informative. The negative log likelihood of the most informative LCS of two lexical concepts offers a reliable estimate of the amount of infor- mation shared by those concepts, and thus a good estimate of their similarity. Lin (1998) combines the intuitions behind Resnick’s metric and that of Wu and Palmer to estimate the similarity of two lexical concepts as an information ratio: twice the information content of their LCS divided by the sum of their individual information contents. Jiang and Conrath (1997) consider the converse notion of dissimilarity, noting that two lexical concepts are dissimilar to the extent that each contains information that is not shared by the other. So if the information content of their most informative LCS is a good measure of what they do share, then the sum of their individual information contents, minus twice the content of their most informative LCS, is a reliable estimate of their dissimilarity. Seco et al. (2006) presents a minor innovation, showing how Resnick’s notion of information content can be calculated without the use of an external corpus. Rather, when using Resnick’s metric (or that of Lin, or Jiang and Conrath) for measuring the similarity of lexical concepts in WordNet, one can use the category structure of WordNet itself to estimate infor- mation content. Typically, the more general a concept, the more descendants it will possess. Seco et al. thus estimate the information content of a lexical concept as the log of the sum of all its unique descendants (both direct and indirect), divided by the log of the total number of concepts in the entire hierarchy. Not only is this intrinsic view of information content convenient to use, without recourse to an external corpus, Seco et al. show that it offers a better estimate of information content than its extrinsic, corpus-based alternatives, as measured relative to average human similarity ratings for the 30 word-pairs in the Miller & Charles (1991) test set. A similarity measure can draw on other sources of information besides WordNet’s category structures. One might eke out additional information from WordNet’s textual glosses, as in Lesk (1986), or use category structures other than those offered by WordNet. Looking beyond WordNet, entries in the online encyclopedia Wikipedia are not only connected by a dense topology of lateral links, they are also organized by a rich hierarchy of overlapping categories. Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures. Nonetheless, WordNet can be a valuable component of a hybrid measure, and Agirre et al. (2009) use an SVM (support vector machine) to combine information from WordNet with information harvested from the web. Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set. Similarity is not always applied to pairs of concepts; it is sometimes analogically applied to pairs of pairs of concepts, as in proportional analogies of the form A is to B as C is to D (e.g., hacks are to writers as mercenaries are to soldiers, or chisels are to sculptors as scalpels are to surgeons). In such analogies, one is really assessing the similarity of the unstated relationship between each pair of concepts: thus, mercenaries are soldiers whose allegiance is paid for, much as hacks are writers with income-driven loyalties; sculptors use chisels to carve stone, while surgeons use scalpels to cut or carve flesh. Veale (2004) used WordNet to assess the similarity of A:B to C:D as a function of the combined similarity of A to C and of B to D. In contrast, Turney (2005) used the web to pursue a more divergent course, to represent the tacit relationships of A to B and of C to D as points in a highdimensional space. The dimensions of this space initially correspond to linking phrases on the web, before these dimensions are significantly reduced using singular value decomposition. In the infamous SAT test, an analogy A:B::C:D has four other pairs of concepts that serve as likely distractors (e.g. singer:songwriter for hack:writer) and the goal is to choose the most appropriate C:D pair for a given A:B pairing. Using variants of Wu and Palmer (1994) on the 374 SAT analogies of Turney (2005), Veale (2004) reports a success rate of 38–44% using only WordNet-based similarity. In contrast, Turney (2005) reports up to 55% success on the same analogies, partly because his approach aims 662 to match implicit relations rather than explicit concepts, and in part because it uses a divergent process to gather from the web as rich a perspec- tive as it can on these latent relationships. 2.1 Clever Comparisons Create Similarity Each of these approaches to similarity is a user of information, rather than a creator, and each fails to capture how a creative comparison (such as a metaphor) can spur a listener to view a topic from an atypical perspective. Camac & Glucksberg (1984) provide experimental evidence for the claim that “metaphors do not use preexisting associations to achieve their effects [… ] people use metaphors to create new relations between concepts.” They also offer a salutary reminder of an often overlooked fact: every comparison exploits information, but each is also a source of new information in its own right. Thus, “this cola is acid” reveals a different perspective on cola (e.g. as a corrosive substance or an irritating food) than “this acid is cola” highlights for acid (such as e.g., a familiar substance) Veale & Keane (1994) model the role of similarity in realizing the long-term perlocutionary effect of an informative comparison. For example, to compare surgeons to butchers is to encourage one to see all surgeons as more bloody, … crude or careless. The reverse comparison, of butchers to surgeons, encourages one to see butchers as more skilled and precise. Veale & Keane present a network model of memory, called Sapper, in which activation can spread between related concepts, thus allowing one concept to prime the properties of a neighbor. To interpret an analogy, Sapper lays down new activation-carrying bridges in memory between analogical counterparts, such as between surgeon & butcher, flesh & meat, and scalpel & cleaver. Comparisons can thus have lasting effects on how Sapper sees the world, changing the pattern of activation that arises when it primes a concept. Veale (2003) adopts a similarly dynamic view of similarity in WordNet, showing how an analogical comparison can result in the automatic addition of new categories and relations to WordNet itself. Veale considers the problem of finding an analogical mapping between different parts of WordNet’s noun-sense hierarchy, such as between instances of Greek god and Norse god, or between the letters of different alphabets, such as of Greek and Hebrew. But no structural similarity measure for WordNet exhibits enough discernment to e.g. assign a higher similarity to Zeus & Odin (each is the supreme deity of its pantheon) than to a pairing of Zeus and any other Norse god, just as no structural measure will assign a higher similarity to Alpha & Aleph or to Beta & Beth than to any random letter pairing. A fine-grained category hierarchy permits fine-grained similarity judgments, and though WordNet is useful, its sense hierarchies are not especially fine-grained. However, we can automatically make WordNet subtler and more discerning, by adding new fine-grained categories to unite lexical concepts whose similarity is not reflected by any existing categories. Veale (2003) shows how a property that is found in the glosses of two lexical concepts, of the same depth, can be combined with their LCS to yield a new fine-grained parent category, so e.g. “supreme” + deity = Supreme-deity (for Odin, Zeus, Jupiter, etc.) and “1 st” + letter = 1st-letter (for Alpha, Aleph, etc.) Selected aspects of the textual similarity of two WordNet glosses – the key to similarity in Lesk (1986) – can thus be reified into an explicitly categorical WordNet form. 3 Divergent (Re)Categorization To tap into a richer source of concept properties than WordNet’s glosses, we can use web ngrams. Consider these descriptions of a cowboy from the Google n-grams (Brants & Franz, 2006). The numbers to the right are Google frequency counts. a lonesome cowboy 432 a mounted cowboy 122 a grizzled cowboy 74 a swaggering cowboy 68 To find the stable properties that can underpin a meaningful fine-grained category for cowboy, we must seek out the properties that are so often presupposed to be salient of all cowboys that one can use them to anchor a simile, such as

6 0.73557091 107 acl-2013-Deceptive Answer Prediction with User Preference Graph

7 0.65073919 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules

8 0.55014449 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

9 0.5423311 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

10 0.53571129 332 acl-2013-Subtree Extractive Summarization via Submodular Maximization

11 0.53397739 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

12 0.53021294 17 acl-2013-A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation

13 0.52811319 16 acl-2013-A Novel Translation Framework Based on Rhetorical Structure Theory

14 0.52753288 280 acl-2013-Plurality, Negation, and Quantification:Towards Comprehensive Quantifier Scope Disambiguation

15 0.5239144 133 acl-2013-Efficient Implementation of Beam-Search Incremental Parsers

16 0.52364844 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner

17 0.52238166 335 acl-2013-Survey on parsing three dependency representations for English

18 0.52089953 202 acl-2013-Is a 204 cm Man Tall or Small ? Acquisition of Numerical Common Sense from the Web

19 0.51995695 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing

20 0.51891518 267 acl-2013-PARMA: A Predicate Argument Aligner