acl acl2011 acl2011-59 knowledge-graph by maker-knowledge-mining

59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach


Source: pdf

Author: Muhua Zhu ; Jingbo Zhu ; Minghan Hu

Abstract: For the task of automatic treebank conversion, this paper presents a feature-based approach which encodes bracketing structures in a treebank into features to guide the conversion of this treebank to a different standard. Experiments on two Chinese treebanks show that our approach improves conversion accuracy by 1.31% over a strong baseline.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 zhu j ingbo @mai l neu edu cn Abstract For the task of automatic treebank conversion, this paper presents a feature-based approach which encodes bracketing structures in a treebank into features to guide the conversion of this treebank to a different standard. [sent-6, score-1.856]

2 Experiments on two Chinese treebanks show that our approach improves conversion accuracy by 1. [sent-7, score-0.591]

3 1 Introduction In the field of syntactic parsing, research efforts have been put onto the task of automatic conversion of a treebank (source treebank) to fit a different standard which is exhibited by another treebank (target treebank). [sent-9, score-0.912]

4 Treebank conversion is desirable primarily because source-style and target-style annotations exist for non-overlapping text samples so that a larger target-style treebank can be obtained through such conversion. [sent-10, score-0.687]

5 Hereafter, source and target treebanks are named as heterogenous treebanks due to their different annotation standards. [sent-11, score-0.389]

6 In this paper, we focus on the scenario of conversion between phrase-structure heterogeneous treebanks (Wang et al. [sent-12, score-0.966]

7 Due to the availability of annotation in a source treebank, it is natural to use such annotation to guide treebank conversion. [sent-14, score-0.363]

8 1 which depicts a sentence annotated with standards of Tsinghua Chinese Treebank (TCT) (Zhou, 1996) and Penn Chinese Treebank (CTB) (Xue et al. [sent-16, score-0.049]

9 Suppose that the conversion is in the direction from the TCTstyle parse (left side) to the CTB-style parse (right side). [sent-18, score-0.682]

10 The constituents vp: [将/will 投降/surrender], dj: [敌人/enemy 将/will 投降/surrender], and np:[情 715 huminghan@ i se neu edu cn . [sent-19, score-0.203]

11 报/intelligence 专 家/experts] in the TCT-style parse strongly suggest a resulting CTB-style parse also bracket the words as constituents. [sent-22, score-0.22]

12 Zhu and Zhu (2010) show the effectiveness of using brack- eting structures in a source treebank (source-side bracketing structures in short) as parsing constraints during the decoding phase of a target treebank-based parser. [sent-23, score-0.825]

13 However, using source-side bracketing structures as parsing constraints is problematic in some cases. [sent-24, score-0.348]

14 1, the TCTstyle parse takes “认 为/deems” as the right boundary of a constituent while in the CTB-style parse, “认 为 ” is the left boundary of a constituent. [sent-26, score-0.266]

15 According to the criteria used in Zhu and Zhu (2010), any CTB-style constituents with “认 为 ” being the left boundary are thought to be inconsistent with the bracketing structure of the TCT-style parse and will be pruned. [sent-27, score-0.588]

16 However, ifwe prune such “inconsistent” constituents, the correct conversion result (right side of Fig. [sent-28, score-0.5]

17 The problem comes from binary distinctions used in the approach of Zhu and Zhu (2010). [sent-30, score-0.106]

18 With binary distinctions, constituents generated by a target treebank-based parser are judged to be either consistent or inconsistent with source-side bracketing structures. [sent-31, score-0.652]

19 That approach prunes inconsistent constituents which instead might be correct conversion results1 . [sent-32, score-0.695]

20 In this paper, we insist on using sourceside bracketing structures as guiding information. [sent-33, score-0.307]

21 To achieve such a goal, we propose to use a feature-based approach to treebank conversion and to encode source-side bracketing structures as a set 1To show how severe this problem might be, Section 3. [sent-35, score-0.998]

22 The advantage is that inconsistent constituents can be scored with a function based on the features rather than ruled out as impossible. [sent-41, score-0.269]

23 To test the efficacy of our approach, we conduct experiments on conversion from TCT to CTB. [sent-42, score-0.462]

24 31% absolute improvement in conversion accuracy over the approach used in Zhu and Zhu (2010). [sent-44, score-0.462]

25 1 Generic System Architecture To conduct treebank conversion, our approach, overall speaking, proceeds in the following steps. [sent-46, score-0.255]

26 Step 1: Build a parser (named source parser) on a source treebank, and use it to parse sentences in the training data of a target treebank. [sent-47, score-0.353]

27 Step 2: Build a parser on pairs of golden targetstyle and auto-assigned (in Step 1) source-style parses in the training data of the target treebank. [sent-48, score-0.335]

28 Such a parser is named heterogeneous parser since it incorporates information derived from both source and target treebanks, which follow different annotation standards. [sent-49, score-0.76]

29 Step 3: In the testing phase, the heterogeneous parser takes golden source-style parses as input and conducts treebank conversion. [sent-50, score-0.921]

30 To instantiate the generic framework described above, we need to decide the following three factors: 716 (1) a parsing model for building a source parser, (2) a parsing model for building a heterogeneous parser, and (3) features for building a heterogeneous parser. [sent-53, score-0.987]

31 In principle, any off-the-shelf parsers can be used to build a source parser, so we focus only on the latter two factors. [sent-54, score-0.076]

32 To build a heterogeneous parser, we use feature-based parsing algorithms in order to easily incorporate features that encode source-side bracketing structures. [sent-55, score-0.761]

33 Theoretically, any featurebased approaches are applicable, such as Finkel et al. [sent-56, score-0.027]

34 In this paper, we use the shift-reduce parsing algorithm for its simplicity and competitive performance. [sent-59, score-0.068]

35 2 Shift-Reduce-Based Heterogeneous Parser The heterogeneous parser used in this paper is based on the shift-reduce parsing algorithm described in Sagae and Lavie (2006a) and Wang et al. [sent-61, score-0.57]

36 Shift-reduce parsing is a state transition process, where a state is defined to be a tuple hS, Qi . [sent-63, score-0.14]

37 At each state transition, a shift-reduce parser either shifts the top item of Q onto S, or reduces the top one (or two) items on S. [sent-66, score-0.198]

38 A shift-reduce-based heterogeneous parser proceeds similarly as the standard shift-reduce parsing algorithm. [sent-67, score-0.6]

39 In the training phase, each target-style parse tree in the training data is transformed into a binary tree (Charniak et al. [sent-68, score-0.147]

40 A classifier can be trained on the set of action-states, where each state is represented as a feature vector. [sent-70, score-0.078]

41 In the testing phase, the trained classifier is used to choose actions for state transition. [sent-71, score-0.066]

42 Moreover, beam search strategies can be used to expand the search space of a shift-reduce-based heterogeneous parser (Sagae and Lavie, 2006a). [sent-72, score-0.531]

43 To incorporate information on source-side bracketing structures, in both training and testing phases, feature vectors representing states hS, Qi are augmented with features trhesate bridge athtees c hSur,reQnit astreate a agnmde tnhtee corresponding source-style parse. [sent-73, score-0.319]

44 3 Features This section describes the feature functions used to build a heterogeneous parser on the training data of a target treebank. [sent-75, score-0.628]

45 The first group of features are derived solely from target-style parse trees so they are referred to as target side features. [sent-77, score-0.257]

46 This group of features are completely identical to those used in Sagae and Lavie (2006a). [sent-78, score-0.036]

47 In addition, we have features extracted jointly from target-style and source-style parse trees. [sent-79, score-0.146]

48 These features are generated by consulting a source-style parse (referred to as ts) while we decompose a target-style parse into an action-state sequence. [sent-80, score-0.256]

49 Here, si denote the ith item from the top of the stack, and qi denote the ith item from the front end of the queue. [sent-81, score-0.17]

50 Constituent features Fc(si, ts) This feature schema covers three feature functions: Fc(s1 , ts), Fc(s2, ts), and Fc(s1 ◦ s2 , ts), which decide whether partial parses on stac◦ks S correspond to a constituent in the source-style parse ts. [sent-83, score-0.355]

51 That is, Fc(si, ts) = + if si has a bracketing match (ignoring grammar labels) with any constituent in ts. [sent-84, score-0.315]

52 Relation feature Fr(Ns(s1), Ns(s2)) We first position the lowest node Ns (si) in ts, which dominates the span of si. [sent-86, score-0.078]

53 Then a feature function Fr(Ns (s1) , Ns (s2)) is defined to indicate the relationship of Ns (s1) and Ns (s2). [sent-87, score-0.042]

54 Suppose we are considering the sentence depicted in Fig. [sent-90, score-0.028]

55 Frontier-words feature Ff(RF(s1) , q1) A feature function which decides whether the right frontier word of s1 and q1 are in the same base phrase in ts. [sent-92, score-0.155]

56 Here, a base phrase is defined to be any phrase which dominates no other phrases. [sent-93, score-0.036]

57 Path feature Fp(RF(s1) , q1) Syntactic path features are widely used in the literature of semantic role labeling (Gildea and Jurafsky, 2002) to encode information of both structures and grammar labels. [sent-94, score-0.218]

58 We define a string-valued feature function Fp(RF(s1) , q1) which connects the right frontier word of s1 to q1 in ts. [sent-95, score-0.113]

59 To better understand the above feature functions, we re-examine the example depicted in Fig. [sent-96, score-0.07]

60 Suppose that we use a shift-reduce-based heterogeneous parser to convert the TCT-style parse to the CTB-style parse and that stack S currently contains two partial parses: s2:[NP (NN 情报) (NN 专 家)] and s1: (VV 为 ). [sent-98, score-0.807]

61 In such a state, we can see that spans of both s2 and s1 ◦ s2 correspond to constituents in ts but that of s1 do◦ess not. [sent-99, score-0.217]

62 Moreover, 认 Ns (s1) is dj and Ns (s2) is np, so Ns (s1) and Ns (s2) are neither identical nor sisters in ts. [sent-100, score-0.036]

63 The values of these features are collected in Table 1. [sent-101, score-0.036]

64 1 Data Preparation and Performance Metric In the experiments, we use two heterogeneous treebanks: CTB 5. [sent-103, score-0.375]

65 1 and the TCT corpus released by the CIPS-SIGHAN-2010 syntactic parsing competition2. [sent-104, score-0.068]

66 To evaluate conversion accuracy, we use the same test set (named Sample-TCT) as in Zhu and Zhu (2010), which is a set of 150 sentences with manually assigned CTB-style and TCT-style parse trees. [sent-111, score-0.572]

67 19% (215/3473) CTBstyle constituents are inconsistent with respect to the TCT standard and 8. [sent-113, score-0.233]

68 87% (231/2602) TCT-style constituents are inconsistent with respect to the CTB standard. [sent-114, score-0.233]

69 For all experiments, bracketing F1 is used as the performance metric, provided by EVALB 3. [sent-115, score-0.211]

70 2 Implementation Issues To implement a heterogeneous parser, we first build a Berkeley parser (Petrov et al. [sent-117, score-0.542]

71 , 2006) on the TCT training data and then use it to assign TCT-style parses to sentences in the CTB training data. [sent-118, score-0.071]

72 On the “updated” CTB training data, we build two shiftreduce-based heterogeneous parsers by using maximum entropy classification model, without/with beam search. [sent-119, score-0.444]

73 Hereafter, the two heterogeneous parsers are referred to as Basic-SR and Beam-SR, respectively. [sent-120, score-0.404]

74 In the testing phase, Basic-SR and Beam-SR convert TCT-style parse trees in Sample-TCT to the CTB standard. [sent-121, score-0.175]

75 The conversion results are evaluated against corresponding CTB-style parse trees in Sample-TCT. [sent-122, score-0.572]

76 Before conducting treebank conversion, we apply the POS adaptation method proposed in Jiang et al. [sent-123, score-0.284]

77 (2009) to convert TCT-style POS tags in the input to the CTB standard. [sent-124, score-0.035]

78 3 Results Table 2 shows the results achieved by Basic-SR and Beam-SR with heterogeneous features being added incrementally. [sent-128, score-0.411]

79 Here, baseline represents the systems which use only target side features. [sent-129, score-0.082]

80 From the table we can see that heterogeneous features improve conversion accuracy significantly. [sent-130, score-0.873]

81 Specifically, adding the constituent (Fc) features to Basic-SR (BeamSR) achieves a 2. [sent-131, score-0.09]

82 79% (3%) improvement, adding the relation (Fr) and frontier-word (Ff) features yields a 0. [sent-132, score-0.036]

83 The path feature is not so effective as expected, although it manages to achieve improvements. [sent-141, score-0.082]

84 Since we use the same training and testing data as in Zhu and Zhu (2010), we can compare our approach directly with the informed decoding approach used in that work. [sent-143, score-0.105]

85 We find that Basic-SR achieves very close conversion results (84. [sent-144, score-0.462]

86 07%) and Beam-SR even outperforms the informed decoding approach (85. [sent-147, score-0.075]

87 4 Related Work For phrase-structure treebank conversion, Wang et al. [sent-152, score-0.225]

88 (1994) suggest to use source-side bracketing structures to select conversion results from k-best lists. [sent-153, score-0.742]

89 The approach is quite generic in the sense that it can be used for conversion between treebanks of different grammar formalisms, such as from a dependency treebank to a constituency treebank (Niu et al. [sent-154, score-1.1]

90 However, it suffers from limited variations in k-best lists (Huang, 2008). [sent-156, score-0.029]

91 Zhu and Zhu (2010) propose to incorporate bracketing structures as parsing constraints in the decoding phase of a CKY-style parser. [sent-157, score-0.451]

92 However, it suffers from binary distinctions (consistent or inconsistent), as discussed in Section 1. [sent-160, score-0.135]

93 Moreover, it coincides with the stacking method used for dependency parser combination (Martins et al. [sent-163, score-0.157]

94 , 2008; Nivre and McDonald, 2008), the Pred method for domain adaptation (Daum ´e III and Marcu, 2006), and the method for annotation adaptation of word segmentation and POS tagging (Jiang et al. [sent-164, score-0.169]

95 As one of the most related works, Jiang and Liu (2009) present a similar approach to conversion between dependency treebanks. [sent-166, score-0.492]

96 In contrast to Jiang and Liu (2009), the task studied in this paper, phrase-structure treebank conversion, is rel- atively complicated and more efforts should be put into feature engineering. [sent-167, score-0.267]

97 5 Conclusion To avoid binary distinctions used in previous approaches to automatic treebank conversion, we proposed in this paper a feature-based approach. [sent-168, score-0.331]

98 Experiments on two Chinese treebanks showed that our approach outperformed the baseline system (Zhu and Zhu, 2010) by 1. [sent-169, score-0.129]

99 Acknowledgments We thank Kenji Sagae for helpful discussions on the implementation of shift-reduce parser and the three anonymous reviewers for comments. [sent-171, score-0.127]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('conversion', 0.462), ('heterogeneous', 0.375), ('zhu', 0.288), ('treebank', 0.225), ('bracketing', 0.211), ('ns', 0.208), ('tct', 0.186), ('ctb', 0.154), ('sagae', 0.142), ('treebanks', 0.129), ('parser', 0.127), ('fc', 0.12), ('inconsistent', 0.118), ('constituents', 0.115), ('parse', 0.11), ('ts', 0.102), ('golden', 0.093), ('jiang', 0.086), ('fp', 0.073), ('parses', 0.071), ('distinctions', 0.069), ('structures', 0.069), ('phase', 0.068), ('parsing', 0.068), ('tctstyle', 0.062), ('rf', 0.062), ('adaptation', 0.059), ('chinese', 0.059), ('lavie', 0.058), ('kenji', 0.056), ('muhua', 0.055), ('constituent', 0.054), ('petrov', 0.052), ('nn', 0.052), ('annotation', 0.051), ('si', 0.05), ('stack', 0.05), ('qi', 0.05), ('fr', 0.05), ('standards', 0.049), ('wang', 0.048), ('tsuruoka', 0.047), ('jingbo', 0.047), ('niu', 0.047), ('neu', 0.045), ('target', 0.044), ('cn', 0.043), ('np', 0.042), ('feature', 0.042), ('wenbin', 0.041), ('build', 0.04), ('path', 0.04), ('ff', 0.04), ('vv', 0.04), ('informed', 0.04), ('blum', 0.039), ('side', 0.038), ('frontier', 0.037), ('binary', 0.037), ('source', 0.036), ('hereafter', 0.036), ('dj', 0.036), ('features', 0.036), ('state', 0.036), ('hs', 0.036), ('dominates', 0.036), ('convert', 0.035), ('item', 0.035), ('vp', 0.035), ('decoding', 0.035), ('boundary', 0.034), ('suppose', 0.034), ('right', 0.034), ('martins', 0.032), ('xue', 0.032), ('encode', 0.031), ('testing', 0.03), ('proceeds', 0.03), ('dependency', 0.03), ('pos', 0.03), ('alon', 0.03), ('finkel', 0.03), ('generic', 0.029), ('suffers', 0.029), ('referred', 0.029), ('daum', 0.029), ('beam', 0.029), ('qun', 0.028), ('depicted', 0.028), ('slav', 0.028), ('evalb', 0.027), ('uptraining', 0.027), ('aclcoling', 0.027), ('deem', 0.027), ('enemy', 0.027), ('featurebased', 0.027), ('ingbo', 0.027), ('insist', 0.027), ('mengqiu', 0.027), ('porceedings', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

Author: Muhua Zhu ; Jingbo Zhu ; Minghan Hu

Abstract: For the task of automatic treebank conversion, this paper presents a feature-based approach which encodes bracketing structures in a treebank into features to guide the conversion of this treebank to a different standard. Experiments on two Chinese treebanks show that our approach improves conversion accuracy by 1.31% over a strong baseline.

2 0.1645636 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

Author: Yue Zhang ; Joakim Nivre

Abstract: Transition-based dependency parsers generally use heuristic decoding algorithms but can accommodate arbitrarily rich feature representations. In this paper, we show that we can improve the accuracy of such parsers by considering even richer feature sets than those employed in previous systems. In the standard Penn Treebank setup, our novel features improve attachment score form 91.4% to 92.9%, giving the best results so far for transitionbased parsing and rivaling the best results overall. For the Chinese Treebank, they give a signficant improvement of the state of the art. An open source release of our parser is freely available.

3 0.14534263 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

Author: Zhongguo Li

Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂 擌 奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂 䀓 惼 ‘vice director’ and 䉂 䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂 擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂 擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱 嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌 撥 怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥 惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩 扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩 堑 扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌 撥 怂 惆 VV ???HHH VV NNf ZZ VVf VVf 扩 扙 堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽 䊂䠽 吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.

4 0.12951437 282 acl-2011-Shift-Reduce CCG Parsing

Author: Yue Zhang ; Stephen Clark

Abstract: CCGs are directly compatible with binarybranching bottom-up parsing algorithms, in particular CKY and shift-reduce algorithms. While the chart-based approach has been the dominant approach for CCG, the shift-reduce method has been little explored. In this paper, we develop a shift-reduce CCG parser using a discriminative model and beam search, and compare its strengths and weaknesses with the chart-based C&C; parser. We study different errors made by the two parsers, and show that the shift-reduce parser gives competitive accuracies compared to C&C.; Considering our use of a small beam, and given the high ambiguity levels in an automatically-extracted grammar and the amount of information in the CCG lexical categories which form the shift actions, this is a surprising result.

5 0.12304804 166 acl-2011-Improving Decoding Generalization for Tree-to-String Translation

Author: Jingbo Zhu ; Tong Xiao

Abstract: To address the parse error issue for tree-tostring translation, this paper proposes a similarity-based decoding generation (SDG) solution by reconstructing similar source parse trees for decoding at the decoding time instead of taking multiple source parse trees as input for decoding. Experiments on Chinese-English translation demonstrated that our approach can achieve a significant improvement over the standard method, and has little impact on decoding speed in practice. Our approach is very easy to implement, and can be applied to other paradigms such as tree-to-tree models. 1

6 0.11865391 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

7 0.11330594 167 acl-2011-Improving Dependency Parsing with Semantic Classes

8 0.10810352 66 acl-2011-Chinese sentence segmentation as comma classification

9 0.10703953 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

10 0.10055373 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

11 0.100206 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

12 0.098639689 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

13 0.093644015 333 acl-2011-Web-Scale Features for Full-Scale Parsing

14 0.092231102 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

15 0.087736309 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

16 0.087434255 243 acl-2011-Partial Parsing from Bitext Projections

17 0.08675766 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

18 0.086676814 28 acl-2011-A Statistical Tree Annotator and Its Applications

19 0.086445332 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

20 0.084174961 122 acl-2011-Event Extraction as Dependency Parsing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.181), (1, -0.073), (2, -0.027), (3, -0.226), (4, -0.033), (5, -0.033), (6, -0.013), (7, 0.03), (8, 0.061), (9, 0.001), (10, 0.009), (11, 0.042), (12, -0.0), (13, -0.12), (14, 0.028), (15, 0.033), (16, 0.011), (17, -0.025), (18, 0.124), (19, 0.03), (20, -0.032), (21, -0.005), (22, -0.046), (23, 0.066), (24, 0.032), (25, -0.008), (26, -0.017), (27, 0.007), (28, 0.07), (29, -0.032), (30, -0.009), (31, 0.048), (32, 0.031), (33, -0.03), (34, 0.039), (35, -0.083), (36, -0.052), (37, -0.067), (38, 0.009), (39, -0.069), (40, 0.045), (41, 0.093), (42, -0.007), (43, -0.016), (44, 0.005), (45, -0.024), (46, 0.041), (47, 0.059), (48, -0.03), (49, -0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96266133 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

Author: Muhua Zhu ; Jingbo Zhu ; Minghan Hu

Abstract: For the task of automatic treebank conversion, this paper presents a feature-based approach which encodes bracketing structures in a treebank into features to guide the conversion of this treebank to a different standard. Experiments on two Chinese treebanks show that our approach improves conversion accuracy by 1.31% over a strong baseline.

2 0.72223198 66 acl-2011-Chinese sentence segmentation as comma classification

Author: Nianwen Xue ; Yaqin Yang

Abstract: We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.

3 0.70341355 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

Author: Zhongguo Li

Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂 擌 奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂 䀓 惼 ‘vice director’ and 䉂 䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂 擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂 擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱 嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌 撥 怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥 惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩 扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩 堑 扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌 撥 怂 惆 VV ???HHH VV NNf ZZ VVf VVf 扩 扙 堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽 䊂䠽 吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.

4 0.68701392 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

Author: Yue Zhang ; Joakim Nivre

Abstract: Transition-based dependency parsers generally use heuristic decoding algorithms but can accommodate arbitrarily rich feature representations. In this paper, we show that we can improve the accuracy of such parsers by considering even richer feature sets than those employed in previous systems. In the standard Penn Treebank setup, our novel features improve attachment score form 91.4% to 92.9%, giving the best results so far for transitionbased parsing and rivaling the best results overall. For the Chinese Treebank, they give a signficant improvement of the state of the art. An open source release of our parser is freely available.

5 0.67392248 192 acl-2011-Language-Independent Parsing with Empty Elements

Author: Shu Cai ; David Chiang ; Yoav Goldberg

Abstract: We present a simple, language-independent method for integrating recovery of empty elements into syntactic parsing. This method outperforms the best published method we are aware of on English and a recently published method on Chinese.

6 0.67339104 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

7 0.66530609 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

8 0.66033101 243 acl-2011-Partial Parsing from Bitext Projections

9 0.65397394 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

10 0.65014005 28 acl-2011-A Statistical Tree Annotator and Its Applications

11 0.63737899 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

12 0.63506103 333 acl-2011-Web-Scale Features for Full-Scale Parsing

13 0.63322896 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing

14 0.62287569 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

15 0.61914361 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

16 0.61501575 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

17 0.6083377 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

18 0.60073793 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

19 0.58316797 267 acl-2011-Reversible Stochastic Attribute-Value Grammars

20 0.58277345 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.011), (17, 0.035), (28, 0.011), (37, 0.097), (39, 0.539), (41, 0.035), (55, 0.017), (59, 0.026), (72, 0.015), (91, 0.037), (96, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96803528 1 acl-2011-(11-06-spirl)

Author: (hal)

Abstract: unkown-abstract

2 0.9156872 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

Author: Yoav Goldberg ; Michael Elhadad

Abstract: We experiment with extending a lattice parsing methodology for parsing Hebrew (Goldberg and Tsarfaty, 2008; Golderg et al., 2009) to make use of a stronger syntactic model: the PCFG-LA Berkeley Parser. We show that the methodology is very effective: using a small training set of about 5500 trees, we construct a parser which parses and segments unsegmented Hebrew text with an F-score of almost 80%, an error reduction of over 20% over the best previous result for this task. This result indicates that lattice parsing with the Berkeley parser is an effective methodology for parsing over uncertain inputs.

same-paper 3 0.89441311 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

Author: Muhua Zhu ; Jingbo Zhu ; Minghan Hu

Abstract: For the task of automatic treebank conversion, this paper presents a feature-based approach which encodes bracketing structures in a treebank into features to guide the conversion of this treebank to a different standard. Experiments on two Chinese treebanks show that our approach improves conversion accuracy by 1.31% over a strong baseline.

4 0.88728487 52 acl-2011-Automatic Labelling of Topic Models

Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin

Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.

5 0.820472 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

Author: Jacob Eisenstein ; Noah A. Smith ; Eric P. Xing

Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.

6 0.7684325 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

7 0.74323404 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

8 0.70407653 192 acl-2011-Language-Independent Parsing with Empty Elements

9 0.60989857 182 acl-2011-Joint Annotation of Search Queries

10 0.56060791 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

11 0.54756618 282 acl-2011-Shift-Reduce CCG Parsing

12 0.53923148 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

13 0.53456056 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

14 0.52955645 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

15 0.52940762 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing

16 0.52432936 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

17 0.52148736 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

18 0.5213089 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

19 0.51356894 238 acl-2011-P11-2093 k2opt.pdf

20 0.51058215 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing