acl acl2011 acl2011-192 knowledge-graph by maker-knowledge-mining

192 acl-2011-Language-Independent Parsing with Empty Elements

Source: pdf

Author: Shu Cai ; David Chiang ; Yoav Goldberg

Abstract: We present a simple, language-independent method for integrating recovery of empty elements into syntactic parsing. This method outperforms the best published method we are aware of on English and a recently published method on Chinese.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract We present a simple, language-independent method for integrating recovery of empty elements into syntactic parsing. [sent-2, score-1.09]

2 This method outperforms the best published method we are aware of on English and a recently published method on Chinese. [sent-3, score-0.102]

3 1 Introduction Empty elements in the syntactic analysis of a sentence are markers that show where a word or phrase might otherwise be expected to appear, but does not. [sent-4, score-0.238]

4 For example, in the tree of Figure 2a, the first empty element (*) marks where John would be if believed were in the active voice (someone believed. [sent-6, score-0.998]

5 ), and the second empty element (*T*) marks where the man would be ifwho were not fronted (John was believed to admire who? [sent-9, score-0.961]

6 Empty elements exist in many languages and serve different purposes. [sent-11, score-0.238]

7 In languages such as Chinese and Korean, where subjects and objects can be dropped to avoid duplication, empty elements are particularly important, as they indicate the position of dropped arguments. [sent-12, score-1.038]

8 Figure 1 gives an example of a Chinese parse tree with empty elements. [sent-13, score-0.733]

9 The first empty element (*p ro*) marks the subject of the whole sentence, a pronoun inferable from context. [sent-14, score-0.915]

10 The second empty element (*PRO*) marks the subject of the dependent VP (shíshī fǎlǜ tiáowén). [sent-15, score-0.915]

11 Yet most parsing work based on these resources has ignored empty elements, with some 212 Yoav Goldberg Ben Gurion University of the Negev Department of Computer Science POB 653 Be’er Sheva, 84105, Israel yoavg@cs . [sent-19, score-0.798]

12 én law clause Figure 1: Chinese parse tree with empty elements marked. [sent-121, score-1.006]

13 The meaning of the sentence is, “Implementation of the law is temporarily suspended. [sent-122, score-0.035]

14 Johnson (2002) studied emptyelement recovery in English, followed by several others (Dienes and Dubey, 2003; Campbell, 2004; Gabbard et al. [sent-124, score-0.256]

15 , 2006); the best results we are aware of are due to Schmid (2006). [sent-125, score-0.038]

16 Recently, empty-element recovery for Chinese has begun to receive attention: Yang and Xue (2010) treat it as classification problem, while Chung and Gildea (2010) pursue several approaches for both Korean and Chinese, and explore applications to machine translation. [sent-126, score-0.156]

17 Our intuition motivating this work is that empty elements are an integral part of syntactic structure, and should be constructed jointly with it, not added in afterwards. [sent-127, score-0.934]

18 Moreover, we expect empty-element recovery to improve as the parsing quality improves. [sent-128, score-0.258]

19 (2006), which we extend to predict empty cateProceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-130, score-0.696]

20 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 212–216, gories by the use of lattice parsing. [sent-132, score-0.154]

21 The method is language-independent and performs very well on both languages we tested it on: for English, it outperforms the best published method we are aware of (Schmid, 2006), and for Chinese, it outperforms the method of Yang and Xue (2010). [sent-133, score-0.07]

22 We take a state-of-the- art parsing model, the Berkeley parser (Petrov et al. [sent-135, score-0.211]

23 , 2006), train it on data with explicit empty elements, and test it on word lattices that can nondeterministically insert empty elements anywhere. [sent-136, score-1.701]

24 The idea is that the state-splitting of the parsing model will enable it to learn where to expect empty elements to be inserted into the test sentences. [sent-137, score-1.036]

25 Tree transformations Prior to training, we alter the annotation of empty elements so that the terminal label is a consistent symbol (ϵ), the preterminal label is the type of the empty element, and - NONE is deleted (see Figure 2b). [sent-138, score-1.755]

26 This simplifies the lattices because there is only one empty symbol, and helps the parsing model to learn dependencies between nonterminal labels and empty-category types because there is no intervening - NONE - . [sent-139, score-1.002]

27 Then, following Schmid (2006), if a constituent contains an empty element that is linked to another node with label X, then we append /X to its label. [sent-140, score-0.9]

28 If there is more than one empty element, we process them bottom-up (see Figure 2b). [sent-141, score-0.696]

29 This helps the parser learn to expect where to find empty elements. [sent-142, score-0.788]

30 In our experiments, we did this only for elements of type *T*. [sent-143, score-0.238]

31 Finally, we train the Berkeley parser on the preprocessed training data. [sent-144, score-0.063]

32 Lattice parsing Unlike the training data, the test data does not mark any empty elements. [sent-145, score-0.798]

33 We allow the parser to produce empty elements by means of lattice-parsing (Chappelier et al. [sent-146, score-0.997]

34 , 1999), a generalization of CKY parsing allowing it to parse a wordlattice instead of a predetermined list of terminals. [sent-147, score-0.102]

35 Lattice parsing adds a layer of flexibility to existing parsing technology, and allows parsing in situations where the yield of the tree is not known in advance. [sent-148, score-0.343]

36 Lattice parsing originated in the speech 1Unfortunately, not enough information was available to carry out comparison with the method of Chung and Gildea (2010). [sent-149, score-0.129]

37 We use a modified version of the Berkeley parser which allows handling lattices as input. [sent-153, score-0.134]

38 2 The modifi- cation is fairly straightforward: Each lattice arc correspond to a lexical item. [sent-154, score-0.181]

39 Lexical items are now indexed by their start and end states rather than by their sentence position, and the initialization procedure of the CKY chart is changed to allow lexical items of spans greater than 1. [sent-155, score-0.152]

40 We then make the necessary adjustments to the parsing algorithm to support this change: trying rules involving preterminals even when the span is greater than 1,and not relying on span size for identifying lexical items. [sent-156, score-0.102]

41 At test time, we first construct a lattice for each test sentence that allows 0, 1, or 2 empty symbols (ϵ) between each pair of words or at the start/end of the sentence. [sent-157, score-0.85]

42 Then we feed these lattices through our lattice parser to produce trees with empty elements. [sent-158, score-0.984]

43 Finally, we reverse the transformations that had been applied to the training data. [sent-159, score-0.034]

44 3 Evaluation Measures Evaluation metrics for empty-element recovery are not well established, and previous studies use a variety of metrics. [sent-160, score-0.156]

45 We review several of these here and additionally propose a unified evaluation of parsing and empty-element recovery. [sent-161, score-0.102]

46 3 If A and B are multisets, let A(x) be the number of occurrences of x in A, let |A| = ∑x A(x), and let A ∩ B be the multiset such|A t|ha =t (A∑ ∩ B)(x) = min(A(x), B(x)). [sent-162, score-0.081]

47 If T is the multiset of ( (“∑Aite ∩ms B”) (ixn) )th =e trees being tested and G is the multiset of “items” in the gold-standard trees, then precision=|G| ∩T| T| recall=|G| ∩G| T| F1=prec1ision2+rec1all 2The modified parser is available at http ://www . [sent-163, score-0.225]

48 * (a) Figure 2: English parse tree with empty elements marked. [sent-290, score-0.971]

49 Define a nonterminal node, for present purposes, to be a node which is neither a terminal nor preterminal node. [sent-293, score-0.249]

50 , 1991) counts labeled nonempty brackets: items are (X, i,j) for each nonempty nonterminal node, where X is its label and i,j are the start and end positions of its span. [sent-295, score-0.419]

51 Yang and Xue (2010) simply count unlabeled empty elements: items are (i, i) for each empty element, where iis its position. [sent-296, score-1.5]

52 If multiple empty elements occur at the same position, they only count the last one. [sent-297, score-0.966]

53 The metric originally proposed by Johnson (2002) counts labeled empty brackets: items are (X/t, i, i) for each empty nonterminal node, where X is its label and t is the type of the empty element it dominates, but also (t, i, i) for each empty element not domi- nated by an empty nonterminal node. [sent-298, score-4.193]

54 4 The following structure has an empty nonterminal dominating two empty elements: SB. [sent-299, score-1.496]

55 * Johnson Schmid counts this as (SBAR, i, i), (S/*T*, i, i); (2006) counts it as a single 4This happens in the Penn Treebank for types *U* and 0, but never in the Penn Chinese Treebank except by mistake. [sent-321, score-0.06]

56 5 We tried to follow Schmid in a generic way: we collapse any vertical chain of empty nonterminals into a single nonterminal. [sent-441, score-0.696]

57 , items are (t, i, i) for each empty element, and the second, similar in spirit to SParseval (Roark et al. [sent-445, score-0.772]

58 , items are (X, i,j) for each nonterminal node (whether nonempty or empty). [sent-448, score-0.322]

59 4 Experiments and Results English As is standard, we trained the parser on sections 02–21 of the Penn Treebank Wall Street Journal corpus, used section 00 for development, and section 23 for testing. [sent-450, score-0.096]

60 We ran 6 cycles of training; then, because we were unable to complete the 7th split-merge cycle with the default setting of merging 50% of splits, we tried increasing merges to 75% and ran 7 cycles of training. [sent-451, score-0.116]

61 We chose the parser settings that gave the best labeled empty elements F1 on the dev set, and used these settings for the test set. [sent-453, score-1.03]

62 We outperform the state of the art at recovering empty elements, as well as achieving state of the art accuracy at recovering phrase structure. [sent-454, score-0.954]

63 5This difference is not small; scores using Schmid’s metric are lower by roughly 1%. [sent-455, score-0.039]

64 There are other minor differences in Schmid’s metric which we do not detail here. [sent-456, score-0.039]

65 TaskSystemPEmUpntyla REbleelmedenFts1EPmpLtyab ERelleemdenFts1PABllr aLcaRkbeetlsedF1 DTe svtYs sap pnl i gt 67a56n× ×dm m Xe eur g eg e e(2 5 50 0 1% % 0)7 8 70264 . [sent-476, score-0.029]

66 For comparability with previous work (Yang and Xue, 2010), we trained the parser on sections 0081–0900, used sections 0041–0080 for development, and sections 0001–0040 and 0901–0931 for testing. [sent-488, score-0.162]

67 We selected the 6th split-merge cycle based on the labeled empty elements F1 measure. [sent-490, score-1.003]

68 The unlabeled empty elements column shows that our system outperforms the baseline system of Yang and Xue (2010). [sent-491, score-0.934]

69 We also analyzed the emptyelement recall by type (Table 3). [sent-492, score-0.1]

70 Our system outperformed that of Yang and Xue (2010) especially on *p ro*, used for dropped arguments, and *T* , used for relative clauses and topicalization. [sent-493, score-0.052]

71 5 Discussion and Future Work The empty-element recovery method we have presented is simple, highly effective, and fully integrated with state of the art parsing. [sent-494, score-0.202]

72 We hope to exploit cross-lingual information about empty elements in machine translation. [sent-495, score-0.934]

73 Chung and Gildea (2010) have shown that such information indeed helps translation, and we plan to extend this work by handling more empty categories (rather 215 *TpPROy*TNrpPoROe*GT1253o97t2l489a0dY132 95X0C86oreO13c6u5893tr s64521Y3864X. [sent-496, score-0.725]

74 r91685s Table 3: Recall on different types of empty categories. [sent-498, score-0.696]

75 We also plan to extend our work here to recover coindexation information (links between a moved element and the trace which marks the position it was moved from). [sent-501, score-0.367]

76 This work was supported in part by DARPA under contracts HR001 1-06-C-0022 (subcontract to BBN Technologies) and DOI-NBC N10AP20031 , and by NSF under contract IIS-0908532. [sent-505, score-0.027]

77 Joint Hebrew segmentation and parsing using a PCFG-LA lattice parser. [sent-558, score-0.256]

78 A simple pattern-matching algorithm for recovering empty nodes and their antecedents. [sent-580, score-0.779]

79 Trace prediction and recovery with unlexicalized PCFGs and slash features. [sent-601, score-0.21]

80 Chasing the ghost: recovering empty categories in the Chinese Treebank. [sent-610, score-0.779]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('empty', 0.696), ('elements', 0.238), ('xue', 0.171), ('recovery', 0.156), ('lattice', 0.154), ('element', 0.15), ('schmid', 0.125), ('penn', 0.107), ('nonterminal', 0.104), ('parsing', 0.102), ('emptyelement', 0.1), ('yang', 0.097), ('chinese', 0.094), ('chung', 0.093), ('goldberg', 0.089), ('chappelier', 0.088), ('nonempty', 0.088), ('recovering', 0.083), ('multiset', 0.081), ('items', 0.076), ('trace', 0.076), ('lattices', 0.071), ('marks', 0.069), ('sparseval', 0.066), ('parser', 0.063), ('yoav', 0.06), ('sh', 0.06), ('yaqin', 0.059), ('preterminal', 0.059), ('petrov', 0.055), ('node', 0.054), ('slash', 0.054), ('dienes', 0.054), ('gildea', 0.053), ('brackets', 0.052), ('dropped', 0.052), ('treebank', 0.052), ('gabbard', 0.051), ('johnson', 0.05), ('berkeley', 0.049), ('art', 0.046), ('tagyoung', 0.046), ('believed', 0.046), ('ro', 0.045), ('isi', 0.044), ('nianwen', 0.043), ('pcfgs', 0.04), ('cycles', 0.04), ('hebrew', 0.04), ('pro', 0.04), ('cky', 0.039), ('metric', 0.039), ('korean', 0.038), ('aware', 0.038), ('marcus', 0.038), ('tree', 0.037), ('moved', 0.036), ('roark', 0.036), ('cycle', 0.036), ('vb', 0.036), ('law', 0.035), ('green', 0.035), ('transformations', 0.034), ('sections', 0.033), ('wh', 0.033), ('labeled', 0.033), ('terminal', 0.032), ('count', 0.032), ('published', 0.032), ('santorini', 0.03), ('counts', 0.03), ('slav', 0.03), ('helps', 0.029), ('gurion', 0.029), ('naturel', 0.029), ('pob', 0.029), ('rajman', 0.029), ('sheva', 0.029), ('yoavg', 0.029), ('ghost', 0.029), ('kahn', 0.029), ('multisets', 0.029), ('sap', 0.029), ('black', 0.029), ('arabic', 0.028), ('negev', 0.027), ('campbell', 0.027), ('contracts', 0.027), ('cation', 0.027), ('gdaniec', 0.027), ('ingria', 0.027), ('langage', 0.027), ('stewart', 0.027), ('hale', 0.027), ('nated', 0.027), ('originated', 0.027), ('kulick', 0.027), ('usc', 0.027), ('shu', 0.027), ('dte', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 192 acl-2011-Language-Independent Parsing with Empty Elements

Author: Shu Cai ; David Chiang ; Yoav Goldberg

2 0.3636671 28 acl-2011-A Statistical Tree Annotator and Its Applications

Author: Xiaoqiang Luo ; Bing Zhao

Abstract: In many natural language applications, there is a need to enrich syntactical parse trees. We present a statistical tree annotator augmenting nodes with additional information. The annotator is generic and can be applied to a variety of applications. We report 3 such applications in this paper: predicting function tags; predicting null elements; and predicting whether a tree constituent is projectable in machine translation. Our function tag prediction system outperforms significantly published results.

3 0.2519289 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

Author: Yoav Goldberg ; Michael Elhadad

Abstract: We experiment with extending a lattice parsing methodology for parsing Hebrew (Goldberg and Tsarfaty, 2008; Golderg et al., 2009) to make use of a stronger syntactic model: the PCFG-LA Berkeley Parser. We show that the methodology is very effective: using a small training set of about 5500 trees, we construct a parser which parses and segments unsegmented Hebrew text with an F-score of almost 80%, an error reduction of over 20% over the best previous result for this task. This result indicates that lattice parsing with the Berkeley parser is an effective methodology for parsing over uncertain inputs.

4 0.10610307 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

Author: Zhongguo Li

Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂擌奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂䀓惼 ‘vice director’ and 䉂䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌撥怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩堑扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌撥怂惆 VV ???HHH VV NNf ZZ VVf VVf 扩扙堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽䊂䠽吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.

5 0.088444225 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

Author: Lane Schwartz ; Chris Callison-Burch ; William Schuler ; Stephen Wu

Abstract: This paper describes a novel technique for incorporating syntactic knowledge into phrasebased machine translation through incremental syntactic parsing. Bottom-up and topdown parsers typically require a completed string as input. This requirement makes it difficult to incorporate them into phrase-based translation, which generates partial hypothesized translations from left-to-right. Incremental syntactic language models score sentences in a similar left-to-right fashion, and are therefore a good mechanism for incorporat- ing syntax into phrase-based translation. We give a formal definition of one such lineartime syntactic language model, detail its relation to phrase-based decoding, and integrate the model with the Moses phrase-based translation system. We present empirical results on a constrained Urdu-English translation task that demonstrate a significant BLEU score improvement and a large decrease in perplexity.

6 0.083411425 66 acl-2011-Chinese sentence segmentation as comma classification

7 0.08314874 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

8 0.081931107 282 acl-2011-Shift-Reduce CCG Parsing

9 0.07857956 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

10 0.076374039 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

11 0.06955564 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

12 0.069268957 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

13 0.068476088 167 acl-2011-Improving Dependency Parsing with Semantic Classes

14 0.067475185 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

15 0.063873336 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

16 0.063834257 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

17 0.061587971 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

18 0.059070114 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

19 0.058538906 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations

20 0.057845291 333 acl-2011-Web-Scale Features for Full-Scale Parsing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.158), (1, -0.07), (2, -0.004), (3, -0.163), (4, -0.041), (5, -0.016), (6, -0.011), (7, 0.026), (8, 0.033), (9, 0.009), (10, -0.022), (11, 0.039), (12, -0.065), (13, -0.054), (14, 0.034), (15, -0.016), (16, -0.011), (17, -0.027), (18, 0.119), (19, 0.089), (20, 0.104), (21, 0.046), (22, -0.096), (23, 0.068), (24, 0.022), (25, 0.014), (26, -0.044), (27, -0.004), (28, 0.067), (29, -0.077), (30, 0.038), (31, -0.021), (32, -0.04), (33, -0.048), (34, 0.087), (35, -0.116), (36, -0.021), (37, -0.133), (38, 0.045), (39, -0.031), (40, 0.036), (41, -0.168), (42, 0.125), (43, -0.048), (44, -0.01), (45, 0.064), (46, -0.132), (47, 0.085), (48, -0.03), (49, -0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97249448 192 acl-2011-Language-Independent Parsing with Empty Elements

Author: Shu Cai ; David Chiang ; Yoav Goldberg

2 0.80733711 28 acl-2011-A Statistical Tree Annotator and Its Applications

Author: Xiaoqiang Luo ; Bing Zhao

3 0.68011028 66 acl-2011-Chinese sentence segmentation as comma classification

Author: Nianwen Xue ; Yaqin Yang

Abstract: We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.

4 0.63515973 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

Author: Yoav Goldberg ; Michael Elhadad

5 0.60533595 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

Author: Elias Ponvert ; Jason Baldridge ; Katrin Erk

Abstract: We consider a new subproblem of unsupervised parsing from raw text, unsupervised partial parsing—the unsupervised version of text chunking. We show that addressing this task directly, using probabilistic finite-state methods, produces better results than relying on the local predictions of a current best unsupervised parser, Seginer’s (2007) CCL. These finite-state models are combined in a cascade to produce more general (full-sentence) constituent structures; doing so outperforms CCL by a wide margin in unlabeled PARSEVAL scores for English, German and Chinese. Finally, we address the use of phrasal punctuation as a heuristic indicator of phrasal boundaries, both in our system and in CCL.

6 0.60272294 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

7 0.53428435 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

8 0.47060448 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

9 0.45801136 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

10 0.43432185 173 acl-2011-Insertion Operator for Bayesian Tree Substitution Grammars

11 0.42854786 330 acl-2011-Using Derivation Trees for Treebank Error Detection

12 0.42618048 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

13 0.41137582 200 acl-2011-Learning Dependency-Based Compositional Semantics

14 0.41103142 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations

15 0.39952463 243 acl-2011-Partial Parsing from Bitext Projections

16 0.39648351 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method

17 0.38886148 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

18 0.38670355 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

19 0.38422751 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

20 0.36910018 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.022), (17, 0.087), (26, 0.018), (37, 0.057), (39, 0.2), (41, 0.047), (53, 0.019), (55, 0.042), (59, 0.052), (72, 0.033), (83, 0.194), (91, 0.034), (96, 0.108)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87050772 192 acl-2011-Language-Independent Parsing with Empty Elements

Author: Shu Cai ; David Chiang ; Yoav Goldberg

2 0.7706148 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

Author: Yoav Goldberg ; Michael Elhadad

3 0.76630569 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

Author: Jacob Eisenstein ; Noah A. Smith ; Eric P. Xing

Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.

4 0.76565123 52 acl-2011-Automatic Labelling of Topic Models

Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin

Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.

5 0.7506724 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

Author: Muhua Zhu ; Jingbo Zhu ; Minghan Hu

Abstract: For the task of automatic treebank conversion, this paper presents a feature-based approach which encodes bracketing structures in a treebank into features to guide the conversion of this treebank to a different standard. Experiments on two Chinese treebanks show that our approach improves conversion accuracy by 1.31% over a strong baseline.

6 0.74011379 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

7 0.73167223 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

8 0.66654092 182 acl-2011-Joint Annotation of Search Queries

9 0.66396046 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

10 0.64331394 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

11 0.63812244 28 acl-2011-A Statistical Tree Annotator and Its Applications

12 0.63510972 282 acl-2011-Shift-Reduce CCG Parsing

13 0.63482243 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

14 0.63456929 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

15 0.63063526 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing

16 0.63030082 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

17 0.62812507 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

18 0.62714434 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

19 0.6242702 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

20 0.62407708 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing