acl acl2011 acl2011-184 knowledge-graph by maker-knowledge-mining

184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser


Source: pdf

Author: Yoav Goldberg ; Michael Elhadad

Abstract: We experiment with extending a lattice parsing methodology for parsing Hebrew (Goldberg and Tsarfaty, 2008; Golderg et al., 2009) to make use of a stronger syntactic model: the PCFG-LA Berkeley Parser. We show that the methodology is very effective: using a small training set of about 5500 trees, we construct a parser which parses and segments unsegmented Hebrew text with an F-score of almost 80%, an error reduction of over 20% over the best previous result for this task. This result indicates that lattice parsing with the Berkeley parser is an effective methodology for parsing over uncertain inputs.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 i l Abstract We experiment with extending a lattice parsing methodology for parsing Hebrew (Goldberg and Tsarfaty, 2008; Golderg et al. [sent-4, score-0.99]

2 We show that the methodology is very effective: using a small training set of about 5500 trees, we construct a parser which parses and segments unsegmented Hebrew text with an F-score of almost 80%, an error reduction of over 20% over the best previous result for this task. [sent-6, score-0.393]

3 This result indicates that lattice parsing with the Berkeley parser is an effective methodology for parsing over uncertain inputs. [sent-7, score-1.159]

4 1 Introduction Most work on parsing assumes that the lexical items in the yield of a parse tree are fully observed, and correspond to space delimited tokens, perhaps after a deterministic preprocessing step of tokenization. [sent-8, score-0.399]

5 For example, the Hebrew token bcl1 can be interpreted as the single noun meaning “onion”, or as a sequence of a preposition and a noun b-cl meaning “in (the) shadow”. [sent-10, score-0.128]

6 , 2001) 704 items corresponding to an input string is ambiguous, and cannot be determined using a deterministic procedure. [sent-12, score-0.127]

7 In this work, we focus on constituency parsing of Modern Hebrew (henceforth Hebrew) from raw unsegmented text. [sent-13, score-0.367]

8 A common method of approaching the discrepancy between input strings and space delimited to- kens is using a pipeline process, in which the input string is pre-segmented prior to handing it to a parser. [sent-14, score-0.228]

9 The shortcoming of this method, as noted by (Tsarfaty, 2006), is that many segmentation decisions cannot be resolved based on local context alone. [sent-15, score-0.183]

10 Thus, segmentation decisions should be integrated into the parsing process and not performed as an independent preprocessing step. [sent-17, score-0.401]

11 Goldberg and Tsarfaty (2008) demonstrated the effectiveness of lattice parsing for jointly performing segmentation and parsing of Hebrew text. [sent-18, score-1.158]

12 They experimented with various manual refinements of unlexicalized, treebank-derived grammars, and showed that better grammars contribute to better segmentation accuracies. [sent-19, score-0.218]

13 (2009) showed that segmentation and parsing accuracies can be further improved by extending the lexical coverage of a lattice-parser using an external resource. [sent-21, score-0.495]

14 Recently, Green and Manning (2010) demonstrated the effectiveness of lattice-parsing for parsing Arabic. [sent-22, score-0.252]

15 Here, we report the results of experiments cou- pling lattice parsing together with the currently best grammar learning method: the Berkeley PCFG-LA parser (Petrov et al. [sent-23, score-0.929]

16 Several such elements may attach together, producing forms such as wfmhfmf (w-f-m-hfmf “and-that-from-the-sun”). [sent-29, score-0.162]

17 The linear order of such segmental elements within a token is fixed (disallowing the reading w-f-m-h-f-mf in the previous example). [sent-31, score-0.136]

18 However, the syntactic relations of these elements with respect to the rest of the sentence is rather free. [sent-32, score-0.068]

19 The relativizer f(“that”) for example may attach to an arbitrarily long relative clause that goes beyond token boundaries. [sent-33, score-0.179]

20 To further complicate matters, the definite article h(“the”) is not realized in writing when following the particles b(“in”),k(“like”) and l(“to”). [sent-34, score-0.098]

21 In addition, pronominal elements may attach to nouns, verbs, adverbs, prepositions and others as suffixes (e. [sent-36, score-0.139]

22 Rich templatic morphology Hebrew has a very productive morphological structure, which is based on a root+template system. [sent-45, score-0.161]

23 The productive morphology results in many distinct word forms and a high out-of-vocabulary rate which makes it hard to reliably estimate lexical parameters from annotated corpora. [sent-46, score-0.094]

24 The root+template system (combined with the unvocalized writing system and rich affixation) makes it hard to guess the morphological analyses 705 of an unknown word based on its prefix and suffix, as usually done in other languages. [sent-47, score-0.288]

25 Unvocalized writing system Most vowels are not marked in everyday Hebrew text, which results in a very high level of lexical and morphological ambiguity. [sent-48, score-0.197]

26 Agreement Hebrew grammar forces morphological agreement between Adjectives and Nouns (which should agree on Gender and Number and definiteness), and between Subjects and Verbs (which should agree on Gender and Number). [sent-50, score-0.248]

27 3 PCFG-LA Grammar Estimation Klein and Manning (2003) demonstrated that linguistically informed splitting of non-terminal symbols in treebank-derived grammars can result in accurate grammars. [sent-51, score-0.096]

28 Their work triggered investiga- tions in automatic grammar refinement and statesplitting (Matsuzaki et al. [sent-52, score-0.109]

29 , 2006) and its publicly available implementation, the Berkeley parser2, works by starting with a bare-bones treebank derived grammar and automatically refining it in split-merge-smooth cycles. [sent-56, score-0.184]

30 The learning works by iteratively (1) splitting each non-terminal category in two, (2) merging back non-effective splits and (3) smoothing the split non-terminals toward their shared ancestor. [sent-57, score-0.066]

31 This process allows learning tree annotations which capture many latent syntactic interactions. [sent-59, score-0.066]

32 At inference time, the latent annotations are (approximately) marginalized out, resulting in the (approximate) most probable unannotated tree according to the refined grammar. [sent-60, score-0.066]

33 This parsing methodology is very robust, producing state of the art accuracies for English, as well as many other languages including German (Petrov and Klein, 2008), French (Candito et al. [sent-61, score-0.337]

34 The grammar learning process is applied to binarized parse trees, with 1st-order vertical and 0thorder horizontal markovization. [sent-63, score-0.105]

35 com/p/berkeleyparser/ Figure 1: Lattice representation of the sentence bclm hneim. [sent-66, score-0.112]

36 Lattice arcs correspond to different segments of the token, each lattice path encodes a possible reading of the sentence. [sent-68, score-0.597]

37 Notice how the token bclm have analyses which include segments which are not directly present in the unsegmented form, such as the definite article h (1-3) and the pronominal suffix which is expanded to the sequence fl hm (“of them”, 2-4, 4-5). [sent-69, score-0.512]

38 However, it allows the resulting refined grammar to encode its own set ofdependencies between a node and its sisters, as well as ordering preferences in long, flat rules. [sent-72, score-0.1]

39 Our initial experiments on Hebrew confirm that moving to higher order horizontal markovization degrades parsing performance, while producing much larger grammars. [sent-73, score-0.291]

40 4 Lattice Representation and Parsing Following (Goldberg and Tsarfaty, 2008) we deal with the ambiguous affixation patterns in Hebrew by encoding the input sentence as a segmentation lattice. [sent-74, score-0.339]

41 Each token is encoded as a lattice representing its possible analyses, and the token-lattices are then concatenated to form the sentence-lattice. [sent-75, score-0.6]

42 Figure 1 presents the lattice for the two token sentence “bclm hneim”. [sent-76, score-0.6]

43 Lattice Parsing The CKY parsing algorithm can be extended to accept a lattice as its input (Chappelier et al. [sent-78, score-0.793]

44 This works by indexing lexical items by their start and end states in the lattice instead of by their sentence position, and changing the initialization procedure of CKY to allow terminal and preterminal sybols of spans of sizes > 1. [sent-80, score-0.579]

45 It is then relatively straightforward to modify the parsing mechanism to support this change: not giving special treatments for spans of size 1, and distinguishing lexical items from non-terminals by a specified marking instead of by their position in the chart. [sent-81, score-0.349]

46 We 706 modified the PCFG-LA Berkeley parser to accept lattice input at inference time (training is performed as usual on fully observed treebank trees). [sent-82, score-0.783]

47 Lattice Construction We construct the token lattices using MILA, a lexicon-based morphological analyzer which provides a set of possible analyses for each token (Itai and Wintner, 2008). [sent-83, score-0.447]

48 Still, the use of the lexicon for lattice construction rather than relying on forms seen in the treebank is essential to achieve parsing accuracy. [sent-87, score-0.88]

49 Lexical Probabilities Estimation Lexical p(t → w) probabilities are defined over iLndeixviicdaula pl segments rather than for complete tokens. [sent-88, score-0.122]

50 We use the default lexical probability estimation of the Berkeley parser. [sent-90, score-0.072]

51 (2009) suggest to estimate lexical probabilities for rare and unseen segments using emission probabilities of an HMM tagger trained using EM on large corpora. [sent-92, score-0.189]

52 Our preliminary exper- iments with this method with the Berkeley parser 3Probabilities for robust segments (lexical items observed 100 times or more in training) are based on the MLE estimates resulting from the EM procedure. [sent-93, score-0.262]

53 Other segments are assigned smoothed probabilities which combine the p(w|t) MLE estismmaoteo twheithd unigram tag probabilities. [sent-94, score-0.122]

54 Crucially, we restrict each segment to appear only with tags which are licensed by a morphological analyzer, as encoded in the lattice. [sent-96, score-0.181]

55 When analyzing the parsing results on out-of-treebank text, we observed cases where this estimation method indeed fixed mistakes, and others where it hurt. [sent-99, score-0.253]

56 We are still uncertain if the slight drop in performance over the test set is due to overfitting of the treebank vocabulary, or the inadequacy of the method in general. [sent-100, score-0.111]

57 , 2009), which was converted to use the tagset of the MILA morphological analyzer (Golderg et al. [sent-103, score-0.212]

58 Gold Segmentation and Tagging To assess the adequacy of the Berkeley parser for Hebrew, we performed baseline experiments in which either gold segmentation and tagging or just gold segmentation were available to the parser. [sent-108, score-0.683]

59 8% for the gold segmentation and tagging, and about 82. [sent-110, score-0.221]

60 This shows the adequacy of the PCFG-LA methodology for parsing the Hebrew treebank, but also goes to show the highly ambiguous nature of the tagging. [sent-112, score-0.34]

61 Our baseline lattice parsing experiment (without the lexicon) results in an F-score of around 76%. [sent-113, score-0.723]

62 4 Segmentation → Parsing pipeline As another baseline, we experimented w piithp a pipeline system in which the input text is automatically segmented and tagged using a state-of-the-art HMM pos-tagger (Goldberg et al. [sent-114, score-0.264]

63 5 In the pipeline setting, we either allow the parser to assign all possible POS-tags, or restrict it to POS-tags licensed by the lexicon. [sent-119, score-0.275]

64 Lattice Parsing Experiments Our initial lattice parsing experiments with the Berkeley parser were disappointing. [sent-120, score-0.856]

65 The lattice seemed too permissive, allowing the parser to chose weird analyses. [sent-121, score-0.638]

66 Error analysis suggested the parser failed to distinguish among the various kinds of VPs: finite, non-finite and modals. [sent-122, score-0.133]

67 Once we annotate the treebank verbs into finite, non-finite and modals6, results improve a lot. [sent-123, score-0.075]

68 Further improvement was gained by specifically marking the subject-NPs. [sent-124, score-0.057]

69 7 The parser was not able to correctly learn these splits on its own, but once they were manually provided it did a very good job utilizing this information. [sent-125, score-0.172]

70 In all the experiments, the use of the morphological analyzer in producing the lattice was crucial for parsing accuracy. [sent-128, score-0.976]

71 Results Our final configuration (marking verbal forms and subject-NPs, using the analyzer to construct the lattice and training the parser for 5 iterations) produces remarkable parsing accuracy when parsing from unsegmented text: an F-score of 79. [sent-129, score-1.38]

72 The pipeline systems with the same grammar achieve substantially lower F-scores of 75. [sent-134, score-0.164]

73 For comparison, the previous best results for parsing Hebrew are 84. [sent-137, score-0.218]

74 1%F assuming gold segmentation and tagging (Tsarfaty and Sima’an, 2010)9, and 73. [sent-138, score-0.296]

75 7%F starting from unsegmented text (Golderg et 5The segmentation+tagging the Treebank data is 91. [sent-139, score-0.119]

76 accuracy of the HMM tagger on 6This information is available in both the treebank and the morphological analyzer, but we removed it at first. [sent-141, score-0.205]

77 (2009) also report improvements in accuracy when providing the PCFG-LA parser with few manuallydevised linguistically-motivated state-splits. [sent-145, score-0.133]

78 While the pipeline system already improves over the previous best results, the lattice-based joint-model improves results even further. [sent-155, score-0.091]

79 Overall, the PCFGLA+Lattice parser improve results by 6 F-points absolute, an error reduction of about 20%. [sent-156, score-0.133]

80 Tagging accuracies are also remarkable, and constitute stateof-the-art tagging for Hebrew. [sent-157, score-0.104]

81 Running time The lattice representation effectively results in longer inputs to the parser. [sent-159, score-0.531]

82 It is informative to quantify the effect of the lattice representation on the parsing time, which is cubic in sentence length. [sent-160, score-0.749]

83 The pipeline parser parsed the 483 pre-segmented input sentences in 151 seconds (3. [sent-161, score-0.264]

84 2 sentences/second) not including segmentation time, while the lattice parser took 175 seconds (2. [sent-162, score-0.821]

85 Parsing with the lattice representation is slower than in the pipeline setup, but not prohibitively so. [sent-164, score-0.622]

86 However, the statesplit model exhibits no notion of syntactic agreement on gender and number. [sent-167, score-0.109]

87 This is troubling, as we encountered a fair amount of parsing mistakes which would have been solved if the parser were to use agreement information. [sent-168, score-0.449]

88 708 6 Conclusions and Future Work We demonstrated that the combination of lattice parsing with the PCFG-LA Berkeley parser is highly effective. [sent-169, score-0.89]

89 Lattice parsing allows much needed flexibility in providing input to a parser when the yield of the tree is not known in advance, and the grammar refinement and estimation techniques of the Berkeley parser provide a strong disambiguation component. [sent-170, score-0.668]

90 In this work, we applied the Berkeley+Lattice parser to the challenging task of joint segmentation and parsing of Hebrew text. [sent-171, score-0.534]

91 The result is the first constituency parser which can parse naturally occurring unsegmented Hebrew text with an acceptable accuracy (an F1 score of 80%). [sent-172, score-0.282]

92 These include joint segmentation and parsing of Chinese, empty element prediction (see (Cai et al. [sent-174, score-0.401]

93 The code of the lattice extension to the Berkeley parser is publicly available. [sent-176, score-0.638]

94 10 Despite its strong performance, we observed that the Berkeley parser did not learn morphological agreement patterns. [sent-177, score-0.308]

95 Unsupervised lexicon-based resolution of unknown words for full morphological analy- sis. [sent-188, score-0.13]

96 On statistical parsing of French with supervised and semi-supervised strategies. [sent-198, score-0.218]

97 A single generative model for joint morphological segmentation and syntactic parsing. [sent-210, score-0.34]

98 Enhancing unlexicalized parsing performance using a wide coverage lexicon, fuzzy tag-set mapping, and em-hmm-based lexical probabilities. [sent-220, score-0.319]

99 Inducing head-driven PCFGs with latent heads: Refining a tree-bank grammar for parsing. [sent-271, score-0.112]

100 Modeling morphosyntactic agreement in constituency-based parsing of Modern Hebrew. [sent-280, score-0.303]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lattice', 0.505), ('hebrew', 0.417), ('parsing', 0.218), ('segmentation', 0.183), ('berkeley', 0.16), ('tsarfaty', 0.159), ('goldberg', 0.154), ('parser', 0.133), ('morphological', 0.13), ('unsegmented', 0.119), ('golderg', 0.115), ('yoav', 0.104), ('adler', 0.101), ('petrov', 0.096), ('token', 0.095), ('segments', 0.092), ('pipeline', 0.091), ('bclm', 0.086), ('meni', 0.086), ('reut', 0.083), ('analyzer', 0.082), ('affixation', 0.076), ('tagging', 0.075), ('treebank', 0.075), ('grammar', 0.073), ('sima', 0.072), ('modern', 0.07), ('itai', 0.066), ('delimited', 0.057), ('guthmann', 0.057), ('mila', 0.057), ('unvocalized', 0.057), ('yoad', 0.057), ('marking', 0.057), ('lexicon', 0.056), ('attach', 0.054), ('slav', 0.051), ('licensed', 0.051), ('chappelier', 0.051), ('deterministic', 0.05), ('methodology', 0.049), ('analyses', 0.045), ('agreement', 0.045), ('hmm', 0.044), ('pronominal', 0.044), ('remarkable', 0.044), ('segmented', 0.042), ('definiteness', 0.042), ('elements', 0.041), ('alon', 0.041), ('producing', 0.041), ('ambiguous', 0.04), ('input', 0.04), ('morphosyntactic', 0.04), ('matsuzaki', 0.04), ('candito', 0.04), ('latent', 0.039), ('splits', 0.039), ('gold', 0.038), ('automatique', 0.038), ('traitement', 0.038), ('khalil', 0.038), ('gender', 0.037), ('lexical', 0.037), ('particles', 0.037), ('items', 0.037), ('unlexicalized', 0.036), ('refining', 0.036), ('refinement', 0.036), ('uncertain', 0.036), ('grammars', 0.035), ('estimation', 0.035), ('mle', 0.035), ('cai', 0.035), ('verbal', 0.035), ('cky', 0.034), ('demonstrated', 0.034), ('interpreted', 0.033), ('adequacy', 0.033), ('horizontal', 0.032), ('definite', 0.031), ('house', 0.031), ('productive', 0.031), ('probabilities', 0.03), ('long', 0.03), ('accept', 0.03), ('green', 0.03), ('writing', 0.03), ('constituency', 0.03), ('accuracies', 0.029), ('coverage', 0.028), ('refined', 0.027), ('syntactic', 0.027), ('mistakes', 0.027), ('splitting', 0.027), ('rich', 0.026), ('morphologically', 0.026), ('fair', 0.026), ('forms', 0.026), ('representation', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

Author: Yoav Goldberg ; Michael Elhadad

Abstract: We experiment with extending a lattice parsing methodology for parsing Hebrew (Goldberg and Tsarfaty, 2008; Golderg et al., 2009) to make use of a stronger syntactic model: the PCFG-LA Berkeley Parser. We show that the methodology is very effective: using a small training set of about 5500 trees, we construct a parser which parses and segments unsegmented Hebrew text with an F-score of almost 80%, an error reduction of over 20% over the best previous result for this task. This result indicates that lattice parsing with the Berkeley parser is an effective methodology for parsing over uncertain inputs.

2 0.2519289 192 acl-2011-Language-Independent Parsing with Empty Elements

Author: Shu Cai ; David Chiang ; Yoav Goldberg

Abstract: We present a simple, language-independent method for integrating recovery of empty elements into syntactic parsing. This method outperforms the best published method we are aware of on English and a recently published method on Chinese.

3 0.18806253 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

Author: John Lee ; Jason Naradowsky ; David A. Smith

Abstract: Most previous studies of morphological disambiguation and dependency parsing have been pursued independently. Morphological taggers operate on n-grams and do not take into account syntactic relations; parsers use the “pipeline” approach, assuming that morphological information has been separately obtained. However, in morphologically-rich languages, there is often considerable interaction between morphology and syntax, such that neither can be disambiguated without the other. In this paper, we propose a discriminative model that jointly infers morphological properties and syntactic structures. In evaluations on various highly-inflected languages, this joint model outperforms both a baseline tagger in morphological disambiguation, and a pipeline parser in head selection.

4 0.15959799 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

Author: Zhongguo Li

Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂 擌 奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂 䀓 惼 ‘vice director’ and 䉂 䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂 擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂 擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱 嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌 撥 怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥 惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩 扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩 堑 扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌 撥 怂 惆 VV ???HHH VV NNf ZZ VVf VVf 扩 扙 堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽 䊂䠽 吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.

5 0.15881243 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

Author: Jason Naradowsky ; Kristina Toutanova

Abstract: This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets.

6 0.15034592 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

7 0.13716841 310 acl-2011-Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

8 0.11510645 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features

9 0.11416382 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

10 0.11364617 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

11 0.1134066 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

12 0.11184202 282 acl-2011-Shift-Reduce CCG Parsing

13 0.10442805 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

14 0.10391963 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

15 0.10244944 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

16 0.10002219 167 acl-2011-Improving Dependency Parsing with Semantic Classes

17 0.097022586 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

18 0.096395902 333 acl-2011-Web-Scale Features for Full-Scale Parsing

19 0.092608869 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

20 0.087736309 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.21), (1, -0.091), (2, -0.028), (3, -0.219), (4, -0.065), (5, -0.025), (6, 0.084), (7, 0.049), (8, 0.069), (9, 0.09), (10, -0.014), (11, 0.082), (12, -0.116), (13, -0.037), (14, 0.07), (15, -0.028), (16, 0.03), (17, -0.029), (18, 0.089), (19, 0.159), (20, 0.106), (21, 0.068), (22, -0.053), (23, -0.069), (24, 0.023), (25, 0.034), (26, -0.001), (27, 0.004), (28, -0.008), (29, -0.0), (30, 0.004), (31, 0.008), (32, -0.02), (33, -0.012), (34, 0.02), (35, -0.018), (36, 0.021), (37, -0.033), (38, -0.019), (39, -0.014), (40, -0.036), (41, -0.094), (42, 0.023), (43, -0.056), (44, 0.013), (45, 0.041), (46, -0.065), (47, -0.118), (48, 0.018), (49, -0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95649904 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

Author: Yoav Goldberg ; Michael Elhadad

Abstract: We experiment with extending a lattice parsing methodology for parsing Hebrew (Goldberg and Tsarfaty, 2008; Golderg et al., 2009) to make use of a stronger syntactic model: the PCFG-LA Berkeley Parser. We show that the methodology is very effective: using a small training set of about 5500 trees, we construct a parser which parses and segments unsegmented Hebrew text with an F-score of almost 80%, an error reduction of over 20% over the best previous result for this task. This result indicates that lattice parsing with the Berkeley parser is an effective methodology for parsing over uncertain inputs.

2 0.77113289 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

Author: John Lee ; Jason Naradowsky ; David A. Smith

Abstract: Most previous studies of morphological disambiguation and dependency parsing have been pursued independently. Morphological taggers operate on n-grams and do not take into account syntactic relations; parsers use the “pipeline” approach, assuming that morphological information has been separately obtained. However, in morphologically-rich languages, there is often considerable interaction between morphology and syntax, such that neither can be disambiguated without the other. In this paper, we propose a discriminative model that jointly infers morphological properties and syntactic structures. In evaluations on various highly-inflected languages, this joint model outperforms both a baseline tagger in morphological disambiguation, and a pipeline parser in head selection.

3 0.75871414 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

Author: Zhongguo Li

Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂 擌 奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂 䀓 惼 ‘vice director’ and 䉂 䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂 擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂 擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱 嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌 撥 怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥 惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩 扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩 堑 扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌 撥 怂 惆 VV ???HHH VV NNf ZZ VVf VVf 扩 扙 堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽 䊂䠽 吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.

4 0.70634073 192 acl-2011-Language-Independent Parsing with Empty Elements

Author: Shu Cai ; David Chiang ; Yoav Goldberg

Abstract: We present a simple, language-independent method for integrating recovery of empty elements into syntactic parsing. This method outperforms the best published method we are aware of on English and a recently published method on Chinese.

5 0.64648849 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

Author: Elias Ponvert ; Jason Baldridge ; Katrin Erk

Abstract: We consider a new subproblem of unsupervised parsing from raw text, unsupervised partial parsing—the unsupervised version of text chunking. We show that addressing this task directly, using probabilistic finite-state methods, produces better results than relying on the local predictions of a current best unsupervised parser, Seginer’s (2007) CCL. These finite-state models are combined in a cascade to produce more general (full-sentence) constituent structures; doing so outperforms CCL by a wide margin in unlabeled PARSEVAL scores for English, German and Chinese. Finally, we address the use of phrasal punctuation as a heuristic indicator of phrasal boundaries, both in our system and in CCL.

6 0.6215381 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

7 0.60687691 66 acl-2011-Chinese sentence segmentation as comma classification

8 0.60164988 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

9 0.58252394 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

10 0.56784689 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

11 0.55806303 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL

12 0.55506498 124 acl-2011-Exploiting Morphology in Turkish Named Entity Recognition System

13 0.544285 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

14 0.53592163 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

15 0.5246734 310 acl-2011-Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

16 0.52437782 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing

17 0.51885736 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

18 0.5187003 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

19 0.49767628 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

20 0.49315742 243 acl-2011-Partial Parsing from Bitext Projections


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.014), (17, 0.03), (26, 0.016), (31, 0.017), (37, 0.071), (39, 0.511), (41, 0.036), (55, 0.023), (59, 0.033), (72, 0.022), (91, 0.031), (96, 0.1)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96888113 1 acl-2011-(11-06-spirl)

Author: (hal)

Abstract: unkown-abstract

same-paper 2 0.91967827 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

Author: Yoav Goldberg ; Michael Elhadad

Abstract: We experiment with extending a lattice parsing methodology for parsing Hebrew (Goldberg and Tsarfaty, 2008; Golderg et al., 2009) to make use of a stronger syntactic model: the PCFG-LA Berkeley Parser. We show that the methodology is very effective: using a small training set of about 5500 trees, we construct a parser which parses and segments unsegmented Hebrew text with an F-score of almost 80%, an error reduction of over 20% over the best previous result for this task. This result indicates that lattice parsing with the Berkeley parser is an effective methodology for parsing over uncertain inputs.

3 0.89316618 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach

Author: Muhua Zhu ; Jingbo Zhu ; Minghan Hu

Abstract: For the task of automatic treebank conversion, this paper presents a feature-based approach which encodes bracketing structures in a treebank into features to guide the conversion of this treebank to a different standard. Experiments on two Chinese treebanks show that our approach improves conversion accuracy by 1.31% over a strong baseline.

4 0.89315844 52 acl-2011-Automatic Labelling of Topic Models

Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin

Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.

5 0.82620174 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

Author: Jacob Eisenstein ; Noah A. Smith ; Eric P. Xing

Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.

6 0.77381599 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

7 0.75091153 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

8 0.71431446 192 acl-2011-Language-Independent Parsing with Empty Elements

9 0.62134945 182 acl-2011-Joint Annotation of Search Queries

10 0.56526166 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

11 0.54938316 282 acl-2011-Shift-Reduce CCG Parsing

12 0.5440197 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

13 0.54189241 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

14 0.54013216 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

15 0.53614497 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

16 0.53515476 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing

17 0.52905309 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

18 0.52516389 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

19 0.51956749 238 acl-2011-P11-2093 k2opt.pdf

20 0.51574862 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing