acl acl2012 acl2012-122 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Reut Tsarfaty ; Joakim Nivre ; Evelina Andersson
Abstract: We present novel metrics for parse evaluation in joint segmentation and parsing scenarios where the gold sequence of terminals is not known in advance. The protocol uses distance-based metrics defined for the space of trees over lattices. Our metrics allow us to precisely quantify the performance gap between non-realistic parsing scenarios (assuming gold segmented and tagged input) and realistic ones (not assuming gold segmentation and tags). Our evaluation of segmentation and parsing for Modern Hebrew sheds new light on the performance ofthe best parsing systems to date in the different scenarios.
Reference: text
sentIndex sentText sentNum sentScore
1 se Abstract We present novel metrics for parse evaluation in joint segmentation and parsing scenarios where the gold sequence of terminals is not known in advance. [sent-8, score-0.638]
2 The protocol uses distance-based metrics defined for the space of trees over lattices. [sent-9, score-0.247]
3 Our metrics allow us to precisely quantify the performance gap between non-realistic parsing scenarios (assuming gold segmented and tagged input) and realistic ones (not assuming gold segmentation and tags). [sent-10, score-0.817]
4 Our evaluation of segmentation and parsing for Modern Hebrew sheds new light on the performance ofthe best parsing systems to date in the different scenarios. [sent-11, score-0.391]
5 1 Introduction A parser takes a sentence in natural language as input and returns a syntactic parse tree representing the sentence’s human-perceived interpretation. [sent-12, score-0.268]
6 Current state-of-the-art parsers assume that the space- delimited words in the input are the basic units of syntactic analysis. [sent-13, score-0.148]
7 , 1991 ; Buchholz and Marsi, 2006) accordingly assume that the yield of the parse tree is known in advance. [sent-15, score-0.108]
8 This assumption breaks down when parsing morphologically rich languages (Tsarfaty et al. [sent-16, score-0.317]
9 , 2010), where every space-delimited word may be effectively composed of multiple morphemes, each of which having a distinct role in the syntactic parse tree. [sent-17, score-0.114]
10 In order to parse such input the text needs to undergo morphological segmentation, that is, identifying the morphological segments of each word and assigning the corresponding part-ofspeech (PoS) tags to them. [sent-18, score-0.671]
11 The multiple morphological analyses of input words may be represented via a lattice that encodes the different segmentation possibilities of the entire word sequence. [sent-20, score-0.618]
12 One can either select a segmentation path prior to parsing, or, as has been recently argued, one can let the parser pick a segmentation jointly with decoding (Tsarfaty, 2006; Cohen and Smith, 2007; Goldberg and Tsarfaty, 2008; Green and Manning, 2010). [sent-21, score-0.38]
13 If the selected segmentation is different from the gold segmentation, the gold and parse trees are rendered incomparable and standard evaluation metrics break down. [sent-22, score-0.655]
14 Evaluation scenarios restricted to gold input are often used to bypass this problem, but, as shall be seen shortly, they present an overly optimistic upperbound on parser performance. [sent-23, score-0.321]
15 This paper presents a full treatment of evaluation in different parsing scenarios, using distance-based measures defined for trees over a shared common denominator defined in terms of a lattice structure. [sent-24, score-0.365]
16 We demonstrate the informativeness of our metrics by evaluating joint segmentation and parsing performance for the Semitic language Modern Hebrew, using the best performing systems, both constituencybased and dependency-based (Tsarfaty, 2010; Goldberg, 2011a). [sent-25, score-0.409]
17 Our experiments demonstrate that, for all parsers, significant performance gaps between realistic and non-realistic scenarios crucially depend on the kind of information initially provided to the parser. [sent-26, score-0.15]
18 The tool and metrics that we provide are completely general and can straightforwardly apply to other languages, treebanks and different tasks. [sent-27, score-0.09]
19 Erroneous nodes in the parse hypothesis are marked in italics. [sent-31, score-0.122]
20 Missing nodes from the hypothesis are marked in bold. [sent-32, score-0.055]
21 2 The Challenge: Evaluation for MRLs In morphologically rich languages (MRLs) substantial information about the grammatical relations between entities is expressed at word level using inflectional affixes. [sent-33, score-0.162]
22 In particular, in MRLs such as Hebrew, Arabic, Turkish or Maltese, elements such as determiners, definite articles and conjunction markers appear as affixes that are appended to an openclass word. [sent-34, score-0.055]
23 Note that morphological segmentation is not the inverse of concatenation. [sent-37, score-0.397]
24 For instance, the overt definite article H and the possessor FL show up only in the analysis. [sent-38, score-0.097]
25 The correct parse for the Hebrew phrase “BCLM HNEIM” is shown in Figure 1 (tree1), and it presupposes that these segments can be identified and assigned the correct PoS tags. [sent-39, score-0.109]
26 However, morphological segmentation is non-trivial due to massive wordlevel ambiguity. [sent-40, score-0.397]
27 2 The multitude of morphological analyses may be encoded in a lattice structure, as illustrated in Figure 2. [sent-42, score-0.42]
28 2The complete set of analyses for this word is provided in Goldberg and Tsarfaty (2008). [sent-45, score-0.05]
29 7 Figure 2: The morphological segmentation possibilities of BCLM HNEIM. [sent-47, score-0.397]
30 In practice, a statistical component is required to decide on the correct morphological segmentation, that is, to pick out the correct path through the lattice. [sent-49, score-0.242]
31 , 2008; Habash and Rambow, 2005), or jointly with parsing (Tsarfaty, 2006; Goldberg and Tsarfaty, 2008; Green and Manning, 2010). [sent-51, score-0.118]
32 Either way, an incorrect morphological segmentation hypothesis introduces errors into the parse hypothesis, ultimately providing a parse tree which spans a different yield than the gold terminals. [sent-52, score-0.786]
33 To understand why, consider the trees in Figure 1. [sent-54, score-0.119]
34 , 1991) calculate the harmonic means of precision and recall on labeled spans hi, label, ji where i,j are termionanl l baboeulneddar siepsa. [sent-56, score-0.063]
35 n Now, tbheel ,NjiP dominating r“esh taedrmowiof them” has been identified and labeled correctly in tree2, but in tree1 it spans h2, NP, 5i and in tree2 iitn spans h1, NP, 4i. [sent-57, score-0.161]
36 1T ihti ssp anonsde h 2w,Nil P t,h5ein a n bed c ionu trnetee2d as an error NfoPr tree2, along dwei twh iltls t hdeonm binea ctoeud natendd dominating structure, and PARSEVAL will score 0. [sent-58, score-0.035]
37 A generalized version of PARSEVAL which considers i,j character-based indices instead of terminal boundaries (Tsarfaty, 2006) will fail here too, since the missing overt definite article H will cause similar misalignments. [sent-59, score-0.139]
38 Metrics for dependencybased evaluation such as ATTACHMENT SCORES (Buchholz and Marsi, 2006) suffer from similar problems, since they assume that both trees have the same nodes an assumption that breaks down in the case of incorrect morphological segmentation. [sent-60, score-0.492]
39 Although great advances have been made in parsing MRLs in recent years, this evaluation challenge remained unsolved. [sent-61, score-0.152]
40 — 3 The Proposal: Distance-Based Metrics Input and Output Spaces We view the joint task as a structured prediction function h : X → Y from input space eXd pornetdoi output space Y. [sent-64, score-0.043]
41 , wn hof e spacedxel ∈imi Xted i sw oard sesq furoenmc a xset =W . [sent-68, score-0.039]
42 w We assume a lexicon LEX, ditiesdtin wcot fdros mfro W, containing pairs mofe segments draw,n d firstoimnc a set mT W Wof, t ceormntianianilns gan pda iPrsoS o categories ddrraawwnn f frroomm a set NT ooff tneormntienramlsin aanlds. [sent-69, score-0.042]
43 LEX = {hs, pi |s ∈ T ,p ∈ N} Each word wi in the input may admit multiple morphological analyses, constrained by a languagespecific morphological analyzer MA. [sent-70, score-0.59]
44 The morphological analysis of an input word MA(wi) can be represented as a lattice Li in which every arc corresponds to a lexicon entry hs, pi. [sent-71, score-0.413]
45 The morphologircesalp analysis ao lfe an input tsreynt hesn,pcei x ish eth meno a hloatltoicgeL obtained through the concatenation of the lattices L1, . [sent-72, score-0.077]
46 , wn be a sentence with a morphological analysis lattice MA(x) = L. [sent-82, score-0.409]
47 We define the output space YMA(x)=L for h (abbreviated YL), as hthee o suettp uoft linearly-ordered labeled trees such tYhat the yield of LEX entries hs1,p1i,. [sent-83, score-0.119]
48 ,hsk, pki in each tree (where si ∈ Tnt rainesd pi ∈ Ni,,. [sent-86, score-0.104]
49 Cohen and Smith (2007) aimed to fix this, but in their implementation syntactic nodes internal to word boundaries may be lost without scoring. [sent-93, score-0.102]
50 Tih ∈e operations in A are properly constrained by the lattice, athtiaotn is, we can only aedrldy a cnodn dsetrlaeitnee dlex beym tehse t lhaat-t belong to LEX, and we can only add and delete them where they can occur in the lattice. [sent-95, score-0.039]
51 , ami as tahned sum noef tthhee ccoosstts o off a a slle operathiaons in the sequence C(ha1 , . [sent-99, score-0.05]
52 , aPmi is a sequence of operations that) )tu =rns h y1 into y2. [sent-106, score-0.039]
53 iTh ise treeedit distance is the minimum cost of any edit script that turns y1 into y2 (Bille, 2005). [sent-107, score-0.069]
54 We would need to delete all lexemes and nodes in p and add all the lexemes and nodes of g, except for roots. [sent-109, score-0.186]
55 An Example Both trees in Figure 1 are contained in YL for the lattice L in Figure 2. [sent-110, score-0.247]
56 If we replace terminal boundaries with lattice indices from Figure 2, we need 6 edit operations to turn tree2 into tree1 (deleting the nodes in italic, adding the nodes in bold) and the evaluation score will be TEDEVAL(tree2,tree1) = 1−14+160−2 = 0. [sent-111, score-0.388]
57 h 843e6542Berk- ley Parser trained on bare-bone trees (PS) and relationalrealizational trees (RR). [sent-122, score-0.295]
58 027 5 Table 2: Dependency parsing results by MaltParser (MP) and EasyFirst (EF), trained on the treebank converted into unlabeled dependencies, and parsing the entire dev-set. [sent-133, score-0.236]
59 For constituency-based parsing we use two models trained by the Berkeley parser (Petrov et al. [sent-134, score-0.188]
60 , 2006) one on phrase-structure (PS) trees and one on relational-realizational (RR) trees (Tsarfaty and Sima’an, 2008). [sent-135, score-0.238]
61 In the raw scenario we let a latticebased parser choose its own segmentation and tags (Goldberg, 2011b). [sent-136, score-0.31]
62 For dependency parsing we use MaltParser (Nivre et al. [sent-137, score-0.157]
63 , 2007b) optimized for Hebrew by Ballesteros and Nivre (2012), and the EasyFirst parser ofGoldberg and Elhadad (2010) with the features therein. [sent-138, score-0.07]
64 Since these parsers cannot choose their own tags, automatically predicted segments and tags are provided by Adler and Elhadad (2006). [sent-139, score-0.224]
65 We use PARSEVAL for evaluating phrase-structure trees, ATTACHSCORES for evaluating dependency trees, and TEDEVAL for evaluating all trees in all scenarios. [sent-142, score-0.296]
66 We implement SEGEVAL for evaluating segmentation based on our TEDEVAL implementation, replacing the tree distance and size with string terms. [sent-143, score-0.242]
67 9 Table 1 shows the constituency-based parsing results for all scenarios. [sent-144, score-0.118]
68 All of our results confirm that gold information leads to much higher scores. [sent-145, score-0.112]
69 TEDEVAL allows us to precisely quantify the drop in accuracy from gold to predicted (as in PARSEVAL) and than from predicted to raw on a single scale. [sent-146, score-0.425]
70 Unlabeled TEDEVAL shows a greater drop when moving from predicted to raw than from gold to predicted, and for labeled TEDEVAL it is the other way round. [sent-148, score-0.345]
71 This demonstrates the great importance of gold tags which provide morphologically disambiguated information for identifying phrase content. [sent-149, score-0.304]
72 Table 2 shows that dependency parsing results confirm the same trends, but we see a much smaller drop when moving from gold to predicted. [sent-150, score-0.363]
73 This is due to the fact that we train the parsers for predicted on a treebank containing predicted tags. [sent-151, score-0.236]
74 There is however a great drop when moving from predicted to raw, which confirms that evaluation benchmarks on gold input as in Nivre et al. [sent-152, score-0.372]
75 (2007a) do not provide a realistic indication of parser performance. [sent-153, score-0.124]
76 Cross-framework evaluation may be conducted by combining this metric with the cross-framework protocol of Tsarfaty et al. [sent-158, score-0.038]
77 5 Conclusion We presented distance-based metrics defined for trees over lattices and applied them to evaluating parsers on joint morphological and syntactic disambiguation. [sent-160, score-0.636]
78 Our contribution is both technical, providing an evaluation tool that can be straight- forwardly applied for parsing scenarios involving trees over and methodological, suggesting to evaluate parsers in all possible scenarios in order to get a realistic indication of parser performance. [sent-161, score-0.611]
79 A procedure for quantitatively comparing the syntactic coverage of English grammars. [sent-192, score-0.047]
80 A single framework for joint morphological segmentation and syntactic parsing. [sent-210, score-0.444]
81 Joint morphological segmentation and syntactic parsing using a PCFGLA lattice parser. [sent-220, score-0.69]
82 Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. [sent-229, score-0.278]
83 Morphological disambiguation of Hebrew: A case study in classifier combination. [sent-250, score-0.036]
84 Statistical parsing for morphologically rich language (SPMRL): What, how and whither. [sent-262, score-0.28]
wordName wordTfidf (topN-words)
[('tsarfaty', 0.387), ('tedeval', 0.286), ('hebrew', 0.276), ('morphological', 0.242), ('reut', 0.199), ('goldberg', 0.191), ('segmentation', 0.155), ('lattice', 0.128), ('sima', 0.127), ('morphologically', 0.123), ('trees', 0.119), ('parsing', 0.118), ('bclm', 0.114), ('mrls', 0.114), ('gold', 0.112), ('yoav', 0.102), ('scenarios', 0.096), ('nivre', 0.096), ('lex', 0.091), ('parseval', 0.091), ('metrics', 0.09), ('joakim', 0.09), ('predicted', 0.089), ('modern', 0.087), ('evelina', 0.086), ('shadow', 0.086), ('green', 0.077), ('hs', 0.077), ('maltparser', 0.076), ('parser', 0.07), ('edit', 0.069), ('khalil', 0.068), ('parse', 0.067), ('rr', 0.065), ('adler', 0.064), ('spans', 0.063), ('pi', 0.063), ('ps', 0.063), ('buchholz', 0.06), ('np', 0.059), ('parsers', 0.058), ('elhadad', 0.057), ('easyfirst', 0.057), ('hneim', 0.057), ('predigcrtaoelwd', 0.057), ('relationalrealizational', 0.057), ('shacham', 0.057), ('sparseval', 0.057), ('spmrl', 0.057), ('tpopp', 0.057), ('yoad', 0.057), ('arabic', 0.055), ('nodes', 0.055), ('definite', 0.055), ('realistic', 0.054), ('drop', 0.051), ('cohen', 0.051), ('raw', 0.05), ('ballesteros', 0.05), ('ami', 0.05), ('mp', 0.05), ('analyses', 0.05), ('sandra', 0.049), ('syntactic', 0.047), ('evaluating', 0.046), ('assuming', 0.046), ('moving', 0.043), ('input', 0.043), ('overt', 0.042), ('spence', 0.042), ('erwin', 0.042), ('segments', 0.042), ('terminal', 0.042), ('tree', 0.041), ('marsi', 0.04), ('yl', 0.04), ('operations', 0.039), ('incorrect', 0.039), ('dependency', 0.039), ('rich', 0.039), ('wn', 0.039), ('protocol', 0.038), ('lexemes', 0.038), ('shay', 0.037), ('breaks', 0.037), ('roark', 0.037), ('es', 0.036), ('disambiguation', 0.036), ('ma', 0.036), ('black', 0.035), ('habash', 0.035), ('alon', 0.035), ('fl', 0.035), ('dominating', 0.035), ('tags', 0.035), ('great', 0.034), ('johan', 0.034), ('lattices', 0.034), ('jens', 0.034), ('quantify', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000026 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing
Author: Reut Tsarfaty ; Joakim Nivre ; Evelina Andersson
Abstract: We present novel metrics for parse evaluation in joint segmentation and parsing scenarios where the gold sequence of terminals is not known in advance. The protocol uses distance-based metrics defined for the space of trees over lattices. Our metrics allow us to precisely quantify the performance gap between non-realistic parsing scenarios (assuming gold segmented and tagged input) and realistic ones (not assuming gold segmentation and tags). Our evaluation of segmentation and parsing for Modern Hebrew sheds new light on the performance ofthe best parsing systems to date in the different scenarios.
2 0.1327873 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies
Author: Wanxiang Che ; Valentin Spitkovsky ; Ting Liu
Abstract: Stanford dependencies are widely used in natural language processing as a semanticallyoriented representation, commonly generated either by (i) converting the output of a constituent parser, or (ii) predicting dependencies directly. Previous comparisons of the two approaches for English suggest that starting from constituents yields higher accuracies. In this paper, we re-evaluate both methods for Chinese, using more accurate dependency parsers than in previous work. Our comparison of performance and efficiency across seven popular open source parsers (four constituent and three dependency) shows, by contrast, that recent higher-order graph-based techniques can be more accurate, though somewhat slower, than constituent parsers. We demonstrate also that n-way jackknifing is a useful technique for producing automatic (rather than gold) partof-speech tags to train Chinese dependency parsers. Finally, we analyze the relations produced by both kinds of parsing and suggest which specific parsers to use in practice.
3 0.12963466 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
Author: Spence Green ; John DeNero
Abstract: When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agreement model. Agreement is promoted by scoring a sequence of fine-grained morpho-syntactic classes that are predicted during decoding for each translation hypothesis. For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders. 1
4 0.12842974 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese
Author: Jun Hatori ; Takuya Matsuzaki ; Yusuke Miyao ; Jun'ichi Tsujii
Abstract: We propose the first joint model for word segmentation, POS tagging, and dependency parsing for Chinese. Based on an extension of the incremental joint model for POS tagging and dependency parsing (Hatori et al., 2011), we propose an efficient character-based decoding method that can combine features from state-of-the-art segmentation, POS tagging, and dependency parsing models. We also describe our method to align comparable states in the beam, and how we can combine features of different characteristics in our incremental framework. In experiments using the Chinese Treebank (CTB), we show that the accuracies of the three tasks can be improved significantly over the baseline models, particularly by 0.6% for POS tagging and 2.4% for dependency parsing. We also perform comparison experiments with the partially joint models.
5 0.11174625 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling
Author: Kareem Darwish ; Ahmed Ali
Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.
6 0.10706897 109 acl-2012-Higher-order Constituent Parsing and Parser Combination
7 0.1005843 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT
8 0.099711768 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction
9 0.09740027 213 acl-2012-Utilizing Dependency Language Models for Graph-based Dependency Parsing Models
10 0.097106881 90 acl-2012-Extracting Narrative Timelines as Temporal Dependency Structures
11 0.086777061 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining
12 0.084620997 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars
13 0.080353804 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation
14 0.079866186 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic
15 0.079765484 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence
16 0.077973753 140 acl-2012-Machine Translation without Words through Substring Alignment
17 0.075234339 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities
18 0.072801001 137 acl-2012-Lemmatisation as a Tagging Task
19 0.072301038 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?
20 0.071550593 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing
topicId topicWeight
[(0, -0.193), (1, -0.036), (2, -0.165), (3, -0.111), (4, -0.032), (5, 0.036), (6, 0.045), (7, -0.123), (8, 0.032), (9, 0.011), (10, -0.033), (11, 0.022), (12, 0.05), (13, -0.072), (14, 0.046), (15, -0.072), (16, -0.133), (17, -0.039), (18, -0.105), (19, 0.061), (20, 0.037), (21, -0.031), (22, 0.017), (23, -0.058), (24, -0.032), (25, -0.002), (26, -0.016), (27, 0.041), (28, 0.039), (29, -0.024), (30, -0.009), (31, 0.003), (32, -0.059), (33, -0.049), (34, -0.039), (35, -0.003), (36, -0.002), (37, 0.061), (38, -0.065), (39, -0.046), (40, -0.016), (41, 0.081), (42, -0.029), (43, -0.069), (44, -0.034), (45, -0.06), (46, -0.067), (47, 0.055), (48, 0.005), (49, 0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.95308852 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing
Author: Reut Tsarfaty ; Joakim Nivre ; Evelina Andersson
Abstract: We present novel metrics for parse evaluation in joint segmentation and parsing scenarios where the gold sequence of terminals is not known in advance. The protocol uses distance-based metrics defined for the space of trees over lattices. Our metrics allow us to precisely quantify the performance gap between non-realistic parsing scenarios (assuming gold segmented and tagged input) and realistic ones (not assuming gold segmentation and tags). Our evaluation of segmentation and parsing for Modern Hebrew sheds new light on the performance ofthe best parsing systems to date in the different scenarios.
2 0.58393914 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT
Author: David Stallard ; Jacob Devlin ; Michael Kayser ; Yoong Keok Lee ; Regina Barzilay
Abstract: If unsupervised morphological analyzers could approach the effectiveness of supervised ones, they would be a very attractive choice for improving MT performance on low-resource inflected languages. In this paper, we compare performance gains for state-of-the-art supervised vs. unsupervised morphological analyzers, using a state-of-theart Arabic-to-English MT system. We apply maximum marginal decoding to the unsupervised analyzer, and show that this yields the best published segmentation accuracy for Arabic, while also making segmentation output more stable. Our approach gives an 18% relative BLEU gain for Levantine dialectal Arabic. Furthermore, it gives higher gains for Modern Standard Arabic (MSA), as measured on NIST MT-08, than does MADA (Habash and Rambow, 2005), a leading supervised MSA segmenter.
3 0.58102989 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese
Author: Jun Hatori ; Takuya Matsuzaki ; Yusuke Miyao ; Jun'ichi Tsujii
Abstract: We propose the first joint model for word segmentation, POS tagging, and dependency parsing for Chinese. Based on an extension of the incremental joint model for POS tagging and dependency parsing (Hatori et al., 2011), we propose an efficient character-based decoding method that can combine features from state-of-the-art segmentation, POS tagging, and dependency parsing models. We also describe our method to align comparable states in the beam, and how we can combine features of different characteristics in our incremental framework. In experiments using the Chinese Treebank (CTB), we show that the accuracies of the three tasks can be improved significantly over the baseline models, particularly by 0.6% for POS tagging and 2.4% for dependency parsing. We also perform comparison experiments with the partially joint models.
4 0.57873613 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling
Author: Kareem Darwish ; Ahmed Ali
Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.
5 0.57446873 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities
Author: Seyed Abolghasem Mirroshandel ; Alexis Nasr ; Joseph Le Roux
Abstract: Treebanks are not large enough to reliably model precise lexical phenomena. This deficiency provokes attachment errors in the parsers trained on such data. We propose in this paper to compute lexical affinities, on large corpora, for specific lexico-syntactic configurations that are hard to disambiguate and introduce the new information in a parser. Experiments on the French Treebank showed a relative decrease ofthe error rate of 7. 1% Labeled Accuracy Score yielding the best parsing results on this treebank.
6 0.56892228 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing
7 0.55808866 213 acl-2012-Utilizing Dependency Language Models for Graph-based Dependency Parsing Models
8 0.55468112 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction
9 0.54481626 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic
10 0.53737205 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars
11 0.53533828 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies
12 0.52756768 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese
13 0.49920803 30 acl-2012-Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations
14 0.49590141 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence
15 0.49416226 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing
16 0.48483145 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
17 0.48234704 137 acl-2012-Lemmatisation as a Tagging Task
18 0.48225391 71 acl-2012-Dependency Hashing for n-best CCG Parsing
19 0.47918117 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining
20 0.47721684 109 acl-2012-Higher-order Constituent Parsing and Parser Combination
topicId topicWeight
[(7, 0.014), (26, 0.02), (28, 0.027), (30, 0.03), (37, 0.025), (39, 0.02), (57, 0.017), (71, 0.043), (74, 0.036), (82, 0.016), (84, 0.481), (85, 0.029), (90, 0.071), (92, 0.038), (94, 0.016), (99, 0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.9073863 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing
Author: Reut Tsarfaty ; Joakim Nivre ; Evelina Andersson
Abstract: We present novel metrics for parse evaluation in joint segmentation and parsing scenarios where the gold sequence of terminals is not known in advance. The protocol uses distance-based metrics defined for the space of trees over lattices. Our metrics allow us to precisely quantify the performance gap between non-realistic parsing scenarios (assuming gold segmented and tagged input) and realistic ones (not assuming gold segmentation and tags). Our evaluation of segmentation and parsing for Modern Hebrew sheds new light on the performance ofthe best parsing systems to date in the different scenarios.
2 0.85425842 68 acl-2012-Decoding Running Key Ciphers
Author: Sravana Reddy ; Kevin Knight
Abstract: There has been recent interest in the problem of decoding letter substitution ciphers using techniques inspired by natural language processing. We consider a different type of classical encoding scheme known as the running key cipher, and propose a search solution using Gibbs sampling with a word language model. We evaluate our method on synthetic ciphertexts of different lengths, and find that it outperforms previous work that employs Viterbi decoding with character-based models.
3 0.84552431 195 acl-2012-The Creation of a Corpus of English Metalanguage
Author: Shomir Wilson
Abstract: Metalanguage is an essential linguistic mechanism which allows us to communicate explicit information about language itself. However, it has been underexamined in research in language technologies, to the detriment of the performance of systems that could exploit it. This paper describes the creation of the first tagged and delineated corpus of English metalanguage, accompanied by an explicit definition and a rubric for identifying the phenomenon in text. This resource will provide a basis for further studies of metalanguage and enable its utilization in language technologies.
4 0.82988936 135 acl-2012-Learning to Temporally Order Medical Events in Clinical Text
Author: Preethi Raghavan ; Albert Lai ; Eric Fosler-Lussier
Abstract: We investigate the problem of ordering medical events in unstructured clinical narratives by learning to rank them based on their time of occurrence. We represent each medical event as a time duration, with a corresponding start and stop, and learn to rank the starts/stops based on their proximity to the admission date. Such a representation allows us to learn all of Allen’s temporal relations between medical events. Interestingly, we observe that this methodology performs better than a classification-based approach for this domain, but worse on the relationships found in the Timebank corpus. This finding has important implications for styles of data representation and resources used for temporal relation learning: clinical narratives may have different language attributes corresponding to temporal ordering relative to Timebank, implying that the field may need to look at a wider range ofdomains to fully understand the nature of temporal ordering.
5 0.76580834 93 acl-2012-Fast Online Lexicon Learning for Grounded Language Acquisition
Author: David Chen
Abstract: Learning a semantic lexicon is often an important first step in building a system that learns to interpret the meaning of natural language. It is especially important in language grounding where the training data usually consist of language paired with an ambiguous perceptual context. Recent work by Chen and Mooney (201 1) introduced a lexicon learning method that deals with ambiguous relational data by taking intersections of graphs. While the algorithm produced good lexicons for the task of learning to interpret navigation instructions, it only works in batch settings and does not scale well to large datasets. In this paper we introduce a new online algorithm that is an order of magnitude faster and surpasses the stateof-the-art results. We show that by changing the grammar of the formal meaning represen- . tation language and training on additional data collected from Amazon’s Mechanical Turk we can further improve the results. We also include experimental results on a Chinese translation of the training data to demonstrate the generality of our approach.
6 0.40084437 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool
8 0.37548691 34 acl-2012-Automatically Learning Measures of Child Language Development
9 0.37271592 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation
10 0.37117141 194 acl-2012-Text Segmentation by Language Using Minimum Description Length
11 0.37028712 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese
12 0.36777115 139 acl-2012-MIX Is Not a Tree-Adjoining Language
13 0.35599259 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction
14 0.35253662 99 acl-2012-Finding Salient Dates for Building Thematic Timelines
15 0.3517971 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers
16 0.34753034 104 acl-2012-Graph-based Semi-Supervised Learning Algorithms for NLP
17 0.34559941 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing
18 0.34394905 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence
19 0.33775339 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT
20 0.33387467 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction