acl acl2011 acl2011-224 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dirk Hovy ; Ashish Vaswani ; Stephen Tratz ; David Chiang ; Eduard Hovy
Abstract: We present a preliminary study on unsupervised preposition sense disambiguation (PSD), comparing different models and training techniques (EM, MAP-EM with L0 norm, Bayesian inference using Gibbs sampling). To our knowledge, this is the first attempt at unsupervised preposition sense disambiguation. Our best accuracy reaches 56%, a significant improvement (at p <.001) of 16% over the most-frequent-sense baseline.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a preliminary study on unsupervised preposition sense disambiguation (PSD), comparing different models and training techniques (EM, MAP-EM with L0 norm, Bayesian inference using Gibbs sampling). [sent-2, score-0.909]
2 To our knowledge, this is the first attempt at unsupervised preposition sense disambiguation. [sent-3, score-0.723]
3 Our best accuracy reaches 56%, a significant improvement (at p <. [sent-4, score-0.069]
4 1 Introduction Reliable disambiguation of words plays an important role in many NLP applications. [sent-6, score-0.18]
5 76 senses for each of the 34 most frequent English prepositions, while nouns usually have around two (WordNet nouns average about 1. [sent-10, score-0.307]
6 Disambiguating prepositions is thus a challenging and interesting task in itself (as exemplified by the SemEval 2007 task, (Litkowski and Hargraves, 2007)), and holds promise for NLP applications such as Information Extraction or Machine Translation. [sent-13, score-0.17]
7 1 Given a sentence such as the following: In the morning, he shopped in Rome we ultimately want to be able to annotate it as 1See (Chan et al. [sent-14, score-0.236]
8 323 in/TEMPORAL the morning/TIME he/PERSON shopped/SOCIAL in/LOCATIVE Rome/LOCATION Here, the preposition in has two distinct meanings, namely a temporal and a locative one. [sent-16, score-0.467]
9 Ultimately, we want to disambiguate prepositions not by and for themselves, but in the context of sequential semantic labeling. [sent-18, score-0.27]
10 This should also improve disambiguation of the words linked by the prepositions (here, morning, shopped, and Rome). [sent-19, score-0.316]
11 We propose using unsupervised methods in order to leverage unlabeled data, since, to our knowledge, there are no annotated data sets that include both preposition and argument senses. [sent-20, score-0.607]
12 In this paper, we present our unsupervised framework and show results for preposition disambiguation. [sent-21, score-0.561]
13 We hope to present results for the joint disambiguation of preposition and arguments in a future paper. [sent-22, score-0.663]
14 The results from this work can be incorporated into a number of NLP problems, such as semantic tagging, which tries to assign not only syntactic, but also semantic categories to unlabeled text. [sent-23, score-0.092]
15 Knowledge about semantic constraints of prepositional constructions would not only provide better label accuracy, but also aid in resolving prepositional attachment problems. [sent-24, score-0.579]
16 , 2010) also crucially depend on unsupervised techniques as the ones described here for textual enrichment. [sent-26, score-0.094]
17 Our contributions are: • we present the first unsupervised preposition sense disambiguation (PSD) system Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-27, score-0.869]
18 The head word h (a noun, adjective, or verb) governs the preposition. [sent-30, score-0.044]
19 The object of the prepositional phrase (usually a noun) is denoted o, in our example morning and Rome. [sent-32, score-0.367]
20 We will refer to h and o collectively as the prepositional arguments. [sent-33, score-0.203]
21 In our example sentence above, the respective structures would be shopped in morning and shopped in Rome. [sent-36, score-0.467]
22 The senses of each element are denoted by a barred letter, i. [sent-37, score-0.227]
23 , ¯ p denotes the preposition sense, h¯ denotes the sense of the head word, and ¯o the sense of the object. [sent-39, score-0.835]
24 3 Data We use the data set for the SemEval 2007 PSD task, which consists of a training (16k) and a test set (8k) of sentences with sense-annotated prepositions following the sense inventory of The Preposition Project, TPP (Litkowski and Hargraves, 2005). [sent-40, score-0.332]
25 It defines senses for each of the 34 most frequent prepositions. [sent-41, score-0.227]
26 We used an in-house dependency parser to extract the prepositional constructions from the data (e. [sent-46, score-0.266]
27 In order to constrain the argument senses, we construct a dictionary that lists for each word all the possible lexicographer senses according to WordNet. [sent-50, score-0.395]
28 The set of lexicographer senses (45) is a higher level abstraction which is sufficiently coarse to allow for a good generalization. [sent-51, score-0.3]
29 Unknown words are assumed to have all possible senses applicable to their 324 respective word class (i. [sent-52, score-0.281]
30 all noun senses for words labeled as nouns, etc). [sent-54, score-0.227]
31 c) incorporates further constraints on variables As shown by Hovy et al. [sent-67, score-0.064]
32 (2010), preposition senses can be accurately disambiguated using only the head word and object of the PP. [sent-68, score-0.783]
33 We exploit this property of prepositional constructions to represent the constraints between h, p, and o in a graphical model. [sent-69, score-0.367]
34 The joint distribution over the network can thus be written as Pp(h, o, ¯h, p¯ , o¯) = P(¯h) · P(h|¯h) ·(1) P(¯ p|¯h) · P( o¯| p¯) · P(o|¯ o) We want to incorporate as much information as possible into the model to constrain the choices. [sent-74, score-0.103]
35 In Figure 1c, we condition ¯ p on both h¯ and o¯, to reflect the fact that prepositions act as links and determine their sense mainly through context. [sent-75, score-0.373]
36 In order to constrain the object sense o¯, we condition on h¯, similar to a second-order HMM. [sent-76, score-0.297]
37 The actual object o is conditioned on both ¯ p and o¯. [sent-77, score-0.045]
38 The joint distribution is equal to Pp(h, o, ¯h, p¯ , o¯) = P(¯h) · P(h|¯h) ·(2) P(¯ o|¯h) · P( p¯|h¯, o¯) · P(o|¯ o, p¯ ) Though we would like to also condition the preposition sense ¯ p on the head word h (i. [sent-78, score-0.714]
39 Ideally, the sense distribution found by the model matches the real one. [sent-82, score-0.162]
40 Since most linguistic distributions are Zipfian, we want a training method that encourages sparsity in the model. [sent-83, score-0.166]
41 We briefly introduce different unsupervised training methods and discuss their respective advantages and disadvantages. [sent-84, score-0.148]
42 Unless specified otherwise, we initialized all models uniformly, and trained until the perplexity rate stopped increasing or a predefined number of iterations was reached. [sent-85, score-0.112]
43 We ran EM on each model for 100 iterations, or until the perplexity stopped decreasing below a threshold of 10−6. [sent-93, score-0.082]
44 2 EM with Smoothing and Restarts In addition to the baseline, we ran 100 restarts with random initialization and smoothed the fractional counts by adding 0. [sent-95, score-0.253]
45 Repeated random restarts help escape unfavorable initializations that lead to local maxima. [sent-98, score-0.221]
46 3 MAP-EM with L0 Norm Since we want to encourage sparsity in our models, we use the MDL-inspired technique introduced by Vaswani et al. [sent-101, score-0.103]
47 The authors use a smoothed L0 prior, which encourages probabilities to go down to 0. [sent-104, score-0.127]
48 The prior involves hyperparameters α, which rewards sparsity, and β, which controls how close the approximation is to the true L0 norm. [sent-105, score-0.038]
49 2 We perform a grid search to tune the hyper-parameters of the smoothed L0 prior for accuracy on the preposition against, since it has a medium number of senses and instances. [sent-106, score-0.833]
50 The subscripts trans and emit denote the transition and emission parameters. [sent-112, score-0.316]
51 The latter resulted in the best accuracy we achieved. [sent-118, score-0.037]
52 4 Bayesian Inference Instead of EM, we can use Bayesian inference with Gibbs sampling and Dirichlet priors (also known as the Chinese Restaurant Process, CRP). [sent-120, score-0.109]
53 (2010), running Gibbs sampling for 10,000 iterations, with a burn-in period of 5,000, and carry out automatic run selection over 10 random restarts. [sent-122, score-0.066]
54 3 Again, we tuned the hyper-parameters of our Dirichlet priors for accuracy via a grid search over the model for the preposition against. [sent-123, score-0.575]
55 This encourages sparsity in the model and allows for a more nuanced explanation of the data by shifting probability mass to the few prominent classes. [sent-127, score-0.112]
56 3Due to time and space constraints, we did not run the 1000 restarts used in Chiang et al. [sent-130, score-0.159]
57 Numbers in brackets include against (used to tune MAP-EM and Bayesian Inference hyper-parameters) 6 Results Given a sequence h, p, o, we want to find the sequence of senses p¯, ¯o that maximizes the joint probability. [sent-147, score-0.281]
58 Since unsupervised methods use the provided labels indiscriminately, we have to map the resulting predictions to the gold labels. [sent-148, score-0.094]
59 We use many-to-1 mapping as described by Johnson (2007) and used in other unsupervised tasks (Berg-Kirkpatrick et al. [sent-150, score-0.094]
60 , 2010), where each predicted sense is mapped to the gold label it most frequently occurs with in the test data. [sent-151, score-0.162]
61 We report results both with and without against, since we tuned the hyperparameters of two training methods on this preposition. [sent-155, score-0.038]
62 , each preposition token is labeled with its respective name. [sent-159, score-0.521]
63 Adding smoothing and random restarts increases the gain considerably, illustrating how important these techniques are for unsupervised training. [sent-162, score-0.347]
64 CRP is somewhat surprisingly roughly equivalent to EM with smoothing and random restarts. [sent-164, score-0.094]
65 7k), which allow for a better modeling of the data, L0 normalization helps by zeroing out infrequent ones. [sent-171, score-0.032]
66 However, the difference between our complex model and the best HMM (EM with smoothing and random restarts, 55%) is not significant. [sent-172, score-0.094]
67 The best current supervised system we are aware of (Hovy et al. [sent-174, score-0.034]
68 (2010) make explicit use of the arguments for preposition sense disambiguation, using various features. [sent-180, score-0.679]
69 We differ from these approaches by using unsupervised methods and including argument labeling. [sent-181, score-0.14]
70 The constraints of prepositional constructions have been explored by Rudzicz and Mokhov (2003) and O’Hara and Wiebe (2003) to annotate the semantic role of complete PPs with FrameNet and Penn Treebank categories. [sent-182, score-0.41]
71 Ye and Baldwin (2006) explore the constraints of prepositional phrases for semantic role labeling. [sent-183, score-0.347]
72 We plan to use the constraints for argument disambiguation. [sent-184, score-0.11]
73 8 Conclusion and Future Work We evaluate the influence oftwo different models (to represent constraints) and three unsupervised training methods (to achieve sparse sense distributions) on PSD. [sent-185, score-0.288]
74 Using MAP-EM with L0 norm on our model, we achieve an accuracy of 56%. [sent-186, score-0.135]
75 We hope to shorten the gap to supervised systems with more unlabeled data. [sent-189, score-0.034]
76 The advantage of our approach is that the models can be used to infer the senses of the prepositional arguments as well as the preposition. [sent-192, score-0.48]
77 We are currently annotating the data to produce a test set with Amazon’s Mechanical Turk, in order to measure label accuracy for the preposition arguments. [sent-193, score-0.504]
78 In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1, pages 10– 18. [sent-225, score-0.051]
79 Towards a heuristic categorization of prepositional phrases in english with wordnet. [sent-268, score-0.203]
80 Efficient optimization of an MDL-inspired objective function for unsupervised part-of-speech tagging. [sent-282, score-0.094]
wordName wordTfidf (topN-words)
[('preposition', 0.467), ('senses', 0.227), ('prepositional', 0.203), ('trans', 0.179), ('litkowski', 0.179), ('prepositions', 0.17), ('sense', 0.162), ('restarts', 0.159), ('em', 0.149), ('hargraves', 0.147), ('shopped', 0.147), ('disambiguation', 0.146), ('semeval', 0.142), ('emit', 0.137), ('vaswani', 0.129), ('morning', 0.119), ('tratz', 0.119), ('hovy', 0.116), ('hara', 0.112), ('carmel', 0.11), ('psd', 0.11), ('norm', 0.098), ('dirk', 0.097), ('unsupervised', 0.094), ('baldwin', 0.084), ('ye', 0.08), ('oo', 0.08), ('lexicographer', 0.073), ('mapem', 0.073), ('orin', 0.073), ('rudzicz', 0.073), ('hh', 0.069), ('bayesian', 0.068), ('constraints', 0.064), ('smoothed', 0.064), ('smoothing', 0.064), ('encourages', 0.063), ('constructions', 0.063), ('rome', 0.06), ('graehl', 0.057), ('crp', 0.056), ('vanilla', 0.056), ('respective', 0.054), ('want', 0.054), ('teaching', 0.051), ('chiang', 0.05), ('arguments', 0.05), ('sparsity', 0.049), ('constrain', 0.049), ('eduard', 0.049), ('gibbs', 0.048), ('pauls', 0.047), ('argument', 0.046), ('stopped', 0.046), ('ashish', 0.046), ('ken', 0.046), ('semantic', 0.046), ('object', 0.045), ('head', 0.044), ('hmm', 0.042), ('condition', 0.041), ('chan', 0.041), ('nouns', 0.04), ('formalisms', 0.04), ('inference', 0.04), ('dempster', 0.039), ('jonathan', 0.039), ('grid', 0.038), ('meanings', 0.038), ('hyperparameters', 0.038), ('graphical', 0.037), ('accuracy', 0.037), ('sampling', 0.036), ('perplexity', 0.036), ('janyce', 0.036), ('ultimately', 0.035), ('role', 0.034), ('supervised', 0.034), ('dimensions', 0.033), ('tom', 0.033), ('priors', 0.033), ('tim', 0.032), ('talip', 0.032), ('oftwo', 0.032), ('unfavorable', 0.032), ('curious', 0.032), ('indiscriminately', 0.032), ('kordoni', 0.032), ('monosemous', 0.032), ('ofvarious', 0.032), ('rkh', 0.032), ('valia', 0.032), ('zeroing', 0.032), ('stephen', 0.032), ('reaches', 0.032), ('pp', 0.031), ('wiebe', 0.031), ('random', 0.03), ('patrick', 0.03), ('iterations', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation
Author: Dirk Hovy ; Ashish Vaswani ; Stephen Tratz ; David Chiang ; Eduard Hovy
Abstract: We present a preliminary study on unsupervised preposition sense disambiguation (PSD), comparing different models and training techniques (EM, MAP-EM with L0 norm, Bayesian inference using Gibbs sampling). To our knowledge, this is the first attempt at unsupervised preposition sense disambiguation. Our best accuracy reaches 56%, a significant improvement (at p <.001) of 16% over the most-frequent-sense baseline.
2 0.21560994 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks
Author: Alla Rozovskaya ; Dan Roth
Abstract: We consider the problem of correcting errors made by English as a Second Language (ESL) writers and address two issues that are essential to making progress in ESL error correction - algorithm selection and model adaptation to the first language of the ESL learner. A variety of learning algorithms have been applied to correct ESL mistakes, but often comparisons were made between incomparable data sets. We conduct an extensive, fair comparison of four popular learning methods for the task, reversing conclusions from earlier evaluations. Our results hold for different training sets, genres, and feature sets. A second key issue in ESL error correction is the adaptation of a model to the first language ofthe writer. Errors made by non-native speakers exhibit certain regularities and, as we show, models perform much better when they use knowledge about error patterns of the nonnative writers. We propose a novel way to adapt a learned algorithm to the first language of the writer that is both cheaper to implement and performs better than other adaptation methods.
3 0.21363546 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation
Author: Tim Van de Cruys ; Marianna Apidianaki
Abstract: In this paper, we present a unified model for the automatic induction of word senses from text, and the subsequent disambiguation of particular word instances using the automatically extracted sense inventory. The induction step and the disambiguation step are based on the same principle: words and contexts are mapped to a limited number of topical dimensions in a latent semantic word space. The intuition is that a particular sense is associated with a particular topic, so that different senses can be discriminated through their association with particular topical dimensions; in a similar vein, a particular instance of a word can be disambiguated by determining its most important topical dimensions. The model is evaluated on the SEMEVAL-20 10 word sense induction and disambiguation task, on which it reaches stateof-the-art results.
4 0.21329501 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems
Author: Nitin Madnani ; Martin Chodorow ; Joel Tetreault ; Alla Rozovskaya
Abstract: Despite the rising interest in developing grammatical error detection systems for non-native speakers of English, progress in the field has been hampered by a lack of informative metrics and an inability to directly compare the performance of systems developed by different researchers. In this paper we address these problems by presenting two evaluation methodologies, both based on a novel use of crowdsourcing. 1 Motivation and Contributions One of the fastest growing areas in need of NLP tools is the field of grammatical error detection for learners of English as a Second Language (ESL). According to Guo and Beckett (2007), “over a billion people speak English as their second or for- eign language.” This high demand has resulted in many NLP research papers on the topic, a Synthesis Series book (Leacock et al., 2010) and a recurring workshop (Tetreault et al., 2010a), all in the last five years. In this year’s ACL conference, there are four long papers devoted to this topic. Despite the growing interest, two major factors encumber the growth of this subfield. First, the lack of consistent and appropriate score reporting is an issue. Most work reports results in the form of precision and recall as measured against the judgment of a single human rater. This is problematic because most usage errors (such as those in article and preposition usage) are a matter of degree rather than simple rule violations such as number agreement. As a consequence, it is common for two native speakers 508 to have different judgments of usage. Therefore, an appropriate evaluation should take this into account by not only enlisting multiple human judges but also aggregating these judgments in a graded manner. Second, systems are hardly ever compared to each other. In fact, to our knowledge, no two systems developed by different groups have been compared directly within the field primarily because there is no common corpus or shared task—both commonly found in other NLP areas such as machine translation.1 For example, Tetreault and Chodorow (2008), Gamon et al. (2008) and Felice and Pulman (2008) developed preposition error detection systems, but evaluated on three different corpora using different evaluation measures. The goal of this paper is to address the above issues by using crowdsourcing, which has been proven effective for collecting multiple, reliable judgments in other NLP tasks: machine translation (Callison-Burch, 2009; Zaidan and CallisonBurch, 2010), speech recognition (Evanini et al., 2010; Novotney and Callison-Burch, 2010), automated paraphrase generation (Madnani, 2010), anaphora resolution (Chamberlain et al., 2009), word sense disambiguation (Akkaya et al., 2010), lexicon construction for less commonly taught languages (Irvine and Klementiev, 2010), fact mining (Wang and Callison-Burch, 2010) and named entity recognition (Finin et al., 2010) among several others. In particular, we make a significant contribution to the field by showing how to leverage crowdsourc1There has been a recent proposal for a related shared task (Dale and Kilgarriff, 2010) that shows promise. Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 508–513, ing to both address the lack ofappropriate evaluation metrics and to make system comparison easier. Our solution is general enough for, in the simplest case, intrinsically evaluating a single system on a single dataset and, more realistically, comparing two different systems (from same or different groups). 2 A Case Study: Extraneous Prepositions We consider the problem of detecting an extraneous preposition error, i.e., incorrectly using a preposition where none is licensed. In the sentence “They came to outside”, the preposition to is an extraneous error whereas in the sentence “They arrived to the town” the preposition to is a confusion error (cf. arrived in the town). Most work on automated correction of preposition errors, with the exception of Gamon (2010), addresses preposition confusion errors e.g., (Felice and Pulman, 2008; Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010b). One reason is that in addition to the standard context-based features used to detect confusion errors, identifying extraneous prepositions also requires actual knowledge of when a preposition can and cannot be used. Despite this lack of attention, extraneous prepositions account for a significant proportion—as much as 18% in essays by advanced English learners (Rozovskaya and Roth, 2010a)—of all preposition usage errors. 2.1 Data and Systems For the experiments in this paper, we chose a proprietary corpus of about 500,000 essays written by ESL students for Test of English as a Foreign Language (TOEFL?R). Despite being common ESL errors, preposition errors are still infrequent overall, with over 90% of prepositions being used correctly (Leacock et al., 2010; Rozovskaya and Roth, 2010a). Given this fact about error sparsity, we needed an efficient method to extract a good number of error instances (for statistical reliability) from the large essay corpus. We found all trigrams in our essays containing prepositions as the middle word (e.g., marry with her) and then looked up the counts of each tri- gram and the corresponding bigram with the preposition removed (marry her) in the Google Web1T 5-gram Corpus. If the trigram was unattested or had a count much lower than expected based on the bi509 gram count, then we manually inspected the trigram to see whether it was actually an error. If it was, we extracted a sentence from the large essay corpus containing this erroneous trigram. Once we had extracted 500 sentences containing extraneous preposition error instances, we added 500 sentences containing correct instances of preposition usage. This yielded a corpus of 1000 sentences with a 50% error rate. These sentences, with the target preposition highlighted, were presented to 3 expert annotators who are native English speakers. They were asked to annotate the preposition usage instance as one of the following: extraneous (Error), not extraneous (OK) or too hard to decide (Unknown); the last category was needed for cases where the context was too messy to make a decision about the highlighted preposition. On average, the three experts had an agreement of 0.87 and a kappa of 0.75. For subse- quent analysis, we only use the classes Error and OK since Unknown was used extremely rarely and never by all 3 experts for the same sentence. We used two different error detection systems to illustrate our evaluation methodology:2 • • 3 LM: A 4-gram language model trained on tLhMe Google Wme lba1nTg 5-gram Corpus dw oithn SRILM (Stolcke, 2002). PERC: An averaged Perceptron (Freund and Schapire, 1999) calgaessdif Pieerr—ce as implemented nind the Learning by Java toolkit (Rizzolo and Roth, 2007)—trained on 7 million examples and using the same features employed by Tetreault and Chodorow (2008). Crowdsourcing Recently,we showed that Amazon Mechanical Turk (AMT) is a cheap and effective alternative to expert raters for annotating preposition errors (Tetreault et al., 2010b). In other current work, we have extended this pilot study to show that CrowdFlower, a crowdsourcing service that allows for stronger quality con- × trol on untrained human raters (henceforth, Turkers), is more reliable than AMT on three different error detection tasks (article errors, confused prepositions 2Any conclusions drawn in this paper pertain only to these specific instantiations of the two systems. & extraneous prepositions). To impose such quality control, one has to provide “gold” instances, i.e., examples with known correct judgments that are then used to root out any Turkers with low performance on these instances. For all three tasks, we obtained 20 Turkers’ judgments via CrowdFlower for each instance and found that, on average, only 3 Turkers were required to match the experts. More specifically, for the extraneous preposition error task, we used 75 sentences as gold and obtained judgments for the remaining 923 non-gold sentences.3 We found that if we used 3 Turker judgments in a majority vote, the agreement with any one of the three expert raters is, on average, 0.87 with a kappa of 0.76. This is on par with the inter-expert agreement and kappa found earlier (0.87 and 0.75 respectively). The extraneous preposition annotation cost only $325 (923 judgments 20 Turkers) and was com- pleted 9in2 a single day. T 2h0e only rres)st arnicdtio wna on tmheTurkers was that they be physically located in the USA. For the analysis in subsequent sections, we use these 923 sentences and the respective 20 judgments obtained via CrowdFlower. The 3 expert judgments are not used any further in this analysis. 4 Revamping System Evaluation In this section, we provide details on how crowdsourcing can help revamp the evaluation of error detection systems: (a) by providing more informative measures for the intrinsic evaluation of a single system (§ 4. 1), and (b) by easily enabling system comparison (§ 4.2). 4.1 Crowd-informed Evaluation Measures When evaluating the performance of grammatical error detection systems against human judgments, the judgments for each instance are generally reduced to the single most frequent category: Error or OK. This reduction is not an accurate reflection of a complex phenomenon. It discards valuable information about the acceptability of usage because it treats all “bad” uses as equal (and all good ones as equal), when they are not. Arguably, it would be fairer to use a continuous scale, such as the proportion of raters who judge an instance as correct or 3We found 2 duplicate sentences and removed them. 510 incorrect. For example, if 90% of raters agree on a rating of Error for an instance of preposition usage, then that is stronger evidence that the usage is an error than if 56% of Turkers classified it as Error and 44% classified it as OK (the sentence “In addition classmates play with some game and enjoy” is an example). The regular measures of precision and recall would be fairer if they reflected this reality. Besides fairness, another reason to use a continuous scale is that of stability, particularly with a small number of instances in the evaluation set (quite common in the field). By relying on majority judgments, precision and recall measures tend to be unstable (see below). We modify the measures of precision and recall to incorporate distributions of correctness, obtained via crowdsourcing, in order to make them fairer and more stable indicators of system performance. Given an error detection system that classifies a sentence containing a specific preposition as Error (class 1) if the preposition is extraneous and OK (class 0) otherwise, we propose the following weighted versions of hits (Hw), misses (Mw) and false positives (FPw): XN Hw = X(csiys ∗ picrowd) (1) Xi XN Mw = X((1 − csiys) ∗ picrowd) (2) Xi XN FPw = X(csiys ∗ (1 − picrowd)) (3) Xi In the above equations, N is the total number of instances, csiys is the class (1 or 0) , and picrowd indicates the proportion of the crowd that classified instance i as Error. Note that if we were to revert to the majority crowd judgment as the sole judgment for each instance, instead of proportions, picrowd would always be either 1 or 0 and the above formulae would simply compute the normal hits, misses and false positives. Given these definitions, weighted precision can be defined as Precisionw = Hw/(Hw Hw/(Hw + FPw) and weighted + Mw). recall as Recallw = agreement Figure 1: Histogram of Turker agreements for all 923 instances on whether a preposition is extraneous. UWnwei gihg tede Pr0 e.c9 i5s0i70onR0 .e3 c78al14l Table 1: Comparing commonly used (unweighted) and proposed (weighted) precision/recall measures for LM. To illustrate the utility of these weighted measures, we evaluated the LM and PERC systems on the dataset containing 923 preposition instances, against all 20 Turker judgments. Figure 1 shows a histogram of the Turker agreement for the majority rating over the set. Table 1 shows both the unweighted (discrete majority judgment) and weighted (continuous Turker proportion) versions of precision and recall for this system. The numbers clearly show that in the unweighted case, the performance of the system is overestimated simply because the system is getting as much credit for each contentious case (low agreement) as for each clear one (high agreement). In the weighted measure we propose, the contentious cases are weighted lower and therefore their contribution to the overall performance is reduced. This is a fairer representation since the system should not be expected to perform as well on the less reliable instances as it does on the clear-cut instances. Essentially, if humans cannot consistently decide whether 511 [n=93] [n=1 14] Agreement Bin [n=71 6] Figure 2: Unweighted precision/recall by agreement bins for LM & PERC. a case is an error then a system’s output cannot be considered entirely right or entirely wrong.4 As an added advantage, the weighted measures are more stable. Consider a contentious instance in a small dataset where 7 out of 15 Turkers (a minority) classified it as Error. However, it might easily have happened that 8 Turkers (a majority) classified it as Error instead of 7. In that case, the change in unweighted precision would have been much larger than is warranted by such a small change in the data. However, weighted precision is guaranteed to be more stable. Note that the instability decreases as the size of the dataset increases but still remains a problem. 4.2 Enabling System Comparison In this section, we show how to easily compare different systems both on the same data (in the ideal case of a shared dataset being available) and, more realistically, on different datasets. Figure 2 shows (unweighted) precision and recall of LM and PERC (computed against the majority Turker judgment) for three agreement bins, where each bin is defined as containing only the instances with Turker agreement in a specific range. We chose the bins shown 4The difference between unweighted and weighted measures can vary depending on the distribution of agreement. since they are sufficiently large and represent a reasonable stratification of the agreement space. Note that we are not weighting the precision and recall in this case since we have already used the agreement proportions to create the bins. This curve enables us to compare the two systems easily on different levels of item contentiousness and, therefore, conveys much more information than what is usually reported (a single number for unweighted precision/recall over the whole corpus). For example, from this graph, PERC is seen to have similar performance as LM for the 75-90% agreement bin. In addition, even though LM precision is perfect (1.0) for the most contentious instances (the 50-75% bin), this turns out to be an artifact of the LM classifier’s decision process. When it must decide between what it views as two equally likely possibilities, it defaults to OK. Therefore, even though LM has higher unweighted precision (0.957) than PERC (0.813), it is only really better on the most clear-cut cases (the 90-100% bin). If one were to report unweighted precision and recall without using any bins—as is the norm—this important qualification would have been harder to discover. While this example uses the same dataset for evaluating two systems, the procedure is general enough to allow two systems to be compared on two different datasets by simply examining the two plots. However, two potential issues arise in that case. The first is that the bin sizes will likely vary across the two plots. However, this should not be a significant problem as long as the bins are sufficiently large. A second, more serious, issue is that the error rates (the proportion of instances that are actually erroneous) in each bin may be different across the two plots. To handle this, we recommend that a kappa-agreement plot be used instead of the precision-agreement plot shown here. 5 Conclusions Our goal is to propose best practices to address the two primary problems in evaluating grammatical error detection systems and we do so by leveraging crowdsourcing. For system development, we rec- ommend that rather than compressing multiple judgments down to the majority, it is better to use agreement proportions to weight precision and recall to 512 yield fairer and more stable indicators of performance. For system comparison, we argue that the best solution is to use a shared dataset and present the precision-agreement plot using a set of agreed-upon bins (possibly in conjunction with the weighted precision and recall measures) for a more informative comparison. However, we recognize that shared datasets are harder to create in this field (as most of the data is proprietary). Therefore, we also provide a way to compare multiple systems across different datasets by using kappa-agreement plots. As for agreement bins, we posit that the agreement values used to define them depend on the task and, therefore, should be determined by the community. Note that both of these practices can also be implemented by using 20 experts instead of 20 Turkers. However, we show that crowdsourcing yields judgments that are as good but without the cost. To facilitate the adoption of these practices, we make all our evaluation code and data available to the com- munity.5 Acknowledgments We would first like to thank our expert annotators Sarah Ohls and Waverely VanWinkle for their hours of hard work. We would also like to acknowledge Lei Chen, Keelan Evanini, Jennifer Foster, Derrick Higgins and the three anonymous reviewers for their helpful comments and feedback. References Cem Akkaya, Alexander Conrad, Janyce Wiebe, and Rada Mihalcea. 2010. Amazon Mechanical Turk for Subjectivity Word Sense Disambiguation. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 195–203. Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. In Proceedings of EMNLP, pages 286– 295. Jon Chamberlain, Massimo Poesio, and Udo Kruschwitz. 2009. A Demonstration of Human Computation Using the Phrase Detectives Annotation Game. In ACM SIGKDD Workshop on Human Computation, pages 23–24. 5http : / /bit . ly/ crowdgrammar Robert Dale and Adam Kilgarriff. 2010. Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task. In Proceedings of INLG. Keelan Evanini, Derrick Higgins, and Klaus Zechner. 2010. Using Amazon Mechanical Turk for Transcription of Non-Native Speech. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 53–56. Rachele De Felice and Stephen Pulman. 2008. A Classifier-Based Approach to Preposition and Determiner Error Correction in L2 English. In Proceedings of COLING, pages 169–176. Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating Named Entities in Twitter Data with Crowdsourcing. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 80–88. Yoav Freund and Robert E. Schapire. 1999. Large Margin Classification Using the Perceptron Algorithm. Machine Learning, 37(3):277–296. Michael Gamon, Jianfeng Gao, Chris Brockett, Alexander Klementiev, William Dolan, Dmitriy Belenko, and Lucy Vanderwende. 2008. Using Contextual Speller Techniques and Language Modeling for ESL Error Correction. In Proceedings of IJCNLP. Michael Gamon. 2010. Using Mostly Native Data to Correct Errors in Learners’ Writing. In Proceedings of NAACL, pages 163–171 . Y. Guo and Gulbahar Beckett. 2007. The Hegemony of English as a Global Language: Reclaiming Local Knowledge and Culture in China. Convergence: International Journal of Adult Education, 1. Ann Irvine and Alexandre Klementiev. 2010. Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 108–1 13. Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Synthesis Lectures on Human Language Technologies. Morgan Claypool. Nitin Madnani. 2010. The Circle of Meaning: From Translation to Paraphrasing and Back. Ph.D. thesis, Department of Computer Science, University of Maryland College Park. Scott Novotney and Chris Callison-Burch. 2010. Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription. In Proceedings of NAACL, pages 207–215. Nicholas Rizzolo and Dan Roth. 2007. Modeling Discriminative Global Inference. In Proceedings of 513 the First IEEE International Conference on Semantic Computing (ICSC), pages 597–604, Irvine, California, September. Alla Rozovskaya and D. Roth. 2010a. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Alla Rozovskaya and D. Roth. 2010b. Generating Confusion Sets for Context-Sensitive Error Correction. In Proceedings of EMNLP. Andreas Stolcke. 2002. SRILM: An Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, pages 257–286. Joel Tetreault and Martin Chodorow. 2008. The Ups and Downs of Preposition Error Detection in ESL Writing. In Proceedings of COLING, pages 865–872. Joel Tetreault, Jill Burstein, and Claudia Leacock, editors. 2010a. Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Joel Tetreault, Elena Filatova, and Martin Chodorow. 2010b. Rethinking Grammatical Error Annotation and Evaluation with the Amazon Mechanical Turk. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications, pages 45–48. Rui Wang and Chris Callison-Burch. 2010. Cheap Facts and Counter-Facts. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 163–167. Omar F. Zaidan and Chris Callison-Burch. 2010. Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators. In Proceedings of NAACL, pages 369–372.
5 0.18505079 158 acl-2011-Identification of Domain-Specific Senses in a Machine-Readable Dictionary
Author: Fumiyo Fukumoto ; Yoshimi Suzuki
Abstract: This paper focuses on domain-specific senses and presents a method for assigning category/domain label to each sense of words in a dictionary. The method first identifies each sense of a word in the dictionary to its corresponding category. We used a text classification technique to select appropriate senses for each domain. Then, senses were scored by computing the rank scores. We used Markov Random Walk (MRW) model. The method was tested on English and Japanese resources, WordNet 3.0 and EDR Japanese dictionary. For evaluation of the method, we compared English results with the Subject Field Codes (SFC) resources. We also compared each English and Japanese results to the first sense heuristics in the WSD task. These results suggest that identification of domain-specific senses (IDSS) may actually be of benefit.
6 0.16728194 307 acl-2011-Towards Tracking Semantic Change by Visual Analytics
7 0.15065932 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization
8 0.14312243 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation
9 0.14197204 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation
10 0.12277687 167 acl-2011-Improving Dependency Parsing with Semantic Classes
11 0.10659286 333 acl-2011-Web-Scale Features for Full-Scale Parsing
12 0.10298906 96 acl-2011-Disambiguating temporal-contrastive connectives for machine translation
13 0.10126331 334 acl-2011-Which Noun Phrases Denote Which Concepts?
14 0.093012765 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
15 0.090746872 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
16 0.087919243 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
17 0.087035239 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
18 0.085578568 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
19 0.081111036 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
20 0.080659918 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model
topicId topicWeight
[(0, 0.199), (1, -0.003), (2, -0.076), (3, -0.046), (4, -0.007), (5, -0.034), (6, 0.152), (7, 0.038), (8, -0.061), (9, 0.008), (10, -0.026), (11, -0.207), (12, 0.174), (13, 0.156), (14, -0.023), (15, 0.089), (16, 0.091), (17, 0.139), (18, -0.134), (19, 0.199), (20, -0.021), (21, 0.001), (22, 0.03), (23, 0.038), (24, 0.053), (25, 0.042), (26, -0.001), (27, -0.043), (28, -0.046), (29, 0.094), (30, -0.022), (31, 0.037), (32, -0.022), (33, -0.016), (34, 0.049), (35, -0.027), (36, 0.071), (37, -0.06), (38, 0.01), (39, -0.091), (40, -0.028), (41, 0.021), (42, 0.092), (43, 0.018), (44, 0.003), (45, 0.014), (46, -0.137), (47, 0.017), (48, -0.126), (49, -0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.94804978 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation
Author: Dirk Hovy ; Ashish Vaswani ; Stephen Tratz ; David Chiang ; Eduard Hovy
Abstract: We present a preliminary study on unsupervised preposition sense disambiguation (PSD), comparing different models and training techniques (EM, MAP-EM with L0 norm, Bayesian inference using Gibbs sampling). To our knowledge, this is the first attempt at unsupervised preposition sense disambiguation. Our best accuracy reaches 56%, a significant improvement (at p <.001) of 16% over the most-frequent-sense baseline.
2 0.69929433 307 acl-2011-Towards Tracking Semantic Change by Visual Analytics
Author: Christian Rohrdantz ; Annette Hautli ; Thomas Mayer ; Miriam Butt ; Daniel A. Keim ; Frans Plank
Abstract: This paper presents a new approach to detecting and tracking changes in word meaning by visually modeling and representing diachronic development in word contexts. Previous studies have shown that computational models are capable of clustering and disambiguating senses, a more recent trend investigates whether changes in word meaning can be tracked by automatic methods. The aim of our study is to offer a new instrument for investigating the diachronic development of word senses in a way that allows for a better understanding of the nature of semantic change in general. For this purpose we combine techniques from the field of Visual Analytics with unsupervised methods from Natural Language Processing, allowing for an interactive visual exploration of semantic change.
3 0.66401792 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation
Author: Tim Van de Cruys ; Marianna Apidianaki
Abstract: In this paper, we present a unified model for the automatic induction of word senses from text, and the subsequent disambiguation of particular word instances using the automatically extracted sense inventory. The induction step and the disambiguation step are based on the same principle: words and contexts are mapped to a limited number of topical dimensions in a latent semantic word space. The intuition is that a particular sense is associated with a particular topic, so that different senses can be discriminated through their association with particular topical dimensions; in a similar vein, a particular instance of a word can be disambiguated by determining its most important topical dimensions. The model is evaluated on the SEMEVAL-20 10 word sense induction and disambiguation task, on which it reaches stateof-the-art results.
4 0.65667009 158 acl-2011-Identification of Domain-Specific Senses in a Machine-Readable Dictionary
Author: Fumiyo Fukumoto ; Yoshimi Suzuki
Abstract: This paper focuses on domain-specific senses and presents a method for assigning category/domain label to each sense of words in a dictionary. The method first identifies each sense of a word in the dictionary to its corresponding category. We used a text classification technique to select appropriate senses for each domain. Then, senses were scored by computing the rank scores. We used Markov Random Walk (MRW) model. The method was tested on English and Japanese resources, WordNet 3.0 and EDR Japanese dictionary. For evaluation of the method, we compared English results with the Subject Field Codes (SFC) resources. We also compared each English and Japanese results to the first sense heuristics in the WSD task. These results suggest that identification of domain-specific senses (IDSS) may actually be of benefit.
5 0.60451341 334 acl-2011-Which Noun Phrases Denote Which Concepts?
Author: Jayant Krishnamurthy ; Tom Mitchell
Abstract: Resolving polysemy and synonymy is required for high-quality information extraction. We present ConceptResolver, a component for the Never-Ending Language Learner (NELL) (Carlson et al., 2010) that handles both phenomena by identifying the latent concepts that noun phrases refer to. ConceptResolver performs both word sense induction and synonym resolution on relations extracted from text using an ontology and a small amount of labeled data. Domain knowledge (the ontology) guides concept creation by defining a set of possible semantic types for concepts. Word sense induction is performed by inferring a set of semantic types for each noun phrase. Synonym detection exploits redundant informa- tion to train several domain-specific synonym classifiers in a semi-supervised fashion. When ConceptResolver is run on NELL’s knowledge base, 87% of the word senses it creates correspond to real-world concepts, and 85% of noun phrases that it suggests refer to the same concept are indeed synonyms.
6 0.60365158 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks
7 0.59250003 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems
8 0.56463557 96 acl-2011-Disambiguating temporal-contrastive connectives for machine translation
9 0.54290611 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization
10 0.49777773 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
11 0.47395158 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation
12 0.43503404 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model
13 0.43142366 167 acl-2011-Improving Dependency Parsing with Semantic Classes
14 0.42153558 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
15 0.42124966 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis
16 0.41318241 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus
17 0.40501139 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
18 0.40481421 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text
19 0.39325452 321 acl-2011-Unsupervised Discovery of Rhyme Schemes
20 0.38100201 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
topicId topicWeight
[(5, 0.012), (17, 0.039), (37, 0.083), (39, 0.04), (41, 0.052), (55, 0.018), (59, 0.478), (72, 0.022), (91, 0.026), (96, 0.14), (97, 0.018)]
simIndex simValue paperId paperTitle
1 0.94141316 322 acl-2011-Unsupervised Learning of Semantic Relation Composition
Author: Eduardo Blanco ; Dan Moldovan
Abstract: This paper presents an unsupervised method for deriving inference axioms by composing semantic relations. The method is independent of any particular relation inventory. It relies on describing semantic relations using primitives and manipulating these primitives according to an algebra. The method was tested using a set of eight semantic relations yielding 78 inference axioms which were evaluated over PropBank.
2 0.89692432 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?
Author: Mariet Theune ; Ruud Koolen ; Emiel Krahmer ; Sander Wubben
Abstract: In this paper we investigate how much data is required to train an algorithm for attribute selection, a subtask of Referring Expressions Generation (REG). To enable comparison between different-sized training sets, a systematic training method was developed. The results show that depending on the complexity of the domain, training on 10 to 20 items may already lead to a good performance.
3 0.89494777 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis
Author: Oscar Tackstrom ; Ryan McDonald
Abstract: We derive two variants of a semi-supervised model for fine-grained sentiment analysis. Both models leverage abundant natural supervision in the form of review ratings, as well as a small amount of manually crafted sentence labels, to learn sentence-level sentiment classifiers. The proposed model is a fusion of a fully supervised structured conditional model and its partially supervised counterpart. This allows for highly efficient estimation and inference algorithms with rich feature definitions. We describe the two variants as well as their component models and verify experimentally that both variants give significantly improved results for sentence-level sentiment analysis compared to all baselines. 1 Sentence-level sentiment analysis In this paper, we demonstrate how combining coarse-grained and fine-grained supervision benefits sentence-level sentiment analysis an important task in the field of opinion classification and retrieval (Pang and Lee, 2008). Typical supervised learning approaches to sentence-level sentiment analysis rely on sentence-level supervision. While such fine-grained supervision rarely exist naturally, and thus requires labor intensive manual annotation effort (Wiebe et al., 2005), coarse-grained supervision is naturally abundant in the form of online review ratings. This coarse-grained supervision is, of course, less informative compared to fine-grained supervision, however, by combining a small amount of sentence-level supervision with a large amount of document-level supervision, we are able to substantially improve on the sentence-level classification task. Our work combines two strands of research: models for sentiment analysis that take document structure into account; – 569 Ryan McDonald Google, Inc., New York ryanmcd@ google com . and models that use latent variables to learn unobserved phenomena from that which can be observed. Exploiting document structure for sentiment analysis has attracted research attention since the early work of Pang and Lee (2004), who performed minimal cuts in a sentence graph to select subjective sentences. McDonald et al. (2007) later showed that jointly learning fine-grained (sentence) and coarsegrained (document) sentiment improves predictions at both levels. More recently, Yessenalina et al. (2010) described how sentence-level latent variables can be used to improve document-level prediction and Nakagawa et al. (2010) used latent variables over syntactic dependency trees to improve sentence-level prediction, using only labeled sentences for training. In a similar vein, Sauper et al. (2010) integrated generative content structure models with discriminative models for multi-aspect sentiment summarization and ranking. These approaches all rely on the availability of fine-grained annotations, but Ta¨ckstro¨m and McDonald (201 1) showed that latent variables can be used to learn fine-grained sentiment using only coarse-grained supervision. While this model was shown to beat a set of natural baselines with quite a wide margin, it has its shortcomings. Most notably, due to the loose constraints provided by the coarse supervision, it tends to only predict the two dominant fine-grained sentiment categories well for each document sentiment category, so that almost all sentences in positive documents are deemed positive or neutral, and vice versa for negative documents. As a way of overcoming these shortcomings, we propose to fuse a coarsely supervised model with a fully supervised model. Below, we describe two ways of achieving such a combined model in the framework of structured conditional latent variable models. Contrary to (generative) topic models (Mei et al., 2007; Titov and Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 569–574, Figure 1: a) Factor graph of the fully observed graphical model. b) Factor graph of the corresponding latent variable model. During training, shaded nodes are observed, while non-shaded nodes are unobserved. The input sentences si are always observed. Note that there are no factors connecting the document node, yd, with the input nodes, s, so that the sentence-level variables, ys, in effect form a bottleneck between the document sentiment and the input sentences. McDonald, 2008; Lin and He, 2009), structured conditional models can handle rich and overlapping features and allow for exact inference and simple gradient based estimation. The former models are largely orthogonal to the one we propose in this work and combining their merits might be fruitful. As shown by Sauper et al. (2010), it is possible to fuse generative document structure models and task specific structured conditional models. While we do model document structure in terms of sentiment transitions, we do not model topical structure. An interesting avenue for future work would be to extend the model of Sauper et al. (2010) to take coarse-grained taskspecific supervision into account, while modeling fine-grained task-specific aspects with latent variables. Note also that the proposed approach is orthogonal to semi-supervised and unsupervised induction of context independent (prior polarity) lexicons (Turney, 2002; Kim and Hovy, 2004; Esuli and Sebastiani, 2009; Rao and Ravichandran, 2009; Velikovich et al., 2010). The output of such models could readily be incorporated as features in the proposed model. 1.1 Preliminaries Let d be a document consisting of n sentences, s = (si)in=1, with a document–sentence-sequence pair denoted d = (d, s). Let yd = (yd, ys) denote random variables1 the document level sentiment, yd, and the sequence of sentence level sentiment, = (ysi)in=1 . – ys 1We are abusing notation throughout by using the same symbols to refer to random variables and their particular assignments. 570 In what follows, we assume that we have access to two training sets: a small set of fully labeled instances, DF = {(dj, and a large set of ydj)}jm=f1, coarsely labeled instances DC = {(dj, yjd)}jm=fm+fm+c1. Furthermore, we assume that yd and all yis take values in {POS, NEG, NEU}. We focus on structured conditional models in the exponential family, with the standard parametrization pθ(yd,ys|s) = expnhφ(yd,ys,s),θi − Aθ(s)o
4 0.88625503 293 acl-2011-Template-Based Information Extraction without the Templates
Author: Nathanael Chambers ; Dan Jurafsky
Abstract: Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to handcreated gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.
same-paper 5 0.85036719 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation
Author: Dirk Hovy ; Ashish Vaswani ; Stephen Tratz ; David Chiang ; Eduard Hovy
Abstract: We present a preliminary study on unsupervised preposition sense disambiguation (PSD), comparing different models and training techniques (EM, MAP-EM with L0 norm, Bayesian inference using Gibbs sampling). To our knowledge, this is the first attempt at unsupervised preposition sense disambiguation. Our best accuracy reaches 56%, a significant improvement (at p <.001) of 16% over the most-frequent-sense baseline.
6 0.81652737 329 acl-2011-Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
7 0.75166082 51 acl-2011-Automatic Headline Generation using Character Cross-Correlation
8 0.64168441 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features
9 0.63967794 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons
10 0.60713655 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
11 0.60034853 7 acl-2011-A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality
12 0.59952611 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
13 0.5807761 167 acl-2011-Improving Dependency Parsing with Semantic Classes
14 0.57740259 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation
15 0.56924868 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
16 0.56232232 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation
17 0.5595746 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts
18 0.55841696 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
19 0.55414748 174 acl-2011-Insights from Network Structure for Text Mining
20 0.55171108 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia