acl acl2013 acl2013-246 knowledge-graph by maker-knowledge-mining

246 acl-2013-Modeling Thesis Clarity in Student Essays

Source: pdf

Author: Isaac Persing ; Vincent Ng

Abstract: Recently, researchers have begun exploring methods of scoring student essays with respect to particular dimensions of quality such as coherence, technical errors, and relevance to prompt, but there is relatively little work on modeling thesis clarity. We present a new annotated corpus and propose a learning-based approach to scoring essays along the thesis clarity dimension. Additionally, in order to provide more valuable feedback on why an essay is scored as it is, we propose a second learning-based approach to identifying what kinds of errors an essay has that may lower its thesis clarity score.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Recently, researchers have begun exploring methods of scoring student essays with respect to particular dimensions of quality such as coherence, technical errors, and relevance to prompt, but there is relatively little work on modeling thesis clarity. [sent-3, score-0.727]

2 We present a new annotated corpus and propose a learning-based approach to scoring essays along the thesis clarity dimension. [sent-4, score-1.074]

3 Additionally, in order to provide more valuable feedback on why an essay is scored as it is, we propose a second learning-based approach to identifying what kinds of errors an essay has that may lower its thesis clarity score. [sent-5, score-1.925]

4 1 Introduction Automated essay scoring, the task of employing computer technology to evaluate and score written text, is one of the most important educational applications of natural language processing (NLP) (see Shermis and Burstein (2003) and Shermis et al. [sent-6, score-0.642]

5 , 2003) is that they adopt a holistic scoring scheme, which summarizes the quality of an essay with a single score and thus provides very limited feedback to the writer. [sent-9, score-0.778]

6 In particular, it is not clear which dimension of an essay (e. [sent-10, score-0.613]

7 Recent work addresses this problem by scoring a particular dimension of essay quality such as coherence (Miltsakaki and Kukich, 2004), technical errors, Relevance to Prompt (Higgins et al. [sent-13, score-0.754]

8 Essay grading software that provides feedback along multiple dimensions of essay quality such as E-rater/Criterion (Attali and Burstein, 2006) has also begun to emerge. [sent-16, score-0.613]

9 Nevertheless, there is an essay scoring dimen- sion for which few computational models have been developed thesis clarity. [sent-17, score-0.943]

10 Thesis clarity refers to how clearly an author explains the thesis of her essay, i. [sent-18, score-0.658]

11 , the position she argues for with respect to the topic on which the essay is written. [sent-20, score-0.616]

12 1 An essay with a high thesis clarity score presents its thesis in a way that is easy for the reader to understand, preferably but not necessarily directly, as in essays with explicit thesis sentences. [sent-21, score-2.073]

13 First, we aim to develop a computational model for scoring the thesis clarity of student essays. [sent-24, score-0.812]

14 Because there are many reasons why an essay may receive a low thesis clarity score, our second goal is to build a system for determining why an essay receives its score. [sent-25, score-1.872]

15 We believe the feedback provided by this system will be more informative to a student than would a thesis clarity score alone, as it will help her understand which aspects of her writing need to be improved in order to better convey her the— sis. [sent-26, score-0.82]

16 To this end, we identify five common errors that impact thesis clarity, and our system’s purpose is to determine which of these errors occur in a given essay. [sent-27, score-0.403]

17 We evaluate our thesis clarity scoring model and error identification system on a data set of 830 essays annotated with both thesis clarity scores and errors. [sent-28, score-1.961]

18 First, we develop a scoring model and error identification system for the thesis clarity dimension on student essays. [sent-30, score-1.033]

19 Second, we use features explicitly designed for each of the identified error 1An essay’s thesis is the overall message of the entire essay. [sent-31, score-0.426]

20 This concept is unbound from the the concept of thesis sentences, as even an essay that never explicitly states its thesis in any of its sentences may still have an overall message that can be inferred from the arguments it makes. [sent-32, score-1.083]

21 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgices 260–269, types in order to train our scoring model, in contrast to many existing systems for other scoring dimensions, which use more general features developed without the concept of error classes. [sent-35, score-0.401]

22 Third, we make our data set consisting of thesis clarity annotations of 830 essays publicly available in order to stimulate further research on this task. [sent-36, score-0.959]

23 Since progress in thesis clarity modeling is hindered in part by the lack of a publicly annotated corpus, we believe that our data set will be a valuable resource to the NLP community. [sent-37, score-0.68]

24 We select a subset consisting of 830 argumentative essays from the ICLE to annotate and use for training and testing of our models of essay thesis clarity. [sent-42, score-1.15]

25 3 Corpus Annotation For each of the 830 argumentative essays, we ask two native English speakers to (1) score it along the thesis clarity dimension and (2) determine the subset of the five pre-defined errors that detracts from the clarity of its thesis. [sent-45, score-1.303]

26 Annotators evaluate the clarity of each essay’s thesis using a numerical score from 1 to 4 at half-point increments (see Table 2 for a description of each score). [sent-47, score-0.708]

27 This contrasts with previous work on essay scoring, where the corpus is annotated with a binary decision (i. [sent-48, score-0.592]

28 Hence, our annotation scheme not only provides a finer-grained distinction of thesis clarity (which can be important in practice), but also makes the prediction task more challenging. [sent-54, score-0.658]

29 Analysis of these essays reveals that, though annotators only exactly agree on the thesis clarity score of an essay 36% of the time, the scores they apply are within 0. [sent-56, score-1.653]

30 Table 3 shows the number of essays that receive each of the seven scores for thesis clarity. [sent-59, score-0.59]

31 To identify what kinds of errors make an essay’s thesis unclear, we ask one of our annotators to write 1–4 sentence critiques of thesis clarity on 527 essays, and obtain our list of five common error classes by categorizing the things he found to criticize. [sent-69, score-1.231]

32 We present our annotators with descriptions of these five error classes (see Table 4), and ask them to assign zero or more of the error types to each essay. [sent-70, score-0.386]

33 It is important to note that we ask our anno- tators to mark an essay with one of these errors only when the error makes the thesis less clear. [sent-71, score-1.046]

34 So for example, an essay whose thesis is irrelevant to the prompt but is explicitly and otherwise clearly stated would not be marked as having a Relevance to Prompt error. [sent-72, score-1.039]

35 If the irrelevant thesis is stated in such a way that its inapplicability to the prompt causes the reader to be confused about what the essay’s purpose is, however, then the essay would be assigned a Relevance to Prompt error. [sent-73, score-1.039]

36 the errors in the same 100 essays that were doublyannotated with thesis clarity scores. [sent-79, score-1.021]

37 Table 5 shows the number of essays assigned to each of the five thesis clarity errors. [sent-82, score-1.002]

38 eesrsraoyrs1C5P21IP2R314R2M4D7W39P Table 5: Distribution of thesis clarity errors. [sent-84, score-0.658]

39 More specifically, each training example consists of a target, which we set to the essay’s thesis clarity score minus 4. [sent-87, score-0.708]

40 Representing the reduction in an essay’s thesis clarity score with its thesis clarity score minus 4. [sent-89, score-1.416]

41 0 allows us to more easily interpret the error and bias weights of the trained system, as under this setup, each error’s weight should be a negative number reflecting how many points an essay loses due to the presence of that error. [sent-90, score-0.773]

42 The bias feature allows for the possibility that an essay may lose points from its thesis clarity score for problems not accounted for in our five error classes. [sent-91, score-1.563]

43 hat each of the enumerated error classes has a negative impact on thesis clarity score. [sent-105, score-0.822]

44 Moreover, this set of errors accounts for a large majority of all errors impacting thesis clarity because unenumerated errors cost essays an average of only one-tenth of one point on the four-point thesis clarity scale. [sent-108, score-1.803]

45 4 Error Classification In this section, we describe in detail our system for identifying thesis clarity errors. [sent-109, score-0.688]

46 1 Model Training and Application We recast the problem of identifying which thesis clarity errors apply to an essay as a multi-label classification problem, wherein each essay may be assigned zero or more of the five pre-defined error types. [sent-111, score-2.098]

47 So in the binary classification problem for identifying error ei, we create one training instance from each essay in the training set, labeling the instance as positive if the essay has ei as one of its labels, and negative otherwise. [sent-113, score-1.42]

48 After creating training instances for error ei, we train a binary classifier, bi, for identifying which test essays contain error ei. [sent-117, score-0.561]

49 First, since labeling essays with thesis clarity errors can be viewed as a text categorization task, we employ lemmatized word unigram, bigram, and trigram features that occur in the essay that have not been removed by the feature selection parameter ni. [sent-133, score-1.714]

50 Because the essays vary greatly in length, we normalize each essay’s set of word features to unit length. [sent-134, score-0.342]

51 1 classifies only instances bi assigned values in the top tenth of the range as positive, and so on, and X is the default threshold, labeling essays as positive instances of ei only if bi returns for them a value greater than 0. [sent-152, score-0.513]

52 We expect that features based on random indexing may be particularly useful for the Incomplete Prompt Response and Relevance to Prompt errors because they may help us find text related to the prompt even if some of its components have been rephrased (e. [sent-158, score-0.376]

53 , an essay may talk about “jail” rather than “prison”, which is mentioned in one of the prompts). [sent-160, score-0.592]

54 Furthermore, we ask our annotators to only annotate an error if it makes the thesis less clear. [sent-171, score-0.415]

55 To use this feature, we first examine each of the 13 essay prompts, splitting it into its component pieces. [sent-180, score-0.592]

56 For our purposes, a component of a prompt is a prompt substring such that, if an essay does not address it, it may be assigned the Incomplete Prompt Response label. [sent-181, score-1.014]

57 To give an example, the lemmatized version of the third component of the second essay in Table 1is “it should rehabilitate they”. [sent-183, score-0.616]

58 To compute one of our keyword features, we compute the random indexing similarity between the essay and each group of primary keywords taken from components of the essay’s prompt and assign the feature the lowest of these values. [sent-185, score-0.978]

59 If this feature has a low value, that suggests that the essay may have an Incomplete Prompt Response error because the essay probably did not respond to the part of the prompt from which this value came. [sent-186, score-1.587]

60 To compute another of the keyword features, we count the numbers of combined primary and secondary keywords the essay contains from each component of its prompt, and divide each number by the total number of primary and secondary features for that component. [sent-187, score-0.784]

61 For each essay, Aw+i counts the numabnedr Aofw w−ord n-grams we believe indicate that an essay is a positive example of ei, and Aw−i counts the number of word n-grams we d b Aeliwev−e indicate an essay is not an example of ei. [sent-202, score-1.235]

62 The first category for ei’s list consists of features that indicate an essay may be a positive instance. [sent-210, score-0.681]

63 Each word n-gram from this list that occurs in an essay increases the essay’s Aw+i value by one. [sent-211, score-0.634]

64 This feature indicates that the essay it occurs in might be a positive instance of the Writer Position error since it tells us the writer is attributing some statement being made to someone else. [sent-233, score-0.896]

65 Hence, this feature along with several others like “Awareness-Cognizer-we all” are useful when constructing the lists of frame features for Writer Position’s aggregated frame features Af+i and Af−i. [sent-234, score-0.34]

66 5 Score Prediction Because essays containing thesis clarity errors tend to have lower thesis clarity scores than essays with fewer errors, we believe that thesis clarity scores can be predicted for essays by utilizing the same features we use for identifying thesis clarity errors. [sent-236, score-3.718]

67 Because our score prediction system uses the same feature types we use for thesis error identification, each essay’s vector space representation remains unchanged. [sent-237, score-0.485]

68 Only its label changes to one of the values in Table 2 in order to reflect its thesis clarity score. [sent-238, score-0.658]

69 0), we cast thesis clarity score prediction as a regression rather than classification task. [sent-244, score-0.729]

70 7 instances, train a regressor on training set essays, and tune parameters on validation set essays, we can use the regressor to obtain thesis clarity scores on test set essays. [sent-247, score-0.799]

71 In each experiment, we use 3/5 of our labeled essays for model training, another 1/5 for parameter tuning, and the final 1/5 for testing. [sent-255, score-0.322]

72 To evaluate our thesis clarity error type identification system, we compute precision, recall, micro F-score, and macro Fscore, which are calculated as follows. [sent-258, score-1.192]

73 Let tpi be the number of test essays correctly labeled as positive by error ei’s binary classifier bi; pi be the total number of test essays labeled as positive by bi; and gi be the total number of test essays that belong to ei according to the gold standard. [sent-259, score-1.147]

74 Results on error identification, expressed in terms of precision, recall, micro F-score, and macro F-score are shown in the first four columns of Table 6. [sent-265, score-0.473]

75 Our Baseline system, which only uses word n-gram and random indexing features, seems to perform uniformly poorly across both micro and macro F-scores (F Fˆ; and see row 1). [sent-266, score-0.429]

76 The per-class results9 show that, since micro F-score places more weight on the correct identification of the most frequent errors, the system’s micro F-score (31. [sent-267, score-0.428]

77 10 When we add the misspelling feature to the baseline, resulting in the system called Bm (row 2), the micro F-score sees a very small, insignificant improvement. [sent-279, score-0.373]

78 The macro Fscore of the overall system would likely have improved more than shown in the table if the addition of keyword features did not simultaneously reduce Missing Details’s score by several points. [sent-291, score-0.342]

79 This feature type does, however, result in major improvements to micro and macro performance on Missing Details and Writer Position, the other two classes this feature was designed to help. [sent-297, score-0.476]

80 The micro F-score improvement can also be partly attributed to a four point improvement in Incomplete Prompt Response’s micro Fscore. [sent-310, score-0.442]

81 7% macro F-score improvement of the Missing Details error plays a larger role in the overall system’s macro F-score improvement than Confusing Phrasing’s improvement, however. [sent-312, score-0.501]

82 The improvement we see in micro F-score when we add aggregated frame features (row 6) can be attributed almost solely to improvements in classification of the minority classes. [sent-313, score-0.453]

83 We did not expect this aggregated feature type to be especially useful for Missing Details error identification because very few of these types of features occur in its Af+i list, and there are none in its Af−i list. [sent-320, score-0.352]

84 Some aim to improve precision by telling us when essays are less likely to be positive instances of an error class, such as any of the Aw−i, Ap−i, or Af−i features, asnucdh o atshearnsy aoifm th etoA twell− us Awph−en, an essay is more likely to be a positive instance of an error. [sent-328, score-1.081]

85 We design three evaluation metrics to measure the error of our thesis clarity scoring system. [sent-331, score-0.903]

86 Finally, the S3 metric measures the average square of the distance between a system’s thesis clarity score estimations and the annotatorassigned scores. [sent-337, score-0.743]

87 These three scores are given by: N1AjX6=Ej1′, N1XiN=1|Aj− Ej|, N1iXN=1(Aj− Ej)2 267 where Aj, Ej, and Ej′ are the annotator assigned, system estimated, and rounded system estimated scores12 respectively for essay j, and N is the number of essays. [sent-339, score-0.707]

88 We see that the thesis clarity score predicting variation of the Baseline system, which employs as features only word n-grams and random indexing features, predicts the wrong score 65. [sent-342, score-0.861]

89 Adding the misspelling feature to the scoring systems, however, only yields minor, insignificant improvements to their performances under the three scoring metrics. [sent-348, score-0.379]

90 Overall, the scoring model employing the Bmk feature set performs significantly better than the Baseline scoring model with respect to two out of three scoring metrics. [sent-350, score-0.384]

91 The only remaining feature type whose addition yields a significant performance improvement is the aggregated word feature type, which improves system Bmk’s S2 score significantly while having an insignificant impact on the other S metrics. [sent-351, score-0.338]

92 This is a surprising finding since, up until we introduced aggregated part-of-speech tag ngram features into our regressor, each additional feature that helped with error classification made at least a small but positive contribution to at least two out of the three S scores. [sent-353, score-0.361]

93 This means that 25% of the time, when system Bmkwp (which obtains the best S2 score) is presented with a test essay having a gold standard score of 1. [sent-369, score-0.672]

94 5, it predicts that the essay has a score less than or equal to 2. [sent-370, score-0.642]

95 Nevertheless, no system relies entirely on bias, as evidenced by the fact that each column in the table has a tendency for its scores to ascend as the gold standard score increases, implying that the systems have some success at predicting lower scores for essays with lower gold standard scores. [sent-376, score-0.464]

96 7 Conclusion We examined the problem of modeling thesis clarity errors and scoring in student essays. [sent-378, score-0.874]

97 In addition to developing these models, we proposed novel features for use in our thesis clarity error model and employed these features, each of which was explicitly designed for one or more of the error types, to train our scoring model. [sent-379, score-1.074]

98 We make our thesis clarity annotations publicly available in order to stimulate further research on this task. [sent-380, score-0.658]

99 Automated scoring and annotation of essays with the Intelligent Essay AssessorTM. [sent-450, score-0.416]

100 Evaluation of text coherence for electronic essay scoring systems. [sent-456, score-0.733]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('essay', 0.592), ('clarity', 0.422), ('essays', 0.301), ('thesis', 0.236), ('prompt', 0.211), ('micro', 0.194), ('macro', 0.149), ('error', 0.13), ('scoring', 0.115), ('aw', 0.104), ('phrasing', 0.093), ('confusing', 0.086), ('aggregated', 0.081), ('writer', 0.079), ('frame', 0.069), ('indexing', 0.062), ('errors', 0.062), ('misspelling', 0.059), ('higgins', 0.059), ('ei', 0.056), ('regressor', 0.056), ('keyword', 0.053), ('bi', 0.052), ('insignificant', 0.051), ('score', 0.05), ('burstein', 0.046), ('af', 0.046), ('jill', 0.044), ('five', 0.043), ('shermis', 0.043), ('missing', 0.042), ('features', 0.041), ('identification', 0.04), ('student', 0.039), ('feature', 0.039), ('incomplete', 0.038), ('persing', 0.036), ('relevance', 0.036), ('estimations', 0.035), ('classes', 0.034), ('details', 0.033), ('misspellings', 0.032), ('fscore', 0.031), ('response', 0.031), ('system', 0.03), ('ap', 0.03), ('icle', 0.03), ('derrick', 0.03), ('positive', 0.029), ('landauer', 0.029), ('scores', 0.029), ('learner', 0.028), ('secondary', 0.028), ('ej', 0.028), ('statement', 0.027), ('points', 0.027), ('improvement', 0.027), ('coherence', 0.026), ('annotator', 0.026), ('ask', 0.026), ('evidenced', 0.025), ('seven', 0.024), ('aggregative', 0.024), ('bmk', 0.024), ('bmkwp', 0.024), ('fscores', 0.024), ('ipr', 0.024), ('kanerva', 0.024), ('kirkpatrick', 0.024), ('rehabilitate', 0.024), ('row', 0.024), ('bias', 0.024), ('position', 0.024), ('value', 0.023), ('automated', 0.023), ('annotators', 0.023), ('svmlight', 0.022), ('believe', 0.022), ('argumentative', 0.021), ('mahwah', 0.021), ('pers', 0.021), ('attali', 0.021), ('subtable', 0.021), ('utdal', 0.021), ('primary', 0.021), ('type', 0.021), ('aj', 0.021), ('dimension', 0.021), ('feedback', 0.021), ('classification', 0.021), ('parameter', 0.021), ('surprising', 0.02), ('actual', 0.02), ('fairly', 0.02), ('isaac', 0.02), ('semafor', 0.02), ('minority', 0.02), ('prison', 0.02), ('list', 0.019), ('overall', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 246 acl-2013-Modeling Thesis Clarity in Student Essays

Author: Isaac Persing ; Vincent Ng

2 0.47386113 389 acl-2013-Word Association Profiles and their Use for Automated Scoring of Essays

Author: Beata Beigman Klebanov ; Michael Flor

Abstract: We describe a new representation of the content vocabulary of a text we call word association profile that captures the proportions of highly associated, mildly associated, unassociated, and dis-associated pairs of words that co-exist in the given text. We illustrate the shape of the distirbution and observe variation with genre and target audience. We present a study of the relationship between quality of writing and word association profiles. For a set of essays written by college graduates on a number of general topics, we show that the higher scoring essays tend to have higher percentages of both highly associated and dis-associated pairs, and lower percentages of mildly associated pairs of words. Finally, we use word association profiles to improve a system for automated scoring of essays.

3 0.069980204 182 acl-2013-High-quality Training Data Selection using Latent Topics for Graph-based Semi-supervised Learning

Author: Akiko Eriguchi ; Ichiro Kobayashi

Abstract: In a multi-class document categorization using graph-based semi-supervised learning (GBSSL), it is essential to construct a proper graph expressing the relation among nodes and to use a reasonable categorization algorithm. Furthermore, it is also important to provide high-quality correct data as training data. In this context, we propose a method to construct a similarity graph by employing both surface information and latent information to express similarity between nodes and a method to select high-quality training data for GBSSL by means of the PageR- ank algorithm. Experimenting on Reuters21578 corpus, we have confirmed that our proposed methods work well for raising the accuracy of a multi-class document categorization.

4 0.068590924 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing

Author: Jonathan K. Kummerfeld ; Daniel Tse ; James R. Curran ; Dan Klein

Abstract: Aspects of Chinese syntax result in a distinctive mix of parsing challenges. However, the contribution of individual sources of error to overall difficulty is not well understood. We conduct a comprehensive automatic analysis of error types made by Chinese parsers, covering a broad range of error types for large sets of sentences, enabling the first empirical ranking of Chinese error types by their performance impact. We also investigate which error types are resolved by using gold part-of-speech tags, showing that improving Chinese tagging only addresses certain error types, leaving substantial outstanding challenges.

5 0.067338206 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

Author: Ryo Nagata ; Edward Whittaker

Abstract: Mother tongue interference is the phenomenon where linguistic systems of a mother tongue are transferred to another language. Although there has been plenty of work on mother tongue interference, very little is known about how strongly it is transferred to another language and about what relation there is across mother tongues. To address these questions, this paper explores and visualizes mother tongue interference preserved in English texts written by Indo-European language speakers. This paper further explores linguistic features that explain why certain relations are preserved in English writing, and which contribute to related tasks such as native language identification.

6 0.065725408 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics

7 0.051016513 265 acl-2013-Outsourcing FrameNet to the Crowd

8 0.048240628 172 acl-2013-Graph-based Local Coherence Modeling

9 0.047085814 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation

10 0.044980872 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

11 0.044723801 310 acl-2013-Semantic Frames to Predict Stock Price Movement

12 0.044401284 8 acl-2013-A Learner Corpus-based Approach to Verb Suggestion for ESL

13 0.044144314 238 acl-2013-Measuring semantic content in distributional vectors

14 0.043784827 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners

15 0.043688245 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction

16 0.042165831 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms

17 0.041836444 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation

18 0.040701021 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

19 0.039825551 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

20 0.039545082 126 acl-2013-Diverse Keyword Extraction from Conversations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.133), (1, 0.035), (2, -0.001), (3, -0.052), (4, 0.003), (5, -0.027), (6, 0.026), (7, 0.001), (8, -0.036), (9, 0.023), (10, -0.009), (11, 0.025), (12, -0.056), (13, -0.0), (14, -0.097), (15, 0.004), (16, -0.033), (17, 0.026), (18, 0.001), (19, -0.031), (20, 0.045), (21, 0.021), (22, 0.039), (23, -0.044), (24, 0.004), (25, 0.135), (26, 0.029), (27, 0.007), (28, -0.091), (29, 0.055), (30, -0.137), (31, -0.141), (32, -0.003), (33, -0.097), (34, -0.08), (35, -0.084), (36, -0.088), (37, 0.08), (38, 0.151), (39, -0.006), (40, 0.239), (41, 0.124), (42, -0.313), (43, -0.077), (44, 0.238), (45, 0.206), (46, 0.067), (47, 0.206), (48, 0.091), (49, -0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93938124 246 acl-2013-Modeling Thesis Clarity in Student Essays

Author: Isaac Persing ; Vincent Ng

2 0.85792536 389 acl-2013-Word Association Profiles and their Use for Automated Scoring of Essays

Author: Beata Beigman Klebanov ; Michael Flor

3 0.44732189 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics

Author: Rebecca J. Passonneau ; Emily Chen ; Weiwei Guo ; Dolores Perin

Abstract: The pyramid method for content evaluation of automated summarizers produces scores that are shown to correlate well with manual scores used in educational assessment of students’ summaries. This motivates the development of a more accurate automated method to compute pyramid scores. Of three methods tested here, the one that performs best relies on latent semantics.

4 0.4328025 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

Author: Ryo Nagata ; Edward Whittaker

5 0.34508273 69 acl-2013-Bilingual Lexical Cohesion Trigger Model for Document-Level Machine Translation

Author: Guosheng Ben ; Deyi Xiong ; Zhiyang Teng ; Yajuan Lu ; Qun Liu

Abstract: In this paper, we propose a bilingual lexical cohesion trigger model to capture lexical cohesion for document-level machine translation. We integrate the model into hierarchical phrase-based machine translation and achieve an absolute improvement of 0.85 BLEU points on average over the baseline on NIST Chinese-English test sets.

6 0.33030251 340 acl-2013-Text-Driven Toponym Resolution using Indirect Supervision

7 0.32884654 172 acl-2013-Graph-based Local Coherence Modeling

8 0.32527575 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models

9 0.31905624 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation

10 0.31807256 390 acl-2013-Word surprisal predicts N400 amplitude during reading

11 0.3113403 324 acl-2013-Smatch: an Evaluation Metric for Semantic Feature Structures

12 0.3050459 122 acl-2013-Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners

13 0.30306786 1 acl-2013-"Let Everything Turn Well in Your Wife": Generation of Adult Humor Using Lexical Constraints

14 0.29940668 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

15 0.29917029 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

16 0.28331053 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art

17 0.27903208 225 acl-2013-Learning to Order Natural Language Texts

18 0.27637598 310 acl-2013-Semantic Frames to Predict Stock Price Movement

19 0.2714583 63 acl-2013-Automatic detection of deception in child-produced speech using syntactic complexity features

20 0.27027357 238 acl-2013-Measuring semantic content in distributional vectors

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.051), (6, 0.364), (11, 0.034), (15, 0.013), (24, 0.048), (26, 0.039), (35, 0.074), (42, 0.031), (48, 0.042), (70, 0.038), (71, 0.014), (88, 0.044), (90, 0.053), (95, 0.066)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93450391 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning

Author: Daniel Beck ; Lucia Specia ; Trevor Cohn

Abstract: Quality estimation models provide feedback on the quality of machine translated texts. They are usually trained on humanannotated datasets, which are very costly due to its task-specific nature. We investigate active learning techniques to reduce the size of these datasets and thus annotation effort. Experiments on a number of datasets show that with as little as 25% of the training instances it is possible to obtain similar or superior performance compared to that of the complete datasets. In other words, our active learning query strategies can not only reduce annotation effort but can also result in better quality predictors. ,t .

2 0.93123376 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

Author: Chris Quirk

Abstract: The notion of fertility in word alignment (the number of words emitted by a single state) is useful but difficult to model. Initial attempts at modeling fertility used heuristic search methods. Recent approaches instead use more principled approximate inference techniques such as Gibbs sampling for parameter estimation. Yet in practice we also need the single best alignment, which is difficult to find using Gibbs. Building on recent advances in dual decomposition, this paper introduces an exact algorithm for finding the single best alignment with a fertility HMM. Finding the best alignment appears important, as this model leads to a substantial improvement in alignment quality.

3 0.92918986 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics

Author: Dehong Gao ; Wenjie Li ; Renxian Zhang

Abstract: The growth of the Web 2.0 technologies has led to an explosion of social networking media sites. Among them, Twitter is the most popular service by far due to its ease for realtime sharing of information. It collects millions of tweets per day and monitors what people are talking about in the trending topics updated timely. Then the question is how users can understand a topic in a short time when they are frustrated with the overwhelming and unorganized tweets. In this paper, this problem is approached by sequential summarization which aims to produce a sequential summary, i.e., a series of chronologically ordered short subsummaries that collectively provide a full story about topic development. Both the number and the content of sub-summaries are automatically identified by the proposed stream-based and semantic-based approaches. These approaches are evaluated in terms of sequence coverage, sequence novelty and sequence correlation and the effectiveness of their combination is demonstrated. 1 Introduction and Background Twitter, as a popular micro-blogging service, collects millions of real-time short text messages (known as tweets) every second. It acts as not only a public platform for posting trifles about users’ daily lives, but also a public reporter for real-time news. Twitter has shown its powerful ability in information delivery in many events, like the wildfires in San Diego and the earthquake in Japan. Nevertheless, the side effect is individual users usually sink deep under millions of flooding-in tweets. To alleviate this problem, the applications like whatthetrend 1 have evolved from Twitter to provide services that encourage users to edit explanatory tweets about a trending topic, which can be regarded as topic summaries. It is to some extent a good way to help users understand trending topics. 1 whatthetrend.com There is also pioneering research in automatic Twitter trending topic summarization. (O'Connor et al., 2010) explained Twitter trending topics by providing a list of significant terms. Users could utilize these terms to drill down to the tweets which are related to the trending topics. (Sharifi et al., 2010) attempted to provide a one-line summary for each trending topic using phrase reinforcement ranking. The relevance model employed by (Harabagiu and Hickl, 2011) generated summaries in larger size, i.e., 250word summaries, by synthesizing multiple high rank tweets. (Duan et al., 2012) incorporate the user influence and content quality information in timeline tweet summarization and employ reinforcement graph to generate summaries for trending topics. Twitter summarization is an emerging research area. Current approaches still followed the traditional summarization route and mainly focused on mining tweets of both significance and representativeness. Though, the summaries generated in such a way can sketch the most important aspects of the topic, they are incapable of providing full descriptions of the changes of the focus of a topic, and the temporal information or freshness of the tweets, especially for those newsworthy trending topics, like earthquake and sports meeting. As the main information producer in Twitter, the massive crowd keeps close pace with the development of trending topics and provide the timely updated information. The information dynamics and timeliness is an important consideration for Twitter summarization. That is why we propose sequential summarization in this work, which aims to produce sequential summaries to capture the temporal changes of mass focus. Our work resembles update summarization promoted by TAC 2 which required creating summaries with new information assuming the reader has already read some previous documents under the same topic. Given two chronologically ordered documents sets about a topic, the systems were asked to generate two 2 www.nist.gov/tac 567 summaries, and the second one should inform the user of new information only. In order to achieve this goal, existing approaches mainly emphasized the novelty of the subsequent summary (Li and Croft, 2006; Varma et al., 2009; Steinberger and Jezek, 2009). Different from update summarization, we focus more on the temporal change of trending topics. In particular, we need to automatically detect the “update points” among a myriad of related tweets. It is the goal of this paper to set up a new practical summarization application tailored for timely updated Twitter messages. With the aim of providing a full description of the focus changes and the records of the timeline of a trending topic, the systems are expected to discover the chronologically ordered sets of information by themselves and they are free to generate any number of update summaries according to the actual situations instead of a fixed number of summaries as specified in DUC/TAC. Our main contributions include novel approaches to sequential summarization and corresponding evaluation criteria for this new application. All of them will be detailed in the following sections. 2 Sequential Summarization Sequential summarization proposed here aims to generate a series of chronologically ordered subsummaries for a given Twitter trending topic. Each sub-summary is supposed to represent one main subtopic or one main aspect of the topic, while a sequential summary, made up by the subsummaries, should retain the order the information is delivered to the public. In such a way, the sequential summary is able to provide a general picture of the entire topic development. 2.1 Subtopic Segmentation One of the keys to sequential summarization is subtopic segmentation. How many subtopics have attracted the public attention, what are they, and how are they developed? It is important to provide the valuable and organized materials for more fine-grained summarization approaches. We proposed the following two approaches to automatically detect and chronologically order the subtopics. 2.1.1 Stream-based Subtopic Detection and Ordering Typically when a subtopic is popular enough, it will create a certain level of surge in the tweet stream. In other words, every surge in the tweet stream can be regarded as an indicator of the appearance of a subtopic that is worthy of being summarized. Our early investigation provides evidence to support this assumption. By examining the correlations between tweet content changes and volume changes in randomly selected topics, we have observed that the changes in tweet volume can really provide the clues of topic development or changes of crowd focus. The stream-based subtopic detection approach employs the offline peak area detection (Opad) algorithm (Shamma et al., 2010) to locate such surges by tracing tweet volume changes. It regards the collection of tweets at each such surge time range as a new subtopic. Offline Peak Area Detection (Opad) Algorithm 1: Input: TS (tweets stream, each twi with timestamp ti); peak interval window ∆? (in hour), and time stepℎ (ℎ ≪ ∆?); 2: Output: Peak Areas PA. 3: Initial: two time slots: ?′ = ? = ?0 + ∆?; Tweet numbers: ?′ = ? = ?????(?) 4: while (?? = ? + ℎ) < ??−1 5: update ?′ = ?? + ∆? and ?′ = ?????(?′) 6: if (?′ < ? And up-hilling) 7: output one peak area ??? 8: state of down-hilling 9: else 10: update ? = ?′ and ? = ?′ 11: state of up-hilling 12: 13: function ?????(?) 14: Count tweets in time interval T The subtopics detected by the Opad algorithm are naturally ordered in the timeline. 2.1.2 Semantic-based Subtopic Detection and Ordering Basically the stream-based approach monitors the changes of the level of user attention. It is easy to implement and intuitively works, but it fails to handle the cases where the posts about the same subtopic are received at different time ranges due to the difference of geographical and time zones. This may make some subtopics scattered into several time slots (peak areas) or one peak area mixed with more than one subtopic. In order to sequentially segment the subtopics from the semantic aspect, the semantic-based subtopic detection approach breaks the time order of tweet stream, and regards each tweet as an individual short document. It takes advantage of Dynamic Topic Modeling (David and Michael, 2006) to explore the tweet content. 568 DTM in nature is a clustering approach which can dynamically generate the subtopic underlying the topic. Any clustering approach requires a pre-specified cluster number. To avoid tuning the cluster number experimentally, the subtopic number required by the semantic-based approach is either calculated according to heuristics or determined by the number of the peak areas detected from the stream-based approach in this work. Unlike the stream-based approach, the subtopics formed by DTM are the sets of distributions of subtopic and word probabilities. They are time independent. Thus, the temporal order among these subtopics is not obvious and needs to be discovered. We use the probabilistic relationships between tweets and topics learned from DTM to assign each tweet to a subtopic that it most likely belongs to. Then the subtopics are ordered temporally according to the mean values of their tweets’ timestamps. 2.2 Sequential Summary Generation Once the subtopics are detected and ordered, the tweets belonging to each subtopic are ranked and the most significant one is extracted to generate the sub-summary regarding that subtopic. Two different ranking strategies are adopted to conform to two different subtopic detection mechanisms. For a tweet in a peak area, the linear combination of two measures is considered to independently. Each sub-summary is up to 140 characters in length to comply with the limit of tweet, but the annotators are free to choose the number of sub-summaries. It ends up with 6.3 and 4.8 sub-summaries on average in a sequential summary written by the two annotators respectively. These two sets of sequential summaries are regarded as reference summaries to evaluate system-generated summaries from the following three aspects. Sequence Coverage Sequence coverage measures the N-gram match between system-generated summaries and human-written summaries (stopword removed first). Considering temporal information is an important factor in sequential summaries, we evaluate its significance to be a sub-summary: (1) subtopic representativeness measured by the  cosine similarity between the tweet and the centroid of all the tweets in the same peak area; (2) crowding endorsement measured by the times that the tweet is re-tweeted normalized by the total number of re-tweeting. With the DTM model, the significance of the tweets is evaluated directly by word distribution per subtopic. MMR (Carbonell and Goldstein, 1998) is used to reduce redundancy in sub-summary generation. 3 Experiments and Evaluations The experiments are conducted on the 24 Twitter trending topics collected using Twitter APIs 3 . The statistics are shown in Table 1. Due to the shortage of gold-standard sequential summaries, we invite two annotators to read the chronologically ordered tweets, and write a series of sub-summaries for each topic 3https://dev.twitter.com/ propose the position-aware coverage measure by accommodating the position information in matching. Let S={s1, s2, sk} denote a … … …, sequential summary and si the ith sub-summary, N-gram coverage is defined as: ???????? =|? 1?|?∑?∈? ?∑? ? ?∈?∙ℎ ?∑ ? ?∈?-?ℎ? ?∑? ∈-? ?,? ? ? ?∈? ? ? ? ? ? ? (ℎ?(?-?-? ? ? ?) where, ??? = |? − ?| + 1, i and j denote the serial numbers of the sub-summaries in the systemgenerated summary ??? and the human-written summary ?ℎ? , respectively. ? serves as a coefficient to discount long-distance matched sub-summaries. We evaluate unigram, bigram, and skipped bigram matches. Like in ROUGE (Lin, 2004), the skip distance is up to four words.  Sequence Novelty Sequence novelty evaluates the average novelty of two successive sub-summaries. Information content (IC) has been used to measure the novelty of update summaries by (Aggarwal et al., 2009). In this paper, the novelty of a system569 generated sequential summary is defined as the average of IC increments of two adjacent subsummaries, ??????? =|?|1 − 1?∑>1(????− ????, ??−1) × where |?| is the number of sub-summaries in the sequential summary. ???? = ∑?∈?? ??? . ????, ??−1 = ∑?∈??∩??−1 ??? is the overlapped information in the two adjacent sub-summaries. ??? = ???? ?????????(?, ???) where w is a word, ???? is the inverse tweet frequency of w, and ??? is all the tweets in the trending topic. The relevance function is introduced to ensure that the information brought by new sub-summaries is not only novel but also related to the topic.  Sequence Correlation Sequence correlation evaluates the sequential matching degree between system-generated and human-written summaries. In statistics, Kendall’s tau coefficient is often used to measure the association between two sequences (Lapata, 2006). The basic idea is to count the concordant and discordant pairs which contain the same elements in two sequences. Borrowing this idea, for each sub-summary in a human-generated summary, we find its most matched subsummary (judged by the cosine similarity measure) in the corresponding system-generated summary and then define the correlation according to the concordance between the two matched sub-summary sequences. ??????????? 2(|#???????????????| |#???????????????|) − = ?(? − 1) where n is the number of human-written subsummaries. Tables 2 and 3 below present the evaluation results. For the stream-based approach, we set ∆t=3 hours experimentally. For the semanticbased approach, we compare three different approaches to defining the sub-topic number K: (1) Semantic-based 1: Following the approach proposed in (Li et al., 2007), we first derive the matrix of tweet cosine similarity. Given the 1norm of eigenvalues ?????? (? = 1, 2, ,?) of the similarity matrix and the ratios ?? = ??????/?2 , the subtopic number ? = ? + 1 if ?? − ??+1 > ? (? 0.4 ). (2) Semantic-based 2: Using the rule of thumb in (Wan and Yang, 2008), ? = √? , where n is the tweet number. (3) Combined: K is defined as the number of the peak areas detected from the Opad algorithm, meanwhile we use the … = tweets within peak areas as the tweets of DTM. This is our new idea. The experiments confirm the superiority of the semantic-based approach over the stream-based approach in summary content coverage and novelty evaluations, showing that the former is better at subtopic content modeling. The subsummaries generated by the stream-based approach have comparative sequence (i.e., order) correlation with the human summaries. Combining the advantages the two approaches leads to the best overall results. SCebomaSCs beonmtdivr1eac( ∆nrdδ-bm(ta=i∆g0-cs3e.t)5=d32U0 n.3ig510r32a7m B0 .i1g 6r3589a46m87 SB0 k.i1 gp8725r69ame173d Table 2. N-Gram Coverage Evaluation Sem CtraeonTmaA tmicapb-nplibentria ec3os-de.abcd N(a∆hs(o1evt∆=(sdetδ=3l2)t 0y).a4n)dCoN0r .o 73e vl071ea96lti783 oy nEvCalo0ur a. 3 tei3792ol3a489nt650io n 4 Concluding Remarks We start a new application for Twitter trending topics, i.e., sequential summarization, to reveal the developing scenario of the trending topics while retaining the order of information presentation. We develop several solutions to automatically detect, segment and order subtopics temporally, and extract the most significant tweets into the sub-summaries to compose sequential summaries. Empirically, the combination of the stream-based approach and the semantic-based approach leads to sequential summaries with high coverage, low redundancy, and good order. Acknowledgments The work described in this paper is supported by a Hong Kong RGC project (PolyU No. 5202/12E) and a National Nature Science Foundation of China (NSFC No. 61272291). References Aggarwal Gaurav, Sumbaly Roshan and Sinha Shakti. 2009. Update Summarization. Stanford: CS224N Final Projects. 570 Blei M. David and Jordan I. Michael. 2006. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, 113120. Pittsburgh, Pennsylvania. Carbonell Jaime and Goldstein Jade. 1998. The use of MMR, diversity based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval, 335-336. Melbourne, Australia. Duan Yajuan, Chen Zhimin, Wei Furu, Zhou Ming and Heung-Yeung Shum. 2012. Twitter Topic Summarization by Ranking Tweets using Social Influence and Content Quality. In Proceedings of the 24th International Conference on Computational Linguistics, 763-780. Mumbai, India. Harabagiu Sanda and Hickl Andrew. 2011. Relevance Modeling for Microblog Summarization. In Proceedings of 5th International AAAI Conference on Weblogs and Social Media. Barcelona, Spain. Lapata Mirella. 2006. Automatic evaluation of information ordering: Kendall’s tau. Computational Linguistics, 32(4): 1-14. Li Wenyuan, Ng Wee-Keong, Liu Ying and Ong Kok-Leong. 2007. Enhancing the Effectiveness of Clustering with Spectra Analysis. IEEE Transactions on Knowledge and Data Engineering, 19(7):887-902. Li Xiaoyan and Croft W. Bruce. 2006. Improving novelty detection for general topics using sentence level information patterns. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, 238-247. New York, USA. Lin Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the ACL Workshop on Text Summarization Branches Out, 74-81 . Barcelona, Spain. Liu Fei, Liu Yang and Weng Fuliang. 2011. Why is “SXSW ” trending? Exploring Multiple Text Sources for Twitter Topic Summarization. In Proceedings of the ACL Workshop on Language in Social Media, 66-75. Portland, Oregon. O'Connor Brendan, Krieger Michel and Ahn David. 2010. TweetMotif: Exploratory Search and Topic Summarization for Twitter. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media, 384-385. Atlanta, Georgia. Shamma A. David, Kennedy Lyndon and Churchill F. Elizabeth. 2010. Tweetgeist: Can the Twitter Timeline Reveal the Structure of Broadcast Events? In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, 589-593. Savannah, Georgia, USA. Sharifi Beaux, Hutton Mark-Anthony and Kalita Jugal. 2010. Summarizing Microblogs Automatically. In Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 685688. Los Angeles, California. Steinberger Josef and Jezek Karel. 2009. Update summarization based on novel topic distribution. In Proceedings of the 9th ACM Symposium on Document Engineering, 205-213. Munich, Germany. Varma Vasudeva, Bharat Vijay, Kovelamudi Sudheer, Praveen Bysani, Kumar K. N, Kranthi Reddy, Karuna Kumar and Nitin Maganti. 2009. IIIT Hyderabad at TAC 2009. In Proceedings of the 2009 Text Analysis Conference. GaithsBurg, Maryland. Wan Xiaojun and Yang Jianjun. 2008. Multidocument summarization using cluster-based link analysis. In Proceedings of the 3 1st Annual International Conference on Research and Development in Information Retrieval, 299-306. Singapore, Singapore. 571

4 0.91410625 145 acl-2013-Exploiting Qualitative Information from Automatic Word Alignment for Cross-lingual NLP Tasks

Author: Jose G.C. de Souza ; Miquel Espla-Gomis ; Marco Turchi ; Matteo Negri

Abstract: The use of automatic word alignment to capture sentence-level semantic relations is common to a number of cross-lingual NLP applications. Despite its proved usefulness, however, word alignment information is typically considered from a quantitative point of view (e.g. the number of alignments), disregarding qualitative aspects (the importance of aligned terms). In this paper we demonstrate that integrating qualitative information can bring significant performance improvements with negligible impact on system complexity. Focusing on the cross-lingual textual en- tailment task, we contribute with a novel method that: i) significantly outperforms the state of the art, and ii) is portable, with limited loss in performance, to language pairs where training data are not available.

same-paper 5 0.88891834 246 acl-2013-Modeling Thesis Clarity in Student Essays

Author: Isaac Persing ; Vincent Ng

6 0.85636234 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition

7 0.85030705 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

8 0.65350193 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

9 0.64367092 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

10 0.63354826 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning

11 0.61867779 353 acl-2013-Towards Robust Abstractive Multi-Document Summarization: A Caseframe Analysis of Centrality and Domain

12 0.61483622 333 acl-2013-Summarization Through Submodularity and Dispersion

13 0.61321223 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics

14 0.60811299 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

15 0.60768694 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

16 0.60627258 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

17 0.60217738 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation

18 0.59794658 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

19 0.59126568 176 acl-2013-Grounded Unsupervised Semantic Parsing

20 0.58169711 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization