emnlp emnlp2013 emnlp2013-28 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hongbo Chen ; Ben He
Abstract: Previous approaches for automated essay scoring (AES) learn a rating model by minimizing either the classification, regression, or pairwise classification loss, depending on the learning algorithm used. In this paper, we argue that the current AES systems can be further improved by taking into account the agreement between human and machine raters. To this end, we propose a rankbased approach that utilizes listwise learning to rank algorithms for learning a rating model, where the agreement between the human and machine raters is directly incorporated into the loss function. Various linguistic and statistical features are utilized to facilitate the learning algorithms. Experiments on the publicly available English essay dataset, Automated Student Assessment Prize (ASAP), show that our proposed approach outperforms the state-of-the-art algorithms, and achieves performance comparable to professional human raters, which suggests the effectiveness of our proposed method for automated essay scoring.
Reference: text
sentIndex sentText sentNum sentScore
1 uca s s Abstract Previous approaches for automated essay scoring (AES) learn a rating model by minimizing either the classification, regression, or pairwise classification loss, depending on the learning algorithm used. [sent-2, score-1.17]
2 To this end, we propose a rankbased approach that utilizes listwise learning to rank algorithms for learning a rating model, where the agreement between the human and machine raters is directly incorporated into the loss function. [sent-4, score-0.863]
3 1 Introduction Automated essay scoring utilizes the NLP techniques to automatically rate essays written for given prompts, namely, essay topics, in an educational setting (Dikli, 2006). [sent-7, score-1.696]
4 For example, before AES systems enter the picture, essays in the writing assessment of Graduate Record Examination (GRE) are rated by two human raters. [sent-9, score-0.47]
5 cn , third human rater is needed when the difference of the scores given by the two human raters is larger than one in the 6-point scale. [sent-14, score-0.336]
6 Currently, GRE essays are rated by one human rater and one AES system. [sent-15, score-0.476]
7 Existing approaches consider essay rating as a classification (Larkey, 1998), regression (Attali and Burstein, 2006) or preference ranking problem (Yannakoudakis et al. [sent-21, score-1.143]
8 In this paper, we argue that the purpose of AES is to predict the essay’s rating that human raters would give. [sent-23, score-0.448]
9 oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is7t4ic1s–1752, algorithm to address automated essay scoring in the view of directly optimizing the agreement between human raters and the AES system. [sent-30, score-1.086]
10 , 2011) that maximizes the pairwise classification precision (Liu, 2009), our rank-based approach follows the listwise learning paradigm and the agreement between the machine and human raters is directly integrated into the loss function that is optimized by gradient boost regression trees. [sent-32, score-0.695]
11 To the best of our knowledge, this work is the first to apply listwise learning to rank approach for AES, which aims at the optimization of the agreement between the human and machine raters. [sent-33, score-0.363]
12 As it is widely accepted that the agreement between human raters, measured by either quadratic weighted Kappa or Pearson’s correlation coefficient, ranges from 0. [sent-37, score-0.377]
13 In section 2, we introduce the research background of automated essay scoring and give a brief introduction to learning to rank. [sent-42, score-0.833]
14 In section 3, a detailed description of our listwise learning to rank approach for automated essay scoring is presented. [sent-43, score-1.089]
15 Regression-based approach treats feature values and essay score as independent variables and dependent variable, respectively, and then learns a regression equation by classical regression algorithms, such as support vector regression (Vapnik et al. [sent-54, score-0.913]
16 fourth root of essay length, and uses regression-based approach to predict the score that human raters will give. [sent-59, score-0.829]
17 The classification-based approach sees essay scores as in-discriminative class labels and uses classical classification algorithms, e. [sent-64, score-0.702]
18 the K-nearest neighbor (KNN) and the naive Bayesian model, to predict to which class an essay belongs, where a class is associated to a numeric rating. [sent-66, score-0.646]
19 , 1999), developed also in late 1990s, evaluates essay by measuring semantic features. [sent-68, score-0.646]
20 , 2011) proposed a preference ranking based approach for learning a rating model, where a ranking function or model is learned to construct a global ordering of essays based on writing quality. [sent-73, score-0.799]
21 It is also the first study of rank-based approach in automated essay scoring. [sent-74, score-0.767]
22 A prompt-specific rating model is built for a specific prompt and designed to be the best rating model for the particular prompt (Williamson, 2009). [sent-79, score-0.696]
23 Generic rating model is trained from essays across a group of prompts and designed to be the best fit for predicting human scores for all prompts. [sent-82, score-0.703]
24 Generic rating model evaluates essays across all prompts with the same scoring criteria, which is more consistent with the human rubric that is usually the same for all prompts, and therefore has validity-related advantages (Attali et al. [sent-84, score-0.745]
25 , 1996), which have been widely used in automated essay scoring (Shermis and Burstein, 2002), can be seen as pointwise approaches. [sent-92, score-0.833]
26 , 2011) apply pairwise approach, ranking SVM, to automated essay scoring 1743 and achieve better performance than support vector regression. [sent-99, score-0.951]
27 In listwise approaches, ranking algorithms process a list of documents each time and the loss function aims at measuring the accordance between predicted ranking list and the ground truth label. [sent-100, score-0.426]
28 Listwise approach has not yet been used in automated essay scoring. [sent-105, score-0.767]
29 Firstly, a set of essays rated by professional human raters are gathered for the training. [sent-107, score-0.567]
30 A listwise learning to rank algorithm learns a ranking model or function using this set of human rated essays represented by vectors of the pre-defined features. [sent-108, score-0.718]
31 Then the learned ranking model or function outputs a model score for each essay, including both rated and unrated essays, from which a global ordering of essays is constructed. [sent-109, score-0.448]
32 1 Listwise Learning to Rank for AES Our choice of the listwise learning to rank algorithm is due to the fact that it takes the entire set of labeled essays associated to a given prompt, instead of the individual essays or essay pairs as in (Yannakoudakis et al. [sent-115, score-1.494]
33 As for the automated essay scoring, LambdaMART is not readily available since its loss function is defined as a function of the gradient of IR evaluation measures. [sent-137, score-0.876]
34 This is because for AES, the rating prediction of all essays equally matters, no matter what ratings they receive. [sent-139, score-0.598]
35 For a pair of essays, essay iand essay j, λi,j is defined as the derivative of RankNet (Li et al. [sent-152, score-1.292]
36 λi,j=1 + e−δ(δsi−sj)|∆Kappa| (1) and sj are the model scores for essay iand essay j, respectively. [sent-154, score-1.316]
37 Quadratic weighted Kappa are calculated as follows: si κ = 1 −∑∑ii, j ωωii, j EOii, j (2) In matrix O, Oi,j co∑rresponds to the number of essays that received a score iby human rater and a score j by the AES system. [sent-156, score-0.477]
38 In each iteration, every essay is ranked by its model score and then rated according to its ranking position. [sent-160, score-0.775]
39 For example, for five essays e1, e2, e3, e4, e5 with actual ratings 5, 4, 3, 2, 1, if the ranking (by model score) is e3, e4, e1, e5, e2, we assume that e3, e4, e1, e5, e2 will get ratings of 5, 4, 3, 2, 1, over which quadratic weighted kappa gain can be calculated. [sent-161, score-0.704]
40 ic Lhe essay ni o retec teihvee s a higher rating nthdiacne essay j. [sent-164, score-1.557]
41 essay i, is defined as, λi = ∑ λi,j −∑ λi,j; j:⟨i∑,j⟩∈I j:⟨j∑,i⟩∈I (3) The rational behind the above formulae is as follows. [sent-170, score-0.646]
42 For each of the essays in the whole essay collection associated with the same prompt, e. [sent-171, score-0.942]
43 essay i, the gradient λi is incremented by a positive value λi,j when coming across another essay j that has a lower rating. [sent-173, score-1.365]
44 On the contrary, the gradient λi will be incremented by a negative value −λi,j when twheil lan boet ihnecrr essay ehdas b a higher rating. [sent-175, score-0.719]
45 uAes a result, after each iteration of MART, essays with higher rating tend to receive a higher model score while essays with lower rating tend to get a lower model score. [sent-176, score-1.122]
46 To determine the final rating of each given unrated essay, we have to map this unscaled model score to the predefined scale, such as an integer from 1 to 6 in a 6 point scale. [sent-178, score-0.323]
47 To begin with, the learned ranking model also computes an unscaled model score for each essay in the training set. [sent-180, score-0.759]
48 As the model is trained by learning to rank algorithms, essays with higher model scores tend to get higher actual ratings. [sent-181, score-0.395]
49 2 Pre-defined Features We pre-define four types of features that indicate the essay quality, including lexical, syntactical, grammar and fluency, content and prompt-specific features. [sent-191, score-0.676]
50 – Unique words: The number of unique words appeared in each essay, normalized by the essay length in words. [sent-202, score-0.716]
51 Word bigram and trigram: We evaluate the grammar and fluency of an essay by calculating mean tf/TF ofword bigrams and trigrams (Briscoe et al. [sent-218, score-0.756]
52 , 2010) (tf is the term frequency in a single essay and TF is the term frequency in the whole essay collection). [sent-219, score-1.292]
53 We assume a bigram or trigram with high tf/TF as a grammar error because high tf/TF means that this kind of bigram or trigram is not commonly used in the whole essay collection but appears in the specific essay. [sent-220, score-0.676]
54 The fourth root of essay length in words is proved to be highly correlated with the essay score (Shermis and Burstein, 2002). [sent-225, score-1.353]
55 It is calculated as the weighted mean of all cosine similarities and the weight is set as the corresponding essay score. [sent-227, score-0.729]
56 Dataset in this competition6 consists of eight essay sets. [sent-245, score-0.646]
57 Each essay set was generated from a single prompt. [sent-246, score-0.646]
58 The number of essays associated with each prompt ranges from 900 to 1800 and the average length of essays in word in each essay set ranges from 150 to 650. [sent-247, score-1.414]
59 All essays were written by students in different grades and received a resolved score, namely the actual rating, from professional human raters. [sent-248, score-0.37]
60 If there are essays that come from n essay topics, we calculate the agreement degree on each essay topic first and then compute the overall agreement degree in the z-space. [sent-262, score-1.778]
61 , 2011) utilizes the SVM for preference ranking, a pairwise learning to rank algorithm, for training a rating model. [sent-281, score-0.422]
62 For prompt-specific rating model, feature selection is conducted on the essays associated with the same prompt. [sent-299, score-0.588]
63 For generic rating model, the final feature set used for training is the intersection of the 8 feature sets for prompt-specific rating model. [sent-300, score-0.653]
64 For content and prompt-specific features, essay length in words, word vector and semantic vector similarity with high rated essays, text coherence are usually selected for training a prompt- 7http://people. [sent-305, score-0.759]
65 When it comes to the generic rating model, the prompt-specific features like word vector similarity and semantic vector similarity, are removed. [sent-317, score-0.359]
66 4 Evaluation Methodology We conduct three sets of experiments to evaluate the effectiveness of our listwise learning to rank approach for automated essay scoring. [sent-319, score-1.023]
67 We conduct 5-fold cross-validation, where the essays of each prompt are randomly partitioned into 5 subsets. [sent-321, score-0.379]
68 The objective of the second set of experiments is to test the performance of our listwise learning to rank approach for generic rating models. [sent-331, score-0.59]
69 In 5-fold cross-validation, essays associated with the same prompt are randomly partitioned into 5 subsets. [sent-333, score-0.379]
70 In the third set of experiments, we evaluate the quality of the features used in our rating model by 1748 feature ablation test and feature unique test. [sent-336, score-0.393]
71 In ablation test, we evaluate our essay rating model’s performance before and after the removal of a subset of features from the whole feature set. [sent-337, score-1.007]
72 In unique test, only a subset of features are used in the rating model construction and all other features are removed. [sent-339, score-0.326]
73 The learned rating model’s performance indicates to which extent the features are correlated with the actual essay ratings. [sent-340, score-0.935]
74 For prompt-specific rating model, all of these al- gorithms achieve good performance comparable to human raters as literatures have revealed that the agreement between two professional human raters (measured by statistics for correlation analysis, e. [sent-347, score-0.764]
75 The result of the first set of experiments suggests the effectiveness and robustness of our listwise learning to rank approach in the building of prompt-specific rating model. [sent-355, score-0.521]
76 For generic rating model, one can conclude from Table 1 that RF bagging LambdaMART performs better than SVM for classification, regression and preference ranking on the ASAP dataset. [sent-356, score-0.6]
77 The dataset used in our experiment consists of essays generated by 8 prompts and each prompt has its own features. [sent-357, score-0.46]
78 With such a training set, both classification and regression based approaches produce not good results, as it is commonly accepted that rating model whose performance measured by interrater agreement lower than 0. [sent-358, score-0.512]
79 The performance comparison of the generic rating models suggest that the rank based approaches, SVMp and RF bagging KLambdaMART, are more effective than the classification based SVMc and the regression based SVMr, while our proposed RF bagging K-LambdaMART outperforms the state-of-the-art SVMp. [sent-361, score-0.653]
80 Moreover, we find that there is no obvious performance difference when our proposed method is applied to prompt-specific and generic rating models. [sent-362, score-0.334]
81 Considering the advantages generic rating models have, the result of the second set of experiments suggests the feasibility of building a rating model which is generalizable across different prompts while performs slightly inferior to the prompt-specific rating model. [sent-363, score-0.945]
82 Among the lexical features, the two feature subsets, word level and statistics of word length, are highly correlated with essay score in both promptspecific and generic rating models. [sent-369, score-1.066]
83 This observation was expected since word usage is an important no- tion of writing quality, regardless of essay topics. [sent-370, score-0.686]
84 In the syntactical features, the feature subset, sentence level, measured by the height and depth of the parser tree, correlates the most with essay score. [sent-371, score-0.75]
85 What is more, during feature selection, we find that the Pearson’s correlation coefficient between the feature values and the final ratings in each essay prompt ranges from -0. [sent-375, score-0.911]
86 60, which suggests that our method to estimate the number of grammar errors is applicable because it is widely accepted that in the evaluation of student essays, essays with more grammar errors tend to receive lower ratings. [sent-377, score-0.394]
87 Among the content and prompt-specific features, essay length and word vector similarity features give good results in feature unique test. [sent-378, score-0.768]
88 The fourth root of essay length in words has been proved to be a highly correlated feature by many works on AES (Shermis and Burstein, 2002). [sent-379, score-0.734]
89 Word vector similarity feature measures prompt-specific vocabulary usage, which is also important to essay evaluation. [sent-380, score-0.698]
90 However, the result of feature unique test suggests that most features used in our rating model are in fact highly correlated with the writing quality. [sent-383, score-0.389]
91 6 Conclusions and Future Work We have proposed a listwise learning to rank approach to automated essay scoring (AES) by directly incorporating the human-machine agreement Table 2: Results of feature ablation and unique test into the loss function. [sent-384, score-1.319]
92 Experiments on the public English dataset ASAP show that our approach outperforms the state-of-the-art algorithms in both promptspecific and generic rating settings. [sent-385, score-0.399]
93 Moreover, it is widely accepted that the agreement between professional human raters ranges from 0. [sent-386, score-0.356]
94 78 for generic rating, suggesting its potential in automated essay scoring. [sent-392, score-0.836]
95 It is therefore appealing to develop an approach that learns a generic model with acceptable rating accuracy, since it has both validity-related and logistical 1750 advantages. [sent-396, score-0.334]
96 In our future work, we plan to continue the research on generic rating model. [sent-397, score-0.334]
97 Because of the diversification of writing features of essays associated with different prompts, a viable approach is to explore more generic writing features that can well reflect the writing quality. [sent-398, score-0.485]
98 Performance of a generic approach in automated essay scoring. [sent-411, score-0.836]
99 Automatic 1751 essay grading using text categorization techniques. [sent-495, score-0.646]
100 Comparing the validity of automated and human essay scoring. [sent-522, score-0.804]
wordName wordTfidf (topN-words)
[('essay', 0.646), ('essays', 0.296), ('aes', 0.279), ('rating', 0.265), ('listwise', 0.181), ('raters', 0.146), ('asap', 0.131), ('lambdamart', 0.131), ('automated', 0.121), ('kappa', 0.105), ('yannakoudakis', 0.104), ('quadratic', 0.099), ('rater', 0.092), ('prompt', 0.083), ('williamson', 0.081), ('prompts', 0.081), ('regression', 0.08), ('ranking', 0.078), ('rank', 0.075), ('agreement', 0.07), ('attali', 0.07), ('generic', 0.069), ('bagging', 0.066), ('scoring', 0.066), ('burstein', 0.06), ('loss', 0.059), ('weighted', 0.052), ('rated', 0.051), ('gradient', 0.05), ('syntactical', 0.05), ('rf', 0.049), ('fluency', 0.049), ('ranknet', 0.046), ('shermis', 0.046), ('subclauses', 0.046), ('assessment', 0.046), ('pearson', 0.045), ('educational', 0.042), ('preference', 0.042), ('ablation', 0.041), ('briscoe', 0.04), ('vapnik', 0.04), ('pairwise', 0.04), ('writing', 0.04), ('accepted', 0.038), ('variance', 0.037), ('length', 0.037), ('ratings', 0.037), ('coefficient', 0.037), ('human', 0.037), ('forests', 0.037), ('professional', 0.037), ('brenner', 0.035), ('larkey', 0.035), ('promptspecific', 0.035), ('unscaled', 0.035), ('mart', 0.034), ('unique', 0.033), ('svm', 0.033), ('classification', 0.032), ('mean', 0.031), ('disagreement', 0.031), ('powers', 0.03), ('grammar', 0.03), ('algorithms', 0.03), ('stands', 0.03), ('subset', 0.028), ('ranges', 0.028), ('ndcg', 0.028), ('prize', 0.028), ('burges', 0.027), ('measured', 0.027), ('subsets', 0.027), ('feature', 0.027), ('gre', 0.026), ('correlation', 0.026), ('similarity', 0.025), ('degree', 0.025), ('college', 0.025), ('scores', 0.024), ('exists', 0.024), ('correlated', 0.024), ('breland', 0.023), ('esol', 0.023), ('gbdt', 0.023), ('hewlett', 0.023), ('hongbo', 0.023), ('incremented', 0.023), ('klambdamart', 0.023), ('kliebsch', 0.023), ('svmc', 0.023), ('svmp', 0.023), ('svmr', 0.023), ('ungraded', 0.023), ('unrated', 0.023), ('zechner', 0.023), ('spelling', 0.023), ('boosting', 0.023), ('validation', 0.022), ('partition', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 28 emnlp-2013-Automated Essay Scoring by Maximizing Human-Machine Agreement
Author: Hongbo Chen ; Ben He
Abstract: Previous approaches for automated essay scoring (AES) learn a rating model by minimizing either the classification, regression, or pairwise classification loss, depending on the learning algorithm used. In this paper, we argue that the current AES systems can be further improved by taking into account the agreement between human and machine raters. To this end, we propose a rankbased approach that utilizes listwise learning to rank algorithms for learning a rating model, where the agreement between the human and machine raters is directly incorporated into the loss function. Various linguistic and statistical features are utilized to facilitate the learning algorithms. Experiments on the publicly available English essay dataset, Automated Student Assessment Prize (ASAP), show that our proposed approach outperforms the state-of-the-art algorithms, and achieves performance comparable to professional human raters, which suggests the effectiveness of our proposed method for automated essay scoring.
2 0.094740637 200 emnlp-2013-Well-Argued Recommendation: Adaptive Models Based on Words in Recommender Systems
Author: Julien Gaillard ; Marc El-Beze ; Eitan Altman ; Emmanuel Ethis
Abstract: Recommendation systems (RS) take advantage ofproducts and users information in order to propose items to consumers. Collaborative, content-based and a few hybrid RS have been developed in the past. In contrast, we propose a new domain-independent semantic RS. By providing textually well-argued recommendations, we aim to give more responsibility to the end user in his decision. The system includes a new similarity measure keeping up both the accuracy of rating predictions and coverage. We propose an innovative way to apply a fast adaptation scheme at a semantic level, providing recommendations and arguments in phase with the very recent past. We have performed several experiments on films data, providing textually well-argued recommendations.
3 0.094604865 123 emnlp-2013-Learning to Rank Lexical Substitutions
Author: Gyorgy Szarvas ; Robert Busa-Fekete ; Eyke Hullermeier
Abstract: The problem to replace a word with a synonym that fits well in its sentential context is known as the lexical substitution task. In this paper, we tackle this task as a supervised ranking problem. Given a dataset of target words, their sentential contexts and the potential substitutions for the target words, the goal is to train a model that accurately ranks the candidate substitutions based on their contextual fitness. As a key contribution, we customize and evaluate several learning-to-rank models to the lexical substitution task, including classification-based and regression-based approaches. On two datasets widely used for lexical substitution, our best models signifi- cantly advance the state-of-the-art.
4 0.068646133 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students
Author: Philip Resnik ; Anderson Garron ; Rebecca Resnik
Abstract: in College Students Anderson Garron University of Maryland College Park, MD 20742 agarron@cs.umd.edu Rebecca Resnik Mindwell Psychology Bethesda 5602 Shields Drive Bethesda, MD 20817 drrebeccaresnik@gmail.com out adequate insurance or in rural areas – cannot ac- We investigate the value-add of topic modeling in text analysis for depression, and for neuroticism as a strongly associated personality measure. Using Pennebaker’s Linguistic Inquiry and Word Count (LIWC) lexicon to provide baseline features, we show that straightforward topic modeling using Latent Dirichlet Allocation (LDA) yields interpretable, psychologically relevant “themes” that add value in prediction of clinical assessments.
5 0.043215353 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
Author: Alla Rozovskaya ; Dan Roth
Abstract: State-of-the-art systems for grammatical error correction are based on a collection of independently-trained models for specific errors. Such models ignore linguistic interactions at the sentence level and thus do poorly on mistakes that involve grammatical dependencies among several words. In this paper, we identify linguistic structures with interacting grammatical properties and propose to address such dependencies via joint inference and joint learning. We show that it is possible to identify interactions well enough to facilitate a joint approach and, consequently, that joint methods correct incoherent predictions that independentlytrained classifiers tend to produce. Furthermore, because the joint learning model considers interacting phenomena during training, it is able to identify mistakes that require mak- ing multiple changes simultaneously and that standard approaches miss. Overall, our model significantly outperforms the Illinois system that placed first in the CoNLL-2013 shared task on grammatical error correction.
6 0.042421229 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
7 0.041745901 126 emnlp-2013-MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
8 0.036184318 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels
9 0.03587259 170 emnlp-2013-Sentiment Analysis: How to Derive Prior Polarities from SentiWordNet
10 0.035472579 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
11 0.034189619 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
12 0.032455053 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
13 0.032420628 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
14 0.032333363 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts
15 0.031863876 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
16 0.03159257 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
17 0.030989328 1 emnlp-2013-A Constrained Latent Variable Model for Coreference Resolution
18 0.030967813 202 emnlp-2013-Where Not to Eat? Improving Public Policy by Predicting Hygiene Inspections Using Online Reviews
19 0.030668546 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation
20 0.030595368 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations
topicId topicWeight
[(0, -0.123), (1, 0.025), (2, -0.025), (3, -0.008), (4, -0.012), (5, 0.018), (6, 0.055), (7, 0.012), (8, -0.013), (9, -0.042), (10, -0.034), (11, 0.024), (12, -0.056), (13, 0.018), (14, 0.061), (15, -0.01), (16, -0.028), (17, 0.039), (18, -0.004), (19, 0.025), (20, 0.008), (21, 0.033), (22, 0.005), (23, 0.043), (24, 0.026), (25, 0.062), (26, -0.061), (27, -0.052), (28, -0.136), (29, -0.028), (30, 0.03), (31, -0.007), (32, -0.079), (33, -0.032), (34, 0.009), (35, 0.15), (36, -0.106), (37, 0.086), (38, -0.019), (39, -0.134), (40, 0.234), (41, -0.278), (42, -0.009), (43, 0.01), (44, -0.168), (45, 0.018), (46, 0.138), (47, 0.064), (48, -0.01), (49, 0.066)]
simIndex simValue paperId paperTitle
same-paper 1 0.93504232 28 emnlp-2013-Automated Essay Scoring by Maximizing Human-Machine Agreement
Author: Hongbo Chen ; Ben He
Abstract: Previous approaches for automated essay scoring (AES) learn a rating model by minimizing either the classification, regression, or pairwise classification loss, depending on the learning algorithm used. In this paper, we argue that the current AES systems can be further improved by taking into account the agreement between human and machine raters. To this end, we propose a rankbased approach that utilizes listwise learning to rank algorithms for learning a rating model, where the agreement between the human and machine raters is directly incorporated into the loss function. Various linguistic and statistical features are utilized to facilitate the learning algorithms. Experiments on the publicly available English essay dataset, Automated Student Assessment Prize (ASAP), show that our proposed approach outperforms the state-of-the-art algorithms, and achieves performance comparable to professional human raters, which suggests the effectiveness of our proposed method for automated essay scoring.
2 0.69800282 200 emnlp-2013-Well-Argued Recommendation: Adaptive Models Based on Words in Recommender Systems
Author: Julien Gaillard ; Marc El-Beze ; Eitan Altman ; Emmanuel Ethis
Abstract: Recommendation systems (RS) take advantage ofproducts and users information in order to propose items to consumers. Collaborative, content-based and a few hybrid RS have been developed in the past. In contrast, we propose a new domain-independent semantic RS. By providing textually well-argued recommendations, we aim to give more responsibility to the end user in his decision. The system includes a new similarity measure keeping up both the accuracy of rating predictions and coverage. We propose an innovative way to apply a fast adaptation scheme at a semantic level, providing recommendations and arguments in phase with the very recent past. We have performed several experiments on films data, providing textually well-argued recommendations.
3 0.60541481 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students
Author: Philip Resnik ; Anderson Garron ; Rebecca Resnik
Abstract: in College Students Anderson Garron University of Maryland College Park, MD 20742 agarron@cs.umd.edu Rebecca Resnik Mindwell Psychology Bethesda 5602 Shields Drive Bethesda, MD 20817 drrebeccaresnik@gmail.com out adequate insurance or in rural areas – cannot ac- We investigate the value-add of topic modeling in text analysis for depression, and for neuroticism as a strongly associated personality measure. Using Pennebaker’s Linguistic Inquiry and Word Count (LIWC) lexicon to provide baseline features, we show that straightforward topic modeling using Latent Dirichlet Allocation (LDA) yields interpretable, psychologically relevant “themes” that add value in prediction of clinical assessments.
4 0.58419877 123 emnlp-2013-Learning to Rank Lexical Substitutions
Author: Gyorgy Szarvas ; Robert Busa-Fekete ; Eyke Hullermeier
Abstract: The problem to replace a word with a synonym that fits well in its sentential context is known as the lexical substitution task. In this paper, we tackle this task as a supervised ranking problem. Given a dataset of target words, their sentential contexts and the potential substitutions for the target words, the goal is to train a model that accurately ranks the candidate substitutions based on their contextual fitness. As a key contribution, we customize and evaluate several learning-to-rank models to the lexical substitution task, including classification-based and regression-based approaches. On two datasets widely used for lexical substitution, our best models signifi- cantly advance the state-of-the-art.
5 0.34268293 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
Author: Sida Wang ; Mengqiu Wang ; Stefan Wager ; Percy Liang ; Christopher D. Manning
Abstract: NLP models have many and sparse features, and regularization is key for balancing model overfitting versus underfitting. A recently repopularized form of regularization is to generate fake training data by repeatedly adding noise to real data. We reinterpret this noising as an explicit regularizer, and approximate it with a second-order formula that can be used during training without actually generating fake data. We show how to apply this method to structured prediction using multinomial logistic regression and linear-chain CRFs. We tackle the key challenge of developing a dynamic program to compute the gradient of the regularizer efficiently. The regularizer is a sum over inputs, so we can estimate it more accurately via a semi-supervised or transductive extension. Applied to text classification and NER, our method provides a > 1% absolute performance gain over use of standard L2 regularization.
6 0.33874536 170 emnlp-2013-Sentiment Analysis: How to Derive Prior Polarities from SentiWordNet
7 0.30200201 202 emnlp-2013-Where Not to Eat? Improving Public Policy by Predicting Hygiene Inspections Using Online Reviews
9 0.29181662 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
10 0.2624408 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels
12 0.25881261 198 emnlp-2013-Using Soft Constraints in Joint Inference for Clinical Concept Recognition
13 0.25601688 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries
14 0.25410709 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
15 0.25275359 94 emnlp-2013-Identifying Manipulated Offerings on Review Portals
16 0.25187868 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation
17 0.2459317 162 emnlp-2013-Russian Stress Prediction using Maximum Entropy Ranking
18 0.24200638 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion
19 0.24116327 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models
20 0.23984016 26 emnlp-2013-Assembling the Kazakh Language Corpus
topicId topicWeight
[(3, 0.045), (18, 0.03), (22, 0.052), (30, 0.08), (45, 0.012), (50, 0.023), (51, 0.178), (52, 0.338), (66, 0.027), (71, 0.047), (75, 0.025), (96, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.76829928 28 emnlp-2013-Automated Essay Scoring by Maximizing Human-Machine Agreement
Author: Hongbo Chen ; Ben He
Abstract: Previous approaches for automated essay scoring (AES) learn a rating model by minimizing either the classification, regression, or pairwise classification loss, depending on the learning algorithm used. In this paper, we argue that the current AES systems can be further improved by taking into account the agreement between human and machine raters. To this end, we propose a rankbased approach that utilizes listwise learning to rank algorithms for learning a rating model, where the agreement between the human and machine raters is directly incorporated into the loss function. Various linguistic and statistical features are utilized to facilitate the learning algorithms. Experiments on the publicly available English essay dataset, Automated Student Assessment Prize (ASAP), show that our proposed approach outperforms the state-of-the-art algorithms, and achieves performance comparable to professional human raters, which suggests the effectiveness of our proposed method for automated essay scoring.
2 0.72899324 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech
Author: Stella Frank ; Frank Keller ; Sharon Goldwater
Abstract: Frank Keller keller@ inf .ed .ac .uk Sharon Goldwater sgwater@ inf .ed .ac .uk ILCC, School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK interactions are often (but not necessarily) synergisChildren learn various levels of linguistic structure concurrently, yet most existing models of language acquisition deal with only a single level of structure, implicitly assuming a sequential learning process. Developing models that learn multiple levels simultaneously can provide important insights into how these levels might interact synergistically dur- ing learning. Here, we present a model that jointly induces syntactic categories and morphological segmentations by combining two well-known models for the individual tasks. We test on child-directed utterances in English and Spanish and compare to single-task baselines. In the morphologically poorer language (English), the model improves morphological segmentation, while in the morphologically richer language (Spanish), it leads to better syntactic categorization. These results provide further evidence that joint learning is useful, but also suggest that the benefits may be different for typologically different languages.
Author: Katsuhito Sudoh ; Shinsuke Mori ; Masaaki Nagata
Abstract: This paper proposes a novel noise-aware character alignment method for bootstrapping statistical machine transliteration from automatically extracted phrase pairs. The model is an extension of a Bayesian many-to-many alignment method for distinguishing nontransliteration (noise) parts in phrase pairs. It worked effectively in the experiments of bootstrapping Japanese-to-English statistical machine transliteration in patent domain using patent bilingual corpora.
4 0.52889431 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability
Author: Micha Elsner ; Sharon Goldwater ; Naomi Feldman ; Frank Wood
Abstract: We present a cognitive model of early lexical acquisition which jointly performs word segmentation and learns an explicit model of phonetic variation. We define the model as a Bayesian noisy channel; we sample segmentations and word forms simultaneously from the posterior, using beam sampling to control the size of the search space. Compared to a pipelined approach in which segmentation is performed first, our model is qualitatively more similar to human learners. On data with vari- able pronunciations, the pipelined approach learns to treat syllables or morphemes as words. In contrast, our joint model, like infant learners, tends to learn multiword collocations. We also conduct analyses of the phonetic variations that the model learns to accept and its patterns of word recognition errors, and relate these to developmental evidence.
5 0.52632433 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology
Author: Yiping Jin ; Min-Yen Kan ; Jun-Ping Ng ; Xiangnan He
Abstract: This paper presents DefMiner, a supervised sequence labeling system that identifies scientific terms and their accompanying definitions. DefMiner achieves 85% F1 on a Wikipedia benchmark corpus, significantly improving the previous state-of-the-art by 8%. We exploit DefMiner to process the ACL Anthology Reference Corpus (ARC) – a large, real-world digital library of scientific articles in computational linguistics. The resulting automatically-acquired glossary represents the terminology defined over several thousand individual research articles. We highlight several interesting observations: more definitions are introduced for conference and workshop papers over the years and that multiword terms account for slightly less than half of all terms. Obtaining a list of popular , defined terms in a corpus ofcomputational linguistics papers, we find that concepts can often be categorized into one of three categories: resources, methodologies and evaluation metrics.
6 0.52632171 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
7 0.52506417 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
9 0.52415597 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
10 0.52413434 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
11 0.52269679 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
12 0.52244508 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training
13 0.52227324 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction
14 0.52217686 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
15 0.52205086 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery
16 0.52171874 152 emnlp-2013-Predicting the Presence of Discourse Connectives
17 0.52139908 143 emnlp-2013-Open Domain Targeted Sentiment
18 0.5212419 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
19 0.52102745 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors
20 0.52004296 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction