acl acl2011 acl2011-341 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kirill Kireyev ; Thomas K Landauer
Abstract: While computational estimation of difficulty of words in the lexicon is useful in many educational and assessment applications, the concept of scalar word difficulty and current corpus-based methods for its estimation are inadequate. We propose a new paradigm called word meaning maturity which tracks the degree of knowledge of each word at different stages of language learning. We present a computational algorithm for estimating word maturity, based on modeling language acquisition with Latent Semantic Analysis. We demonstrate that the resulting metric not only correlates well with external indicators, but captures deeper semantic effects in language. 1 Motivation It is no surprise that through stages of language learning, different words are learned at different times and are known to different extents. For example, a common word like “dog” is familiar to even a first-grader, whereas a more advanced word like “focal” does not usually enter learners’ vocabulary until much later. Although individual rates of learning words may vary between highand low-performing students, it has been observed that “children [… ] acquire word meanings in roughly the same sequence” (Biemiller, 2008). The aim of this work is to model the degree of knowledge of words at different learning stages. Such a metric would have extremely useful applications in personalized educational technologies, for the purposes of accurate assessment and personalized vocabulary instruction. … 299 .l andaue r } @pear s on .com 2 Rethinking Word Difficulty Previously, related work in education and psychometrics has been concerned with measuring word difficulty or classifying words into different difficulty categories. Examples of such approaches include creation of word lists for targeted vocabulary instruction at various grade levels that were compiled by educational experts, such as Nation (1993) or Biemiller (2008). Such word difficulty assignments are also implicitly present in some readability formulas that estimate difficulty of texts, such as Lexiles (Stenner, 1996), which include a lexical difficulty component based on the frequency of occurrence of words in a representative corpus, on the assumption that word difficulty is inversely correlated to corpus frequency. Additionally, research in psycholinguistics has attempted to outline and measure psycholinguistic dimensions of words such as age-of-acquisition and familiarity, which aim to track when certain words become known and how familiar they appear to an average person. Importantly, all such word difficulty measures can be thought of as functions that assign a single scalar value to each word w: !
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract While computational estimation of difficulty of words in the lexicon is useful in many educational and assessment applications, the concept of scalar word difficulty and current corpus-based methods for its estimation are inadequate. [sent-3, score-0.346]
2 We propose a new paradigm called word meaning maturity which tracks the degree of knowledge of each word at different stages of language learning. [sent-4, score-1.05]
3 For example, a common word like “dog” is familiar to even a first-grader, whereas a more advanced word like “focal” does not usually enter learners’ vocabulary until much later. [sent-8, score-0.169]
4 Although individual rates of learning words may vary between highand low-performing students, it has been observed that “children [… ] acquire word meanings in roughly the same sequence” (Biemiller, 2008). [sent-9, score-0.165]
5 Such a metric would have extremely useful applications in personalized educational technologies, for the purposes of accurate assessment and personalized vocabulary instruction. [sent-11, score-0.211]
6 com 2 Rethinking Word Difficulty Previously, related work in education and psychometrics has been concerned with measuring word difficulty or classifying words into different difficulty categories. [sent-14, score-0.273]
7 Examples of such approaches include creation of word lists for targeted vocabulary instruction at various grade levels that were compiled by educational experts, such as Nation (1993) or Biemiller (2008). [sent-15, score-0.357]
8 Importantly, all such word difficulty measures can be thought of as functions that assign a single scalar value to each word w: ! [sent-18, score-0.261]
9 Ac s2s0o1ci1a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 29 –308, ] Knowing a word is quite often not an either-or situation; some words are known well, some not at all, and some are known to varying degrees. [sent-25, score-0.145]
10 [… ] How well a particular word is known may condition the connections made between that particular word and the other words in the mental lexicon. [sent-26, score-0.233]
11 [… … … Thus, instead of modeling when a particular word will become fully known, it makes more sense to model the degree to which a word is known at different levels of language exposure. [sent-27, score-0.241]
12 Second, word difficulty is inherently perspectival: the degree of word understanding depends not only on the word itself, but also on the sophistication of a given learner. [sent-28, score-0.35]
13 Consider again the difference between “dog” and “focal”: a typical firstgrader will have much more difficulty understanding the latter word compared to the former, whereas a well-educated adult will be able to use these words with equal ease. [sent-29, score-0.378]
14 Therefore, the degree, or maturity, of word knowledge is inherently a function of two parameters -- word w and learner level l: ! [sent-30, score-0.267]
15 for more advanced learners), we would expect the degree of understanding of word w to approach its full value corresponding to perfect knowledge; this will happen at different rates for different words. [sent-36, score-0.178]
16 Ideally, we would obtain maturity values by testing word knowledge of learners across different levels (ages or school grades) for all the words in the lexicon. [sent-37, score-0.967]
17 Such a procedure, however, is prohibitively expensive; so instead we would like to estimate word maturity by using computational models. [sent-38, score-0.833]
18 To summarize: our aim is to model the development of meaning of words as a function of increasing exposure to language, and ultimately - the degree to which the meaning of words at each stage of exposure resemble their “adult” meaning. [sent-39, score-0.469]
19 We therefore define word meaning maturity to be the degree to which the understanding of the word (expected for the average learner of a particular level) resembles that of an ideal mature learner. [sent-40, score-1.171]
20 1 Latent Semantic Analysis (LSA) An appealing choice for quantitatively modeling word meanings and their growth over time is Latent Semantic Analysis (LSA), an unsupervised method for representing word and document meaning in a multi-dimensional vector space. [sent-42, score-0.265]
21 A Singular Value Decomposition on the high-dimensional matrix of word/document occurrence counts (A) in the corpus, followed by zeroing all but the largest r eleof the diagonal matrix S, yields a lowerrank word vector matrix (U). [sent-44, score-0.182]
22 The resulting word in U are positioned in such a way that semantically related words vectors point in similar directions or, equivalently, have higher cosine values between them. [sent-46, score-0.152]
23 300 dimensions are retained 2 UΣ is used to project word vectors into V-space high accuracy, as attested, for example, by 90% correlation with human judgments on assessing the quality of student essay content (Landauer, 2002). [sent-61, score-0.173]
24 2 Using LSA to Compute Word Maturity In this work, the general procedure behind computationally estimating word maturity of a learner at a particular intermediate level (i. [sent-63, score-1.192]
25 This corpus approximates the amount and sophistication of language encountered by a learner at the given level. [sent-67, score-0.182]
26 The resulting LSA word vectors model the meaning of each word to the particular intermediate-level learner. [sent-70, score-0.268]
27 Compare the meaning representation of each word (its LSA vector) to the corresponding one in a reference model. [sent-72, score-0.187]
28 The reference model is trained on a much larger corpus and approximates the word meanings by a mature adult learner. [sent-73, score-0.47]
29 These levels may directly correspond to school grades, learner ages or any other arbitrary gradations. [sent-75, score-0.169]
30 A high discrepancy between the vectors would suggest that an intermediate model’s meaning of a particular word is quite different from the reference meaning, and thus the word maturity at the corresponding level is relatively low. [sent-77, score-1.35]
31 The result of the Procrustes Alignment of the two spaces is effectively a joint LSA space containing two distinct word vectors for each word (e. [sent-118, score-0.231]
32 After merging using Procrustes Alignment, the comparison of word meanings becomes a simple problem of comparing word vectors in the joint space using the standard cosine metric. [sent-121, score-0.276]
33 4 Implementation Details In our experiments we used passages from the MetaMetrics Inc. [sent-122, score-0.167]
34 2002 largely consisting of educational and literary content representative of the reading material used in American schools at different grade levels. [sent-123, score-0.23]
35 The first-level intermediate corpus was composed of 6,000 text passages, intended for school grade 1 or below. [sent-125, score-0.384]
36 The grade level is approximated using the Coleman-Liau readability formula (Coleman, 1975), which estimates the US grade level necessary to comprehend a given text, based on its average sentence and word length statistics: corpus3, ! [sent-126, score-0.513]
37 Each subsequent intermediate corpus contains additional 6,000 new passages of the next grade level, in addition to the previous corpus. [sent-133, score-0.521]
38 The adult corpus is twice as large, and of same grade level range (0-14) as the largest intermediate corpus. [sent-135, score-0.625]
39 302 passages by difficulty, in order to mimic the way typical human learners encounter progressively more difficult materials at successive school grades. [sent-138, score-0.293]
40 Build LSA spaces on the adult and each of the intermediate corpora 2. [sent-140, score-0.488]
41 Merge the intermediate space for level l with the adult space, using Procrustes Alignment. [sent-141, score-0.495]
42 This results in a joint space with two sets of vectors: the versions from the intermediate space {vlw} , and adult space{vaw} . [sent-142, score-0.462]
43 Compute the cosine in the joint space between the two word vectors for the given word w ! [sent-144, score-0.206]
44 ) = (5) In the cases where a word w has not been encountered in a given intermediate space, or in the rare cases where the cosine value falls below 0, the word maturity value is set to 0. [sent-152, score-1.141]
45 Hence, the range for the word maturity function falls in the closed interval [0. [sent-153, score-0.833]
46 A higher cosine value means greater similarity in meaning between the reference and intermediate spaces, which implies a more mature meaning of word w at the level l, i. [sent-156, score-0.609]
47 The scores between discrete levels are interpolated, resulting in a continuous word maturity curve for each word. [sent-159, score-0.869]
48 Consistent with intuition, simple words like “dog” approach their adult meaning rather quickly, while “focal” takes much longer to become known to any degree. [sent-161, score-0.361]
49 Closer analysis of the corpus and the semantic near-neighbor word vectors at each intermediate space, shows that earlier meaning deal almost exclusively with the first sense (bird), while later readings with the other (country). [sent-164, score-0.464]
50 1 Time-to-maturity Evaluation of the word maturity metric against external data is not always straightforward because, to the best of our knowledge, data that contains word knowledge statistics at different learner levels does not exist. [sent-168, score-1.109]
51 Instead, we often have to evaluate against external data consisting of scalar difficulty values (see Section 2 for discussion) for each word, such as age-of-acquisition norms described in the following subsection. [sent-169, score-0.243]
52 One is to compute the word maturity at a particular level, obtaining a single number for each word. [sent-171, score-0.854]
53 Another is by computing time-tomaturity: the minimum level (the value on the xaxis of the word maturity graph) at which the word maturity reaches4 a particular threshold α: ! [sent-172, score-1.743]
54 Since the values of word maturity are interpolated, the ttm(w) can take on fractional values. [sent-187, score-0.833]
55 It should be emphasized that such a collapsing of word maturity into a scalar value inherently results in loss of information; we only perform it in order to allow evaluation against external data sources. [sent-188, score-0.943]
56 As a baseline for these experiments we include word frequency, namely the document frequency of words in the adult corpus. [sent-189, score-0.324]
57 Age of Acquisition approximates the age at which a word is first learned and has been proposed as a significant contributor to language and memory processes. [sent-192, score-0.15]
58 html To verify how these word groups correspond to the word maturity metric, we assign each of the words in the four groups a difficulty rating 1-4 respectively, and measure the correlation with time-to-maturity. [sent-216, score-1.018]
59 4 n39 The word maturity metric shows higher correlation with instruction word list norms than word frequency. [sent-220, score-1.146]
60 4 Text Complexity Another way in which our metric can be evaluated is by examining the word maturity in texts that have been leveled, i. [sent-222, score-0.898]
61 Thus, the correlation between text difficulty and our word maturity metric can serve as another validation of the metric. [sent-226, score-1.008]
62 The collection consists of 1,220 readings, each annotated with a US school grade level (in the range between 3-12) for which the reading is intended. [sent-228, score-0.264]
63 In this experiment we computed the correlation of the grade level with time-to-maturity, and two other measures, namely: • Time-to-maturity: average time-tomaturity of unique words in text (excluding stopwords) with α=0. [sent-230, score-0.253]
64 More specifically, we define the frequency-maturity for a particular word at a given level as the ratio of the number of occurrences at the intermediate corpus for that level (l) to the number of occurrences in the reference corpus (a): ! [sent-246, score-0.432]
65 )) Similarly to the original LSA-based word maturity metric, this ratio increases from 0 to 1 for each word as the amount of cumulative language exposure increases. [sent-257, score-0.974]
66 The corpora used at each interme- diate level are identical to the original word maturity model, but instead of creating LSA spaces we simply use the corpora to compute word frequency. [sent-258, score-1.036]
67 The following figure shows the Spearman correlations between the external measures used for experiments in Section 5, and time-to-maturity computed based on the two maturity metrics: the new frequency-based maturity and the original LSA-based word maturity. [sent-259, score-1.696]
68 The results indicate that the original LSA-based word maturity correlates better with real-world data than a maturity metric simply based on frequency. [sent-261, score-1.702]
69 (This is related to but distinct from the merely polysemous words that have several related meanings), Because of the conflation of several unrelated meanings into the same orthographic form, homographs implicitly contain more semantic content in a single word. [sent-264, score-0.234]
70 Therefore, one would expect the meaning of homographs to mature more slowly than would be predicted by frequency alone: all things being equal, a learner has to learn the meanings for all of the senses of a homograph word before the word can be considered fully known. [sent-265, score-0.555]
71 More specifically, one would expect the time- to-maturity of homographs to have greater values than words of similar frequency. [sent-266, score-0.146]
72 The size (and content) of the corpus used to train the reference model is potentially important, since it affects the word maturity calcu- lations, which are comparisons of the intermediate LSA spaces to the reference LSA space built on this corpus. [sent-276, score-1.197]
73 It is interesting to investigate how the word maturity model would be affected if the adult corpus were made significantly more sophisticated. [sent-277, score-1.048]
74 If the word maturity metric were simply based on word frequency (including the frequency-based maturity baseline described in Section 6. [sent-278, score-1.765]
75 1), one would expect the word maturity of the words at each level to decrease significantly if the reference model is made significantly larger, since each intermediate level will have encountered fewer words by comparison. [sent-279, score-1.297]
76 Therefore, if word maturity were tracking something similar to real word knowledge, one would expect the word maturity for most words to plateau over time, and subsequently not change significantly, no matter how sophisticated the reference model becomes. [sent-281, score-1.855]
77 To evaluate this inquiry we created a reference corpus that is twice as large as before (four times as large and of the same difficulty range as the corpus for the last intermediate level), containing roughly 329,000 passages. [sent-282, score-0.334]
78 We computed the word maturity model using this larger reference corpus, while keeping all the original intermediate corpora of the same size and content. [sent-283, score-1.101]
79 The results show that the average word maturity of words at the last intermediate level (14) decreases by less than 14% as a result of doubling the adult corpus. [sent-284, score-1.327]
80 This relatively small difference, in spite of a twofold increase of the adult corpus, is consistent with the idea that word knowledge should approach a plateau, after which further exposure to language does little to change most word meanings. [sent-286, score-0.409]
81 4 Integration into Lexicon Another important consideration with respect to word learning mentioned in Wotler (2001), is the “connections made between [a] particular word and the other words in the mental lexicon. [sent-288, score-0.177]
82 ” One implication of that is that measuring word maturity must take into account the way words in the lan- guage are integrated with other words. [sent-289, score-0.877]
83 One way to test this effect is to introduce readings where a large part of the important vocabulary is not well known to learners at a given level. [sent-290, score-0.184]
84 This can be simulated in the word maturity model by rearranging the order of some of the training passages, by introducing certain advanced passages at a very early level. [sent-292, score-1.058]
85 To test this effect, we first collected all passages in the training corpus of intermediate models containing some advanced words from different topics, namely: “chromosome”, “neutron” and “filibuster” together with their plural variants. [sent-295, score-0.419]
86 We changed the order of inclusion of these 89 passages into the intermediate models in each of the two following ways: 306 1. [sent-296, score-0.368]
87 All the passages were introduced at the first level (l=1) intermediate corpus 2. [sent-297, score-0.448]
88 All the passages were introduced at the last level (l=14) intermediate corpus. [sent-298, score-0.448]
89 This resulted in two new variants of word maturity models, which were computed in all the same ways as before, except that all of these 89 advanced passages were introduced either at the very first level or at the very last level. [sent-299, score-1.109]
90 We then computed the word maturity at the levels they were introduced. [sent-300, score-0.869]
91 The hypothesis consistent with a meaning-based maturity method would be that less learning (i. [sent-301, score-0.78]
92 lower word maturity) of the relevant words will occur when passages are introduced prematurely (at level 1). [sent-303, score-0.322]
93 Table 5 shows the word maturities measured for each of those cases, at the level (1 or 14) when all of the passages have been introduced. [sent-304, score-0.305]
94 e87a)l52te3vant passages are introduced early vs late. [sent-308, score-0.191]
95 Indeed, the results show lower word maturity values when advanced passages are introduced too early, and higher ones when the passages are introduced at a later stage, when the rest of the supporting vocabulary is known. [sent-309, score-1.278]
96 7 Conclusion We have introduced a new metric for estimating the degree of knowledge of words by learners at different levels. [sent-310, score-0.2]
97 The implementation is based on unsupervised word meaning acquisition from natural text, from corpora that resemble in volume and complexity the reading materials a typical human learner might encounter. [sent-312, score-0.372]
98 The metric correlates better than word frequency to a range of external measures, including vocabulary word lists, psycholinguistic norms and leveled texts. [sent-313, score-0.446]
99 Furthermore, we have shown that the metric is based on word meaning (to the extent that it can be approximated with LSA), and not merely on shallow measures like word frequency. [sent-314, score-0.333]
100 Many interesting research questions still remain pertaining to the best way to select and partition the training corpora, align adult and intermediate LSA models, correlate the results with real school grade levels, as well as other free parameters in the model. [sent-315, score-0.599]
wordName wordTfidf (topN-words)
[('maturity', 0.78), ('adult', 0.215), ('intermediate', 0.201), ('lsa', 0.197), ('passages', 0.167), ('grade', 0.153), ('homographs', 0.091), ('meaning', 0.089), ('difficulty', 0.088), ('exposure', 0.088), ('procrustes', 0.088), ('learner', 0.081), ('meanings', 0.07), ('norms', 0.069), ('readings', 0.069), ('metric', 0.065), ('age', 0.061), ('landauer', 0.061), ('aoa', 0.059), ('biemiller', 0.059), ('level', 0.056), ('word', 0.053), ('vectors', 0.052), ('mature', 0.051), ('instruction', 0.051), ('merely', 0.051), ('spaces', 0.05), ('learners', 0.046), ('acquisition', 0.046), ('reference', 0.045), ('scalar', 0.045), ('leveled', 0.044), ('ocd', 0.044), ('rotation', 0.044), ('stenner', 0.044), ('degree', 0.043), ('matrix', 0.043), ('readability', 0.042), ('external', 0.041), ('focal', 0.039), ('multivariate', 0.037), ('sophistication', 0.036), ('plateau', 0.036), ('turkey', 0.036), ('levels', 0.036), ('approximates', 0.036), ('known', 0.035), ('vocabulary', 0.034), ('grades', 0.034), ('frequency', 0.034), ('expect', 0.033), ('children', 0.033), ('tracks', 0.032), ('personalized', 0.031), ('encounters', 0.031), ('school', 0.03), ('educational', 0.03), ('gilhooly', 0.029), ('kireyev', 0.029), ('krzanowski', 0.029), ('lexiles', 0.029), ('maturities', 0.029), ('metametrics', 0.029), ('rearranging', 0.029), ('wolter', 0.029), ('psycholinguistic', 0.029), ('encountered', 0.029), ('advanced', 0.029), ('resemble', 0.028), ('mental', 0.028), ('materials', 0.028), ('dog', 0.028), ('dimensions', 0.027), ('comprehension', 0.026), ('coleman', 0.026), ('emulate', 0.026), ('bristol', 0.026), ('reading', 0.025), ('cosine', 0.025), ('correlates', 0.024), ('introduced', 0.024), ('inherently', 0.024), ('latent', 0.023), ('space', 0.023), ('ages', 0.022), ('mimic', 0.022), ('davis', 0.022), ('schools', 0.022), ('familiarity', 0.022), ('correlation', 0.022), ('corpora', 0.022), ('measuring', 0.022), ('measures', 0.022), ('words', 0.022), ('thomas', 0.021), ('particular', 0.021), ('rates', 0.02), ('assessment', 0.02), ('correlations', 0.02), ('judgments', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge
Author: Kirill Kireyev ; Thomas K Landauer
Abstract: While computational estimation of difficulty of words in the lexicon is useful in many educational and assessment applications, the concept of scalar word difficulty and current corpus-based methods for its estimation are inadequate. We propose a new paradigm called word meaning maturity which tracks the degree of knowledge of each word at different stages of language learning. We present a computational algorithm for estimating word maturity, based on modeling language acquisition with Latent Semantic Analysis. We demonstrate that the resulting metric not only correlates well with external indicators, but captures deeper semantic effects in language. 1 Motivation It is no surprise that through stages of language learning, different words are learned at different times and are known to different extents. For example, a common word like “dog” is familiar to even a first-grader, whereas a more advanced word like “focal” does not usually enter learners’ vocabulary until much later. Although individual rates of learning words may vary between highand low-performing students, it has been observed that “children [… ] acquire word meanings in roughly the same sequence” (Biemiller, 2008). The aim of this work is to model the degree of knowledge of words at different learning stages. Such a metric would have extremely useful applications in personalized educational technologies, for the purposes of accurate assessment and personalized vocabulary instruction. … 299 .l andaue r } @pear s on .com 2 Rethinking Word Difficulty Previously, related work in education and psychometrics has been concerned with measuring word difficulty or classifying words into different difficulty categories. Examples of such approaches include creation of word lists for targeted vocabulary instruction at various grade levels that were compiled by educational experts, such as Nation (1993) or Biemiller (2008). Such word difficulty assignments are also implicitly present in some readability formulas that estimate difficulty of texts, such as Lexiles (Stenner, 1996), which include a lexical difficulty component based on the frequency of occurrence of words in a representative corpus, on the assumption that word difficulty is inversely correlated to corpus frequency. Additionally, research in psycholinguistics has attempted to outline and measure psycholinguistic dimensions of words such as age-of-acquisition and familiarity, which aim to track when certain words become known and how familiar they appear to an average person. Importantly, all such word difficulty measures can be thought of as functions that assign a single scalar value to each word w: !
2 0.089200042 205 acl-2011-Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
Author: Michael Mohler ; Razvan Bunescu ; Rada Mihalcea
Abstract: In this work we address the task of computerassisted assessment of short student answers. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation. We also present a first attempt to align the dependency graphs of the student and the instructor answers in order to make use of a structural component in the automatic grading of student answers.
3 0.083864771 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
Author: Helen Yannakoudakis ; Ted Briscoe ; Ben Medlock
Abstract: We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of ‘English as a Second or Other Language’ (ESOL) examination scripts. In particular, we use rank preference learning to explicitly model the grade relationships between scripts. A number of different features are extracted and ablation tests are used to investigate their contribution to overall performance. A comparison between regression and rank preference models further supports our method. Experimental results on the first publically available dataset show that our system can achieve levels of performance close to the upper bound for the task, as defined by the agreement between human examiners on the same corpus. Finally, using a set of ‘outlier’ texts, we test the validity of our model and identify cases where the model’s scores diverge from that of a human examiner.
4 0.083790205 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus
Author: Ryo Nagata ; Edward Whittaker ; Vera Sheinman
Abstract: The availability of learner corpora, especially those which have been manually error-tagged or shallow-parsed, is still limited. This means that researchers do not have a common development and test set for natural language processing of learner English such as for grammatical error detection. Given this background, we created a novel learner corpus that was manually error-tagged and shallowparsed. This corpus is available for research and educational purposes on the web. In this paper, we describe it in detail together with its data-collection method and annotation schemes. Another contribution of this paper is that we take the first step toward evaluating the performance of existing POStagging/chunking techniques on learner corpora using the created corpus. These contributions will facilitate further research in related areas such as grammatical error detection and automated essay scoring.
5 0.065265052 248 acl-2011-Predicting Clicks in a Vocabulary Learning System
Author: Aaron Michelony
Abstract: We consider the problem of predicting which words a student will click in a vocabulary learning system. Often a language learner will find value in the ability to look up the meaning of an unknown word while reading an electronic document by clicking the word. Highlighting words likely to be unknown to a readeris attractive due to drawing his orher attention to it and indicating that information is available. However, this option is usually done manually in vocabulary systems and online encyclopedias such as Wikipedia. Furthurmore, it is never on a per-user basis. This paper presents an automated way of highlighting words likely to be unknown to the specific user. We present related work in search engine ranking, a description of the study used to collect click data, the experiment we performed using the random forest machine learning algorithm and finish with a discussion of future work.
6 0.064474456 204 acl-2011-Learning Word Vectors for Sentiment Analysis
7 0.061758704 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
8 0.061531745 307 acl-2011-Towards Tracking Semantic Change by Visual Analytics
9 0.056677856 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation
10 0.053248186 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
11 0.046720844 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
12 0.041054834 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization
13 0.037971415 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
14 0.035531927 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech
15 0.034522615 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
16 0.033540457 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification
17 0.032479785 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation
18 0.032341201 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
19 0.031923868 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing
20 0.031904172 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents
topicId topicWeight
[(0, 0.099), (1, 0.02), (2, -0.014), (3, 0.035), (4, -0.025), (5, -0.005), (6, 0.024), (7, 0.023), (8, -0.001), (9, -0.015), (10, -0.036), (11, -0.04), (12, 0.01), (13, 0.034), (14, -0.034), (15, 0.03), (16, 0.022), (17, 0.003), (18, -0.026), (19, -0.018), (20, 0.078), (21, -0.001), (22, -0.048), (23, 0.002), (24, -0.057), (25, -0.02), (26, -0.021), (27, -0.006), (28, -0.001), (29, -0.006), (30, -0.039), (31, 0.015), (32, -0.025), (33, 0.063), (34, -0.018), (35, -0.012), (36, -0.037), (37, 0.026), (38, 0.008), (39, 0.095), (40, -0.001), (41, -0.05), (42, 0.009), (43, -0.015), (44, 0.098), (45, -0.001), (46, 0.085), (47, 0.078), (48, 0.016), (49, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.84475863 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge
Author: Kirill Kireyev ; Thomas K Landauer
Abstract: While computational estimation of difficulty of words in the lexicon is useful in many educational and assessment applications, the concept of scalar word difficulty and current corpus-based methods for its estimation are inadequate. We propose a new paradigm called word meaning maturity which tracks the degree of knowledge of each word at different stages of language learning. We present a computational algorithm for estimating word maturity, based on modeling language acquisition with Latent Semantic Analysis. We demonstrate that the resulting metric not only correlates well with external indicators, but captures deeper semantic effects in language. 1 Motivation It is no surprise that through stages of language learning, different words are learned at different times and are known to different extents. For example, a common word like “dog” is familiar to even a first-grader, whereas a more advanced word like “focal” does not usually enter learners’ vocabulary until much later. Although individual rates of learning words may vary between highand low-performing students, it has been observed that “children [… ] acquire word meanings in roughly the same sequence” (Biemiller, 2008). The aim of this work is to model the degree of knowledge of words at different learning stages. Such a metric would have extremely useful applications in personalized educational technologies, for the purposes of accurate assessment and personalized vocabulary instruction. … 299 .l andaue r } @pear s on .com 2 Rethinking Word Difficulty Previously, related work in education and psychometrics has been concerned with measuring word difficulty or classifying words into different difficulty categories. Examples of such approaches include creation of word lists for targeted vocabulary instruction at various grade levels that were compiled by educational experts, such as Nation (1993) or Biemiller (2008). Such word difficulty assignments are also implicitly present in some readability formulas that estimate difficulty of texts, such as Lexiles (Stenner, 1996), which include a lexical difficulty component based on the frequency of occurrence of words in a representative corpus, on the assumption that word difficulty is inversely correlated to corpus frequency. Additionally, research in psycholinguistics has attempted to outline and measure psycholinguistic dimensions of words such as age-of-acquisition and familiarity, which aim to track when certain words become known and how familiar they appear to an average person. Importantly, all such word difficulty measures can be thought of as functions that assign a single scalar value to each word w: !
2 0.76398855 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
Author: Helen Yannakoudakis ; Ted Briscoe ; Ben Medlock
Abstract: We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of ‘English as a Second or Other Language’ (ESOL) examination scripts. In particular, we use rank preference learning to explicitly model the grade relationships between scripts. A number of different features are extracted and ablation tests are used to investigate their contribution to overall performance. A comparison between regression and rank preference models further supports our method. Experimental results on the first publically available dataset show that our system can achieve levels of performance close to the upper bound for the task, as defined by the agreement between human examiners on the same corpus. Finally, using a set of ‘outlier’ texts, we test the validity of our model and identify cases where the model’s scores diverge from that of a human examiner.
3 0.70309138 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP
Author: Anja Belz ; Eric Kow
Abstract: Studies assessing rating scales are very common in psychology and related fields, but are rare in NLP. In this paper we assess discrete and continuous scales used for measuring quality assessments of computergenerated language. We conducted six separate experiments designed to investigate the validity, reliability, stability, interchangeability and sensitivity of discrete vs. continuous scales. We show that continuous scales are viable for use in language evaluation, and offer distinct advantages over discrete scales. 1 Background and Introduction Rating scales have been used for measuring human perception of various stimuli for a long time, at least since the early 20th century (Freyd, 1923). First used in psychology and psychophysics, they are now also common in a variety of other disciplines, including NLP. Discrete scales are the only type of scale commonly used for qualitative assessments of computer-generated language in NLP (e.g. in the DUC/TAC evaluation competitions). Continuous scales are commonly used in psychology and related fields, but are virtually unknown in NLP. While studies assessing the quality of individual scales and comparing different types of rating scales are common in psychology and related fields, such studies hardly exist in NLP, and so at present little is known about whether discrete scales are a suitable rating tool for NLP evaluation tasks, or whether continuous scales might provide a better alternative. A range of studies from sociology, psychophysiology, biometrics and other fields have compared 230 Kow} @bright on .ac .uk discrete and continuous scales. Results tend to differ for different types of data. E.g., results from pain measurement show a continuous scale to outperform a discrete scale (ten Klooster et al., 2006). Other results (Svensson, 2000) from measuring students’ ease of following lectures show a discrete scale to outperform a continuous scale. When measuring dyspnea, Lansing et al. (2003) found a hybrid scale to perform on a par with a discrete scale. Another consideration is the types of data produced by discrete and continuous scales. Parametric methods of statistical analysis, which are far more sensitive than non-parametric ones, are commonly applied to both discrete and continuous data. However, parametric methods make very strong assumptions about data, including that it is numerical and normally distributed (Siegel, 1957). If these assumptions are violated, then the significance of results is overestimated. Clearly, the numerical assumption does not hold for the categorial data produced by discrete scales, and it is unlikely to be normally distributed. Many researchers are happier to apply parametric methods to data from continuous scales, and some simply take it as read that such data is normally distributed (Lansing et al., 2003). Our aim in the present study was to systematically assess and compare discrete and continuous scales when used for the qualitative assessment of computer-generated language. We start with an overview of assessment scale types (Section 2). We describe the experiments we conducted (Sec- tion 4), the data we used in them (Section 3), and the properties we examined in our inter-scale comparisons (Section 5), before presenting our results Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiastti ocns:aslh Loirntpgaupisetricss, pages 230–235, Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. 1. Very Poor 2. Poor 3. Barely Acceptable 4. Good 5. Very Good Figure 1: Evaluation of Readability in DUC’06, comprising 5 evaluation criteria, including Grammaticality. Evaluation task for each summary text: evaluator selects one of the options (1–5) to represent quality of the summary in terms of the criterion. (Section 6), and some conclusions (Section 7). 2 Rating Scales With Verbal Descriptor Scales (VDSs), participants give responses on ordered lists of verbally described and/or numerically labelled response cate- gories, typically varying in number from 2 to 11 (Svensson, 2000). An example of a VDS used in NLP is shown in Figure 1. VDSs are used very widely in contexts where computationally generated language is evaluated, including in dialogue, summarisation, MT and data-to-text generation. Visual analogue scales (VASs) are far less common outside psychology and related areas than VDSs. Responses are given by selecting a point on a typically horizontal line (although vertical lines have also been used (Scott and Huskisson, 2003)), on which the two end points represent the extreme values of the variable to be measured. Such lines can be mono-polar or bi-polar, and the end points are labelled with an image (smiling/frowning face), or a brief verbal descriptor, to indicate which end of the line corresponds to which extreme of the variable. The labels are commonly chosen to represent a point beyond any response actually likely to be chosen by raters. There is only one examples of a VAS in NLP system evaluation that we are aware of (Gatt et al., 2009). Hybrid scales, known as a graphic rating scales, combine the features of VDSs and VASs, and are also used in psychology. Here, the verbal descriptors are aligned along the line of a VAS and the endpoints are typically unmarked (Svensson, 2000). We are aware of one example in NLP (Williams and Reiter, 2008); 231 Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. extbreamdely excellent Figure 2: Evaluation of Grammaticality with alternative VAS scale (cf. Figure 1). Evaluation task for each summary text: evaluator selects a place on the line to represent quality of the summary in terms of the criterion. we did not investigate this scale in our study. We used the following two specific scale designs in our experiments: VDS-7: 7 response categories, numbered (7 = best) and verbally described (e.g. 7 = “perfectly fluent” for Fluency, and 7 = “perfectly clear” for Clarity). Response categories were presented in a vertical list, with the best category at the bottom. Each category had a tick-box placed next to it; the rater’s task was to tick the box by their chosen rating. VAS: a horizontal, bi-polar line, with no ticks on it, mapping to 0–100. In the image description tests, statements identified the left end as negative, the right end as positive; in the weather forecast tests, the positive end had a smiling face and the label “statement couldn’t be clearer/read better”; the negative end had a frowning face and the label “statement couldn’t be more unclear/read worse”. The raters’ task was to move a pointer (initially in the middle of the line) to the place corresponding to their rating. 3 Data Weather forecast texts: In one half of our evaluation experiments we used human-written and automatically generated weather forecasts for the same weather data. The data in our evaluations was for 22 different forecast dates and included outputs from 10 generator systems and one set of human forecasts. This data has also been used for comparative system evaluation in previous research (Langner, 2010; Angeli et al., 2010; Belz and Kow, 2009). The following are examples of weather forecast texts from the data: 1: S SE 2 8 -3 2 INCREAS ING 3 6-4 0 BY MID AF TERNOON 2 : S ’ LY 2 6-3 2 BACKING S SE 3 0 -3 5 BY AFTERNOON INCREAS ING 3 5 -4 0 GUSTS 5 0 BY MID EVENING Image descriptions: In the other half of our evaluations, we used human-written and automatically generated image descriptions for the same images. The data in our evaluations was for 112 different image sets and included outputs from 6 generator systems and 2 sets of human-authored descriptions. This data was originally created in the TUNA Project (van Deemter et al., 2006). The following is an example of an item from the corpus, consisting of a set of images and a description for the entity in the red frame: the smal l blue fan 4 Experimental Set-up 4.1 Evaluation criteria Fluency/Readability: Both the weather forecast and image description evaluation experiments used a quality criterion intended to capture ‘how well a piece of text reads’ , called Fluency in the latter, Readability in the former. Adequacy/Clarity: In the image description experiments, the second quality criterion was Adequacy, explained as “how clear the description is”, and “how easy it would be to identify the image from the description”. This criterion was called Clarity in the weather forecast experiments, explained as “how easy is it to understand what is being described”. 4.2 Raters In the image experiments we used 8 raters (native speakers) in each experiment, from cohorts of 3rdyear undergraduate and postgraduate students doing a degree in a linguistics-related subject. They were paid and spent about 1hour doing the experiment. In the weather forecast experiments, we used 22 raters in each experiment, from among academic staff at our own university. They were not paid and spent about 15 minutes doing the experiment. 232 4.3 Summary overview of experiments Weather VDS-7 (A): VDS-7 scale; weather forecast data; criteria: Readability and Clarity; 22 raters (university staff) each assessing 22 forecasts. Weather VDS-7 (B): exact repeat of Weather VDS-7 (A), including same raters. Weather VAS: VAS scale; 22 raters (university staff), no overlap with raters in Weather VDS-7 experiments; other details same as in Weather VDS-7. Image VDS-7: VDS-7 scale; image description data; 8 raters (linguistics students) each rating 112 descriptions; criteria: Fluency and Adequacy. Image VAS (A): VAS scale; 8 raters (linguistics students), no overlap with raters in Image VAS-7; other details same as in Image VDS-7 experiment. Image VAS (B): exact repeat of Image VAS (A), including same raters. 4.4 Design features common to all experiments In all our experiments we used a Repeated Latin Squares design to ensure that each rater sees the same number of outputs from each system and for each text type (forecast date/image set). Following detailed instructions, raters first did a small number of practice examples, followed by the texts to be rated, in an order randomised for each rater. Evaluations were carried out via a web interface. They were allowed to interrupt the experiment, and in the case of the 1hour long image description evaluation they were encouraged to take breaks. 5 Comparison and Assessment of Scales Validity is to the extent to which an assessment method measures what it is intended to measure (Svensson, 2000). Validity is often impossible to assess objectively, as is the case of all our criteria except Adequacy, the validity of which we can directly test by looking at correlations with the accuracy with which participants in a separate experiment identify the intended images given their descriptions. A standard method for assessing Reliability is Kendall’s W, a coefficient of concordance, measuring the degree to which different raters agree in their ratings. We report W for all 6 experiments. Stability refers to the extent to which the results of an experiment run on one occasion agree with the results of the same experiment (with the same raters) run on a different occasion. In the present study, we assess stability in an intra-rater, test-retest design, assessing the agreement between the same participant’s responses in the first and second runs of the test with Pearson’s product-moment correlation coefficient. We report these measures between ratings given in Image VAS (A) vs. those given in Image VAS (B), and between ratings given in Weather VDS-7 (A) vs. those given in Weather VDS-7 (B). We assess Interchangeability, that is, the extent to which our VDS and VAS scales agree, by computing Pearson’s and Spearman’s coefficients between results. We report these measures for all pairs of weather forecast/image description evaluations. We assess the Sensitivity of our scales by determining the number of significant differences between different systems and human authors detected by each scale. We also look at the relative effect of the different experimental factors by computing the F-Ratio for System (the main factor under investigation, so its relative effect should be high), Rater and Text Type (their effect should be low). F-ratios were de- termined by a one-way ANOVA with the evaluation criterion in question as the dependent variable and System, Rater or Text Type as grouping factors. 6 Results 6.1 Interchangeability and Reliability for system/human authored image descriptions Interchangeability: Pearson’s r between the means per system/human in the three image description evaluation experiments were as follows (Spearman’s ρ shown in brackets): Forb.eqAdFlouthV AD S d-(e7Aq)uac.y945a78n*d(V.F9A2l5uS8e*(—An *c)y,.98o36r.748*e1l9a*(tV.i98(Ao.2578nS019s(*5B b) e- tween Image VDS-7 and Image VAS (A) (the main VAS experiment) are extremely high, meaning that they could substitute for each other here. Reliability: Inter-rater agreement in terms of Kendall’s W in each of the experiments: 233 K ’ s W FAldue qnucayc .6V549D80S* -7* VA.46S7 16(*A * )VA.7S529 (5*B *) W was higher in the VAS data in the case of Fluency, whereas for Adequacy, W was the same for the VDS data and VAS (B), and higher in the VDS data than in the VAS (A) data. 6.2 Interchangeability and Reliability for system/human authored weather forecasts Interchangeability: The correlation coefficients (Pearson’s r with Spearman’s ρ in brackets) between the means per system/human in the image description experiments were as follows: ForRCea.ld bVoDt hS -A7 (d BAeq)ua.c9y851a*nVdD(.8F9S7-lu09*(eBn—*)cy,.9 o43r2957*1e la(*t.8i(o736n025Vs9*6A bS)e- tween Weather VDS-7 (A) (the main VDS-7 experiment) and Weather VAS (A) are again very high, although rank-correlation is somewhat lower. Reliability: Inter-rater agreement Kendall’s W was as follows: in terms of W RClea rdi.tyVDS.5-4739 7(*A * )VDS.4- 7583 (*B * ).4 8 V50*A *S This time the highest agreement for both Clarity and Readability was in the VDS-7 data. 6.3 Stability tests for image and weather data Pearson’s r between ratings given by the same raters first in Image VAS (A) and then in Image VAS (B) was .666 for Adequacy, .593 for Fluency. Between ratings given by the same raters first in Weather VDS-7 (A) and then in Weather VDS-7 (B), Pearson’s r was .656 for Clarity, .704 for Readability. (All significant at p < .01.) Note that these are computed on individual scores (rather than means as in the correlation figures given in previous sections). 6.4 F-ratios and post-hoc analysis for image data The table below shows F-ratios determined by a oneway ANOVA with the evaluation criterion in question (Adequacy/Fluency) as the dependent variable and System/Rater/Text Type as the grouping factor. Note that for System a high F-ratio is desirable, but a low F-ratio is desirable for other factors. tem, the main factor under investigation, VDS-7 found 8 for Adequacy and 14 for Fluency; VAS (A) found 7 for Adequacy and 15 for Fluency. 6.5 F-ratios and post-hoc analysis for weather data The table below shows F-ratios analogous to the previous section (for Clarity/Readability). tem, VDS-7 (A) found 24 for Clarity, 23 for Readability; VAS found 25 for Adequacy, 26 for Fluency. 6.6 Scale validity test for image data Our final table of results shows Pearson’s correlation coefficients (calculated on means per system) between the Adequacy data from the three image description evaluation experiments on the one hand, and the data from an extrinsic experiment in which we measured the accuracy with which participants identified the intended image described by a description: ThecorIlm at iog ne V bAeDSt w-(A7eB)An dA eqd uqe ac uy a cy.I89nD720d 6AI*Dc .Acuray was strong and highly significant in all three image description evaluation experiments, but strongest in VAS (B), and weakest in VAS (A). For comparison, 234 Pearson’s between Fluency and ID Accuracy ranged between .3 and .5, whereas Pearson’s between Adequacy and ID Speed (also measured in the same image identfication experiment) ranged between -.35 and -.29. 7 Discussion and Conclusions Our interchangeability results (Sections 6. 1and 6.2) indicate that the VAS and VDS-7 scales we have tested can substitute for each other in our present evaluation tasks in terms of the mean system scores they produce. Where we were able to measure validity (Section 6.6), both scales were shown to be similarly valid, predicting image identification accuracy figures from a separate experiment equally well. Stability (Section 6.3) was marginally better for VDS-7 data, and Reliability (Sections 6.1 and 6.2) was better for VAS data in the image descrip- tion evaluations, but (mostly) better for VDS-7 data in the weather forecast evaluations. Finally, the VAS experiments found greater numbers of statistically significant differences between systems in 3 out of 4 cases (Section 6.5). Our own raters strongly prefer working with VAS scales over VDSs. This has also long been clear from the psychology literature (Svensson, 2000)), where raters are typically found to prefer VAS scales over VDSs which can be a “constant source of vexation to the conscientious rater when he finds his judgments falling between the defined points” (Champney, 1941). Moreover, if a rater’s judgment falls between two points on a VDS then they must make the false choice between the two points just above and just below their actual judgment. In this case we know that the point they end up selecting is not an accurate measure of their judgment but rather just one of two equally accurate ones (one of which goes unrecorded). Our results establish (for our evaluation tasks) that VAS scales, so far unproven for use in NLP, are at least as good as VDSs, currently virtually the only scale in use in NLP. Combined with the fact that raters strongly prefer VASs and that they are regarded as more amenable to parametric means of statistical analysis, this indicates that VAS scales should be used more widely for NLP evaluation tasks. References Gabor Angeli, Percy Liang, and Dan Klein. 2010. A simple domain-independent probabilistic approach to generation. In Proceedings of the 15th Conference on Empirical Methods in Natural Language Processing (EMNLP’10). Anja Belz and Eric Kow. 2009. System building cost vs. output quality in data-to-text generation. In Proceedings of the 12th European Workshop on Natural Language Generation, pages 16–24. H. Champney. 1941. The measurement of parent behavior. Child Development, 12(2): 13 1. M. Freyd. 1923. The graphic rating scale. Biometrical Journal, 42:83–102. A. Gatt, A. Belz, and E. Kow. 2009. The TUNA Challenge 2009: Overview and evaluation results. In Proceedings of the 12th European Workshop on Natural Language Generation (ENLG’09), pages 198–206. Brian Langner. 2010. Data-driven Natural Language Generation: Making Machines Talk Like Humans Using Natural Corpora. Ph.D. thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University. Robert W. Lansing, Shakeeb H. Moosavi, and Robert B. Banzett. 2003. Measurement of dyspnea: word labeled visual analog scale vs. verbal ordinal scale. Respiratory Physiology & Neurobiology, 134(2):77 –83. J. Scott and E. C. Huskisson. 2003. Vertical or horizontal visual analogue scales. Annals of the rheumatic diseases, (38):560. Sidney Siegel. 1957. Non-parametric statistics. The American Statistician, 11(3): 13–19. Elisabeth Svensson. 2000. Comparison of the quality of assessments using continuous and discrete ordinal rating scales. Biometrical Journal, 42(4):417–434. P. M. ten Klooster, A. P. Klaar, E. Taal, R. E. Gheith, J. J. Rasker, A. K. El-Garf, and M. A. van de Laar. 2006. The validity and reliability of the graphic rating scale and verbal rating scale for measuing pain across cultures: A study in egyptian and dutch women with rheumatoid arthritis. The Clinical Journal of Pain, 22(9):827–30. Kees van Deemter, Ielka van der Sluis, and Albert Gatt. 2006. Building a semantically transparent corpus for the generation of referring expressions. In Proceedings of the 4th International Conference on Natural Language Generation, pages 130–132, Sydney, Australia, July. S. Williams and E. Reiter. 2008. Generating basic skills reports for low-skilled readers. Natural Language Engineering, 14(4):495–525. 235
4 0.68464899 248 acl-2011-Predicting Clicks in a Vocabulary Learning System
Author: Aaron Michelony
Abstract: We consider the problem of predicting which words a student will click in a vocabulary learning system. Often a language learner will find value in the ability to look up the meaning of an unknown word while reading an electronic document by clicking the word. Highlighting words likely to be unknown to a readeris attractive due to drawing his orher attention to it and indicating that information is available. However, this option is usually done manually in vocabulary systems and online encyclopedias such as Wikipedia. Furthurmore, it is never on a per-user basis. This paper presents an automated way of highlighting words likely to be unknown to the specific user. We present related work in search engine ranking, a description of the study used to collect click data, the experiment we performed using the random forest machine learning algorithm and finish with a discussion of future work.
Author: Michael Mohler ; Razvan Bunescu ; Rada Mihalcea
Abstract: In this work we address the task of computerassisted assessment of short student answers. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation. We also present a first attempt to align the dependency graphs of the student and the instructor answers in order to make use of a structural component in the automatic grading of student answers.
6 0.6390205 55 acl-2011-Automatically Predicting Peer-Review Helpfulness
7 0.58011782 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
9 0.56740308 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations
10 0.55532908 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components
11 0.55195224 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution
12 0.4992964 125 acl-2011-Exploiting Readymades in Linguistic Creativity: A System Demonstration of the Jigsaw Bard
13 0.49772504 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
14 0.49565983 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
15 0.47374907 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?
16 0.47186756 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports
17 0.46315208 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements
18 0.46041393 74 acl-2011-Combining Indicators of Allophony
19 0.44606417 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling
20 0.43820351 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds
topicId topicWeight
[(1, 0.015), (5, 0.032), (17, 0.035), (26, 0.027), (37, 0.063), (39, 0.026), (41, 0.056), (55, 0.016), (59, 0.037), (72, 0.041), (91, 0.046), (96, 0.459), (97, 0.011), (98, 0.028)]
simIndex simValue paperId paperTitle
1 0.99350083 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles
Author: Nitin Agarwal ; Ravi Shankar Reddy ; Kiran GVR ; Carolyn Penstein Rose
Abstract: In this demo, we present SciSumm, an interactive multi-document summarization system for scientific articles. The document collection to be summarized is a list of papers cited together within the same source article, otherwise known as a co-citation. At the heart of the approach is a topic based clustering of fragments extracted from each article based on queries generated from the context surrounding the co-cited list of papers. This analysis enables the generation of an overview of common themes from the co-cited papers that relate to the context in which the co-citation was found. SciSumm is currently built over the 2008 ACL Anthology, however the gen- eralizable nature of the summarization techniques and the extensible architecture makes it possible to use the system with other corpora where a citation network is available. Evaluation results on the same corpus demonstrate that our system performs better than an existing widely used multi-document summarization system (MEAD).
same-paper 2 0.99163359 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge
Author: Kirill Kireyev ; Thomas K Landauer
Abstract: While computational estimation of difficulty of words in the lexicon is useful in many educational and assessment applications, the concept of scalar word difficulty and current corpus-based methods for its estimation are inadequate. We propose a new paradigm called word meaning maturity which tracks the degree of knowledge of each word at different stages of language learning. We present a computational algorithm for estimating word maturity, based on modeling language acquisition with Latent Semantic Analysis. We demonstrate that the resulting metric not only correlates well with external indicators, but captures deeper semantic effects in language. 1 Motivation It is no surprise that through stages of language learning, different words are learned at different times and are known to different extents. For example, a common word like “dog” is familiar to even a first-grader, whereas a more advanced word like “focal” does not usually enter learners’ vocabulary until much later. Although individual rates of learning words may vary between highand low-performing students, it has been observed that “children [… ] acquire word meanings in roughly the same sequence” (Biemiller, 2008). The aim of this work is to model the degree of knowledge of words at different learning stages. Such a metric would have extremely useful applications in personalized educational technologies, for the purposes of accurate assessment and personalized vocabulary instruction. … 299 .l andaue r } @pear s on .com 2 Rethinking Word Difficulty Previously, related work in education and psychometrics has been concerned with measuring word difficulty or classifying words into different difficulty categories. Examples of such approaches include creation of word lists for targeted vocabulary instruction at various grade levels that were compiled by educational experts, such as Nation (1993) or Biemiller (2008). Such word difficulty assignments are also implicitly present in some readability formulas that estimate difficulty of texts, such as Lexiles (Stenner, 1996), which include a lexical difficulty component based on the frequency of occurrence of words in a representative corpus, on the assumption that word difficulty is inversely correlated to corpus frequency. Additionally, research in psycholinguistics has attempted to outline and measure psycholinguistic dimensions of words such as age-of-acquisition and familiarity, which aim to track when certain words become known and how familiar they appear to an average person. Importantly, all such word difficulty measures can be thought of as functions that assign a single scalar value to each word w: !
Author: Vicent Alabau ; Alberto Sanchis ; Francisco Casacuberta
Abstract: In interactive machine translation (IMT), a human expert is integrated into the core of a machine translation (MT) system. The human expert interacts with the IMT system by partially correcting the errors of the system’s output. Then, the system proposes a new solution. This process is repeated until the output meets the desired quality. In this scenario, the interaction is typically performed using the keyboard and the mouse. In this work, we present an alternative modality to interact within IMT systems by writing on a tactile display or using an electronic pen. An on-line handwritten text recognition (HTR) system has been specifically designed to operate with IMT systems. Our HTR system improves previous approaches in two main aspects. First, HTR decoding is tightly coupled with the IMT system. Second, the language models proposed are context aware, in the sense that they take into account the partial corrections and the source sentence by using a combination of ngrams and word-based IBM models. The proposed system achieves an important boost in performance with respect to previous work.
4 0.99022889 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System
Author: Wei-Bin Liang ; Chung-Hsien Wu ; Chia-Ping Chen
Abstract: In this study, a novel approach to robust dialogue act detection for error-prone speech recognition in a spoken dialogue system is proposed. First, partial sentence trees are proposed to represent a speech recognition output sentence. Semantic information and the derivation rules of the partial sentence trees are extracted and used to model the relationship between the dialogue acts and the derivation rules. The constructed model is then used to generate a semantic score for dialogue act detection given an input speech utterance. The proposed approach is implemented and evaluated in a Mandarin spoken dialogue system for tour-guiding service. Combined with scores derived from the ASR recognition probability and the dialogue history, the proposed approach achieves 84.3% detection accuracy, an absolute improvement of 34.7% over the baseline of the semantic slot-based method with 49.6% detection accuracy.
5 0.99012363 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers
Author: Daniel Emilio Beck
Abstract: In this paper I present a Master’s thesis proposal in syntax-based Statistical Machine Translation. Ipropose to build discriminative SMT models using both tree-to-string and tree-to-tree approaches. Translation and language models will be represented mainly through the use of Tree Automata and Tree Transducers. These formalisms have important representational properties that makes them well-suited for syntax modeling. Ialso present an experiment plan to evaluate these models through the use of a parallel corpus written in English and Brazilian Portuguese.
6 0.99001372 82 acl-2011-Content Models with Attitude
7 0.98979956 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity
8 0.98930413 41 acl-2011-An Interactive Machine Translation System with Online Learning
9 0.98901045 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names
10 0.98675126 25 acl-2011-A Simple Measure to Assess Non-response
11 0.98638606 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
12 0.97583866 266 acl-2011-Reordering with Source Language Collocations
13 0.97286284 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
14 0.97230083 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
15 0.97201198 264 acl-2011-Reordering Metrics for MT
16 0.97156543 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search
17 0.97080535 169 acl-2011-Improving Question Recommendation by Exploiting Information Need
18 0.96768844 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
19 0.9658162 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model
20 0.96573985 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers