acl acl2011 acl2011-248 knowledge-graph by maker-knowledge-mining

248 acl-2011-Predicting Clicks in a Vocabulary Learning System


Source: pdf

Author: Aaron Michelony

Abstract: We consider the problem of predicting which words a student will click in a vocabulary learning system. Often a language learner will find value in the ability to look up the meaning of an unknown word while reading an electronic document by clicking the word. Highlighting words likely to be unknown to a readeris attractive due to drawing his orher attention to it and indicating that information is available. However, this option is usually done manually in vocabulary systems and online encyclopedias such as Wikipedia. Furthurmore, it is never on a per-user basis. This paper presents an automated way of highlighting words likely to be unknown to the specific user. We present related work in search engine ranking, a description of the study used to collect click data, the experiment we performed using the random forest machine learning algorithm and finish with a discussion of future work.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the problem of predicting which words a student will click in a vocabulary learning system. [sent-3, score-0.781]

2 Often a language learner will find value in the ability to look up the meaning of an unknown word while reading an electronic document by clicking the word. [sent-4, score-0.33]

3 Highlighting words likely to be unknown to a readeris attractive due to drawing his orher attention to it and indicating that information is available. [sent-5, score-0.096]

4 However, this option is usually done manually in vocabulary systems and online encyclopedias such as Wikipedia. [sent-6, score-0.12]

5 This paper presents an automated way of highlighting words likely to be unknown to the specific user. [sent-8, score-0.196]

6 We present related work in search engine ranking, a description of the study used to collect click data, the experiment we performed using the random forest machine learning algorithm and finish with a discussion of future work. [sent-9, score-0.705]

7 1 Introduction When reading an article one occasionally encounters an unknown word for which one would like the definition. [sent-10, score-0.147]

8 For students learning or mastering a language, this can occur frequently. [sent-11, score-0.205]

9 Using a computerized learning system, it is possible to highlight words with which one would expect students to struggle. [sent-12, score-0.204]

10 The highlighting both draws attention to the word and indicates that information about it is available. [sent-13, score-0.134]

11 There are many applications of automatically highlighting unknown words. [sent-14, score-0.168]

12 Traditionally learners of foreign languages have had to look up unknown words in a dictionary. [sent-17, score-0.127]

13 For reading on the computer, unknown words are generally entered into an online dictionary, which can be time-consuming. [sent-18, score-0.189]

14 The automated highlighting of words could also be applied in an online encyclopedia, such as Wikipedia. [sent-19, score-0.128]

15 The proliferation of handheld computer devices for read- ing is another potential application, as some of these user interfaces may cause difficulty in the copying and pasting of a word into a dictionary. [sent-20, score-0.13]

16 Given a finite amount of resources available to improve definitions for certain words, knowing which words are likely to be clicked will help. [sent-21, score-0.293]

17 In this paper, we explore applying machine learning algorithms to classifying clicks in a vocabulary learning system. [sent-23, score-0.489]

18 The primary contribution of this work is to provide a list of features for machine learning algorithms and their correlation with clicks. [sent-24, score-0.176]

19 We analyze how the different features correlate with different aspects of the vocabulary learning process. [sent-25, score-0.124]

20 2 Related Work The previous work done in this area has mainly been in the area of predicting clicks for web search ranking. [sent-26, score-0.472]

21 For search engine results, there have been several factors identified for why people click on certain results over others. [sent-27, score-0.553]

22 One of the most important is position bias, which says that the presentation order affects the probability of a user clicking on a result. [sent-28, score-0.195]

23 This is considered a “fundamental prob- lem in click data” (Craswell et al. [sent-29, score-0.499]

24 , 2005) have shown that click probability decays faster than examination probability. [sent-33, score-0.563]

25 There have been four hypotheses for how to model position bias: • Baseline Hypothesis: There is no position bias. [sent-34, score-0.14]

26 does not fit with the data for how users click the top results. [sent-36, score-0.543]

27 • Mixture Hypothesis: Users click based on relevance or Haty rpaontdhoemsi. [sent-37, score-0.499]

28 s • Examination Hypothesis: Each result has a probability onf being hexesaims:ined based on its position and will be clicked if it is both examined and relevant. [sent-38, score-0.309]

29 • Cascade Model: Users view search results from top ctoa dbeot Mtomod eanl:d U Ucsleicrks on a r seesaurlct hw riethsu a tcse frrtaomin probability. [sent-39, score-0.057]

30 The cascade model has been shown to closely model the top-ranked results and the baseline model closely matches how users click at lower-ranked results (Craswell et al. [sent-40, score-0.612]

31 There has also been work done in predicting document keywords (Do g˘an and Lu, 2010). [sent-42, score-0.214]

32 Our goals are complimentary, in that they are trying to predict words that a user would use to search for a document and we are trying to predict words in a document that a user would want more information about. [sent-44, score-0.652]

33 3 Data Description To obtain click data, a study was conducted involving middle-school students, of which 157 were in the 7th grade and 17 were in the 8th grade. [sent-46, score-0.574]

34 90 students spoke Spanish as their primary language, 75 spoke English as their primary language, 8 spoke other languages and 1was unknown. [sent-47, score-0.531]

35 There were six documents for which we obtained click data. [sent-48, score-0.54]

36 Each document was either about science or was a fable. [sent-49, score-0.164]

37 The science documents contained more advanced vocabulary whereas the fables were primarily written for English language learners. [sent-50, score-0.262]

38 In the study, the students took a vocabulary test, used the vocabu- lary system and then took another vocabulary test 100 Num641325berSGF cai e b n lrce eW26135o096 1r8375 d4sStu126d35028ents Table 1. [sent-51, score-0.362]

39 The highlighted words were chosen by a computer program using latent semantic analysis (Deerwester et al. [sent-53, score-0.124]

40 Importantly, only nouns were highlighted and only nouns were in the vocabulary test. [sent-56, score-0.261]

41 When the student clicked on a highlighted word, they were shown definitions for the word along with four images showing the word in context. [sent-57, score-0.573]

42 For example, if a student clicked on the word “crane” which had the word “flying” next to it, one of the images the student would see would be of a flying crane. [sent-58, score-0.622]

43 From Figure 1we see that there is a relation between the total number of words in a document and the number of clicks students made. [sent-59, score-0.735]

44 For every click in document four, there are about 30 non-clicks. [sent-62, score-0.634]

45 For the second science document there are 100 non-clicks for every click and for the first science document there are nearly 300 non-clicks for every click. [sent-64, score-0.827]

46 There was also no correlation seen between a word being on a quiz and being clicked. [sent-65, score-0.306]

47 This indicates that the students may not have used the system as seriously as possible and introduced noise into the click data. [sent-66, score-0.675]

48 This is further evidenced by the quizzes, which show that only about 10% of the quiz words that students got wrong on the first test were actually learned. [sent-67, score-0.378]

49 However, we will show that we are able to predict clicks regardless. [sent-68, score-0.441]

50 Figure 2, 3 and 4 show the relationship between the mean age of acquisition of the words clicked on, STAR language scores and the number of clicks for document 2. [sent-69, score-1.136]

51 Age of acquisition scores are abstract and a score of 300 means a word was acquired at 46, 400 is 6-8 and 500 is 8-10 (Cortese and Fugett, 2004). [sent-72, score-0.158]

52 Age of Acquisition vs Clicks 4 Machine Learning Method The goal of our study is to predict student clicks in a vocabulary learning system. [sent-74, score-0.723]

53 We used the random forest machine learning method, due to its success in the Yahoo! [sent-75, score-0.123]

54 Random forest is an algorithm that classifies data by decision trees voting on a classification (Breiman, 2001). [sent-79, score-0.096]

55 The forest chooses the class with the most 101 Star Language Figure 3. [sent-80, score-0.096]

56 Each tree in the forest is trained by first sampling a subset of the data, chosen randomly with replacement, and then removing a large number of features. [sent-83, score-0.096]

57 Random forest has the advantage that it does not overfit the data. [sent-86, score-0.096]

58 To implement this algorithm on our click data, we constructed feature vectors consisting of both student features and word features. [sent-87, score-0.677]

59 Each word is either clicked or not clicked, so we were able to use a binary classifier. [sent-88, score-0.273]

60 The features used are of two types: student features and word features. [sent-91, score-0.209]

61 The student features we used in our experiment were the STAR (Standardized Testing and Reporting, a California standardized test) language score and the CELDT (California English Language Development Test) overall score, which correlated highly with each other. [sent-92, score-0.194]

62 1 between the STAR language score and total clicks across all the documents. [sent-94, score-0.396]

63 Also available were the STAR math score, CELDT reading, writing, speaking and listening scores, grade level and primary language. [sent-95, score-0.093]

64 We used and tested many word features, which were discovered to be more important than the student features. [sent-97, score-0.147]

65 First, we used the part-of-speech as a feature which was useful since only nouns were highlighted in the study. [sent-98, score-0.132]

66 The most useful was age of acquisition, which refers to “the age at which a word was learnt and has been proposed as a significant contributor to language and memory processes” (Stadthagen-Gonzalez and Davis, 2006). [sent-103, score-0.462]

67 This was useful because it was available for the majority of words and is a good proxy for the difficulty of a word. [sent-104, score-0.098]

68 Also useful was imageability, which is “the ease with which the word gives rise to a sensory mental image” (Bird et al. [sent-105, score-0.063]

69 Third, we obtained the Google unigram frequencies which were also a proxy for the difficulty of a word. [sent-108, score-0.07]

70 Fourth, we calculated click percentages for words, students and words, words in a document and spe- cific words in a document. [sent-109, score-0.895]

71 We instead would like to focus on words for which we do not have click data. [sent-111, score-0.527]

72 102 Fifth, the word position, which indicates the position of the word in the document, was useful because position bias was seen in our data. [sent-112, score-0.269]

73 After seeing a word three or four times, the clicks for that word dropped off dramatically. [sent-117, score-0.464]

74 We gathered etymological data, such as the language of origin and the date the word entered the English language; however these features did not help. [sent-119, score-0.142]

75 We were also able to categorize the words using WordNet (Fellbaum, 1998), which can determine, for example, that a boat is an artifact and a lion is an animal. [sent-120, score-0.096]

76 We tested for the categories of abstraction, artifact, living thing and animal but found no correlation between clicks and these categories. [sent-121, score-0.523]

77 2 Missing Values Many features were not available for every word in the evaluation, such as age of acquisition. [sent-123, score-0.279]

78 We decided to create reduced feature models due to them being reported to consistently outperform imputation (Saar-Tsechansky and Provost, 2007). [sent-125, score-0.066]

79 3 Experimental Set-up We ran our evaluation on document four, which had click data for 22 students. [sent-127, score-0.634]

80 We chose this document because it had the highest correlation between a word being a quiz word and clicked, at 0. [sent-128, score-0.475]

81 06, and the correlation between the age of acquisition of a word and that word being a quiz word is high, at 0. [sent-129, score-0.712]

82 The algorithms were run with the following features: STAR language score, CELDT overall score, word position, word instance, document number, age of acquisition, imageability, Google frequency, stopword, and part-of-speech. [sent-131, score-0.417]

83 The training data for a student consisted of his or her click data for the other fables and all the other students’ click data for all the fables. [sent-133, score-1.21]

84 We obtained similar performance with the other documents except document one. [sent-136, score-0.176]

85 First, we are trying to maximize clicks when we should be trying to maximize learning. [sent-140, score-0.528]

86 In the future we would like to identify which clicks are more important than others and incorporate that into our model. [sent-141, score-0.396]

87 Second, across all documents of the study there was no correlation between a word being on the quiz and being clicked. [sent-142, score-0.376]

88 We would like to obtain click data from users actively trying to learn and see how the results would be affected and we speculate that the position bias effect may be reduced in this case. [sent-143, score-0.74]

89 Third, this study involved students who were using the system for the first time. [sent-144, score-0.205]

90 This actually produced a slight negative correlation between age of acquisition and whether the word is a quiz word or not, whereas for the fable documents there is a strong positive correlation between these two variables. [sent-149, score-0.817]

91 It raises the question 103 of how appropriate it is to include click data from a document with only one click out of 100 or 300 non-clicks into the training set for a document with one click out of 30 non-clicks. [sent-150, score-1.767]

92 When the science documents were included in the training set for the fables, there was no difference in performance. [sent-151, score-0.07]

93 The correlation between the word position and clicks is about -0. [sent-152, score-0.598]

94 This shows that position bias affects vocabulary systems as well as search engines and finding a good model to describe this is future work. [sent-154, score-0.291]

95 The cascade model seems most appropriate, however the students tended to click in a nonlinear order. [sent-155, score-0.744]

96 Previous work by Do˘ gan and Lu in predicting click-words (Do g˘an and Lu, 2010) built a learning system to predict click-words for documents in the field of bioinformatics. [sent-157, score-0.184]

97 They claim that ”Our results show that a word’s semantic type, location, POS, neighboring words and phrase information together could best determine if a word will be a click-word. [sent-158, score-0.092]

98 ” They did report that if a word was in the title or abstract it was more likely to be a click-word, which is similar to our finding that a word at the beginning of the document is more likely to be clicked. [sent-159, score-0.203]

99 Certain features such as neighboring words do not seem applicable to our usage in general, although it is something to be aware of for specialized domains. [sent-161, score-0.089]

100 Their use of semantic types was interesting, though using WordNet we did not find any preference for certain classes of nouns being clicked over others. [sent-162, score-0.275]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('click', 0.499), ('clicks', 0.396), ('clicked', 0.239), ('age', 0.214), ('star', 0.191), ('students', 0.176), ('quiz', 0.174), ('cortese', 0.165), ('document', 0.135), ('imageability', 0.134), ('acquisition', 0.124), ('student', 0.113), ('highlighting', 0.1), ('celdt', 0.099), ('fables', 0.099), ('correlation', 0.098), ('forest', 0.096), ('highlighted', 0.096), ('vocabulary', 0.093), ('craswell', 0.087), ('spoke', 0.087), ('position', 0.07), ('cascade', 0.069), ('unknown', 0.068), ('trying', 0.066), ('fugett', 0.066), ('imputation', 0.066), ('bias', 0.061), ('monosyllabic', 0.058), ('cruz', 0.058), ('flying', 0.058), ('weka', 0.057), ('davis', 0.05), ('standardized', 0.05), ('gan', 0.05), ('predicting', 0.048), ('deerwester', 0.048), ('iwould', 0.048), ('entered', 0.048), ('clicking', 0.048), ('primary', 0.047), ('vs', 0.047), ('grade', 0.046), ('reading', 0.045), ('predict', 0.045), ('users', 0.044), ('santa', 0.044), ('documents', 0.041), ('proxy', 0.039), ('artifact', 0.039), ('affects', 0.039), ('user', 0.038), ('bird', 0.038), ('nouns', 0.036), ('lu', 0.036), ('examination', 0.035), ('word', 0.034), ('missing', 0.034), ('ratings', 0.033), ('interpreting', 0.033), ('difficulty', 0.031), ('keywords', 0.031), ('foreign', 0.031), ('images', 0.031), ('wordnet', 0.031), ('california', 0.031), ('features', 0.031), ('joachims', 0.03), ('neighboring', 0.03), ('toutanova', 0.03), ('behavior', 0.029), ('study', 0.029), ('nonlinearity', 0.029), ('living', 0.029), ('cific', 0.029), ('proficient', 0.029), ('etymological', 0.029), ('ctoa', 0.029), ('decays', 0.029), ('boat', 0.029), ('bristol', 0.029), ('mastering', 0.029), ('sensory', 0.029), ('sfo', 0.029), ('science', 0.029), ('scott', 0.028), ('words', 0.028), ('search', 0.028), ('random', 0.027), ('proliferation', 0.027), ('encyclopedias', 0.027), ('ined', 0.027), ('beach', 0.027), ('judith', 0.027), ('complimentary', 0.027), ('fioorn', 0.027), ('instruments', 0.027), ('definitions', 0.026), ('engine', 0.026), ('revisit', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

Author: Aaron Michelony

Abstract: We consider the problem of predicting which words a student will click in a vocabulary learning system. Often a language learner will find value in the ability to look up the meaning of an unknown word while reading an electronic document by clicking the word. Highlighting words likely to be unknown to a readeris attractive due to drawing his orher attention to it and indicating that information is available. However, this option is usually done manually in vocabulary systems and online encyclopedias such as Wikipedia. Furthurmore, it is never on a per-user basis. This paper presents an automated way of highlighting words likely to be unknown to the specific user. We present related work in search engine ranking, a description of the study used to collect click data, the experiment we performed using the random forest machine learning algorithm and finish with a discussion of future work.

2 0.30000111 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

Author: Patrick Pantel ; Ariel Fuxman

Abstract: We propose methods for estimating the probability that an entity from an entity database is associated with a web search query. Association is modeled using a query entity click graph, blending general query click logs with vertical query click logs. Smoothing techniques are proposed to address the inherent data sparsity in such graphs, including interpolation using a query synonymy model. A large-scale empirical analysis of the smoothing techniques, over a 2-year click graph collected from a commercial search engine, shows significant reductions in modeling error. The association models are then applied to the task of recommending products to web queries, by annotating queries with products from a large catalog and then mining query- product associations through web search session analysis. Experimental analysis shows that our smoothing techniques improve coverage while keeping precision stable, and overall, that our top-performing model affects 9% of general web queries with 94% precision.

3 0.16676258 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

Author: Bo Pang ; Ravi Kumar

Abstract: Web search is an information-seeking activity. Often times, this amounts to a user seeking answers to a question. However, queries, which encode user’s information need, are typically not expressed as full-length natural language sentences in particular, as questions. Rather, they consist of one or more text fragments. As humans become more searchengine-savvy, do natural-language questions still have a role to play in web search? Through a systematic, large-scale study, we find to our surprise that as time goes by, web users are more likely to use questions to express their search intent. —

4 0.16416244 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

Author: Sara Rosenthal ; Kathleen McKeown

Abstract: We investigate whether wording, stylistic choices, and online behavior can be used to predict the age category of blog authors. Our hypothesis is that significant changes in writing style distinguish pre-social media bloggers from post-social media bloggers. Through experimentation with a range of years, we found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction. We also show that internet writing characteristics are important features for age prediction, but that lexical content is also needed to produce significantly more accurate results. Our best results allow for 81.57% accuracy.

5 0.079630777 205 acl-2011-Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments

Author: Michael Mohler ; Razvan Bunescu ; Rada Mihalcea

Abstract: In this work we address the task of computerassisted assessment of short student answers. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation. We also present a first attempt to align the dependency graphs of the student and the instructor answers in order to make use of a structural component in the automatic grading of student answers.

6 0.072886787 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles

7 0.065984793 217 acl-2011-Machine Translation System Combination by Confusion Forest

8 0.065265052 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge

9 0.058499627 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

10 0.054267291 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech

11 0.053653978 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

12 0.052965757 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

13 0.052426342 204 acl-2011-Learning Word Vectors for Sentiment Analysis

14 0.051450204 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

15 0.050445151 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

16 0.04894872 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus

17 0.048789117 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

18 0.048663504 115 acl-2011-Engkoo: Mining the Web for Language Learning

19 0.048219625 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

20 0.04819084 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.142), (1, 0.051), (2, -0.034), (3, 0.057), (4, -0.088), (5, -0.058), (6, -0.035), (7, -0.083), (8, 0.055), (9, -0.022), (10, -0.01), (11, -0.031), (12, -0.013), (13, 0.01), (14, -0.018), (15, -0.033), (16, 0.058), (17, 0.004), (18, -0.033), (19, -0.029), (20, 0.098), (21, 0.039), (22, -0.011), (23, 0.031), (24, -0.031), (25, -0.075), (26, -0.015), (27, 0.046), (28, -0.078), (29, -0.024), (30, -0.026), (31, -0.023), (32, -0.072), (33, 0.044), (34, -0.015), (35, -0.057), (36, -0.085), (37, 0.063), (38, -0.031), (39, 0.059), (40, -0.058), (41, 0.039), (42, 0.079), (43, -0.061), (44, 0.119), (45, -0.021), (46, -0.002), (47, 0.1), (48, 0.098), (49, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87959099 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

Author: Aaron Michelony

Abstract: We consider the problem of predicting which words a student will click in a vocabulary learning system. Often a language learner will find value in the ability to look up the meaning of an unknown word while reading an electronic document by clicking the word. Highlighting words likely to be unknown to a readeris attractive due to drawing his orher attention to it and indicating that information is available. However, this option is usually done manually in vocabulary systems and online encyclopedias such as Wikipedia. Furthurmore, it is never on a per-user basis. This paper presents an automated way of highlighting words likely to be unknown to the specific user. We present related work in search engine ranking, a description of the study used to collect click data, the experiment we performed using the random forest machine learning algorithm and finish with a discussion of future work.

2 0.58838046 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge

Author: Kirill Kireyev ; Thomas K Landauer

Abstract: While computational estimation of difficulty of words in the lexicon is useful in many educational and assessment applications, the concept of scalar word difficulty and current corpus-based methods for its estimation are inadequate. We propose a new paradigm called word meaning maturity which tracks the degree of knowledge of each word at different stages of language learning. We present a computational algorithm for estimating word maturity, based on modeling language acquisition with Latent Semantic Analysis. We demonstrate that the resulting metric not only correlates well with external indicators, but captures deeper semantic effects in language. 1 Motivation It is no surprise that through stages of language learning, different words are learned at different times and are known to different extents. For example, a common word like “dog” is familiar to even a first-grader, whereas a more advanced word like “focal” does not usually enter learners’ vocabulary until much later. Although individual rates of learning words may vary between highand low-performing students, it has been observed that “children [… ] acquire word meanings in roughly the same sequence” (Biemiller, 2008). The aim of this work is to model the degree of knowledge of words at different learning stages. Such a metric would have extremely useful applications in personalized educational technologies, for the purposes of accurate assessment and personalized vocabulary instruction. … 299 .l andaue r } @pear s on .com 2 Rethinking Word Difficulty Previously, related work in education and psychometrics has been concerned with measuring word difficulty or classifying words into different difficulty categories. Examples of such approaches include creation of word lists for targeted vocabulary instruction at various grade levels that were compiled by educational experts, such as Nation (1993) or Biemiller (2008). Such word difficulty assignments are also implicitly present in some readability formulas that estimate difficulty of texts, such as Lexiles (Stenner, 1996), which include a lexical difficulty component based on the frequency of occurrence of words in a representative corpus, on the assumption that word difficulty is inversely correlated to corpus frequency. Additionally, research in psycholinguistics has attempted to outline and measure psycholinguistic dimensions of words such as age-of-acquisition and familiarity, which aim to track when certain words become known and how familiar they appear to an average person. Importantly, all such word difficulty measures can be thought of as functions that assign a single scalar value to each word w: !

3 0.57499754 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

Author: Bo Pang ; Ravi Kumar

Abstract: Web search is an information-seeking activity. Often times, this amounts to a user seeking answers to a question. However, queries, which encode user’s information need, are typically not expressed as full-length natural language sentences in particular, as questions. Rather, they consist of one or more text fragments. As humans become more searchengine-savvy, do natural-language questions still have a role to play in web search? Through a systematic, large-scale study, we find to our surprise that as time goes by, web users are more likely to use questions to express their search intent. —

4 0.57070005 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

Author: Patrick Pantel ; Ariel Fuxman

Abstract: We propose methods for estimating the probability that an entity from an entity database is associated with a web search query. Association is modeled using a query entity click graph, blending general query click logs with vertical query click logs. Smoothing techniques are proposed to address the inherent data sparsity in such graphs, including interpolation using a query synonymy model. A large-scale empirical analysis of the smoothing techniques, over a 2-year click graph collected from a commercial search engine, shows significant reductions in modeling error. The association models are then applied to the task of recommending products to web queries, by annotating queries with products from a large catalog and then mining query- product associations through web search session analysis. Experimental analysis shows that our smoothing techniques improve coverage while keeping precision stable, and overall, that our top-performing model affects 9% of general web queries with 94% precision.

5 0.56850487 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

Author: Sara Rosenthal ; Kathleen McKeown

Abstract: We investigate whether wording, stylistic choices, and online behavior can be used to predict the age category of blog authors. Our hypothesis is that significant changes in writing style distinguish pre-social media bloggers from post-social media bloggers. Through experimentation with a range of years, we found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction. We also show that internet writing characteristics are important features for age prediction, but that lexical content is also needed to produce significantly more accurate results. Our best results allow for 81.57% accuracy.

6 0.56754452 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

7 0.54444796 89 acl-2011-Creative Language Retrieval: A Robust Hybrid of Information Retrieval and Linguistic Creativity

8 0.54297513 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

9 0.51369649 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

10 0.5113377 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

11 0.50654042 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search

12 0.50404859 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations

13 0.49837735 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis

14 0.49212879 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

15 0.48831809 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

16 0.48301071 55 acl-2011-Automatically Predicting Peer-Review Helpfulness

17 0.48077995 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

18 0.47419035 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

19 0.4614419 125 acl-2011-Exploiting Readymades in Linguistic Creativity: A System Demonstration of the Jigsaw Bard

20 0.45618531 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.035), (17, 0.038), (26, 0.041), (37, 0.057), (39, 0.031), (41, 0.067), (55, 0.02), (58, 0.275), (59, 0.041), (72, 0.037), (88, 0.013), (91, 0.05), (96, 0.176), (97, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.76718634 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

Author: Moshe Koppel ; Navot Akiva ; Idan Dershowitz ; Nachum Dershowitz

Abstract: We propose a novel unsupervised method for separating out distinct authorial components of a document. In particular, we show that, given a book artificially “munged” from two thematically similar biblical books, we can separate out the two constituent books almost perfectly. This allows us to automatically recapitulate many conclusions reached by Bible scholars over centuries of research. One of the key elements of our method is exploitation of differences in synonym choice by different authors. 1

same-paper 2 0.7639153 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

Author: Aaron Michelony

Abstract: We consider the problem of predicting which words a student will click in a vocabulary learning system. Often a language learner will find value in the ability to look up the meaning of an unknown word while reading an electronic document by clicking the word. Highlighting words likely to be unknown to a readeris attractive due to drawing his orher attention to it and indicating that information is available. However, this option is usually done manually in vocabulary systems and online encyclopedias such as Wikipedia. Furthurmore, it is never on a per-user basis. This paper presents an automated way of highlighting words likely to be unknown to the specific user. We present related work in search engine ranking, a description of the study used to collect click data, the experiment we performed using the random forest machine learning algorithm and finish with a discussion of future work.

3 0.69363892 61 acl-2011-Binarized Forest to String Translation

Author: Hao Zhang ; Licheng Fang ; Peng Xu ; Xiaoyun Wu

Abstract: Tree-to-string translation is syntax-aware and efficient but sensitive to parsing errors. Forestto-string translation approaches mitigate the risk of propagating parser errors into translation errors by considering a forest of alternative trees, as generated by a source language parser. We propose an alternative approach to generating forests that is based on combining sub-trees within the first best parse through binarization. Provably, our binarization forest can cover any non-consitituent phrases in a sentence but maintains the desirable property that for each span there is at most one nonterminal so that the grammar constant for decoding is relatively small. For the purpose of reducing search errors, we apply the synchronous binarization technique to forest-tostring decoding. Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system). Consistent and significant gains are also shown in WMT 2010 in the English to German, French, Spanish and Czech tracks.

4 0.6241262 41 acl-2011-An Interactive Machine Translation System with Online Learning

Author: Daniel Ortiz-Martinez ; Luis A. Leiva ; Vicent Alabau ; Ismael Garcia-Varea ; Francisco Casacuberta

Abstract: State-of-the-art Machine Translation (MT) systems are still far from being perfect. An alternative is the so-called Interactive Machine Translation (IMT) framework, where the knowledge of a human translator is combined with the MT system. We present a statistical IMT system able to learn from user feedback by means of the application of online learning techniques. These techniques allow the MT system to update the parameters of the underlying models in real time. According to empirical results, our system outperforms the results of conventional IMT systems. To the best of our knowledge, this online learning capability has never been provided by previous IMT systems. Our IMT system is implemented in C++, JavaScript, and ActionScript; and is publicly available on the Web.

5 0.61469781 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

Author: Joseph Reisinger ; Marius Pasca

Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.

6 0.61449814 177 acl-2011-Interactive Group Suggesting for Twitter

7 0.61179465 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

8 0.61175376 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

9 0.6107533 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

10 0.61029744 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis

11 0.60977525 11 acl-2011-A Fast and Accurate Method for Approximate String Search

12 0.60747945 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework

13 0.60747737 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

14 0.60704482 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

15 0.60677814 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

16 0.6064384 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

17 0.60598135 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

18 0.60541236 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

19 0.60534519 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

20 0.60509431 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing