acl acl2010 acl2010-164 knowledge-graph by maker-knowledge-mining

164 acl-2010-Learning Phrase-Based Spelling Error Models from Clickthrough Data


Source: pdf

Author: Xu Sun ; Jianfeng Gao ; Daniel Micol ; Chris Quirk

Abstract: This paper explores the use of clickthrough data for query spelling correction. First, large amounts of query-correction pairs are derived by analyzing users' query reformulation behavior encoded in the clickthrough data. Then, a phrase-based error model that accounts for the transformation probability between multi-term phrases is trained and integrated into a query speller system. Experiments are carried out on a human-labeled data set. Results show that the system using the phrase-based error model outperforms cantly its baseline systems. 1 signifi-

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract This paper explores the use of clickthrough data for query spelling correction. [sent-7, score-1.223]

2 First, large amounts of query-correction pairs are derived by analyzing users' query reformulation behavior encoded in the clickthrough data. [sent-8, score-0.854]

3 Then, a phrase-based error model that accounts for the transformation probability between multi-term phrases is trained and integrated into a query speller system. [sent-9, score-0.895]

4 1 signifi- Introduction Search queries present a particular challenge for traditional spelling correction methods for three main reasons (Ahmad and Kondrak, 2004). [sent-12, score-0.909]

5 First, spelling errors are more common in search queries than in regular written text: roughly 10-15% of queries contain misspelled terms (Cucerzan and Brill, 2004). [sent-13, score-1.101]

6 5% of valid search terms do not occur in their 200K-entry spelling lexicon. [sent-18, score-0.615]

7 Therefore, recent research has focused on the use of Web corpora and query logs, rather than Jianfeng Gao Microsoft Research Redmond, WA, USA j fgao @mi cro s o ft . [sent-19, score-0.404]

8 Another important data source that would be useful for this purpose is clickthrough data. [sent-25, score-0.326]

9 Although it is well-known that clickthrough data contain rich information about users' search behavior, e. [sent-26, score-0.395]

10 , how a user (re-) formulates a query in order to find the relevant document, there has been little research on exploiting the data for the development of a query speller system. [sent-28, score-1.04]

11 In this paper we present a novel method of extracting large amounts of query-correction pairs from the clickthrough data. [sent-29, score-0.36]

12 These pairs, impli- citly judged by millions of users, are used to train a set of spelling error models. [sent-30, score-0.659]

13 Results show that the error models learned from clickthrough data lead to significant improvements on the task of query spelling correction. [sent-39, score-1.359]

14 In particular, the speller system incorporating a phrase-based error model significantly outperforms its baseline systems. [sent-40, score-0.421]

15 To the best of our knowledge, this is the first extensive study of learning phase-based error models from clickthrough data for query spelling correction. [sent-41, score-1.359]

16 Section 3 presents the way query-correction pairs are extracted from the clickthrough data. [sent-44, score-0.36]

17 In non-word error spelling correction, any word that is not found in a pre-compiled lexicon is considered to be misspelled. [sent-51, score-0.659]

18 Then, a list of lexical words that are similar to the misspelled word are proposed as candidate spelling corrections. [sent-52, score-0.746]

19 Real-word spelling correction is also referred to as context sensitive spelling correction (CSSC). [sent-61, score-1.514]

20 When designed to handle regular written text, _ both CSSC and non-word error speller systems rely on a pre-defined vocabulary (i. [sent-68, score-0.411]

21 However, in query spelling correction, it is impossible to compile such a vocabulary, and the boundary between the non-word and real-word errors is quite vague. [sent-71, score-0.897]

22 Therefore, recent research on query spelling correction has focused on exploiting noisy Web data and query logs to infer knowledge about misspellings and word usage in search queries. [sent-72, score-1.722]

23 Cucerzan and Brill (2004) discuss in detail the challenges of query spelling correction, and suggest the use of query logs. [sent-73, score-1.271]

24 Ahmad and Kondrak (2005) propose a method of estimating an error model from query logs using the EM algorithm. [sent-74, score-0.63]

25 (2006) extend the error model by capturing word-level similarities learned from query logs. [sent-76, score-0.543]

26 (2007) suggest using web search results to improve spelling correction. [sent-78, score-0.647]

27 (2009) present a query speller system in which both the error model and the language model are trained using Web data. [sent-80, score-0.828]

28 Compared to Web corpora and query logs, clickthrough data contain much richer information about users’ search behavior. [sent-81, score-0.769]

29 Although there has been a lot of research on using clickthrough data to improve Web document retrieval (e. [sent-82, score-0.326]

30 , 2009), the data have not been fully explored for query spelling correction. [sent-86, score-0.897]

31 This study tries to learn error models from clickthrough data. [sent-87, score-0.462]

32 To our knowledge, this is the first such attempt using clickthrough data. [sent-88, score-0.326]

33 Most of the speller systems reviewed above are based on the framework of the source channel model. [sent-89, score-0.35]

34 Typically, a language model (source model) is used to capture contextual information, while an error model (channel model) is considered to be context free in that it does not take into account any contextual information in modeling word transformation probabilities. [sent-90, score-0.363]

35 , 2003; Och and Ney, 2004), we propose a phrase-based error model where we assume that query spelling correction is performed at the phrase level. [sent-93, score-1.338]

36 In what follows, before presenting the phrasebased error model, we will first describe the clickthrough data and the query speller system we used in this study. [sent-94, score-1.088]

37 The clickthrough data of the first type has been widely used in previous research and proved to be useful for Web search (Joachims, 2002; Agichtein et al. [sent-97, score-0.395]

38 , 2009) and query reformulation (Wang and Zhai, 2008; Suzuki et al. [sent-99, score-0.494]

39 The data consist of a set of query sessions that were extracted from one year of log files from a commercial Web search engine. [sent-102, score-0.634]

40 A query session contains a query issued by a user and a ranked list of links (i. [sent-103, score-0.824]

41 We then scored each query pair (Q1, Q2) using the edit distance between Q1 and Q2, and retained those with an edit distance score lower than a pre-set threshold as query correction pairs. [sent-109, score-1.208]

42 The clickthrough data of the second type consists of a set of query reformulation sessions extracted from 3 months of log files from a commercial Web browser. [sent-112, score-0.984]

43 A query reformulation session contains a list of URLs that record user behaviors that relate to the query reformulation functions, provided by a Web search engine. [sent-113, score-1.127]

44 For example, almost all commercial search engines offer the "did you mean" function, suggesting a possible alternate interpretation or spelling of a user-issued query. [sent-114, score-0.654]

45 Figure 1 shows a sample of the query reformulation sessions that record the "did you mean" sessions from three of the most popular search engines. [sent-115, score-0.797]

46 These sessions encode the same user behavior: A user first queries for "harrypotter sheme park", and then clicks on the resulting spelling suggestion "harry potter theme park". [sent-116, score-1.193]

47 A sample of query reformulation sessions from three popular search engines. [sent-137, score-0.665]

48 These sessions show that a user first issues the query "harrypotter sheme park", and then clicks on the resulting spell suggestion "harry potter theme park". [sent-138, score-0.88]

49 From these three months of query reformulation sessions, we extracted about 3 million query-correction pairs. [sent-140, score-0.494]

50 Compared to the pairs extracted from the clickthrough data of the first type (query sessions), this data set is much cleaner because all these spelling corrections are actually clicked, and thus judged implicitly, by many users. [sent-141, score-0.991]

51 In addition to the "did you mean" function, recently some search engines have introduced two new spelling suggestion functions. [sent-142, score-0.642]

52 One is the "auto-correction" function, where the search engine is confident enough to automatically apply the spelling correction to the query and execute it to produce search results for the user. [sent-143, score-1.269]

53 Since our extraction approach focuses on user-approved spelling suggestions, 268 we ignore the query reformulation sessions recording either of the two functions. [sent-146, score-1.119]

54 Although by doing so we could miss some basic, obvious spelling corrections, our experiments show that the negative impact on error model training is negligible. [sent-147, score-0.692]

55 One possible reason is that our baseline system, which does not use any error model learned from the clickthrough data, is already able to correct these basic, obvious spelling mistakes. [sent-148, score-1.018]

56 First, we extracted a set of queries from the sessions where no spell suggestion is presented or clicked on. [sent-154, score-0.38]

57 We do so by running a sanity check of the queries against our baseline spelling correction system, which will be described in Section 6. [sent-156, score-0.909]

58 If the system thinks an input query is misspelled, we assumed it was an obvious misspelling, and removed it. [sent-157, score-0.374]

59 4 The Baseline Speller System The spelling correction problem is typically formulated under the framework of the source channel model. [sent-159, score-0.855]

60 The speller system used in our experiments is based on a ranking model (or ranker), which can be viewed as a generalization of the source channel model. [sent-210, score-0.383]

61 In candidate generation, an input query is first tokenized into a sequence of terms. [sent-212, score-0.415]

62 Then we scan the query from left to right, and each query term q is looked up in lexicon to generate a list of spelling suggestions c whose edit distance from q is lower than a preset threshold. [sent-213, score-1.409]

63 The lexicon we used contains around 430,000 entries; these are high frequency query terms collected from one year of search query logs. [sent-214, score-0.844]

64 The set of all the generated spelling suggestions is stored using a lattice data structure, which is a compact representation of exponentially many possible candidate spelling corrections. [sent-216, score-1.112]

65 The language model (the second factor) is a backoff bigram model trained on the tokenized form of one year of query logs, using maximum likelihood estimation with absolute discounting smoothing. [sent-218, score-0.467]

66 Notice that we always include the input query Q in the 20-best candidate list. [sent-233, score-0.415]

67 The core of the second component of the speller system is a ranker, which re-ranks the 20-best candidate spelling corrections. [sent-234, score-0.816]

68 If the top C after re-ranking is different than the original query Q, the system returns C as the correction. [sent-235, score-0.374]

69 Let f be a feature vector extracted from a query and candidate spelling correction pair (Q, C). [sent-236, score-1.172]

70 accuracy on a set of hu269 C: “disney theme park” correct query S: [“disney”, “theme park”] segmentation T: [“disnee”, “theme part”] translation M: (1 ? [sent-245, score-0.472]

71 1) permutation Q: “theme part disnee” misspelled query Figure 2: Example demonstrating the generative procedure behind the phrase-based error model. [sent-247, score-0.726]

72 , the edit distance function) as features, the ranker can be viewed as a more general framework, subsuming the source channel model as a special case. [sent-252, score-0.335]

73 5 A Phrase-Based Error Model The goal of the phrase-based error model is to transform a correctly spelled query C into a misspelled query Q. [sent-254, score-1.159]

74 We assume the following generative story: first the correctly spelled query C is broken into K non-empty word sequences c1, ck, then each is replaced with a new non-empty word sequence q1, qk, and finally these phrases are permuted and concatenated to form the misspelled Q. [sent-257, score-0.616]

75 For the sole remaining factor P(T|C, S), we make the assumption that a segmented query T = q1… qK is generated from left to right by transforming each phrase c1 cK independently: … … … … 270 Input: biPhraseLattice “PL ” with length = K & height = L; Initialization: biPhrase. [sent-363, score-0.441]

76 Notice that when we set L=1, the phrase-based error model is reduced to a word-based error model which assumes that words are transformed independently from C to Q, without taking into account any contextual information. [sent-453, score-0.385]

77 Throughout this section, we have approached this model in a noisy channel approach, finding probabilities of the misspelled query given the corrected query. [sent-577, score-0.74]

78 However, the method can be run in both directions, and in practice SMT systems benefit from also including the direct probability of the corrected query given this misspelled query (Och, 2002). [sent-578, score-0.987]

79 3 Phrase-Based Error Model Features To use the phrase-based error model for spelling correction, we derive five features and integrate them into the ranker-based query speller system, described in Section 4. [sent-580, score-1.318]

80 Unaligned word penalty feature: the feature is defined as the ratio between the number of unaligned query words and the total number of query words. [sent-631, score-0.748]

81 Experiments We evaluate the spelling error models on a large scale real world data set containing 24,172 queries sampled from one year’s worth of query logs from a commercial search engine. [sent-632, score-1.38]

82 The spelling of each query is judged and corrected by four annotators. [sent-633, score-0.921]

83 The training data contains 8,5 15 query-correction pairs, among which 1,743 queries are misspelled (i. [sent-636, score-0.334]

84 • Precision: The number of correct spelling corrections for misspelled queries generated by the system divided by the total number of corrections generated by the system. [sent-644, score-1.073]

85 • Recall: The number of correct spelling corrections for misspelled queries generated by the system divided by the total number of misspelled queries in the test set. [sent-645, score-1.299]

86 Moreover, since we proposed to use clickthrough data for spelling correction, it is interesting to study the impact on spelling performance from the size of clickthrough data used for training. [sent-672, score-1.698]

87 The results show first and foremost that the ranker-based system significantly outperforms the spelling system based solely on the source-channel model, largely due to the richer In our experiments, all the speller systems are ranker-based. [sent-674, score-0.775]

88 Row 2 is the ranker-based spelling system that uses all 96 ranking features, as described in Section 4. [sent-687, score-0.523]

89 The other is a phonetic model that measures the edit distance between the metaphones (Philips, 1990) of a query word and its aligned correction word. [sent-690, score-0.781]

90 Second, the error model learned from clickthrough data leads to significant improvements (Rows 3 and 4 vs. [sent-693, score-0.495]

91 This paper extends the recent research on using Web data and query logs for query spelling correction in two aspects. [sent-703, score-1.592]

92 query-correction pairs) can be extracted from clickthrough data, focusing on query reformulation sessions. [sent-706, score-0.82]

93 Second, we argue that it is critical to capture contextual information for query spelling correction. [sent-708, score-0.944]

94 To this end, we propose 273 a new phrase-based error model, which leads to significant improvement in our spelling correction experiments. [sent-709, score-0.893]

95 For example, in future work we plan to investigate the combination of the clickthrough data collected from a Web browser with the noisy but large query sessions collected from a commercial search engine. [sent-711, score-0.939]

96 Learning a spelling error model from search query logs. [sent-723, score-1.135]

97 An improved error model for noisy channel spelling correction. [sent-729, score-0.819]

98 A spelling correction program based on a noisy channel model. [sent-779, score-0.884]

99 Exploring distributional similarity based models for query spelling correction. [sent-804, score-0.897]

100 Mining term association patterns from search logs for effective query reformulation. [sent-849, score-0.53]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('spelling', 0.523), ('query', 0.374), ('clickthrough', 0.326), ('speller', 0.252), ('correction', 0.234), ('misspelled', 0.182), ('queries', 0.152), ('error', 0.136), ('reformulation', 0.12), ('corrections', 0.108), ('sessions', 0.102), ('channel', 0.098), ('theme', 0.098), ('ranker', 0.091), ('logs', 0.087), ('brill', 0.085), ('edit', 0.082), ('harrypotter', 0.08), ('potter', 0.08), ('sheme', 0.08), ('ypre', 0.08), ('search', 0.069), ('transformation', 0.067), ('park', 0.064), ('spelled', 0.06), ('row', 0.057), ('urls', 0.056), ('web', 0.055), ('alignment', 0.051), ('harry', 0.05), ('suggestion', 0.05), ('biphrase', 0.048), ('biphrasepre', 0.048), ('clicked', 0.048), ('cssc', 0.048), ('kernighan', 0.048), ('totalprob', 0.048), ('contextual', 0.047), ('gao', 0.047), ('earch', 0.043), ('candidate', 0.041), ('user', 0.04), ('commercial', 0.039), ('equations', 0.039), ('ahmad', 0.039), ('whitelaw', 0.039), ('spe', 0.039), ('phrase', 0.038), ('qk', 0.036), ('agichtein', 0.036), ('issued', 0.036), ('moore', 0.036), ('kondrak', 0.034), ('permutation', 0.034), ('cucerzan', 0.034), ('pairs', 0.034), ('probability', 0.033), ('net', 0.033), ('model', 0.033), ('disnee', 0.032), ('disney', 0.032), ('kukich', 0.032), ('mangu', 0.032), ('maxprobpre', 0.032), ('misspellings', 0.032), ('philips', 0.032), ('probincrs', 0.032), ('xpre', 0.032), ('distance', 0.031), ('och', 0.03), ('suzuki', 0.03), ('cro', 0.03), ('levenshtein', 0.03), ('record', 0.03), ('transforming', 0.029), ('noisy', 0.029), ('ck', 0.028), ('microsoft', 0.028), ('golding', 0.028), ('spell', 0.028), ('clicks', 0.028), ('alignments', 0.027), ('year', 0.027), ('aligned', 0.027), ('suggestions', 0.025), ('mi', 0.025), ('correcting', 0.025), ('estimate', 0.024), ('smt', 0.024), ('rows', 0.024), ('church', 0.024), ('argm', 0.024), ('corrected', 0.024), ('neural', 0.024), ('log', 0.023), ('regular', 0.023), ('valid', 0.023), ('li', 0.023), ('alternate', 0.023), ('okazaki', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 164 acl-2010-Learning Phrase-Based Spelling Error Models from Clickthrough Data

Author: Xu Sun ; Jianfeng Gao ; Daniel Micol ; Chris Quirk

Abstract: This paper explores the use of clickthrough data for query spelling correction. First, large amounts of query-correction pairs are derived by analyzing users' query reformulation behavior encoded in the clickthrough data. Then, a phrase-based error model that accounts for the transformation probability between multi-term phrases is trained and integrated into a query speller system. Experiments are carried out on a human-labeled data set. Results show that the system using the phrase-based error model outperforms cantly its baseline systems. 1 signifi-

2 0.21358632 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries

Author: Xiao Li

Abstract: Determining the semantic intent of web queries not only involves identifying their semantic class, which is a primary focus of previous works, but also understanding their semantic structure. In this work, we formally define the semantic structure of noun phrase queries as comprised of intent heads and intent modifiers. We present methods that automatically identify these constituents as well as their semantic roles based on Markov and semi-Markov conditional random fields. We show that the use of semantic features and syntactic features significantly contribute to improving the understanding performance.

3 0.13223071 215 acl-2010-Speech-Driven Access to the Deep Web on Mobile Devices

Author: Taniya Mishra ; Srinivas Bangalore

Abstract: The Deep Web is the collection of information repositories that are not indexed by search engines. These repositories are typically accessible through web forms and contain dynamically changing information. In this paper, we present a system that allows users to access such rich repositories of information on mobile devices using spoken language.

4 0.12945917 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

Author: Vamshi Ambati ; Stephan Vogel ; Jaime Carbonell

Abstract: Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner.

5 0.11435448 177 acl-2010-Multilingual Pseudo-Relevance Feedback: Performance Study of Assisting Languages

Author: Manoj Kumar Chinnakotla ; Karthik Raman ; Pushpak Bhattacharyya

Abstract: In a previous work of ours Chinnakotla et al. (2010) we introduced a novel framework for Pseudo-Relevance Feedback (PRF) called MultiPRF. Given a query in one language called Source, we used English as the Assisting Language to improve the performance of PRF for the source language. MulitiPRF showed remarkable improvement over plain Model Based Feedback (MBF) uniformly for 4 languages, viz., French, German, Hungarian and Finnish with English as the assisting language. This fact inspired us to study the effect of any source-assistant pair on MultiPRF performance from out of a set of languages with widely different characteristics, viz., Dutch, English, Finnish, French, German and Spanish. Carrying this further, we looked into the effect of using two assisting languages together on PRF. The present paper is a report of these investigations, their results and conclusions drawn therefrom. While performance improvement on MultiPRF is observed whatever the assisting language and whatever the source, observations are mixed when two assisting languages are used simultaneously. Interestingly, the performance improvement is more pronounced when the source and assisting languages are closely related, e.g., French and Spanish.

6 0.10739067 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data

7 0.09190584 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

8 0.086358249 129 acl-2010-Growing Related Words from Seed via User Behaviors: A Re-Ranking Based Approach

9 0.079479396 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

10 0.078600638 106 acl-2010-Event-Based Hyperspace Analogue to Language for Query Expansion

11 0.071335047 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

12 0.068513311 75 acl-2010-Correcting Errors in a Treebank Based on Synchronous Tree Substitution Grammar

13 0.067479186 170 acl-2010-Letter-Phoneme Alignment: An Exploration

14 0.066635966 133 acl-2010-Hierarchical Search for Word Alignment

15 0.061724305 27 acl-2010-An Active Learning Approach to Finding Related Terms

16 0.058164358 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages

17 0.055772338 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation

18 0.055744346 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

19 0.052953098 85 acl-2010-Detecting Experiences from Weblogs

20 0.052660014 123 acl-2010-Generating Focused Topic-Specific Sentiment Lexicons


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.168), (1, -0.031), (2, -0.072), (3, 0.008), (4, 0.044), (5, -0.016), (6, -0.058), (7, 0.034), (8, 0.033), (9, -0.026), (10, -0.054), (11, 0.007), (12, -0.065), (13, -0.127), (14, 0.01), (15, 0.093), (16, -0.077), (17, -0.025), (18, -0.021), (19, 0.043), (20, 0.049), (21, -0.144), (22, 0.09), (23, 0.026), (24, 0.124), (25, -0.032), (26, 0.162), (27, -0.065), (28, -0.021), (29, 0.072), (30, 0.056), (31, -0.07), (32, -0.096), (33, -0.219), (34, 0.117), (35, -0.099), (36, -0.14), (37, -0.143), (38, -0.016), (39, 0.09), (40, 0.128), (41, 0.089), (42, 0.003), (43, -0.058), (44, 0.011), (45, 0.173), (46, 0.019), (47, -0.074), (48, 0.042), (49, -0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9522509 164 acl-2010-Learning Phrase-Based Spelling Error Models from Clickthrough Data

Author: Xu Sun ; Jianfeng Gao ; Daniel Micol ; Chris Quirk

Abstract: This paper explores the use of clickthrough data for query spelling correction. First, large amounts of query-correction pairs are derived by analyzing users' query reformulation behavior encoded in the clickthrough data. Then, a phrase-based error model that accounts for the transformation probability between multi-term phrases is trained and integrated into a query speller system. Experiments are carried out on a human-labeled data set. Results show that the system using the phrase-based error model outperforms cantly its baseline systems. 1 signifi-

2 0.81440175 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries

Author: Xiao Li

Abstract: Determining the semantic intent of web queries not only involves identifying their semantic class, which is a primary focus of previous works, but also understanding their semantic structure. In this work, we formally define the semantic structure of noun phrase queries as comprised of intent heads and intent modifiers. We present methods that automatically identify these constituents as well as their semantic roles based on Markov and semi-Markov conditional random fields. We show that the use of semantic features and syntactic features significantly contribute to improving the understanding performance.

3 0.75751191 177 acl-2010-Multilingual Pseudo-Relevance Feedback: Performance Study of Assisting Languages

Author: Manoj Kumar Chinnakotla ; Karthik Raman ; Pushpak Bhattacharyya

Abstract: In a previous work of ours Chinnakotla et al. (2010) we introduced a novel framework for Pseudo-Relevance Feedback (PRF) called MultiPRF. Given a query in one language called Source, we used English as the Assisting Language to improve the performance of PRF for the source language. MulitiPRF showed remarkable improvement over plain Model Based Feedback (MBF) uniformly for 4 languages, viz., French, German, Hungarian and Finnish with English as the assisting language. This fact inspired us to study the effect of any source-assistant pair on MultiPRF performance from out of a set of languages with widely different characteristics, viz., Dutch, English, Finnish, French, German and Spanish. Carrying this further, we looked into the effect of using two assisting languages together on PRF. The present paper is a report of these investigations, their results and conclusions drawn therefrom. While performance improvement on MultiPRF is observed whatever the assisting language and whatever the source, observations are mixed when two assisting languages are used simultaneously. Interestingly, the performance improvement is more pronounced when the source and assisting languages are closely related, e.g., French and Spanish.

4 0.55859578 106 acl-2010-Event-Based Hyperspace Analogue to Language for Query Expansion

Author: Tingxu Yan ; Tamsin Maxwell ; Dawei Song ; Yuexian Hou ; Peng Zhang

Abstract: p . zhang1 @ rgu .ac .uk Bag-of-words approaches to information retrieval (IR) are effective but assume independence between words. The Hyperspace Analogue to Language (HAL) is a cognitively motivated and validated semantic space model that captures statistical dependencies between words by considering their co-occurrences in a surrounding window of text. HAL has been successfully applied to query expansion in IR, but has several limitations, including high processing cost and use of distributional statistics that do not exploit syntax. In this paper, we pursue two methods for incorporating syntactic-semantic information from textual ‘events’ into HAL. We build the HAL space directly from events to investigate whether processing costs can be reduced through more careful definition of word co-occurrence, and improve the quality of the pseudo-relevance feedback by applying event information as a constraint during HAL construction. Both methods significantly improve performance results in comparison with original HAL, and interpolation of HAL and relevance model expansion outperforms either method alone.

5 0.5161671 215 acl-2010-Speech-Driven Access to the Deep Web on Mobile Devices

Author: Taniya Mishra ; Srinivas Bangalore

Abstract: The Deep Web is the collection of information repositories that are not indexed by search engines. These repositories are typically accessible through web forms and contain dynamically changing information. In this paper, we present a system that allows users to access such rich repositories of information on mobile devices using spoken language.

6 0.40074661 129 acl-2010-Growing Related Words from Seed via User Behaviors: A Re-Ranking Based Approach

7 0.39901182 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data

8 0.39565644 252 acl-2010-Using Parse Features for Preposition Selection and Error Detection

9 0.38384843 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

10 0.33671576 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

11 0.32011196 259 acl-2010-WebLicht: Web-Based LRT Services for German

12 0.3071658 27 acl-2010-An Active Learning Approach to Finding Related Terms

13 0.30440775 151 acl-2010-Intelligent Selection of Language Model Training Data

14 0.30421785 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices

15 0.29670024 254 acl-2010-Using Speech to Reply to SMS Messages While Driving: An In-Car Simulator User Study

16 0.2898266 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

17 0.28972945 123 acl-2010-Generating Focused Topic-Specific Sentiment Lexicons

18 0.27674517 116 acl-2010-Finding Cognate Groups Using Phylogenies

19 0.27506435 133 acl-2010-Hierarchical Search for Word Alignment

20 0.26920545 204 acl-2010-Recommendation in Internet Forums and Blogs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.01), (25, 0.051), (39, 0.015), (42, 0.03), (44, 0.014), (54, 0.282), (59, 0.114), (72, 0.011), (73, 0.058), (78, 0.022), (83, 0.068), (84, 0.026), (98, 0.189)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82785261 177 acl-2010-Multilingual Pseudo-Relevance Feedback: Performance Study of Assisting Languages

Author: Manoj Kumar Chinnakotla ; Karthik Raman ; Pushpak Bhattacharyya

Abstract: In a previous work of ours Chinnakotla et al. (2010) we introduced a novel framework for Pseudo-Relevance Feedback (PRF) called MultiPRF. Given a query in one language called Source, we used English as the Assisting Language to improve the performance of PRF for the source language. MulitiPRF showed remarkable improvement over plain Model Based Feedback (MBF) uniformly for 4 languages, viz., French, German, Hungarian and Finnish with English as the assisting language. This fact inspired us to study the effect of any source-assistant pair on MultiPRF performance from out of a set of languages with widely different characteristics, viz., Dutch, English, Finnish, French, German and Spanish. Carrying this further, we looked into the effect of using two assisting languages together on PRF. The present paper is a report of these investigations, their results and conclusions drawn therefrom. While performance improvement on MultiPRF is observed whatever the assisting language and whatever the source, observations are mixed when two assisting languages are used simultaneously. Interestingly, the performance improvement is more pronounced when the source and assisting languages are closely related, e.g., French and Spanish.

same-paper 2 0.81677437 164 acl-2010-Learning Phrase-Based Spelling Error Models from Clickthrough Data

Author: Xu Sun ; Jianfeng Gao ; Daniel Micol ; Chris Quirk

Abstract: This paper explores the use of clickthrough data for query spelling correction. First, large amounts of query-correction pairs are derived by analyzing users' query reformulation behavior encoded in the clickthrough data. Then, a phrase-based error model that accounts for the transformation probability between multi-term phrases is trained and integrated into a query speller system. Experiments are carried out on a human-labeled data set. Results show that the system using the phrase-based error model outperforms cantly its baseline systems. 1 signifi-

3 0.80181456 198 acl-2010-Predicate Argument Structure Analysis Using Transformation Based Learning

Author: Hirotoshi Taira ; Sanae Fujita ; Masaaki Nagata

Abstract: Maintaining high annotation consistency in large corpora is crucial for statistical learning; however, such work is hard, especially for tasks containing semantic elements. This paper describes predicate argument structure analysis using transformation-based learning. An advantage of transformation-based learning is the readability of learned rules. A disadvantage is that the rule extraction procedure is time-consuming. We present incremental-based, transformation-based learning for semantic processing tasks. As an example, we deal with Japanese predicate argument analysis and show some tendencies of annotators for constructing a corpus with our method.

4 0.77035499 32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data

Author: Yassine Benajiba ; Imed Zitouni ; Mona Diab ; Paolo Rosso

Abstract: Building an accurate Named Entity Recognition (NER) system for languages with complex morphology is a challenging task. In this paper, we present research that explores the feature space using both gold and bootstrapped noisy features to build an improved highly accurate Arabic NER system. We bootstrap noisy features by projection from an Arabic-English parallel corpus that is automatically tagged with a baseline NER system. The feature space covers lexical, morphological, and syntactic features. The proposed approach yields an improvement of up to 1.64 F-measure (absolute).

5 0.64460897 79 acl-2010-Cross-Lingual Latent Topic Extraction

Author: Duo Zhang ; Qiaozhu Mei ; ChengXiang Zhai

Abstract: Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Proba- bilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.

6 0.64209116 170 acl-2010-Letter-Phoneme Alignment: An Exploration

7 0.64206469 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

8 0.6413343 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

9 0.64104646 83 acl-2010-Dependency Parsing and Projection Based on Word-Pair Classification

10 0.63876528 133 acl-2010-Hierarchical Search for Word Alignment

11 0.6384235 232 acl-2010-The S-Space Package: An Open Source Package for Word Space Models

12 0.63725507 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

13 0.63690913 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment

14 0.63591325 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules

15 0.63581407 15 acl-2010-A Semi-Supervised Key Phrase Extraction Approach: Learning from Title Phrases through a Document Semantic Network

16 0.63559115 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons

17 0.63485432 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery

18 0.63377827 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

19 0.63370478 146 acl-2010-Improving Chinese Semantic Role Labeling with Rich Syntactic Features

20 0.63341653 80 acl-2010-Cross Lingual Adaptation: An Experiment on Sentiment Classifications