emnlp emnlp2010 emnlp2010-90 knowledge-graph by maker-knowledge-mining

90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

Source: pdf

Author: Florian Boudin ; Jian-Yun Nie ; Martin Dawes

Abstract: The PECO framework is a knowledge representation for formulating clinical questions. Queries are decomposed into four aspects, which are Patient-Problem (P), Exposure (E), Comparison (C) and Outcome (O). However, no test collection is available to evaluate such framework in information retrieval. In this work, we first present the construction of a large test collection extracted from systematic literature reviews. We then describe an analysis of the distribution of PECO elements throughout the relevant documents and propose a language modeling approach that uses these distributions as a weighting strategy. In our experiments carried out on a collection of 1.5 million documents and 423 queries, our method was found to lead to an improvement of 28% in MAP and 50% in P@5, as com- pared to the state-of-the-art method.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ro Abstract The PECO framework is a knowledge representation for formulating clinical questions. [sent-8, score-0.422]

2 We then describe an analysis of the distribution of PECO elements throughout the relevant documents and propose a language modeling approach that uses these distributions as a weighting strategy. [sent-12, score-0.419]

3 MEDLINE, the authoritative repository of citations from the medical and bio-medical domain, contains more than 18 million citations. [sent-16, score-0.293]

4 Searching for clinically relevant information within this large amount of data is a difficult task that medical professionals are often unable to complete in a timely manner. [sent-17, score-0.209]

5 A better access to clinical evidence represents a high impact application for physicians. [sent-18, score-0.429]

6 Practice EBM means integrating individual clinical expertise with the best available external clinical evidence from systematic research. [sent-26, score-0.897]

7 It involves tracking down the best evidence from randomized trials or meta-analyses with which to answer clinical questions. [sent-27, score-0.494]

8 (1995) identified the following four aspects as the key elements of a well-built clinical question: • • • • Patient-problem: what are the patient charactPeartisiteincts- (e. [sent-29, score-0.614]

9 Physicians are educated to formulate their clinical questions in respect to this structure. [sent-47, score-0.439]

10 PubMed1 , the most used search interface, does not allow users to formulate PECO queries yet. [sent-53, score-0.192]

11 For the previously mentioned clinical question, a physician would use the query “Treadmill AND Parkinson ’s disease”. [sent-54, score-0.491]

12 One can for example differentiate two queries in which a disease would be a patient condition or a clinical outcome. [sent-57, score-0.648]

13 This conceptual decomposition of queries is also particularly useful in a sense that it can be used to balance the importance of each element in the search process. [sent-58, score-0.3]

14 Another important factor that prevented researchers from testing approaches to clinical information retrieval (IR) based on PECO elements is the lack of a test collection, which contains a set of documents, a set of queries and the relevance judgments. [sent-59, score-0.841]

15 In this paper, we take advantage of the systematic reviews about clinical questions from Cochrane. [sent-61, score-0.546]

16 Each Cochrane review examines in depth a clinical question and survey all the available relevant publications. [sent-62, score-0.494]

17 We transformed them into a TREC-like test collection, which contains 423 queries and 8926 relevant documents extracted from MEDLINE. [sent-64, score-0.344]

18 One can then match the PECO elements in the query to the elements detected in documents. [sent-67, score-0.433]

19 However, as previous studies have shown, it is very difficult to automatically annotate accurately PECO elements in documents. [sent-68, score-0.192]

20 To by-pass this issue, we propose an alternative that relies on the observed positional distri- bution of these elements in documents. [sent-69, score-0.217]

21 2 Related work The need to answer clinical questions related to a patient care using IR systems has been well studied and documented (Hersh et al. [sent-79, score-0.473]

22 There are a limited but growing number of studies trying to use the PECO elements in the retrieval process. [sent-83, score-0.266]

23 (Demner-Fushman and Lin, 2007) is one of the few such studies, in which a series of knowledge extractors is used to detect PECO elements in documents. [sent-84, score-0.174]

24 These elements are later used to re-rank a list of retrieved citations from PubMed. [sent-85, score-0.402]

25 Results reported indicate that their method can bring relevant citations into higherranking positions, and from these abstracts generate responses that answer clinicians’ questions. [sent-86, score-0.283]

26 This study demonstrates the value of the PECO framework as a method for structuring clinical questions. [sent-87, score-0.406]

27 However, as the focus has been put on the postretrieval step (for question-answering), it is not clear whether PECO elements are useful at the retrieval step. [sent-88, score-0.248]

28 Intuitively, the integration of PECO elements in the retrieval process can also lead to higher retrieval effectiveness. [sent-89, score-0.322]

29 The most obvious scenario for testing this would be to recognize PECO elements in documents prior to indexing. [sent-90, score-0.294]

30 When a PECO-structured query is formulated, it is matched against the PECO elements in the documents (Dawes et al. [sent-91, score-0.379]

31 Neverthe- less, the task of automatically identifying PECO elements is a very difficult one. [sent-93, score-0.174]

32 playing a major role in the clinical study) or secondary elements. [sent-101, score-0.406]

33 Notwithstanding, experiments conducted using a collection of documents that were annotated at a sentence-level only showed a small increase in retrieval accuracy (Boudin et al. [sent-110, score-0.26]

34 They show that a large improvement in retrieval effectiveness can be obtained this way and indicate that the weights learned automatically are correlated to the observed distribution of PECO elements in documents. [sent-115, score-0.271]

35 In this work, we propose to go one step further in this direction by analyzing the distribution of PECO elements in a large number of documents and define the positional probabilities of PECO elements accordingly. [sent-116, score-0.534]

36 3 Construction of the test collection Despite the increasing use of search engines by medical professionals, there is no standard test collection for evaluating clinical IR. [sent-118, score-0.667]

37 Systematic reviews try to identify, appraise, select and 110 synthesize all high quality research evidence relevant to a clinical question. [sent-121, score-0.533]

38 In particular, a review contains a reference section, listing all the relevant studies to the clinical question. [sent-124, score-0.512]

39 We gathered a subset of Cochrane systematic reviews and asked a group of annotators, one professor and four Master students in family medicine, to create PECO-structured queries corresponding to the clinical questions. [sent-128, score-0.696]

40 As clinical questions answered in these reviews cover various aspects of one topic, multiple variants of precise PECO queries were generated for each review. [sent-129, score-0.649]

41 Moreover, in order to be able to compare a PECO-based search strategy to a real world scenario, this group have also provided the keyword-based queries that they would have used to search with PubMed. [sent-130, score-0.237]

42 Below is an example of queries generated from the systematic review about “Aspirin with or without an antiemetic for acute migraine headaches in adults”: Keyword-based query [aspirin and migraine] PECO-structured queries 1. [sent-131, score-0.596]

43 org All the citations included in the “References” section of the systematic review were extracted and selected as relevant documents. [sent-136, score-0.325]

44 These citations were manually mapped to PubMed unique identifiers (PMID). [sent-137, score-0.175]

45 Figure 1: Histogram of the number of queries versus the number of relevant documents. [sent-142, score-0.224]

46 The resulting test collection is composed of 423 queries and 8926 relevant citations (2596 different citations). [sent-145, score-0.484]

47 This number reduces to 8138 citations once we remove the citations without any text in the abstract (i. [sent-146, score-0.35]

48 Figure 1 shows the statistics derived from the number of relevant documents by query. [sent-149, score-0.179]

49 In this test collection, the average number of documents per query is approximately 19 while the average length of a document is 246 words. [sent-150, score-0.249]

50 4 Distribution of PECO elements The observation that PECO elements are not evenly distributed throughout the documents is not new. [sent-151, score-0.468]

51 These rhetorical categories are highly correlated to the distributions of PECO elements, as some elements are more likely to occur in certain categories (e. [sent-159, score-0.227]

52 clinical outcomes are more likely to appear in the conclusion). [sent-161, score-0.406]

53 To the best of our knowledge, the first analysis of the distribution of PECO elements in documents was described in(Boudin et al. [sent-163, score-0.317]

54 A small collection of manually annotated abstracts was used to compute the probability that a PECO element occurs in a specific part of the documents. [sent-165, score-0.2]

55 The idea is to use the pairs of PECO-structured query and relevant document, assuming that if a document is relevant then it should contain the same elements as the query. [sent-168, score-0.421]

56 Errors can be introduced by synonyms or homonyms and relevant documents may not contain all of the elements described in the query. [sent-170, score-0.353]

57 There are several ways to look at the distribution of PECO elements in documents. [sent-176, score-0.197]

58 Furthermore, most of the citations available in PubMed are devoid of explicitly marked sections. [sent-179, score-0.175]

59 For each PECO element, the distribution of query words among the parts of the documents is not uniform (Figure 2). [sent-185, score-0.257]

60 Our proposed model will exploit the typical distributions of PECO elements in documents. [sent-188, score-0.194]

61 2015 0P1 2P3 C4ePl5mPe6natPs7r 8ofP9th1e0doPc1um2ePn3tOs4elP5me6ntsP78910 P elements E elements Fig ure 2: Distribution of each PECO element throughout the different parts of the documents. [sent-191, score-0.462]

62 This approach assumes that queries and documents are generated from some probability distribution oftext (Ponte and Croft, 1998). [sent-193, score-0.308]

63 Under this assumption, ranking a document D as relevant to a query Q is seen as estimating P( Q|D), the probability that Q was generated by the same distribution as D. [sent-194, score-0.211]

64 A typical way to score a document D as relevant to a query Q is to compute the Kullback-Leibler divergence between their respective language models: score(Q,D) = X P(w|Q) · logP(w|D) (1) wX∈Q Under the traditional bag-of-words assumption, i. [sent-195, score-0.204]

65 1 Model definition In our model, we propose to use the distribution of PECO elements observed in documents to emphasize the most informative parts of the documents. [sent-203, score-0.346]

66 The idea is to get rid of the problem of precisely detecting PECO elements by using a positional language model. [sent-204, score-0.217]

67 The idea is to use the PECO structure as a way to balance the importance of each element in the retrieval step. [sent-209, score-0.182]

68 The final scoring function is defined as: scorefinal(Q, D) = X δe · score(Qe, D) e∈XPECO In our model, there are a total of 7 weighting parameters, 4 corresponding to the PECO elements in queries (δP, δE, δC and δO) and 3 for the document language models (α, β and γ). [sent-210, score-0.406]

69 We used the following constraints: citations with an abstract, human subjects, and belonging to one of the following publication types: randomized control trials, reviews, clinical trials, letters, editorials and metaanalyses. [sent-217, score-0.627]

70 The set of queries and relevance judgments described in Section 3 is used to evaluate our model. [sent-218, score-0.187]

71 Because each query is generated from a systematic literature review completed at a time t, we placed an additional restriction on the publication date of the retrieved documents: only documents published before time t are considered. [sent-220, score-0.366]

72 Number of relevant documents retrieved All retrieval tasks are performed using an “outof-the-shelf” version of the Lemur toolkit4. [sent-226, score-0.306]

73 The number of retrieved documents is set to 1000 and the Dirichlet prior smoothing parameter to = 2000. [sent-230, score-0.173]

74 2 Experiments We first investigated the impact of using PECOstructured queries on the retrieval performance. [sent-235, score-0.239]

75 In our test collection, queries are often composed of multiword phrases such as “low back pain” or “early pregnancy”. [sent-248, score-0.184]

76 However, the number of relevant documents retrieved is decreased. [sent-257, score-0.232]

77 The PECO queries use PECO-structured queries as a bag of words. [sent-260, score-0.33]

78 The number of relevant documents retrieved is also larger. [sent-262, score-0.232]

79 These results indicate that formulating clinical queries according to the PECO framework enhance the retrieval effectiveness. [sent-263, score-0.661]

80 172∗ 5433 Table 1: Comparing the performance measures of keyword-based and PECO-structured queries in terms of MAP, precision at 5 and number of relevant documents retrieved (#rel. [sent-272, score-0.397]

81 The first variant (named Model-1) uses a global σe distribution fixed according to the average distribution of all PECO elements (i. [sent-279, score-0.22]

82 The idea is to see if, given the fact that PECO elements have different distributions in documents, using an adapted weight distribution for each element can improve the retrieval effectiveness. [sent-283, score-0.376]

83 Previous studies have shown that assigning a different weight to each PECO element in the query leads to better results (Demner-Fushman and Lin, 2007; Boudin et al. [sent-284, score-0.188]

84 The PECO decomposition of queries is particularly useful to balance the importance of each element in the scoring function. [sent-292, score-0.273]

85 These results support our assumption that the distribution of PECO elements in documents can be used to weight words in the document language model. [sent-295, score-0.361]

86 7 Conclusion This paper first presented the construction of a test collection for evaluating clinical information retrieval. [sent-301, score-0.472]

87 From a set of systematic reviews, a group of annotators were asked to generate structured clinical queries and collect relevance judgments. [sent-302, score-0.673]

88 The resulting test collection is composed of 423 queries and 8926 relevant documents. [sent-303, score-0.309]

89 This test collection provides a basis for researchers to experiment with PECO-structured queries in clinical IR. [sent-304, score-0.637]

90 In a second step, this paper addressed the problem of using the PECO framework in clinical IR. [sent-306, score-0.406]

91 A straightforward idea is to identify PECO elements in documents and use the elements in the retrieval process. [sent-307, score-0.542]

92 Instead, we proposed a less demanding approach that uses the distribution of PECO ele- ments in documents to re-weight terms in the document model. [sent-338, score-0.187]

93 5 million citations extracted with PubMed, our best model obtains an increase of 28% for MAP and nearly 50% for P@5 over the classical language modeling approach. [sent-342, score-0.191]

94 In future work, we intend to expand our analysis of the distribution of PECO elements to a larger number of citations. [sent-343, score-0.197]

95 One way to do that would be to automatically extract PubMed citations that contain structural markers associated to PECO categories (Chung, 2009). [sent-344, score-0.175]

96 Utilization of the PICO framework to improve searching PubMed for clinical questions. [sent-360, score-0.423]

97 The iden115 tification of clinically important elements within medical journal abstracts: PatientPopulationProblem, ExposureIntervention, Comparison, Outcome, Duration and Results (PECODR). [sent-364, score-0.302]

98 Factors associated with successful answering of clinical questions using an information retrieval system. [sent-377, score-0.513]

99 Impact of clinical information-retrieval technology on physicians: a literature review of quantitative, qualitative and mixed methods studies. [sent-396, score-0.435]

100 The well-built clinical question: a key to evidence-based decisions. [sent-409, score-0.406]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('peco', 0.743), ('clinical', 0.406), ('citations', 0.175), ('elements', 0.174), ('queries', 0.165), ('documents', 0.12), ('boudin', 0.102), ('medical', 0.102), ('cochrane', 0.09), ('pubmed', 0.09), ('element', 0.085), ('query', 0.085), ('retrieval', 0.074), ('collection', 0.066), ('aspirin', 0.064), ('migraine', 0.064), ('systematic', 0.062), ('relevant', 0.059), ('eal', 0.055), ('retrieved', 0.053), ('adults', 0.051), ('treadmill', 0.051), ('montr', 0.049), ('abstracts', 0.049), ('reviews', 0.045), ('document', 0.044), ('positional', 0.043), ('disease', 0.043), ('dawes', 0.038), ('ebm', 0.038), ('exposure', 0.038), ('parkinson', 0.038), ('physicians', 0.038), ('pico', 0.038), ('placebo', 0.038), ('pluye', 0.038), ('medicine', 0.036), ('trials', 0.036), ('patient', 0.034), ('questions', 0.033), ('rhetorical', 0.033), ('pain', 0.033), ('ir', 0.032), ('map', 0.029), ('review', 0.029), ('randomized', 0.029), ('parts', 0.029), ('informatics', 0.027), ('search', 0.027), ('ages', 0.026), ('antiemetic', 0.026), ('clinically', 0.026), ('grad', 0.026), ('indri', 0.026), ('journals', 0.026), ('mcknight', 0.026), ('ponte', 0.026), ('sackett', 0.026), ('schardt', 0.026), ('specialists', 0.026), ('nie', 0.026), ('distribution', 0.023), ('weighting', 0.023), ('balance', 0.023), ('evidence', 0.023), ('relevance', 0.022), ('diro', 0.022), ('patients', 0.022), ('roland', 0.022), ('metzler', 0.022), ('drug', 0.022), ('hersh', 0.022), ('professionals', 0.022), ('title', 0.022), ('florian', 0.022), ('pierre', 0.02), ('distributions', 0.02), ('named', 0.019), ('composed', 0.019), ('bmc', 0.018), ('walking', 0.018), ('niu', 0.018), ('studies', 0.018), ('outcome', 0.018), ('group', 0.018), ('canada', 0.017), ('grid', 0.017), ('heart', 0.017), ('bruce', 0.017), ('publication', 0.017), ('failure', 0.017), ('duration', 0.017), ('searching', 0.017), ('divergence', 0.016), ('moderate', 0.016), ('formulating', 0.016), ('interface', 0.016), ('richardson', 0.016), ('croft', 0.016), ('million', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

Author: Florian Boudin ; Jian-Yun Nie ; Martin Dawes

2 0.091124207 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.

3 0.076598614 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei

Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.

4 0.065225951 15 emnlp-2010-A Unified Framework for Scope Learning via Simplified Shallow Semantic Parsing

Author: Qiaoming Zhu ; Junhui Li ; Hongling Wang ; Guodong Zhou

Abstract: This paper approaches the scope learning problem via simplified shallow semantic parsing. This is done by regarding the cue as the predicate and mapping its scope into several constituents as the arguments of the cue. Evaluation on the BioScope corpus shows that the structural information plays a critical role in capturing the relationship between a cue and its dominated arguments. It also shows that our parsing approach significantly outperforms the state-of-the-art chunking ones. Although our parsing approach is only evaluated on negation and speculation scope learning here, it is portable to other kinds of scope learning. 1

5 0.058667626 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

Author: Roberto Navigli ; Giuseppe Crisafulli

Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.

6 0.054309178 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

7 0.044285763 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

8 0.043641184 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

9 0.043264262 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

10 0.040677749 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

11 0.039834078 51 emnlp-2010-Function-Based Question Classification for General QA

12 0.037435677 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

13 0.036552701 39 emnlp-2010-EMNLP 044

14 0.035426069 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

15 0.034003787 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

16 0.033067487 84 emnlp-2010-NLP on Spoken Documents Without ASR

17 0.030364774 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

18 0.030346423 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

19 0.028578583 74 emnlp-2010-Learning the Relative Usefulness of Questions in Community QA

20 0.028232818 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.101), (1, 0.064), (2, -0.081), (3, 0.055), (4, 0.069), (5, 0.045), (6, -0.08), (7, 0.007), (8, -0.125), (9, 0.043), (10, -0.017), (11, 0.094), (12, 0.092), (13, 0.069), (14, -0.04), (15, 0.052), (16, 0.02), (17, 0.027), (18, -0.057), (19, 0.002), (20, -0.081), (21, 0.061), (22, 0.109), (23, -0.021), (24, -0.057), (25, 0.051), (26, -0.018), (27, -0.062), (28, -0.064), (29, 0.149), (30, 0.278), (31, -0.049), (32, 0.022), (33, 0.265), (34, -0.028), (35, -0.113), (36, 0.258), (37, 0.015), (38, -0.016), (39, 0.1), (40, -0.077), (41, 0.13), (42, 0.229), (43, 0.084), (44, 0.106), (45, -0.012), (46, -0.067), (47, -0.072), (48, 0.132), (49, -0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95024818 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

Author: Florian Boudin ; Jian-Yun Nie ; Martin Dawes

2 0.70610929 15 emnlp-2010-A Unified Framework for Scope Learning via Simplified Shallow Semantic Parsing

Author: Qiaoming Zhu ; Junhui Li ; Hongling Wang ; Guodong Zhou

3 0.37627116 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

4 0.30352968 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

Author: John Platt ; Kristina Toutanova ; Wen-tau Yih

Abstract: Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corre- sponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.

5 0.29693896 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

Author: Danish Contractor ; Govind Kothari ; Tanveer Faruquie ; L V Subramaniam ; Sumit Negi

Abstract: Recent times have seen a tremendous growth in mobile based data services that allow people to use Short Message Service (SMS) to access these data services. In a multilingual society it is essential that data services that were developed for a specific language be made accessible through other local languages also. In this paper, we present a service that allows a user to query a FrequentlyAsked-Questions (FAQ) database built in a local language (Hindi) using Noisy SMS English queries. The inherent noise in the SMS queries, along with the language mismatch makes this a challenging problem. We handle these two problems by formulating the query similarity over FAQ questions as a combinatorial search problem where the search space consists of combinations of dictionary variations of the noisy query and its top-N translations. We demonstrate the effectiveness of our approach on a real-life dataset.

6 0.2753585 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

7 0.25768733 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

8 0.17528617 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

9 0.17142093 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

10 0.16102621 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

11 0.14124243 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

12 0.14093108 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

13 0.13817099 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

14 0.13631298 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

15 0.13420326 40 emnlp-2010-Effects of Empty Categories on Machine Translation

16 0.13112792 84 emnlp-2010-NLP on Spoken Documents Without ASR

17 0.1303284 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

18 0.12944235 51 emnlp-2010-Function-Based Question Classification for General QA

19 0.12110201 83 emnlp-2010-Multi-Level Structured Models for Document-Level Sentiment Classification

20 0.11804102 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.02), (10, 0.015), (12, 0.026), (29, 0.063), (30, 0.021), (52, 0.029), (56, 0.06), (62, 0.443), (66, 0.097), (72, 0.059), (76, 0.032), (79, 0.013), (87, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9028008 88 emnlp-2010-On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Author: Alexander M Rush ; David Sontag ; Michael Collins ; Tommi Jaakkola

Abstract: This paper introduces dual decomposition as a framework for deriving inference algorithms for NLP problems. The approach relies on standard dynamic-programming algorithms as oracle solvers for sub-problems, together with a simple method for forcing agreement between the different oracles. The approach provably solves a linear programming (LP) relaxation of the global inference problem. It leads to algorithms that are simple, in that they use existing decoding algorithms; efficient, in that they avoid exact algorithms for the full model; and often exact, in that empirically they often recover the correct solution in spite of using an LP relaxation. We give experimental results on two problems: 1) the combination of two lexicalized parsing models; and 2) the combination of a lexicalized parsing model and a trigram part-of-speech tagger.

same-paper 2 0.80006045 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

Author: Florian Boudin ; Jian-Yun Nie ; Martin Dawes

3 0.75395739 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

Author: Stefan Schoenmackers ; Jesse Davis ; Oren Etzioni ; Daniel Weld

Abstract: input. Even the entire Web corpus does not explicitly answer all questions, yet inference can uncover many implicit answers. But where do inference rules come from? This paper investigates the problem of learning inference rules from Web text in an unsupervised, domain-independent manner. The SHERLOCK system, described herein, is a first-order learner that acquires over 30,000 Horn clauses from Web text. SHERLOCK embodies several innovations, including a novel rule scoring function based on Statistical Relevance (Salmon et al., 1971) which is effective on ambiguous, noisy and incomplete Web extractions. Our experiments show that inference over the learned rules discovers three times as many facts (at precision 0.8) as the TEXTRUNNER system which merely extracts facts explicitly stated in Web text.

4 0.67994189 38 emnlp-2010-Dual Decomposition for Parsing with Non-Projective Head Automata

Author: Terry Koo ; Alexander M. Rush ; Michael Collins ; Tommi Jaakkola ; David Sontag

Abstract: This paper introduces algorithms for nonprojective parsing based on dual decomposition. We focus on parsing algorithms for nonprojective head automata, a generalization of head-automata models to non-projective structures. The dual decomposition algorithms are simple and efficient, relying on standard dynamic programming and minimum spanning tree algorithms. They provably solve an LP relaxation of the non-projective parsing problem. Empirically the LP relaxation is very often tight: for many languages, exact solutions are achieved on over 98% of test sentences. The accuracy of our models is higher than previous work on a broad range of datasets.

5 0.48218188 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference

Author: Andre Martins ; Noah Smith ; Eric Xing ; Pedro Aguiar ; Mario Figueiredo

Abstract: We present a unified view of two state-of-theart non-projective dependency parsers, both approximate: the loopy belief propagation parser of Smith and Eisner (2008) and the relaxed linear program of Martins et al. (2009). By representing the model assumptions with a factor graph, we shed light on the optimization problems tackled in each method. We also propose a new aggressive online algorithm to learn the model parameters, which makes use of the underlying variational representation. The algorithm does not require a learning rate parameter and provides a single framework for a wide family of convex loss functions, includ- ing CRFs and structured SVMs. Experiments show state-of-the-art performance for 14 languages.

6 0.41504228 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

7 0.40699366 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

8 0.40485361 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

9 0.39881206 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

10 0.39790359 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks

11 0.39157212 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

12 0.39126456 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

13 0.37810838 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

14 0.37525496 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

15 0.3678202 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

16 0.36746687 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

17 0.36712182 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

18 0.36607417 104 emnlp-2010-The Necessity of Combining Adaptation Methods

19 0.36585084 80 emnlp-2010-Modeling Organization in Student Essays

20 0.36353323 73 emnlp-2010-Learning Recurrent Event Queries for Web Search