emnlp emnlp2010 emnlp2010-73 knowledge-graph by maker-knowledge-mining

73 emnlp-2010-Learning Recurrent Event Queries for Web Search


Source: pdf

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Inc 701 First Avenue, Sunnyvale, CA94089 Abstract Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. [sent-2, score-0.64]

2 The freshness of documents ranked for such queries is generally of critical importance. [sent-3, score-0.293]

3 REQ forms a significant volume, as much as 6% of query traffic received by search engines. [sent-4, score-0.53]

4 From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. [sent-7, score-1.148]

5 We also develop temporal features by time series analysis from query frequency. [sent-8, score-0.469]

6 Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. [sent-9, score-0.499]

7 Existing search engines adopt a fairly involved ranking algorithm to order Web search results by considering many factors. [sent-28, score-0.298]

8 Typically, a recurrent event is associated with a root, and spawns a large set of queries. [sent-38, score-0.343]

9 Oscar, for instance, is a recurrent event about the annual Academy Award. [sent-39, score-0.343]

10 Based on this, queries like “oscar best actress”, “oscar best dress”, “oscar best movie award”, are all recurrent event queries. [sent-40, score-0.573]

11 As such, REQ is a highly frequent category of query in Web search. [sent-41, score-0.428]

12 By Web search query log analysis, we observe that there about 5-6% queries of total query volume belongs to this category. [sent-42, score-1.25]

13 In this work, we learn if a query is in the REQ class, by effectively combining multiple features. [sent-43, score-0.428]

14 Our features are developed through analysis of historical query logs. [sent-44, score-0.428]

15 2 Related Work We found our work were related to two other problems: general query classification and time-sensitive query classification. [sent-50, score-0.894]

16 For general query classification, the task is to assign a Web search query to one or more predefined categories based on its topics. [sent-51, score-0.958]

17 In the query classification contest in KDDCUP 2005 (Li et al. [sent-52, score-0.491]

18 The difficulties for query classification are from short queries, lack of labeled data, and query sense ambiguity. [sent-56, score-0.894]

19 Most popular studies use query log, web search results, unla- beled data to enrich query classification (Shen et al. [sent-57, score-1.064]

20 , 2005), or use document classification to predict query classification (Broder et al. [sent-59, score-0.504]

21 General query classification is also studied for query intent detection by (Li et al. [sent-61, score-0.894]

22 , 2010a) proposed a machine-learned framework to improve ranking result freshness, in which novel features, modeling algorithms and editorial guideline are used to deal with time sensitivities of queries and documents. [sent-81, score-0.324]

23 Novel and effective features are also extracted for fresh URLs so that ranking recency in web search is improved. [sent-86, score-0.302]

24 Perhaps the most related work to this paper is the query classification approach used in (Zhang et al. [sent-87, score-0.466]

25 , 2009), in which year qualified queries (YQQs) are detected based on heuristic rules. [sent-89, score-0.464]

26 For example, a query containing a year stamp is an explicit YQQ; if the year stamp is removed from this YQQ, the remaining part of this query is also a YQQ, which is called implicit YQQ. [sent-90, score-1.444]

27 , 2010a) proposed a breaking-news query classifier with high accuracy and reasonable coverage, which works not by modeling each individual topic and tracking it over time, but by modeling each discrete time slot, and comparing the models representing different time slots. [sent-101, score-0.467]

28 The buzziness of a query is computed as the language model likelihood difference between different time slots. [sent-102, score-0.453]

29 In this approach, both query log and news contents are exploited to compute language model likelihood. [sent-103, score-0.49]

30 Diaz (Diaz, 2009) determined the newsworthiness of a query by predicting the probability of a user clicks on the news display of a query. [sent-104, score-0.457]

31 In this framework, the data sources of both query log and news corpus are leveraged to compute contextual features. [sent-105, score-0.49]

32 Furthermore, the online click feedback also plays a critical role for future click prediction. [sent-106, score-0.302]

33 We subdivide all raw queries in query log into three categories: Explicit Timestamp, Implicit Timestamp, and No Timestamp. [sent-115, score-0.72]

34 An Explicit Timestamp query contains at least one token being a time indicator. [sent-116, score-0.428]

35 These queries are considered to conatin time indicators, because we can regard {2010, 2007, 2009} as year indicator, cdaence rmegbaerrd as 2m01on0t,h 2 indicator, {summer, Q1(first quarter)} as s aesas moonnathl i nnddiiccaattoorr,. [sent-118, score-0.464]

36 Any query containing at least one year indicator is an Explicit Timestamp query. [sent-122, score-0.691]

37 Implicit Timestamp queries are resulted by removing all year indicators from the corresponding Explicit Timestamp queries. [sent-125, score-0.504]

38 For example, the Implicit Timestamp query of emnlp 2010 is emnlp. [sent-126, score-0.428]

39 All other queries are No Timestamp queries because they have never been found together with a year indicator. [sent-127, score-0.694]

40 Classifying queries into the above three categories depends on the used query log. [sent-128, score-0.658]

41 A search engine company partner provided us a query log from 08/01/2009 to 02/29/2010 for this research. [sent-129, score-0.649]

42 We found the proportions of the three categories in this query log are 13. [sent-130, score-0.49]

43 These numbers could be slightly different depending on the source of query logs. [sent-134, score-0.428]

44 1% of Implicit Timestamp queries in the query log is a significant number. [sent-136, score-0.72]

45 They belong to Implicit Timestamp just because users issued the query with a year indicator through varied intents. [sent-139, score-0.793]

46 For example, “google” is found to be an Implicit Timestamp query since there were many “google 2008” or “google 2009” in the query log. [sent-140, score-0.856]

47 The next few sections introduce our work in recognizing recurrent event time sense for Implicit Timestamp queries. [sent-141, score-0.343]

48 We extract these features from query log, query session log, click log, search results, time series and NLP morphological analysis. [sent-144, score-1.135]

49 1 Query log analysis The following features are extracted from query log analysis: QueryDailyFrequency: the total counts of the query divided by the number of the days in the pe- riod. [sent-146, score-0.98]

50 ExplicitQueryRatio: Ratio of number of counts query was issued with year and number of counts query was issued with or without year. [sent-147, score-1.198]

51 For example, if a query was issued with query+2009 and query+2008, this feature’s value is two. [sent-151, score-0.482]

52 The expected distribution for this (saufm101, saumf091, saumf011) 1132 query is, Eq = (sums2um∗a1f10,sumsu2m∗a1f09,. [sent-170, score-0.428]

53 2 Query reformulation If users cannot find the newest page by issuing Implicit Timestamp query, they may re-issue the query using an Explicit Timestamp query. [sent-181, score-0.615]

54 YearSwitch: Number of unique year-like tokens switched by users in a query session. [sent-185, score-0.476]

55 3 Click log analysis If a query is time sensitive, users may click a page that displays the year indicator on title or url. [sent-188, score-1.028]

56 An example that shows year indicator on url is www. [sent-189, score-0.328]

57 Search engine click log saves all users’ click information. [sent-194, score-0.421]

58 YearUrlTop5CTR: Aggregated click through rate (CTR) of all top five URLs containing a year indicator. [sent-196, score-0.416]

59 YearUrlFPCTR: Aggregated click through rate (CTR) of all first page URLs containing a year indicator. [sent-198, score-0.461]

60 4 Search engine result set For each Implicited Timestamp query, we can scrape the search engine to get search results. [sent-200, score-0.318]

61 We count the number of titles and urls that contain year indicator. [sent-201, score-0.319]

62 TitleYearTop5: the number of titles containing a year indication on the top 5 results. [sent-203, score-0.342]

63 TitleYearTop10: the number of titles containing a year indication on the top 10 results. [sent-206, score-0.342]

64 TitleYearTop30: the number of titles containing a year indication on the top 30 results. [sent-209, score-0.342]

65 UrlYearTop5: the number of urls containing a year indication on the top 5 results. [sent-210, score-0.347]

66 UrlYearTop10: the number ofurls containing a year indication on the top 10 results. [sent-213, score-0.302]

67 UrlYearTop30: the number of titles containing a year indication on the top 30 results. [sent-214, score-0.342]

68 5 Time series analysis Recurrent event query has periodic occurrence pattern in time series. [sent-216, score-0.506]

69 The annual event usually starts from Oscar nomination as earlier as last year December to award announcement of February this year. [sent-218, score-0.312]

70 By making use of recurrent event queries’ periodic properties, we calculated the query period as a new feature. [sent-223, score-0.804]

71 2 shows the autocorrelation function plot for the query Oscar. [sent-233, score-0.504]

72 6 Recurrent event seed word list Many recurrent event queries share some common words that have recurrent time sense. [sent-235, score-0.97]

73 Those seeds are likely combined with other words to form new recurrent event queries. [sent-237, score-0.376]

74 To generate the seed list, we tokenized all the queries from Implicit Timestamp queries and split all the tokens. [sent-239, score-0.514]

75 Some top tokens were removed if they are not qualified to form recurrent event queries. [sent-241, score-0.374]

76 The editors took about four days to do the judgment according to the token’s time sense and examples of recurrent event queries. [sent-242, score-0.343]

77 A token will be in the seed if there are many recurrent event examples formed by this token, by editors’ judgment. [sent-244, score-0.397]

78 15294357185913 2714 5 169 8319721 5239 5 267 8129530 23 7351 6 379 3407 21435 9463 7 49150 1953 47561 75 89603 1763 4569 7368 701 5729 4375 1785 9 813 27841 5 869 3897 Figure 2: Frequency waveform(top) and corresponding autocorrelation curve (bottom) for query Oscar. [sent-255, score-0.504]

79 The 6,000 queries were sampled from Implicit Timestamp queries according to frequency distribution to be representative. [sent-279, score-0.492]

80 The recall is a measure of correctly classified REQ queries divided by all REQ queries in test data. [sent-298, score-0.46]

81 The second and seventh features are from search session analysis by counting users who changed queries from Implicit Timestamp to Explicit Timestamp. [sent-312, score-0.406]

82 Calculation of this feature needs two years query log to be much more effective, but we didn’t get so large data for many queries. [sent-316, score-0.49]

83 One of the features from recurrent event seed list is ranked No. [sent-317, score-0.422]

84 The ChiSquareYearDist feature is ranked 5th, that proves the recurrent event query frequency has a statistical distribution pattern over years. [sent-320, score-0.828]

85 082 Table 4: Probabilities of example queries by GBDT tree classifier Some query examples, and their scores from our model are listed in Table 4. [sent-392, score-0.727]

86 In their approach, search ranking is altered by boosting pages with most recent year if the query is a REQ. [sent-399, score-0.892]

87 The year indicator Table 5: REQ learner improves search engine organic results. [sent-400, score-0.51]

88 If the query q is not a REQ, boosting is set to zero. [sent-406, score-0.462]

89 If the newest page has a lower ranking score than the oldest page, then the difference is added to the newest page to promote the ranking of the newest page. [sent-409, score-0.529]

90 We extracted the top five search results for each query under three configures: organic search engine results, (Zhang et al. [sent-422, score-0.808]

91 Editors assign five grades according to relevance between query and articles: Perfect, Excellent, Good, Fair, and Bad. [sent-425, score-0.461]

92 For example, a “Perfect” grade means the content of the url match exactly the query intent. [sent-426, score-0.518]

93 We divided 800 test queries into 10 buckets according to the classifier probability. [sent-437, score-0.323]

94 1], contains the query with a classifier probability greater than 0 but less than 0. [sent-440, score-0.467]

95 Our results are compared with organic search results, but we also show the improvements over search organic by (Zhang et al. [sent-442, score-0.38]

96 this type of queries can’t be solved in traditional ranking method. [sent-464, score-0.324]

97 Our proposed methods are novel comparing with traditional query classification methods. [sent-467, score-0.466]

98 We identified and developed features from query log, search session, click and time series analysis. [sent-468, score-0.681]

99 Automatic web query classification using labeled and unlabeled training data. [sent-484, score-0.534]

100 Q2c@ust: our winning solution to query classification in kddcup 2005. [sent-613, score-0.466]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('req', 0.505), ('query', 0.428), ('timestamp', 0.281), ('recurrent', 0.265), ('year', 0.234), ('queries', 0.23), ('click', 0.151), ('search', 0.102), ('dcg', 0.097), ('ranking', 0.094), ('organic', 0.088), ('oscar', 0.087), ('implicit', 0.081), ('dong', 0.078), ('event', 0.078), ('page', 0.076), ('autocorrelation', 0.076), ('zhang', 0.069), ('web', 0.068), ('url', 0.065), ('chisquareyeardist', 0.063), ('gbdt', 0.063), ('newest', 0.063), ('ruiqiang', 0.063), ('userswitch', 0.063), ('log', 0.062), ('engine', 0.057), ('issued', 0.054), ('buckets', 0.054), ('metzler', 0.054), ('seed', 0.054), ('normalizeduserswitch', 0.051), ('users', 0.048), ('urls', 0.045), ('pagerank', 0.045), ('diaz', 0.043), ('regression', 0.041), ('temporal', 0.041), ('indicators', 0.04), ('titles', 0.04), ('classifier', 0.039), ('explicit', 0.039), ('cho', 0.039), ('awards', 0.039), ('dynamics', 0.039), ('pan', 0.039), ('yi', 0.039), ('anlei', 0.038), ('ctr', 0.038), ('elsas', 0.038), ('explicitqueryratio', 0.038), ('freshness', 0.038), ('iphone', 0.038), ('pandey', 0.038), ('recency', 0.038), ('yearurlfpctr', 0.038), ('yqq', 0.038), ('zhaohui', 0.038), ('classification', 0.038), ('sigir', 0.038), ('indication', 0.037), ('boosting', 0.034), ('bayes', 0.033), ('relevance', 0.033), ('miss', 0.033), ('period', 0.033), ('seeds', 0.033), ('google', 0.033), ('boosted', 0.032), ('wsdm', 0.032), ('apple', 0.032), ('calendar', 0.032), ('frequency', 0.032), ('academy', 0.032), ('zheng', 0.032), ('top', 0.031), ('tree', 0.03), ('naive', 0.03), ('basketball', 0.029), ('ford', 0.029), ('clicks', 0.029), ('decision', 0.029), ('indicator', 0.029), ('aggregated', 0.027), ('dn', 0.027), ('libsvm', 0.027), ('reviews', 0.027), ('chang', 0.027), ('session', 0.026), ('gradient', 0.026), ('ranked', 0.025), ('adidas', 0.025), ('avenumbertokenseeds', 0.025), ('bai', 0.025), ('berberich', 0.025), ('buzziness', 0.025), ('contest', 0.025), ('cup', 0.025), ('grade', 0.025), ('ipo', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.

2 0.24386953 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei

Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.

3 0.18063277 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

Author: Roberto Navigli ; Giuseppe Crisafulli

Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.

4 0.11356907 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

5 0.091124207 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

Author: Florian Boudin ; Jian-Yun Nie ; Martin Dawes

Abstract: The PECO framework is a knowledge representation for formulating clinical questions. Queries are decomposed into four aspects, which are Patient-Problem (P), Exposure (E), Comparison (C) and Outcome (O). However, no test collection is available to evaluate such framework in information retrieval. In this work, we first present the construction of a large test collection extracted from systematic literature reviews. We then describe an analysis of the distribution of PECO elements throughout the relevant documents and propose a language modeling approach that uses these distributions as a weighting strategy. In our experiments carried out on a collection of 1.5 million documents and 423 queries, our method was found to lead to an improvement of 28% in MAP and 50% in P@5, as com- pared to the state-of-the-art method.

6 0.085520163 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

7 0.075905561 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

8 0.066931032 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

9 0.053698312 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

10 0.052943211 20 emnlp-2010-Automatic Detection and Classification of Social Events

11 0.051622517 51 emnlp-2010-Function-Based Question Classification for General QA

12 0.045697417 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

13 0.044195402 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

14 0.042979937 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

15 0.041442666 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

16 0.038921215 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

17 0.037713788 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

18 0.037406728 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

19 0.036882009 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

20 0.035981197 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.139), (1, 0.099), (2, -0.117), (3, 0.173), (4, 0.119), (5, 0.079), (6, -0.23), (7, 0.04), (8, -0.278), (9, 0.122), (10, -0.081), (11, 0.162), (12, 0.234), (13, 0.078), (14, -0.092), (15, -0.041), (16, 0.108), (17, 0.131), (18, -0.092), (19, 0.02), (20, -0.057), (21, -0.023), (22, 0.221), (23, -0.119), (24, -0.075), (25, 0.103), (26, -0.01), (27, -0.057), (28, 0.004), (29, 0.051), (30, 0.039), (31, 0.049), (32, -0.016), (33, -0.009), (34, 0.027), (35, 0.057), (36, -0.02), (37, 0.059), (38, -0.074), (39, -0.006), (40, -0.046), (41, -0.116), (42, -0.012), (43, 0.022), (44, 0.006), (45, -0.014), (46, 0.032), (47, 0.049), (48, -0.084), (49, 0.001)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97873694 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.

2 0.91789007 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei

Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.

3 0.67894745 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

Author: Roberto Navigli ; Giuseppe Crisafulli

Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.

4 0.46622333 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

Author: Florian Boudin ; Jian-Yun Nie ; Martin Dawes

Abstract: The PECO framework is a knowledge representation for formulating clinical questions. Queries are decomposed into four aspects, which are Patient-Problem (P), Exposure (E), Comparison (C) and Outcome (O). However, no test collection is available to evaluate such framework in information retrieval. In this work, we first present the construction of a large test collection extracted from systematic literature reviews. We then describe an analysis of the distribution of PECO elements throughout the relevant documents and propose a language modeling approach that uses these distributions as a weighting strategy. In our experiments carried out on a collection of 1.5 million documents and 423 queries, our method was found to lead to an improvement of 28% in MAP and 50% in P@5, as com- pared to the state-of-the-art method.

5 0.42584386 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

6 0.42470574 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

7 0.31344062 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

8 0.21502799 20 emnlp-2010-Automatic Detection and Classification of Social Events

9 0.21445987 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

10 0.17807315 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

11 0.16561282 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

12 0.16514517 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

13 0.15395784 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

14 0.14358188 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

15 0.1405836 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

16 0.14024711 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

17 0.13528652 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

18 0.12556316 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

19 0.11501784 61 emnlp-2010-Improving Gender Classification of Blog Authors

20 0.11418851 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.023), (10, 0.013), (12, 0.034), (18, 0.423), (29, 0.07), (30, 0.024), (32, 0.014), (52, 0.016), (56, 0.047), (62, 0.02), (66, 0.081), (72, 0.088), (76, 0.015), (82, 0.014), (87, 0.025), (89, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74494332 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.

2 0.50293475 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

Author: Ahmed Hassan ; Vahed Qazvinian ; Dragomir Radev

Abstract: Mining sentiment from user generated content is a very important task in Natural Language Processing. An example of such content is threaded discussions which act as a very important tool for communication and collaboration in the Web. Threaded discussions include e-mails, e-mail lists, bulletin boards, newsgroups, and Internet forums. Most of the work on sentiment analysis has been centered around finding the sentiment toward products or topics. In this work, we present a method to identify the attitude of participants in an online discussion toward one another. This would enable us to build a signed network representation of participant interaction where every edge has a sign that indicates whether the interaction is positive or negative. This is different from most of the research on social networks that has focused almost exclusively on positive links. The method is exper- imentally tested using a manually labeled set of discussion posts. The results show that the proposed method is capable of identifying attitudinal sentences, and their signs, with high accuracy and that it outperforms several other baselines.

3 0.35003191 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei

Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.

4 0.32048139 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

Author: Kostadin Cholakov ; Gertjan van Noord

Abstract: Unknown words are a hindrance to the performance of hand-crafted computational grammars of natural language. However, words with incomplete and incorrect lexical entries pose an even bigger problem because they can be the cause of a parsing failure despite being listed in the lexicon of the grammar. Such lexical entries are hard to detect and even harder to correct. We employ an error miner to pinpoint words with problematic lexical entries. An automated lexical acquisition technique is then used to learn new entries for those words which allows the grammar to parse previously uncovered sentences successfully. We test our method on a large-scale grammar of Dutch and a set of sentences for which this grammar fails to produce a parse. The application of the method enables the grammar to cover 83.76% of those sentences with an accuracy of 86.15%.

5 0.31341073 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka

Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

6 0.31040454 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

7 0.30810356 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

8 0.30601969 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

9 0.30566338 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

10 0.30521929 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

11 0.30448952 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

12 0.30210006 20 emnlp-2010-Automatic Detection and Classification of Social Events

13 0.30111292 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

14 0.30104595 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

15 0.30056411 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

16 0.30046961 84 emnlp-2010-NLP on Spoken Documents Without ASR

17 0.29953095 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

18 0.29867625 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

19 0.29835677 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

20 0.29800409 86 emnlp-2010-Non-Isomorphic Forest Pair Translation