emnlp emnlp2010 emnlp2010-32 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei
Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.
Reference: text
sentIndex sentText sentNum sentScore
1 edu @ Abstract In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. [sent-4, score-1.135]
2 Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. [sent-6, score-1.517]
3 This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events. [sent-7, score-1.467]
4 When I search for “msr” Ialways try to access Microsoft research; and even if Imisspelled it, a smart search engine could suggest a correct query based on my query history, the current session of queries, and/or the queries that other people have been using. [sent-11, score-1.186]
5 When a new event happens, the burst of new contents and new interests make it hard to predict what people would search and to suggest what queries they should use. [sent-17, score-0.483]
6 Indeed, there is already considerable effort in seeking help from these sources, by the integration of news and blogs into search results or the use of social bookmarks to enhance search. [sent-23, score-0.727]
7 How is the bursting content in web search, news media, social media, and social bookmarks correlating and different from each other? [sent-28, score-0.767]
8 enhancing search results, query suggestion, keyword bidding on advertisement, etc) by integrating the information from all these sources, it is appealing to have an investigation of feasibility. [sent-31, score-0.445]
9 c od2s01 in0 N Aastsuorcaialt Lioan g foura Cgeom Prpoucteastisoin ga,l p Laignegsui 1s0ti7c7s–1087, we write, and what we tag in the scenarios of bursty events. [sent-34, score-0.626]
10 Specifically, we analyze the language used in different contexts of bursty events, including two different query log contexts, two news media contexts, two blog contexts, and an additional context of social bookmarks. [sent-35, score-1.804]
11 A variety of experiments have been conducted, including the content similarity and cross-entropy between sources, the coverage of search queries in online media, and an in-depth semantic comparison of sources based on language networks. [sent-36, score-0.6]
12 These methods are all related to the preprocessing step of our analysis: detecting bursty queries from the query log effectively. [sent-45, score-1.182]
13 Our work can lead to many useful search applications, such as query suggestion which takes as input a specific query and returns as output one or several suggested queries. [sent-67, score-0.804]
14 3 Analysis Setup Tasks of web information retrieval such as web search generally perform very well on frequent and navigational queries (Broder, 2002) such like “chicago” or “yahoo movies. [sent-71, score-0.504]
15 ” A considerable chal- lenge in web search remains in how to handle informational queries, especially queries that reflect new information need and suddenly changed information need of users. [sent-72, score-0.467]
16 Many such scenarios are caused by the emergence of bursty events (e. [sent-73, score-0.686]
17 The focus of this paper is to analyze how other online media sources react to those bursty events and how those reactions compare to the reaction in web search. [sent-77, score-1.086]
18 This analysis thus serves as an primitive investigation of the feasibility of leveraging other sources to enhance the search of bursty topics. [sent-78, score-0.846]
19 Therefore, we focus on the “event-related” topics which present as bursty queries submitted to a search engine. [sent-79, score-1.139]
20 These queries not only reflect the suddenly changed information need of users, but also trigger the correlated reactions in other online sources, such as news media, blog media, social bookmarks, etc. [sent-80, score-0.93]
21 We begin with the extraction of bursty topics from the query log. [sent-81, score-1.047]
22 1 Bursty Topic Extraction Search engine logs (or query logs) store the history of users’ search behaviors, which reflect users’ interests and information need. [sent-83, score-0.519]
23 It is common practice to segment query log into search sessions, each of which represents one user’s searching activities in a short period of time. [sent-85, score-0.492]
24 1 Find bursty queries from query log How to extract the queries that represent bursty events? [sent-90, score-2.014]
25 We believe that bursty queries present the pattern that its day-by-day search volume shows a significant spike that is, the frequency that the user submit this query should suddenly increase at one specific time and drop down after a while. [sent-91, score-1.348]
26 This assumption is consistent with existing work of finding bursty patterns in emails, scientific literature (Klein– berg, 2003), and blogs (Gruhl et al. [sent-92, score-0.738]
27 This is reasonable since a bursty event usually causes a large volume of search activities. [sent-97, score-0.795]
28 • Let fmax (q) be the maximum search volume of a query q in one day (i. [sent-98, score-0.507]
29 We further balance these two ratios by ranking the bursty 1Now known as Bing: www. [sent-108, score-0.626]
30 8 (based on several tests), we select the top 130 bigram queries which form the pool of bursty topics for our analysis. [sent-112, score-1.008]
31 2 Context extraction from multiple sources Once we select the pool of bursty topics, we gather the contexts of each topic from multiple sources: query log, news media, blog media, and social bookmarks. [sent-115, score-1.726]
32 We assume that the language in these contexts will reflect the reactions of the bursty events in corresponding online media. [sent-116, score-0.876]
33 1 Super query context The most straightforward context of bursty events in web search is the query string, which directly reflects the users’ interests and perspectives in the topic. [sent-119, score-1.665]
34 We therefore define the first type of context of a bursty topic in query log as the set of surrounding terms of that bursty bigram in the (longer) queries. [sent-120, score-1.795]
35 For example, the word aftermath in the query “haiti earthquake aftermath” is a term in the context of the bursty topic haiti earthquake. [sent-121, score-1.088]
36 Formally, we define a Super Query of a bursty topic t, sq(t), as the query which contains the bigram query t lexically as a substring. [sent-122, score-1.372]
37 For each bursty topic t, we scan the whole query log Q and retrieve all the super queries of t to form the context which is represented by SQ(t). [sent-123, score-1.486]
38 SQ(t) = {q|q ∈ Q and q = sq(t) } SQ(t) is defined as the super query context of t. [sent-124, score-0.501]
39 For example, the super query context of ”kentucky election” contains terms such as “2006,” “results,” “christian county,” etc. [sent-125, score-0.501]
40 The super query context is widely explored by search engines to provide query expansion and query completion (Jones et al. [sent-127, score-1.249]
41 2 Query session context Another interesting context of a bursty topic in query log is the sequence of queries that a user searches after he submitted the bursty query q. [sent-131, score-2.439]
42 We define a Query Session containing a bursty topic t, qs(t), as the queries which are issued by the same user after he issued t, within 30 minutes. [sent-133, score-0.944]
43 For each bursty topic t, we collect all the qs(t) to form the query session context of t, QS(t) : QS(t) = {q|q ∈ Q and q ∈ qs(t)} In web search, the query session context is usually utilized to provide query suggestion and query reformulation (Radlinski and Joachims, 2005). [sent-134, score-2.286]
44 3 News contexts News articles written by critics and journalists reflect the reactions and perspectives of such professional group of people to a bursty event. [sent-137, score-0.898]
45 We collect news articles about these 130 bursty topics from Google News2, by finding the most relevant news articles which (1) match the bursty topic t, (2) were published in May, 2006, and (3) were published by any of the five major news medias: CNN, NBC, ABC, New York Times and Washington Post. [sent-138, score-2.161]
46 This provides us two contexts of each bursty topic t: the set of relevant news titles, NT(t), and the set of relevant news bodies, NB(t). [sent-140, score-1.15]
47 4 Blog contexts Compared with news articles, blog articles are written by common users in the online communi- ties, which are supposed to reflect the reactions and 2http://news. [sent-143, score-0.768]
48 com/ 1080 opinions of the public to the bursty events. [sent-145, score-0.626]
49 We collect blog articles about these 130 topics from Google Blog3, by finding the most relevant blog articles which (1) match the bursty topic t, (2) were published in May, 2006 (3) were published in the most popular blog community, Blogspot4. [sent-146, score-1.709]
50 We then retrieve the title and body ofeach relevant blog post respectively. [sent-147, score-0.41]
51 This provides another two contexts: the set of relevant blog titles, BT(t), and the set of relevant blog bodies, BB(t). [sent-148, score-0.474]
52 5 Social bookmarking context Social bookmarks form a new source of social media that allows the users to tag the webpages they are interested in and share their tags with others. [sent-151, score-0.607]
53 We observe that the bursty bigram queries are also frequently used as tags in Delicious. [sent-155, score-0.89]
54 We thus construct another context of bursty events by collecting all the tags that are used to tag the same URLs as the bursty topic. [sent-156, score-1.365]
55 3 Context Statistics Now we have constructed the set of 130 bursty topics and 7 corresponding contexts from various sources. [sent-161, score-0.828]
56 For each context, we then clean the data by removing stopwords and the bursty topic keywords themselves. [sent-163, score-0.734]
57 Table 2 shows the basic statistics of each context: From Table 2 we observe the following facts: • The query session context covers more terms (both unigrams saensds bigrams) tth cano tehres super query 3http://blogsearch. [sent-165, score-0.975]
58 • News articles and blog articles cover most of the bursty topics acnleds c aonndtba ilno a arirtcihc esest coofv unigrams a thnde bigrams in the corresponding contexts. [sent-179, score-1.281]
59 • The Delicious context only covers less than 60% of bursty topics. [sent-180, score-0.679]
60 In Section 4, we present a comprehensive analysis of these different contexts of bursty topics, with three different types of comparison. [sent-182, score-0.74]
61 4 Experiment In this section, we present a comprehensive comparative analysis of the different contexts, which represent the reactions to the bursty topics in corresponding sources. [sent-183, score-0.878]
62 By representing each context of a bursty topic as a vector space model of unigrams/bigrams, we first compute and compare the 1081 average cosine similarity between contexts. [sent-188, score-0.796]
63 To investigate how well one source can predict the content of another, we also represent each context of a bursty topic as a unigram/bigram language model and compute the Cross Entropy (Kullback and Leibler, 1951) between every pairs of contexts. [sent-192, score-0.829]
64 1 Results From the results shown in Table A-D, or in Figure 1- 4 more visually, some interesting phenomena can be observed: • Compared with other contexts, query session is •m Cucohm more wsiimthila ort hteor t choen super query. [sent-200, score-0.535]
65 Tsihoins makes sense because many super queries would be included in the query session. [sent-201, score-0.654]
66 • Compared with news and blog, the delicious con•te Cxto ims pcalroesder wtoi tthhe n query log bcolongte,x tth. [sent-202, score-0.591]
67 This means social tags could be an effective source to enhance bursty topics in web search in terms of query suggestion. [sent-204, score-1.404]
68 • In the news and blog contexts, the title contexts are more esi nmeiwlasr aton dth bel query cteoxnttse,x thtse et htiatlne t choen body contexts. [sent-207, score-0.976]
69 70B45 3 HCE(m||SQ) Table 3: Cross-entropy among three sources News would be a better predictor of the query tha•n blog sin w general. [sent-233, score-0.644]
70 • News and blogs are much more similar to each oth•er N tehwans query logs. [sent-235, score-0.415]
71 e W me hypothesize tihlaatr tthoi se result reflects the behavior how people write blogs about bursty events typically they may have read several news articles before writing their own blog. [sent-236, score-1.093]
72 nFsro amm otnheg upper table, we can observe that queries are more likely to be generated by news articles, rather than • – blog articles. [sent-239, score-0.622]
73 From the lower table, we can observe that queries are more likely to generate blog articles(body), rather than news articles(body). [sent-240, score-0.622]
74 This result is quite interesting, which indicates the users’ actual behaviors: when a bursty event happens, users would search them from web after they read it from some news articles. [sent-241, score-1.178]
75 • From Table 3 we also find that queries are more likely to generate news title, rdat thheart tqhuaenr blog etit mleo. [sent-243, score-0.622]
76 If one is a 1083 good predictor of bursty queries, the other one also tends to be. [sent-251, score-0.652]
77 For these unfamiliar topics, users possibly search the web “after” they read the news articles and express their diverse opinions in the blog. [sent-262, score-0.586]
78 For these daily-life-related queries, users would express the similar opinions when they search or write blogs, while news articles typically report such “professional” viewpoints. [sent-264, score-0.513]
79 2 Coverage analysis Are social bookmarks the best source to predict bursty content in search? [sent-266, score-0.988]
80 In this experiment, we analyze the coverage of query contexts in other contexts in a systematic way. [sent-268, score-0.561]
81 If the majority of terms in the super query context would be covered by a small proportion of top words from another source, this source has the potential. [sent-269, score-0.501]
82 1 Unigram coverage We first analyze the coverage of unigrams from the super query context in four other contexts: QS, DT, News (the combination of NT and NB) and Blog (the combination of BT and BB) to compare with SQ. [sent-272, score-0.762]
83 We can observe that: • Query Session naturally covers most of the super query yte Srmesss (over 7tu0r%all)y. [sent-278, score-0.448]
84 • Though delicious tags are more similar to queries tohuagnh news iaonuds blog, as wee mll as a relatively higher coverage rate than the other two while size ratio is small, the overall coverage rate is quite low: only 21. [sent-279, score-0.597]
85 Note that this is contradict to existing comparative studies between social bookmarks and search logs (Bischoff et al. [sent-281, score-0.467]
86 Clearly, when considering bursty queries, the coverage and effectiveness of social bookmarks is much lower than considering all queries. [sent-283, score-0.983]
87 Handling bursty queries is much more difficult; only using social bookmarks to predict queries is not a good choice. [sent-284, score-1.357]
88 80%), which means news and blogs have a higher potential to predict the bursty topics in search. [sent-289, score-1.06]
89 Moreover, in most cases, news is still prior to blog not only the overall rate, but also the size ratio comparison while the coverage rate reaches 50% (news:109 < blog: 183). [sent-290, score-0.503]
90 Different from the unigram coverage, except that the query session can naturally keep a high coverage rate (66. [sent-297, score-0.429]
91 Therefore, except some proper nouns such as person’s name, a lot of bigrams in the query log are formed in an adhoc way. [sent-302, score-0.42]
92 In this section we will discuss the inner-relation within each particular context when it comes to a particular bursty topic, how coherent is the information in each context? [sent-306, score-0.679]
93 It is because the queries in one user session can easily shift to other (irrelevant) topics even in a short time. [sent-325, score-0.417]
94 It can be explained by the roles of the title and the body in one article: the title contains a series of words which briefly summarize a topic while the body part would describe and discuss the title in details. [sent-327, score-0.456]
95 5 Conclusion and Future work In this paper, we have studied and compared how the web content reacts to bursty events in multiple contexts of web search and online media. [sent-369, score-1.144]
96 After a series of comprehensive experiments including content similarity and predictability, the coverage of search content, and semantic diversity, we found that social bookmarks are not enough to predict the queries because of a low coverage. [sent-370, score-0.838]
97 Furthermore, news can be seen as a consistent source which would not only trigger the discussion of bursty events in blogs but also in search queries. [sent-372, score-1.119]
98 When the target is to diversify the search results and query suggestions, blogs and social bookmarks are potentially useful accessory sources because of the high diversity of content. [sent-373, score-0.929]
99 Our work serves as a feasibility investigation of query suggestion for bursty events. [sent-374, score-0.985]
100 Future work would address on how to systematically predict and recommend the bursty queries using online media, as well as a reasonable evaluation metrics upon it. [sent-375, score-0.89]
wordName wordTfidf (topN-words)
[('bursty', 0.626), ('query', 0.303), ('blog', 0.237), ('queries', 0.206), ('news', 0.179), ('bookmarks', 0.157), ('super', 0.145), ('search', 0.142), ('social', 0.137), ('topics', 0.118), ('blogs', 0.112), ('media', 0.111), ('density', 0.109), ('unigrams', 0.108), ('users', 0.101), ('sq', 0.095), ('contexts', 0.084), ('topic', 0.082), ('web', 0.078), ('sources', 0.078), ('title', 0.076), ('body', 0.073), ('reactions', 0.073), ('bigrams', 0.07), ('qs', 0.065), ('coverage', 0.063), ('session', 0.063), ('delicious', 0.062), ('url', 0.062), ('articles', 0.061), ('hce', 0.06), ('heat', 0.06), ('events', 0.06), ('bigram', 0.058), ('bb', 0.056), ('suggestion', 0.056), ('context', 0.053), ('bookmarking', 0.048), ('interests', 0.047), ('submitted', 0.047), ('log', 0.047), ('vol', 0.043), ('content', 0.043), ('gruhl', 0.041), ('suddenly', 0.041), ('adar', 0.036), ('bischoff', 0.036), ('bodies', 0.036), ('burst', 0.036), ('bursting', 0.036), ('coference', 0.036), ('disperse', 0.036), ('fmax', 0.036), ('icwsm', 0.036), ('radlinski', 0.036), ('similarity', 0.035), ('online', 0.033), ('behaviors', 0.032), ('stands', 0.032), ('predictability', 0.031), ('denser', 0.031), ('hot', 0.031), ('comparative', 0.031), ('professional', 0.03), ('comprehensive', 0.03), ('write', 0.03), ('user', 0.03), ('weblogs', 0.029), ('gradually', 0.028), ('event', 0.027), ('engine', 0.027), ('analyze', 0.027), ('day', 0.026), ('nb', 0.026), ('predictor', 0.026), ('keywords', 0.026), ('bt', 0.026), ('nt', 0.026), ('published', 0.025), ('predict', 0.025), ('read', 0.025), ('ratio', 0.024), ('correlated', 0.024), ('tendency', 0.024), ('retrieve', 0.024), ('dt', 0.024), ('aftermath', 0.024), ('choen', 0.024), ('cointet', 0.024), ('heymann', 0.024), ('jasmine', 0.024), ('journalists', 0.024), ('kentucky', 0.024), ('krause', 0.024), ('kullback', 0.024), ('leibler', 0.024), ('lloyd', 0.024), ('novak', 0.024), ('prospectives', 0.024), ('sigmod', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media
Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei
Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.
2 0.24386953 73 emnlp-2010-Learning Recurrent Event Queries for Web Search
Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng
Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.
3 0.15757091 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
Author: Roberto Navigli ; Giuseppe Crisafulli
Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.
4 0.10902929 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names
Author: Raghavendra Udupa ; Shaishav Kumar
Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.
5 0.096464925 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective
Author: Amr Ahmed ; Eric Xing
Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.
6 0.090466544 20 emnlp-2010-Automatic Detection and Classification of Social Events
7 0.084001467 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
8 0.079997122 61 emnlp-2010-Improving Gender Classification of Blog Authors
9 0.07820297 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval
10 0.076598614 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval
11 0.071917593 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
13 0.054800205 51 emnlp-2010-Function-Based Question Classification for General QA
14 0.052239835 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails
15 0.050304994 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications
16 0.046067059 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition
17 0.044311926 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping
18 0.042483836 39 emnlp-2010-EMNLP 044
19 0.038741183 84 emnlp-2010-NLP on Spoken Documents Without ASR
20 0.038451385 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning
topicId topicWeight
[(0, 0.151), (1, 0.12), (2, -0.173), (3, 0.112), (4, 0.157), (5, 0.083), (6, -0.233), (7, 0.007), (8, -0.261), (9, 0.076), (10, -0.079), (11, 0.153), (12, 0.194), (13, 0.115), (14, -0.081), (15, -0.002), (16, 0.079), (17, 0.142), (18, -0.041), (19, 0.023), (20, -0.015), (21, -0.056), (22, 0.249), (23, -0.075), (24, -0.048), (25, 0.147), (26, 0.033), (27, -0.024), (28, 0.058), (29, 0.032), (30, 0.012), (31, 0.061), (32, 0.003), (33, -0.073), (34, 0.019), (35, 0.028), (36, -0.016), (37, 0.085), (38, -0.034), (39, -0.085), (40, 0.072), (41, -0.138), (42, -0.005), (43, -0.006), (44, -0.061), (45, -0.0), (46, 0.06), (47, 0.104), (48, -0.005), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.98359179 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media
Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei
Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.
2 0.91478121 73 emnlp-2010-Learning Recurrent Event Queries for Web Search
Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng
Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.
3 0.61471617 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
Author: Roberto Navigli ; Giuseppe Crisafulli
Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.
4 0.37065089 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval
Author: Florian Boudin ; Jian-Yun Nie ; Martin Dawes
Abstract: The PECO framework is a knowledge representation for formulating clinical questions. Queries are decomposed into four aspects, which are Patient-Problem (P), Exposure (E), Comparison (C) and Outcome (O). However, no test collection is available to evaluate such framework in information retrieval. In this work, we first present the construction of a large test collection extracted from systematic literature reviews. We then describe an analysis of the distribution of PECO elements throughout the relevant documents and propose a language modeling approach that uses these distributions as a weighting strategy. In our experiments carried out on a collection of 1.5 million documents and 423 queries, our method was found to lead to an improvement of 28% in MAP and 50% in P@5, as com- pared to the state-of-the-art method.
5 0.37012574 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names
Author: Raghavendra Udupa ; Shaishav Kumar
Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.
6 0.3512117 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval
8 0.31418279 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
10 0.2783246 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition
11 0.27147958 61 emnlp-2010-Improving Gender Classification of Blog Authors
12 0.24185899 20 emnlp-2010-Automatic Detection and Classification of Social Events
13 0.19064239 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails
14 0.17977133 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
15 0.17033675 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning
16 0.16167605 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
17 0.15136065 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs
18 0.14731325 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars
19 0.14215133 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions
20 0.13904719 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks
topicId topicWeight
[(3, 0.013), (10, 0.012), (12, 0.04), (14, 0.286), (18, 0.029), (29, 0.082), (30, 0.048), (32, 0.019), (52, 0.021), (56, 0.052), (62, 0.05), (66, 0.09), (72, 0.084), (76, 0.059), (89, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.77804869 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media
Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei
Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.
2 0.71797335 38 emnlp-2010-Dual Decomposition for Parsing with Non-Projective Head Automata
Author: Terry Koo ; Alexander M. Rush ; Michael Collins ; Tommi Jaakkola ; David Sontag
Abstract: This paper introduces algorithms for nonprojective parsing based on dual decomposition. We focus on parsing algorithms for nonprojective head automata, a generalization of head-automata models to non-projective structures. The dual decomposition algorithms are simple and efficient, relying on standard dynamic programming and minimum spanning tree algorithms. They provably solve an LP relaxation of the non-projective parsing problem. Empirically the LP relaxation is very often tight: for many languages, exact solutions are achieved on over 98% of test sentences. The accuracy of our models is higher than previous work on a broad range of datasets.
3 0.47867328 73 emnlp-2010-Learning Recurrent Event Queries for Web Search
Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng
Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.
4 0.47464207 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text
Author: Stefan Schoenmackers ; Jesse Davis ; Oren Etzioni ; Daniel Weld
Abstract: input. Even the entire Web corpus does not explicitly answer all questions, yet inference can uncover many implicit answers. But where do inference rules come from? This paper investigates the problem of learning inference rules from Web text in an unsupervised, domain-independent manner. The SHERLOCK system, described herein, is a first-order learner that acquires over 30,000 Horn clauses from Web text. SHERLOCK embodies several innovations, including a novel rule scoring function based on Statistical Relevance (Salmon et al., 1971) which is effective on ambiguous, noisy and incomplete Web extractions. Our experiments show that inference over the learned rules discovers three times as many facts (at precision 0.8) as the TEXTRUNNER system which merely extracts facts explicitly stated in Web text.
5 0.46801457 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions
Author: Ahmed Hassan ; Vahed Qazvinian ; Dragomir Radev
Abstract: Mining sentiment from user generated content is a very important task in Natural Language Processing. An example of such content is threaded discussions which act as a very important tool for communication and collaboration in the Web. Threaded discussions include e-mails, e-mail lists, bulletin boards, newsgroups, and Internet forums. Most of the work on sentiment analysis has been centered around finding the sentiment toward products or topics. In this work, we present a method to identify the attitude of participants in an online discussion toward one another. This would enable us to build a signed network representation of participant interaction where every edge has a sign that indicates whether the interaction is positive or negative. This is different from most of the research on social networks that has focused almost exclusively on positive links. The method is exper- imentally tested using a manually labeled set of discussion posts. The results show that the proposed method is capable of identifying attitudinal sentences, and their signs, with high accuracy and that it outperforms several other baselines.
6 0.46787524 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification
7 0.46660727 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
8 0.46623623 40 emnlp-2010-Effects of Empty Categories on Machine Translation
9 0.4654291 26 emnlp-2010-Classifying Dialogue Acts in One-on-One Live Chats
10 0.46326986 31 emnlp-2010-Constraints Based Taxonomic Relation Classification
11 0.46140793 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
12 0.45936614 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
13 0.45809931 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment
14 0.45797235 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval
15 0.45681846 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words
16 0.45653474 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
17 0.45614243 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
18 0.45573458 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
19 0.45563403 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space