emnlp emnlp2010 emnlp2010-66 knowledge-graph by maker-knowledge-mining

66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering


Source: pdf

Author: Roberto Navigli ; Giuseppe Crisafulli

Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Inducing Word Senses to Improve Web Search Result Clustering Roberto Navigli and Giuseppe Crisafulli Dipartimento di Informatica Sapienza Universit a` di Roma . [sent-1, score-0.11]

2 , navigl i di uni roma 1 it @ Abstract In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). [sent-3, score-0.781]

3 , meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. [sent-6, score-0.747]

4 Then we cluster the search results based on their semantic similarity to the induced word senses. [sent-7, score-0.339]

5 Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification. [sent-8, score-0.727]

6 and Google usually do a good job at retrieving a small number of relevant results from such an enormous collection of Web pages (i. [sent-11, score-0.04]

7 However, current search engines are still facing the lexical ambiguity issue (Furnas et al. [sent-14, score-0.379]

8 , 1990) and Wikipedia as sources of ambiguous words it was reported that around 3% of Web queries and 23% of the most frequent queries are ambiguous. [sent-19, score-0.153]

9 Ambiguity is often the consequence of the low number of query words entered on average by Web users (Kamvar and Baluja, 2006). [sent-32, score-0.303]

10 In the past few years, Web clustering engines (Carpineto et al. [sent-34, score-0.326]

11 , 2009) have been proposed as a solution to the lexical ambiguity issue in Web Information Retrieval. [sent-35, score-0.102]

12 These systems group search results, by providing a cluster for each specific aspect (i. [sent-36, score-0.322]

13 However, many Web clustering engines group search results on the basis of their lexical similarity. [sent-40, score-0.594]

14 For instance, consider the following snippets returned for the beagle search query: 1. [sent-41, score-0.549]

15 While snippets 1 and 3 both concern the Linux search tool, they do not have any content word in 1http : / /www . [sent-56, score-0.297]

16 tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 1t1ic6s–126, common except our query words. [sent-60, score-0.267]

17 In this paper we present a novel approach to Web search result clustering which is based on the automatic discovery of word senses from raw text a task referred to as Word Sense Induction (WSI). [sent-62, score-0.663]

18 At the core of our approach is a graph-based algorithm that exploits cycles in the co-occurrence graph of the input query to detect the query’s meanings. [sent-63, score-0.513]

19 Our experiments on two datasets of ambiguous queries show that our WSI approach boosts search result clustering in terms of both clustering quality and degree of diversification. [sent-64, score-0.787]

20 A first, historical solution to query ambiguity is that of Web directories, that is taxonomies providing categories to which Web pages are manually assigned (e. [sent-66, score-0.332]

21 Given a query, search results are organized by category. [sent-70, score-0.185]

22 This latter feature of Web directories makes it difficult to distinguish between instances of the same kind (e. [sent-72, score-0.054]

23 , pages about artists with the same surname classified as Art s :Mus i :Bands and c Art i s). [sent-74, score-0.095]

24 While methods for the automatic classt sification of Web documents have been proposed (e. [sent-75, score-0.037]

25 , word senses or concepts) with queries and documents, that is, performing Word Sense Disambiguation (WSD, see Navigli (2009)). [sent-83, score-0.263]

26 SIR is performed by indexing and/or searching concepts rather than terms, thus potentially coping with two linguis– tic phenomena: expressing a single meaning with different words (synonymy) and using the same word to express various different meanings (polysemy). [sent-84, score-0.139]

27 However, contrasting results have been reported on the benefits of these techniques: it has been shown that WSD has to be very accurate to benefit Information Retrieval (Sanderson, 1994) a result that was later debated (Gonzalo et al. [sent-90, score-0.041]

28 Also, it has been reported that WSD has to be very precise on minority senses and uncommon terms, rather than on frequent words (Krovetz and Croft, 1992; Sanderson, 2000). [sent-93, score-0.203]

29 SIR relies on the existence of a reference dictionary to perform WSD (typically, WordNet) and thus suffers from its static nature and its inherent paucity of most proper nouns. [sent-94, score-0.036]

30 This latter problem is particularly important for Web searches, as users tend to retrieve more information about named entities (e. [sent-95, score-0.068]

31 A more popular approach to query ambiguity is that of search result clustering. [sent-101, score-0.558]

32 Typically, given a query, the system starts from a flat list of text snippets returned from one or more commonly-available search engines and clusters them on the basis of some notion of textual similarity. [sent-102, score-0.478]

33 At the root of the clustering approach lies van Rijsbergen’s (1979) cluster hypothesis: “closely associated documents tend to be relevant to the same requests”, whereas documents concerning different meanings of the input query are expected to belong to different clusters. [sent-103, score-0.734]

34 Approaches to search result clustering can be classified as data-centric or description-centric (Carpineto et al. [sent-104, score-0.46]

35 The former focus more on the problem of data clustering than on presenting the results to the user. [sent-106, score-0.234]

36 , 1992), which divides the dataset into a small number of clusters and, after the selection of a group, performs clustering again and proceeds iteratively. [sent-108, score-0.275]

37 Developments of this approach have been proposed which improve on cluster qual- ity and retrieval performance (Ke et al. [sent-109, score-0.132]

38 Description-centric approaches are, instead, more focused on the description to produce for each cluster of search results. [sent-115, score-0.287]

39 , 1997; Zamir and Etzioni, 1998), including later developments (Crabtree et al. [sent-117, score-0.049]

40 Other methods in the literature are based on formal concept analysis (Carpineto and Romano, 2004), singular value decomposition (Osinski and Weiss, 2005), spectral clustering (Cheng et al. [sent-120, score-0.274]

41 , 2007), and graph connectivity measures (Di Giacomo et al. [sent-123, score-0.111]

42 Search result clustering has also been viewed as a supervised salient phrase ranking task (Zeng et al. [sent-125, score-0.315]

43 Another recent research topic dealing with the query ambiguity issue is diversification, which aims to rerank top search results based on criteria that maximize their diversity. [sent-128, score-0.554]

44 One of the first examples of diversification algorithms is based on the use of similarity functions to measure the diversity among documents and between document and query (Carbonell and Goldstein, 1998). [sent-129, score-0.367]

45 Other techniques use conditional probabilities to determine which document is most different from higherranking ones (Chen and Karger, 2006) or use affinity ranking (Zhang et al. [sent-130, score-0.04]

46 In contrast to the above approaches, we perform WSI to dynamically acquire an inventory of senses of the input query. [sent-135, score-0.291]

47 Instead of performing clustering on the basis of the surface similarity of Web snippets, we use our induced word senses to group snippets. [sent-136, score-0.572]

48 , 2008) have provided interesting insights into the use of WSI for Web search result clustering. [sent-139, score-0.226]

49 A more recent attempt at automatically identifying query meanings is based on the use of hidden topics (Nguyen et al. [sent-140, score-0.324]

50 However, in this approach topics estimated from a universal dataset – – 118 are query-independent and thus their number needs to be established beforehand. [sent-142, score-0.03]

51 In contrast, we aim to cluster snippets based on a dynamic and finergrained notion of sense. [sent-143, score-0.214]

52 3 Approach Web search result clustering is usually performed in three main steps: 1. [sent-144, score-0.46]

53 ) is used to retrieve a list of results R = (r1, . [sent-148, score-0.032]

54 , Cm) of the results iAn cRlu istse oribntgai Cne =d by means of a clustering algorithm; 3. [sent-155, score-0.234]

55 The clusters in C are optionally labeled with an appropriate algorithm (e. [sent-156, score-0.041]

56 Our key idea is to improve step 2 by means of a Word Sense Induction algorithm: given a query q, we first dynamically induce, from a text corpus, the set of word senses of q (Section 3. [sent-160, score-0.522]

57 1); next, we cluster the Web results on the basis of the word senses previously induced (Section 3. [sent-161, score-0.405]

58 1 Word Sense Induction Word Sense Induction algorithms are unsupervised techniques aimed at automatically identifying the set of senses denoted by a word. [sent-164, score-0.203]

59 These methods induce word senses from text by clustering word occurrences based on the idea that a given word used in a specific sense tends to co-occur with the same neighbouring words (Harris, 1954). [sent-165, score-0.512]

60 Several approaches to WSI have been proposed in the literature (see Navigli (2009) for a survey), ranging from clustering based on context vectors (e. [sent-166, score-0.234]

61 – – Successful approaches such as HyperLex (V´ eronis, 2004) a graph algorithm based on the identification of hubs in co-occurrence graphs have to cope with a high number of parameters to be tuned (Agirre et al. [sent-173, score-0.176]

62 To deal with this issue we propose two variants of a simple, yet effective, graph-based algorithm for WSI, that we describe hereafter. [sent-175, score-0.037]

63 The algorithm consists of two steps: graph construction and identification of word senses. [sent-176, score-0.147]

64 1 Graph construction Given a target query q, we build a co-occurrence graph Gq = (V, E) such that V is a set of context words related to q and E is the set of undirected edges, each denoting a co-occurrence between pairs of words in V . [sent-179, score-0.378]

65 (1) The rationale behind Dice is that dividing by the sum of total counts of the two words drastically decreases the ranking of words that tend to co-occur frequently with many other words (e. [sent-187, score-0.04]

66 The graph Gq = (V, E) is built as follows: • Our initial vertex set V(0) contains all the contOenutr winiotriadls vfreormtex th seet snippet results of query q (excluding stopwords); then, we add to the highest-ranking words co-occurring with q in the Web1T corpus, i. [sent-191, score-0.459]

67 , those words w for which Dice(q, w) ≥ δ (the threshold δ is established experimentally, see tSheec tthiorens 4. [sent-193, score-0.03]

68 V(0) dV e(x0-) • V(0), For each word w ∈ we select the higheFsotr ranking woordrd ws co-occurring with w in Web1T, that is those words w0 for which Dice(w, w0) ≥ δ. [sent-196, score-0.04]

69 We add each of these words to V (note t)ha ≥t some w0 might already be in V(0)) and the corresponding edge {w, w0} to E with weight Dice(w, w0). [sent-197, score-0.088]

70 2 Identification of word senses The main idea behind our approach is that edges in the co-occurrence graph participating in cycles are likely to connect vertices (i. [sent-201, score-0.572]

71 Specifically, we focus on cycles of length 3 and 4, called respectively triangles and squares in graph theory. [sent-204, score-0.478]

72 Similarly, we define a measure Sqr(e) of the ratio of squares (i. [sent-206, score-0.135]

73 In order to disconnect the graph and determine the meaning components, we remove all the edges whose Tri (or Sqr) value is below a threshold σ. [sent-210, score-0.274]

74 The resulting connected components represent the word senses induced for the query q. [sent-211, score-0.522]

75 Notice that the number of senses is dynamically chosen based on the cooccurrence graph and the algorithm’s thresholds. [sent-212, score-0.366]

76 Our triangular measure is the edge counterpart of the clustering coefficient (or curvature) for vertices, previously used to perform WSI (Widdows and Dorow, 2002). [sent-213, score-0.322]

77 However, it is our hunch that measuring the ratio of squares an edge participates in provides a stronger clue of how important that edge is within a meaning component. [sent-214, score-0.51]

78 We build the co-occurrence graph Gbeagle = (V, E), an excerpt of which is shown in Figure 1(a). [sent-221, score-0.111]

79 We calculate the Sqr values of each edge in the graph. [sent-224, score-0.088]

80 The edges e whose Sqr(e) < σ are removed (we assume σ = 0. [sent-225, score-0.09]

81 For instance, Sqr({ dog, breed }) = 21, as =the 0 edge participates ,in S qthr(e{ square dog )b r=eed puppy canine dog, but it could also have participated in the potential square dog breed puppy search dog. [sent-227, score-1.646]

82 In fact, the other neighbours of dog are canine, puppy and search, and the other neighbour of breed is puppy, thus the square can only be closed by connecting puppy to either canine or search. [sent-228, score-0.891]

83 In our example, the only edges whose Sqr is below σ are: { dog, puppy }, { dog, sseea Srcqhr } a bnedl { linux, em:is {sio dno } (they participate sine no square). [sent-229, score-0.336]

84 { W lien remove ithonese } – – – – – – – – edges and select the resulting connected components as the senses of the query beagle (shown in Figure 1(b)). [sent-230, score-0.844]

85 Note that, if we selected triangles as our pruning measure, we should also remove the following edges { search, index }, { index, tlihneux fo }, { ilinngux e, system } aanrcdh { system, ,se {ar inchde }. [sent-231, score-0.219]

86 Ilinn fact, t {he lsineu edges edmo not participate ,in s any htr }i-. [sent-232, score-0.174]

87 As a result, we would miss the computer science sense of the query. [sent-234, score-0.075]

88 2 Clustering of Web results Given our query q, we submit it to a search engine, which returns a list of relevant search results R = (r1, . [sent-236, score-0.637]

89 We process each result ri by considering the corresponding text snippet and transforming it to a bag of words bi (we apply tokenization, stopwords and target word removal, and lemmatization2). [sent-240, score-0.304]

90 For instance, given the snippet: “the beagle is a breed of medium-sized dog”, we produce the following bag of words: { breed, medium, size, dog }. [sent-241, score-0.619]

91 As a result of the above processing, we obtain a list of bags of words B = (b1, . [sent-242, score-0.086]

92 120 (a)cpaunpi nyebdreogedsearcihsnydsetx limnuxmsi psaicomenacrsafl tnder (b)cpaunpi nyebdreogedsearcihsnydsetx limnuxmsi psaicomenacrsafl tnder Figure 1: The beagle example: (a) graph construction, “weak” edges (according to Sqr) drawn in bold, (b) the word senses induced after edge removal. [sent-250, score-1.3]

93 considering the interrelationships between them (as is done in traditional search result clustering), we intersect each bag of words bi ∈ B with the sense clusters {S1, . [sent-251, score-0.445]

94 The sense cluster with the largest intersection with bi is selected as the most likely meaning of ri. [sent-257, score-0.267]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('query', 0.267), ('beagle', 0.252), ('clustering', 0.234), ('wsi', 0.225), ('sqr', 0.216), ('senses', 0.203), ('dice', 0.189), ('search', 0.185), ('puppy', 0.162), ('breed', 0.158), ('participates', 0.158), ('dog', 0.155), ('web', 0.151), ('cycles', 0.135), ('squares', 0.135), ('snippets', 0.112), ('graph', 0.111), ('sir', 0.108), ('cluster', 0.102), ('triangles', 0.097), ('artists', 0.095), ('carpineto', 0.095), ('sanderson', 0.095), ('square', 0.092), ('engines', 0.092), ('edges', 0.09), ('edge', 0.088), ('participate', 0.084), ('linux', 0.081), ('snippet', 0.081), ('canine', 0.081), ('neighbours', 0.081), ('wsd', 0.075), ('sense', 0.075), ('navigli', 0.073), ('induction', 0.072), ('ambiguity', 0.065), ('cpaunpi', 0.063), ('diversification', 0.063), ('ees', 0.063), ('eou', 0.063), ('gq', 0.063), ('krovetz', 0.063), ('limnuxmsi', 0.063), ('nyebdreogedsearcihsnydsetx', 0.063), ('pladrt', 0.063), ('psaicomenacrsafl', 0.063), ('roma', 0.063), ('singers', 0.063), ('tnder', 0.063), ('zamir', 0.063), ('nguyen', 0.063), ('queries', 0.06), ('meanings', 0.057), ('di', 0.055), ('directories', 0.054), ('dorow', 0.054), ('tri', 0.054), ('bag', 0.054), ('induced', 0.052), ('dynamically', 0.052), ('bi', 0.049), ('stopwords', 0.049), ('developments', 0.049), ('numerator', 0.049), ('gonzalo', 0.049), ('basis', 0.048), ('widdows', 0.045), ('bags', 0.045), ('concepts', 0.041), ('clusters', 0.041), ('result', 0.041), ('meaning', 0.041), ('ranking', 0.04), ('tze', 0.04), ('spectral', 0.04), ('croft', 0.04), ('retrieving', 0.04), ('tool', 0.038), ('issue', 0.037), ('documents', 0.037), ('identification', 0.036), ('static', 0.036), ('acquire', 0.036), ('users', 0.036), ('group', 0.035), ('rn', 0.034), ('wordnet', 0.034), ('ambiguous', 0.033), ('vertices', 0.033), ('yahoo', 0.033), ('retrieve', 0.032), ('remove', 0.032), ('google', 0.031), ('sch', 0.031), ('retrieval', 0.03), ('ri', 0.03), ('established', 0.03), ('engine', 0.029), ('graphs', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

Author: Roberto Navigli ; Giuseppe Crisafulli

Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.

2 0.18063277 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.

3 0.15757091 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei

Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.

4 0.14364544 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs

Author: Ioannis Klapaftis ; Suresh Manandhar

Abstract: Graph-based methods have gained attention in many areas of Natural Language Processing (NLP) including Word Sense Disambiguation (WSD), text summarization, keyword extraction and others. Most of the work in these areas formulate their problem in a graph-based setting and apply unsupervised graph clustering to obtain a set of clusters. Recent studies suggest that graphs often exhibit a hierarchical structure that goes beyond simple flat clustering. This paper presents an unsupervised method for inferring the hierarchical grouping of the senses of a polysemous word. The inferred hierarchical structures are applied to the problem of word sense disambiguation, where we show that our method performs sig- nificantly better than traditional graph-based methods and agglomerative clustering yielding improvements over state-of-the-art WSD systems based on sense induction.

5 0.1102255 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

Author: Joseph Reisinger ; Raymond Mooney

Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.

6 0.10459021 84 emnlp-2010-NLP on Spoken Documents Without ASR

7 0.10325823 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

8 0.094314888 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

9 0.081023864 77 emnlp-2010-Measuring Distributional Similarity in Context

10 0.077358134 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

11 0.071828499 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

12 0.068496332 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

13 0.064788803 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

14 0.05907663 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

15 0.058667626 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

16 0.057771683 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

17 0.057369377 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

18 0.051991455 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

19 0.051648874 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

20 0.049530119 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.177), (1, 0.135), (2, -0.137), (3, 0.147), (4, 0.099), (5, 0.079), (6, -0.327), (7, 0.092), (8, -0.121), (9, 0.185), (10, 0.006), (11, 0.04), (12, 0.092), (13, -0.048), (14, -0.104), (15, -0.211), (16, 0.041), (17, -0.015), (18, 0.029), (19, -0.01), (20, -0.015), (21, 0.068), (22, 0.049), (23, -0.171), (24, 0.046), (25, 0.181), (26, -0.088), (27, -0.013), (28, -0.026), (29, 0.031), (30, -0.095), (31, 0.103), (32, 0.042), (33, 0.055), (34, -0.028), (35, 0.002), (36, -0.012), (37, 0.019), (38, -0.05), (39, 0.063), (40, -0.077), (41, 0.007), (42, -0.023), (43, 0.05), (44, -0.007), (45, -0.001), (46, 0.111), (47, 0.05), (48, 0.0), (49, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98141539 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

Author: Roberto Navigli ; Giuseppe Crisafulli

Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.

2 0.71095514 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.

3 0.65657514 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei

Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.

4 0.6278953 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs

Author: Ioannis Klapaftis ; Suresh Manandhar

Abstract: Graph-based methods have gained attention in many areas of Natural Language Processing (NLP) including Word Sense Disambiguation (WSD), text summarization, keyword extraction and others. Most of the work in these areas formulate their problem in a graph-based setting and apply unsupervised graph clustering to obtain a set of clusters. Recent studies suggest that graphs often exhibit a hierarchical structure that goes beyond simple flat clustering. This paper presents an unsupervised method for inferring the hierarchical grouping of the senses of a polysemous word. The inferred hierarchical structures are applied to the problem of word sense disambiguation, where we show that our method performs sig- nificantly better than traditional graph-based methods and agglomerative clustering yielding improvements over state-of-the-art WSD systems based on sense induction.

5 0.51428664 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

Author: Joseph Reisinger ; Raymond Mooney

Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.

6 0.39871833 84 emnlp-2010-NLP on Spoken Documents Without ASR

7 0.37630996 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

8 0.35032588 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

9 0.34696889 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

10 0.32722008 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

11 0.31872833 77 emnlp-2010-Measuring Distributional Similarity in Context

12 0.31760448 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

13 0.26795965 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

14 0.25841001 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

15 0.25611416 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task

16 0.23301919 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

17 0.23252128 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

18 0.21142103 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

19 0.21114731 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

20 0.20919882 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.012), (10, 0.019), (12, 0.048), (29, 0.085), (30, 0.021), (32, 0.012), (52, 0.023), (56, 0.053), (59, 0.392), (62, 0.027), (66, 0.112), (72, 0.051), (76, 0.025), (79, 0.014), (82, 0.011), (89, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7087779 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

Author: Roberto Navigli ; Giuseppe Crisafulli

Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.

2 0.38524017 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

3 0.38522625 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

Author: Sankaranarayanan Ananthakrishnan ; Rohit Prasad ; David Stallard ; Prem Natarajan

Abstract: Production of parallel training corpora for the development of statistical machine translation (SMT) systems for resource-poor languages usually requires extensive manual effort. Active sample selection aims to reduce the labor, time, and expense incurred in producing such resources, attaining a given performance benchmark with the smallest possible training corpus by choosing informative, nonredundant source sentences from an available candidate pool for manual translation. We present a novel, discriminative sample selection strategy that preferentially selects batches of candidate sentences with constructs that lead to erroneous translations on a held-out development set. The proposed strategy supports a built-in diversity mechanism that reduces redundancy in the selected batches. Simulation experiments on English-to-Pashto and Spanish-to-English translation tasks demon- strate the superiority of the proposed approach to a number of competing techniques, such as random selection, dissimilarity-based selection, as well as a recently proposed semisupervised active learning strategy.

4 0.38482425 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

Author: Samuel Brody

Abstract: We reveal a previously unnoticed connection between dependency parsing and statistical machine translation (SMT), by formulating the dependency parsing task as a problem of word alignment. Furthermore, we show that two well known models for these respective tasks (DMV and the IBM models) share common modeling assumptions. This motivates us to develop an alignment-based framework for unsupervised dependency parsing. The framework (which will be made publicly available) is flexible, modular and easy to extend. Using this framework, we implement several algorithms based on the IBM alignment models, which prove surprisingly effective on the dependency parsing task, and demonstrate the potential of the alignment-based approach.

5 0.38106906 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

Author: Samidh Chatterjee ; Nicola Cancedda

Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.

6 0.38048789 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

7 0.37974244 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

8 0.3793802 84 emnlp-2010-NLP on Spoken Documents Without ASR

9 0.37853843 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

10 0.37821236 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

11 0.37685946 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

12 0.3765505 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

13 0.37631482 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

14 0.37589958 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

15 0.37559438 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

16 0.37558183 114 emnlp-2010-Unsupervised Parse Selection for HPSG

17 0.37543029 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

18 0.37519413 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

19 0.37455508 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

20 0.37441447 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task