acl acl2011 acl2011-258 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marius Pasca
Abstract: The role of search queries, as available within query sessions or in isolation from one another, in examined in the context of ranking the class labels (e.g., brazilian cities, business centers, hilly sites) extracted from Web documents for various instances (e.g., rio de janeiro). The co-occurrence of a class label and an instance, in the same query or within the same query session, is used to reinforce the estimated relevance of the class label for the instance. Experiments over evaluation sets of instances associated with Web search queries illustrate the higher quality of the query-based, re-ranked class labels, relative to ranking baselines using documentbased counts.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract The role of search queries, as available within query sessions or in isolation from one another, in examined in the context of ranking the class labels (e. [sent-3, score-1.164]
2 The co-occurrence of a class label and an instance, in the same query or within the same query session, is used to reinforce the estimated relevance of the class label for the instance. [sent-8, score-1.565]
3 Experiments over evaluation sets of instances associated with Web search queries illustrate the higher quality of the query-based, re-ranked class labels, relative to ranking baselines using documentbased counts. [sent-9, score-1.144]
4 1 Introduction Motivation: The offline acquisition ofinstances (rio de janeiro, porsche cayman) and their corresponding class labels (brazilian cities, locations, vehicles, sports cars) from text has been an active area of research. [sent-10, score-0.559]
5 In Web search, the relative ranking of documents returned in response to a query directly affects the outcome of the search. [sent-15, score-0.6]
6 Similarly, the quality of the relative ranking among class labels extracted for a given instance influences any applications (e. [sent-16, score-0.829]
7 But due to noise in Web data and limitations of extraction techniques, class labels acquired for a given instance (e. [sent-19, score-0.632]
8 Inevitably, some of the extracted class labels will be less useful (e. [sent-23, score-0.563]
9 Contributions: This paper explores the role of Web search queries, rather than Web documents, in inducing superior ranking among class labels extracted automatically from documents for various instances. [sent-30, score-0.799]
10 It compares two sources of indirect ranking evidence available within anonymized query logs: a) co-occurrence of an instance and its class label in the same query; and b) co-occurrence of an instance and its class label, as separate queries within the same query session. [sent-31, score-2.274]
11 The former source is a noisy attempt to capture queries that narrow the search results to a particular class of the instance (e. [sent-32, score-0.797]
12 To our knowledge, this is the first study comparing inherently-noisy queries and query sessions for the purpose of ranking of open-domain, labeled class in- stances. [sent-40, score-1.257]
13 Section 2 introduces intuitions behind an approach using queries for ranking class labels of various instances, and describes associated ranking functions. [sent-44, score-1.266]
14 The results illustrate the higher quality of the querybased, re-ranked lists of class labels, relative to alternative ranking methods using only document-based counts. [sent-46, score-0.613]
15 2 Instance Class Ranking via Query Logs Ranking Hypotheses: We take advantage of anonymized query logs, to induce superior ranking among the class labels associated with various class instances within an IsA repository acquired from Web documents. [sent-47, score-1.833]
16 Given a class instance I,the funcWtioenbs d uoscedum mfoern ttshe. [sent-48, score-0.4]
17 • Hypothesis H2: If C is a prominent class of an ins•tan Hcyepo It,he asnids H I: is ambiguous, mthinenen a fcrlaacstsio onf aonf tihnset queries a abnodut I I I i may abligsou oreusfe,r t htoe ann ad fcraoncttiaoinn C of. [sent-51, score-0.739]
18 • Hypothesis H3: mIfa yC ilss a prominent c cloasnst oinf an :fr Iafc Ctio isn ao fp rtohem queries asbso ouft In may nb ece efo Ill,ow theend by queries a obfou thte eC ,q aunerdi evsic aeb-ovuertsIa . [sent-52, score-0.752]
19 The application of the scoring formula (1) to candidates extracted from the Web produces a ranked list of class labels LH1(I). [sent-69, score-0.725]
20 Examples of such queries are happiness emotion and diderot philosopher. [sent-71, score-0.495]
21 Moreover, queries like happiness positive psychology and diderot enlightenment may be considered to weakly and partially reinforce the relevance of the class labels positive emotions and enlightenment writers of the instances happiness and diderot respectively. [sent-72, score-1.413]
22 In practice, a class label is deemed more relevant if its individual terms occur in pop- itiv•ely R,a Wnkeibng u bsaesrsed se oanrc Hhing ular queries containing the instance. [sent-73, score-0.812]
23 More precisely, for each term within any class label from LH1(I), we compute a score TermQueryScore. [sent-74, score-0.456]
24 The score )is, the frequency sum of the term within anonymized queries containing the instance I a prefix, and as tqhuee rtieersm c anywhere tehlese i nisnt athncee queries. [sent-75, score-0.561]
25 The class labels are ranked according to the means, resulting in a ranked list LH2(I). [sent-78, score-0.742]
26 Examples of Giv•en R tahneki tnhgir bda hsyepdo othne Hsis asu cchla queries are happiness afonclleoswe Id. [sent-84, score-0.431]
27 In practice, a class label is deemed more relevant ifits individual terms occur as part of queries that are in the same query session as a query containing only the instance. [sent-86, score-1.568]
28 s tBanefcoere I computing tihniet frequencies, the class label terms are stemmed. [sent-88, score-0.421]
29 Each class label C is assigned the geometric mean of tEhaec scores o lafb bietsl tNerms, saigftenre ignoring stop wc mordeas:n ScoreH3(C,I) = (YTermSessionScore(Ti))1/N labelsi a=r1e (3) The class ranked according to the geometric means, resulting in a ranked list LH3(I). [sent-89, score-1.09]
30 Unsupervised Ranking: Given an instance I, the ranking hypotheses anngd: corresponding cfuen cIt,io tnhse LH1(I), LH2(I) and LH3(I) (or any combination of them) can b(eI )us aendd together t (oo generate a merged, ranked list of class labels per instance I. [sent-91, score-0.994]
31 By using only 0th00e, ,re ifla Cti ivse nraotnk psr easnedn not the absolute scores of the class labels within the input lists, the outcome of the merging is less sensitive to how class labels of a given instance are numerically scored within the input lists. [sent-96, score-1.301]
32 In case of ties, the scores of the class labels from LH1(I) serve as a secondary ranking criterion. [sent-97, score-0.721]
33 s Tsohcuisa,te edv ewryith in a arannckee Id list of class labels computed according to this ranking formula. [sent-99, score-0.783]
34 Conversely, each class label C from 1609 the IsA repository is associated with a ranked list of class instances computed with the earlier scoring formula (1) used to generate lists LH1(I). [sent-100, score-1.395]
35 The queries are fully-anonymized queries in English submitted to Google by Web users in 2009, and are available in two collections. [sent-104, score-0.749]
36 The first collection is a random sample of 50 million unique queries that are independent from one another. [sent-105, score-0.389]
37 Each session has an initial query and a series of subsequent queries. [sent-107, score-0.442]
38 A subsequent query is a query that has been submitted by the same Web user within no longer than a few minutes after the initial query. [sent-108, score-0.776]
39 A more practical alternative is an automatic evaluation procedure for ranked lists of class labels, based on existing resources and systems. [sent-114, score-0.509]
40 Assume that there is a gold standard, containing gold class labels that are each associated with a gold set of their instances. [sent-115, score-1.006]
41 Based on the gold standard, the ranked lists of class labels available within an IsA repository can be automatically evaluated as follows. [sent-117, score-1.033]
42 First, for each gold label, the ranked lists of class labels of individual gold instances are retrieved from the IsA repository. [sent-118, score-1.194]
43 Second, the individual retrieved lists are merged into a ranked list of class labels, associated with the gold label. [sent-119, score-0.822]
44 Intuitively, a ranked list of class labels is a better approximation of a gold label, if class labels situated at better ranks in the list are closer in meaning to the gold label. [sent-124, score-1.538]
45 Evaluation Metric: Given a gold label and a list of class labels, if any, derived from the IsA repository, the rank of the highest class label that matches the gold label determines the score assigned to the gold label, in the form of the reciprocal rank of the match. [sent-125, score-1.543]
46 Thus, if the gold label matches a class label at rank 1, 2 or 3 in the computed list, the gold label receives a score of 1, 0. [sent-126, score-1.028]
47 The score is 0 if the gold label does not match any of the top 20 class labels. [sent-129, score-0.599]
48 The overall score over the entire set of gold labels is the mean reciprocal rank (MRR) score over all gold labels from the set. [sent-130, score-0.747]
49 Two types of MRR scores are automatically computed: • MRRf considers a gold label and a class label to match, if they are identical; • MRRp considers a gold label and a class label to match, if one or more of their tokens that are not stop words are identical. [sent-131, score-1.417]
50 Thus, insurance carriers and insurance companies are con- Query Set: Sample of Queries queries associated with non-filtered (Qe) or manuallyfiltered (Qm) instances sidered to not match in MRRf scores, but match in MRRp scores. [sent-135, score-0.763]
51 On the other hand, MRRp scores may give credit to less relevant class labels, such as insurance policies for the gold label insurance carriers. [sent-136, score-0.729]
52 Therefore, MRRp is an optimistic, and MRRf is a pessimistic estimate of the actual usefulness of the computed ranked lists of class labels as approximations of the gold labels. [sent-137, score-0.905]
53 The number of class labels available per instance and vice-versa follows a long-tail distribution, indicating that 2. [sent-141, score-0.624]
54 12 million of the instances each have two or more class labels (with an average of 19. [sent-142, score-0.717]
55 The set contains 807 queries, each associated with a ranked list of between 10 and 100 gold instances automatically extracted by Google Squared. [sent-150, score-0.512]
56 Since the gold instances available as input for each query as part of Qe are automatically extracted, they may or may not be true instances of the respective queries. [sent-151, score-0.858]
57 As described in (Pa ¸sca, 2010), the second evaluation set Qm is a subset of 40 queries from Qe, such that the gold instances available for each query in Qm are found to be correct after manual inspection. [sent-152, score-1.018]
58 The 40 queries from Qm are associated with between 8 and 33 human-validated instances. [sent-153, score-0.401]
59 As shown in the upper part of Table 2, the queries from Qe are up to 8 tokens in length, with an average of 2 tokens per query. [sent-154, score-0.468]
60 The lower part of Table 2 shows the number of gold instances available as input, which average around 70 and 17 per query, for queries from Qe and Qm respectively. [sent-157, score-0.697]
61 To provide another view on the distribution of the queries from evaluation sets, Table 3 lists tokens that are not stop words, which occur in most queries from Qe. [sent-158, score-0.887]
62 Comparatively, few query tokens occur in more than one query in Qm. [sent-159, score-0.729]
63 Evaluation Procedure: Following the general evaluation procedure, each query from the sets Qe and Qm acts as a gold class label associated with the corresponding set of instances. [sent-160, score-0.977]
64 Given a query and its instances I from the evaluation sets Qe or Qm, a merged, rsa Inke fdro mlist tsh eo fe vcallaussa iloanbe slse sis computed out of the ranked lists of class labels available in the 1611 QTouke rnyCnt. [sent-161, score-1.318]
65 EhxeaTm opkle ns of Queries Containing queries from the Qe evaluation set, along with the number (Cnt) and examples of queries containing the tokens underlying IsA repository for each instance I. [sent-162, score-0.982]
66 The uenvadleuralytioinng compares tihtoer merged lcihst sin ostfa acnlcases I labels, with the corresponding queries from Qe or Qm. [sent-163, score-0.426]
67 Accuracy of Lists of Class Labels: Table 4 summarizes results from comparative experiments, quantifying a) horizontally, the impact of alternative parameter settings on the computed lists of class labels; and b) vertically, the comparative accuracy of the experimental runs over the query sets. [sent-164, score-0.848]
68 The experimental parameters are the number of input instances from the evaluation sets that are used for retrieving class labels, I-per-Q, set to 3, 5, 10; and the number of class labels retrieved per input instance, C-per-I, set to 5, 10, 20. [sent-165, score-1.171]
69 This suggests that useful class labels can be generated even in extreme scenarios, where the number of instances available as input is as small as 3 or 5. [sent-206, score-0.73]
70 Fourth and most importantly, for most combinations ofparameter settings and on both query sets, the runs that take advantage of query logs (Rp, Rs, Ru) produce the highest scores. [sent-207, score-0.781]
71 In particular, when I-per-Q is set to 10 and C-per-I to 20, run Ru identifies the original query as an exact match among the top three to four class labels returned (score 0. [sent-208, score-0.968]
72 278); and as a partial match among the top one to two class labels returned (score 0. [sent-209, score-0.589]
73 In all experiments, the higher scores of Rp, Rs and Ru can be attributed to higher-quality lists of class labels, relative to Rd. [sent-213, score-0.469]
74 Thus, between the presence of a class label and an instance either in the same query, or as separate queries within the same query session, it is the latter that provides a more useful signal during the reranking of class labels of each instance. [sent-216, score-1.764]
75 Table 5 illustrates the top class labels from the ranked lists generated in run Rs for various queries from both Qe and Qm. [sent-217, score-1.108]
76 The table suggests that the computed class labels are relatively resistant to noise and variation within the input set of gold instances. [sent-218, score-0.796]
77 For example, the top elements of the lists of class la- QuerySQeuteryCnt. [sent-219, score-0.421]
78 Similarly, the class labels computed for european countries are almost the same for Qe vs. [sent-224, score-0.575]
79 Qm, although the overlap of the respective lists of 10 gold instances used as input is not large. [sent-225, score-0.445]
80 The table shows at least one query (park slope restaurants) for which the output is less than optimal, either because the class labels (e. [sent-226, score-0.871]
81 , businesses) are quite distant semantically from the query (for Qe), or because no 1613 output is produced at all, due to no class labels being found in the IsA repository for any of the 10 input gold instances (for Qm). [sent-228, score-1.362]
82 For many queries, however, the computed class labels arguably capture the meaning of the original query, although not necessarily in the exact same lexical form, and sometimes only partially. [sent-229, score-0.575]
83 For example, for the query endangered animals, only the fourth class label from Qm identifies the query exactly. [sent-230, score-1.154]
84 However, class labels preceding endangered animals already capture the notion of animals or species (first and third labels), or that they are endangered (second label). [sent-231, score-0.695]
85 In the first graph of Figure 1, for Qe, the query matches the automatically-generated class label at ranks 1, 2, 3, 4 and 5 for 18. [sent-235, score-0.813]
86 In particular, the query matches the class label at rank 1and 2 for 50. [sent-249, score-0.84]
87 Discussion: The quality of lists of items extracted from documents can benefit from query-driven ranking, particularly for the task of ranking class labels 1614 of instances within IsA repositories. [sent-255, score-1.059]
88 The use of queries for ranking is generally applicable: it can be seen as a post-processing stage that enhances the ranking of the class labels extracted for various instances by any method into any IsA repository. [sent-256, score-1.428]
89 Open-domain class labels extracted from text and re-ranked as described in this paper are useful in a variety of applications. [sent-257, score-0.563]
90 The labeling of the returned set of instances, using the re-ranked class labels available per instances, allows for the generation of query refinements (e. [sent-261, score-0.925]
91 Our work compares the usefulness of queries and query sessions for ranking class labels in extracted IsA repositories. [sent-267, score-1.494]
92 It shows that query sessions produce betterranked class labels than isolated queries do. [sent-268, score-1.289]
93 A task complementary to class label ranking is entity ranking (Billerbeck et al. [sent-269, score-0.759]
94 The choice of search queries and query substitutions is often influenced by, and indicative of, various semantic relations holding among full queries or query terms (Jones et al. [sent-272, score-1.448]
95 , by exploring the acquisition of untyped, similarity-based relations from query logs (Baeza-Yates and Tiberi, 2007). [sent-276, score-0.434]
96 In comparison, queries are used here to re-rank class labels capturing a well-defined type of open-domain relations, namely IsA relations. [sent-277, score-0.89]
97 Current work investigates the impact of ambiguous input instances (Vyas and Pantel, 2009) on the quality of the generated class labels. [sent-279, score-0.529]
98 What you seek is what you get: Extraction of class attributes from query logs. [sent-366, score-0.67]
99 The role of queries in ranking labeled instances extracted from text. [sent-371, score-0.732]
100 Weakly-supervised acquisition of labeled class instances using graph random walks. [sent-394, score-0.522]
wordName wordTfidf (topN-words)
[('queries', 0.363), ('qe', 0.36), ('query', 0.344), ('class', 0.326), ('qm', 0.315), ('labels', 0.201), ('ranking', 0.169), ('instances', 0.164), ('isa', 0.163), ('gold', 0.147), ('repository', 0.141), ('mrrf', 0.135), ('mrrp', 0.118), ('label', 0.095), ('lists', 0.095), ('anonymized', 0.089), ('ranked', 0.088), ('instance', 0.074), ('ru', 0.07), ('insurance', 0.068), ('happiness', 0.068), ('session', 0.068), ('diderot', 0.064), ('merged', 0.063), ('pas', 0.063), ('logs', 0.058), ('web', 0.057), ('sessions', 0.055), ('rank', 0.051), ('demartini', 0.051), ('computed', 0.048), ('rs', 0.045), ('endangered', 0.045), ('tokens', 0.041), ('list', 0.039), ('input', 0.039), ('geometric', 0.039), ('animals', 0.039), ('associated', 0.038), ('sca', 0.037), ('mrr', 0.037), ('extracted', 0.036), ('rp', 0.036), ('run', 0.035), ('formula', 0.035), ('within', 0.035), ('runs', 0.035), ('cafarella', 0.035), ('search', 0.034), ('billerbeck', 0.034), ('enlightenment', 0.034), ('eognarpqficeust', 0.034), ('hcyepo', 0.034), ('iofciu', 0.034), ('itiv', 0.034), ('janeiro', 0.034), ('lhi', 0.034), ('shale', 0.034), ('specialize', 0.034), ('documents', 0.033), ('pennacchiotti', 0.033), ('acquisition', 0.032), ('extraction', 0.031), ('returned', 0.031), ('match', 0.031), ('jaguar', 0.03), ('brazilian', 0.03), ('subsequent', 0.03), ('deemed', 0.028), ('banko', 0.028), ('maker', 0.027), ('vyas', 0.027), ('philosophers', 0.027), ('rio', 0.027), ('ties', 0.027), ('emotions', 0.027), ('sets', 0.027), ('google', 0.027), ('retrieved', 0.026), ('etzioni', 0.026), ('prominent', 0.026), ('oil', 0.026), ('million', 0.026), ('stop', 0.025), ('ins', 0.025), ('eo', 0.025), ('scores', 0.025), ('matches', 0.024), ('aonf', 0.024), ('yis', 0.024), ('ranks', 0.024), ('relative', 0.023), ('conversely', 0.023), ('cars', 0.023), ('comparatively', 0.023), ('kozareva', 0.023), ('freq', 0.023), ('oisn', 0.023), ('per', 0.023), ('submitted', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 258 acl-2011-Ranking Class Labels Using Query Sessions
Author: Marius Pasca
Abstract: The role of search queries, as available within query sessions or in isolation from one another, in examined in the context of ranking the class labels (e.g., brazilian cities, business centers, hilly sites) extracted from Web documents for various instances (e.g., rio de janeiro). The co-occurrence of a class label and an instance, in the same query or within the same query session, is used to reinforce the estimated relevance of the class label for the instance. Experiments over evaluation sets of instances associated with Web search queries illustrate the higher quality of the query-based, re-ranked class labels, relative to ranking baselines using documentbased counts.
2 0.35715729 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
Author: Joseph Reisinger ; Marius Pasca
Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.
3 0.31710052 182 acl-2011-Joint Annotation of Search Queries
Author: Michael Bendersky ; W. Bruce Croft ; David A. Smith
Abstract: W. Bruce Croft Dept. of Computer Science University of Massachusetts Amherst, MA cro ft @ c s .uma s s .edu David A. Smith Dept. of Computer Science University of Massachusetts Amherst, MA dasmith@ c s .umas s .edu articles or web pages). As previous research shows, these differences severely limit the applicability of Marking up search queries with linguistic annotations such as part-of-speech tags, capitalization, and segmentation, is an impor- tant part of query processing and understanding in information retrieval systems. Due to their brevity and idiosyncratic structure, search queries pose a challenge to existing NLP tools. To address this challenge, we propose a probabilistic approach for performing joint query annotation. First, we derive a robust set of unsupervised independent annotations, using queries and pseudo-relevance feedback. Then, we stack additional classifiers on the independent annotations, and exploit the dependencies between them to further improve the accuracy, even with a very limited amount of available training data. We evaluate our method using a range of queries extracted from a web search log. Experimental results verify the effectiveness of our approach for both short keyword queries, and verbose natural language queries.
4 0.30573621 256 acl-2011-Query Weighting for Ranking Model Adaptation
Author: Peng Cai ; Wei Gao ; Aoying Zhou ; Kam-Fai Wong
Abstract: We propose to directly measure the importance of queries in the source domain to the target domain where no rank labels of documents are available, which is referred to as query weighting. Query weighting is a key step in ranking model adaptation. As the learning object of ranking algorithms is divided by query instances, we argue that it’s more reasonable to conduct importance weighting at query level than document level. We present two query weighting schemes. The first compresses the query into a query feature vector, which aggregates all document instances in the same query, and then conducts query weighting based on the query feature vector. This method can efficiently estimate query importance by compressing query data, but the potential risk is information loss resulted from the compression. The second measures the similarity between the source query and each target query, and then combines these fine-grained similarity values for its importance estimation. Adaptation experiments on LETOR3.0 data set demonstrate that query weighting significantly outperforms document instance weighting methods.
5 0.28854024 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes
Author: Bo Pang ; Ravi Kumar
Abstract: Web search is an information-seeking activity. Often times, this amounts to a user seeking answers to a question. However, queries, which encode user’s information need, are typically not expressed as full-length natural language sentences in particular, as questions. Rather, they consist of one or more text fragments. As humans become more searchengine-savvy, do natural-language questions still have a role to play in web search? Through a systematic, large-scale study, we find to our surprise that as time goes by, web users are more likely to use questions to express their search intent. —
6 0.25932163 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities
7 0.13146828 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search
8 0.12621883 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
9 0.12333924 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges
10 0.10497263 333 acl-2011-Web-Scale Features for Full-Scale Parsing
11 0.10056322 89 acl-2011-Creative Language Retrieval: A Robust Hybrid of Information Retrieval and Linguistic Creativity
12 0.098384947 11 acl-2011-A Fast and Accurate Method for Approximate String Search
13 0.094384558 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
14 0.093439475 135 acl-2011-Faster and Smaller N-Gram Language Models
15 0.088940226 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation
16 0.087218054 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
17 0.085223325 52 acl-2011-Automatic Labelling of Topic Models
18 0.08433231 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering
19 0.074857101 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
20 0.072108343 197 acl-2011-Latent Class Transliteration based on Source Language Origin
topicId topicWeight
[(0, 0.187), (1, 0.098), (2, -0.149), (3, 0.095), (4, -0.143), (5, -0.31), (6, -0.093), (7, -0.295), (8, 0.158), (9, -0.05), (10, 0.167), (11, -0.049), (12, -0.019), (13, 0.019), (14, -0.008), (15, -0.054), (16, -0.037), (17, 0.032), (18, -0.038), (19, -0.002), (20, -0.058), (21, 0.059), (22, -0.023), (23, -0.015), (24, 0.001), (25, 0.108), (26, 0.02), (27, 0.032), (28, 0.034), (29, 0.003), (30, -0.038), (31, -0.031), (32, -0.045), (33, -0.022), (34, -0.027), (35, 0.025), (36, 0.06), (37, 0.04), (38, -0.043), (39, -0.047), (40, 0.085), (41, -0.004), (42, -0.01), (43, -0.04), (44, -0.087), (45, 0.008), (46, 0.021), (47, -0.009), (48, 0.062), (49, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.98632568 258 acl-2011-Ranking Class Labels Using Query Sessions
Author: Marius Pasca
Abstract: The role of search queries, as available within query sessions or in isolation from one another, in examined in the context of ranking the class labels (e.g., brazilian cities, business centers, hilly sites) extracted from Web documents for various instances (e.g., rio de janeiro). The co-occurrence of a class label and an instance, in the same query or within the same query session, is used to reinforce the estimated relevance of the class label for the instance. Experiments over evaluation sets of instances associated with Web search queries illustrate the higher quality of the query-based, re-ranked class labels, relative to ranking baselines using documentbased counts.
2 0.92212737 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
Author: Joseph Reisinger ; Marius Pasca
Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.
3 0.9097591 182 acl-2011-Joint Annotation of Search Queries
Author: Michael Bendersky ; W. Bruce Croft ; David A. Smith
Abstract: W. Bruce Croft Dept. of Computer Science University of Massachusetts Amherst, MA cro ft @ c s .uma s s .edu David A. Smith Dept. of Computer Science University of Massachusetts Amherst, MA dasmith@ c s .umas s .edu articles or web pages). As previous research shows, these differences severely limit the applicability of Marking up search queries with linguistic annotations such as part-of-speech tags, capitalization, and segmentation, is an impor- tant part of query processing and understanding in information retrieval systems. Due to their brevity and idiosyncratic structure, search queries pose a challenge to existing NLP tools. To address this challenge, we propose a probabilistic approach for performing joint query annotation. First, we derive a robust set of unsupervised independent annotations, using queries and pseudo-relevance feedback. Then, we stack additional classifiers on the independent annotations, and exploit the dependencies between them to further improve the accuracy, even with a very limited amount of available training data. We evaluate our method using a range of queries extracted from a web search log. Experimental results verify the effectiveness of our approach for both short keyword queries, and verbose natural language queries.
4 0.86415863 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities
Author: Patrick Pantel ; Ariel Fuxman
Abstract: We propose methods for estimating the probability that an entity from an entity database is associated with a web search query. Association is modeled using a query entity click graph, blending general query click logs with vertical query click logs. Smoothing techniques are proposed to address the inherent data sparsity in such graphs, including interpolation using a query synonymy model. A large-scale empirical analysis of the smoothing techniques, over a 2-year click graph collected from a commercial search engine, shows significant reductions in modeling error. The association models are then applied to the task of recommending products to web queries, by annotating queries with products from a large catalog and then mining query- product associations through web search session analysis. Experimental analysis shows that our smoothing techniques improve coverage while keeping precision stable, and overall, that our top-performing model affects 9% of general web queries with 94% precision.
5 0.85720032 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes
Author: Bo Pang ; Ravi Kumar
Abstract: Web search is an information-seeking activity. Often times, this amounts to a user seeking answers to a question. However, queries, which encode user’s information need, are typically not expressed as full-length natural language sentences in particular, as questions. Rather, they consist of one or more text fragments. As humans become more searchengine-savvy, do natural-language questions still have a role to play in web search? Through a systematic, large-scale study, we find to our surprise that as time goes by, web users are more likely to use questions to express their search intent. —
6 0.83160692 256 acl-2011-Query Weighting for Ranking Model Adaptation
7 0.73119092 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search
9 0.5849762 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
10 0.57725745 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
11 0.4817293 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges
12 0.44798911 135 acl-2011-Faster and Smaller N-Gram Language Models
13 0.4432981 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering
14 0.42022902 11 acl-2011-A Fast and Accurate Method for Approximate String Search
15 0.38911214 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA
16 0.3736468 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
17 0.35975796 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search
18 0.35180143 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
19 0.33512786 333 acl-2011-Web-Scale Features for Full-Scale Parsing
20 0.32376853 248 acl-2011-Predicting Clicks in a Vocabulary Learning System
topicId topicWeight
[(5, 0.022), (9, 0.018), (17, 0.056), (26, 0.094), (37, 0.071), (39, 0.065), (41, 0.036), (55, 0.047), (59, 0.034), (72, 0.018), (90, 0.23), (91, 0.052), (96, 0.151), (97, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.81321579 258 acl-2011-Ranking Class Labels Using Query Sessions
Author: Marius Pasca
Abstract: The role of search queries, as available within query sessions or in isolation from one another, in examined in the context of ranking the class labels (e.g., brazilian cities, business centers, hilly sites) extracted from Web documents for various instances (e.g., rio de janeiro). The co-occurrence of a class label and an instance, in the same query or within the same query session, is used to reinforce the estimated relevance of the class label for the instance. Experiments over evaluation sets of instances associated with Web search queries illustrate the higher quality of the query-based, re-ranked class labels, relative to ranking baselines using documentbased counts.
2 0.77927333 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution
Author: Hugo Jair Escalante ; Thamar Solorio ; Manuel Montes-y-Gomez
Abstract: This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for AA. We show that LHs are particularly helpful for AA, because they provide useful information for uncovering, to some extent, the writing style of authors. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. We found that LHs are even more advantageous in challenging conditions, such as having imbalanced and small training sets. Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection.
3 0.76997137 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis
Author: Oleksandr Kolomiyets ; Steven Bethard ; Marie-Francine Moens
Abstract: We explore a semi-supervised approach for improving the portability of time expression recognition to non-newswire domains: we generate additional training examples by substituting temporal expression words with potential synonyms. We explore using synonyms both from WordNet and from the Latent Words Language Model (LWLM), which predicts synonyms in context using an unsupervised approach. We evaluate a state-of-the-art time expression recognition system trained both with and without the additional training examples using data from TempEval 2010, Reuters and Wikipedia. We find that the LWLM provides substantial improvements on the Reuters corpus, and smaller improvements on the Wikipedia corpus. We find that WordNet alone never improves performance, though intersecting the examples from the LWLM and WordNet provides more stable results for Wikipedia. 1
4 0.73830414 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs
Author: Aditya Joshi ; Balamurali AR ; Pushpak Bhattacharyya ; Rajat Mohanty
Abstract: Social networking and micro-blogging sites are stores of opinion-bearing content created by human users. We describe C-Feel-It, a system which can tap opinion content in posts (called tweets) from the micro-blogging website, Twitter. This web-based system categorizes tweets pertaining to a search string as positive, negative or objective and gives an aggregate sentiment score that represents a sentiment snapshot for a search string. We present a qualitative evaluation of this system based on a human-annotated tweet corpus.
5 0.70394528 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition
Author: John DeNero ; Klaus Macherey
Abstract: Unsupervised word alignment is most often modeled as a Markov process that generates a sentence f conditioned on its translation e. A similar model generating e from f will make different alignment predictions. Statistical machine translation systems combine the predictions of two directional models, typically using heuristic combination procedures like grow-diag-final. This paper presents a graphical model that embeds two directional aligners into a single model. Inference can be performed via dual decomposition, which reuses the efficient inference algorithms of the directional models. Our bidirectional model enforces a one-to-one phrase constraint while accounting for the uncertainty in the underlying directional models. The resulting alignments improve upon baseline combination heuristics in word-level and phrase-level evaluations.
6 0.67821264 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation
7 0.66656858 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents
8 0.6639877 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
9 0.656353 115 acl-2011-Engkoo: Mining the Web for Language Learning
10 0.65633589 253 acl-2011-PsychoSentiWordNet
11 0.65323269 333 acl-2011-Web-Scale Features for Full-Scale Parsing
12 0.65180635 182 acl-2011-Joint Annotation of Search Queries
13 0.64718479 193 acl-2011-Language-independent compound splitting with morphological operations
14 0.64258218 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities
15 0.64232796 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models
16 0.64213407 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
17 0.64092016 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
18 0.64074916 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction
19 0.64072168 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
20 0.63936865 178 acl-2011-Interactive Topic Modeling