acl acl2011 acl2011-137 knowledge-graph by maker-knowledge-mining

137 acl-2011-Fine-Grained Class Label Markup of Search Queries


Source: pdf

Author: Joseph Reisinger ; Marius Pasca

Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Furthermore search queries lack explicit syntax often used to determine intent in question answering. [sent-5, score-0.296]

2 In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. [sent-6, score-0.212]

3 This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. [sent-7, score-0.53]

4 We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents. [sent-8, score-0.601]

5 1 Introduction Search queries are generally short and rarely contain much explicit syntax, making query understanding a purely semantic endeavor. [sent-9, score-0.746]

6 , the query [tropical breeze cleaners] has little to do with island vacations, nor are desert birds relevant to [1970 road runner], which refers to a car model. [sent-12, score-0.554]

7 , road runner is a car model) with a latent variable model for capturing weakly compositional interactions between query constituents. [sent-18, score-0.728]

8 Constituents are tagged with IsA class labels from a large, automatically extracted lexicon, using a probabilistic context free grammar (PCFG). [sent-19, score-0.193]

9 In addition to improving query understanding, potential applications of CLC include: (1) relation extraction (Baeza-Yates and Tiberi, 2007), (2) query substitutions or broad matching (Jones et al. [sent-32, score-0.946]

10 c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s200–1209, CLC and evaluate it on a sample of 500M search queries along two dimensions: (1) query constituent chunking precision (i. [sent-38, score-0.854]

11 , Bergsma and Wang (2007); Tan and Peng (2008)), and (2) class label assignment precision (i. [sent-41, score-0.373]

12 , given the query intent, how relevant are the inferred class labels), paying particular attention to cases where queries contain ambiguous constituents. [sent-43, score-0.846]

13 CLC compares favorably to several simpler submodels, with gains in performance stemming from coarse-graining related class labels and increasing the number of clusters used to cap- ture between-label correlations. [sent-44, score-0.463]

14 Li (2010) defines the semantic structure of noun-phrase queries as intent heads (attributes) coupled with some number of intent modifiers (attribute values), e. [sent-47, score-0.453]

15 , the query [alice in wonderland 2010 cast] is comprised of an intent head cast and two intent modifiers alice in wonderland and 2010. [sent-49, score-0.758]

16 In this work we focus on semantic class markup of query constituents, but our approach could be easily extended to account for query structure as well. [sent-50, score-1.146]

17 (2010) describe a similar classlabel-based approach for query interpretation, explicitly modeling the importance of each label for a given entity. [sent-52, score-0.634]

18 For simplicity, we extract class labels using the seed-based approach proposed by Van Durme and Pas ¸ca (2008) (in particular Pas ¸ca (2010)) which generalizes Hearst (1992). [sent-54, score-0.193]

19 Semantic class label lexicons derived from any of these approaches can be used as input to CLC. [sent-61, score-0.242]

20 Several authors have studied query clustering in the context of information retrieval (e. [sent-62, score-0.473]

21 Our approach is novel in this regard, as we cluster queries in order to capture correlations between span labels, rather than explicitly for query understanding. [sent-65, score-0.887]

22 Our approach yields similar topical decompositions of noun-phrases in queries and is completely unsupervised. [sent-67, score-0.206]

23 (2006) propose an automatic method for query substitution, i. [sent-69, score-0.473]

24 , replacing a given query with another query with the similar meaning, overcoming issues with poor paraphrase coverage in tail queries. [sent-71, score-0.946]

25 Correlations mined by our approach are readily useful for downstream query substitution. [sent-72, score-0.473]

26 Bergsma and Wang (2007) develop a supervised approach to query chunking using 500 handsegmented queries from the AOL corpus. [sent-73, score-0.771]

27 Tan and Peng (2008) develop a generative model of query segmentation that makes use of a language model and concepts derived from Wikipedia article titles. [sent-74, score-0.473]

28 CLC differs fundamentally in that it learns concept label markup in addition to segmentation and uses in-domain concepts derived from queries themselves. [sent-75, score-0.446]

29 This work also differs from both of these studies significantly in scope, training on 500M queries instead of just 500. [sent-76, score-0.206]

30 Φ LΦ CΦ L lqaube rlyc lus t er s seaside townsbuilding materials brighton vinyl windows label pcfg query constituents Figure 1: Overview of CLC markup generation for the query [brighton vinyl windows]. [sent-86, score-1.548]

31 3 Latent Class-Label Correlation Input to CLC consists of raw search queries and a partial grammar mapping class labels to query spans (e. [sent-88, score-0.999]

32 We motivate our use of an HDP latent class model instead of a full PCFG with binary productions by the fact that the space of possible binary rule combinations is prohibitively large (561K base labels; 3 14B binary rules). [sent-95, score-0.296]

33 1 IsA Label Extraction IsA class labels (hypernyms) V are extracted from a large corpus of raw Web text using the method proposed by Van Durme and Pas ¸ca (2008) and ex- tended by Pas ¸ca (2010). [sent-102, score-0.228]

34 Manually specified patterns are used to extract a seed set of class labels and the resulting label lists are reranked using cluster purity measures. [sent-103, score-0.447]

35 Table 1 shows an example set of class labels extracted for several common noun phrases. [sent-105, score-0.193]

36 In addition to extracted rules, the CLC grammar is augmented with a set of null rules, one per unigram, ensuring that every query has a valid parse. [sent-108, score-0.5]

37 2 Class-Label PCFG In addition to the observed class-label production rules, CLC incorporates two sets of latent production rules coupled via an HDP (Figure 1). [sent-110, score-0.396]

38 Class label→query span productions extracted from raw tleabxte are ucleursyte srpedan nin ptroo a scetito onfs l eaxtetnratc ltaedbel fr production clusters L = {l1, . [sent-111, score-0.523]

39 omial distribution over class labels V parametrized by φlLk . [sent-116, score-0.193]

40 Conceptually, φlLk captures a set of class labels with similar productions that are found in similar queries, for example the class labels states, northeast states, u. [sent-117, score-0.482]

41 Each query q ∈ Q is assigned to a latent query cluEsatecrh cq ∈ C{c1, . [sent-120, score-1.094]

42 Query rcllu asbteelrs capture obnro calud cteorrsre Lla,ti doennso tbeedtween label production clusters and are necessary for performing sense disambiguation and capturing se- lectional preference. [sent-124, score-0.55]

43 Query clusters and label production clusters are linked using a single HDP, allowing the number of label clusters to vary over the course of Gibbs sampling, based on the variance of the underlying data (Section 3. [sent-125, score-1.135]

44 Viewed as a grammar, CLC only contains unary rules mapping labels to query spans; production correlations are captured directly by the query cluster, unlike in HDP-PCFG (Liang et al. [sent-127, score-1.289]

45 The top section of the model is the standard HDP prior; the middle section is the additional machinery necessary for modeling latent groupings and the bottom section contains the indicators for the latent class model. [sent-129, score-0.263]

46 Given a query q, a query cluster assignment cq and a set oflabel production clusters L, we define a parse oafs q otof l baeb a sequence nocfl productions tq forming a parse tree consuming all the tokens in q. [sent-132, score-1.661]

47 The probability of a query q is the sum of the probabilities of the parse trees that can generate it, P(q|φL,φC,cq) = X P(t|φL,φC,cq) {t |yX(t)=q} where {t |y(t) = q} is the set of trees with q as their yield (i. [sent-134, score-0.473]

48 A set of base clusters β ∼ GEM(γ) is drawn from a Dirichlbeats ePr colcuesstesr sw βith ∼ b GaEseM measure γ using th ae D sitriicckh-breaking construction, and clusters for each group k, φC φL. [sent-140, score-0.488]

49 αC Query cluster smoothing; higher values lead to more uniform mass over label clusters. [sent-142, score-0.254]

50 αL Label cluster smoothing; higher values lead to more label diversity within clusters. [sent-143, score-0.254]

51 Intuitively, β defines a common “menu” of label clusters, and each query cluster defines a separate distribution over the label clusters. [sent-148, score-0.888]

52 ith Latent ∼ φkC Groups (HDP-LG) can be used to define a set of query clusters over a set of (potentially infinite) base label clusters (Figure 2). [sent-151, score-1.122]

53 Each query cluster (latent group) assigns weight to different subsets of the available label clusters capturing correlations between them at the query level. [sent-152, score-1.542]

54 Each query q maintains a distribution over query clusters πq, capturing its affinity for each latent group. [sent-153, score-1.303]

55 CLC-BASE no query clusters, one label per label cluster. [sent-156, score-0.795]

56 CLC-HDP-LG full HDP-LG model with |C| query clusters over a potentially idneflin witeit number of query clusters. [sent-160, score-1.21]

57 2 Relevant assignments c, z and l are stored locally with each query and are distributed across compute nodes. [sent-171, score-0.507]

58 1 Query Corpus Our dataset consists of a sample of 450M English queries submitted by anonymous Web users to 2This approximation and architecture is similar to Smola and Narayanamurthy (2010). [sent-177, score-0.206]

59 214324681012 Query length Figure 3: Distribution in the query corpus, broken down by query length (red/solid=all queries; blue/dashed=queries with ambiguous spans); most queries contain between 2-6 tokens. [sent-180, score-1.205]

60 Single token queries are removed as the model is incapable of using context to disambiguate their meaning. [sent-185, score-0.206]

61 During training, we include 10 copies of each query (4. [sent-187, score-0.473]

62 5B queries total), allowing an estimate of the Bayes average posterior from a single Gibbs sample. [sent-188, score-0.206]

63 2) by human raters across two different samples: (1) an unbiased sample from the original corpus, and (2) a biased sample of queries containing ambiguous spans. [sent-192, score-0.259]

64 Two raters scored a total of 10K labels from 800 spans across 300 queries. [sent-193, score-0.204]

65 Chunking precision is measured as the percentage of labels not marked as badspan. [sent-199, score-0.195]

66 We report two sets of precision scores depending on how null labels are handled: Strict evaluation treats null-labeled spans as incorrect, while Normal evaluation removes null-labeled spans from the precision calculation. [sent-200, score-0.489]

67 MAP estimates are calculated as the single most likely label/cluster assignment across all query copies; all assignments in the sample are averaged er %ltos veucm0 0 . [sent-205, score-0.555]

68 0 0 5 746 0 5 0 5 01 502 50 Gibbs iterations Figure 4: Convergence rates of CLCBASE (red/solid), CLC-HDP-LG 100C,40L (green/dashed), CLC-HDP-LG 1000C,40L (blue/dotted) in terms of % of query cluster swaps, label cluster swaps and null rule assignments. [sent-208, score-0.885]

69 Therefore, we will make use of the fact that models with αC = 1yielded roughly 40 label clusters on aver- age, and models with αC = 0. [sent-212, score-0.391]

70 1yielded roughly 200 label clusters, naming model variants simply by the number of query and label clusters: (1) CLC-BASE, (2) CLC-DPMM 1C-40L, (3) CLC-HDP-LG 100C40L, (4) CLC-HDP-LG 1000C-40L, and (5) CLCHDP-LG 1000C-200L. [sent-213, score-0.795]

71 1 Chunking Precision Chunking precision scores for each model are shown in Table 3 (average % of labels not marked badspan). [sent-217, score-0.195]

72 CLC-BASE performed the worst by a significant margin (∼78%), indicating that label coarse-graining risg more important cthatainn query lcalbuse-l tering for chunking accuracy. [sent-219, score-0.726]

73 No significant differences in label chunking accuracy were found be- tween Bayes and MAP inference. [sent-220, score-0.253]

74 2 Predicting Span Labels The full CLC-HDP-LG model variants obtain higher label precision than the simpler models, with CLCHDP-LG 1000C-40L achieving the highest precision of the three (∼63% accuracy). [sent-222, score-0.367]

75 However, comparing ttaoi CLC-DPMM 11%C-4 a0cLcu arnacdy CLC-BASE demonstrates that the addition of label clusters and query clusters both lead to gains in label precision. [sent-224, score-1.255]

76 The breakdown over MAP and Bayes posterior estimation is less clear when considering label precision: the simpler models CLC-BASE and CLCDPMM 1C-40L perform significantly worse than Bayes when using MAP estimation, while in CLCHDP-LG the reverse holds. [sent-226, score-0.201]

77 There is little evidence for correlation between precision and query length (weak, not statistically significant negative correlation using Spearman’s ρ). [sent-227, score-0.67]

78 This result is interesting as the relative prevalence of natural language queries increases with query length, potentially degrading performance. [sent-228, score-0.713]

79 However, we did find a strong positive correlation between precision and the number of labels productions applicable to a query, i. [sent-229, score-0.348]

80 In general, the more precise models tend to have a significantly lower proportion of missing spans Model Chunking Precision Label Precision normal strict Ambiguous Label Precision hist normal strict Spearman’s q. [sent-233, score-0.274]

81 4 Table 3: Chunking and label precision across five models. [sent-541, score-0.244]

82 Spearman’s ρ columns give label precision correlations with query length (weak negative correlation) and the number of applicable labels (weak to strong positive correlation); dots indicate significance. [sent-544, score-0.905]

83 3 High Polysemy Subset We repeat the analysis of label precision on a subset of queries containing one of the manually-selected polysemous spans shown in Table 4. [sent-547, score-0.542]

84 The CLCHDP-LG -based models still significantly outperform the simpler models, but unlike in the broader setting, CLC-HDP-LG 100C-40L significantly outperforms CLC-HDP-LG 1000C-40L, indicating that lower query cluster granularity helps address polysemy (Table 3). [sent-548, score-0.606]

85 4 Error Analysis Figure 5 gives examples of both high-precision and low-precision queries markups inferred by CLCHDP-LG. [sent-550, score-0.239]

86 A large number of mistakes made by CLC are original query; lines indicate potential spans; small text shows potential labels colored and numbered by label cluster; small bar shows percentage of assignments to that label cluster. [sent-554, score-0.497]

87 Other examples of common errors include interpreting weymouth in [weymouth train time table] as a town in Massachusetts instead of a town in the UK (lack of domain knowledge), and using lower qual1207 ity semantic labels (e. [sent-560, score-0.244]

88 6 Discussion and Future Work Adding both latent label clusters (DPMM) and latent query clusters (extending to HDP-LG) improve chunking and label precision over the baseline CLCBASE system. [sent-563, score-1.612]

89 The label clusters are important because they capture intra-group correlations between class labels, while the query clusters are important for capturing inter-group correlations. [sent-564, score-1.287]

90 However, the algorithm is sensitive to the relative number of clusters in each case: Too many labels/label clusters relative to the number of query clusters make it difficult to learn correlations (O(n2) query clusters are required to capture pairwise interactions). [sent-565, score-1.942]

91 Too many query clusters, on the other hand, make the model intractable computationally. [sent-566, score-0.473]

92 (Future Work) Many query slots have weak semantics and hence are misleading for CLC. [sent-568, score-0.542]

93 For example [pacific breeze cleaners] or [dale hartley subaru] should be parsed such that the type of the leading slot is determined not by its direct content, but by its context; seeing subaru or cleaners after a noun-phrase slot is a strong indicator of its type (dealership or shop name). [sent-569, score-0.213]

94 The current CLC model only couples these slots through their correlations in query clusters, not directly through relative position or context. [sent-570, score-0.576]

95 Finally, we did not measure label coverage with respect to a human evaluation set; coverage is useful as it indicates whether our inferred semantics are biased with respect to human norms. [sent-572, score-0.194]

96 CLC captures semantic information in the form of interactions between clusters of automatically extracted class-labels, e. [sent-574, score-0.301]

97 CLC was able to chunk queries into spans more accurately and infer more precise labels than several sub-models even across a highly ambiguous query subset. [sent-578, score-0.936]

98 The role of queries in ranking labeled instances extracted from text. [sent-670, score-0.206]

99 Unsupervised query segmentation using generative language models and Wikipedia. [sent-723, score-0.473]

100 Semi-supervised learning of semantic classes for query understanding: from the Web and for the Web. [sent-754, score-0.513]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('clc', 0.49), ('query', 0.473), ('clusters', 0.23), ('queries', 0.206), ('label', 0.161), ('hdp', 0.156), ('bayes', 0.129), ('production', 0.123), ('labels', 0.112), ('isa', 0.099), ('productions', 0.096), ('pas', 0.096), ('cluster', 0.093), ('chunking', 0.092), ('spans', 0.092), ('latent', 0.091), ('intent', 0.09), ('pcfg', 0.086), ('precision', 0.083), ('class', 0.081), ('markup', 0.079), ('vinyl', 0.077), ('correlations', 0.076), ('mapreduce', 0.073), ('map', 0.07), ('cleaners', 0.068), ('tq', 0.068), ('gibbs', 0.067), ('johnson', 0.058), ('ccq', 0.057), ('cq', 0.057), ('llk', 0.057), ('correlation', 0.057), ('dell', 0.056), ('ambiguous', 0.053), ('durme', 0.051), ('breeze', 0.051), ('dirichlet', 0.049), ('assignment', 0.048), ('windows', 0.048), ('lr', 0.047), ('clothing', 0.047), ('pcfgs', 0.046), ('weak', 0.042), ('bayesian', 0.041), ('adaptor', 0.04), ('semantic', 0.04), ('simpler', 0.04), ('span', 0.039), ('badspan', 0.038), ('beeferman', 0.038), ('clcbase', 0.038), ('dpmm', 0.038), ('hist', 0.038), ('kc', 0.038), ('llr', 0.038), ('runner', 0.038), ('subaru', 0.038), ('swaps', 0.038), ('weymouth', 0.038), ('wonderland', 0.038), ('brighton', 0.038), ('talukdar', 0.038), ('normal', 0.038), ('teh', 0.037), ('materials', 0.036), ('capturing', 0.036), ('raw', 0.035), ('spearman', 0.035), ('strict', 0.034), ('potentially', 0.034), ('assignments', 0.034), ('goods', 0.034), ('jaguar', 0.034), ('tropical', 0.034), ('infinite', 0.033), ('inferred', 0.033), ('jones', 0.032), ('bergsma', 0.032), ('rules', 0.032), ('selectional', 0.031), ('taxonomy', 0.031), ('szpektor', 0.031), ('tratz', 0.031), ('interactions', 0.031), ('car', 0.03), ('alice', 0.029), ('bar', 0.029), ('variable', 0.029), ('tan', 0.028), ('base', 0.028), ('ca', 0.028), ('slot', 0.028), ('understanding', 0.027), ('coupled', 0.027), ('null', 0.027), ('states', 0.027), ('dean', 0.027), ('town', 0.027), ('slots', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

Author: Joseph Reisinger ; Marius Pasca

Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.

2 0.35715729 258 acl-2011-Ranking Class Labels Using Query Sessions

Author: Marius Pasca

Abstract: The role of search queries, as available within query sessions or in isolation from one another, in examined in the context of ranking the class labels (e.g., brazilian cities, business centers, hilly sites) extracted from Web documents for various instances (e.g., rio de janeiro). The co-occurrence of a class label and an instance, in the same query or within the same query session, is used to reinforce the estimated relevance of the class label for the instance. Experiments over evaluation sets of instances associated with Web search queries illustrate the higher quality of the query-based, re-ranked class labels, relative to ranking baselines using documentbased counts.

3 0.34948391 182 acl-2011-Joint Annotation of Search Queries

Author: Michael Bendersky ; W. Bruce Croft ; David A. Smith

Abstract: W. Bruce Croft Dept. of Computer Science University of Massachusetts Amherst, MA cro ft @ c s .uma s s .edu David A. Smith Dept. of Computer Science University of Massachusetts Amherst, MA dasmith@ c s .umas s .edu articles or web pages). As previous research shows, these differences severely limit the applicability of Marking up search queries with linguistic annotations such as part-of-speech tags, capitalization, and segmentation, is an impor- tant part of query processing and understanding in information retrieval systems. Due to their brevity and idiosyncratic structure, search queries pose a challenge to existing NLP tools. To address this challenge, we propose a probabilistic approach for performing joint query annotation. First, we derive a robust set of unsupervised independent annotations, using queries and pseudo-relevance feedback. Then, we stack additional classifiers on the independent annotations, and exploit the dependencies between them to further improve the accuracy, even with a very limited amount of available training data. We evaluate our method using a range of queries extracted from a web search log. Experimental results verify the effectiveness of our approach for both short keyword queries, and verbose natural language queries.

4 0.31463423 256 acl-2011-Query Weighting for Ranking Model Adaptation

Author: Peng Cai ; Wei Gao ; Aoying Zhou ; Kam-Fai Wong

Abstract: We propose to directly measure the importance of queries in the source domain to the target domain where no rank labels of documents are available, which is referred to as query weighting. Query weighting is a key step in ranking model adaptation. As the learning object of ranking algorithms is divided by query instances, we argue that it’s more reasonable to conduct importance weighting at query level than document level. We present two query weighting schemes. The first compresses the query into a query feature vector, which aggregates all document instances in the same query, and then conducts query weighting based on the query feature vector. This method can efficiently estimate query importance by compressing query data, but the potential risk is information loss resulted from the compression. The second measures the similarity between the source query and each target query, and then combines these fine-grained similarity values for its importance estimation. Adaptation experiments on LETOR3.0 data set demonstrate that query weighting significantly outperforms document instance weighting methods.

5 0.26394671 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

Author: Bo Pang ; Ravi Kumar

Abstract: Web search is an information-seeking activity. Often times, this amounts to a user seeking answers to a question. However, queries, which encode user’s information need, are typically not expressed as full-length natural language sentences in particular, as questions. Rather, they consist of one or more text fragments. As humans become more searchengine-savvy, do natural-language questions still have a role to play in web search? Through a systematic, large-scale study, we find to our surprise that as time goes by, web users are more likely to use questions to express their search intent. —

6 0.23683076 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

7 0.21696123 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

8 0.13706918 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

9 0.12715833 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

10 0.11987621 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

11 0.11893678 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search

12 0.11139664 89 acl-2011-Creative Language Retrieval: A Robust Hybrid of Information Retrieval and Linguistic Creativity

13 0.10859647 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering

14 0.1050447 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

15 0.10439503 333 acl-2011-Web-Scale Features for Full-Scale Parsing

16 0.10352671 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

17 0.10198296 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

18 0.099051788 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

19 0.098166913 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

20 0.094356887 135 acl-2011-Faster and Smaller N-Gram Language Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.237), (1, 0.079), (2, -0.153), (3, 0.07), (4, -0.136), (5, -0.319), (6, -0.125), (7, -0.237), (8, 0.141), (9, -0.05), (10, 0.192), (11, -0.035), (12, -0.001), (13, 0.073), (14, -0.019), (15, -0.066), (16, -0.103), (17, 0.03), (18, -0.033), (19, 0.048), (20, -0.076), (21, 0.056), (22, -0.055), (23, -0.062), (24, 0.016), (25, 0.13), (26, 0.04), (27, -0.046), (28, 0.02), (29, -0.02), (30, -0.055), (31, -0.023), (32, -0.083), (33, 0.002), (34, -0.052), (35, 0.005), (36, 0.071), (37, 0.027), (38, -0.047), (39, 0.01), (40, 0.049), (41, 0.026), (42, 0.029), (43, -0.021), (44, -0.017), (45, 0.019), (46, 0.043), (47, 0.01), (48, 0.024), (49, -0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98045772 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

Author: Joseph Reisinger ; Marius Pasca

Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.

2 0.939529 258 acl-2011-Ranking Class Labels Using Query Sessions

Author: Marius Pasca

Abstract: The role of search queries, as available within query sessions or in isolation from one another, in examined in the context of ranking the class labels (e.g., brazilian cities, business centers, hilly sites) extracted from Web documents for various instances (e.g., rio de janeiro). The co-occurrence of a class label and an instance, in the same query or within the same query session, is used to reinforce the estimated relevance of the class label for the instance. Experiments over evaluation sets of instances associated with Web search queries illustrate the higher quality of the query-based, re-ranked class labels, relative to ranking baselines using documentbased counts.

3 0.87979454 182 acl-2011-Joint Annotation of Search Queries

Author: Michael Bendersky ; W. Bruce Croft ; David A. Smith

Abstract: W. Bruce Croft Dept. of Computer Science University of Massachusetts Amherst, MA cro ft @ c s .uma s s .edu David A. Smith Dept. of Computer Science University of Massachusetts Amherst, MA dasmith@ c s .umas s .edu articles or web pages). As previous research shows, these differences severely limit the applicability of Marking up search queries with linguistic annotations such as part-of-speech tags, capitalization, and segmentation, is an impor- tant part of query processing and understanding in information retrieval systems. Due to their brevity and idiosyncratic structure, search queries pose a challenge to existing NLP tools. To address this challenge, we propose a probabilistic approach for performing joint query annotation. First, we derive a robust set of unsupervised independent annotations, using queries and pseudo-relevance feedback. Then, we stack additional classifiers on the independent annotations, and exploit the dependencies between them to further improve the accuracy, even with a very limited amount of available training data. We evaluate our method using a range of queries extracted from a web search log. Experimental results verify the effectiveness of our approach for both short keyword queries, and verbose natural language queries.

4 0.81807202 256 acl-2011-Query Weighting for Ranking Model Adaptation

Author: Peng Cai ; Wei Gao ; Aoying Zhou ; Kam-Fai Wong

Abstract: We propose to directly measure the importance of queries in the source domain to the target domain where no rank labels of documents are available, which is referred to as query weighting. Query weighting is a key step in ranking model adaptation. As the learning object of ranking algorithms is divided by query instances, we argue that it’s more reasonable to conduct importance weighting at query level than document level. We present two query weighting schemes. The first compresses the query into a query feature vector, which aggregates all document instances in the same query, and then conducts query weighting based on the query feature vector. This method can efficiently estimate query importance by compressing query data, but the potential risk is information loss resulted from the compression. The second measures the similarity between the source query and each target query, and then combines these fine-grained similarity values for its importance estimation. Adaptation experiments on LETOR3.0 data set demonstrate that query weighting significantly outperforms document instance weighting methods.

5 0.81745362 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

Author: Patrick Pantel ; Ariel Fuxman

Abstract: We propose methods for estimating the probability that an entity from an entity database is associated with a web search query. Association is modeled using a query entity click graph, blending general query click logs with vertical query click logs. Smoothing techniques are proposed to address the inherent data sparsity in such graphs, including interpolation using a query synonymy model. A large-scale empirical analysis of the smoothing techniques, over a 2-year click graph collected from a commercial search engine, shows significant reductions in modeling error. The association models are then applied to the task of recommending products to web queries, by annotating queries with products from a large catalog and then mining query- product associations through web search session analysis. Experimental analysis shows that our smoothing techniques improve coverage while keeping precision stable, and overall, that our top-performing model affects 9% of general web queries with 94% precision.

6 0.81590277 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

7 0.68912572 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search

8 0.63451636 89 acl-2011-Creative Language Retrieval: A Robust Hybrid of Information Retrieval and Linguistic Creativity

9 0.5733164 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

10 0.55909574 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

11 0.48175475 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering

12 0.45432937 135 acl-2011-Faster and Smaller N-Gram Language Models

13 0.43413046 11 acl-2011-A Fast and Accurate Method for Approximate String Search

14 0.42307416 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

15 0.41250339 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining

16 0.3886382 333 acl-2011-Web-Scale Features for Full-Scale Parsing

17 0.38344648 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

18 0.37010315 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

19 0.3666791 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

20 0.36299688 248 acl-2011-Predicting Clicks in a Vocabulary Learning System


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.025), (13, 0.049), (17, 0.053), (26, 0.069), (31, 0.01), (37, 0.092), (39, 0.073), (41, 0.065), (55, 0.034), (59, 0.061), (72, 0.027), (90, 0.021), (91, 0.053), (92, 0.075), (96, 0.162), (97, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93799704 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

Author: Hakan Ceylan ; Rada Mihalcea

Abstract: We introduce a new publicly available tool that implements efficient indexing and retrieval of large N-gram datasets, such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and duplicate data, it also allows users to issue wild card queries provided that the wild cards in the query are contiguous. Furthermore, we also implement some of the smoothing algorithms that are designed specifically for large datasets and are shown to yield better language models than the traditional ones on the Web1T 5gram corpus (Yuret, 2008). We demonstrate the effectiveness of our tool and the smoothing algorithms on the English Lexical Substi- tution task by a simple implementation that gives considerable improvement over a basic language model.

same-paper 2 0.93459159 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

Author: Joseph Reisinger ; Marius Pasca

Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.

3 0.92678541 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

Author: Bo Han ; Timothy Baldwin

Abstract: Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn’t require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.

4 0.90470099 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

Author: Kristy Boyer ; Joseph Grafsgaard ; Eun Young Ha ; Robert Phillips ; James Lester

Abstract: Dialogue act classification is a central challenge for dialogue systems. Although the importance of emotion in human dialogue is widely recognized, most dialogue act classification models make limited or no use of affective channels in dialogue act classification. This paper presents a novel affect-enriched dialogue act classifier for task-oriented dialogue that models facial expressions of users, in particular, facial expressions related to confusion. The findings indicate that the affectenriched classifiers perform significantly better for distinguishing user requests for feedback and grounding dialogue acts within textual dialogue. The results point to ways in which dialogue systems can effectively leverage affective channels to improve dialogue act classification. 1

5 0.90264463 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

Author: Alexander M. Rush ; Michael Collins

Abstract: We describe an exact decoding algorithm for syntax-based statistical translation. The approach uses Lagrangian relaxation to decompose the decoding problem into tractable subproblems, thereby avoiding exhaustive dynamic programming. The method recovers exact solutions, with certificates of optimality, on over 97% of test examples; it has comparable speed to state-of-the-art decoders.

6 0.90173674 258 acl-2011-Ranking Class Labels Using Query Sessions

7 0.89535189 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

8 0.89463234 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA

9 0.89403671 11 acl-2011-A Fast and Accurate Method for Approximate String Search

10 0.89245886 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

11 0.88880038 182 acl-2011-Joint Annotation of Search Queries

12 0.8869698 178 acl-2011-Interactive Topic Modeling

13 0.88547742 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

14 0.88509154 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

15 0.88454586 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

16 0.88343036 193 acl-2011-Language-independent compound splitting with morphological operations

17 0.88320655 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

18 0.882815 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

19 0.88037944 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

20 0.88031918 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing