emnlp emnlp2012 emnlp2012-78 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jianfeng Gao ; Shasha Xie ; Xiaodong He ; Alnur Ali
Abstract: This paper explores log-based query expansion (QE) models for Web search. Three lexicon models are proposed to bridge the lexical gap between Web documents and user queries. These models are trained on pairs of user queries and titles of clicked documents. Evaluations on a real world data set show that the lexicon models, integrated into a ranker-based QE system, not only significantly improve the document retrieval performance but also outperform two state-of-the-art log-based QE methods.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract This paper explores log-based query expansion (QE) models for Web search. [sent-3, score-0.655]
2 These models are trained on pairs of user queries and titles of clicked documents. [sent-5, score-0.3]
3 Query expansion (QE) is an effective strategy to address the problem. [sent-8, score-0.35]
4 It expands a query issued by a user with additional related terms, called expansion terms, so that more relevant documents can be retrieved. [sent-9, score-0.775]
5 We select expansion terms for a query according to how likely it is that the expansion terms occur in the title of a document that is relevant to the query. [sent-11, score-1.227]
6 Assuming that a query is parallel to the titles of documents clicked for that query (Gao et al. [sent-12, score-0.783]
7 The third is a bilingual topic model, which represents a query as a distribution of hidden topics and learns the translation between a query and a title term at the semantic level. [sent-18, score-0.897]
8 (2002; 2003) is to our knowledge the first to explore querydocument relations for direct extraction of expansion terms for Web search. [sent-25, score-0.416]
9 First, unlike traditional QE methods that are based on relevance feedback, log-based QE derives expansion terms from search logs, allowing term correlations to be pre-computed offline. [sent-31, score-0.549]
10 2001) or derived automatically from document collections (Jing and Croft 1994), the log-based method is superior in that it explicitly captures the correlation between query terms and document terms, and thus can bridge the lexical gap between them more effectively. [sent-33, score-0.528]
11 Second, since search logs retrain querydocument pairs clicked by millions of users, the term correlations reflect the preference of the majority of users. [sent-34, score-0.297]
12 The SMT-based system can produce cleaner, more relevant expansion terms because rich context information useful for filtering noisy expansions is captured by combining language model and phrase translation model in its decoder. [sent-42, score-0.542]
13 We will show that the proposed lexicon models outperform significantly the term correlation mod- el, and that a simpler QE system that incorporates the lexicon models can beat the sophisticated, black-box SMT system. [sent-47, score-0.263]
14 2 Lexicon Models We view search queries and Web documents as two different languages, and cast QE as a means to bridge the language gap by translating queries to documents, represented by their titles. [sent-48, score-0.39]
15 Let be a query, be an expansion term candidate, the translation probability from to is defined as | ∑ (| )( |) (1) . [sent-53, score-0.513]
16 where | is the unsmoothed unigram probability of word in query The word translation probabilities | are estimated on the querytitle pairs derived from the clickthrough data by assuming that the title terms are likely to be the desired expansions of the paired query. [sent-54, score-0.68]
17 Formally, we optimize the model parameters by maximizing the probability of generating document titles from queries over the entire training corpus: ∏ | (2) where both the titles and the paired queries are viewed as bag of words. [sent-57, score-0.403]
18 2008), is intended to capture inter-term dependencies for selecting expansion terms. [sent-63, score-0.368]
19 The model is based on lexicalized triplets ( ) which can be understood as two query terms triggering one expansion term. [sent-64, score-0.774]
20 The translation probability of given for the triplet model is parameterized as | ∑ ∑ (| ) (4) where Z is a normalization factor based on the corresponding query length, i. [sent-65, score-0.563]
21 , , and ( )is the probability of translating | into given another query word . [sent-67, score-0.336]
22 The idea underlying the model is that a search query and its relevant Web documents share a common distribution of (hidden) topics, but use different (probably overlapping) vocabularies to express these topics. [sent-78, score-0.437]
23 First, a query is represented as a vector of topics. [sent-80, score-0.305]
24 Then, all the candidate expansion terms, which are selected from document, are ranked by how likely it is that these document terms are selected to best describe those topics. [sent-81, score-0.454]
25 In a sense, BLTM is similar to the word model and the triplet model since they all map a query to a document word. [sent-82, score-0.553]
26 In our experiments BLTM is found to often select a different set of expansion terms and is complementary to the word model and the triplet model. [sent-84, score-0.572]
27 First, for each topic , a pair of different word distributions are selected from a Dirichlet prior with concentration parameter β, where is a topic-specific query 668 sion candidates and the expanded query generated by the ranker-based QE system. [sent-86, score-0.73]
28 , expansion term candidate) is generated by first selecting a topic according to the topic distribution and then drawing a word from By summing over all possible topics, we end up with the following model form | ∑ | | (5) The BLTM training follows the method described in Gao et al. [sent-95, score-0.519]
29 In training, we also constrain that the paired query and title have similar fractions of tokens assigned to each topic. [sent-98, score-0.363]
30 The system expands an input query in two distinct stages, candidate generation and ranking, as illustrated by an example in Figure 1. [sent-102, score-0.324]
31 In candidate generation, an input query is first tokenized into a sequence of terms. [sent-103, score-0.305]
32 altered words according to their word translation probabilities from Then, we form a list of expansion candidates, each of which contains all the original words in except for the word that is substituted by one of its altered words. [sent-106, score-0.479]
33 So, for a query with terms, there are at most candidates. [sent-107, score-0.305]
34 In the second stage, all the expansion candidates are ranked using a ranker that is based on the Markov Random Field (MRF) model in which the three lexicon models are incorporated as features. [sent-108, score-0.59]
35 Expansion terms of a query are taken from those terms in the -best ( in our experiments) expansion candidates of the query that have not been seen in the original query string. [sent-109, score-1.366]
36 1 MRF-Based Ranker The ranker is based on the MRF model that models the joint distribution of over a set of expansion term random variables and a query random variable It is constructed from a graph consisting of a query node and nodes for . [sent-112, score-1.154]
37 ∏ (6) where is the set of cliques in , and each is a non-negative potential function defined over a clique configuration c that measures the compatibility of the configuration, is a set of parameters that are used within the potential function, and normalizes the distribution. [sent-116, score-0.345]
38 For ranking expansion candidates, we can drop the expensive computation of since it is independent of E, and simply rank each expansion candidate by 669 . [sent-117, score-0.733]
39 Figure 2: The structure of the Markov random field for representing the term dependency among the query and the expansion terms its unnormalized joint probability with under the MRF. [sent-118, score-0.761]
40 In this paper, the graphical model representation we propose for QE is a fully connected graph shown in Figure 2, where all expansion terms and the original query are assumed dependent with each other. [sent-122, score-0.694]
41 The cliques defined in for MRF can be grouped into two categories. [sent-128, score-0.247]
42 The first includes three types of cliques involving both the query node and one or more expansion terms. [sent-129, score-0.881]
43 The potential functions defined over these cliques attempt to abstract the idea behind the query to title translation models. [sent-130, score-0.755]
44 The other three types, belonging to the second category, involve only expansion terms. [sent-131, score-0.35]
45 The first type of cliques involves a single expansion term and the query node. [sent-133, score-0.948]
46 The potentials functions for these cliques are defined as (6) where the three feature functions of the form are defined as the log probabilities of translating to according to the word, triplet and topic models defined in Equations (1), (4) and (5), respectively. [sent-134, score-0.644]
47 | | The second type of cliques contains the query node and two expansion terms, and , which appear in consecutive order in the expansion. [sent-135, score-0.881]
48 The potential functions over these cliques are defined as (7) where the feature is defined as the log prob- ability of generating an expansion bigram given | | Unlike the language models used for document ranking (e. [sent-136, score-0.836]
49 , Zhai and Lafferty 2001), we cannot compute the bigram probability by simply counting the relative frequency of in because the query is usually very short and the bigram is unlikely to occur. [sent-138, score-0.359]
50 We thus have ∏ ∏ | where | the translation probability is computed using a variant of the triplet model described in Section 2. [sent-140, score-0.258]
51 and | are assigned respectively by the unigram and bigram language models, estimated from the collection of document titles of the clickthrough data, and is the unigram probability of the query term, estimated from the collection of queries of the clickthrough data. [sent-148, score-0.882]
52 The third type of cliques contains the query node and two expansion terms, and , which occur unordered within the expansion. [sent-149, score-0.902]
53 The potential functions over these cliques are defined as (8) where the feature ability of generating given is defined as the log proba pair of expansion terms | . [sent-150, score-0.75]
54 | Unlike | defined in Equation (7), this class of features captures long-span term dependency in the expansion candidate. [sent-151, score-0.438]
55 Similar to the computation of | in Equation (7), we approximate | as ∏ ∏ | | where | the translation probability is computed using the triplet model described in Section 2. [sent-152, score-0.258]
56 is assigned by a unigram language model estimated from the collection of document titles of the clickthrough data. [sent-154, score-0.268]
57 We now turn to the other three types of cliques that do not contain the query node. [sent-156, score-0.531]
58 The fourth type of cliques contains only one expansion term. [sent-157, score-0.576]
59 The fifth type of cliques contains a pair of terms appearing in consecutive order in the expansion. [sent-159, score-0.265]
60 The potential functions are defined as (10) | where | is the bigram probability computed using a bigram language model trained on the collection of document titles. [sent-160, score-0.23]
61 The sixth type of cliques contains a pair of terms appearing unordered within the expansion. [sent-161, score-0.286]
62 In this study the effectiveness of a QE method is evaluated by first issuing a set of queries which are expanded using the method to a search engine and then measuring the Web search performance. [sent-176, score-0.33]
63 Better QE methods are supposed to lead to better Web search results using the corre- spondingly expanded query set. [sent-177, score-0.402]
64 4 Experiments We evaluate the performance of a QE method by first issuing a set of queries which are expanded using the method to a search engine and then measuring the Web search performance. [sent-190, score-0.33]
65 Better QE methods are supposed to lead to better Web search results using the correspondingly expanded query set. [sent-191, score-0.402]
66 On average, each query is associated with 197 Web documents (URLs). [sent-199, score-0.353]
67 The label is human generated and is on a 5-level relevance scale, 0 to 4, with 4 meaning document D is the most relevant to query Q and 0 meaning D is not relevant to Q. [sent-201, score-0.463]
68 To reflect a natural query distribution, we do not try to control the quality of these queries. [sent-206, score-0.305]
69 For example, in our query sets, there are roughly 20% misspelled queries, 20% navigational queries, and 10% transactional queries. [sent-207, score-0.305]
70 Second, for each query, we collect Web documents to be judged by issuing the query to several popular search engines (e. [sent-208, score-0.429]
71 The query-title pairs used for model training are extracted from one year of query log files using a procedure similar to Gao et al. [sent-216, score-0.328]
72 NoQE (Row 1) is the baseline retrieval system that uses the raw input queries and the BM25 document ranking model. [sent-236, score-0.275]
73 It takes the following steps to expand an input query : 1. [sent-241, score-0.305]
74 Find all documents that have clicks on a query that contains one or more of these query terms. [sent-244, score-0.658]
75 For each title term in these documents, calculate its evidence of being selected as an expansion term according to the whole query via a scoring function | 4. [sent-246, score-0.847]
76 Select n title terms with the highest score (where the value of n is optimized on training data) and formulate the expanded query by adding these terms into 5. [sent-247, score-0.496]
77 (13) | | where is the set of documents clicked for the queries containing the term and is collected from search logs, | weight is a normalized tf-idf ,. [sent-251, score-0.361]
78 of the document term in and | is the relative occurrence of among all the documents clicked for the queries containing Table 1 shows that TC leads to significant improvement over NoQE in NDCG@ 10, but not in NDCG@ 1 and NDCG@3 (Row 2 vs. [sent-252, score-0.384]
79 To apply the system to QE, expansion terms of a query are taken from those terms in the 10-best translations of the query that have not been seen in the original query string. [sent-265, score-1.362]
80 MRF (Row 4) is the ranker-based QE system described in Section 3, which uses a MRF-based ranker to incorporate all 8 classes of features derived from a variety of lexicon translation models and language models as in Equation (12). [sent-271, score-0.293]
81 , the lexicon models and the language models described in Sections 2 and 3, in ranking expansion candidates for QE. [sent-278, score-0.478]
82 (2008) hypothesize that statistical translation model is superior to correlation model because the EM training captures the hidden alignment information when mapping document terms to query terms, leading to a better smoothed probability distribution. [sent-288, score-0.517]
83 Row 2) mainly because in the former the expansion candidates are generated by a word translation model and are less noisy. [sent-292, score-0.448]
84 Row 7), but also seems to subsume the latter in that combining the features derived from both models in the ranker leads to little improvement over the ranker that uses only the triplet model features (Row 10 vs. [sent-299, score-0.437]
85 The bilingual topic model underperforms the word model and the triplet model (Row 9 vs. [sent-301, score-0.251]
86 However, we found that the bilingual topic model often selects a different set of expansion terms and is complementary to the other two lexicon models. [sent-303, score-0.529]
87 As a result, unlike the case of combining the word model and triplet model features, incorporating the bilingual topic model features in the ranker leads to some visible improvement in NDCG at all positions (Row 11vs. [sent-304, score-0.378]
88 First, as expected, in comparison with the word model, the triplet translation model is more effective in benefitting long queries, e. [sent-307, score-0.258]
89 , notably queries containing questions and queries containing song lyrics. [sent-309, score-0.248]
90 Second, unlike the two lexicon models, the bilingual topic model tends to generate expansions that are more likely to relate to an entire query rather than individual query terms. [sent-310, score-0.788]
91 Third, the 674 features involving the order of the expansion terms benefitted queries containing named entities. [sent-311, score-0.513]
92 (2008) propose a method of selecting expansion terms to directly optimize average precision. [sent-318, score-0.407]
93 The effectiveness of the statistical translation-based approach to Web search has been demonstrated empirically in recent studies where word-based and phrase-based translation models are trained on large amounts of clickthrough data (e. [sent-328, score-0.233]
94 In addition to QE, search logs have also been used for other Web search tasks, such as document ranking (Joachims 2002; Agichtein et al. [sent-333, score-0.263]
95 2006), search query processing and spelling correction (Huang et al. [sent-334, score-0.347]
96 2010b) image retrieval (Craswell and Szummer 2007), and user query clustering (Baeza-Yates and Tiberi 2007; Wen et al. [sent-336, score-0.39]
97 These models are trained on pairs of user queries and the titles of clicked documents using EM. [sent-342, score-0.348]
98 Query expansion using term relationships in language models for information retrieval. [sent-373, score-0.417]
99 A large scale ranker-based system for query spelling correction. [sent-483, score-0.324]
100 Exploring web scale language models for search query processing. [sent-518, score-0.413]
wordName wordTfidf (topN-words)
[('qe', 0.617), ('expansion', 0.35), ('query', 0.305), ('cliques', 0.226), ('triplet', 0.183), ('mrf', 0.16), ('ranker', 0.127), ('queries', 0.124), ('ndcg', 0.12), ('clickthrough', 0.116), ('gao', 0.115), ('row', 0.097), ('riezler', 0.096), ('bltm', 0.093), ('cui', 0.093), ('smt', 0.091), ('croft', 0.085), ('logs', 0.081), ('clicked', 0.08), ('triplets', 0.08), ('sigir', 0.079), ('translation', 0.075), ('lexicon', 0.072), ('nie', 0.068), ('term', 0.067), ('web', 0.066), ('document', 0.065), ('metzler', 0.062), ('title', 0.058), ('expanded', 0.055), ('noqe', 0.053), ('relevance', 0.051), ('user', 0.051), ('tc', 0.048), ('documents', 0.048), ('wen', 0.046), ('titles', 0.045), ('search', 0.042), ('topic', 0.042), ('feedback', 0.041), ('cao', 0.04), ('mrfbased', 0.04), ('terms', 0.039), ('lafferty', 0.039), ('functions', 0.038), ('expansions', 0.038), ('zhai', 0.037), ('clique', 0.034), ('issuing', 0.034), ('retrieval', 0.034), ('engine', 0.033), ('ranking', 0.033), ('correlation', 0.033), ('potential', 0.032), ('commercial', 0.031), ('translating', 0.031), ('rows', 0.029), ('berger', 0.027), ('altered', 0.027), ('craswell', 0.027), ('lavrenko', 0.027), ('querydocument', 0.027), ('querytitle', 0.027), ('rocchio', 0.027), ('szummer', 0.027), ('tiberi', 0.027), ('bigram', 0.027), ('bilingual', 0.026), ('cikm', 0.026), ('trec', 0.025), ('ir', 0.025), ('judgment', 0.024), ('cro', 0.024), ('candidates', 0.023), ('bendersky', 0.023), ('jarvelin', 0.023), ('kekalainen', 0.023), ('generator', 0.023), ('log', 0.023), ('unigram', 0.022), ('bridge', 0.021), ('relevant', 0.021), ('em', 0.021), ('vocabularies', 0.021), ('unordered', 0.021), ('defined', 0.021), ('collection', 0.02), ('equation', 0.02), ('topics', 0.019), ('cleaner', 0.019), ('lang', 0.019), ('hasan', 0.019), ('virtual', 0.019), ('system', 0.019), ('ft', 0.019), ('selecting', 0.018), ('ibm', 0.018), ('microsoft', 0.018), ('incorporated', 0.018), ('agichtein', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 78 emnlp-2012-Learning Lexicon Models from Search Logs for Query Expansion
Author: Jianfeng Gao ; Shasha Xie ; Xiaodong He ; Alnur Ali
Abstract: This paper explores log-based query expansion (QE) models for Web search. Three lexicon models are proposed to bridge the lexical gap between Web documents and user queries. These models are trained on pairs of user queries and titles of clicked documents. Evaluations on a real world data set show that the lexicon models, integrated into a ranker-based QE system, not only significantly improve the document retrieval performance but also outperform two state-of-the-art log-based QE methods.
2 0.21219678 5 emnlp-2012-A Discriminative Model for Query Spelling Correction with Latent Structural SVM
Author: Huizhong Duan ; Yanen Li ; ChengXiang Zhai ; Dan Roth
Abstract: Discriminative training in query spelling correction is difficult due to the complex internal structures of the data. Recent work on query spelling correction suggests a two stage approach a noisy channel model that is used to retrieve a number of candidate corrections, followed by discriminatively trained ranker applied to these candidates. The ranker, however, suffers from the fact the low recall of the first, suboptimal, search stage. This paper proposes to directly optimize the search stage with a discriminative model based on latent structural SVM. In this model, we treat query spelling correction as a multiclass classification problem with structured input and output. The latent structural information is used to model the alignment of words in the spelling correction process. Experiment results show that as a standalone speller, our model outperforms all the baseline systems. It also attains a higher recall compared with the noisy channel model, and can therefore serve as a better filtering stage when combined with a ranker.
3 0.14788443 28 emnlp-2012-Collocation Polarity Disambiguation Using Web-based Pseudo Contexts
Author: Yanyan Zhao ; Bing Qin ; Ting Liu
Abstract: This paper focuses on the task of collocation polarity disambiguation. The collocation refers to a binary tuple of a polarity word and a target (such as ⟨long, battery life⟩ or ⟨long, ast atratrguep⟩t) (, siunc whh aisch ⟨ ltohneg s,en btatitmeernyt l iofrei⟩en otrat ⟨iolonn gof, tshtaer polarity wwohirdch (“long”) changes along owniothf different targets (“battery life” or “startup”). To disambiguate a collocation’s polarity, previous work always turned to investigate the polarities of its surrounding contexts, and then assigned the majority polarity to the collocation. However, these contexts are limited, thus the resulting polarity is insufficient to be reliable. We therefore propose an unsupervised three-component framework to expand some pseudo contexts from web, to help disambiguate a collocation’s polarity.Without using any additional labeled data, experiments , show that our method is effective.
4 0.10332903 64 emnlp-2012-Improved Parsing and POS Tagging Using Inter-Sentence Consistency Constraints
Author: Alexander Rush ; Roi Reichart ; Michael Collins ; Amir Globerson
Abstract: State-of-the-art statistical parsers and POS taggers perform very well when trained with large amounts of in-domain data. When training data is out-of-domain or limited, accuracy degrades. In this paper, we aim to compensate for the lack of available training data by exploiting similarities between test set sentences. We show how to augment sentencelevel models for parsing and POS tagging with inter-sentence consistency constraints. To deal with the resulting global objective, we present an efficient and exact dual decomposition decoding algorithm. In experiments, we add consistency constraints to the MST parser and the Stanford part-of-speech tagger and demonstrate significant error reduction in the domain adaptation and the lightly supervised settings across five languages.
5 0.098470174 97 emnlp-2012-Natural Language Questions for the Web of Data
Author: Mohamed Yahya ; Klaus Berberich ; Shady Elbassuoni ; Maya Ramanath ; Volker Tresp ; Gerhard Weikum
Abstract: The Linked Data initiative comprises structured databases in the Semantic-Web data model RDF. Exploring this heterogeneous data by structured query languages is tedious and error-prone even for skilled users. To ease the task, this paper presents a methodology for translating natural language questions into structured SPARQL queries over linked-data sources. Our method is based on an integer linear program to solve several disambiguation tasks jointly: the segmentation of questions into phrases; the mapping of phrases to semantic entities, classes, and relations; and the construction of SPARQL triple patterns. Our solution harnesses the rich type system provided by knowledge bases in the web of linked data, to constrain our semantic-coherence objective function. We present experiments on both the . in question translation and the resulting query answering.
6 0.093266025 41 emnlp-2012-Entity based QA Retrieval
7 0.075735047 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
8 0.073121883 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules
9 0.071144082 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
10 0.061088555 55 emnlp-2012-Forest Reranking through Subtree Ranking
11 0.059642758 35 emnlp-2012-Document-Wide Decoding for Phrase-Based Statistical Machine Translation
12 0.053705525 128 emnlp-2012-Translation Model Based Cross-Lingual Language Model Adaptation: from Word Models to Phrase Models
13 0.053522088 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers
14 0.050834995 86 emnlp-2012-Locally Training the Log-Linear Model for SMT
15 0.050510414 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP
16 0.047547277 112 emnlp-2012-Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge
17 0.047258839 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation
18 0.047065672 19 emnlp-2012-An Entity-Topic Model for Entity Linking
19 0.043745622 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
20 0.043353505 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation
topicId topicWeight
[(0, 0.169), (1, 0.011), (2, -0.044), (3, 0.108), (4, -0.076), (5, -0.175), (6, 0.187), (7, -0.015), (8, -0.014), (9, 0.005), (10, 0.055), (11, -0.068), (12, 0.132), (13, 0.091), (14, -0.087), (15, -0.035), (16, -0.02), (17, 0.233), (18, 0.084), (19, -0.061), (20, 0.023), (21, 0.051), (22, -0.12), (23, -0.144), (24, 0.049), (25, 0.069), (26, -0.2), (27, -0.13), (28, -0.229), (29, 0.002), (30, -0.009), (31, 0.004), (32, 0.233), (33, -0.108), (34, -0.142), (35, -0.033), (36, -0.147), (37, 0.035), (38, -0.042), (39, 0.025), (40, -0.024), (41, -0.106), (42, 0.129), (43, 0.02), (44, -0.099), (45, 0.052), (46, -0.062), (47, -0.017), (48, -0.142), (49, -0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.95606846 78 emnlp-2012-Learning Lexicon Models from Search Logs for Query Expansion
Author: Jianfeng Gao ; Shasha Xie ; Xiaodong He ; Alnur Ali
Abstract: This paper explores log-based query expansion (QE) models for Web search. Three lexicon models are proposed to bridge the lexical gap between Web documents and user queries. These models are trained on pairs of user queries and titles of clicked documents. Evaluations on a real world data set show that the lexicon models, integrated into a ranker-based QE system, not only significantly improve the document retrieval performance but also outperform two state-of-the-art log-based QE methods.
2 0.558599 5 emnlp-2012-A Discriminative Model for Query Spelling Correction with Latent Structural SVM
Author: Huizhong Duan ; Yanen Li ; ChengXiang Zhai ; Dan Roth
Abstract: Discriminative training in query spelling correction is difficult due to the complex internal structures of the data. Recent work on query spelling correction suggests a two stage approach a noisy channel model that is used to retrieve a number of candidate corrections, followed by discriminatively trained ranker applied to these candidates. The ranker, however, suffers from the fact the low recall of the first, suboptimal, search stage. This paper proposes to directly optimize the search stage with a discriminative model based on latent structural SVM. In this model, we treat query spelling correction as a multiclass classification problem with structured input and output. The latent structural information is used to model the alignment of words in the spelling correction process. Experiment results show that as a standalone speller, our model outperforms all the baseline systems. It also attains a higher recall compared with the noisy channel model, and can therefore serve as a better filtering stage when combined with a ranker.
3 0.47563663 28 emnlp-2012-Collocation Polarity Disambiguation Using Web-based Pseudo Contexts
Author: Yanyan Zhao ; Bing Qin ; Ting Liu
Abstract: This paper focuses on the task of collocation polarity disambiguation. The collocation refers to a binary tuple of a polarity word and a target (such as ⟨long, battery life⟩ or ⟨long, ast atratrguep⟩t) (, siunc whh aisch ⟨ ltohneg s,en btatitmeernyt l iofrei⟩en otrat ⟨iolonn gof, tshtaer polarity wwohirdch (“long”) changes along owniothf different targets (“battery life” or “startup”). To disambiguate a collocation’s polarity, previous work always turned to investigate the polarities of its surrounding contexts, and then assigned the majority polarity to the collocation. However, these contexts are limited, thus the resulting polarity is insufficient to be reliable. We therefore propose an unsupervised three-component framework to expand some pseudo contexts from web, to help disambiguate a collocation’s polarity.Without using any additional labeled data, experiments , show that our method is effective.
4 0.43763092 41 emnlp-2012-Entity based QA Retrieval
Author: Amit Singh
Abstract: Bridging the lexical gap between the user’s question and the question-answer pairs in the Q&A; archives has been a major challenge for Q&A; retrieval. State-of-the-art approaches address this issue by implicitly expanding the queries with additional words using statistical translation models. While useful, the effectiveness of these models is highly dependant on the availability of quality corpus in the absence of which they are troubled by noise issues. Moreover these models perform word based expansion in a context agnostic manner resulting in translation that might be mixed and fairly general. This results in degraded retrieval performance. In this work we address the above issues by extending the lexical word based translation model to incorporate semantic concepts (entities). We explore strategies to learn the translation probabilities between words and the concepts using the Q&A; archives and a popular entity catalog. Experiments conducted on a large scale real data show that the proposed techniques are promising.
5 0.35443896 97 emnlp-2012-Natural Language Questions for the Web of Data
Author: Mohamed Yahya ; Klaus Berberich ; Shady Elbassuoni ; Maya Ramanath ; Volker Tresp ; Gerhard Weikum
Abstract: The Linked Data initiative comprises structured databases in the Semantic-Web data model RDF. Exploring this heterogeneous data by structured query languages is tedious and error-prone even for skilled users. To ease the task, this paper presents a methodology for translating natural language questions into structured SPARQL queries over linked-data sources. Our method is based on an integer linear program to solve several disambiguation tasks jointly: the segmentation of questions into phrases; the mapping of phrases to semantic entities, classes, and relations; and the construction of SPARQL triple patterns. Our solution harnesses the rich type system provided by knowledge bases in the web of linked data, to constrain our semantic-coherence objective function. We present experiments on both the . in question translation and the resulting query answering.
6 0.34152508 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
7 0.33462802 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules
8 0.32460436 55 emnlp-2012-Forest Reranking through Subtree Ranking
9 0.31142086 64 emnlp-2012-Improved Parsing and POS Tagging Using Inter-Sentence Consistency Constraints
10 0.3062433 86 emnlp-2012-Locally Training the Log-Linear Model for SMT
11 0.25177708 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
12 0.25095272 128 emnlp-2012-Translation Model Based Cross-Lingual Language Model Adaptation: from Word Models to Phrase Models
13 0.24228536 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid
14 0.2358052 63 emnlp-2012-Identifying Event-related Bursts via Social Media Activities
15 0.23193519 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media
16 0.22917747 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP
17 0.20884867 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts
18 0.19886482 13 emnlp-2012-A Unified Approach to Transliteration-based Text Input with Online Spelling Correction
19 0.19779791 60 emnlp-2012-Generative Goal-Driven User Simulation for Dialog Management
20 0.19169796 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation
topicId topicWeight
[(2, 0.012), (16, 0.028), (25, 0.014), (34, 0.057), (60, 0.089), (63, 0.058), (64, 0.012), (65, 0.033), (70, 0.023), (74, 0.036), (76, 0.022), (80, 0.011), (86, 0.018), (95, 0.485)]
simIndex simValue paperId paperTitle
1 0.90770936 28 emnlp-2012-Collocation Polarity Disambiguation Using Web-based Pseudo Contexts
Author: Yanyan Zhao ; Bing Qin ; Ting Liu
Abstract: This paper focuses on the task of collocation polarity disambiguation. The collocation refers to a binary tuple of a polarity word and a target (such as ⟨long, battery life⟩ or ⟨long, ast atratrguep⟩t) (, siunc whh aisch ⟨ ltohneg s,en btatitmeernyt l iofrei⟩en otrat ⟨iolonn gof, tshtaer polarity wwohirdch (“long”) changes along owniothf different targets (“battery life” or “startup”). To disambiguate a collocation’s polarity, previous work always turned to investigate the polarities of its surrounding contexts, and then assigned the majority polarity to the collocation. However, these contexts are limited, thus the resulting polarity is insufficient to be reliable. We therefore propose an unsupervised three-component framework to expand some pseudo contexts from web, to help disambiguate a collocation’s polarity.Without using any additional labeled data, experiments , show that our method is effective.
2 0.89152622 83 emnlp-2012-Lexical Differences in Autobiographical Narratives from Schizophrenic Patients and Healthy Controls
Author: Kai Hong ; Christian G. Kohler ; Mary E. March ; Amber A. Parker ; Ani Nenkova
Abstract: We present a system for automatic identification of schizophrenic patients and healthy controls based on narratives the subjects recounted about emotional experiences in their own life. The focus of the study is to identify the lexical features that distinguish the two populations. We report the results of feature selection experiments that demonstrate that the classifier can achieve accuracy on patient level prediction as high as 76.9% with only a small set of features. We provide an in-depth discussion of the lexical features that distinguish the two groups and the unexpected relationship between emotion types of the narratives and the accuracy of patient status prediction.
3 0.87984806 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics
Author: Keith Stevens ; Philip Kegelmeyer ; David Andrzejewski ; David Buttler
Abstract: We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for comparing complete topic models. We further compare the automated measures to other metrics for topic models, comparison to manually crafted semantic tests and document classification. Our experiments reveal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation ofdocuments and words in a corpus.
same-paper 4 0.83885312 78 emnlp-2012-Learning Lexicon Models from Search Logs for Query Expansion
Author: Jianfeng Gao ; Shasha Xie ; Xiaodong He ; Alnur Ali
Abstract: This paper explores log-based query expansion (QE) models for Web search. Three lexicon models are proposed to bridge the lexical gap between Web documents and user queries. These models are trained on pairs of user queries and titles of clicked documents. Evaluations on a real world data set show that the lexicon models, integrated into a ranker-based QE system, not only significantly improve the document retrieval performance but also outperform two state-of-the-art log-based QE methods.
5 0.4912841 5 emnlp-2012-A Discriminative Model for Query Spelling Correction with Latent Structural SVM
Author: Huizhong Duan ; Yanen Li ; ChengXiang Zhai ; Dan Roth
Abstract: Discriminative training in query spelling correction is difficult due to the complex internal structures of the data. Recent work on query spelling correction suggests a two stage approach a noisy channel model that is used to retrieve a number of candidate corrections, followed by discriminatively trained ranker applied to these candidates. The ranker, however, suffers from the fact the low recall of the first, suboptimal, search stage. This paper proposes to directly optimize the search stage with a discriminative model based on latent structural SVM. In this model, we treat query spelling correction as a multiclass classification problem with structured input and output. The latent structural information is used to model the alignment of words in the spelling correction process. Experiment results show that as a standalone speller, our model outperforms all the baseline systems. It also attains a higher recall compared with the noisy channel model, and can therefore serve as a better filtering stage when combined with a ranker.
6 0.46363276 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts
7 0.43554777 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
8 0.43386915 101 emnlp-2012-Opinion Target Extraction Using Word-Based Translation Model
9 0.43322897 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP
10 0.43072665 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes
11 0.41787302 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews
12 0.41019657 50 emnlp-2012-Extending Machine Translation Evaluation Metrics with Lexical Cohesion to Document Level
13 0.4037089 120 emnlp-2012-Streaming Analysis of Discourse Participants
14 0.40187103 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules
15 0.40010911 63 emnlp-2012-Identifying Event-related Bursts via Social Media Activities
16 0.39829344 97 emnlp-2012-Natural Language Questions for the Web of Data
17 0.3908959 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes
18 0.38762736 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns
19 0.38573891 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
20 0.38025394 19 emnlp-2012-An Entity-Topic Model for Entity Linking