emnlp emnlp2013 emnlp2013-39 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Artem Sokokov ; Laura Jehl ; Felix Hieber ; Stefan Riezler
Abstract: We present an approach to learning bilingual n-gram correspondences from relevance rankings of English documents for Japanese queries. We show that directly optimizing cross-lingual rankings rivals and complements machine translation-based cross-language information retrieval (CLIR). We propose an efficient boosting algorithm that deals with very large cross-product spaces of word correspondences. We show in an experimental evaluation on patent prior art search that our approach, and in particular a consensus-based combination of boosting and translation-based approaches, yields substantial improvements in CLIR performance. Our training and test data are made publicly available.
Reference: text
sentIndex sentText sentNum sentScore
1 de ov , , , Abstract We present an approach to learning bilingual n-gram correspondences from relevance rankings of English documents for Japanese queries. [sent-3, score-0.253]
2 We show that directly optimizing cross-lingual rankings rivals and complements machine translation-based cross-language information retrieval (CLIR). [sent-4, score-0.18]
3 We propose an efficient boosting algorithm that deals with very large cross-product spaces of word correspondences. [sent-5, score-0.214]
4 We show in an experimental evaluation on patent prior art search that our approach, and in particular a consensus-based combination of boosting and translation-based approaches, yields substantial improvements in CLIR performance. [sent-6, score-0.724]
5 1 Introduction The central problem addressed in Cross-Language Information Retrieval (CLIR) is that of translating or projecting a query into the language of the document repository across which retrieval is performed. [sent-8, score-0.359]
6 There are two main approaches to tackle this problem: The first approach leverages the standard Statistical Machine Translation (SMT) machinery to produce a single best translation that is used as search query in the target language. [sent-9, score-0.318]
7 We will henceforth call this the direct translation approach. [sent-10, score-0.186]
8 Alternative approaches avoid to solve the hard problem of word reordering, and instead rely on token-to-token translations that are used to project 1688 the query terms into the target language with a probabilistic weighting of the standard term tf-idf scheme. [sent-12, score-0.277]
9 Darwish and Oard (2003) termed this method the probabilistic structured query approach. [sent-13, score-0.231]
10 The advantage of this technique is an implicit query expansion effect due to the use of probability distributions over term translations (Xu et al. [sent-14, score-0.277]
11 Recent research has shown that leveraging query con- text by extracting term translation probabilities from n-best direct translations of queries instead of using context-free translation tables outperforms both direct translation and context-free projection (Ture et al. [sent-16, score-0.868]
12 While direct translation as well as probabilistic structured query approaches use machine learning to optimize the SMT module, retrieval is done by standard search algorithms in both approaches. [sent-19, score-0.533]
13 Our method learns a table of n-gram correspondences by direct optimization of a ranking objective on relevance rankings of English documents for Japanese queries. [sent-27, score-0.397]
14 That is, given labeled data in the form of a set R of tuples (q, d+, d−), where d+ is a relevant (or higher ranked) ddocument and d−an irrelevant (or lower ranked) document for query q, the goal is to find a weight matrix W ∈ IRQ×D such that f(q, d+) > f(q, d−) frioxr Wall d ∈at aIR tuples from R. [sent-34, score-0.356]
15 The scoring model lea)r fnosr weights ftuorp ealsl possible correspondences of query terms and document terms by directly optimizing the ranking objective at hand. [sent-35, score-0.375]
16 The main contribution of our paper is the presentation of algorithms that make learning a phrase 1With bold letters we denote vectors for query q and document d. [sent-42, score-0.243]
17 1689 table by direct rank optimization feasible, and an experimental verification of the benefits of this approach, especially with regard to a combination of the orthogonal information sources of rankingbased and SMT-based CLIR approaches. [sent-46, score-0.195]
18 Our approach builds upon a boosting framework for pairwise ranking (Freund et al. [sent-47, score-0.327]
19 Furthermore, we present an implementation of boosting that utilizes parallel estimation on bootstrap samples from the training set for increased efficiency and reduced error (Breiman, 1996). [sent-49, score-0.31]
20 We show in an experimental evaluation on largescale retrieval on patent abstracts that our boosting approach is comparable in MAP and improves significantly by 13-15 PRES points over very competi- tive translation-based CLIR systems that are trained on 1. [sent-51, score-0.873]
21 2 Related Work Recent research in CLIR follows the two main paradigms of direct translation and probabilistic structured query approaches. [sent-54, score-0.417]
22 An example for the first approach is the work of Magdy and Jones (201 1) who presented an efficient technique to adapt off-the-shelf SMT systems for CLIR by training them on data pre-processed for retrieval (case folding, stopword removal, stemming). [sent-55, score-0.221]
23 (2012) presented an approach to direct translationbased CLIR where the n-best list of an SMT system is re-ranked according to the MAP performance of the translated queries. [sent-57, score-0.147]
24 The probabilistic structured query approach has seen a lot of work on contextaware query expansion across languages, based on various similarity statistics (Ballesteros and Croft, 1998; Gao et al. [sent-58, score-0.428]
25 Since the latter techniques achieved only marginal improvements over the context-sensitive query translation from n-best lists, we did not pursue them in our work. [sent-65, score-0.318]
26 CLIR in the context of patent prior art search was done as extrinsic evaluation at the NTCIR PatentMT2 workshops until 2010, and has been ongoing in the CLEF-IP3 benchmarking workshops since 2009. [sent-66, score-0.472]
27 However, most workshop participants did either not make use of automatic translation at all, or they used an off-the-shelf translation tool. [sent-67, score-0.242]
28 This is due to the CLEF-IP data collection where parts of patent documents are provided as manual translations into three languages. [sent-68, score-0.561]
29 In order to evaluate CLIR in a truly cross-lingual scenario, we created a large patent CLIR dataset where queries and documents are Japanese and English patent abstracts, respectively. [sent-69, score-1.059]
30 Ranking approaches to CLIR have been presented by Guo and Gomes (2009) who use pairwise ranking for patent retrieval. [sent-70, score-0.585]
31 Their method is a classical learning-to-rank setup where retrieval scores such as tf-idf or BM25 are combined with domain knowledge on patent class, inventor, date, location, etc. [sent-71, score-0.588]
32 Methods to learn word-based translation correspondences from supervised ranking signals have been presented by Bai et al. [sent-73, score-0.253]
33 at / ˜cle f-ip / fs 1690 A combination of bagging and boosting in the context of retrieval has been presented by Pavlov et al. [sent-88, score-0.425]
34 Parallel boosting where all feature weights are updated simultaneously has been presented by Collins et al. [sent-93, score-0.214]
35 1 Direct translation approach For direct translation, we use the SCFG decoder cdec (Dyer et al. [sent-101, score-0.256]
36 At retrieval time, all queries are translated sentence-wise and subsequently re-joined to form one query per patent. [sent-107, score-0.379]
37 Our baseline retrieval system uses the Okapi BM25 scores for document ranking. [sent-108, score-0.162]
38 2 Probabilistic structured query approach Early Probabilistic Structured Query approaches (Xu et al. [sent-119, score-0.231]
39 , 2001 ; Darwish and Oard, 2003) represent translation options by lexical, i. [sent-120, score-0.153]
40 , 2012a) extract translation options from the decoder’s n-best list for translating a particular query. [sent-125, score-0.183]
41 The central idea is to let the language model choose fluent, context-aware translations for each query term during decoding. [sent-126, score-0.277]
42 This retains the desired queryexpansion effect of probabilistic structured models, but it reduces query drift by filtering translations with respect to the context of the full query. [sent-127, score-0.271]
43 A projection of source language query terms f ∈ F Ain ptor tjehec target language gisu gaceh qieuveerdy by representing each source token f by its probabilistically weighted translations. [sent-128, score-0.229]
44 We use the same hierarchical phrase-based system that was used for direct translation to calculate n-best translations for the probabilistic structured query approach. [sent-133, score-0.457]
45 Probabilistic structured queries that include context-aware estimates of translation probabilities require a preservation of sentence-wise contextsensitivity also in retrieval. [sent-136, score-0.221]
46 Thus, unlike the direct translation approach, we compute weighted term and document frequencies for each sentence s in query F separately. [sent-137, score-0.469]
47 The scoring (3) of a target document for a multiple sentence query then becomes: score(E|F) = X XBM25(tf(f,E),df(f)) sX Xin XF Xf∈s 3. [sent-138, score-0.243]
48 , 2003; Collins and Koo, 2005) defines a scoring function f(q, d) on query q and document d as a weighted linear combination of T weak learners ht such that f(q, d) = wtht(q, d). [sent-140, score-0.339]
49 In our experiments, these features indicate the presence of pairs of uni- and bi-grams from the source-side vocabulary of query terms and the target-side vocabulary of document-terms, respectively. [sent-142, score-0.197]
50 For training, we are given labeled data in the form of a set R of tuples (q, d+, d−), where d+ is a relevant (or higher ranked) document and d− an irrelevant (or lower ranked) document for query q. [sent-145, score-0.364]
51 s Then, using each of the samples as a training set, separate boosting models {wst, hts}, s = 1. [sent-156, score-0.267]
52 4 Model Combination by Borda Counts SMT-based approaches to CLIR and our boosting approach have different strengths. [sent-214, score-0.214]
53 Both baseline SMT-based approaches presented above are agnostic of the ultimate retrieval task and are not specifically adapted for it. [sent-216, score-0.146]
54 The boosting method, on the other hand, learns domainspecific word associations that are useful to discern relevant from irrelevant documents. [sent-217, score-0.281]
55 The aggregate score fagg for two rankings f1(q, d) 8It is possible to construct separate query and document inverted indices and intersect them on the fly to determine the of documents that contains some pair of words. [sent-223, score-0.401]
56 (1995), combining several systems’ scores with Borda Counts can be viewed as the “data fusion” approach to IR, that merges outputs of the systems, while the PSQ baseline is an example of the “query combination” approach that extends the query at the input. [sent-230, score-0.197]
57 1 Parallel Translation Data For Japanese-to-English patent translation we used data provided by the organizers of the NTCIR9 workshop for the JP-EN PatentMT subtask. [sent-233, score-0.593]
58 We customized the list of prefixes by adding some abbreviations like “Chem”, “FIG” or “Pat”, which are specific to patent documents. [sent-252, score-0.502]
59 2 Ranking Data from Patent Citations Graf and Azzopardi (2008) describe a method to extract relevance judgements for patent retrieval from patent citations. [sent-254, score-1.147]
60 The key idea is to regard patent documents that are cited in a query patent, either by the patent applicant, or by the patent examiner or in a patent office’s search report, as relevant for the query patent. [sent-255, score-2.361]
61 Furthermore, patent documents that are related to the query patent via a patent family relationship, i. [sent-256, score-1.696]
62 , patents granted by different patent authorities but related to the same invention, are regarded as relevant. [sent-258, score-0.592]
63 We assign three integer relevance levels to these three categories of relationships, with highest relevance (3) for family patents, lower relevance for patents cited in search reports by patent examiners (2), and lowest relevance level (1) for applicants’ citations. [sent-259, score-0.974]
64 We also include all patents which are in the same patent family as an applicant or examiner citation to avoid false negatives. [sent-260, score-0.715]
65 This methodology has been used to create patent retrieval data at CLEF-IP11 and proved very useful to automatically create a patent retrieval dataset for our experiments. [sent-261, score-1.176]
66 For the creation of our dataset, we used the MAREC12 citation graph to extract patents in citation or family relation. [sent-262, score-0.212]
67 Since the Japanese portion of the MAREC corpus only contains English abstracts, but not the Japanese full texts, we merged the patent documents in the NTCIR-10 test collec- tion described above with the Japanese (JP) section 11 http : / /www . [sent-263, score-0.521]
68 In order to keep parallel data for SMT training separate from ranking data, we used only data from the years 2003-2005 to extract training data for ranking, and two small datasets of 2,000 queries each from the years 20062007 for development and testing. [sent-274, score-0.238]
69 The experiments reported here use only the abstract of the Japanese and English patents in our training, development and test collection. [sent-278, score-0.17]
70 The direct translation approach (DT) was developed in three configurations: no stopword filtering, small stopword list (52 words) and a large stopword list (543 words). [sent-325, score-0.561]
71 The probabilistic structured query (PSQ) approach was developed using the lexical translation table and the translation table estimated on the decoder’s n-best list, both optionally pruned with a variable lower pL and cumulative pC threshold on the word pair probability in the table (Section 3. [sent-328, score-0.473]
72 2698 for PSQ with a translation table estimated on the n-best list (pL = 0. [sent-336, score-0.151]
73 Training of the boosting approach (Boost) was done in parallel on bootstrap samples from the training data. [sent-342, score-0.31]
74 , an English abstract) from the English patents marked relevant for the Japanese patent, and a random document d− from the whole pool of English patent abstracts. [sent-380, score-0.638]
75 If d− had a relevance score greater or equal to the relevance score of d+, it was resampled. [sent-381, score-0.174]
76 κ Figure 1: MAP rank aggregation for combinations of the bi-gram boosting and the baselines on the dev set. [sent-394, score-0.338]
77 κ Figure 3: PRES rank aggregation for combinations of the bi-gram boosting and the baselines on the dev set. [sent-395, score-0.338]
78 Given the fact that the complex SMT system behind the direct translation and PSQ approach is trained and tuned on very large in-domain datasets, the performance of the bare phrase table induced by the Boost method is respectable. [sent-402, score-0.186]
79 1696 κ Figure 2: MAP rank aggregation for the bi-gram boosting and the “PSQ n-best table” approach on dev and test sets. [sent-403, score-0.338]
80 uTnhieand bi-gram boosting model with the best variants of the DT and PSQ approaches. [sent-407, score-0.214]
81 Table 3 shows the retrieval performance of the best baseline model (PSQ n-best) combined with the best Boost model (bi-gram), with an impressive gain of over 7 MAP points (15 PRES points) over the best individual baseline result from Table 2. [sent-411, score-0.144]
82 3 Analysis Table 4 lists some of the top-200 selected features for the boosting approach (the most common translation of the Japanese term is put in subscript). [sent-417, score-0.375]
83 We see that the direct ranking approach is able to penalize uni- and bi-gram cooccurrences that are harmful for retrieval by assigning them a negative weight, e. [sent-418, score-0.26]
84 This has a query expansion effect that is not possible in systems that use one translation or a small list of nbest translations. [sent-427, score-0.348]
85 7 Conclusion We presented a boosting approach to induce a table of bilingual n-gram correspondences by direct preference learning on relevance rankings. [sent-433, score-0.419]
86 We compared our boosting approach to very competitive CLIR baselines that use a complex SMT system trained and tuned on large in-domain datasets. [sent-435, score-0.214]
87 Furthermore, our patent retrieval setup gives SMT-based approaches an advantage in that queries consist of several normal- length sentences, as opposed to the short queries common to web search. [sent-436, score-0.72]
88 Overall, we obtained the best results by a model combination using consensus- based voting where the best SMT-based approach was combined with the boosting phrase table (gaining more than 7 MAP or 15 PRES points). [sent-455, score-0.252]
89 We attribute this to the fact that the boosting approach augments SMT approaches with valuable information that is hard to get in approaches that are agnostic about the ranking data and the ranking task at hand. [sent-456, score-0.402]
90 The experimental setup presented in this paper uses relevance links between patent abstracts as ranking data. [sent-457, score-0.681]
91 While this technique is useful to develop patent retrieval systems, it would be interesting to see if our results transfer to patent retrieval scenarios where full patent documents are used instead of only abstracts, or to standard CLIR scenar- ios that use short search queries in retrieval. [sent-458, score-1.763]
92 Combining the evidence of multiple query representations for information retrieval. [sent-480, score-0.197]
93 Overview of the patent translation task at the NTCIR-7 workshop. [sent-536, score-0.593]
94 Improving query translation for cross-language information retrieval using statistical models. [sent-544, score-0.434]
95 Crosslingual query suggestion using query logs of different languages. [sent-548, score-0.394]
96 A methodology for building a patent test collection for prior art search. [sent-557, score-0.472]
97 Ranking structured documents: A large margin based approach for patent prior art search. [sent-565, score-0.506]
98 In Proceedings of the ACM SIGIR conference on Research and development in information retrieval (SIGIR’10), New York, NY. [sent-596, score-0.166]
99 An efficient method for using machine translation technologies in cross-language patent search. [sent-601, score-0.593]
100 Adaptation of statistical machine translation model for cross-lingual information retrieval in a service context. [sent-605, score-0.237]
wordName wordTfidf (topN-words)
[('patent', 0.472), ('clir', 0.346), ('psq', 0.331), ('pres', 0.241), ('boosting', 0.214), ('query', 0.197), ('translation', 0.121), ('patents', 0.12), ('retrieval', 0.116), ('dt', 0.111), ('map', 0.109), ('wh', 0.107), ('stopword', 0.105), ('smt', 0.095), ('borda', 0.092), ('japanese', 0.089), ('relevance', 0.087), ('ture', 0.085), ('sigir', 0.085), ('ranking', 0.079), ('magdy', 0.075), ('queries', 0.066), ('direct', 0.065), ('rankings', 0.064), ('lexp', 0.06), ('pnbest', 0.06), ('bai', 0.06), ('boost', 0.06), ('ht', 0.058), ('bagging', 0.057), ('orthogonal', 0.056), ('samples', 0.053), ('correspondences', 0.053), ('qwh', 0.052), ('rankboost', 0.052), ('translationbased', 0.052), ('development', 0.05), ('documents', 0.049), ('document', 0.046), ('aslam', 0.045), ('fagg', 0.045), ('gareth', 0.045), ('irq', 0.045), ('dev', 0.045), ('parallel', 0.043), ('abstracts', 0.043), ('aggregation', 0.043), ('translations', 0.04), ('jones', 0.04), ('term', 0.04), ('darwish', 0.039), ('grangier', 0.039), ('schapire', 0.038), ('combination', 0.038), ('tuples', 0.038), ('irrelevant', 0.037), ('pc', 0.037), ('cdec', 0.036), ('rank', 0.036), ('pl', 0.035), ('structured', 0.034), ('pairwise', 0.034), ('decoder', 0.034), ('acm', 0.034), ('family', 0.034), ('ferhan', 0.033), ('koo', 0.032), ('options', 0.032), ('projection', 0.032), ('scfg', 0.031), ('utiyama', 0.031), ('tf', 0.03), ('applicant', 0.03), ('bagged', 0.03), ('billions', 0.03), ('canini', 0.03), ('circuit', 0.03), ('discern', 0.03), ('examiner', 0.03), ('ganjisaffar', 0.03), ('goel', 0.03), ('graf', 0.03), ('japaneseenglish', 0.03), ('maptest', 0.03), ('nikoulina', 0.03), ('parallelization', 0.03), ('pavlov', 0.03), ('smucker', 0.03), ('tuwien', 0.03), ('walid', 0.03), ('yanjun', 0.03), ('agnostic', 0.03), ('freund', 0.03), ('hash', 0.03), ('list', 0.03), ('ef', 0.029), ('citation', 0.029), ('points', 0.028), ('ranked', 0.028), ('yoram', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999917 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
Author: Artem Sokokov ; Laura Jehl ; Felix Hieber ; Stefan Riezler
Abstract: We present an approach to learning bilingual n-gram correspondences from relevance rankings of English documents for Japanese queries. We show that directly optimizing cross-lingual rankings rivals and complements machine translation-based cross-language information retrieval (CLIR). We propose an efficient boosting algorithm that deals with very large cross-product spaces of word correspondences. We show in an experimental evaluation on patent prior art search that our approach, and in particular a consensus-based combination of boosting and translation-based approaches, yields substantial improvements in CLIR performance. Our training and test data are made publicly available.
2 0.16311027 97 emnlp-2013-Identifying Web Search Query Reformulation using Concept based Matching
Author: Ahmed Hassan
Abstract: Web search users frequently modify their queries in hope of receiving better results. This process is referred to as “Query Reformulation”. Previous research has mainly focused on proposing query reformulations in the form of suggested queries for users. Some research has studied the problem of predicting whether the current query is a reformulation of the previous query or not. However, this work has been limited to bag-of-words models where the main signals being used are word overlap, character level edit distance and word level edit distance. In this work, we show that relying solely on surface level text similarity results in many false positives where queries with different intents yet similar topics are mistakenly predicted as query reformulations. We propose a new representation for Web search queries based on identifying the concepts in queries and show that we can sig- nificantly improve query reformulation performance using features of query concepts.
3 0.11351606 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries
Author: Xiao Ding ; Zhicheng Dou ; Bing Qin ; Ting Liu ; Ji-rong Wen
Abstract: Web users are increasingly looking for structured data, such as lyrics, job, or recipes, using unstructured queries on the web. However, retrieving relevant results from such data is a challenging problem due to the unstructured language of the web queries. In this paper, we propose a method to improve web search ranking by detecting Structured Annotation of queries based on top search results. In a structured annotation, the original query is split into different units that are associated with semantic attributes in the corresponding domain. We evaluate our techniques using real world queries and achieve significant improvement. . 1
4 0.10673691 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning
Author: Lei Cui ; Xilun Chen ; Dongdong Zhang ; Shujie Liu ; Mu Li ; Ming Zhou
Abstract: Domain adaptation for SMT usually adapts models to an individual specific domain. However, it often lacks some correlation among different domains where common knowledge could be shared to improve the overall translation quality. In this paper, we propose a novel multi-domain adaptation approach for SMT using Multi-Task Learning (MTL), with in-domain models tailored for each specific domain and a general-domain model shared by different domains. The parameters of these models are tuned jointly via MTL so that they can learn general knowledge more accurately and exploit domain knowledge better. Our experiments on a largescale English-to-Chinese translation task validate that the MTL-based adaptation approach significantly and consistently improves the translation quality compared to a non-adapted baseline. Furthermore, it also outperforms the individual adaptation of each specific domain.
5 0.097620621 123 emnlp-2013-Learning to Rank Lexical Substitutions
Author: Gyorgy Szarvas ; Robert Busa-Fekete ; Eyke Hullermeier
Abstract: The problem to replace a word with a synonym that fits well in its sentential context is known as the lexical substitution task. In this paper, we tackle this task as a supervised ranking problem. Given a dataset of target words, their sentential contexts and the potential substitutions for the target words, the goal is to train a model that accurately ranks the candidate substitutions based on their contextual fitness. As a key contribution, we customize and evaluate several learning-to-rank models to the lexical substitution task, including classification-based and regression-based approaches. On two datasets widely used for lexical substitution, our best models signifi- cantly advance the state-of-the-art.
6 0.095657386 139 emnlp-2013-Noise-Aware Character Alignment for Bootstrapping Statistical Machine Transliteration from Bilingual Corpora
7 0.093275629 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
8 0.083671711 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
9 0.08187443 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models
10 0.080735572 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries
11 0.080690831 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora
12 0.076317988 52 emnlp-2013-Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation
13 0.075348027 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
14 0.074479952 84 emnlp-2013-Factored Soft Source Syntactic Constraints for Hierarchical Machine Translation
15 0.073064484 20 emnlp-2013-An Efficient Language Model Using Double-Array Structures
16 0.072723567 24 emnlp-2013-Application of Localized Similarity for Web Documents
17 0.071485087 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk
18 0.070661254 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
19 0.067843057 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation
20 0.063619271 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
topicId topicWeight
[(0, -0.213), (1, -0.099), (2, 0.006), (3, 0.044), (4, 0.051), (5, 0.007), (6, 0.065), (7, 0.118), (8, 0.048), (9, -0.158), (10, -0.042), (11, 0.094), (12, -0.035), (13, -0.092), (14, 0.074), (15, -0.046), (16, -0.099), (17, -0.175), (18, 0.128), (19, 0.009), (20, 0.039), (21, -0.088), (22, -0.119), (23, 0.114), (24, -0.101), (25, 0.034), (26, -0.005), (27, 0.022), (28, 0.089), (29, 0.035), (30, 0.022), (31, 0.015), (32, -0.029), (33, -0.009), (34, 0.005), (35, 0.027), (36, -0.054), (37, -0.049), (38, -0.088), (39, -0.002), (40, 0.052), (41, -0.082), (42, 0.027), (43, 0.12), (44, 0.019), (45, 0.085), (46, 0.159), (47, 0.031), (48, 0.012), (49, -0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.92801559 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
Author: Artem Sokokov ; Laura Jehl ; Felix Hieber ; Stefan Riezler
Abstract: We present an approach to learning bilingual n-gram correspondences from relevance rankings of English documents for Japanese queries. We show that directly optimizing cross-lingual rankings rivals and complements machine translation-based cross-language information retrieval (CLIR). We propose an efficient boosting algorithm that deals with very large cross-product spaces of word correspondences. We show in an experimental evaluation on patent prior art search that our approach, and in particular a consensus-based combination of boosting and translation-based approaches, yields substantial improvements in CLIR performance. Our training and test data are made publicly available.
2 0.70728827 97 emnlp-2013-Identifying Web Search Query Reformulation using Concept based Matching
Author: Ahmed Hassan
Abstract: Web search users frequently modify their queries in hope of receiving better results. This process is referred to as “Query Reformulation”. Previous research has mainly focused on proposing query reformulations in the form of suggested queries for users. Some research has studied the problem of predicting whether the current query is a reformulation of the previous query or not. However, this work has been limited to bag-of-words models where the main signals being used are word overlap, character level edit distance and word level edit distance. In this work, we show that relying solely on surface level text similarity results in many false positives where queries with different intents yet similar topics are mistakenly predicted as query reformulations. We propose a new representation for Web search queries based on identifying the concepts in queries and show that we can sig- nificantly improve query reformulation performance using features of query concepts.
3 0.65066183 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries
Author: Jerome White ; Douglas W. Oard ; Nitendra Rajput ; Marion Zalk
Abstract: Building search engines that can respond to spoken queries with spoken content requires that the system not just be able to find useful responses, but also that it know when it has heard enough about what the user wants to be able to do so. This paper describes a simulation study with queries spoken by non-native speakers that suggests that indicates that finding relevant content is often possible within a half minute, and that combining features based on automatically recognized words with features designed for automated prediction of query difficulty can serve as a useful basis for predicting when that useful content has been found.
4 0.59035563 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries
Author: Xiao Ding ; Zhicheng Dou ; Bing Qin ; Ting Liu ; Ji-rong Wen
Abstract: Web users are increasingly looking for structured data, such as lyrics, job, or recipes, using unstructured queries on the web. However, retrieving relevant results from such data is a challenging problem due to the unstructured language of the web queries. In this paper, we propose a method to improve web search ranking by detecting Structured Annotation of queries based on top search results. In a structured annotation, the original query is split into different units that are associated with semantic attributes in the corresponding domain. We evaluate our techniques using real world queries and achieve significant improvement. . 1
5 0.50182021 20 emnlp-2013-An Efficient Language Model Using Double-Array Structures
Author: Makoto Yasuhara ; Toru Tanaka ; Jun-ya Norimatsu ; Mikio Yamamoto
Abstract: Ngram language models tend to increase in size with inflating the corpus size, and consume considerable resources. In this paper, we propose an efficient method for implementing ngram models based on doublearray structures. First, we propose a method for representing backwards suffix trees using double-array structures and demonstrate its efficiency. Next, we propose two optimization methods for improving the efficiency of data representation in the double-array structures. Embedding probabilities into unused spaces in double-array structures reduces the model size. Moreover, tuning the word IDs in the language model makes the model smaller and faster. We also show that our method can be used for building large language models using the division method. Lastly, we show that our method outperforms methods based on recent related works from the viewpoints of model size and query speed when both optimization methods are used.
7 0.49988163 95 emnlp-2013-Identifying Multiple Userids of the Same Author
9 0.47773686 123 emnlp-2013-Learning to Rank Lexical Substitutions
10 0.43427014 24 emnlp-2013-Application of Localized Similarity for Web Documents
11 0.42130116 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk
12 0.4181937 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models
13 0.39209121 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning
14 0.38641018 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
15 0.37500042 28 emnlp-2013-Automated Essay Scoring by Maximizing Human-Machine Agreement
16 0.36136556 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
17 0.3597492 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
18 0.35630244 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
19 0.35481584 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
20 0.35339546 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
topicId topicWeight
[(3, 0.024), (10, 0.018), (18, 0.031), (22, 0.058), (29, 0.301), (30, 0.092), (50, 0.023), (51, 0.189), (66, 0.03), (71, 0.032), (75, 0.025), (77, 0.027), (90, 0.013), (96, 0.028)]
simIndex simValue paperId paperTitle
Author: Yufang Hou ; Katja Markert ; Michael Strube
Abstract: Recognizing bridging anaphora is difficult due to the wide variation within the phenomenon, the resulting lack of easily identifiable surface markers and their relative rarity. We develop linguistically motivated discourse structure, lexico-semantic and genericity detection features and integrate these into a cascaded minority preference algorithm that models bridging recognition as a subtask of learning finegrained information status (IS). We substantially improve bridging recognition without impairing performance on other IS classes.
2 0.84228635 97 emnlp-2013-Identifying Web Search Query Reformulation using Concept based Matching
Author: Ahmed Hassan
Abstract: Web search users frequently modify their queries in hope of receiving better results. This process is referred to as “Query Reformulation”. Previous research has mainly focused on proposing query reformulations in the form of suggested queries for users. Some research has studied the problem of predicting whether the current query is a reformulation of the previous query or not. However, this work has been limited to bag-of-words models where the main signals being used are word overlap, character level edit distance and word level edit distance. In this work, we show that relying solely on surface level text similarity results in many false positives where queries with different intents yet similar topics are mistakenly predicted as query reformulations. We propose a new representation for Web search queries based on identifying the concepts in queries and show that we can sig- nificantly improve query reformulation performance using features of query concepts.
Author: Mikhail Ageev ; Dmitry Lagun ; Eugene Agichtein
Abstract: Passage retrieval is a crucial first step of automatic Question Answering (QA). While existing passage retrieval algorithms are effective at selecting document passages most similar to the question, or those that contain the expected answer types, they do not take into account which parts of the document the searchers actually found useful. We propose, to the best of our knowledge, the first successful attempt to incorporate searcher examination data into passage retrieval for question answering. Specifically, we exploit detailed examination data, such as mouse cursor movements and scrolling, to infer the parts of the document the searcher found interesting, and then incorporate this signal into passage retrieval for QA. Our extensive experiments and analysis demonstrate that our method significantly improves passage retrieval, compared to using textual features alone. As an additional contribution, we make available to the research community the code and the search behavior data used in this study, with the hope of encouraging further research in this area.
same-paper 4 0.78204483 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
Author: Artem Sokokov ; Laura Jehl ; Felix Hieber ; Stefan Riezler
Abstract: We present an approach to learning bilingual n-gram correspondences from relevance rankings of English documents for Japanese queries. We show that directly optimizing cross-lingual rankings rivals and complements machine translation-based cross-language information retrieval (CLIR). We propose an efficient boosting algorithm that deals with very large cross-product spaces of word correspondences. We show in an experimental evaluation on patent prior art search that our approach, and in particular a consensus-based combination of boosting and translation-based approaches, yields substantial improvements in CLIR performance. Our training and test data are made publicly available.
5 0.635149 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries
Author: Xiao Ding ; Zhicheng Dou ; Bing Qin ; Ting Liu ; Ji-rong Wen
Abstract: Web users are increasingly looking for structured data, such as lyrics, job, or recipes, using unstructured queries on the web. However, retrieving relevant results from such data is a challenging problem due to the unstructured language of the web queries. In this paper, we propose a method to improve web search ranking by detecting Structured Annotation of queries based on top search results. In a structured annotation, the original query is split into different units that are associated with semantic attributes in the corresponding domain. We evaluate our techniques using real world queries and achieve significant improvement. . 1
6 0.59775728 108 emnlp-2013-Interpreting Anaphoric Shell Nouns using Antecedents of Cataphoric Shell Nouns as Training Data
7 0.59690577 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
8 0.5958603 95 emnlp-2013-Identifying Multiple Userids of the Same Author
9 0.59469551 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
10 0.59457302 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
11 0.59407341 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
12 0.59356099 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
13 0.59276152 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge
14 0.59234416 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
15 0.59199858 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
16 0.59103757 143 emnlp-2013-Open Domain Targeted Sentiment
17 0.59080321 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery
18 0.59045315 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts
19 0.59029794 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
20 0.59022343 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation