acl acl2012 acl2012-35 knowledge-graph by maker-knowledge-mining

35 acl-2012-Automatically Mining Question Reformulation Patterns from Search Log Data

Source: pdf

Author: Xiaobing Xue ; Yu Tao ; Daxin Jiang ; Hang Li

Abstract: Natural language questions have become popular in web search. However, various questions can be formulated to convey the same information need, which poses a great challenge to search systems. In this paper, we automatically mined 5w1h question reformulation patterns from large scale search log data. The question reformulations generated from these patterns are further incorporated into the retrieval model. Experiments show that using question reformulation patterns can significantly improve the search performance of natural language questions.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com , Abstract Natural language questions have become popular in web search. [sent-7, score-0.147]

2 However, various questions can be formulated to convey the same information need, which poses a great challenge to search systems. [sent-8, score-0.243]

3 In this paper, we automatically mined 5w1h question reformulation patterns from large scale search log data. [sent-9, score-1.223]

4 The question reformulations generated from these patterns are further incorporated into the retrieval model. [sent-10, score-0.899]

5 Experiments show that using question reformulation patterns can significantly improve the search performance of natural language questions. [sent-11, score-1.116]

6 1 Introduction More and more web users tend to use natural language questions as queries for web search. [sent-12, score-0.292]

7 Some commercial natural language search engines such as InQuira and Ask have also been developed to answer this type of queries. [sent-13, score-0.108]

8 One major challenge is that various questions can be formulated for the same information need. [sent-14, score-0.111]

9 Table 1shows some alternative expressions for the question “how far is it from Boston to Seattle”. [sent-15, score-0.394]

10 It is difficult for search systems to achieve satisfactory retrieval performance without considering these alternative expressions. [sent-16, score-0.164]

11 In this paper, we propose a method of automatically mining 5w1h question1 reformulation patterns to improve the search relevance of 5w1h questions. [sent-17, score-0.871]

12 Question reformulations represent the alternative expressions for 5w1h questions. [sent-18, score-0.498]

13 A question ∗Contribution during internship at Microsoft Research Asia 15w1h questions start with “Who”, “What”, “Where”, “When”, “Why” and “How”. [sent-19, score-0.367]

14 For example, users may ask similar questions “how far is it from X1 to X2” where X1 and X2 represent some other cities besides Boston and Seattle. [sent-21, score-0.156]

15 Then, similar question reformulations as in Table 1will be generated with the city names changed. [sent-22, score-0.701]

16 These patterns increase the coverage of the system by handling the queries that did not appear before but share similar structures as previous queries. [sent-23, score-0.227]

17 Using reformulation patterns as the key concept, we propose a question reformulation framework. [sent-24, score-1.67]

18 First, we mine the question reformulation patterns from search logs that record users’ reformulation behavior. [sent-25, score-1.832]

19 Second, given a new question, we use the most relevant reformulation patterns to generate question reformulations and each of the reformulations is associated with its probability. [sent-26, score-1.891]

20 Third, the original question and these question reformulations are then combined together for retrieval. [sent-27, score-1.03]

21 First, we propose a simple yet effective approach to automatically mine 5w1h question reformulation patterns. [sent-29, score-0.953]

22 Second, we conduct comprehensive studies in improving the search performance of 5w1h questions using the mined patterns. [sent-30, score-0.189]

23 2 Related Work In the Natural Language Processing (NLP) area, different expressions that convey the same meaning are referred as paraphrases (Lin and Pantel, 2001 ; Barzilay and McKeown, 2001 ; Pang et al. [sent-34, score-0.143]

24 , 2006), question answering (Ravichandran and Hovy, 2002) and document summarization (McKeown et al. [sent-38, score-0.277]

25 Yet, little research has considered improving web search performance using paraphrases. [sent-40, score-0.13]

26 Query logs have become an important resource for many NLP applications such as class and attribute extraction (Pa ¸sca and Van Durme, 2008), paraphrasing (Zhao et al. [sent-41, score-0.093]

27 Little research has been conducted to automatically mine 5w1h question reformulation patterns from query logs. [sent-44, score-1.303]

28 Different techniques have been developed for query segmentation (Bergsma and Wang, 2007; Tan and Peng, 2008) and query substitution (Jones et al. [sent-48, score-0.422]

29 Yet, most previous research focused on keyword queries without considering 5w1h questions. [sent-50, score-0.088]

30 188 Table 2: Question reformulation patterns generated for the query pair (“how far is it from Boston to Seattle” ,“distance from Boston to Seattle”). [sent-53, score-1.02]

31 1 Generating Reformulation Patterns From the search log, we extract all successive query pairs issued by the same user within a certain time period where the first query is a 5w1h question. [sent-55, score-0.594]

32 In such query pair, the second query is considered as a question reformulation. [sent-56, score-0.699]

33 Set = {(q, qr)}, as the input and outputs a pattern Sbaetse = consisting o, fa s5 wth1eh i question reformulation patterns, i. [sent-59, score-0.952]

34 Specifically, fuolra teioanch p query pair (q, qr), we fpirs)t} c). [sent-62, score-0.211]

35 ol Slepcetc aifllcommon words between q and qr except for stopwords ST2, where CW = {w|w ∈ q, w ∈ q′, w ∈/ ST}. [sent-63, score-0.096]

36 s in Si are replaced as slots in q and qr to construct a reformulation pattern. [sent-65, score-0.696]

37 Finally, the patterns observed in many different query pairs are kept. [sent-67, score-0.35]

38 In other words, we rely on the frequency of a pattern to filter noisy patterns. [sent-68, score-0.048]

39 Generating patterns using more NLP features such as the parsing information will be studied in the future work. [sent-69, score-0.164]

40 We select the pattern that has the most prefix words, since this pattern is more likely to have the same information as If sev- qnew. [sent-72, score-0.126]

41 eral patterns have the same number of prefix words, we use the total number of words to break the tie. [sent-75, score-0.169]

42 After picking the best question pattern p⋆, we further rank all question reformulation patterns containing p⋆, i. [sent-76, score-1.412]

43 The probability P(pr |p⋆) associated with the pattern (p⋆, pr) is assigned to the corresponding question reformulation qrnew qrnew. [sent-82, score-1.045]

44 3 Retrieval Model Given the original question and k question reformulations {qrnew}, the query distribution model (Xue ualandti Croft, 2010) (denoted as QDist) i os adopted qnew qnew to combine and {qrnew} using their associated probabilities. [sent-84, score-1.519]

45 score(qnew, D), is calculated as follows: score(qnew, D) = λ log P(qnew|D) Xk +(1 − λ)XP(pri|p⋆)logP(qrniew|D) (2) Xi=1 In Eq. [sent-87, score-0.055]

46 2, λ is a parameter that indicates the probability assigned to the original query. [sent-88, score-0.052]

47 4 Experiments A large scale search log from a commercial search engine (201 1. [sent-91, score-0.289]

48 From the search log, we extract all successive query pairs issued by the same user within 30 minutes (Boldi et al. [sent-94, score-0.405]

49 , 2008)3 where the first query is a 5w1h question. [sent-95, score-0.211]

50 For the retrieval experiments, we randomly sample 10,000 natural language questions as queries 3In web search, queries issued within 30 minutes are usually considered having the same information need. [sent-97, score-0.477]

51 189 Table 4: Retrieval Performance of using question reformulations. [sent-98, score-0.277]

52 For each question, we generate the top ten questions reformulations. [sent-104, score-0.121]

53 A web collection from a commercial search engine is used for retrieval experiments. [sent-106, score-0.251]

54 1 Examples and Performance Table 3 shows examples of the generated questions reformulations. [sent-110, score-0.09]

55 Several interesting expressions are generated to reformulate the original question. [sent-111, score-0.094]

56 We compare the retrieval performance of using the question reformulations (QDist) with the performance of using the original question (Orig) in Table 4. [sent-112, score-1.089]

57 Table 4 shows that using the question reformulations can significantly improve the retrieval performance ofnatural language questions. [sent-115, score-0.76]

58 Note that, considering the scale of experiments (10,000 queries), around 3% improvement with respect to NDCG is a very interesting result for web search. [sent-116, score-0.083]

59 2 Analysis In this subsection, we analyze the results to better understand the effect of question reformulations. [sent-118, score-0.277]

60 First, we report the performance of always picking the best question reformulation for each query (denoted as Upper) in Table 5, which provides an 4www . [sent-119, score-1.159]

61 3D 2059C689G741@5 Table 6: Best reformulation within different positions. [sent-125, score-0.648]

62 t4h%in top 3 upper bound for the performance of the question reformulation. [sent-129, score-0.331]

63 Table 5 shows that if we were able to always picking the best question reformulation, the performance of Orig could be improved by around 30% (from 0. [sent-130, score-0.321]

64 It indicates that we do generate some high quality question reformulations. [sent-133, score-0.277]

65 Table 6 further reports the percent of those 10,000 queries where the best question reformulation can be observed in the top 1position, within the top 2 positions and within the top 3 positions, respectively. [sent-134, score-1.154]

66 Table 6 shows that for most queries, our method successfully ranks the best reformulation within the top 3 positions. [sent-135, score-0.679]

67 Second, we study the effect of different types of question reformulations. [sent-136, score-0.277]

68 We roughly divide the question reformulations generated by our method into five categories as shown in Table 7. [sent-137, score-0.701]

69 For each category, we report the percent of reformulations which performance is bigger/smaller/equal with respect to the original question. [sent-138, score-0.503]

70 Table 7 shows that the “more specific” reformulations and the “equivalent” reformulations are more likely to improve the original question. [sent-139, score-0.9]

71 Reformu- lations that make “morphological change” do not have much effect on improving the original question. [sent-140, score-0.052]

72 “More general” and “not relevant” reformulations usually decrease the performance. [sent-141, score-0.424]

73 Third, we conduct the error analysis on the question reformulations that decrease the performance of the original question. [sent-142, score-0.753]

74 First, some important words are removed from the original question. [sent-144, score-0.052]

75 For example, “what is the role ofcorporate executives” is reformulated as “corporate executives”. [sent-145, score-0.055]

76 For example, “how to effectively organize your classroom” is reformulated as “how to effectively organize your elementary classroom”. [sent-147, score-0.123]

77 Third, some reformulations entirely change 190 Table 7: Analysis of different types of reformulations. [sent-148, score-0.424]

78 For example, “what is the adjective of anxiously” is reformulated as “what is the noun of anxiously”. [sent-154, score-0.055]

79 Fourth, we compare our question reformulation method with two long query processing techniques, i. [sent-155, score-1.115]

80 NoStop removes all stopwords in the query and DropOne learns to drop a single word from the query. [sent-159, score-0.238]

81 Table 8 reports the retrieval performance of different methods. [sent-162, score-0.059]

82 Table 8 shows that both NoStop and DropOne perform worse than using the original question, which indicates that the general techniques developed for long queries are not appropriate for natural language questions. [sent-163, score-0.14]

83 5 Conclusion Improving the search relevance of natural language questions poses a great challenge for search systems. [sent-165, score-0.299]

84 We propose to automatically mine 5w1h question reformulation patterns from search log data. [sent-166, score-1.22]

85 The effectiveness of the extracted patterns has been shown on web search. [sent-167, score-0.196]

86 These patterns are potentially useful for many other applications, which will be studied in the future work. [sent-168, score-0.164]

87 How to automatically classify the extracted patterns is also an interesting future issue. [sent-169, score-0.139]

88 In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 571–578. [sent-179, score-0.043]

89 In Proceedings of the 43rd Annual Meeting on Association for Compu- tational Linguistics, pages 597–604. [sent-186, score-0.043]

90 In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pages 50–57. [sent-194, score-0.043]

91 Large scale acquisition of paraphrases for learning surface patterns. [sent-207, score-0.121]

92 From “Dango” to “Japanese Cakes”: Query reformulation models and patterns. [sent-219, score-0.627]

93 IEEE/WIC/ACM International Joint Conferences on, volume 1, pages 183–190. [sent-222, score-0.043]

94 Exploring web scale language models for search query processing. [sent-241, score-0.367]

95 Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. [sent-302, score-0.095]

96 Weakly-supervised acquisition of open-domain classes and class attributes from web documents and query logs. [sent-315, score-0.29]

97 Learning surface text patterns for a question answering system. [sent-329, score-0.416]

98 Unsupervised query segmentation using generative language models and Wikipedia. [sent-335, score-0.211]

99 Mining term association patterns from search logs for effective query reformulation. [sent-341, score-0.485]

100 Pivot approach for extracting paraphrase patterns from bilingual corpora. [sent-362, score-0.161]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('reformulation', 0.627), ('reformulations', 0.424), ('question', 0.277), ('query', 0.211), ('seattle', 0.163), ('qnew', 0.139), ('patterns', 0.139), ('boston', 0.138), ('boldi', 0.093), ('qrnew', 0.093), ('questions', 0.09), ('queries', 0.088), ('search', 0.073), ('paraphrases', 0.073), ('balasubramanian', 0.07), ('dropone', 0.07), ('nostop', 0.07), ('qdist', 0.07), ('qr', 0.069), ('croft', 0.059), ('retrieval', 0.059), ('web', 0.057), ('pr', 0.056), ('ndcg', 0.055), ('reformulated', 0.055), ('log', 0.055), ('paraphrasing', 0.053), ('tan', 0.053), ('issued', 0.052), ('original', 0.052), ('mine', 0.049), ('pattern', 0.048), ('anxiously', 0.046), ('castillo', 0.046), ('classroom', 0.046), ('executives', 0.046), ('orig', 0.046), ('picking', 0.044), ('pages', 0.043), ('far', 0.043), ('pas', 0.043), ('expressions', 0.042), ('barzilay', 0.041), ('bonchi', 0.04), ('huston', 0.04), ('ponte', 0.04), ('pri', 0.04), ('logs', 0.04), ('ravichandran', 0.04), ('kauchak', 0.037), ('mckeown', 0.035), ('zhao', 0.035), ('commercial', 0.035), ('organize', 0.034), ('cw', 0.032), ('bannard', 0.032), ('relevance', 0.032), ('alternative', 0.032), ('jansen', 0.031), ('bergsma', 0.031), ('bhagat', 0.031), ('poses', 0.031), ('top', 0.031), ('xue', 0.03), ('prefix', 0.03), ('wang', 0.029), ('dis', 0.028), ('convey', 0.028), ('stopwords', 0.027), ('engine', 0.027), ('zhai', 0.027), ('percent', 0.027), ('scale', 0.026), ('successive', 0.026), ('mined', 0.026), ('proceeding', 0.026), ('studied', 0.025), ('jones', 0.024), ('cro', 0.024), ('upper', 0.023), ('ask', 0.023), ('minutes', 0.022), ('pang', 0.022), ('acquisition', 0.022), ('generating', 0.022), ('paraphrase', 0.022), ('association', 0.022), ('ca', 0.021), ('formulated', 0.021), ('within', 0.021), ('asia', 0.021), ('sigir', 0.021), ('kuansan', 0.02), ('xiaolong', 0.02), ('opr', 0.02), ('aristides', 0.02), ('booth', 0.02), ('burch', 0.02), ('cbe', 0.02), ('francesco', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 35 acl-2012-Automatically Mining Question Reformulation Patterns from Search Log Data

Author: Xiaobing Xue ; Yu Tao ; Daxin Jiang ; Hang Li

2 0.13824619 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling

Author: Patrick Pantel ; Thomas Lin ; Michael Gamon

Abstract: We predict entity type distributions in Web search queries via probabilistic inference in graphical models that capture how entitybearing queries are generated. We jointly model the interplay between latent user intents that govern queries and unobserved entity types, leveraging observed signals from query formulations and document clicks. We apply the models to resolve entity types in new queries and to assign prior type distributions over an existing knowledge base. Our models are efficiently trained using maximum likelihood estimation over millions of real-world Web search queries. We show that modeling user intent significantly improves entity type resolution for head queries over the state ofthe art, on several metrics, without degradation in tail query performance.

3 0.11969363 212 acl-2012-Using Search-Logs to Improve Query Tagging

Author: Kuzman Ganchev ; Keith Hall ; Ryan McDonald ; Slav Petrov

Abstract: Syntactic analysis of search queries is important for a variety of information-retrieval tasks; however, the lack of annotated data makes training query analysis models difficult. We propose a simple, efficient procedure in which part-of-speech tags are transferred from retrieval-result snippets to queries at training time. Unlike previous work, our final model does not require any additional resources at run-time. Compared to a state-ofthe-art approach, we achieve more than 20% relative error reduction. Additionally, we annotate a corpus of search queries with partof-speech tags, providing a resource for future work on syntactic query analysis.

4 0.10709044 55 acl-2012-Community Answer Summarization for Multi-Sentence Question with Group L1 Regularization

Author: Wen Chan ; Xiangdong Zhou ; Wei Wang ; Tat-Seng Chua

Abstract: We present a novel answer summarization method for community Question Answering services (cQAs) to address the problem of “incomplete answer”, i.e., the “best answer” of a complex multi-sentence question misses valuable information that is contained in other answers. In order to automatically generate a novel and non-redundant community answer summary, we segment the complex original multi-sentence question into several sub questions and then propose a general Conditional Random Field (CRF) based answer summary method with group L1 regularization. Various textual and non-textual QA features are explored. Specifically, we explore four different types of contextual factors, namely, the information novelty and non-redundancy modeling for local and non-local sentence interactions under question segmentation. To further unleash the potential of the abundant cQA features, we introduce the group L1 regularization for feature learning. Experimental results on a Yahoo! Answers dataset show that our proposed method significantly outperforms state-of-the-art methods on cQA summarization task.

5 0.10421934 217 acl-2012-Word Sense Disambiguation Improves Information Retrieval

Author: Zhi Zhong ; Hwee Tou Ng

Abstract: Previous research has conflicting conclusions on whether word sense disambiguation (WSD) systems can improve information retrieval (IR) performance. In this paper, we propose a method to estimate sense distributions for short queries. Together with the senses predicted for words in documents, we propose a novel approach to incorporate word senses into the language modeling approach to IR and also exploit the integration of synonym relations. Our experimental results on standard TREC collections show that using the word senses tagged by a supervised WSD system, we obtain significant improvements over a state-of-the-art IR system.

6 0.086253032 177 acl-2012-Sentence Dependency Tagging in Online Question Answering Forums

7 0.083456434 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

8 0.071664736 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation

9 0.066657998 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

10 0.06638889 134 acl-2012-Learning to Find Translations and Transliterations on the Web

11 0.055110868 66 acl-2012-DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation

12 0.052538343 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

13 0.051323034 116 acl-2012-Improve SMT Quality with Automatically Extracted Paraphrase Rules

14 0.045036919 144 acl-2012-Modeling Review Comments

15 0.041361567 92 acl-2012-FLOW: A First-Language-Oriented Writing Assistant System

16 0.041213788 14 acl-2012-A Joint Model for Discovery of Aspects in Utterances

17 0.039811589 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

18 0.039499305 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

19 0.038145263 56 acl-2012-Computational Approaches to Sentence Completion

20 0.037028853 78 acl-2012-Efficient Search for Transformation-based Inference

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.119), (1, 0.034), (2, -0.001), (3, 0.043), (4, 0.046), (5, 0.096), (6, 0.016), (7, 0.015), (8, -0.011), (9, -0.031), (10, 0.081), (11, 0.102), (12, 0.003), (13, 0.084), (14, 0.057), (15, -0.025), (16, 0.081), (17, -0.044), (18, 0.021), (19, -0.037), (20, 0.18), (21, 0.186), (22, 0.086), (23, -0.01), (24, -0.121), (25, -0.147), (26, 0.093), (27, -0.018), (28, 0.001), (29, -0.055), (30, -0.084), (31, -0.024), (32, 0.06), (33, -0.037), (34, -0.022), (35, 0.002), (36, 0.096), (37, -0.047), (38, -0.026), (39, -0.084), (40, -0.063), (41, -0.11), (42, -0.036), (43, 0.065), (44, 0.022), (45, -0.013), (46, 0.044), (47, 0.029), (48, -0.13), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96202195 35 acl-2012-Automatically Mining Question Reformulation Patterns from Search Log Data

Author: Xiaobing Xue ; Yu Tao ; Daxin Jiang ; Hang Li

2 0.66661978 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling

Author: Patrick Pantel ; Thomas Lin ; Michael Gamon

3 0.63827789 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

Author: Richard Eckart de Castilho ; Sabine Bartsch ; Iryna Gurevych

Abstract: We present CSNIPER (Corpus Sniper), a tool that implements (i) a web-based multiuser scenario for identifying and annotating non-canonical grammatical constructions in large corpora based on linguistic queries and (ii) evaluation of annotation quality by measuring inter-rater agreement. This annotationby-query approach efficiently harnesses expert knowledge to identify instances of linguistic phenomena that are hard to identify by means of existing automatic annotation tools.

4 0.58893436 212 acl-2012-Using Search-Logs to Improve Query Tagging

Author: Kuzman Ganchev ; Keith Hall ; Ryan McDonald ; Slav Petrov

5 0.53927743 55 acl-2012-Community Answer Summarization for Multi-Sentence Question with Group L1 Regularization

Author: Wen Chan ; Xiangdong Zhou ; Wei Wang ; Tat-Seng Chua

6 0.47739175 177 acl-2012-Sentence Dependency Tagging in Online Question Answering Forums

7 0.43362573 217 acl-2012-Word Sense Disambiguation Improves Information Retrieval

8 0.38461941 66 acl-2012-DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation

9 0.37384748 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

10 0.35905159 134 acl-2012-Learning to Find Translations and Transliterations on the Web

11 0.32568944 14 acl-2012-A Joint Model for Discovery of Aspects in Utterances

12 0.31567338 112 acl-2012-Humor as Circuits in Semantic Networks

13 0.3140153 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation

14 0.28770989 116 acl-2012-Improve SMT Quality with Automatically Extracted Paraphrase Rules

15 0.28551182 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

16 0.28099713 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

17 0.27177832 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs

18 0.26457819 56 acl-2012-Computational Approaches to Sentence Completion

19 0.25012222 51 acl-2012-Collective Generation of Natural Image Descriptions

20 0.24846637 144 acl-2012-Modeling Review Comments

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.017), (26, 0.055), (28, 0.03), (30, 0.027), (37, 0.023), (39, 0.046), (48, 0.02), (52, 0.332), (74, 0.022), (82, 0.013), (84, 0.022), (85, 0.017), (90, 0.166), (92, 0.069), (99, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74305862 35 acl-2012-Automatically Mining Question Reformulation Patterns from Search Log Data

Author: Xiaobing Xue ; Yu Tao ; Daxin Jiang ; Hang Li

2 0.74185205 126 acl-2012-Labeling Documents with Timestamps: Learning from their Time Expressions

Author: Nathanael Chambers

Abstract: Temporal reasoners for document understanding typically assume that a document’s creation date is known. Algorithms to ground relative time expressions and order events often rely on this timestamp to assist the learner. Unfortunately, the timestamp is not always known, particularly on the Web. This paper addresses the task of automatic document timestamping, presenting two new models that incorporate rich linguistic features about time. The first is a discriminative classifier with new features extracted from the text’s time expressions (e.g., ‘since 1999’). This model alone improves on previous generative models by 77%. The second model learns probabilistic constraints between time expressions and the unknown document time. Imposing these learned constraints on the discriminative model further improves its accuracy. Finally, we present a new experiment design that facil- itates easier comparison by future work.

3 0.69752347 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection

Author: Xu Sun ; Houfeng Wang ; Wenjie Li

Abstract: We present a joint model for Chinese word segmentation and new word detection. We present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling. As we know, training a word segmentation system on large-scale datasets is already costly. In our case, adding high dimensional new features will further slow down the training speed. To solve this problem, we propose a new training method, adaptive online gradient descent based on feature frequency information, for very fast online training of the parameters, even given large-scale datasets with high dimensional features. Compared with existing training methods, our training method is an order magnitude faster in terms of training time, and can achieve equal or even higher accuracies. The proposed fast training method is a general purpose optimization method, and it is not limited in the specific task discussed in this paper.

4 0.69737256 105 acl-2012-Head-Driven Hierarchical Phrase-based Translation

Author: Junhui Li ; Zhaopeng Tu ; Guodong Zhou ; Josef van Genabith

Abstract: This paper presents an extension of Chiang’s hierarchical phrase-based (HPB) model, called Head-Driven HPB (HD-HPB), which incorporates head information in translation rules to better capture syntax-driven information, as well as improved reordering between any two neighboring non-terminals at any stage of a derivation to explore a larger reordering search space. Experiments on Chinese-English translation on four NIST MT test sets show that the HD-HPB model significantly outperforms Chiang’s model with average gains of 1.91 points absolute in BLEU. 1

5 0.51444918 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling

Author: Arjun Mukherjee ; Bing Liu

Abstract: Aspect extraction is a central problem in sentiment analysis. Current methods either extract aspects without categorizing them, or extract and categorize them using unsupervised topic modeling. By categorizing, we mean the synonymous aspects should be clustered into the same category. In this paper, we solve the problem in a different setting where the user provides some seed words for a few aspect categories and the model extracts and clusters aspect terms into categories simultaneously. This setting is important because categorizing aspects is a subjective task. For different application purposes, different categorizations may be needed. Some form of user guidance is desired. In this paper, we propose two statistical models to solve this seeded problem, which aim to discover exactly what the user wants. Our experimental results show that the two proposed models are indeed able to perform the task effectively. 1

6 0.51331669 167 acl-2012-QuickView: NLP-based Tweet Search

7 0.512748 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

8 0.51176912 217 acl-2012-Word Sense Disambiguation Improves Information Retrieval

9 0.51160705 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

10 0.51035684 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling

11 0.50716066 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

12 0.50627655 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content

13 0.50549263 98 acl-2012-Finding Bursty Topics from Microblogs

14 0.50512558 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

15 0.50506777 73 acl-2012-Discriminative Learning for Joint Template Filling

16 0.50459403 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

17 0.50436068 16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery

18 0.50427377 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

19 0.50412899 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

20 0.50383878 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons