acl acl2012 acl2012-156 knowledge-graph by maker-knowledge-mining

156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

Source: pdf

Author: Wan-Yu Lin ; Nanyun Peng ; Chun-Chao Yen ; Shou-de Lin

Abstract: In this paper, we introduce a framework that identifies online plagiarism by exploiting lexical, syntactic and semantic features that includes duplication-gram, reordering and alignment of words, POS and phrase tags, and semantic similarity of sentences. We establish an ensemble framework to combine the predictions of each model. Results demonstrate that our system can not only find considerable amount of real-world online plagiarism cases but also outperforms several state-of-the-art algorithms and commercial software. Keywords Plagiarism Detection, Lexical, Syntactic, Semantic 1.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract In this paper, we introduce a framework that identifies online plagiarism by exploiting lexical, syntactic and semantic features that includes duplication-gram, reordering and alignment of words, POS and phrase tags, and semantic similarity of sentences. [sent-6, score-1.197]

2 We establish an ensemble framework to combine the predictions of each model. [sent-7, score-0.07]

3 Results demonstrate that our system can not only find considerable amount of real-world online plagiarism cases but also outperforms several state-of-the-art algorithms and commercial software. [sent-8, score-1.049]

4 Introduction Online plagiarism, the action of trying to create a new piece of writing by copying, reorganizing or rewriting others’ work identified through search engines, is one of the most commonly seen misusage of the highly matured web technologies. [sent-10, score-0.053]

5 As implied by the experiment conducted by (Braumoeller and Gaines, 2001), a powerful plagiarism detection system can effectively discourage people from plagiarizing others’ work. [sent-11, score-1.047]

6 A common strategy people adopt for onlineplagiarism detection is as follows. [sent-12, score-0.137]

7 First they identify several suspicious sentences from the write-up and feed them one by one as a query to a search engine to obtain a set of documents. [sent-13, score-0.241]

8 Then human reviewers can manually examine whether these documents are truly the sources of the suspicious sentences. [sent-14, score-0.104]

9 While it is quite straightforward and effective, the limitation of this 145 Chun-Chao Yen Graduate Institute of Shou-de Lin Graduate Institute of Networking and Multimedia, National Taiwan University r 9 6 9 4 4 0 16 @ cs ie . [sent-15, score-0.065]

10 tw Networking and Multimedia, National Taiwan University sdl in@ cs ie . [sent-18, score-0.056]

11 First, since the length of search query is limited, suspicious sentences are usually queried and examined independently. [sent-22, score-0.193]

12 Therefore, it is harder to identify document level plagiarism than sentence level plagiarism. [sent-23, score-0.949]

13 Second, manually checking whether a query sentence plagiarizes certain websites requires specific domain and language knowledge as well as considerable amount of energy and time. [sent-24, score-0.073]

14 To overcome the above shortcomings, we introduce an online plagiarism detection system using natural language processing techniques to simulate the above reverse-engineering approach. [sent-25, score-1.121]

15 We develop an ensemble framework that integrates lexical, syntactic and semantic features to achieve this goal. [sent-26, score-0.129]

16 Our system is language independent and we have implemented both Chinese and English versions for evaluation. [sent-27, score-0.023]

17 Related Work Plagiarism detection has been widely discussed in the past decades (Zou et al. [sent-29, score-0.112]

18 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi1c 4s5–150, Comparing to those systems, our system exploits more sophisticated syntactic and semantic information to simulate what plagiarists are trying to do. [sent-34, score-0.185]

19 There are several online or charged/free downloadable plagiarism detection systems such as Turnitin, EVE2, Docol©c, and CATPPDS which detect mainly verbatim copy. [sent-35, score-1.089]

20 Unfortunately those commercial systems do not reveal the detail strategies used, therefore it is hard to judge and reproduce their results for comparison. [sent-37, score-0.019]

21 1 Query a Search Engine We first break down each article into a series of queries to query a search engine. [sent-41, score-0.146]

22 The main difference between our method and theirs is that we send unquoted queries rather than quoted ones. [sent-44, score-0.036]

23 We do not require the search results 146 to completely match to the query sentence. [sent-45, score-0.089]

24 This strategy allows us to not only identify the copy/paste type of plagiarism but also re-write/edit type of plagiarism. [sent-46, score-0.937]

25 2 Sentence-based Plagiarism Detection Since not all outputs of a search engine contain an exact copy of the query, we need a model to quantify how likely each of them is the source of plagiarism. [sent-48, score-0.1]

26 For better efficiency, our experiment exploits the snippet of a search output to represent the whole document. [sent-49, score-0.087]

27 That is, we want to measure how likely a snippet is the plagiarized source of the query. [sent-50, score-0.09]

28 We designed several models which utilized rich lexical, syntactic and semantic features to pursue this goal, and the details are discussed below. [sent-51, score-0.075]

29 1 Ngram Matching (NM) One straightforward measure is to exploit the ngram similarity between source and target texts. [sent-54, score-0.086]

30 The larger n is, the harder for this feature to detect plagiarism with insertion, replacement, and deletion. [sent-56, score-0.95]

31 2 Reordering of Words (RW) Plagiarism can come from the reordering of words. [sent-60, score-0.057]

32 We argue that the permutation distance between S1 and S2 is an important indicator for reordered plagiarism. [sent-61, score-0.062]

33 The permutation distance is defined as the minimum number of pair-wise exchanging of matched words needed to transform a sentence, S2, to contain the same order of matched words as another sentence, S1. [sent-62, score-0.184]

34 As mentioned in (Sörensena and Sevaux, 2005), the permutation distance can be calculated by the following expression ? [sent-63, score-0.062]

35 S1(i) and S2(i) are indices of the ith matched word in sentences S1 and S2 respectively and n is the number of matched words between the n2−2 n sentences S1 and S2. [sent-93, score-0.139]

36 Let μ = be the normalized term, which is the maximum possible distance between S1 and S2, then the reordering score of the two sentences, expressed as s(S1, S2), will be s S1,S2 = 1− d Sμ1,S2 3. [sent-94, score-0.078]

37 3 Alignment of Words (AW) Besides reordering, plagiarists often insert or delete words in a sentence. [sent-96, score-0.056]

38 We try to model such behavior by finding the alignment of two word sequences. [sent-97, score-0.029]

39 We perform the alignment using a dynamic programming method as mentioned in (Wagner and Fischer, 1975). [sent-98, score-0.029]

40 However, such alignment score does not reflect the continuity of the matched words, which can be an important cue to identify plagiarism. [sent-99, score-0.09]

41 +1 +1 M is the list of matched words, and Mi is the ith matched word in M. [sent-126, score-0.139]

42 This implies we prefer fewer unmatched words in between two matched ones. [sent-127, score-0.061]

43 4 POS and Phrase Tags of Words (PT, PP) Exploiting only lexical features can sometimes result in some false positive cases because two sets of matched words can play different roles in the sentences. [sent-130, score-0.127]

44 tags and phrases Therefore, we further explore syntactic features for plagiarism detection. [sent-133, score-0.938]

45 To achieve this goal, we utilize a parser to obtain POS and phrase tags of the words. [sent-134, score-0.035]

46 “by ”, in the passive form while the object in a Verb Phrase can become a new subject in a Noun Phrase. [sent-185, score-0.021]

47 Here we utilize the Stanford Dependency provided by Stanford Parser to match the tag/phrase between active and passive sentences. [sent-186, score-0.021]

48 , 2006) often explore semantic similarity using lexical databases such as WordNet to find synonyms, we exploit a topic model, specifically latent Dirichlet allocation (LDA, D. [sent-192, score-0.144]

49 Given a set of documents represented by their word sequences, and a topic number n, LDA learns the word distribution for each topic and the topic distribution for each document which maximize the likelihood of the word co-occurrence in a document. [sent-196, score-0.093]

50 The topic distribution is often taken as semantics of a document. [sent-197, score-0.025]

51 We use LDA to obtain the topic distribution of a query and a candidate snippet, and compare the cosine similarity of them as a measure of their semantic similarity. [sent-198, score-0.159]

52 3 Ensemble Similarity Scores Up to this point, for each snippet the system generates six similarity scores to measure the degree of plagiarism in different aspects. [sent-200, score-1.015]

53 In this stage, we propose two strategies to linearly combine the scores to make better prediction. [sent-201, score-0.019]

54 accuracy) as the weight to linearly combine the scores. [sent-204, score-0.019]

55 In the second strategy we exploit a learning model (in the experiment section we use Liblinear) to learn the weights directly. [sent-206, score-0.049]

56 4 Document Level Plagiarism Detection For each query from the input article, our system assigns a degree-of-plagiarism score to some plausible source URLs. [sent-208, score-0.101]

57 Then, for each URL, the system sums up all the scores it obtains as the final score for document-level degree-of-plagiarism. [sent-209, score-0.023]

58 We set up a cutoff threshold to obtain the most plausible URLs. [sent-210, score-0.042]

59 At the end, our system highlights the suspicious areas of plagiarism for display. [sent-211, score-1.039]

60 Evaluation We evaluate our system from two different angles. [sent-213, score-0.023]

61 We first evalaute the sentence level plagirism detection using the PAN corpus in English. [sent-214, score-0.112]

62 We then evaluate the capability of the full system to detect on-line plagiarism cases using annotated results in Chinese. [sent-215, score-0.985]

63 1 Sentence-based Evaluations We want to compare our model with the state-ofthe-art methods, in particular the winning entries in plagiarism detection competition in PAN 1 . [sent-217, score-1.098]

64 However, the competition in PAN is designed for off-line plagiarism detection; the entries did not exploit an IR system to search the Web like we do. [sent-218, score-1.018]

65 To achieve such goal, we first randomly sampled 370 documents from PAN-201 1 external plagiarism corpus (M. [sent-220, score-0.912]

66 Then we use the suspicious passages as queries to search the whole dataset using Lucene. [sent-224, score-0.174]

67 Since there is length limitation in Lucene (as well as in the real search engines), we further break the 2882 plagiarism cases into 6477 queries. [sent-225, score-1.024]

68 We then extract the top 30 snippets returned by the search engine as the potential negative candidates for each plagiarism case. [sent-226, score-1.042]

69 Note that for each suspicious passage, there is only one target passage (given by the ground truth) that is considered as a positive plagiarism case in this data, and it can be either among these 30 cases or not. [sent-227, score-1.099]

70 However, we union these 30 cases with the ground truth as a set, and use our (as well as the competitors’) models to rank the degree-ofplagiarism for all the candidates. [sent-228, score-0.088]

71 We compared our system with the winning entry of PAN 2011 (Grman and Ravas, 2011) and the stopword ngram model that claims to perform better than this winning entry by Stamatatos (201 1). [sent-230, score-0.215]

72 The results of each individual model and ensemble using 5-fold cross validation are listed in Table 3. [sent-231, score-0.07]

73 de/ an ensemble of three features state-of-the-art by 26%. [sent-234, score-0.07]

74 (a) AUC for each individual model (b) AUC of our ensemble and other state-of-the-art algorithms 4. [sent-245, score-0.07]

75 2 Evaluating the Full System To evaluate the overall system, we manually collect 60 real-world review articles from the Internet for books (20), movies (20), and music albums (20). [sent-246, score-0.029]

76 Unfortunately for an online system like ours, there is no ground truth available for recall measure. [sent-247, score-0.126]

77 First we use the 60 articles as inputs to our system, ask 5 human annotators to check whether the articles returned by our system can be considered as plagiarism. [sent-249, score-0.135]

78 Among all 60 review articles, our system identifies a considerablely high number of copy/paste articles, 231 in total. [sent-250, score-0.041]

79 However, identifying this type of plagiarism is trivial, and has been done by many similar tools. [sent-251, score-0.912]

80 Instead we focus on the so-called smart-plagiarism which cannot be found through quoting a query in a search engine. [sent-252, score-0.089]

81 shows the precision of the smart-plagiarism articles returned by our system. [sent-254, score-0.064]

82 Then we use each of them as queries into Google and retrieve a total of 5636 pieces of snippet candidates. [sent-258, score-0.089]

83 We then ask 63 human beings to annotate whether those snippets represent plagiarism cases of the original review article. [sent-259, score-1.024]

84 Eventually we have obtained an annotated dataset and found a total of 502 plagiarized candidates with 4966 innocent ones for evalaution. [sent-260, score-0.037]

85 (a) AUC for each individual model (b) AUC of our ensemble and other state-of-the-art algorithms 4. [sent-274, score-0.07]

86 The main reason we believe is that the plagiarism cases were created in very different manners. [sent-276, score-0.943]

87 Plagiarism cases in PAN external source are created artificially through word insertions, deletions, reordering and synonym substitutions. [sent-277, score-0.108]

88 As a result, features such as word alignment and reordering do not perform well because they did not consider the existence of synonym word replacement. [sent-278, score-0.106]

89 On the other hand, real-world plagiarism cases returned by Google are those with matching-words, and we can find better performance for AW. [sent-279, score-0.978]

90 The performances of syntactic and semantic features, namely PT, PP and LDA, are consistently inferior than other features. [sent-280, score-0.059]

91 It is because they often introduce false-positives as there are some nonplagiarism cases that might have highly overlapped syntactic or semantic tags. [sent-281, score-0.09]

92 We also found that the stopword Ngram model is not applicable universally. [sent-283, score-0.059]

93 For one thing, it is less suitable for on-line plagiarism detection, as the length limitation for queries diminishes the usability of stopword n-grams. [sent-284, score-1.033]

94 Online Demo System We developed an online demos system using JAVA (JDK 1. [sent-288, score-0.069]

95 The system currently supports the detection of documents in both English and Chinese. [sent-290, score-0.135]

96 Users can either upload the plain text file of a suspicious document, or copy/paste the content onto the text area, as shown below in Figure 2. [sent-291, score-0.104]

97 Then the system will output some URLs and snippets as the potential source of plagiarism. [sent-292, score-0.055]

98 Conclusion Comparing with other online plagiarism detection systems, ours exploit more sophisticated features by modeling how human beings plagiarize online sources. [sent-295, score-1.17]

99 We have exploited sentence-level plagiarism detection on lexical, syntactic and semantic levels. [sent-296, score-1.083]

100 Given a parser and a POS tagger of a language, our framework can be extended to support plagiarism detection for that language. [sent-298, score-1.024]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('plagiarism', 0.912), ('detection', 0.112), ('suspicious', 0.104), ('auc', 0.097), ('pan', 0.087), ('ensemble', 0.07), ('matched', 0.061), ('stopword', 0.059), ('reordering', 0.057), ('plagiarists', 0.056), ('query', 0.055), ('snippet', 0.053), ('winning', 0.049), ('online', 0.046), ('lda', 0.044), ('permutation', 0.041), ('multimedia', 0.037), ('copy', 0.037), ('braumoeller', 0.037), ('cede', 0.037), ('copying', 0.037), ('grman', 0.037), ('plagiarized', 0.037), ('potthast', 0.037), ('rensena', 0.037), ('zou', 0.037), ('networking', 0.036), ('queries', 0.036), ('ngram', 0.035), ('returned', 0.035), ('search', 0.034), ('semantic', 0.033), ('stamatatos', 0.032), ('barr', 0.032), ('snippets', 0.032), ('taiwan', 0.032), ('cases', 0.031), ('truth', 0.03), ('beings', 0.03), ('alberto', 0.03), ('lncs', 0.03), ('alignment', 0.029), ('engine', 0.029), ('articles', 0.029), ('simulate', 0.028), ('ground', 0.027), ('similarity', 0.027), ('wagner', 0.026), ('lucene', 0.026), ('graduate', 0.026), ('limitation', 0.026), ('syntactic', 0.026), ('topic', 0.025), ('competition', 0.025), ('passage', 0.025), ('strategy', 0.025), ('exploit', 0.024), ('system', 0.023), ('plausible', 0.023), ('paolo', 0.022), ('distance', 0.021), ('ie', 0.021), ('break', 0.021), ('passive', 0.021), ('synonym', 0.02), ('detect', 0.019), ('linearly', 0.019), ('harder', 0.019), ('obtain', 0.019), ('commercial', 0.019), ('trying', 0.019), ('ask', 0.019), ('engines', 0.019), ('dirichlet', 0.018), ('document', 0.018), ('cs', 0.018), ('allocation', 0.018), ('false', 0.018), ('considerable', 0.018), ('identifies', 0.018), ('lexical', 0.017), ('ith', 0.017), ('institute', 0.017), ('tw', 0.017), ('reviews', 0.017), ('pt', 0.017), ('ir', 0.017), ('phrase', 0.016), ('tool', 0.016), ('checker', 0.016), ('efstathios', 0.016), ('benno', 0.016), ('nanyun', 0.016), ('phan', 0.016), ('gelbukh', 0.016), ('prevention', 0.016), ('pursue', 0.016), ('bandar', 0.016), ('mclean', 0.016), ('shea', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

Author: Wan-Yu Lin ; Nanyun Peng ; Chun-Chao Yen ; Shou-de Lin

2 0.088832989 212 acl-2012-Using Search-Logs to Improve Query Tagging

Author: Kuzman Ganchev ; Keith Hall ; Ryan McDonald ; Slav Petrov

Abstract: Syntactic analysis of search queries is important for a variety of information-retrieval tasks; however, the lack of annotated data makes training query analysis models difficult. We propose a simple, efficient procedure in which part-of-speech tags are transferred from retrieval-result snippets to queries at training time. Unlike previous work, our final model does not require any additional resources at run-time. Compared to a state-ofthe-art approach, we achieve more than 20% relative error reduction. Additionally, we annotate a corpus of search queries with partof-speech tags, providing a resource for future work on syntactic query analysis.

3 0.054513801 134 acl-2012-Learning to Find Translations and Transliterations on the Web

Author: Joseph Z. Chang ; Jason S. Chang ; Roger Jyh-Shing Jang

Abstract: Jason S. Chang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j s chang@ c s .nthu . edu .tw Jyh-Shing Roger Jang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j ang@ c s .nthu .edu .tw identifying such translation counterparts Web, we can cope with the OOV problem. In this paper, we present a new method on the for learning to finding translations and transliterations on the Web for a given term. The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine, and automatically annotating the snippets with tags and features for training a conditional random field model. At runtime, the model is used to extracting translation candidates for a given term. Preliminary experiments and evaluation show our method cleanly combining various features, resulting in a system that outperforms previous work. 1

4 0.054369621 19 acl-2012-A Ranking-based Approach to Word Reordering for Statistical Machine Translation

Author: Nan Yang ; Mu Li ; Dongdong Zhang ; Nenghai Yu

Abstract: Long distance word reordering is a major challenge in statistical machine translation research. Previous work has shown using source syntactic trees is an effective way to tackle this problem between two languages with substantial word order difference. In this work, we further extend this line of exploration and propose a novel but simple approach, which utilizes a ranking model based on word order precedence in the target language to reposition nodes in the syntactic parse tree of a source sentence. The ranking model is automatically derived from word aligned parallel data with a syntactic parser for source language based on both lexical and syntactical features. We evaluated our approach on largescale Japanese-English and English-Japanese machine translation tasks, and show that it can significantly outperform the baseline phrase- based SMT system.

5 0.0445344 31 acl-2012-Authorship Attribution with Author-aware Topic Models

Author: Yanir Seroussi ; Fabian Bohnert ; Ingrid Zukerman

Abstract: Authorship attribution deals with identifying the authors of anonymous texts. Building on our earlier finding that the Latent Dirichlet Allocation (LDA) topic model can be used to improve authorship attribution accuracy, we show that employing a previously-suggested Author-Topic (AT) model outperforms LDA when applied to scenarios with many authors. In addition, we define a model that combines LDA and AT by representing authors and documents over two disjoint topic sets, and show that our model outperforms LDA, AT and support vector machines on datasets with many authors.

6 0.044414751 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

7 0.042938676 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation

8 0.042130917 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling

9 0.041131284 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

10 0.038745224 217 acl-2012-Word Sense Disambiguation Improves Information Retrieval

11 0.038695306 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks

12 0.033559699 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

13 0.033125799 145 acl-2012-Modeling Sentences in the Latent Space

14 0.032647576 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

15 0.032529134 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

16 0.031381376 17 acl-2012-A Novel Burst-based Text Representation Model for Scalable Event Detection

17 0.031261936 199 acl-2012-Topic Models for Dynamic Translation Model Adaptation

18 0.029984811 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System

19 0.028584566 56 acl-2012-Computational Approaches to Sentence Completion

20 0.028363967 190 acl-2012-Syntactic Stylometry for Deception Detection

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.106), (1, 0.023), (2, 0.024), (3, 0.019), (4, -0.004), (5, 0.027), (6, 0.001), (7, -0.012), (8, -0.005), (9, -0.007), (10, 0.034), (11, 0.051), (12, 0.043), (13, 0.03), (14, -0.015), (15, -0.036), (16, 0.013), (17, -0.003), (18, 0.007), (19, -0.074), (20, 0.022), (21, 0.128), (22, 0.023), (23, 0.048), (24, -0.072), (25, -0.012), (26, 0.033), (27, 0.047), (28, 0.028), (29, 0.016), (30, -0.023), (31, -0.034), (32, -0.023), (33, -0.064), (34, -0.0), (35, -0.017), (36, -0.003), (37, -0.024), (38, 0.045), (39, 0.048), (40, 0.062), (41, 0.096), (42, -0.1), (43, -0.079), (44, 0.001), (45, 0.013), (46, 0.072), (47, 0.001), (48, 0.077), (49, -0.09)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86828613 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

Author: Wan-Yu Lin ; Nanyun Peng ; Chun-Chao Yen ; Shou-de Lin

2 0.62965661 212 acl-2012-Using Search-Logs to Improve Query Tagging

Author: Kuzman Ganchev ; Keith Hall ; Ryan McDonald ; Slav Petrov

3 0.53839475 190 acl-2012-Syntactic Stylometry for Deception Detection

Author: Song Feng ; Ritwik Banerjee ; Yejin Choi

Abstract: Most previous studies in computerized deception detection have relied only on shallow lexico-syntactic patterns. This paper investigates syntactic stylometry for deception detection, adding a somewhat unconventional angle to prior literature. Over four different datasets spanning from the product review to the essay domain, we demonstrate that features driven from Context Free Grammar (CFG) parse trees consistently improve the detection performance over several baselines that are based only on shallow lexico-syntactic features. Our results improve the best published result on the hotel review data (Ott et al., 2011) reaching 91.2% accuracy with 14% error reduction. ,

4 0.49944979 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

Author: Richard Eckart de Castilho ; Sabine Bartsch ; Iryna Gurevych

Abstract: We present CSNIPER (Corpus Sniper), a tool that implements (i) a web-based multiuser scenario for identifying and annotating non-canonical grammatical constructions in large corpora based on linguistic queries and (ii) evaluation of annotation quality by measuring inter-rater agreement. This annotationby-query approach efficiently harnesses expert knowledge to identify instances of linguistic phenomena that are hard to identify by means of existing automatic annotation tools.

5 0.48813978 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling

Author: Patrick Pantel ; Thomas Lin ; Michael Gamon

Abstract: We predict entity type distributions in Web search queries via probabilistic inference in graphical models that capture how entitybearing queries are generated. We jointly model the interplay between latent user intents that govern queries and unobserved entity types, leveraging observed signals from query formulations and document clicks. We apply the models to resolve entity types in new queries and to assign prior type distributions over an existing knowledge base. Our models are efficiently trained using maximum likelihood estimation over millions of real-world Web search queries. We show that modeling user intent significantly improves entity type resolution for head queries over the state ofthe art, on several metrics, without degradation in tail query performance.

6 0.4513469 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

7 0.44359091 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts

8 0.4010638 134 acl-2012-Learning to Find Translations and Transliterations on the Web

9 0.39822656 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

10 0.38586345 217 acl-2012-Word Sense Disambiguation Improves Information Retrieval

11 0.3832832 37 acl-2012-Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

12 0.37734327 35 acl-2012-Automatically Mining Question Reformulation Patterns from Search Log Data

13 0.37685651 31 acl-2012-Authorship Attribution with Author-aware Topic Models

14 0.37429538 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

15 0.37350506 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content

16 0.36275357 14 acl-2012-A Joint Model for Discovery of Aspects in Utterances

17 0.36194491 110 acl-2012-Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model

18 0.36057711 19 acl-2012-A Ranking-based Approach to Word Reordering for Statistical Machine Translation

19 0.35505748 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

20 0.33363587 144 acl-2012-Modeling Review Comments

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.022), (26, 0.055), (28, 0.033), (30, 0.046), (37, 0.023), (39, 0.067), (63, 0.211), (74, 0.028), (82, 0.027), (84, 0.02), (85, 0.043), (90, 0.126), (92, 0.093), (94, 0.025), (99, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78255016 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

Author: Wan-Yu Lin ; Nanyun Peng ; Chun-Chao Yen ; Shou-de Lin

2 0.66815227 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

Author: Elif Yamangil ; Stuart Shieber

Abstract: We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes. Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010). We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data.

3 0.66342646 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

Author: Bevan Jones ; Mark Johnson ; Sharon Goldwater

Abstract: Many semantic parsing models use tree transformations to map between natural language and meaning representation. However, while tree transformations are central to several state-of-the-art approaches, little use has been made of the rich literature on tree automata. This paper makes the connection concrete with a tree transducer based semantic parsing model and suggests that other models can be interpreted in a similar framework, increasing the generality of their contributions. In particular, this paper further introduces a variational Bayesian inference algorithm that is applicable to a wide class of tree transducers, producing state-of-the-art semantic parsing results while remaining applicable to any domain employing probabilistic tree transducers.

4 0.65493208 167 acl-2012-QuickView: NLP-based Tweet Search

Author: Xiaohua Liu ; Furu Wei ; Ming Zhou ; QuickView Team Microsoft

Abstract: Tweets have become a comprehensive repository for real-time information. However, it is often hard for users to quickly get information they are interested in from tweets, owing to the sheer volume of tweets as well as their noisy and informal nature. We present QuickView, an NLP-based tweet search platform to tackle this issue. Specifically, it exploits a series of natural language processing technologies, such as tweet normalization, named entity recognition, semantic role labeling, sentiment analysis, tweet classification, to extract useful information, i.e., named entities, events, opinions, etc., from a large volume of tweets. Then, non-noisy tweets, together with the mined information, are indexed, on top of which two brand new scenarios are enabled, i.e., categorized browsing and advanced search, allowing users to effectively access either the tweets or fine-grained information they are interested in.

5 0.65476608 191 acl-2012-Temporally Anchored Relation Extraction

Author: Guillermo Garrido ; Anselmo Penas ; Bernardo Cabaleiro ; Alvaro Rodrigo

Abstract: Although much work on relation extraction has aimed at obtaining static facts, many of the target relations are actually fluents, as their validity is naturally anchored to a certain time period. This paper proposes a methodological approach to temporally anchored relation extraction. Our proposal performs distant supervised learning to extract a set of relations from a natural language corpus, and anchors each of them to an interval of temporal validity, aggregating evidence from documents supporting the relation. We use a rich graphbased document-level representation to generate novel features for this task. Results show that our implementation for temporal anchoring is able to achieve a 69% of the upper bound performance imposed by the relation extraction step. Compared to the state of the art, the overall system achieves the highest precision reported.

6 0.65461999 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition

7 0.65255344 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

8 0.65202034 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

9 0.65153074 31 acl-2012-Authorship Attribution with Author-aware Topic Models

10 0.65126103 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

11 0.65107322 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling

12 0.65073401 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

13 0.64921159 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

14 0.64562857 10 acl-2012-A Discriminative Hierarchical Model for Fast Coreference at Large Scale

15 0.64265704 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

16 0.64253646 187 acl-2012-Subgroup Detection in Ideological Discussions

17 0.64227486 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics

18 0.6419422 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

19 0.64156932 139 acl-2012-MIX Is Not a Tree-Adjoining Language

20 0.63849801 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation