emnlp emnlp2013 emnlp2013-35 knowledge-graph by maker-knowledge-mining

35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

Source: pdf

Author: Silvia Pareti ; Tim O'Keefe ; Ioannis Konstas ; James R. Curran ; Irena Koprinska

Abstract: Direct quotations are used for opinion mining and information extraction as they have an easy to extract span and they can be attributed to a speaker with high accuracy. However, simply focusing on direct quotations ignores around half of all reported speech, which is in the form of indirect or mixed speech. This work presents the first large-scale experiments in indirect and mixed quotation extraction and attribution. We propose two methods of extracting all quote types from news articles and evaluate them on two large annotated corpora, one of which is a contribution of this work. We further show that direct quotation attribution methods can be successfully applied to indirect and mixed quotation attribution.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk ,i Abstract Direct quotations are used for opinion mining and information extraction as they have an easy to extract span and they can be attributed to a speaker with high accuracy. [sent-10, score-0.892]

2 However, simply focusing on direct quotations ignores around half of all reported speech, which is in the form of indirect or mixed speech. [sent-11, score-1.149]

3 This work presents the first large-scale experiments in indirect and mixed quotation extraction and attribution. [sent-12, score-0.797]

4 We further show that direct quotation attribution methods can be successfully applied to indirect and mixed quotation attribution. [sent-14, score-1.451]

5 Reported speech is a carrier of evidence and factuality (Bergler, 1992; Saur ı´ and Pustejovsky, 2009), and as such, text mining applications use quotations to summarise, organise and validate information. [sent-17, score-0.772]

6 Extraction of quotations is also relevant to researchers interested in media monitoring. [sent-18, score-0.725]

7 , 2007; Glass and Bangay, 2007; Elson and McKeown, 2010) thus far have limited their scope to direct quotations (Ex. [sent-20, score-0.815]

8 au by quotation marks, which makes them easy to extract. [sent-25, score-0.435]

9 However, annotated resources suggest that direct quotations represent only a limited portion of all quotations, i. [sent-26, score-0.834]

10 Retrieving only direct quotations can miss key content that can change the interpretation of the quotation (Ex. [sent-31, score-1.267]

11 Previous work on extracting indirect and mixed quotations has suffered from a lack of large-scale data, and has instead used hand-crafted lexica of reporting verbs with rule-based approaches. [sent-43, score-1.139]

12 0()20 19)phMa teLn tdehar-nob dsu iroletvg e r axtpmeaxmrstea LPE Fnaroegnrtlguicsgha uge s (quToteasiNoS215in03/74zsD5e30)65478%487R%P 2esul65t792 s%9 0%R21 Table 1: Related work on direct, indirect and mixed quotation extraction. [sent-50, score-0.769]

13 2 Results 1 Figure estimated by the are for quotation extraction and attribution jointly. [sent-53, score-0.62]

14 ods against both a token-based approach that uses a Conditional Random Field (CRF) to predict IOB labels, and a maximum entropy classifier that predicts whether parse nodes are quotations or not. [sent-54, score-0.767]

15 Finally, we use the direct quotation attribution methods described in O’Keefe et al. [sent-59, score-0.682]

16 (2012) and show that they can be successfully applied to indirect and mixed quotations, albeit with lower accuracy. [sent-60, score-0.334]

17 This leads us to conclude that attributing indirect and mixed quotations to speakers is harder than attributing direct quotations. [sent-61, score-1.225]

18 With this work, we set a new state of the art in quotation extraction. [sent-62, score-0.435]

19 2 Background Pareti (2012) defines an attribution as having a source span, a cue span, and a content span: Source is the span of text that indicates who the content is attributed to, e. [sent-64, score-0.307]

20 Their content corresponds to a quotation span and their source is generally referred to in the literature as the speaker. [sent-86, score-0.527]

21 Direct quotation attribution, with direct quotations being given or extracted heuristically, has been the focus of further studies in both the narrative (Elson and McKeown, 2010) and news (Pouliquen et al. [sent-90, score-1.283]

22 The few studies that have addressed the extraction and attribution of indirect and mixed quotations are discussed below. [sent-93, score-1.244]

23 (2008) developed a quotation extraction and attribution system that combines a lexicon of 53 common reporting verbs and a hand-built grammar to detect constructions that match 6 general lexical patterns. [sent-95, score-0.7]

24 They evaluate their work on 7 articles from the Wall Street Journal, which contain 133 quotations, achieving macro-averaged Precision (P) of 99% and Recall (R) of 74% for quotation span detection. [sent-96, score-0.539]

25 PICTOR yielded 75% P and 86% R in terms of words correctly ascribed to a quotation or speaker, while it achieved 56% P and 52% R when measured in terms of completely correct quotation-speaker pairs. [sent-99, score-0.435]

26 , 2011) extracts quotations from French news, by using a lexicon of reporting verbs and syntactic patterns to extract the complement of a reporting verb as the quotation span and its subject as the source. [sent-101, score-1.39]

27 They evaluated 40 randomly sampled quotations and found that their system made 32 predictions and correctly identified the span in 28 of the 40 cases. [sent-102, score-0.8]

28 Verbatim (Sarmento and Nunes, 2009) extracts quotations from Portuguese news feeds by first finding one of 35 speech verbs and then matching the sentence to one of 19 patterns. [sent-103, score-0.829]

29 9% of the quotations Verbatim finds are errors and that the system identifies approximately one distinct quotation for every 46 news articles. [sent-105, score-1.193]

30 Their work is the closest to ours as they partially apply supervised machine learning to quotation extraction. [sent-108, score-0.435]

31 Their work introduces GloboQuotes, a corpus of 685 news items containing 1,007 quotations of which 802 were used to train an Entropy Guided Transformation Learning (ETL) algorithm (dos Santos and Milidi´ u, 2009). [sent-109, score-0.758]

32 They treat quotation extraction as an IOB labelling task, where they use ETL with POS and NE features to identify the beginning of a quotation, while the inside and outside labels are found using regular expressions. [sent-110, score-0.481]

33 Finally they use ETL to attribute quotations to their source. [sent-111, score-0.725]

34 We have summarised these approaches in Table 1, 991 Table 2: Comparison of the SMHC and PARC corpora, reporting their document and token size and per-type occurrence of quotations overall and per document (average). [sent-113, score-0.828]

35 Furthermore, the published results do not include any comparisons with previous work, which prevents a quantitative comparison of the approaches, and they do not include results broken down by whether the quotation is direct, indirect, or mixed. [sent-115, score-0.435]

36 For this work we use only the assertions, as they correspond to quotations (direct, indirect and mixed). [sent-122, score-0.898]

37 2 Sydney Morning Herald Corpus (SMHC) We based our second corpus on the existing annotations of direct quotations within Sydney Morning Herald articles presented in O’Keefe et al. [sent-132, score-0.844]

38 In that work we defined direct quotations as any text between quotation marks, which included the directly-quoted portion of mixed quotations, as well as scare quotes. [sent-134, score-1.452]

39 Under that definition direct quotations could be automatically extracted with very high accuracy, so annotations in that work were over the automatically extracted direct quotations. [sent-135, score-0.905]

40 As part of this work one annotator removed scare quotes, updated mixed quotations to include both the directly and indirectly quoted portions, and added whole new indirect quotations. [sent-136, score-1.12]

41 The resulting corpus contains 7,991 quotations taken from 965 articles from the 2009 Sydney Morning Herald (we refer to this corpus as SMHC). [sent-138, score-0.754]

42 3 Comparison Table 2 shows a comparison of the two corpora and the quotations annotated within them. [sent-143, score-0.766]

43 SMHC has a higher density of quotations per document, 8. [sent-144, score-0.725]

44 Excluding null-quotation articles from PARC, the average incidence of annotated quotations per article raises to 7. [sent-155, score-0.773]

45 The corpora also differ in quotation type distribution, with direct quotations being largely predominant in SMHC while indirect are more common in PARC. [sent-157, score-1.445]

46 1 Quotation Extraction Quotation extraction is the task of extracting the content span of all of the direct, indirect, and mixed quotations within a given document. [sent-159, score-1.006]

47 More precisely, we consider quotations to be acts of communication, which correspond to assertions in Pareti (2012). [sent-160, score-0.765]

48 Some quotations have content spans that are split into separate, non-adjacent spans, as in example (1a). [sent-161, score-0.774]

49 Quotation marks were normalised to a single character, as the quotation direction is often incorrect for multi-paragraph quotations. [sent-166, score-0.482]

50 We used the attributional cues in the PARC corpus to develop a separate component of our system to identify attribution verb-cues. [sent-175, score-0.187]

51 Sentences containing a verb classified as a 993 cue that do not contain a quotation were removed from the training set for the quotation extraction model. [sent-193, score-0.963]

52 4 Evaluation We use two metrics, listed below, for evaluating the quotation spans predicted by our model against the gold spans from the annotation. [sent-195, score-0.544]

53 Strict The first is a strict metric where a predicted span is only considered to be correct if it exactly matches a span from the gold standard. [sent-196, score-0.236]

54 For each of these metrics we report the micro-average, as the number of quotations in each document varies significantly. [sent-201, score-0.725]

55 When reporting P for the typewise results we restrict the set of predicted quotations to only those with the requisite type, while still considering the full set of gold quotations. [sent-202, score-0.806]

56 Similarly, when calculating R we restrict the set of gold quotations to only those with the required type. [sent-203, score-0.747]

57 As direct quotations are not always explicitly introduced by a cue-verb, we defined a separate baseline with a rule-based approach (Brule) that returns text between quotation marks that has at least 3 tokens, and where the non-stopword and non-proper noun tokens are not all title cased. [sent-211, score-1.318]

58 5 Supervised Approaches We present two supervised approaches to quotation extraction, which operate over the tokens and the phrase-structure parse nodes respectively. [sent-213, score-0.472]

59 Sentence: features indicating whether the sentence contains a quotation mark, a NE, a verb-cue, a pronoun, or any combination of these. [sent-215, score-0.435]

60 Other: features for whether the target is within quotation marks, and whether there is a verb-cue near the end of the sentence. [sent-220, score-0.435]

61 1 Token-based Approach The token-based approach treats quotation extraction as analogous to NE tagging, where there are a sequence of tokens that need to be individually labelled. [sent-224, score-0.499]

62 Each token is given either an I, an O, or a B label, where B denotes the first token in a quotation, Idenotes the token is inside a quotation, and O indicates that the token is not part of a quotation. [sent-225, score-0.268]

63 As such, we treat the entire document as a single sequence, which allows the predicted quotations to span both sentence and paragraph bounds. [sent-228, score-0.823]

64 Syntactic: the label, depth, and token span size of the highest constituent where the current token is the left-most token in the constituent, as well as its parent, and whether either of those contains a verb-cue. [sent-231, score-0.332]

65 1All reports the results over all quotations (direct, indirect and mixed). [sent-235, score-0.898]

66 2 Constituent-based Approach The constituent approach classifies whole phrase structure nodes as either quotation or not a quotation. [sent-239, score-0.507]

67 Ideally each quotation would match exactly one constituent, however this is not always the case in our data. [sent-240, score-0.435]

68 In cases without an exact match we label every constituent that is a subspan of the quotation as a quotation as long as it has a parent that is not a subspan of the quotation. [sent-241, score-0.966]

69 In these cases multiple nodes will be labelled quotation, so a postprocessing step is introduced that rebuilds quotations by merging predicted spans that are adjacent or overlapping within a sentence. [sent-242, score-0.796]

70 Restricting the merging process this way loses the ability to predict quotations that cover more than a sentence, but without this restriction too many predicted quotations are erroneously merged. [sent-243, score-1.473]

71 In early experiments we found that the constituent-based approach performed poorly when trained on all quotations, so for these experiments the constituent classifier is trained only on indirect and mixed quotations. [sent-245, score-0.416]

72 1 Direct Quotations Table 4 shows the results for predicting direct quotations on PARC and SMHC. [sent-252, score-0.815]

73 Although direct quotations should be trivial to extract, and a simple system that returns the content between quotation marks should be hard to beat, there are two main factors that confound the rulebased system. [sent-254, score-1.314]

74 The first is the presence of mixed quotations, which is most clearly demonstrated in the difference between the strict precision scores and the partial precision scores for Brule. [sent-255, score-0.202]

75 Brule will find all of the directly-quoted portions of mixed quotes, which do not exactly match a quotation, and so will receive a low precision score with the strict metric. [sent-256, score-0.202]

76 1All reports the results over all quotations (direct, indirect and mixed). [sent-258, score-0.898]

77 Note that the reduced strict score does not occur for the token method, which correctly identifies mixed quotations. [sent-261, score-0.269]

78 The other main issue is the presence of quotation marks around items such as book titles and scare quotes (i. [sent-262, score-0.596]

79 text that is in quotation marks to distance the author from a particular wording or claim). [sent-264, score-0.5]

80 These results demonstrate that although direct quotations can be accurately extracted with rules, the accuracy will be lower than might be anticipated and the returned spans will in- clude a number of mixed quotations, which will be missing some content. [sent-268, score-1.008]

81 2 Indirect and Mixed Quotations The token approach was also the most effective method for extracting indirect and mixed quotations as Tables 5 and 6 show. [sent-270, score-1.126]

82 Indirect quotations were extracted with strict F-scores of 59% and 60% and partial F-scores of76% and 74% in PARC and SMHC respectively, while mixed quotes were found with strict F-scores of 56% and 85% and partial F-scores of 87% and 86%. [sent-271, score-1.025]

83 The constituent model yielded lower results than the token one, and in particular it greatly lowered the recall of mixed quotations in both corpora. [sent-273, score-1.009]

84 This resulted in an increase in strict P and increased the F-score for mixed quotations to 57%, similarly to the score achieved by the token model. [sent-277, score-0.994]

85 For this score, the baseline models for indirect and mixed quotations are combined with Brule for direct quotations. [sent-281, score-1.149]

86 Qualitatively we found that the token-based approach was making reasonable predictions most of the time, but would often fail when a quotation was attributed to a speaker through a parenthetical clause, as in Example 4. [sent-286, score-0.499]

87 As discussed in Section 2, quotation attribution has been addressed in the literature before, including some work that includes largescale data (Elson and McKeown, 2010). [sent-296, score-0.592]

88 However, the large-scale evaluations that exist cover only direct quotations, whereas we present results for direct, indirect, and mixed quotations. [sent-297, score-0.251]

89 The second method uses a CRF which is able to choose between up to 15 entities that are in the paragraph containing the quotation or any preceding it. [sent-301, score-0.435]

90 (2012) this model achieved the best results on the direct quotations in SMHC, despite not using the sequence features or decoding methods that were available to other models. [sent-305, score-0.83]

91 This discrepancy is caused by differences in our data compared to theirs, notably that the sequence of quotations is altered in ours by the introduction of indirect quotations, and that some of the direct quotations that they evaluated would be considered mixed quotations in our corpora. [sent-315, score-2.614]

92 The rule based method performs particularly poorly on PARC, which is likely caused by the relative scarcity of direct quotations and the fact that it was designed for direct quotations only. [sent-316, score-1.63]

93 Direct quotations are much more frequent in SMHC, so the rules that rely on the sequence of speakers would likely perform relatively better than on PARC. [sent-317, score-0.764]

94 ), trained without any sequence information, equalled or outperformed the two other non-gold approaches for all quotation types on both corpora. [sent-319, score-0.45]

95 This indicates that the CRF model evaluated here was not able to effec- Table 7: Speaker attribution accuracy results for both corpora over gold standard quotations. [sent-320, score-0.201]

96 8 Conclusion In this work we have presented the first large-scale experiments on the entire quotation extraction and attribution task: evaluating the extraction and attribution of direct, indirect and mixed quotations over two large news corpora. [sent-322, score-1.897]

97 We also show that state-of-the-art quotation attribution methods are less accurate on indirect and mixed quotations than they are on direct quotations. [sent-325, score-1.741]

98 This work provides an accurate and complete quotation extraction and attribution system that can be used for a wide range oftasks in information extraction and opinion mining. [sent-330, score-0.648]

99 A large-scale system for annotating and querying quotations in news feeds. [sent-387, score-0.758]

100 Visualizing topical quotations over time to understand news discourse. [sent-428, score-0.758]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('quotations', 0.725), ('quotation', 0.435), ('parc', 0.203), ('indirect', 0.173), ('mixed', 0.161), ('attribution', 0.157), ('smhc', 0.142), ('keefe', 0.112), ('direct', 0.09), ('span', 0.075), ('pareti', 0.071), ('token', 0.067), ('quotes', 0.057), ('constituent', 0.056), ('brule', 0.051), ('bsyn', 0.051), ('speaker', 0.049), ('marks', 0.047), ('ars', 0.045), ('verbs', 0.044), ('elson', 0.044), ('strict', 0.041), ('attributions', 0.041), ('herald', 0.041), ('scare', 0.041), ('assertions', 0.04), ('verb', 0.039), ('reporting', 0.036), ('bergler', 0.035), ('krestel', 0.035), ('news', 0.033), ('spans', 0.032), ('attributional', 0.03), ('blex', 0.03), ('etl', 0.03), ('eventualities', 0.03), ('abbott', 0.03), ('mr', 0.029), ('articles', 0.029), ('sydney', 0.028), ('extraction', 0.028), ('morning', 0.028), ('unlabelled', 0.028), ('speech', 0.027), ('attributing', 0.026), ('clergerie', 0.026), ('verbnet', 0.026), ('cue', 0.026), ('prasad', 0.026), ('classifier', 0.026), ('speakers', 0.024), ('glass', 0.024), ('merit', 0.024), ('milidi', 0.024), ('pouliquen', 0.024), ('quote', 0.024), ('verbatim', 0.024), ('predicted', 0.023), ('corpora', 0.022), ('gold', 0.022), ('said', 0.022), ('tokens', 0.021), ('bangay', 0.02), ('blist', 0.02), ('bsay', 0.02), ('crc', 0.02), ('factuality', 0.02), ('goldpp', 0.02), ('hollingsworth', 0.02), ('irena', 0.02), ('koprinska', 0.02), ('mamede', 0.02), ('pictor', 0.02), ('quoted', 0.02), ('saur', 0.02), ('skadhauge', 0.02), ('subspan', 0.02), ('mckeown', 0.02), ('alan', 0.019), ('annotated', 0.019), ('beliefs', 0.019), ('labelling', 0.018), ('markets', 0.018), ('organisation', 0.018), ('verlap', 0.018), ('wording', 0.018), ('silvia', 0.017), ('content', 0.017), ('sabine', 0.017), ('titles', 0.016), ('sarmento', 0.016), ('iob', 0.016), ('luiz', 0.016), ('ralf', 0.016), ('ruy', 0.016), ('penn', 0.016), ('nodes', 0.016), ('sequence', 0.015), ('attributed', 0.015), ('ne', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

Author: Silvia Pareti ; Tim O'Keefe ; Ioannis Konstas ; James R. Curran ; Irena Koprinska

2 0.06346152 27 emnlp-2013-Authorship Attribution of Micro-Messages

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

3 0.051700979 152 emnlp-2013-Predicting the Presence of Discourse Connectives

Author: Gary Patterson ; Andrew Kehler

Abstract: We present a classification model that predicts the presence or omission of a lexical connective between two clauses, based upon linguistic features of the clauses and the type of discourse relation holding between them. The model is trained on a set of high frequency relations extracted from the Penn Discourse Treebank and achieves an accuracy of 86.6%. Analysis of the results reveals that the most informative features relate to the discourse dependencies between sequences of coherence relations in the text. We also present results of an experiment that provides insight into the nature and difficulty of the task.

4 0.041346174 41 emnlp-2013-Building Event Threads out of Multiple News Articles

Author: Xavier Tannier ; Veronique Moriceau

Abstract: We present an approach for building multidocument event threads from a large corpus of newswire articles. An event thread is basically a succession of events belonging to the same story. It helps the reader to contextualize the information contained in a single article, by navigating backward or forward in the thread from this article. A specific effort is also made on the detection of reactions to a particular event. In order to build these event threads, we use a cascade of classifiers and other modules, taking advantage of the redundancy of information in the newswire corpus. We also share interesting comments concerning our manual annotation procedure for building a training and testing set1.

5 0.037458353 121 emnlp-2013-Learning Topics and Positions from Debatepedia

Author: Swapna Gottipati ; Minghui Qiu ; Yanchuan Sim ; Jing Jiang ; Noah A. Smith

Abstract: We explore Debatepedia, a communityauthored encyclopedia of sociopolitical debates, as evidence for inferring a lowdimensional, human-interpretable representation in the domain of issues and positions. We introduce a generative model positing latent topics and cross-cutting positions that gives special treatment to person mentions and opinion words. We evaluate the resulting representation’s usefulness in attaching opinionated documents to arguments and its consistency with human judgments about positions.

6 0.031859223 183 emnlp-2013-The VerbCorner Project: Toward an Empirically-Based Semantic Decomposition of Verbs

7 0.031716526 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora

8 0.031261139 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

9 0.030258397 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution

10 0.028276129 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

11 0.028166916 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment

12 0.028133616 188 emnlp-2013-Tree Kernel-based Negation and Speculation Scope Detection with Structured Syntactic Parse Features

13 0.027518433 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication

14 0.027220424 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification

15 0.02689961 67 emnlp-2013-Easy Victories and Uphill Battles in Coreference Resolution

16 0.026700623 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation

17 0.026566086 95 emnlp-2013-Identifying Multiple Userids of the Same Author

18 0.025679424 59 emnlp-2013-Deriving Adjectival Scales from Continuous Space Word Representations

19 0.025547046 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering

20 0.025243543 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.098), (1, 0.019), (2, -0.009), (3, 0.001), (4, -0.018), (5, -0.008), (6, -0.015), (7, -0.004), (8, -0.011), (9, 0.013), (10, -0.012), (11, 0.057), (12, -0.006), (13, 0.047), (14, 0.034), (15, 0.076), (16, -0.045), (17, -0.016), (18, 0.029), (19, 0.018), (20, -0.036), (21, 0.116), (22, 0.016), (23, -0.04), (24, -0.005), (25, -0.019), (26, 0.046), (27, -0.009), (28, 0.008), (29, -0.017), (30, 0.003), (31, -0.023), (32, -0.02), (33, -0.025), (34, -0.063), (35, -0.031), (36, 0.014), (37, -0.02), (38, -0.029), (39, 0.022), (40, -0.085), (41, -0.009), (42, 0.087), (43, -0.098), (44, -0.026), (45, 0.07), (46, 0.15), (47, -0.034), (48, -0.124), (49, -0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91068691 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

Author: Silvia Pareti ; Tim O'Keefe ; Ioannis Konstas ; James R. Curran ; Irena Koprinska

2 0.53766555 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi

Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,

3 0.50398695 26 emnlp-2013-Assembling the Kazakh Language Corpus

Author: Olzhas Makhambetov ; Aibek Makazhanov ; Zhandos Yessenbayev ; Bakhyt Matkarimov ; Islam Sabyrgaliyev ; Anuar Sharafudinov

Abstract: This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal. Along with its primary part KLC comprises such parts as: (i) annotated sub-corpus, containing segmented documents encoded in the eXtensible Markup Language (XML) that marks complete morphological, syntactic, and structural characteristics of texts; (ii) as well as a sub-corpus with the annotated speech data. KLC has a web-based corpus management system that helps to navigate the data and retrieve necessary information. KLC is also open for contributors, who are willing to make suggestions, donate texts and help with annotation of existing materials.

4 0.43624032 183 emnlp-2013-The VerbCorner Project: Toward an Empirically-Based Semantic Decomposition of Verbs

Author: Joshua K. Hartshorne ; Claire Bonial ; Martha Palmer

Abstract: This research describes efforts to use crowdsourcing to improve the validity of the semantic predicates in VerbNet, a lexicon of about 6300 English verbs. The current semantic predicates can be thought of semantic primitives, into which the concepts denoted by a verb can be decomposed. For example, the verb spray (of the Spray class), involves the predicates MOTION, NOT, and LOCATION, where the event can be decomposed into an AGENT causing a THEME that was originally not in a particular location to now be in that location. Although VerbNet’s predicates are theoretically well-motivated, systematic empirical data is scarce. This paper describes a recently-launched attempt to address this issue with a series of human judgment tasks, posed to subjects in the form of games.

5 0.4285782 129 emnlp-2013-Measuring Ideological Proportions in Political Speeches

Author: Yanchuan Sim ; Brice D. L. Acree ; Justin H. Gross ; Noah A. Smith

Abstract: We seek to measure political candidates’ ideological positioning from their speeches. To accomplish this, we infer ideological cues from a corpus of political writings annotated with known ideologies. We then represent the speeches of U.S. Presidential candidates as sequences of cues and lags (filler distinguished only by its length in words). We apply a domain-informed Bayesian HMM to infer the proportions of ideologies each candidate uses in each campaign. The results are validated against a set of preregistered, domain expertauthored hypotheses.

6 0.40669805 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English

7 0.40303832 68 emnlp-2013-Effectiveness and Efficiency of Open Relation Extraction

8 0.39880556 27 emnlp-2013-Authorship Attribution of Micro-Messages

9 0.3842347 152 emnlp-2013-Predicting the Presence of Discourse Connectives

10 0.37941384 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication

11 0.37818161 91 emnlp-2013-Grounding Strategic Conversation: Using Negotiation Dialogues to Predict Trades in a Win-Lose Game

12 0.34751314 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs

13 0.33763328 14 emnlp-2013-A Synchronous Context Free Grammar for Time Normalization

14 0.33348703 33 emnlp-2013-Automatic Knowledge Acquisition for Case Alternation between the Passive and Active Voices in Japanese

15 0.33093286 126 emnlp-2013-MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

16 0.31673115 161 emnlp-2013-Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!

17 0.30321115 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering

18 0.29600012 121 emnlp-2013-Learning Topics and Positions from Debatepedia

19 0.29181597 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation

20 0.28661564 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.041), (18, 0.024), (22, 0.041), (30, 0.044), (50, 0.01), (51, 0.6), (66, 0.022), (71, 0.03), (75, 0.025), (77, 0.018), (96, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9987033 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi

2 0.99803853 24 emnlp-2013-Application of Localized Similarity for Web Documents

Author: Peter Rebersek ; Mateja Verlic

Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.

3 0.99729198 91 emnlp-2013-Grounding Strategic Conversation: Using Negotiation Dialogues to Predict Trades in a Win-Lose Game

Author: Anais Cadilhac ; Nicholas Asher ; Farah Benamara ; Alex Lascarides

Abstract: This paper describes a method that predicts which trades players execute during a winlose game. Our method uses data collected from chat negotiations of the game The Settlers of Catan and exploits the conversation to construct dynamically a partial model of each player’s preferences. This in turn yields equilibrium trading moves via principles from game theory. We compare our method against four baselines and show that tracking how preferences evolve through the dialogue and reasoning about equilibrium moves are both crucial to success.

same-paper 4 0.99659365 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

Author: Silvia Pareti ; Tim O'Keefe ; Ioannis Konstas ; James R. Curran ; Irena Koprinska

5 0.99641341 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora

Author: Karl Pichotta ; John DeNero

Abstract: We address the problem of identifying multiword expressions in a language, focusing on English phrasal verbs. Our polyglot ranking approach integrates frequency statistics from translated corpora in 50 different languages. Our experimental evaluation demonstrates that combining statistical evidence from many parallel corpora using a novel ranking-oriented boosting algorithm produces a comprehensive set ofEnglish phrasal verbs, achieving performance comparable to a human-curated set.

6 0.99616939 32 emnlp-2013-Automatic Idiom Identification in Wiktionary

7 0.99540818 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs

8 0.99503028 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching

9 0.94920337 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution

10 0.94891804 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models

11 0.94780302 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

12 0.94307667 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology

13 0.9422996 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

14 0.93966484 126 emnlp-2013-MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

15 0.93886322 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries

16 0.93464541 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts

17 0.93438888 27 emnlp-2013-Authorship Attribution of Micro-Messages

18 0.9336015 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

19 0.93084639 26 emnlp-2013-Assembling the Kazakh Language Corpus

20 0.93071967 152 emnlp-2013-Predicting the Presence of Discourse Connectives