acl acl2011 acl2011-214 knowledge-graph by maker-knowledge-mining

214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics


Source: pdf

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Lost in Translation: Authorship Attribution using Frame Semantics Steffen Hedegaard Department of Computer Science, University of Copenhagen Njalsgade 128, 2300 Copenhagen S, Denmark ste ffenh@ diku dk . [sent-1, score-0.045]

2 Abstract We investigate authorship attribution using classifiers based on frame semantics. [sent-2, score-1.01]

3 The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. [sent-3, score-2.08]

4 1 Introduction Authorship attribution is the following problem: For a given text, determine the author of said text among a list of candidate authors. [sent-5, score-0.651]

5 Determining authorship is difficult, and a host of methods have been proposed: As of 1998 Rudman estimated the number of metrics used in such methods to be at least 1000 (Rudman, 1997). [sent-6, score-0.423]

6 The process of authorship attribution consists of selecting markers (features that provide an indication of the author), and classifying a text by assigning it to an author using some appropriate machine learning technique. [sent-11, score-1.264]

7 65 Jakob Grue Simonsen Department of Computer Science, University of Copenhagen Njalsgade 128, 2300 Copenhagen S, Denmark s imonsen@ diku . [sent-12, score-0.045]

8 1 Attribution of translated texts In contrast to the general authorship attribution problem, the specific problem of attributing translated texts to their original author has received little attention. [sent-14, score-1.759]

9 Conceivably, this is due to the common intuition that the impact of the translator may add enough noise that proper attribution to the original author will be very difficult; for example, in (Arun et al. [sent-15, score-0.848]

10 , 2009) it was found that the imprint of the translator was significantly greater than that of the original author. [sent-16, score-0.197]

11 The volume of resources for natural language processing in English appears to be much larger than for any other language, and it is thus, conceivably, convenient to use the resources at hand for a translated version of the text, rather than the original. [sent-17, score-0.154]

12 To appreciate the difficulty of purely lexical or syntactic characterization of authors based on translation, consider the following excerpts from three different translations of the first few paragraphs of Turgenev’s Dvor nskoe Gnezdo: Liza "A nest of nobles" Translated by W. [sent-18, score-0.124]

13 SheddenRalston A beautiful spring day was drawing to a close. [sent-20, score-0.076]

14 High aloft in the clear sky floated small rosy clouds, which seemed never to drift past, but to be slowly absorbed into the blue depths beyond. [sent-21, score-0.202]

15 At an open window, in a handsome mansion situated in one of the outlying streets of O. [sent-22, score-0.196]

16 , the chief town of the government of that name–it was in the year 1842–there were sitting two ladies, the one about fifty years old, the other an old woman of seventy. [sent-23, score-0.182]

17 Hapgood The brilliant, spring day was inclining toward the Proceedings ofP tohretl 4an9tdh, O Anrneguoanl, M Jueentein 19g- o2f4 t,h 2e0 A1s1s. [sent-26, score-0.076]

18 cc ia2t0io1n1 f Aorss Cocoimatpiounta ftoiorn Caolm Lipnugtuaitsiotincasl:s Lhionrgtpuaisptiecrss, pages 65–70, evening, tiny rose-tinted cloudlets hung high in the heavens, and seemed not to be floating past, but retreating into the very depths of the azure. [sent-28, score-0.112]

19 In front of the open window of a handsome house, in one of the outlying streets of O * * * the capital of a Government, sat two women; one fifty years of age, the other seventy years old, and already aged. [sent-29, score-0.243]

20 Garnett A bright spring day was fading into evening. [sent-31, score-0.076]

21 High overhead in the clear heavens small rosy clouds seemed hardly to move across the sky but to be sinking into its depths of blue. [sent-32, score-0.287]

22 In a handsome house in one of the outlying streets of the government town of O—- (it was in the year 1842) two women were sitting at an open window; one was about fifty, the other an old lady of seventy. [sent-33, score-0.401]

23 As translators express the same semantic content in different ways the syntax and style of different translations of the same text will differ greatly due to the footprint of the translators; this footprint may affect the classification process in different ways depending on the features. [sent-34, score-0.619]

24 For markers based on language structure such as grammar or function words it is to be expected that the footprint of the translator has such a high impact on the resulting text that attribution to the author may not be possible. [sent-35, score-1.242]

25 However, it is possible that a specific author/translator combination has its own unique footprint discernible from other author/translator combinations: A specific translator may often translate often used phrases in the same way. [sent-36, score-0.446]

26 Ideally, the footprint of the author is (more or less) unaffected by the process of translation, for example if the languages are very similar or the marker is not based solely on lexical or syntactic features. [sent-37, score-0.492]

27 In contrast to purely lexical or syntactic features, the semantic content is expected to be, roughly, the same in translations and originals. [sent-38, score-0.074]

28 This leads us to hypothesize that a marker based on semantic frames such as found in the FrameNet database (Ruppenhofer et al. [sent-39, score-0.345]

29 , 2006), will be largely unaffected by translations, whereas traditional lexical markers will be severely impacted by the footprint of the translator. [sent-40, score-0.486]

30 The FrameNet project is a database of annotated exemplar frames, their relations to other frames and obligatory as well as optional frame elements for each frame. [sent-41, score-0.252]

31 FrameNet currently numbers approximately 1000 different frames annotated with natural 66 language examples. [sent-42, score-0.222]

32 In this paper, we combine the data from FrameNet with the LTH semantic parser (Johansson and Nugues, 2007), until very recently (Das et al. [sent-43, score-0.038]

33 , 2010) the semantic parser with best experimental performance (note that the performance of LTH on our corpora is unknown and may differ from the numbers reported in (Johansson and Nugues, 2007)). [sent-44, score-0.038]

34 2 Related work The research on authorship attribution is too voluminous to include; see the excellent surveys (Juola, 2006; Koppel et al. [sent-46, score-0.944]

35 , 2008; Stamatatos, 2009) for an overview of the plethora of lexical and syntactic markers used. [sent-47, score-0.19]

36 The literature on the use of semantic markers is much scarcer: Gamon (Gamon, 2004) developed a tool for producing semantic dependency graphs and using the resulting information in conjunction with lexical and syntactic markers to improve the accuracy of classification. [sent-48, score-0.456]

37 , 2006) employed WordNet and latent semantic analysis to lexical features with the purpose of finding semantic similarities between words; it is not clear whether the use of semantic features improved the classification. [sent-51, score-0.114]

38 , 1997), the problem of attributing translated texts appears to be fairly untouched. [sent-56, score-0.388]

39 2 Corpus and resource selection As pointed out in (Luyckx and Daelemans, 2010) the size of data set and number of authors may crucially affect the efficiency of author attribution methods, and evaluation of the method on some standard corpus is essential (Stamatatos, 2009). [sent-57, score-0.731]

40 Closest to a standard corpus for author attribution is The Federalist Papers (Juola, 2006), originally used by Mosteller and Wallace (Mosteller and Wallace, 1964), and we employ the subset of this corpus consisting of the 71 undisputed single-author documents as our Corpus I. [sent-58, score-0.715]

41 For translated texts, a mix of authors and translators across authors is needed to ensure that the attribution methods do not attribute to the translator instead of the author. [sent-59, score-1.172]

42 However, there does not appear to be a large corpus of texts publicly available that satisfy this demand. [sent-60, score-0.175]

43 Based on this, we elected to compile a fresh corpus of translated texts; our Corpus IIconsists of English translations of 19th century Russian romantic literature chosen from Project Gutenberg for which a number of different versions, with different trans- lators existed. [sent-61, score-0.252]

44 The corpus primarily consists of novels, but is slightly polluted by a few collections of short stories and two nonfiction works by Tolstoy due to the necessity of including a reasonable mix of authors and translators. [sent-62, score-0.118]

45 The corpus consists of 30 texts by 4 different authors and 12 different translators of which some have translated several different authors. [sent-63, score-0.514]

46 The texts range in size from 200 (Turgenev: The Rendezvous) to 33000 (Tolstoy: War and Peace) sentences. [sent-64, score-0.143]

47 3 Experiment design For both corpora, authorship attribution experiments were performed using six classifiers, each employing a distinct feature set. [sent-67, score-0.944]

48 For each feature set the markers were counted in the text and their relative frequencies calculated. [sent-68, score-0.226]

49 Feature selection was based solely on training data in the inner loop of the crossvalidation cycle. [sent-69, score-0.028]

50 The feature sets were: Frequent Words (FW): Frequencies in the text of 67 the X most frequent words1 . [sent-71, score-0.046]

51 Character N-grams: The X most frequent grams for N = 3, 4, 5. [sent-73, score-0.046]

52 N- Frames: The relative frequencies of the X most frequently occurring semantic frames. [sent-74, score-0.074]

53 Frequent Words and Frames (FWaF): The X/2 most frequent features; words and frames resp. [sent-75, score-0.268]

54 To ascertain how heavily each marker is influenced by translation we also performed translator attribution on a subset of 11 texts [Corpus IIb] with 3 different translators each having translated 3 different authors. [sent-78, score-1.237]

55 If the translator leaves a heavy footprint on the marker, the marker is expected to score better when attributing to translator than to author. [sent-79, score-0.774]

56 Finally, we reduced the corpus to a set of 18 texts [Corpus IIc] that only includes unique author/translator combinations to see if each marker could attribute correctly to an author if the translator/author combination was not present in the training set. [sent-80, score-0.443]

57 1The most frequent words, is from a list of word frequencies in the BNC compiled by (Leech et al. [sent-87, score-0.082]

58 FWaF performed better than FW for attribution of author on translated texts. [sent-94, score-0.805]

59 For each corpus results are given for experiments with 400 features. [sent-97, score-0.032]

60 For each corpus the bottom row indicates whether each classifier is significantly discernible from a weighted random attribution. [sent-99, score-0.107]

61 05: X Table 2: Authorship attribution results 2each author is given equal weight, regardless of the number of documents 68 4. [sent-200, score-0.651]

62 1 Corpus I: The Federalist Papers For the Federalist Papers the traditional authorship attribution markers all lie in the 95+ range in accuracy as expected. [sent-201, score-1.181]

63 However, the frame-based markers achieved statistically significant results, and can hence be used for authorship attribution on untranslated documents (but performs worse than the baseline). [sent-202, score-1.197]

64 2 Corpus II: Attribution of translated texts For Corpus IIa–the entire corpus of translated texts– all methods achieve results significantly better than random, and FWaF is the best-scoring method, followed by FW. [sent-205, score-0.483]

65 The results for Corpus IIb (three authors, three translators) clearly suggest that the footprint of the translator is evident in the translated texts, and that the FW (function word) classifier is particularly sensitive to the footprint. [sent-206, score-0.585]

66 In fact, FW was the only one achieving a significant result over random assignment, giving an indication that this marker may be particularly vulnerable to translator influence when attempting to attribute authors. [sent-207, score-0.311]

67 Some of this can be attributed to a smaller (training) corpus, but we also suspect the lack of several instances of the same author/translator combinations in the corpus. [sent-209, score-0.024]

68 Observe that the FWaF classifier is the only classifier with significantly better performance than weighted random assignment, and outperforms the other methods. [sent-210, score-0.06]

69 Frames alone also outperform traditional markers, albeit not by much. [sent-211, score-0.047]

70 The experiments on the collected corpora strongly suggest the feasibility of using Frames as markers for authorship attribution, in particular in combination with traditional lexical approaches. [sent-212, score-0.66]

71 Our inability to obtain demonstrably significant improvement of FWaF over the approach based on Frequent Words is likely an artifact of the fairly small corpus we employ. [sent-213, score-0.032]

72 5 Conclusions, caveats, and future work We have investigated the use of semantic frames as markers for author attribution and tested their applicability to attribution of translated texts. [sent-216, score-1.776]

73 Our results show that frames are potentially useful, especially so for translated texts, and suggest that a combined method of frequent words and frames can outper- form methods based solely on traditional markers, on translated texts. [sent-217, score-0.873]

74 For attribution of untranslated texts and attribution to translator traditional markers such as frequent words and n-grams are still to be preferred. [sent-218, score-1.728]

75 Our test corpora consist of a limited number of authors, from a limited time period, with translators from a similar limited time period and cultural context. [sent-219, score-0.137]

76 Furthermore, our translations are all from a single language. [sent-220, score-0.036]

77 It is well known that effectiveness of authorship markers may be influenced by topics (Stein et al. [sent-222, score-0.613]

78 , 2010); while we have endeavored to design our corpora to minimize such influence, we do not currently know the quantitative impact on topicality on the attribution methods in this paper. [sent-224, score-0.521]

79 Furthermore, traditional investigations of authorship attribution have focused on the case of attributing texts among a small (N < 10) class of authors at the time, albeit with recent, notable exceptions (Luyckx and Daelemans, 2010; Koppel et al. [sent-225, score-1.273]

80 We test our methods on similarly re- stricted sets of authors; the scalability of the methods to larger numbers of authors is currently unknown. [sent-227, score-0.048]

81 , 2010); it would be interesting to see whether a classifier using frames yields significant improvements in ensemble with other methods. [sent-229, score-0.252]

82 A stylometric analysis of mormon scripture and related texts. [sent-269, score-0.045]

83 The effect of author set size and data size in authorship attribution. [sent-298, score-0.553]

84 The state of authorship attribution studies: Some problems and solutions. [sent-326, score-0.944]

85 A comparison of statistical significance tests for information retrieval evaluation. [sent-351, score-0.028]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('attribution', 0.521), ('authorship', 0.423), ('frames', 0.222), ('footprint', 0.204), ('translator', 0.197), ('markers', 0.19), ('fwaf', 0.159), ('translated', 0.154), ('texts', 0.143), ('translators', 0.137), ('author', 0.13), ('attributing', 0.091), ('iia', 0.091), ('iib', 0.091), ('iic', 0.091), ('koppel', 0.088), ('marker', 0.085), ('framenet', 0.078), ('depths', 0.068), ('handsome', 0.068), ('streets', 0.068), ('fw', 0.064), ('untranslated', 0.063), ('stamatatos', 0.063), ('copenhagen', 0.06), ('outlying', 0.06), ('luyckx', 0.06), ('juola', 0.06), ('arun', 0.055), ('federalist', 0.055), ('mosteller', 0.055), ('authors', 0.048), ('fifty', 0.047), ('traditional', 0.047), ('frequent', 0.046), ('diku', 0.045), ('discernible', 0.045), ('heavens', 0.045), ('mormon', 0.045), ('rosy', 0.045), ('rudman', 0.045), ('sky', 0.045), ('smucker', 0.045), ('turgenev', 0.045), ('unaffected', 0.045), ('society', 0.044), ('moshe', 0.044), ('shlomo', 0.044), ('mccarthy', 0.044), ('seemed', 0.044), ('old', 0.042), ('spring', 0.041), ('nest', 0.04), ('archer', 0.04), ('tolstoy', 0.04), ('clouds', 0.04), ('mix', 0.038), ('semantic', 0.038), ('house', 0.037), ('njalsgade', 0.037), ('denmark', 0.037), ('ruppenhofer', 0.037), ('schein', 0.037), ('leech', 0.037), ('conceivably', 0.037), ('classifiers', 0.036), ('frequencies', 0.036), ('translations', 0.036), ('day', 0.035), ('johansson', 0.035), ('wallace', 0.035), ('efstathios', 0.035), ('sitting', 0.033), ('women', 0.033), ('stein', 0.033), ('corpus', 0.032), ('novels', 0.031), ('mcnemar', 0.031), ('lth', 0.031), ('conceivable', 0.031), ('schler', 0.031), ('town', 0.031), ('argamon', 0.03), ('century', 0.03), ('frame', 0.03), ('classifier', 0.03), ('raghavan', 0.029), ('government', 0.029), ('attribute', 0.029), ('ii', 0.028), ('daelemans', 0.028), ('duan', 0.028), ('significance', 0.028), ('solely', 0.028), ('nugues', 0.027), ('studies', 0.026), ('yates', 0.025), ('gamon', 0.024), ('literary', 0.024), ('combinations', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999875 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

2 0.16376846 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

Author: Hugo Jair Escalante ; Thamar Solorio ; Manuel Montes-y-Gomez

Abstract: This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for AA. We show that LHs are particularly helpful for AA, because they provide useful information for uncovering, to some extent, the writing style of authors. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. We found that LHs are even more advantageous in challenging conditions, such as having imbalanced and small training sets. Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection.

3 0.13232405 274 acl-2011-Semi-Supervised Frame-Semantic Parsing for Unknown Predicates

Author: Dipanjan Das ; Noah A. Smith

Abstract: We describe a new approach to disambiguating semantic frames evoked by lexical predicates previously unseen in a lexicon or annotated data. Our approach makes use of large amounts of unlabeled data in a graph-based semi-supervised learning framework. We construct a large graph where vertices correspond to potential predicates and use label propagation to learn possible semantic frames for new ones. The label-propagated graph is used within a frame-semantic parser and, for unknown predicates, results in over 15% absolute improvement in frame identification accuracy and over 13% absolute improvement in full frame-semantic parsing F1 score on a blind test set, over a state-of-the-art supervised baseline.

4 0.12252117 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

Author: Amjad Abu-Jbara ; Barbara Rosario ; Kent Lyons

Abstract: In this paper, we address the problem of optimizing the style of textual content to make it more suitable to being listened to by a user as opposed to being read. We study the differences between the written style and the audio style by consulting the linguistics andjour- nalism literatures. Guided by this study, we suggest a number of linguistic features to distinguish between the two styles. We show the correctness of our features and the impact of style transformation on the user experience through statistical analysis, a style classification task, and a user study.

5 0.11710786 133 acl-2011-Extracting Social Power Relationships from Natural Language

Author: Philip Bramsen ; Martha Escobar-Molano ; Ami Patel ; Rafael Alonso

Abstract: Sociolinguists have long argued that social context influences language use in all manner of ways, resulting in lects 1. This paper explores a text classification problem we will call lect modeling, an example of what has been termed computational sociolinguistics. In particular, we use machine learning techniques to identify social power relationships between members of a social network, based purely on the content of their interpersonal communication. We rely on statistical methods, as opposed to language-specific engineering, to extract features which represent vocabulary and grammar usage indicative of social power lect. We then apply support vector machines to model the social power lects representing superior-subordinate communication in the Enron email corpus. Our results validate the treatment of lect modeling as a text classification problem – albeit a hard one – and constitute a case for future research in computational sociolinguistics. 1

6 0.10169708 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

7 0.092282012 311 acl-2011-Translationese and Its Dialects

8 0.076298907 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

9 0.075234421 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

10 0.068998717 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

11 0.050591893 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

12 0.047245629 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

13 0.046697345 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

14 0.044079084 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

15 0.043575596 194 acl-2011-Language Use: What can it tell us?

16 0.043472357 167 acl-2011-Improving Dependency Parsing with Semantic Classes

17 0.043249495 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

18 0.040273808 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

19 0.039514683 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

20 0.03845492 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.117), (1, 0.015), (2, -0.01), (3, 0.019), (4, -0.007), (5, 0.025), (6, 0.054), (7, 0.007), (8, 0.001), (9, -0.024), (10, -0.034), (11, -0.062), (12, 0.007), (13, 0.01), (14, -0.019), (15, -0.017), (16, -0.022), (17, -0.026), (18, 0.013), (19, -0.105), (20, 0.082), (21, 0.013), (22, -0.064), (23, -0.016), (24, -0.012), (25, -0.023), (26, -0.018), (27, -0.06), (28, -0.02), (29, -0.071), (30, -0.009), (31, 0.0), (32, -0.062), (33, 0.05), (34, 0.091), (35, -0.005), (36, -0.045), (37, -0.019), (38, -0.099), (39, 0.166), (40, -0.005), (41, 0.188), (42, 0.014), (43, -0.021), (44, 0.013), (45, -0.008), (46, -0.141), (47, -0.039), (48, -0.051), (49, -0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91719341 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

2 0.70873815 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

Author: Sara Rosenthal ; Kathleen McKeown

Abstract: We investigate whether wording, stylistic choices, and online behavior can be used to predict the age category of blog authors. Our hypothesis is that significant changes in writing style distinguish pre-social media bloggers from post-social media bloggers. Through experimentation with a range of years, we found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction. We also show that internet writing characteristics are important features for age prediction, but that lexical content is also needed to produce significantly more accurate results. Our best results allow for 81.57% accuracy.

3 0.67327189 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

Author: Hugo Jair Escalante ; Thamar Solorio ; Manuel Montes-y-Gomez

Abstract: This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for AA. We show that LHs are particularly helpful for AA, because they provide useful information for uncovering, to some extent, the writing style of authors. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. We found that LHs are even more advantageous in challenging conditions, such as having imbalanced and small training sets. Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection.

4 0.65272593 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

Author: Amjad Abu-Jbara ; Barbara Rosario ; Kent Lyons

Abstract: In this paper, we address the problem of optimizing the style of textual content to make it more suitable to being listened to by a user as opposed to being read. We study the differences between the written style and the audio style by consulting the linguistics andjour- nalism literatures. Guided by this study, we suggest a number of linguistic features to distinguish between the two styles. We show the correctness of our features and the impact of style transformation on the user experience through statistical analysis, a style classification task, and a user study.

5 0.64286661 133 acl-2011-Extracting Social Power Relationships from Natural Language

Author: Philip Bramsen ; Martha Escobar-Molano ; Ami Patel ; Rafael Alonso

Abstract: Sociolinguists have long argued that social context influences language use in all manner of ways, resulting in lects 1. This paper explores a text classification problem we will call lect modeling, an example of what has been termed computational sociolinguistics. In particular, we use machine learning techniques to identify social power relationships between members of a social network, based purely on the content of their interpersonal communication. We rely on statistical methods, as opposed to language-specific engineering, to extract features which represent vocabulary and grammar usage indicative of social power lect. We then apply support vector machines to model the social power lects representing superior-subordinate communication in the Enron email corpus. Our results validate the treatment of lect modeling as a text classification problem – albeit a hard one – and constitute a case for future research in computational sociolinguistics. 1

6 0.63286328 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

7 0.5577032 274 acl-2011-Semi-Supervised Frame-Semantic Parsing for Unknown Predicates

8 0.54279989 311 acl-2011-Translationese and Its Dialects

9 0.49517787 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

10 0.49118593 194 acl-2011-Language Use: What can it tell us?

11 0.49113771 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues

12 0.47851917 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

13 0.45670259 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

14 0.43614259 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

15 0.42826122 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA

16 0.41078794 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

17 0.4104518 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis

18 0.39822334 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications

19 0.3943494 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

20 0.38604024 248 acl-2011-Predicting Clicks in a Vocabulary Learning System


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.389), (5, 0.042), (17, 0.032), (26, 0.026), (37, 0.069), (39, 0.046), (41, 0.049), (55, 0.026), (57, 0.017), (59, 0.036), (72, 0.033), (91, 0.033), (96, 0.095), (97, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80976689 321 acl-2011-Unsupervised Discovery of Rhyme Schemes

Author: Sravana Reddy ; Kevin Knight

Abstract: This paper describes an unsupervised, language-independent model for finding rhyme schemes in poetry, using no prior knowledge about rhyme or pronunciation.

2 0.80782521 134 acl-2011-Extracting and Classifying Urdu Multiword Expressions

Author: Annette Hautli ; Sebastian Sulger

Abstract: This paper describes a method for automatically extracting and classifying multiword expressions (MWEs) for Urdu on the basis of a relatively small unannotated corpus (around 8.12 million tokens). The MWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and person names. The classification is based on simple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of 0.5 and 0.746 for locations and persons, respectively. A target application is the Urdu ParGram grammar, where MWEs are needed to generate a more precise syntactic and semantic analysis.

same-paper 3 0.73113519 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

4 0.64586371 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds

Author: Taniya Mishra ; Srinivas Bangalore

Abstract: There are several theories regarding what influences prominence assignment in English noun-noun compounds. We have developed corpus-driven models for automatically predicting prominence assignment in noun-noun compounds using feature sets based on two such theories: the informativeness theory and the semantic composition theory. The evaluation of the prediction models indicate that though both of these theories are relevant, they account for different types of variability in prominence assignment.

5 0.62935334 112 acl-2011-Efficient CCG Parsing: A* versus Adaptive Supertagging

Author: Michael Auli ; Adam Lopez

Abstract: We present a systematic comparison and combination of two orthogonal techniques for efficient parsing of Combinatory Categorial Grammar (CCG). First we consider adaptive supertagging, a widely used approximate search technique that prunes most lexical categories from the parser’s search space using a separate sequence model. Next we consider several variants on A*, a classic exact search technique which to our knowledge has not been applied to more expressive grammar formalisms like CCG. In addition to standard hardware-independent measures of parser effort we also present what we believe is the first evaluation of A* parsing on the more realistic but more stringent metric of CPU time. By itself, A* substantially reduces parser effort as measured by the number of edges considered during parsing, but we show that for CCG this does not always correspond to improvements in CPU time over a CKY baseline. Combining A* with adaptive supertagging decreases CPU time by 15% for our best model.

6 0.6162228 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

7 0.43907714 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing

8 0.38610733 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

9 0.3785851 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

10 0.37120536 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

11 0.36755171 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

12 0.36470914 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

13 0.36213523 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

14 0.36095834 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

15 0.36060363 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

16 0.36002731 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

17 0.35853666 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

18 0.3583155 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

19 0.35822666 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

20 0.3580544 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing