acl acl2011 acl2011-212 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hugo Jair Escalante ; Thamar Solorio ; Manuel Montes-y-Gomez
Abstract: This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for AA. We show that LHs are particularly helpful for AA, because they provide useful information for uncovering, to some extent, the writing style of authors. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. We found that LHs are even more advantageous in challenging conditions, such as having imbalanced and small training sets. Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). [sent-9, score-1.03]
2 LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. [sent-10, score-0.401]
3 We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. [sent-13, score-0.194]
4 Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection. [sent-15, score-0.271]
5 1 Introduction Authorship attribution (AA) is the task of deciding whom, from a set of candidates, is the author of a given document (Houvardas and Stamatatos, 2006; Luyckx and Daelemans, 2010; Stamatatos, 2009b). [sent-16, score-0.23]
6 However, unlike usual text categorization tasks, where the core problem is modeling the thematic content of documents (Sebastiani, 2002), the goal in AA is modeling authors’ writing style (Stamatatos, 2009b). [sent-21, score-0.171]
7 Hence, document representations that reveal information about writing style are required to achieve good accuracy in AA. [sent-22, score-0.202]
8 Word and character based representations have been used in AA with some success so far (Houvardas and Stamatatos, 2006; Luyckx and Daelemans, 2010; Plakias and Stamatatos, 2008b). [sent-23, score-0.198]
9 Such representations can capture style information through word or character usage, but they lack sequential information, which can reveal further stylistic information. [sent-24, score-0.309]
10 In this paper, we study the use of richer document representations for the AA task. [sent-25, score-0.134]
11 In particular, we consider local histograms over n-grams at the character-level obtained via the locally-weighted bag of words (LOWBOW) framework (Lebanon et al. [sent-26, score-0.756]
12 Under LOWBOW, a document is represented by a set of local histograms, computed across the whole document but smoothed by kernels centered on different document locations. [sent-28, score-0.373]
13 Ac s2s0o1ci1a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 288–298, representations preserve both word/character usage and sequential information (i. [sent-31, score-0.174]
14 Results confirm that local histograms of character n-grams are more helpful for AA than the usual global histograms of words or character n-grams (Luyckx and Daelemans, 2010); our results are superior to those reported in related works. [sent-35, score-1.549]
15 We also show that local histograms over character n-grams are more helpful than local histograms over words, as originally proposed by (Lebanon et al. [sent-36, score-1.484]
16 The contributions of this work are as follows: • We show that the LOWBOW framework can be helpful wfor AhaAt, giving WevBidOenWce trahamt sequential i bne- • • • 2 formation encoded in local histograms is useful for modeling the writing style of authors. [sent-42, score-0.857]
17 We propose the use of local histograms over cWheara pcrtoepr-olesveel t n-grams ffo lro cAalA. [sent-43, score-0.668]
18 h Wstoeg rsahmows othveatr character-level representations, which have proved to be very effective for AA (Luyckx and Daelemans, 2010), can be further improved by adopting a local histogram formulation. [sent-44, score-0.232]
19 Also, we empirically show that local histograms at the character-level are more helpful than local histograms at the word-level for AA. [sent-45, score-1.359]
20 We study several kernels for a support vector macWhein est uAdAy scelavsesriaflie kr eurnnedlesr ftoher alo scualp histograms fmoar-mulation. [sent-46, score-0.646]
21 Our study confirms that the diffusion kernel (Lafferty and Lebanon, 2005) is the most effective among those we tried, although competitive performance can be obtained with simpler kernels. [sent-47, score-0.146]
22 Unfortunately, because of computational limitations, the latter methods cannot discover enough sequential information from documents (e. [sent-65, score-0.138]
23 With respect to character-based features, n-grams at the character level have been widely used in AA as well (Plakias and Stamatatos, 2008b; Peng et al. [sent-74, score-0.125]
24 However, as with word-based features, character n-grams are unable to incorporate sequential information from documents in their original form (in terms of the positions in which the terms appear across a document). [sent-81, score-0.263]
25 We believe that sequential clues can be helpful for AA because different authors are expected to use different character n-grams or words in different parts of the document. [sent-82, score-0.245]
26 Hence, the proposed features preserve sequential information besides capturing character and word usage information. [sent-84, score-0.226]
27 , 2007), where researchers have used in- formation derived from local histograms for displaying a 2D representation of document’s content. [sent-88, score-0.691]
28 The latter can be due to the fact that local histograms provide little gain over usual global histograms for thematic classification tasks. [sent-95, score-1.252]
29 In this paper we show that LOWBOW representations provide important improvements over global histograms for AA; in particular, local histograms at the character-level achieve the highest performance in our experiments. [sent-96, score-1.303]
30 3 Background This section describes preliminary information on document representations and pattern classification 290 with SVMs. [sent-97, score-0.134]
31 1 Bag of words representations In the bag of words (BOW) representation, documents are represented by histograms over the vocabulary1 that was used to generate a collection of documents; that is, a document iis represented as: di = [xi,1, . [sent-99, score-0.795]
32 2 Locally-weighted bag-of-words representation Instead of using the BOW framework directly, we adopted the LOWBOW framework for document representation (Lebanon et al. [sent-106, score-0.145]
33 The underlying idea in LOWBOW is to compute several local histograms per document, where these histograms µ are smoothed by a kernel function, see Figure 1. [sent-108, score-1.312]
34 The parameters of the kernel specify the position of the kernel in the document (i. [sent-109, score-0.185]
35 , where the local histogram is centered) and its scale (i. [sent-111, score-0.232]
36 In this way the sequential information in the document is preserved together with term usage statistics. [sent-114, score-0.158]
37 Given a kernel smoothing function Kµs,σ : [0, 1] → R with location parameter and PNj=i1 scale parameter σ, where Pjk=1 Kµs,σ (tj) = 1 and 1In the following we will rPefer to arbitrary vocabularies, which can be formed with terms from either words or character n-grams. [sent-125, score-0.187]
38 Then, the term position weighting is combined with term frequency weighting for obtaining local histograms over the terms in the vocabulary (1, . [sent-134, score-0.704]
39 The LOWBOW framework computes a µloc ∈al histogram f LorO eWacBh position µj ∈ {µ1, . [sent-139, score-0.145]
40 Hence, a set of k local histograms are computed for each document i. [sent-152, score-0.729]
41 Each histogram carries information about the distribution of terms at a certain position µj of the document, where σ determines how the nearby terms to µj influence the local histogram j. [sent-153, score-0.358]
42 Thus, sequential information of the document is considered throughout these local histograms. [sent-154, score-0.246]
43 Note that when σ is small, most of the sequential information is preserved, as local histograms are calculated at very local scales; whereas when σ ≥ 1, local histograms rceaslem scballee t;h we terraedaitsio wnhael nB σOW ≥ representation. [sent-155, score-1.521]
44 , 2007): as a single histogram diL = const (hereafter LOWBOW histograms) or by Pthe set of local histograms itself We performed experiments with 291 both forms of representation and considered words dlji Pjk=1 dlji dli{1,. [sent-157, score-0.879]
45 4 Authorship Attribution with LOWBOW Representations For AA we represent the training documents of each author using the framework described in Section 3. [sent-178, score-0.156]
46 2, thus each document of each candidate author is either a LOWBOW histogram or a bag of local histograms (BOLH). [sent-179, score-0.973]
47 Recall that LOWBOW histograms are an un-weighted sum of local histograms and hence can be considered a summary of term usage and sequential information; whereas the BOLH can be seen as term occurrence frequencies across different locations of the document. [sent-180, score-1.405]
48 , 2001; Grauman, 2006): K(P,Q) = exp¡−D(Pγ,Q)2¢ (5) where D(P, Q) is the sum of the distances between the elements of the bag of local histograms associated to author P and the elements of the bag of histograms associated with author Q; γ is the scale parameter of K. [sent-203, score-1.466]
49 t Lhee tel Pem =ent {sp of the bags nodf Qloca =l histograms f}or b iens tthaen eceles mPe natnsd o Q, respectively, Tcaa-l ble 1presents the distance measures we consider for AA using local histograms. [sent-211, score-0.699]
50 Diffusion, Euclidean, and χ2 kernels compare local histograms one to one, which means that the local histograms calculated at the same locations are compared to each other. [sent-214, score-1.48]
51 We believe that for AA this is advantageous as it is expected that an author uses similar terms at similar locations of the document. [sent-215, score-0.174]
52 The Earth mover’s distance (EMD), on the other hand, is an estimate of the optimal cost in taking local histograms from Q to local histograms in P (Rubner et al. [sent-216, score-1.336]
53 , 2001); that is, this measure computes the optimal matching distance between local histograms from different authors that are not necessarily computed at similar locations. [sent-217, score-0.686]
54 There are 50 documents per author for training and 50 documents per author for testing. [sent-223, score-0.314]
55 For our character n-gram experiments, we obtained LOWBOW representations for character 3-grams (only n-grams of size n = 3 were used) considering the 2, 500 most common n-grams. [sent-227, score-0.352]
56 Again, this setting was adopted in agreement with previous work on AA with character n-grams (Houvardas and Stamatatos, 2006; Plakias and Stamatatos, 2008b; Plakias and Stamatatos, 2008a; Luyckx and Daelemans, 2010). [sent-228, score-0.143]
57 We perform experiments using all of the training documents per author, that is, a balanced corpus (we call this setting BC). [sent-233, score-0.128]
58 We tried balanced reduced data sets with: 1, 3, 5 and 10 documents per author (we call this configuration RBC). [sent-235, score-0.188]
59 BC setting represents the AA problem under ideal conditions, whereas settings RBC and IRBC aim at emulating a more realistic scenario, where limited sample documents are available and the whole data set is highly imbalanced (Plakias and Stamatatos, 2008b). [sent-237, score-0.149]
60 2 Experimental results in balanced data We first compare the performance of the LOWBOW histogram representation to that of the traditional BOW representation. [sent-245, score-0.18]
61 , percentage of documents in the test set that were associated to its correct author) for the BOW and LOWBOW histogram representations when using words and character n-grams information. [sent-248, score-0.383]
62 0% Table 2: Authorship attribution representation and LOWBOW shows the parameters we used tograms; columns 3 and 4 show character n-grams, respectively. [sent-265, score-0.27]
63 Column 2 for the LOWBOW hisresults using words and From Table 2 we can see that the BOW representation is very effective, outperforming most of the LOWBOW histogram configurations. [sent-267, score-0.149]
64 Despite a small difference in performance, BOW is advantageous over LOWBOW histograms because it is simpler to compute and it does not rely on parameter selection. [sent-268, score-0.598]
65 Recall that the LOWBOW histogram representations are obtained by the combination of several local histograms calculated at different locations of the document, hence, it seems that the raw sum of local histograms results in a loss of useful information for representing documents. [sent-269, score-1.624]
66 The worse performance was obtained when k = 2 local histograms are considered (see row 3 in Table 2). [sent-270, score-0.719]
67 This result is somewhat expected since the larger the number of local histograms, the more LOWBOW histograms approach the BOW formulation (Lebanon et al. [sent-271, score-0.702]
68 Most of the results from this table are superior to those reported in Table 2, showing that bags of local histograms are a better way to exploit the LOWBOW framework for AA. [sent-274, score-0.742]
69 However, the diffusion kernel outperformed most of the results obtained with other kernels; confirming the results obtained by other researchers (Lebanon et al. [sent-276, score-0.175]
70 2% Table 3: Authorship attribution accuracy when using bags of local histograms and different kernels for word-based and character-based representations. [sent-303, score-0.874]
71 On average, the worse kernel was that based on the earth mover’s distance (EMD), suggesting that the comparison of local histograms at different locations is not a fruitful approach (recall that this is the only kernel that compares local histograms at differ- ent locations). [sent-306, score-1.544]
72 The best performance across settings and kernels was obtained with the diffusion kernel (in bold, column 3, row 9) (86. [sent-308, score-0.252]
73 Therefore, the considered local histogram representations over character n-grams have proved to be very effective for AA. [sent-312, score-0.43]
74 We believe this can be attributed to the fact that character n-grams provide a representation for the document at a finer granularity, which can be better exploited with local histogram representations. [sent-316, score-0.441]
75 Hence, the local histograms are less sparse when using character-level information, which results in better AA performance. [sent-322, score-0.668]
76 Columns show the true author for test documents and rows show the authors predicted by the SVM. [sent-326, score-0.155]
77 The SVM with BOW representation of character ngrams achieved recognition rates of 40% and 50% for BL and JM respectively. [sent-336, score-0.148]
78 Thus, we can state that sequential information was indeed helpful for modeling BL writing style (improvement of 28%), although it is an author that resulted very difficult to model. [sent-337, score-0.248]
79 On the other hand, local histograms were not very useful for identifying documents written by JM (made it worse by −8%). [sent-338, score-0.727]
80 The largest improvement ((3m8a%de) iotf w wloorcasel histograms over atrhgee sBtO imWp foovremmuelan-t tion was obtained for author TN (T. [sent-339, score-0.669]
81 For these experiments we compare the performance of the BOW, LOWBOW histogram and BOLH representations; for the latter, we considered the best setting as reported in Table 3 (i. [sent-345, score-0.144]
82 From Tables 5 and 6 we can see that BOW and LOWBOW histogram representations obtained similar performance to each other across the different training set sizes, which agree with results in Table 2 for the BC data sets. [sent-350, score-0.228]
83 The improvements of local histograms over the BOW formulation vary across different settings and when using information at word-level and character-level. [sent-352, score-0.702]
84 Thus, it is evident that local histograms are more beneficial when less documents are considered. [sent-359, score-0.727]
85 Here, the lack of information is compensated by the availability of several histograms per author. [sent-360, score-0.582]
86 These are very positive results; for example, we can obtain almost 71% of accuracy, using local histograms of character n-grams when a single document is available per author (recall that we have used all of the test samples for evaluating the performance of our methods). [sent-375, score-0.952]
87 The same pattern as before can be observed in experimental results for these data sets as well: BOW and LOWBOW histograms obtained comparable performance to each other and the BOLH formulation performed the best. [sent-377, score-0.625]
88 Again, better results were obtained when using character n-grams for the local histograms. [sent-379, score-0.26]
89 Summarizing, the results obtained in RBC and IRBC data sets show that the use of local histograms is advantageous under challenging conditions. [sent-381, score-0.733]
90 Our hypothesis for this behavior is that local histograms can be thought of as expanding training instances, because for each training instance in the BOW formulation we have k−training instances under BOLH. [sent-383, score-0.702]
91 io Tuhse as etnhee number of available documents per author decreases. [sent-385, score-0.157]
92 We report results for the BOW, LOWBOW histogram and BOLH representations. [sent-395, score-0.126]
93 6 Conclusions We have described the use of local histograms (LH) over character n-grams for AA. [sent-406, score-0.793]
94 LHs are enriched histogram representations that preserve sequential information in documents (in terms of the positions of terms in documents); we explored the suitability of LHs over n-grams at the character-level for AA. [sent-407, score-0.377]
95 We showed evidence supporting our hypothesis that LHs are very helpful for AA; we believe that this is due to the fact that LOWBOW representations can uncover, to some extent, the writing preferences of authors. [sent-408, score-0.132]
96 The improvements were larger in reduced and imbalanced data sets, which is a very positive result as in real AA applications one often faces highly imbalanced and small sample issues. [sent-410, score-0.144]
97 Measuring the usefulness of function words for authorship attribution. [sent-432, score-0.146]
98 The effect of author set size and data size in authorship attribution. [sent-543, score-0.224]
99 Language independent authorship attribution using character level language models. [sent-558, score-0.362]
100 An algorithm for automated authorship attribution using neural networks. [sent-639, score-0.237]
wordName wordTfidf (topN-words)
[('histograms', 0.562), ('lowbow', 0.413), ('stamatatos', 0.322), ('plakias', 0.258), ('aa', 0.236), ('authorship', 0.146), ('lebanon', 0.145), ('histogram', 0.126), ('character', 0.125), ('bow', 0.12), ('bolh', 0.114), ('houvardas', 0.114), ('local', 0.106), ('attribution', 0.091), ('kernels', 0.084), ('rbc', 0.082), ('sequential', 0.079), ('author', 0.078), ('lhs', 0.075), ('luyckx', 0.073), ('representations', 0.073), ('irbc', 0.072), ('imbalanced', 0.072), ('kernel', 0.062), ('document', 0.061), ('locations', 0.06), ('documents', 0.059), ('diffusion', 0.055), ('daelemans', 0.045), ('birmingham', 0.041), ('keselj', 0.041), ('mao', 0.041), ('bag', 0.04), ('plagiarism', 0.039), ('peng', 0.039), ('writing', 0.036), ('advantageous', 0.036), ('bc', 0.034), ('formulation', 0.034), ('svm', 0.033), ('style', 0.032), ('columns', 0.031), ('lncs', 0.031), ('balanced', 0.031), ('bags', 0.031), ('dlji', 0.031), ('lambers', 0.031), ('lowbowk', 0.031), ('mover', 0.031), ('rubner', 0.031), ('obtained', 0.029), ('amida', 0.027), ('cristianini', 0.027), ('emd', 0.027), ('xi', 0.026), ('earth', 0.024), ('jm', 0.024), ('superior', 0.024), ('helpful', 0.023), ('representation', 0.023), ('row', 0.022), ('preserve', 0.022), ('tj', 0.022), ('multiclass', 0.022), ('usual', 0.022), ('categorization', 0.022), ('alabama', 0.021), ('canu', 0.021), ('chasanis', 0.021), ('dli', 0.021), ('ecrime', 0.021), ('exico', 0.021), ('forensics', 0.021), ('itfw', 0.021), ('mez', 0.021), ('pillay', 0.021), ('rifkin', 0.021), ('rouen', 0.021), ('sdbrloe', 0.021), ('shuurmans', 0.021), ('solorio', 0.021), ('tearle', 0.021), ('tensor', 0.021), ('veenman', 0.021), ('vel', 0.021), ('bl', 0.021), ('per', 0.02), ('xj', 0.02), ('koppel', 0.02), ('framework', 0.019), ('authors', 0.018), ('visualization', 0.018), ('suitability', 0.018), ('cis', 0.018), ('uncovering', 0.018), ('chapters', 0.018), ('term', 0.018), ('formulations', 0.018), ('setting', 0.018), ('pjk', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution
Author: Hugo Jair Escalante ; Thamar Solorio ; Manuel Montes-y-Gomez
Abstract: This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for AA. We show that LHs are particularly helpful for AA, because they provide useful information for uncovering, to some extent, the writing style of authors. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. We found that LHs are even more advantageous in challenging conditions, such as having imbalanced and small training sets. Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection.
2 0.16376846 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics
Author: Steffen Hedegaard ; Jakob Grue Simonsen
Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.
3 0.071057305 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
Author: Helen Yannakoudakis ; Ted Briscoe ; Ben Medlock
Abstract: We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of ‘English as a Second or Other Language’ (ESOL) examination scripts. In particular, we use rank preference learning to explicitly model the grade relationships between scripts. A number of different features are extracted and ablation tests are used to investigate their contribution to overall performance. A comparison between regression and rank preference models further supports our method. Experimental results on the first publically available dataset show that our system can achieve levels of performance close to the upper bound for the task, as defined by the agreement between human examiners on the same corpus. Finally, using a set of ‘outlier’ texts, we test the validity of our model and identify cases where the model’s scores diverge from that of a human examiner.
4 0.06829682 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
Author: Amjad Abu-Jbara ; Barbara Rosario ; Kent Lyons
Abstract: In this paper, we address the problem of optimizing the style of textual content to make it more suitable to being listened to by a user as opposed to being read. We study the differences between the written style and the audio style by consulting the linguistics andjour- nalism literatures. Guided by this study, we suggest a number of linguistic features to distinguish between the two styles. We show the correctness of our features and the impact of style transformation on the user experience through statistical analysis, a style classification task, and a user study.
5 0.052294832 204 acl-2011-Learning Word Vectors for Sentiment Analysis
Author: Andrew L. Maas ; Raymond E. Daly ; Peter T. Pham ; Dan Huang ; Andrew Y. Ng ; Christopher Potts
Abstract: Unsupervised vector-based approaches to semantics can model rich lexical meanings, but they largely fail to capture sentiment information that is central to many word meanings and important for a wide range of NLP tasks. We present a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semanticterm–documentinformation as well as rich sentiment content. The proposed model can leverage both continuous and multi-dimensional sentiment information as well as non-sentiment annotations. We instantiate the model to utilize the document-level sentiment polarity annotations present in many online documents (e.g. star ratings). We evaluate the model using small, widely used sentiment and subjectivity corpora and find it out-performs several previously introduced methods for sentiment classification. We also introduce a large dataset , of movie reviews to serve as a more robust benchmark for work in this area.
6 0.052026544 44 acl-2011-An exponential translation model for target language morphology
7 0.051319182 133 acl-2011-Extracting Social Power Relationships from Natural Language
8 0.050620414 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components
9 0.048549149 205 acl-2011-Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
10 0.042565122 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations
11 0.041182183 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal
12 0.039426781 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
13 0.039258286 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories
14 0.039067231 285 acl-2011-Simple supervised document geolocation with geodesic grids
15 0.036858249 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts
16 0.034194723 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
17 0.033987626 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation
18 0.032056287 109 acl-2011-Effective Measures of Domain Similarity for Parsing
19 0.031014044 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents
20 0.030856621 73 acl-2011-Collective Classification of Congressional Floor-Debate Transcripts
topicId topicWeight
[(0, 0.091), (1, 0.029), (2, -0.026), (3, 0.023), (4, -0.015), (5, 0.0), (6, 0.005), (7, 0.007), (8, -0.007), (9, 0.017), (10, -0.013), (11, -0.009), (12, -0.01), (13, 0.036), (14, -0.012), (15, 0.014), (16, 0.005), (17, -0.003), (18, 0.02), (19, -0.034), (20, 0.087), (21, -0.019), (22, -0.02), (23, -0.001), (24, -0.021), (25, -0.053), (26, -0.015), (27, -0.017), (28, 0.009), (29, -0.016), (30, -0.05), (31, 0.057), (32, -0.028), (33, 0.098), (34, 0.052), (35, -0.007), (36, -0.069), (37, -0.021), (38, -0.008), (39, 0.137), (40, 0.017), (41, 0.111), (42, -0.017), (43, -0.002), (44, -0.0), (45, -0.02), (46, -0.045), (47, -0.015), (48, 0.001), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.89149284 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution
Author: Hugo Jair Escalante ; Thamar Solorio ; Manuel Montes-y-Gomez
Abstract: This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for AA. We show that LHs are particularly helpful for AA, because they provide useful information for uncovering, to some extent, the writing style of authors. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. We found that LHs are even more advantageous in challenging conditions, such as having imbalanced and small training sets. Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection.
2 0.74821955 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics
Author: Steffen Hedegaard ; Jakob Grue Simonsen
Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.
Author: Sara Rosenthal ; Kathleen McKeown
Abstract: We investigate whether wording, stylistic choices, and online behavior can be used to predict the age category of blog authors. Our hypothesis is that significant changes in writing style distinguish pre-social media bloggers from post-social media bloggers. Through experimentation with a range of years, we found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction. We also show that internet writing characteristics are important features for age prediction, but that lexical content is also needed to produce significantly more accurate results. Our best results allow for 81.57% accuracy.
4 0.64747053 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
Author: Amjad Abu-Jbara ; Barbara Rosario ; Kent Lyons
Abstract: In this paper, we address the problem of optimizing the style of textual content to make it more suitable to being listened to by a user as opposed to being read. We study the differences between the written style and the audio style by consulting the linguistics andjour- nalism literatures. Guided by this study, we suggest a number of linguistic features to distinguish between the two styles. We show the correctness of our features and the impact of style transformation on the user experience through statistical analysis, a style classification task, and a user study.
5 0.64164883 133 acl-2011-Extracting Social Power Relationships from Natural Language
Author: Philip Bramsen ; Martha Escobar-Molano ; Ami Patel ; Rafael Alonso
Abstract: Sociolinguists have long argued that social context influences language use in all manner of ways, resulting in lects 1. This paper explores a text classification problem we will call lect modeling, an example of what has been termed computational sociolinguistics. In particular, we use machine learning techniques to identify social power relationships between members of a social network, based purely on the content of their interpersonal communication. We rely on statistical methods, as opposed to language-specific engineering, to extract features which represent vocabulary and grammar usage indicative of social power lect. We then apply support vector machines to model the social power lects representing superior-subordinate communication in the Enron email corpus. Our results validate the treatment of lect modeling as a text classification problem – albeit a hard one – and constitute a case for future research in computational sociolinguistics. 1
6 0.55765772 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal
7 0.54150999 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
8 0.53792977 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components
10 0.5126856 248 acl-2011-Predicting Clicks in a Vocabulary Learning System
11 0.4848803 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
12 0.48029 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge
13 0.47791228 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements
14 0.47044381 194 acl-2011-Language Use: What can it tell us?
15 0.45763245 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?
16 0.44369507 150 acl-2011-Hierarchical Text Classification with Latent Concepts
17 0.43793443 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications
18 0.43234855 55 acl-2011-Automatically Predicting Peer-Review Helpfulness
19 0.42666316 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA
20 0.42642647 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis
topicId topicWeight
[(0, 0.04), (5, 0.018), (17, 0.048), (37, 0.084), (39, 0.033), (41, 0.051), (55, 0.02), (59, 0.031), (72, 0.038), (90, 0.339), (91, 0.036), (96, 0.133)]
simIndex simValue paperId paperTitle
same-paper 1 0.68853253 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution
Author: Hugo Jair Escalante ; Thamar Solorio ; Manuel Montes-y-Gomez
Abstract: This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for AA. We show that LHs are particularly helpful for AA, because they provide useful information for uncovering, to some extent, the writing style of authors. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. We found that LHs are even more advantageous in challenging conditions, such as having imbalanced and small training sets. Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection.
2 0.64804757 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis
Author: Oleksandr Kolomiyets ; Steven Bethard ; Marie-Francine Moens
Abstract: We explore a semi-supervised approach for improving the portability of time expression recognition to non-newswire domains: we generate additional training examples by substituting temporal expression words with potential synonyms. We explore using synonyms both from WordNet and from the Latent Words Language Model (LWLM), which predicts synonyms in context using an unsupervised approach. We evaluate a state-of-the-art time expression recognition system trained both with and without the additional training examples using data from TempEval 2010, Reuters and Wikipedia. We find that the LWLM provides substantial improvements on the Reuters corpus, and smaller improvements on the Wikipedia corpus. We find that WordNet alone never improves performance, though intersecting the examples from the LWLM and WordNet provides more stable results for Wikipedia. 1
3 0.60821569 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs
Author: Aditya Joshi ; Balamurali AR ; Pushpak Bhattacharyya ; Rajat Mohanty
Abstract: Social networking and micro-blogging sites are stores of opinion-bearing content created by human users. We describe C-Feel-It, a system which can tap opinion content in posts (called tweets) from the micro-blogging website, Twitter. This web-based system categorizes tweets pertaining to a search string as positive, negative or objective and gives an aggregate sentiment score that represents a sentiment snapshot for a search string. We present a qualitative evaluation of this system based on a human-annotated tweet corpus.
4 0.60752213 258 acl-2011-Ranking Class Labels Using Query Sessions
Author: Marius Pasca
Abstract: The role of search queries, as available within query sessions or in isolation from one another, in examined in the context of ranking the class labels (e.g., brazilian cities, business centers, hilly sites) extracted from Web documents for various instances (e.g., rio de janeiro). The co-occurrence of a class label and an instance, in the same query or within the same query session, is used to reinforce the estimated relevance of the class label for the instance. Experiments over evaluation sets of instances associated with Web search queries illustrate the higher quality of the query-based, re-ranked class labels, relative to ranking baselines using documentbased counts.
5 0.58586097 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition
Author: John DeNero ; Klaus Macherey
Abstract: Unsupervised word alignment is most often modeled as a Markov process that generates a sentence f conditioned on its translation e. A similar model generating e from f will make different alignment predictions. Statistical machine translation systems combine the predictions of two directional models, typically using heuristic combination procedures like grow-diag-final. This paper presents a graphical model that embeds two directional aligners into a single model. Inference can be performed via dual decomposition, which reuses the efficient inference algorithms of the directional models. Our bidirectional model enforces a one-to-one phrase constraint while accounting for the uncertainty in the underlying directional models. The resulting alignments improve upon baseline combination heuristics in word-level and phrase-level evaluations.
6 0.47990009 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds
7 0.47755247 112 acl-2011-Efficient CCG Parsing: A* versus Adaptive Supertagging
8 0.47341943 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
10 0.47160095 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing
11 0.47134456 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
12 0.47037336 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
13 0.46902081 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks
14 0.46884203 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models
15 0.46841449 28 acl-2011-A Statistical Tree Annotator and Its Applications
16 0.46798903 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
17 0.46789429 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction
18 0.4674809 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
19 0.46713421 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
20 0.46695259 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering