acl acl2011 acl2011-319 knowledge-graph by maker-knowledge-mining

319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components


Source: pdf

Author: Moshe Koppel ; Navot Akiva ; Idan Dershowitz ; Nachum Dershowitz

Abstract: We propose a novel unsupervised method for separating out distinct authorial components of a document. In particular, we show that, given a book artificially “munged” from two thematically similar biblical books, we can separate out the two constituent books almost perfectly. This allows us to automatically recapitulate many conclusions reached by Bible scholars over centuries of research. One of the key elements of our method is exploitation of differences in synonym choice by different authors. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract We propose a novel unsupervised method for separating out distinct authorial components of a document. [sent-4, score-0.4]

2 In particular, we show that, given a book artificially “munged” from two thematically similar biblical books, we can separate out the two constituent books almost perfectly. [sent-5, score-0.67]

3 1 Introduction We propose a novel unsupervised method for separating out distinct authorial components of a document. [sent-8, score-0.4]

4 i l is only to cluster the units according to author. [sent-19, score-0.31]

5 The obvious approach to our unsupervised version of the problem would be to segment the text (if necessary), represent each of the resulting units of text as a bag-of-words, and then use clustering algorithms to find natural clusters. [sent-35, score-0.325]

6 Synonym choice proves to be far more useful for authorial decomposition than ordinary lexical features. [sent-38, score-0.24]

7 c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s356–1364, sparse and hence, though reliable, they are not comprehensive; that is, they are useful for separating out some units but not all. [sent-41, score-0.266]

8 Thus, we use a twostage process: first find a reliable partial clustering based on synonym usage and then use these as the basis for supervised learning using a different feature set, such as bag-of-words. [sent-42, score-0.342]

9 First, this testbed is well motivated, since scholars have been doing authorial analysis of biblical literature for centuries. [sent-45, score-0.715]

10 Our main result is that given artificial books constructed by randomly “munging” together actual biblical books, we are able to separate out authorial components with extremely high accuracy, even when the components are thematically similar. [sent-47, score-0.922]

11 Moreover, our automated methods recapitulate many of the results of extensive manual research in authorial analysis of biblical literature. [sent-48, score-0.601]

12 In the next section, we briefly review essential information regarding our biblical testbed. [sent-50, score-0.371]

13 2 The Bible as Testbed While the biblical canon differs across religions and denominations, the common denominator consists of twenty-odd books and several shorter works, ranging in length from tens to thousands of verses. [sent-55, score-0.577]

14 Some of these books are regarded by scholars as largely the product of a single author’s work, while others are thought to be composites in which multiple authors are wellrepresented authors who in some cases lived in widely disparate periods. [sent-57, score-0.372]

15 In this paper, we will focus exclusively on the Hebrew books of the Bi– 1357 ble, and we will work with the original untranslated texts. [sent-58, score-0.238]

16 The first five books of the Bible, collectively known as the Pentateuch, are the subject of much controversy. [sent-59, score-0.238]

17 According to the predominant Jewish and Christian traditions, the five books were written by a single author Moses. [sent-60, score-0.274]

18 Some work on biblical authorship problems within a computational framework has been attempted, but does not handle our problem. [sent-63, score-0.44]

19 Much earlier work (for example, Radday 1970; Bee 1971; Holmes 1994) uses multivariate analysis to test whether the clusters in a given clustering of some biblical text are sufficiently distinct to be regarded as probably a composite text. [sent-64, score-0.621]

20 Other computational work on biblical authorship problems (Mealand 1995; Berryman et al. [sent-67, score-0.44]

21 Jeremiah and Ezekiel are two roughly contemporaneous books belonging to the same biblical sub-genre (prophetic works), and each is widely thought to consist primarily of the work of a single distinct author. [sent-76, score-0.607]

22 Compute the similarity of every pair of chapters in the corpus. [sent-84, score-0.241]

23 Use a clustering algorithm to cluster the chapters into two clusters. [sent-86, score-0.446]

24 We use k=2, cosine similarity and ncut clustering (Dhillon et al. [sent-87, score-0.258]

25 Ideally, 100% of the chapters would lie on the majority diagonal, but in fact only 51% do. [sent-90, score-0.268]

26 NMD is equivalent to maximal macro-averaged recall where the maximum is taken over the (two) possible assignments of books to clusters. [sent-93, score-0.238]

27 1358 This negative result is not especially surprising since there are many ways for the chapters to split (e. [sent-96, score-0.27]

28 Thus, to guide the method in the direction of stylistic elements that might distinguish between Jeremiah and Ezekiel, we define a class of generic biblical words consisting of all 223 words that appear at least five times in each of ten different books of the Bible. [sent-100, score-0.706]

29 Repeating our experiment of above, though limiting our feature set to generic biblical words, we obtain the following matrix: BEJoezorek Clu23st82e rIClu2 st0 e rI As can be seen, using generic words yields NMD of 5 1. [sent-101, score-0.449]

30 4 Exploiting Synonym Usage One of the key features used by Bible scholars to classify different components of biblical literature is synonym choice. [sent-104, score-0.709]

31 The underlying hypothesis is that different authorial components are likely to differ in the proportions with which alternative words from a set of synonyms (synset) are used. [sent-105, score-0.401]

32 More recently, the synonym hypothesis has been used in computational work on authorship attribution of English texts in the work of Clark and Hannon (2007) and Koppel et al. [sent-107, score-0.32]

33 1 (Almost) Automatic Synset Identification One of the advantages of using biblical literature is the availability of a great deal of manual annotation. [sent-113, score-0.378]

34 If none of the synonyms in a synset appear in the unit, all their corresponding entries are 0. [sent-130, score-0.334]

35 If j different synonyms in a synset appear in the unit, then each corresponding entry is 1/j and the rest are 0. [sent-131, score-0.334]

36 Thus, in the typical case where exactly one of the synonyms in a synset appears, its corresponding entry in the vector is 1 and the rest are 0. [sent-132, score-0.302]

37 If the two units use different members of a synset, cosine is diminished; if they use the same members of a synset, cosine is increased. [sent-135, score-0.26]

38 But suppose one unit uses a particular synonym 1 Thanks to Avi Shmidman for his assistance with this. [sent-137, score-0.307]

39 This should teach us nothing about the similarity of the two units, since it reflects only on the relevance of the synset to the content of that unit; it says nothing about which synonym is chosen when the synset is relevant. [sent-139, score-0.534]

40 The required adaptation is as follows: we first eliminate from the representation any synsets that do not appear in both units (where a synset is said to appear in a unit if any of its constituent synonyms appear in the unit). [sent-141, score-0.916]

41 Formally, for a unit x represented in terms of synonyms, our new similarity measure is cos'(x,y) = cos(x|S(x ∩y),y|S(x ∩y)), where x|S(x is the projection of x onto the synsets that appear in both x and y. [sent-143, score-0.388]

42 First, some of the units belong firmly to one cluster or the other. [sent-154, score-0.335]

43 The rest have to be assigned to one cluster or the other because that’s the nature of the clustering algorithm, but in fact are not part of what we might think of as the core of either cluster. [sent-155, score-0.268]

44 Informally, we say that a unit is in the core of its cluster if it is sufficiently similar to the centroid of its cluster and it is sufficiently more similar to the centroid of its cluster than to any other centroid. [sent-156, score-0.72]

45 Formally, let S be a set of synsets, let B be a set of units, and let C be a clustering of B where the units in B are represented in terms of the synsets in S. [sent-157, score-0.603]

46 For a unit x in cluster C(x) with centroid c(x), we say that x is in the core of C(x) if cos'(x,c(x))>θ1 and cos'(x,c(x))-cos'(x,c)>θ2 for every centroid c≠c(x). [sent-158, score-0.398]

47 Second, the clusters that we obtain are based on a subset of the full collection of synsets that does the heavy lifting. [sent-161, score-0.256]

48 Formally, we say that a synonym n in synset s is over-represented in cluster C if p(x∈C|n∈x) > p(x∈C|s∈x) and p(x∈C|n∈x) > p(x∈C). [sent-162, score-0.496]

49 That is, n is over-represented in C if knowing that n appears in a unit increases the likelihood that the unit is in C, relative to knowing only that some member of synset appears in the unit and relative to knowing nothing. [sent-163, score-0.627]

50 We say that a synset s is a separating synset for a clustering {C1,C2} if some synonym in s is over-represented in C1 and a different synonym in s is over-represented in C2. [sent-164, score-0.904]

51 1 Defining the Core of a Cluster We leverage these two observations to formally define the cores of the respective clusters using the following iterative algorithm. [sent-166, score-0.232]

52 Initially, let S be the collection of all synsets, let B be the set of all units in the corpus represented in terms of S, and let {C1,C2} be an initial clustering of the units in B. [sent-168, score-0.59]

53 Redefine C1 and C2 to be the clusters obtained from clustering the units in the reduced B represented in terms of the synsets in reduced S. [sent-174, score-0.553]

54 At the end of this process, we are left with two well-separated cluster cores and a set of separating synsets. [sent-177, score-0.346]

55 When we compute cores of clusters in our 1360 Jeremiah-Ezekiel experiment, 26 of the initial 100 units are eliminated. [sent-178, score-0.389]

56 Of the 154 synsets that appear in the Jeremiah-Ezekiel corpus, 118 are separating synsets for the resulting clustering. [sent-179, score-0.512]

57 The resulting cluster cores split with Jeremiah and Ezekiel as follows: BEJozeorek Clu3s26t e r I Clu3s0t6e r I We find that all but two of the misplaced units are not part of the core. [sent-180, score-0.511]

58 Thus, we use a bag-of-words representation restricted to generic Bible words for the 74 units in our cluster cores and label them according to the cluster to which they were assigned. [sent-188, score-0.633]

59 Ther sultinBgEJosezpoerkl it Cslaus5 0ft1oerl oIwCsl:us41t8e rI Remarkably, even the two Ezekiel chapters that were in the Jeremiah cluster (and hence were essentially misleading training examples) end up on the Ezekiel side of the SVM boundary. [sent-191, score-0.337]

60 Represent units in cluster cores in terms of generic words. [sent-209, score-0.511]

61 Use units in cluster cores as training for learning an SVM classifier. [sent-211, score-0.456]

62 6 Empirical Results We now test our method on other pairs of biblical books to see if we obtain comparable results to those seen above. [sent-214, score-0.577]

63 We need, therefore, to identify a set of biblical books such that (i) each book is sufficiently long (say, at least 20 chapters), (ii) each is written by one primary author, and (iii) the authors are distinct. [sent-215, score-0.682]

64 Since we wish to use these books as a gold standard, it is important that there be a broad consensus regarding the latter two, potentially controversial, criteria. [sent-216, score-0.36]

65 Our choice is thus limited to the following five books that belong to two biblical sub-genres: Isaiah, Jeremiah, Ezekiel (prophetic literature), Job and Proverbs (wisdom literature). [sent-217, score-0.602]

66 ) Recall that our experiment is as follows: For each pair of books, we are given all the chapters in 1361 the union of the two books and are given no information regarding labels. [sent-219, score-0.485]

67 (The fact that there are precisely two constituent books is given. [sent-221, score-0.238]

68 In Figure 1, we see results for the six pairs of books that belong to different sub-genres. [sent-227, score-0.263]

69 In Figure 2, we see results for the four pairs of books that are in the same genre. [sent-228, score-0.238]

70 We note that the synonym method without the second stage is slightly worse than generic words for differentgenre pairs (probably because these pairs share relatively few synsets) but is much more consistent for same-genre pairs, giving results in the area of 90% for each such pair. [sent-234, score-0.233]

71 7 Decomposing Unsegmented Documents Up to now, we have considered the case where we are given text that has been pre-segmented into pure authorial units. [sent-236, score-0.27]

72 Choose the first k1 available verses of Jeremiah, where k1 is a random integer drawn from the uniform distribution over the integers 1to m. [sent-244, score-0.314]

73 Choose the first k2 available verses of Ezekiel, where k2 is a new random integer drawn from the above distribution. [sent-246, score-0.314]

74 Repeat until one of the books is exhausted; then choose the remaining verses of the other book. [sent-248, score-0.552]

75 Furthermore, to simulate the Pentateuch problem, we break Jer-iel into initial units by beginning a new unit whenever we reach the first verse of one of the original chapters of Jeremiah or Ezekiel. [sent-250, score-0.716]

76 (This does not leak any information since there is no inherent connection between these verses and actual crossover points. [sent-251, score-0.314]

77 First, we refine the initial units (each of which might be a mix of verses from Jeremiah and Ezekiel) by splitting them into smaller units that we hope will be pure (wholly from Jeremiah or from Ezekiel). [sent-254, score-0.75]

78 We say that a synset is doubly-represented in a unit if the unit includes two different synonyms of that synset. [sent-255, score-0.591]

79 Doubly-represented synsets are an indication that the unit might include verses from two differ- ent books. [sent-256, score-0.644]

80 Formally, let M(x) represent the number of synsets for which more than one synonym appear in x. [sent-258, score-0.446]

81 If for an initial unit, there is some split for which M(x)max(M(x1),M(x2)) is greater than 0, we split the unit optimally; if there is more than one optimal split, we choose the one closest to the middle verse of the unit. [sent-261, score-0.423]

82 (In principle, we could apply this procedure iteratively; in the experiments reported here, we split only the initial units but not split units. [sent-262, score-0.298]

83 The problem with classifying individual verses is that verses are short and may contain few or no relevant features. [sent-265, score-0.628]

84 In order to remedy this, and also to take advantage of the stickiness of classes across consecutive verses (if a given verse is from a certain book, there is a good chance that the next verse is from the same book), we use two smoothing tactics. [sent-266, score-0.682]

85 Initially, each verse is assigned a raw score by the SVM classifier, representing its signed distance from the SVM boundary. [sent-267, score-0.246]

86 We smooth these scores by computing for each verse a refined score that is a weighted average of the verse’s raw score and the raw scores of the two verses preceding and succeeding it. [sent-268, score-0.548]

87 Rather, we check the class of the last assigned verse before it and the first assigned verse after it. [sent-273, score-0.442]

88 Our two cluster cores, include 33 and 39 units, respectively; 27 of the former are pure Jeremiah and 30 of the latter are pure Ezekiel; no pure units are in the “wrong” cluster core. [sent-279, score-0.612]

89 Applying the SVM classifier learned on the cluster cores to individual verses, 992 of the 2637 verses in Jer-iel lie outside the SVM margin and are assigned to some class. [sent-280, score-0.645]

90 Of the remaining 459 unassigned verses, most lie along transition points (where smoothing tends to flatten scores and where preceding and succeeding assigned verses tend to belong to opposite classes). [sent-283, score-0.402]

91 3 Empirical Results We randomly generated composite books for each of the book pairs considered above. [sent-285, score-0.353]

92 In Figures 3 and 4, we show for each book pair the percentage of all verses in the munged document that are “correctly” classed (that is, in the majority diagonal), the percentage incorrectly classed (minority diagonal) and the percentage not assigned to either class. [sent-286, score-0.652]

93 As is evident, in each case the vast majority of verses are correctly assigned and only a small fraction are incorrectly assigned. [sent-287, score-0.406]

94 ent-genre pair of books that are correctly and incorrectly assigned or remain unassigned. [sent-289, score-0.303]

95 1363 genre pair of books that are correctly and incorrectly assigned or remain unassigned. [sent-290, score-0.303]

96 8 Conclusions and Future Work We have shown that documents can be decomposed into authorial components with very high accuracy by using a two-stage process. [sent-291, score-0.264]

97 First, we establish a reliable partial clustering of units by using synonym choice and then we use these partial clusters as training texts for supervised learning using generic words as features. [sent-292, score-0.669]

98 Despite this limitation, our success on munged biblical books suggests that our method can be fruitfully applied to the Pentateuch, since the broad consensus in the field is that the Pentateuch can be divided into two main authorial categories: Priestly (P) and non-Priestly (Driver 1909). [sent-296, score-0.902]

99 ) We find that our split corresponds to the expert consensus regarding P and non-P for over 90% of the verses in the Pentateuch for which such consensus exists. [sent-298, score-0.491]

100 In this work, we have exploited the availability of tools for identifying synonyms in biblical literature. [sent-301, score-0.476]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('biblical', 0.339), ('verses', 0.314), ('books', 0.238), ('chapters', 0.215), ('authorial', 0.21), ('ezekiel', 0.21), ('synsets', 0.201), ('jeremiah', 0.198), ('units', 0.188), ('bible', 0.185), ('verse', 0.184), ('synonym', 0.178), ('synset', 0.165), ('cores', 0.146), ('pentateuch', 0.14), ('synonyms', 0.137), ('unit', 0.129), ('cluster', 0.122), ('clustering', 0.109), ('authorship', 0.101), ('scholars', 0.099), ('isaiah', 0.087), ('ncut', 0.087), ('separating', 0.078), ('kjv', 0.07), ('munged', 0.07), ('nmd', 0.07), ('book', 0.066), ('pure', 0.06), ('centroid', 0.058), ('svm', 0.056), ('koppel', 0.056), ('split', 0.055), ('clusters', 0.055), ('generic', 0.055), ('components', 0.054), ('hebrew', 0.053), ('recapitulate', 0.052), ('diagonal', 0.051), ('composite', 0.049), ('conveniently', 0.046), ('guthrie', 0.046), ('wish', 0.045), ('consensus', 0.045), ('concordance', 0.043), ('stylistic', 0.042), ('cos', 0.042), ('attribution', 0.041), ('document', 0.04), ('literature', 0.039), ('sufficiently', 0.039), ('assigned', 0.037), ('author', 0.036), ('unsegmented', 0.036), ('cosine', 0.036), ('let', 0.035), ('bejoezorek', 0.035), ('berryman', 0.035), ('carpenter', 0.035), ('centuries', 0.035), ('classed', 0.035), ('composites', 0.035), ('dershowitz', 0.035), ('eisen', 0.035), ('miq', 0.035), ('prophetic', 0.035), ('ramat', 0.035), ('inadequate', 0.034), ('appear', 0.032), ('regarding', 0.032), ('formally', 0.031), ('scholar', 0.031), ('akiva', 0.031), ('navot', 0.031), ('dhillon', 0.031), ('say', 0.031), ('distinct', 0.03), ('decomposition', 0.03), ('israel', 0.029), ('partial', 0.029), ('incorrectly', 0.028), ('testbed', 0.028), ('meyer', 0.028), ('zu', 0.028), ('tease', 0.028), ('aviv', 0.028), ('unsupervised', 0.028), ('ve', 0.028), ('majority', 0.027), ('thematically', 0.027), ('threads', 0.027), ('plagiarism', 0.027), ('nigam', 0.027), ('lie', 0.026), ('supervised', 0.026), ('similarity', 0.026), ('knowing', 0.025), ('na', 0.025), ('belong', 0.025), ('raw', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

Author: Moshe Koppel ; Navot Akiva ; Idan Dershowitz ; Nachum Dershowitz

Abstract: We propose a novel unsupervised method for separating out distinct authorial components of a document. In particular, we show that, given a book artificially “munged” from two thematically similar biblical books, we can separate out the two constituent books almost perfectly. This allows us to automatically recapitulate many conclusions reached by Bible scholars over centuries of research. One of the key elements of our method is exploitation of differences in synonym choice by different authors. 1

2 0.10169708 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

3 0.087436974 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

Author: Joel Lang ; Mirella Lapata

Abstract: In this paper we describe an unsupervised method for semantic role induction which holds promise for relieving the data acquisition bottleneck associated with supervised role labelers. We present an algorithm that iteratively splits and merges clusters representing semantic roles, thereby leading from an initial clustering to a final clustering of better quality. The method is simple, surprisingly effective, and allows to integrate linguistic knowledge transparently. By combining role induction with a rule-based component for argument identification we obtain an unsupervised end-to-end semantic role labeling system. Evaluation on the CoNLL 2008 benchmark dataset demonstrates that our method outperforms competitive unsupervised approaches by a wide margin.

4 0.076757975 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation

Author: Tim Van de Cruys ; Marianna Apidianaki

Abstract: In this paper, we present a unified model for the automatic induction of word senses from text, and the subsequent disambiguation of particular word instances using the automatically extracted sense inventory. The induction step and the disambiguation step are based on the same principle: words and contexts are mapped to a limited number of topical dimensions in a latent semantic word space. The intuition is that a particular sense is associated with a particular topic, so that different senses can be discriminated through their association with particular topical dimensions; in a similar vein, a particular instance of a word can be disambiguated by determining its most important topical dimensions. The model is evaluated on the SEMEVAL-20 10 word sense induction and disambiguation task, on which it reaches stateof-the-art results.

5 0.071463503 334 acl-2011-Which Noun Phrases Denote Which Concepts?

Author: Jayant Krishnamurthy ; Tom Mitchell

Abstract: Resolving polysemy and synonymy is required for high-quality information extraction. We present ConceptResolver, a component for the Never-Ending Language Learner (NELL) (Carlson et al., 2010) that handles both phenomena by identifying the latent concepts that noun phrases refer to. ConceptResolver performs both word sense induction and synonym resolution on relations extracted from text using an ontology and a small amount of labeled data. Domain knowledge (the ontology) guides concept creation by defining a set of possible semantic types for concepts. Word sense induction is performed by inferring a set of semantic types for each noun phrase. Synonym detection exploits redundant informa- tion to train several domain-specific synonym classifiers in a semi-supervised fashion. When ConceptResolver is run on NELL’s knowledge base, 87% of the word senses it creates correspond to real-world concepts, and 85% of noun phrases that it suggests refer to the same concept are indeed synonyms.

6 0.063387439 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

7 0.063232332 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

8 0.062804133 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

9 0.062701069 109 acl-2011-Effective Measures of Domain Similarity for Parsing

10 0.061816018 167 acl-2011-Improving Dependency Parsing with Semantic Classes

11 0.060381468 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition

12 0.059782803 293 acl-2011-Template-Based Information Extraction without the Templates

13 0.056094978 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

14 0.05331865 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

15 0.052811857 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

16 0.051756851 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

17 0.051104158 52 acl-2011-Automatic Labelling of Topic Models

18 0.050620414 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

19 0.049290027 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

20 0.046927676 82 acl-2011-Content Models with Attitude


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.133), (1, 0.04), (2, -0.04), (3, 0.005), (4, -0.008), (5, -0.008), (6, 0.037), (7, 0.048), (8, -0.009), (9, 0.004), (10, 0.012), (11, -0.069), (12, 0.037), (13, 0.043), (14, -0.015), (15, -0.059), (16, -0.04), (17, 0.011), (18, 0.034), (19, 0.024), (20, 0.058), (21, -0.069), (22, -0.018), (23, -0.052), (24, -0.018), (25, -0.047), (26, 0.015), (27, -0.029), (28, -0.003), (29, -0.059), (30, -0.034), (31, -0.05), (32, 0.01), (33, 0.024), (34, 0.012), (35, -0.051), (36, -0.034), (37, 0.006), (38, -0.024), (39, 0.1), (40, 0.016), (41, 0.057), (42, -0.001), (43, 0.013), (44, 0.104), (45, -0.01), (46, 0.011), (47, 0.021), (48, -0.036), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93265766 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

Author: Moshe Koppel ; Navot Akiva ; Idan Dershowitz ; Nachum Dershowitz

Abstract: We propose a novel unsupervised method for separating out distinct authorial components of a document. In particular, we show that, given a book artificially “munged” from two thematically similar biblical books, we can separate out the two constituent books almost perfectly. This allows us to automatically recapitulate many conclusions reached by Bible scholars over centuries of research. One of the key elements of our method is exploitation of differences in synonym choice by different authors. 1

2 0.66557157 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge

Author: Kirill Kireyev ; Thomas K Landauer

Abstract: While computational estimation of difficulty of words in the lexicon is useful in many educational and assessment applications, the concept of scalar word difficulty and current corpus-based methods for its estimation are inadequate. We propose a new paradigm called word meaning maturity which tracks the degree of knowledge of each word at different stages of language learning. We present a computational algorithm for estimating word maturity, based on modeling language acquisition with Latent Semantic Analysis. We demonstrate that the resulting metric not only correlates well with external indicators, but captures deeper semantic effects in language. 1 Motivation It is no surprise that through stages of language learning, different words are learned at different times and are known to different extents. For example, a common word like “dog” is familiar to even a first-grader, whereas a more advanced word like “focal” does not usually enter learners’ vocabulary until much later. Although individual rates of learning words may vary between highand low-performing students, it has been observed that “children [… ] acquire word meanings in roughly the same sequence” (Biemiller, 2008). The aim of this work is to model the degree of knowledge of words at different learning stages. Such a metric would have extremely useful applications in personalized educational technologies, for the purposes of accurate assessment and personalized vocabulary instruction. … 299 .l andaue r } @pear s on .com 2 Rethinking Word Difficulty Previously, related work in education and psychometrics has been concerned with measuring word difficulty or classifying words into different difficulty categories. Examples of such approaches include creation of word lists for targeted vocabulary instruction at various grade levels that were compiled by educational experts, such as Nation (1993) or Biemiller (2008). Such word difficulty assignments are also implicitly present in some readability formulas that estimate difficulty of texts, such as Lexiles (Stenner, 1996), which include a lexical difficulty component based on the frequency of occurrence of words in a representative corpus, on the assumption that word difficulty is inversely correlated to corpus frequency. Additionally, research in psycholinguistics has attempted to outline and measure psycholinguistic dimensions of words such as age-of-acquisition and familiarity, which aim to track when certain words become known and how familiar they appear to an average person. Importantly, all such word difficulty measures can be thought of as functions that assign a single scalar value to each word w: !

3 0.66213334 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

Author: Hugo Jair Escalante ; Thamar Solorio ; Manuel Montes-y-Gomez

Abstract: This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for AA. We show that LHs are particularly helpful for AA, because they provide useful information for uncovering, to some extent, the writing style of authors. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. We found that LHs are even more advantageous in challenging conditions, such as having imbalanced and small training sets. Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection.

4 0.63831115 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

5 0.58668268 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation

Author: Tim Van de Cruys ; Marianna Apidianaki

Abstract: In this paper, we present a unified model for the automatic induction of word senses from text, and the subsequent disambiguation of particular word instances using the automatically extracted sense inventory. The induction step and the disambiguation step are based on the same principle: words and contexts are mapped to a limited number of topical dimensions in a latent semantic word space. The intuition is that a particular sense is associated with a particular topic, so that different senses can be discriminated through their association with particular topical dimensions; in a similar vein, a particular instance of a word can be disambiguated by determining its most important topical dimensions. The model is evaluated on the SEMEVAL-20 10 word sense induction and disambiguation task, on which it reaches stateof-the-art results.

6 0.58580774 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

7 0.57372195 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

8 0.56177855 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

9 0.55110914 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

10 0.54361135 74 acl-2011-Combining Indicators of Allophony

11 0.54356021 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

12 0.54293311 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

13 0.54226404 175 acl-2011-Integrating history-length interpolation and classes in language modeling

14 0.53374892 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

15 0.52763653 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

16 0.52458763 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations

17 0.52225596 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

18 0.52155268 278 acl-2011-Semi-supervised condensed nearest neighbor for part-of-speech tagging

19 0.51845527 293 acl-2011-Template-Based Information Extraction without the Templates

20 0.5143621 55 acl-2011-Automatically Predicting Peer-Review Helpfulness


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (5, 0.031), (17, 0.066), (26, 0.017), (37, 0.109), (39, 0.071), (41, 0.043), (55, 0.031), (58, 0.269), (59, 0.038), (72, 0.025), (91, 0.041), (96, 0.131), (97, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76627398 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

Author: Moshe Koppel ; Navot Akiva ; Idan Dershowitz ; Nachum Dershowitz

Abstract: We propose a novel unsupervised method for separating out distinct authorial components of a document. In particular, we show that, given a book artificially “munged” from two thematically similar biblical books, we can separate out the two constituent books almost perfectly. This allows us to automatically recapitulate many conclusions reached by Bible scholars over centuries of research. One of the key elements of our method is exploitation of differences in synonym choice by different authors. 1

2 0.70195037 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

Author: Aaron Michelony

Abstract: We consider the problem of predicting which words a student will click in a vocabulary learning system. Often a language learner will find value in the ability to look up the meaning of an unknown word while reading an electronic document by clicking the word. Highlighting words likely to be unknown to a readeris attractive due to drawing his orher attention to it and indicating that information is available. However, this option is usually done manually in vocabulary systems and online encyclopedias such as Wikipedia. Furthurmore, it is never on a per-user basis. This paper presents an automated way of highlighting words likely to be unknown to the specific user. We present related work in search engine ranking, a description of the study used to collect click data, the experiment we performed using the random forest machine learning algorithm and finish with a discussion of future work.

3 0.66822189 61 acl-2011-Binarized Forest to String Translation

Author: Hao Zhang ; Licheng Fang ; Peng Xu ; Xiaoyun Wu

Abstract: Tree-to-string translation is syntax-aware and efficient but sensitive to parsing errors. Forestto-string translation approaches mitigate the risk of propagating parser errors into translation errors by considering a forest of alternative trees, as generated by a source language parser. We propose an alternative approach to generating forests that is based on combining sub-trees within the first best parse through binarization. Provably, our binarization forest can cover any non-consitituent phrases in a sentence but maintains the desirable property that for each span there is at most one nonterminal so that the grammar constant for decoding is relatively small. For the purpose of reducing search errors, we apply the synchronous binarization technique to forest-tostring decoding. Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system). Consistent and significant gains are also shown in WMT 2010 in the English to German, French, Spanish and Czech tracks.

4 0.5998646 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

Author: John Lee ; Jason Naradowsky ; David A. Smith

Abstract: Most previous studies of morphological disambiguation and dependency parsing have been pursued independently. Morphological taggers operate on n-grams and do not take into account syntactic relations; parsers use the “pipeline” approach, assuming that morphological information has been separately obtained. However, in morphologically-rich languages, there is often considerable interaction between morphology and syntax, such that neither can be disambiguated without the other. In this paper, we propose a discriminative model that jointly infers morphological properties and syntactic structures. In evaluations on various highly-inflected languages, this joint model outperforms both a baseline tagger in morphological disambiguation, and a pipeline parser in head selection.

5 0.59833682 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing

Author: Michael Auli ; Adam Lopez

Abstract: Via an oracle experiment, we show that the upper bound on accuracy of a CCG parser is significantly lowered when its search space is pruned using a supertagger, though the supertagger also prunes many bad parses. Inspired by this analysis, we design a single model with both supertagging and parsing features, rather than separating them into distinct models chained together in a pipeline. To overcome the resulting increase in complexity, we experiment with both belief propagation and dual decomposition approaches to inference, the first empirical comparison of these algorithms that we are aware of on a structured natural language processing problem. On CCGbank we achieve a labelled dependency F-measure of 88.8% on gold POS tags, and 86.7% on automatic part-of-speeoch tags, the best reported results for this task.

6 0.59636068 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

7 0.5959245 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

8 0.59488153 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

9 0.5946402 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

10 0.59379673 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

11 0.59205079 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

12 0.59200907 28 acl-2011-A Statistical Tree Annotator and Its Applications

13 0.59147042 44 acl-2011-An exponential translation model for target language morphology

14 0.59116936 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

15 0.59088486 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

16 0.59022224 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

17 0.59019774 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds

18 0.58970338 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

19 0.58947003 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

20 0.58930272 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features