jmlr jmlr2007 jmlr2007-54 knowledge-graph by maker-knowledge-mining

54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors


Source: pdf

Author: Moshe Koppel, Jonathan Schler, Elisheva Bonchek-Dokow

Abstract: In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the “depth of difference” between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process. Keywords: authorship attribution, one-class learning, unmasking 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 COM Editor: Michael Collins Abstract In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. [sent-10, score-1.186]

2 We present a new learning-based method for adducing the “depth of difference” between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. [sent-11, score-0.725]

3 The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process. [sent-12, score-0.241]

4 Keywords: authorship attribution, one-class learning, unmasking 1 Introduction In the authorship attribution problem, we are given examples of the writing of a number of authors and are asked to determine which of them authored given anonymous texts (Mosteller and Wallace 1964; Holmes 1998). [sent-13, score-1.539]

5 If it can be assumed for each test document that one of the specified authors is indeed the actual author, the problem fits the standard paradigm of a text categorization problem (Sebastiani 2002). [sent-14, score-0.208]

6 In the authorship verification problem (Van Halteren 2004), we are given examples of the writing of a single author and are asked to determine if given texts were or were not written by this author. [sent-15, score-1.186]

7 As a categorization problem, verification is significantly more difficult than attribution and little, if any, work has been performed on it in the learning community. [sent-16, score-0.621]

8 If, for example, all we wished to do is to determine if a text was written by Shakespeare or Marlowe, it would be sufficient to use their respective known writings, to construct a model distinguishing them, and to test the unknown text against the model. [sent-17, score-0.401]

9 If, on the other hand, we need to determine if a text was written by Shakespeare or not, it is very difficult – if not impossible – to assemble an exhaustive, or even representative, sample of not-Shakespeare. [sent-18, score-0.264]

10 The situation in which we suspect that a given author may have written some text but do not have an exhaustive list of alternative candidates is a common one. [sent-19, score-0.361]

11 The problem is complicated by the fact that a single author may consciously vary his or her style from text to text for many reasons or may unconsciously drift stylistically over time. [sent-20, score-0.455]

12 Thus researchers must learn to somehow distinguish between relatively shallow differences that reflect conscious or unconscious changes in an author’s style and deeper differences that reflect styles of different authors. [sent-21, score-0.245]

13 First, in authorship verification we do not actually lack for negative examples. [sent-27, score-0.67]

14 The world is replete with texts that were, for example, not written by Shakespeare. [sent-28, score-0.129]

15 However, these negative examples are not representative of the entire class. [sent-29, score-0.027]

16 We will consider how to make proper limited use of whatever partial negative information is available for the authorship verification problem. [sent-31, score-0.708]

17 Thus, a better way to think about authorship verification is that we are given two example sets and are asked whether these sets were generated by a single generating process (author) or by two different processes. [sent-33, score-0.776]

18 The central novelty of this paper is a new method for adducing the depth of difference between two example sets, a method that may have far-reaching consequences for determining the reliability of classification models. [sent-34, score-0.091]

19 The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process. [sent-35, score-0.241]

20 We will show that this method provides an extremely robust solution to the authorship verification problem that is independent of language, period and genre. [sent-36, score-0.67]

21 2 Authorship Attribution with Two Candidates Since our methods will build upon the standard methods for handling authorship attribution between two candidate authors, let us begin by briefly reviewing those standard methods. [sent-38, score-0.452]

22 We begin by choosing a feature set consisting of the kinds of features that might be used consistently by a single author over a variety of writings. [sent-39, score-0.254]

23 Typically, these features might include frequencies of function words (Mosteller and Wallace 1964), syntactic structures (Baayen et al. [sent-40, score-0.084]

24 , 2002) or syntactic and orthographic idiosyncrasies (Koppel and Schler 2003). [sent-44, score-0.038]

25 Note that these feature types contrast sharply with the content words commonly used in text categorization by topic. [sent-45, score-0.15]

26 Having constructed the appropriate feature vectors, we continue, precisely as in topic-based text categorization, by using a learning algorithm to construct a distinguishing model. [sent-46, score-0.192]

27 Although many learning methods have been applied to the problem, including multivariate analysis, decision trees and neural nets, a good number of studies have shown that linear separators work well for text categorization (Yang 1999). [sent-47, score-0.15]

28 Linear methods that have been used for text categorization include Naïve Bayes (Mosteller and Wallace 1964; Peng et al. [sent-48, score-0.15]

29 This general framework has been used to convincingly solve a number of real world authorship attribution problems (e. [sent-55, score-0.452]

30 Approach 1: Lining up impostors – One possibility that suggests itself is to assemble a representative collection of works by other authors and to use a two-class learner, such as SVM, to learn a model for A vs. [sent-62, score-0.238]

31 Then we chunk the mystery work X and run the chunks through the learned model. [sent-64, score-0.212]

32 If the preponderance of chunks of X are classed as A, then X is deemed to have been written by A; otherwise it is deemed to have been written by someone else. [sent-65, score-0.358]

33 While it is indeed reasonable to conclude that A is not the author if most chunks are attributed to not-A, the converse is not true. [sent-67, score-0.35]

34 Any author who is neither A nor represented in the sample not-A, but who happens to have a style more similar to A than to not-A, will be falsely determined by this method to be A. [sent-68, score-0.258]

35 Then we ascribe X to A if a sufficient number of chunks of X lie inside the boundary. [sent-72, score-0.189]

36 If it is easy to distinguish between them, that is, if we obtain high accuracy in cross-validation, then conclude that A did not write X. [sent-76, score-0.091]

37 We are given known works by three 19th century American novelists, Herman Melville, James Fenimore Cooper and Nathaniel Hawthorne. [sent-81, score-0.098]

38 For each of the three authors, we are asked if that author was or was not also the author of The House of Seven Gables (henceforth: Gables). [sent-82, score-0.522]

39 Using the method just described and using a feature set consisting of the 250 most frequently used words in A and X (details below), we find that we can distinguish Gables from the works of each author with cross-validation accuracy of above 98%. [sent-83, score-0.374]

40 If we were to conclude, therefore, that none of these authors wrote Gables, we would be wrong: Hawthorne wrote it. [sent-84, score-0.134]

41 4 A New Approach: Unmasking If we look closely at the models that successfully distinguish Gables from Hawthorne’s other work (in this case, The Scarlet Letter), we find that a small number of features are doing all the work of distinguishing between them. [sent-85, score-0.221]

42 These features include he (more frequent in The Scarlet Letter) and she (more frequent in Gables). [sent-86, score-0.124]

43 The situation in which an author will use a small number of features in a consistently different way between works is typical. [sent-87, score-0.305]

44 These differences might result from thematic differences between the works, from differences in genre or purpose, from chronological stylistic drift, or from deliberate attempts by the author to mask his or her identity (as we shall see below). [sent-88, score-0.345]

45 1263 KOPPEL, SCHLER AND BONCHEK-DOKOW The main point of this paper is to show how this problem can be overcome by determining not only if A is distinguishable from X but also how great is the depth of difference between A and X. [sent-89, score-0.036]

46 The intuitive idea of unmasking is to iteratively remove those features that are most useful for distinguishing between A and X and to gauge the speed with which cross-validation accuracy degrades as more features are removed. [sent-91, score-0.69]

47 Our main hypothesis is that if A and X are by the same author, then whatever differences there are between them will be reflected in only a relatively small number of features, despite possible differences in theme, genre and the like. [sent-92, score-0.145]

48 In Figure 1, we show the result of unmasking when comparing Gables to known works of Melville, Cooper and Hawthorne. [sent-93, score-0.436]

49 This graph illustrates our hypothesis: when comparing Gables to works by other authors the degradation as we remove distinguishing features from consideration is slow and smooth but when comparing it to another work by Hawthorne, the degradation is sudden and dramatic. [sent-94, score-0.443]

50 This illustrates that once a small number of distinguishing markers are removed, the two works by Hawthorne become increasingly hard to distinguish from each other. [sent-95, score-0.202]

51 In the following section, we will show how the suddenness of the degradation can be quantified in a fashion optimal for this task and we will see that the phenomenon illustrated in the Gables example holds very generally. [sent-96, score-0.089]

52 Thus by taking into account the depth of difference between two works, we can determine if they were authored by the same person or two different people. [sent-97, score-0.14]

53 Ten-fold cross-validation accuracy of models distinguishing House of Seven Gables from each of Hawthorne, Melville and Cooper. [sent-99, score-0.16]

54 The x-axis represents the number of iterations of eliminating best features at previous iteration. [sent-100, score-0.046]

55 5 Experimental Results In this section we consider an experimental framework for systematic testing of the effectiveness of the unmasking algorithm 5. [sent-102, score-0.385]

56 1 Corpus We consider a collection of 21 nineteenth century English books written by 10 different authors and spanning a variety of genres. [sent-103, score-0.214]

57 We chose all electronically available books by these authors that were sufficiently large (above 500K) and that presented no special formatting challenges. [sent-104, score-0.121]

58 For the sake of all the experiments that follow, we chunk each book into approximately equal sections of at least 500 words without breaking up paragraphs. [sent-109, score-0.169]

59 For each author A and each book X, let AX consist of all the works by A in the corpus unless X is in fact written by A, in which case AX consists of all works by A except X. [sent-110, score-0.522]

60 Our objective is to assign to each pair the value same-author if X is by A and the value different-author otherwise. [sent-111, score-0.031]

61 2 Baseline: One-class SVM In order to establish a baseline, for each author A in the corpus and each book X, we use a oneclass SVM (Chang and Lin 2001) on the 250 most frequent words in AX to build a model for AX. [sent-113, score-0.413]

62 ) We then test each book X against the model of each AX. [sent-115, score-0.099]

63 We assign the pair the value same-author if more than half the chunks of X are assigned to AX. [sent-116, score-0.198]

64 Of the 20 pairs that should have been assigned the value same-author, only 6 are correctly classified, while 46 of the 189 pairs that should be assigned the value different-author are incorrectly classified. [sent-118, score-0.05]

65 These results hold using an RBF kernel; using other kernels or using a threshold other than half (the number of chunks assigned to the class) only degrades results. [sent-119, score-0.194]

66 A second possible baseline is the “lining up impostors” method mentioned in Section 3 above. [sent-120, score-0.03]

67 3 Unmasking Applied Now let us introduce the details of our new method based on our observations above regarding iterative elimination of features. [sent-123, score-0.04]

68 We choose as an initial feature set the n words with highest average frequency in AX and X (that is, the average of the frequency in AX and the frequency in X, giving equal weight to AX and X). [sent-124, score-0.032]

69 Using an SVM with linear kernel we run the following unmasking scheme: 1. [sent-125, score-0.385]

70 Determine the accuracy results of a ten-fold cross-validation experiment for AX against X. [sent-126, score-0.05]

71 (If one of the sets, AX or X, includes more chunks than the other, we randomly discard the surplus. [sent-127, score-0.172]

72 Accuracy results are the average of five runs of ten-fold cross-validation in which we discard randomly for each run. [sent-128, score-0.03]

73 For the model obtained in each fold, eliminate the k most strongly weighted positive features and the k most strongly weighted negative features. [sent-130, score-0.046]

74 In this way, we construct degradation curves for each pair . [sent-133, score-0.147]

75 In Figure 2, we show such curves (using n=250 and k=3) for An Ideal Husband against each of ten authors, including Oscar Wilde. [sent-134, score-0.058]

76 Unmasking An Ideal Husband against each of the ten authors (n=250, k=3). [sent-136, score-0.058]

77 The curve below all the authors is that of Oscar Wilde, the actual author. [sent-137, score-0.058]

78 4 Meta-learning: Identifying Same-Author Curves We wish now to quantify the difference between same-author curves and different-author curves. [sent-140, score-0.058]

79 We then apply a meta-learning scheme in which we use learners to determine what role to assign to various features of the curves. [sent-143, score-0.111]

80 (Note that although we have 20 same-author pairs, we really only have 13 distinct same-author curves, since for authors with exactly two works in our corpus, the comparison of AX with X is identical for each of the two books. [sent-144, score-0.109]

81 ) To illustrate the ease with which same-author curves can be distinguished from differentauthor curves, we note that for all same-author curves, it holds that: • accuracy after 6 elimination rounds is lower than 89% and • the second highest accuracy drop in two iterations is greater than 16%. [sent-145, score-0.291]

82 5 Accuracy Results: Leave-one-book-out Tests In order to assess the accuracy of the method, we use the following cross-validation methodology. [sent-148, score-0.05]

83 For each book B in our corpus, we run a trial in which B is completely eliminated from consideration. [sent-149, score-0.099]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('unmasking', 0.385), ('verification', 0.358), ('authorship', 0.312), ('gables', 0.303), ('koppel', 0.248), ('author', 0.208), ('schler', 0.165), ('hawthorne', 0.163), ('ax', 0.148), ('chunks', 0.142), ('attribution', 0.14), ('melville', 0.11), ('mosteller', 0.11), ('pseudonymous', 0.11), ('shakespeare', 0.11), ('distinguishing', 0.11), ('asked', 0.106), ('book', 0.099), ('wallace', 0.093), ('degradation', 0.089), ('texts', 0.083), ('husband', 0.083), ('text', 0.082), ('house', 0.07), ('chunk', 0.07), ('authored', 0.07), ('categorization', 0.068), ('corpus', 0.067), ('books', 0.063), ('deemed', 0.062), ('authors', 0.058), ('curves', 0.058), ('cooper', 0.057), ('differentiability', 0.056), ('adducing', 0.055), ('baayen', 0.055), ('difficult', 0.055), ('elisheva', 0.055), ('emily', 0.055), ('impostors', 0.055), ('lining', 0.055), ('moshe', 0.055), ('novelists', 0.055), ('oscar', 0.055), ('scarlet', 0.055), ('vel', 0.055), ('wilde', 0.055), ('works', 0.051), ('accuracy', 0.05), ('style', 0.05), ('heights', 0.047), ('century', 0.047), ('assemble', 0.047), ('genre', 0.047), ('reflect', 0.047), ('sufficient', 0.047), ('features', 0.046), ('written', 0.046), ('distinguish', 0.041), ('elimination', 0.04), ('na', 0.039), ('writing', 0.039), ('frequent', 0.039), ('whatever', 0.038), ('wrote', 0.038), ('syntactic', 0.038), ('depth', 0.036), ('jonathan', 0.035), ('determine', 0.034), ('seven', 0.034), ('drift', 0.033), ('holmes', 0.033), ('ideal', 0.033), ('highest', 0.032), ('measuring', 0.032), ('rounds', 0.031), ('assign', 0.031), ('differences', 0.03), ('discard', 0.03), ('dropped', 0.03), ('baseline', 0.03), ('drop', 0.03), ('svm', 0.028), ('english', 0.028), ('degrades', 0.027), ('representative', 0.027), ('iteratively', 0.026), ('letter', 0.026), ('assigned', 0.025), ('exhaustive', 0.025), ('chang', 0.024), ('round', 0.024), ('find', 0.024), ('traits', 0.023), ('secret', 0.023), ('spy', 0.023), ('matthews', 0.023), ('sebastiani', 0.023), ('theme', 0.023), ('shaw', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors

Author: Moshe Koppel, Jonathan Schler, Elisheva Bonchek-Dokow

Abstract: In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the “depth of difference” between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process. Keywords: authorship attribution, one-class learning, unmasking 1

2 0.0594116 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

Author: Evgeniy Gabrilovich, Shaul Markovitch

Abstract: Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation. Keywords: feature generation, text classification, background knowledge

3 0.035361312 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming

4 0.026216526 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

Author: Marlon Núñez, Raúl Fidalgo, Rafael Morales

Abstract: In the process of concept learning, target concepts may have portions with short-term changes, other portions may support long-term changes, and yet others may not change at all. For this reason several local windows need to be handled. We suggest facing this problem, which naturally exists in the field of concept learning, by allocating windows which can adapt their size to portions of the target concept. We propose an incremental decision tree that is updated with incoming examples. Each leaf of the decision tree holds a time window and a local performance measure as the main parameter to be controlled. When the performance of a leaf decreases, the size of its local window is reduced. This learning algorithm, called OnlineTree2, automatically adjusts its internal parameters in order to face the current dynamics of the data stream. Results show that it is comparable to other batch algorithms when facing problems with no concept change, and it is better than evaluated methods in its ability to deal with concept drift when dealing with problems in which: concept change occurs at different speeds, noise may be present and, examples may arrive from different areas of the problem domain (virtual drift). Keywords: incremental algorithms, online learning, concept drift, decision trees, robust learners

5 0.023279952 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

Author: Gal Elidan, Iftach Nachman, Nir Friedman

Abstract: Bayesian networks in general, and continuous variable networks in particular, have become increasingly popular in recent years, largely due to advances in methods that facilitate automatic learning from data. Yet, despite these advances, the key task of learning the structure of such models remains a computationally intensive procedure, which limits most applications to parameter learning. This problem is even more acute when learning networks in the presence of missing values or hidden variables, a scenario that is part of many real-life problems. In this work we present a general method for speeding structure search for continuous variable networks with common parametric distributions. We efficiently evaluate the approximate merit of candidate structure modifications and apply time consuming (exact) computations only to the most promising ones, thereby achieving significant improvement in the running time of the search algorithm. Our method also naturally and efficiently facilitates the addition of useful new hidden variables into the network structure, a task that is typically considered both conceptually difficult and computationally prohibitive. We demonstrate our method on synthetic and real-life data sets, both for learning structure on fully and partially observable data, and for introducing new hidden variables during structure search. Keywords: Bayesian networks, structure learning, continuous variables, hidden variables

6 0.021871518 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

7 0.020396976 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

8 0.019623261 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

9 0.019532908 67 jmlr-2007-Polynomial Identification in the Limit of Substitutable Context-free Languages

10 0.018464088 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

11 0.018106654 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

12 0.017934142 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

13 0.016615385 43 jmlr-2007-Integrating Naïve Bayes and FOIL

14 0.016034737 11 jmlr-2007-Anytime Learning of Decision Trees

15 0.015620557 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts

16 0.015510559 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

17 0.015170083 72 jmlr-2007-Relational Dependency Networks

18 0.015068459 82 jmlr-2007-The Need for Open Source Software in Machine Learning

19 0.01312678 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

20 0.012994668 27 jmlr-2007-Distances between Data Sets Based on Summary Statistics


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.08), (1, 0.077), (2, -0.01), (3, 0.049), (4, -0.014), (5, 0.068), (6, 0.006), (7, -0.025), (8, -0.004), (9, -0.026), (10, -0.013), (11, 0.08), (12, -0.137), (13, 0.154), (14, -0.03), (15, -0.024), (16, 0.129), (17, 0.043), (18, 0.056), (19, -0.033), (20, 0.053), (21, -0.027), (22, -0.082), (23, -0.052), (24, 0.147), (25, 0.028), (26, 0.008), (27, 0.138), (28, 0.023), (29, 0.2), (30, 0.187), (31, 0.007), (32, 0.076), (33, 0.159), (34, -0.09), (35, -0.158), (36, 0.133), (37, -0.065), (38, 0.148), (39, -0.342), (40, 0.462), (41, 0.316), (42, -0.013), (43, -0.269), (44, -0.193), (45, 0.041), (46, 0.114), (47, 0.08), (48, 0.006), (49, 0.158)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98248619 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors

Author: Moshe Koppel, Jonathan Schler, Elisheva Bonchek-Dokow

Abstract: In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the “depth of difference” between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process. Keywords: authorship attribution, one-class learning, unmasking 1

2 0.27776164 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

Author: Evgeniy Gabrilovich, Shaul Markovitch

Abstract: Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation. Keywords: feature generation, text classification, background knowledge

3 0.11872574 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming

4 0.11818565 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

Author: Gal Elidan, Iftach Nachman, Nir Friedman

Abstract: Bayesian networks in general, and continuous variable networks in particular, have become increasingly popular in recent years, largely due to advances in methods that facilitate automatic learning from data. Yet, despite these advances, the key task of learning the structure of such models remains a computationally intensive procedure, which limits most applications to parameter learning. This problem is even more acute when learning networks in the presence of missing values or hidden variables, a scenario that is part of many real-life problems. In this work we present a general method for speeding structure search for continuous variable networks with common parametric distributions. We efficiently evaluate the approximate merit of candidate structure modifications and apply time consuming (exact) computations only to the most promising ones, thereby achieving significant improvement in the running time of the search algorithm. Our method also naturally and efficiently facilitates the addition of useful new hidden variables into the network structure, a task that is typically considered both conceptually difficult and computationally prohibitive. We demonstrate our method on synthetic and real-life data sets, both for learning structure on fully and partially observable data, and for introducing new hidden variables during structure search. Keywords: Bayesian networks, structure learning, continuous variables, hidden variables

5 0.10973381 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

Abstract: We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efficiency criterion, on the basis of specified classification or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which distributes its mass over the more efficient, or informative, components. We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of fit criterion for classifiers based on variable randomly chosen according to P. We then generate classifiers from the optimal distribution of weights learned on the training set. The method is first tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classification and micro-array analysis. We then compare our approach with other step-wise algorithms like random forests or recursive feature elimination. Keywords: stochastic learning algorithms, Robbins-Monro application, pattern recognition, classification algorithm, feature selection

6 0.1065767 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

7 0.089361541 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

8 0.085303739 43 jmlr-2007-Integrating Naïve Bayes and FOIL

9 0.08223629 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

10 0.081739843 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

11 0.081264578 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

12 0.076091848 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

13 0.073781967 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation

14 0.070847593 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

15 0.069978684 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"

16 0.068892114 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

17 0.065577053 39 jmlr-2007-Handling Missing Values when Applying Classification Models

18 0.064144194 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

19 0.062291145 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

20 0.061717737 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.014), (12, 0.016), (15, 0.033), (28, 0.025), (40, 0.016), (48, 0.015), (60, 0.014), (80, 0.699), (85, 0.019), (98, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9180392 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors

Author: Moshe Koppel, Jonathan Schler, Elisheva Bonchek-Dokow

Abstract: In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the “depth of difference” between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process. Keywords: authorship attribution, one-class learning, unmasking 1

2 0.76118934 11 jmlr-2007-Anytime Learning of Decision Trees

Author: Saher Esmeir, Shaul Markovitch

Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning

3 0.31737727 39 jmlr-2007-Handling Missing Values when Applying Classification Models

Author: Maytal Saar-Tsechansky, Foster Provost

Abstract: Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. Keywords: missing data, classification, classification trees, decision trees, imputation

4 0.27948689 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

Author: Ray J. Hickey

Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1

5 0.23744699 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

Author: Wei Pan, Xiaotong Shen

Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage

6 0.21710823 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts

7 0.21408425 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

8 0.20896092 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

9 0.20406416 72 jmlr-2007-Relational Dependency Networks

10 0.19685392 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

11 0.19409065 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes

12 0.1927209 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

13 0.18392362 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

14 0.18061927 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

15 0.17906708 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

16 0.1774635 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

17 0.17280009 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

18 0.16907172 86 jmlr-2007-Truncating the Loop Series Expansion for Belief Propagation

19 0.16764349 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

20 0.16710879 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization