jmlr jmlr2007 jmlr2007-79 knowledge-graph by maker-knowledge-mining

79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning


Source: pdf

Author: Ray J. Hickey

Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ireland, UK, BT52 1SA Editor: Greg Ridgeway Abstract To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. [sent-7, score-0.567]

2 If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). [sent-8, score-0.587]

3 A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. [sent-10, score-0.669]

4 A rule is extracted from the tree by associating a path from the root to a leaf (the rule condition) with the majority class at the leaf (the rule conclusion). [sent-15, score-0.94]

5 For good generalization accuracy, the induced tree must have sufficient structure, that is, depth, to fully extract the conditions of each rule and, in addition, must identify the correct majority class in each leaf. [sent-20, score-0.537]

6 Yet, as is well-known, a major weakness of decision tree induction lies in its progressive sub-division of the training set as the tree develops (divide and conquer). [sent-21, score-0.543]

7 HICKEY This causes the two requirements work to against each other: deepening the tree to create the necessary structure reduces the sample sizes in the leaves upon which inferences about majority classes are based. [sent-24, score-0.594]

8 There has been comparatively little investigation into whether the class designated as the majority using the leaf sample distribution will be the true majority class. [sent-27, score-0.688]

9 The classification rate of an induced tree on unseen examples is limited by the Bayes rate, BCR = (100 - Bayes error rate), which is the probability (expressed as a percentage) that a correct classification would be obtained if the underlying rules in the domain were used as the classifier. [sent-31, score-0.895]

10 If the classification rate of an induced tree is CR, then the shortfall in accuracy compared to the maximum that can be achieved is BCR - CR. [sent-34, score-0.55]

11 Thus here error is relative to the best performance possible, which differs from the usual practice that defines classification error as complementary to classification rate, that is, 100 - CR. [sent-36, score-0.578]

12 The intention is to assign blame for inductive error partially to inadequacies in tree structure and partially to inadequacies in majority class identification. [sent-37, score-0.635]

13 The component for majority class determination will then be further broken down to allow the sampling behaviour at the leaves and the bias introduced by the induction algorithm’s attribute selection competition to be isolated and quantified. [sent-40, score-0.664]

14 In Section 2, the notion of a classification model and its decision tree representation are discussed. [sent-50, score-0.517]

15 Such a distribution is the theoretical analogue of the class frequency distribution in a leaf of a tree induced from training examples. [sent-76, score-0.536]

16 Using the tree as a classifier, where the assigned class is the majority class in the appropriate leaf, will achieve the Bayes classification rate. [sent-77, score-0.727]

17 With regard to the representation of a class model, a decision tree is said to be a core tree if un-expansion of any set of sibling leaves would result in a reduction in expected 1 As a property of an attribute, the notion of pure noise as used by Breiman et al. [sent-84, score-0.777]

18 If, in addition, there is no expansion of the leaves of the core to any depth that would increase the expected information then the tree is said to be a complete core; else it is incomplete. [sent-89, score-0.479]

19 A deflated complete core offers an economical tree representation of a class model: it has no wasteful expansion on pure noise attributes either internal to the core or beyond its edge. [sent-98, score-0.669]

20 1 Deterministic Classifiers and the Reduced Core Replacing each class distribution in the leaf of a tree with a majority class for that distribution will produce a deterministic classification tree. [sent-100, score-0.966]

21 A reduced core deterministic classification tree which achieves the Bayes rate is called complete. [sent-108, score-0.617]

22 3 A Decomposition of Inductive Error Insufficient tree structure and inaccurate majority class identification both contribute to inductive error in trees. [sent-110, score-0.589]

23 The correct majority classes for any tree can be determined from the class model. [sent-114, score-0.494]

24 Altering an induced tree to label each leaf with the true majority class, as distinct from the leaf sample estimate of this, produces the corrected majority class version of the tree, T(maj). [sent-115, score-1.209]

25 The classification rate of this tree is called the corrected majority class classification rate. [sent-116, score-1.002]

26 Correcting majority classes as indicated above removes majority class determination as a source of inductive error. [sent-121, score-0.527]

27 This is called majority class error so that majority class error = CR(T(maj)) – CR . [sent-125, score-0.518]

28 From the definition of a core, it is easy to see that the corrected core and the corrected full tree must have the same classification rate, that is: CR(T(maj)) = CR (Tcore(maj)) . [sent-130, score-0.665]

29 The completeness of a core can be expressed in terms of structure error: the reduced core of a tree is complete if and only if structure error is zero. [sent-133, score-0.559]

30 In theory, though, the effect of this could be to improve majority class estimation: the intelligence in the selection procedure might increase the chance that the majority class in the leaf is the correct one. [sent-137, score-0.72]

31 Ideally, the sample arriving at a leaf should be a random sample from the probability distribution at the leaf as derived from the class model. [sent-139, score-0.542]

32 It must be non-negative since failure to determine one or more leaf majority classes correctly can only reduce the classification rate. [sent-148, score-0.711]

33 (2) That is: majority class error = sampling error + selection bias error . [sent-150, score-0.509]

34 Taken together, the two decompositions in Equations 1 and 2 yield an overall decomposition of inductive error into three components as inductive error = structure error + sampling error + selection bias error. [sent-151, score-0.545]

35 , pk ) , the probability, Pcorr , that this selection will be correct can be calculated from the multinomial distribution as: Pcorr = P(class maj = classmaj ) where classmaj is a majority class. [sent-173, score-0.502]

36 1 Properties of Pcorr Henceforth, pk will be denoted pmaj . [sent-175, score-0.624]

37 Pcorr ≥ pmaj and increases with n (unless pmaj = 1/k in which case, Pcorr = 1/k for all n). [sent-176, score-1.19]

38 1752 STRUCTURE AND MAJORITY CLASSES IN DECISION TREE LEARNING Intuitively, for a given n, Pcorr should be greater in situations where pk −1 pmaj since the majority class has less competition and, conversely, should be small when all the pi are fairly equal. [sent-179, score-0.874]

39 Based on the work of Kesten and Morse (1959), Marshall and Olkin (1979) used majorisation theory to establish that, for fixed and unique pmaj , Pcorr is Schur-concave in the residual probabilities ( p1 ,. [sent-180, score-0.649]

40 For k = 2, pmaj determines the complete distribution (1 − pmaj , pmaj ) . [sent-185, score-1.785]

41 For k > 2 and fixed pmaj , the greatest equalization occurs when all residual probabilities are identical, that is, each is (1 − pmaj ) /(k − 1) . [sent-186, score-1.244]

42 This will be referred to as the equal residue distribution and will be denoted De ( pmaj , k ) . [sent-187, score-0.63]

43 Thus, Pcorr is maximised amongst all distributions on k classes with given pmaj by De ( pmaj , k ) . [sent-188, score-1.242]

44 At the other end of the scale, assuming pmaj > 1/ 2 , then concentration of the residue at a single class produces the minimum Pcorr for that n and pk . [sent-189, score-0.679]

45 Note that this latter situation is identical to that of a two-class problem with distribution (1 − pmaj , pmaj ) . [sent-190, score-1.19]

46 Kesten and Morse (1959) showed that under the constraint pmaj ≥ apk −1 , a > 1 Pcorr is minimized by the distribution 1 1 a ⎛ ⎞ , ,. [sent-195, score-0.61]

47 An alternative proof was provided by Marshall and Olkin (1979) using the Schur-concavity property of Pcorr for fixed pmaj discussed above. [sent-200, score-0.629]

48 For De ( pmaj , k ) , Pcorr increases with k for fixed pmaj and increases with pmaj for fixed k. [sent-201, score-1.853]

49 The first of these results follows from the Schur-concavity of Pcorr in the residual probabilities for fixed pmaj because for k < k′, De ( pmaj , k ) can be viewed as a distribution on k′ classes. [sent-202, score-1.244]

50 The second follows immediately from the Kesten and Morse theorem stated above because when pmaj is increased it will still satisfy the constraint pmaj ≥ apk −1 , a > 1 for the value of a applicable before the increase. [sent-203, score-1.205]

51 The value of pmaj has considerable impact on the value of Pcorr for given n. [sent-204, score-0.595]

52 6 produces Pcorr of approximately 73% whereas for pmaj = 0. [sent-206, score-0.595]

53 1753 HICKEY Also, when k > 2, there may be tied majority classes in the leaf class probability distribution. [sent-219, score-0.486]

54 Thus it is possible for pmaj to be very small and yet Pcorr be large. [sent-221, score-0.595]

55 Frank (2000) also considered the problem for k = 2 and graphs 1- Pcorr against pmaj . [sent-224, score-0.595]

56 2 Leaf Classification Rate and Leaf Sampling Error The sampling error of an induced tree is contributed to by the individual classifications taking place at each leaf of the tree. [sent-226, score-0.623]

57 Inability to determine the correct majority class at a leaf impairs the classification rate locally at the leaf. [sent-227, score-0.727]

58 The best rate that can be obtained at a leaf, that is, its local Bayes rate, is pmaj from its class probability distribution. [sent-228, score-0.643]

59 The expected actual rate from a random sample, that is, the expectation of the probability of the selected class, will be called the (expected) leaf classification rate (LCR). [sent-229, score-0.54]

60 Thus, expressed as a percentage, LCR = 100 * ( pmaj * Pcorr + Eres ( P (class maj )) * (1- Pcorr )) . [sent-232, score-0.809]

61 (4) The Schur-concavity of Pcorr for given pmaj does not extend to LCR. [sent-233, score-0.595]

62 In Equation 4, for given pmaj , Pcorr will increase as the residue probabilities are equalized, however this may be offset by the decrease in Eres ( P (class maj )) as the larger residue probabilities decrease. [sent-234, score-0.896]

63 For De ( pmaj , k ) , Equation 4 becomes: LCR = 100 * ( pmaj * Pcorr + (1- pmaj ) * (1- Pcorr ) /(k -1)) . [sent-260, score-1.785]

64 (5) Using results stated above for De ( pmaj , k ) , it is straightforward to show that, for given n and k, LCR in Equation 5 increases with pmaj . [sent-261, score-1.19]

65 The shortfall100 * pmaj - LCR is the expected loss in classification rate at a leaf due to the determination of majority class from a random sample and thus can be called the leaf sampling error (LSE). [sent-262, score-1.729]

66 In Table 1, LCR and LSE for De ( pmaj , k ) are shown for n = 1, 2, 4 and 10 for several values of k and a range of values of pmaj . [sent-265, score-1.19]

67 For given n and k it is seen that, although LCR increases with pmaj , as noted above, LSE increases and decreases again. [sent-266, score-0.595]

68 When pmaj is large, majority class is 1754 STRUCTURE AND MAJORITY CLASSES IN DECISION TREE LEARNING very likely to be correctly determined and hence sampling error is low. [sent-267, score-0.917]

69 When pmaj is low, then the consequence of wrongful determination of majority class, although more likely, is cushioned by the complimentary probability being only slightly less than pmaj and so loss of classification rate is again minimal. [sent-268, score-1.697]

70 As k and n increase, the maximum value of LSE tends to occur at smaller pmaj . [sent-269, score-0.595]

71 100* pmaj k 10 20 30 40 50 60 70 80 90 n = 1, 2 2 52. [sent-270, score-0.595]

72 0 Table 1: Leaf classification rate (LCR) and leaf sampling error (LSE) for leaf sample size n = 1, 2, 4 and 10 for various k and pmaj under De ( pmaj , k ) . [sent-516, score-2.07]

73 For k = 2, LSE reaches a maximum of approximately 12% when n = 1, 2 and pmaj lies between 0. [sent-518, score-0.595]

74 For n = 1, 2, Pcorr = pmaj and, from Equation 5, LCR for De ( pmaj , k ) can be expressed as: 1755 HICKEY 2 LCR = 100 * ( pmaj + (1- pmaj ) 2 /(k -1)) 2 which decreases with k to 100 * pmaj . [sent-522, score-2.975]

75 Thus, as k increases, LSE for De ( pmaj , k ) tends to 2 100 * pmaj − 100 * pmaj = 100*pmaj (1 − pmaj ) which has a maximum value of 25% at pmaj = 0. [sent-523, score-2.975]

76 For n > 2 and given pmaj , LCR for De ( pmaj , k ) can increase or decrease with k. [sent-526, score-1.207]

77 This is because LCR, in Equation 5, is a convex combination of pmaj and (1- pmaj ) /(k -1)) weighted by Pcorr and 1 − Pcorr respectively. [sent-527, score-1.19]

78 As k increases, Pcorr increases and (1- pmaj ) /(k -1)) , which by definition is less than pmaj , decreases but its weight,1 − Pcorr , is also decreasing. [sent-528, score-1.222]

79 Since the sampling error of the tree is the average of its leaf sampling errors, the behaviour of leaf sampling error should be reflected in the overall sampling error. [sent-529, score-1.09]

80 As training set size increases, induced trees will tend to have more structure and so the leaf class probability distributions will be more informative with the result that individual pmaj in the leaf distributions will tend to increase. [sent-530, score-1.228]

81 Thus the pattern of increase and decrease in sampling error with increases in pmaj noted above for LSE should be observable in the sampling error for the whole tree. [sent-531, score-0.826]

82 Sampling error was estimated in a tree by replacing each leaf with a freshly drawn random sample of the same size and obtaining the classification rate of the resulting tree. [sent-548, score-0.825]

83 This produces an unbiased estimate of sampling error over replications and is more efficient than calculating the exact sampling error from the leaf class probability distribution, particularly when the sample reaching a leaf is large. [sent-549, score-0.77]

84 For most of the learning curve, it is dominated by majority class error and the sub-decomposition shows that this is due mostly to sampling error with selection bias being either negative or approximately zero. [sent-552, score-0.465]

85 The attribute selection competition here is aiding the determination of majority class: the sample reaching a leaf is better able to determine majority class than is an independent random sample. [sent-555, score-0.918]

86 Table 2 (b) shows that, with the addition of 17 pure noise attributes, the full trees are now much larger and that total error, structure and majority class error are considerably larger than for the seven attribute domain. [sent-556, score-0.552]

87 As tree depth increases, there are fewer attributes available for selection, yet selection bias continues to increase along the learning curve suggesting that it accumulates with depth, that is, a leaf inherits a selection bias from its parent and adds to it. [sent-564, score-0.872]

88 Depth = average depth of the tree), classification rate (CR), inductive error (Err) and error decompositions for tree inductions on examples generated from the LED domain. [sent-844, score-0.779]

89 To create a model, the number of informative attributes, pure noise attributes and classes are specified; the number of values for an attribute is specified as either a range across attributes or as the same fixed value for all attributes. [sent-851, score-0.534]

90 A classification rate (CR) which is less than the default classification rate (DCR) for the model is shown in italics in Tables 4 and 5. [sent-896, score-0.566]

91 Where the ID3 tree has significant selection bias error, this will be almost eliminated in the random tree but this benefit, which may be accompanied anyway by larger sampling error due to smaller leaf samples, is usually not sufficient to compensate for the increased structure error. [sent-1596, score-1.085]

92 The rule set derived from this tree was applied to the test set producing a classification rate of 73. [sent-1650, score-0.52]

93 To determine structure and majority class error, an estimate of the true majority class at a leaf in an induced tree needs to be obtained. [sent-1657, score-0.983]

94 For example, if a tree induced from 10000 examples has a leaf containing two examples then the expectation, from Table 7, is that there would be 2 * 436169/10000 = 87 matching examples in the model set. [sent-1661, score-0.516]

95 These estimates of majority class were also used to estimate the reduced core of each induced tree through same majority class pruning. [sent-1662, score-0.804]

96 1765 HICKEY 8 Conclusion and Future Work The contribution of this paper has been the introduction of a method of decomposition of the classification error occurring in decision tree induction. [sent-1777, score-0.604]

97 Instead of comparing tree induction algorithms in terms of classification error is it now possible to provide further insight into how this arose, specifically whether it is due to failure to grow sufficient tree structure or to successfully estimate majority class at the leaves. [sent-1779, score-1.104]

98 It has been shown that majority class error is often quite substantial and that it can be further broken down into sampling error and selection bias error with the extent of these sources being quantified. [sent-1780, score-0.509]

99 In ID3, selection bias error is due to the corruption in the sample reaching a leaf caused by the multiple comparison effect of the competition to select the best attribute with which to expand the tree. [sent-1785, score-0.623]

100 Although regarded here as a source of majority class error, selection bias error could, conceivably, be viewed as part of the error in forming tree structure. [sent-1788, score-0.649]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pmaj', 0.595), ('pcorr', 0.298), ('tree', 0.247), ('classification', 0.245), ('leaf', 0.239), ('cr', 0.219), ('maj', 0.214), ('majority', 0.195), ('bcr', 0.145), ('lcr', 0.137), ('hickey', 0.129), ('err', 0.119), ('attributes', 0.108), ('attribute', 0.103), ('atts', 0.099), ('lse', 0.099), ('core', 0.097), ('dcr', 0.092), ('artificial', 0.069), ('bias', 0.065), ('sampling', 0.063), ('decomp', 0.061), ('leaves', 0.061), ('autouniv', 0.053), ('inductions', 0.053), ('pure', 0.047), ('defined', 0.046), ('kesten', 0.046), ('samp', 0.046), ('significant', 0.046), ('inductive', 0.046), ('corruption', 0.045), ('specified', 0.045), ('sufficient', 0.045), ('error', 0.044), ('trees', 0.044), ('decomposition', 0.043), ('determination', 0.039), ('morse', 0.038), ('depth', 0.037), ('structure', 0.037), ('rules', 0.037), ('five', 0.037), ('reaching', 0.036), ('decompositions', 0.035), ('competition', 0.035), ('residue', 0.035), ('fixed', 0.034), ('selection', 0.034), ('noise', 0.033), ('led', 0.033), ('classes', 0.032), ('definition', 0.032), ('bechofer', 0.031), ('eres', 0.031), ('sel', 0.031), ('induced', 0.03), ('pk', 0.029), ('seven', 0.029), ('rate', 0.028), ('ran', 0.026), ('struct', 0.026), ('curve', 0.026), ('behaviour', 0.025), ('forest', 0.025), ('decision', 0.025), ('induction', 0.024), ('informative', 0.024), ('virtually', 0.023), ('classifier', 0.023), ('inadequacies', 0.023), ('inflated', 0.023), ('marshall', 0.023), ('overfitting', 0.023), ('tcore', 0.023), ('sample', 0.022), ('corrected', 0.022), ('lift', 0.021), ('class', 0.02), ('default', 0.02), ('expansion', 0.02), ('amongst', 0.02), ('bayes', 0.02), ('residual', 0.02), ('oates', 0.019), ('grown', 0.019), ('domain', 0.019), ('ray', 0.018), ('eliminated', 0.018), ('sub', 0.018), ('table', 0.018), ('comparatively', 0.017), ('increase', 0.017), ('generator', 0.016), ('offered', 0.016), ('permits', 0.016), ('apk', 0.015), ('avoidance', 0.015), ('benefit', 0.015), ('classmaj', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

Author: Ray J. Hickey

Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1

2 0.14561334 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

Author: Marlon Núñez, Raúl Fidalgo, Rafael Morales

Abstract: In the process of concept learning, target concepts may have portions with short-term changes, other portions may support long-term changes, and yet others may not change at all. For this reason several local windows need to be handled. We suggest facing this problem, which naturally exists in the field of concept learning, by allocating windows which can adapt their size to portions of the target concept. We propose an incremental decision tree that is updated with incoming examples. Each leaf of the decision tree holds a time window and a local performance measure as the main parameter to be controlled. When the performance of a leaf decreases, the size of its local window is reduced. This learning algorithm, called OnlineTree2, automatically adjusts its internal parameters in order to face the current dynamics of the data stream. Results show that it is comparable to other batch algorithms when facing problems with no concept change, and it is better than evaluated methods in its ability to deal with concept drift when dealing with problems in which: concept change occurs at different speeds, noise may be present and, examples may arrive from different areas of the problem domain (virtual drift). Keywords: incremental algorithms, online learning, concept drift, decision trees, robust learners

3 0.12213793 11 jmlr-2007-Anytime Learning of Decision Trees

Author: Saher Esmeir, Shaul Markovitch

Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning

4 0.096212409 52 jmlr-2007-Margin Trees for High-dimensional Classification

Author: Robert Tibshirani, Trevor Hastie

Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART

5 0.043355145 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

Author: Wei Pan, Xiaotong Shen

Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage

6 0.036782585 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts

7 0.035353128 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

8 0.029649271 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers     (Special Topic on Model Selection)

9 0.029466657 72 jmlr-2007-Relational Dependency Networks

10 0.027060971 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

11 0.025374979 39 jmlr-2007-Handling Missing Values when Applying Classification Models

12 0.02444284 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

13 0.023677833 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

14 0.023405062 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

15 0.020896854 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

16 0.02072883 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

17 0.018690359 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

18 0.018545048 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

19 0.017934142 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors

20 0.017898766 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions     (Special Topic on the Conference on Learning Theory 2005)


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.124), (1, 0.209), (2, -0.042), (3, 0.114), (4, 0.163), (5, 0.15), (6, 0.173), (7, 0.144), (8, -0.184), (9, 0.222), (10, 0.073), (11, 0.125), (12, -0.05), (13, -0.11), (14, -0.027), (15, 0.082), (16, 0.03), (17, 0.002), (18, -0.015), (19, -0.064), (20, 0.026), (21, -0.038), (22, 0.102), (23, -0.066), (24, 0.087), (25, -0.154), (26, 0.13), (27, -0.011), (28, -0.147), (29, -0.138), (30, 0.123), (31, -0.043), (32, -0.065), (33, -0.026), (34, -0.052), (35, 0.043), (36, -0.005), (37, 0.08), (38, 0.052), (39, -0.008), (40, 0.014), (41, -0.014), (42, -0.133), (43, -0.016), (44, 0.012), (45, 0.053), (46, -0.043), (47, 0.034), (48, -0.073), (49, -0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96943557 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

Author: Ray J. Hickey

Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1

2 0.81932038 11 jmlr-2007-Anytime Learning of Decision Trees

Author: Saher Esmeir, Shaul Markovitch

Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning

3 0.59263492 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

Author: Marlon Núñez, Raúl Fidalgo, Rafael Morales

Abstract: In the process of concept learning, target concepts may have portions with short-term changes, other portions may support long-term changes, and yet others may not change at all. For this reason several local windows need to be handled. We suggest facing this problem, which naturally exists in the field of concept learning, by allocating windows which can adapt their size to portions of the target concept. We propose an incremental decision tree that is updated with incoming examples. Each leaf of the decision tree holds a time window and a local performance measure as the main parameter to be controlled. When the performance of a leaf decreases, the size of its local window is reduced. This learning algorithm, called OnlineTree2, automatically adjusts its internal parameters in order to face the current dynamics of the data stream. Results show that it is comparable to other batch algorithms when facing problems with no concept change, and it is better than evaluated methods in its ability to deal with concept drift when dealing with problems in which: concept change occurs at different speeds, noise may be present and, examples may arrive from different areas of the problem domain (virtual drift). Keywords: incremental algorithms, online learning, concept drift, decision trees, robust learners

4 0.5328669 52 jmlr-2007-Margin Trees for High-dimensional Classification

Author: Robert Tibshirani, Trevor Hastie

Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART

5 0.22483426 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts

Author: J. Zico Kolter, Marcus A. Maloof

Abstract: We present an ensemble method for concept drift that dynamically creates and removes weighted experts in response to changes in performance. The method, dynamic weighted majority (DWM), uses four mechanisms to cope with concept drift: It trains online learners of the ensemble, it weights those learners based on their performance, it removes them, also based on their performance, and it adds new experts based on the global performance of the ensemble. After an extensive evaluation— consisting of five experiments, eight learners, and thirty data sets that varied in type of target concept, size, presence of noise, and the like—we concluded that DWM outperformed other learners that only incrementally learn concept descriptions, that maintain and use previously encountered examples, and that employ an unweighted, fixed-size ensemble of experts. Keywords: concept learning, online learning, ensemble methods, concept drift

6 0.17123158 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

7 0.15114465 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

8 0.14226502 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

9 0.13229172 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

10 0.12398118 72 jmlr-2007-Relational Dependency Networks

11 0.12207167 70 jmlr-2007-Ranking the Best Instances

12 0.11984425 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

13 0.11682053 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

14 0.11490974 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

15 0.11481111 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

16 0.11172665 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

17 0.10392649 23 jmlr-2007-Concave Learners for Rankboost

18 0.10276125 43 jmlr-2007-Integrating Naïve Bayes and FOIL

19 0.10274722 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

20 0.10060713 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.01), (10, 0.016), (12, 0.556), (15, 0.022), (28, 0.048), (40, 0.022), (45, 0.016), (48, 0.016), (60, 0.029), (80, 0.085), (85, 0.029), (98, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90841913 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

Author: Ray J. Hickey

Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1

2 0.85471851 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs     (Special Topic on the Conference on Learning Theory 2005)

Author: Matthias Hein, Jean-Yves Audibert, Ulrike von Luxburg

Abstract: Given a sample from a probability measure with support on a submanifold in Euclidean space one can construct a neighborhood graph which can be seen as an approximation of the submanifold. The graph Laplacian of such a graph is used in several machine learning methods like semi-supervised learning, dimensionality reduction and clustering. In this paper we determine the pointwise limit of three different graph Laplacians used in the literature as the sample size increases and the neighborhood size approaches zero. We show that for a uniform measure on the submanifold all graph Laplacians have the same limit up to constants. However in the case of a non-uniform measure on the submanifold only the so called random walk graph Laplacian converges to the weighted Laplace-Beltrami operator. Keywords: graphs, graph Laplacians, semi-supervised learning, spectral clustering, dimensionality reduction

3 0.81579697 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms

Author: Michael Biehl, Anarta Ghosh, Barbara Hammer

Abstract: Learning vector quantization (LVQ) schemes constitute intuitive, powerful classification heuristics with numerous successful applications but, so far, limited theoretical background. We study LVQ rigorously within a simplifying model situation: two competing prototypes are trained from a sequence of examples drawn from a mixture of Gaussians. Concepts from statistical physics and the theory of on-line learning allow for an exact description of the training dynamics in highdimensional feature space. The analysis yields typical learning curves, convergence properties, and achievable generalization abilities. This is also possible for heuristic training schemes which do not relate to a cost function. We compare the performance of several algorithms, including Kohonen’s LVQ1 and LVQ+/-, a limiting case of LVQ2.1. The former shows close to optimal performance, while LVQ+/- displays divergent behavior. We investigate how early stopping can overcome this difficulty. Furthermore, we study a crisp version of robust soft LVQ, which was recently derived from a statistical formulation. Surprisingly, it exhibits relatively poor generalization. Performance improves if a window for the selection of data is introduced; the resulting algorithm corresponds to cost function based LVQ2. The dependence of these results on the model parameters, for example, prior class probabilities, is investigated systematically, simulations confirm our analytical findings. Keywords: prototype based classification, learning vector quantization, Winner-Takes-All algorithms, on-line learning, competitive learning

4 0.40689957 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

Author: Rie Johnson, Tong Zhang

Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian

5 0.3949675 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes

Author: Dima Kuzmin, Manfred K. Warmuth

Abstract: n Maximum concept classes of VC dimension d over n domain points have size ≤d , and this is an upper bound on the size of any concept class of VC dimension d over n points. We give a compression scheme for any maximum class that represents each concept by a subset of up to d unlabeled domain points and has the property that for any sample of a concept in the class, the representative of exactly one of the concepts consistent with the sample is a subset of the domain of the sample. This allows us to compress any sample of a concept in the class to a subset of up to d unlabeled sample points such that this subset represents a concept consistent with the entire original sample. Unlike the previously known compression scheme for maximum classes (Floyd and Warmuth, 1995) which compresses to labeled subsets of the sample of size equal d, our new scheme is tight in the sense that the number of possible unlabeled compression sets of size at most d equals the number of concepts in the class. Keywords: compression schemes, VC dimension, maximum classes, one-inclusion graph

6 0.36597896 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

7 0.36520508 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

8 0.32960257 74 jmlr-2007-Separating Models of Learning from Correlated and Uncorrelated Data     (Special Topic on the Conference on Learning Theory 2005)

9 0.31371742 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

10 0.30645278 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

11 0.30309123 51 jmlr-2007-Loop Corrections for Approximate Inference on Factor Graphs

12 0.2903451 31 jmlr-2007-Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm

13 0.29011208 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

14 0.28804699 11 jmlr-2007-Anytime Learning of Decision Trees

15 0.27907127 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

16 0.27824813 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

17 0.27644682 86 jmlr-2007-Truncating the Loop Series Expansion for Belief Propagation

18 0.27236331 44 jmlr-2007-Large Margin Semi-supervised Learning

19 0.27229679 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

20 0.26847064 72 jmlr-2007-Relational Dependency Networks