jmlr jmlr2007 jmlr2007-11 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Saher Esmeir, Shaul Markovitch
Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning
Reference: text
sentIndex sentText sentNum sentScore
1 IL Department of Computer Science Technion—Israel Institute of Technology Haifa 32000, Israel Editor: Claude Sammut Abstract The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. [sent-7, score-0.556]
2 We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. [sent-11, score-1.056]
3 A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. [sent-13, score-1.094]
4 We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. [sent-14, score-1.211]
5 Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. [sent-15, score-0.762]
6 Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning 1. [sent-16, score-0.789]
7 The majority of existing methods for decision tree induction build a tree top-down and use local measures in an attempt to produce small trees, which, by Occam’s Razor (Blumer et al. [sent-39, score-0.682]
8 11) recognized the need for this type of anytime algorithm for decision tree learning: “What is wanted is a resource constrained algorithm that will do the best it can within a specified computational budget and can pick up threads and continue if this budget is increased. [sent-65, score-0.872]
9 ” There are two main classes of anytime algorithms, namely contract and interruptible (Russell and Zilberstein, 1996). [sent-67, score-0.707]
10 Therefore, in our work, we are interested both in contract and interruptible decision tree learners. [sent-72, score-0.62]
11 When applied to decision tree induction, lookahead attempts to predict the profitability of a split at a node by estimating its effect on deeper descendants of the node. [sent-76, score-0.654]
12 The reported results vary from lookahead produces better trees (Norton, 1989; Ragavan and Rendell, 1993; Dong and Kothari, 2001) to lookahead does not help and can hurt (Murthy and Salzberg, 1995). [sent-82, score-0.559]
13 A consistent decision tree is a tree that correctly classifies all training examples. [sent-88, score-0.636]
14 These works, however, are not capable to handle high-dimensional difficult concepts and are not designed to offer anytime behavior. [sent-96, score-0.459]
15 We first discuss tree size as a desired property of the tree to be learned and then we describe an anytime algorithm that uses sampling methods to obtain smaller trees. [sent-106, score-0.96]
16 Murphy and Pazzani (1994) reported a set of experiments in which all the possible consistent decision trees were produced and showed that, for several tasks, the smallest consistent decision tree had higher predictive error than slightly larger trees. [sent-132, score-0.665]
17 The small number of training examples relative to the size of the tree that perfectly describes the concept might explain why, in these cases, the smallest tree did not generalize best. [sent-135, score-0.532]
18 RTG builds a tree top-down and chooses the splitting attribute at random. [sent-139, score-0.45]
19 For each attribute a, let Tmin (a, A, E) be the smallest tree rooted at a that is consistent with E and uses attributes from A − {a} for the internal splits. [sent-147, score-0.583]
20 3 ID3-k can be viewed as a contract anytime algorithm parameterized by k. [sent-169, score-0.528]
21 However, despite its ability to exploit additional resources when available, the anytime behavior of ID3-k is problematic. [sent-170, score-0.578]
22 If two attributes yield the same decrease in entropy, we prefer the one whose associated lookahead tree is shallower. [sent-180, score-0.556]
23 3 Estimating Tree Size by Sampling Motivated by the advantages of smaller decision trees, we introduce a novel algorithm that, given an attribute a, evaluates it by estimating the size of the minimal tree under it. [sent-184, score-0.481]
24 Summing up the minimal tree size for each subset gives an estimation of the minimal total tree size under a. [sent-238, score-0.532]
25 The minimal size for trees rooted at a4 is 6 and for trees rooted at a3 is 7. [sent-265, score-0.466]
26 For more complicated problems such as 3-XOR, the space of SID3 trees under the relevant attributes includes trees other than the smallest. [sent-268, score-0.488]
27 4 Evaluating Continuous Attributes by Sampling the Candidate Cuts When attributes have a continuous domain, the decision tree learner needs to discretize their values in order to define splitting tests. [sent-271, score-0.491]
28 In our proposed anytime approach, this problem is tackled by allotting additional time for reaching trees unexplored by greedy algorithms with pruning. [sent-335, score-0.699]
29 For this purpose, a post-pruning phase will be applied to the trees that were generated by SID3 to form the lookahead sample, and the size of the pruned trees will be considered. [sent-341, score-0.566]
30 5, namely error-based pruning (EPB), and examine post-pruning of the final trees as well as pruning of the lookahead trees. [sent-346, score-0.499]
31 The space of decision trees over A can be partitioned into two: the space of trees consistent with E and the space of trees inconsistent with E. [sent-349, score-0.677]
32 To handle such cases, we need to devise an interruptible anytime algorithm. [sent-372, score-0.607]
33 1 Interruptible Induction by Sequencing Contract Algorithms By definition, every interruptible algorithm can serve as a contract algorithm because one can stop the interruptible algorithm when all the resources have been consumed. [sent-377, score-0.587]
34 Our interruptible anytime framework, called IIDT, overcomes these problems by iteratively improving the tree rather than trying to rebuild it. [sent-392, score-0.894]
35 It can serve as an interruptible anytime algorithm because, when interrupted, it can immediately return the currently best solution available. [sent-399, score-0.633]
36 The principal difference between the algorithms is that LSID3 uses the available resources to induce a decision tree top-down, where each decision made at a node is final and does not change. [sent-401, score-0.62]
37 Then it iteratively attempts to improve the current tree by choosing a node and rebuilding its subtree with more resources than those used previously. [sent-405, score-0.595]
38 In the first iteration, the subtree rooted at the bolded node is selected for improvement and replaced by a smaller tree (surrounded by a dashed line). [sent-411, score-0.503]
39 Next, the root is selected for improvement and the whole tree is replaced by a tree that perfectly describes the concept. [sent-412, score-0.583]
40 In the first iteration the subtree rooted at the bolded node is selected for improvement and replaced by a smaller tree (surrounded by a dashed line). [sent-416, score-0.503]
41 Next, the root is selected for improvement and the whole tree is replaced by a tree that perfectly describes the concept. [sent-417, score-0.583]
42 Given a tree node y, we can view the task of rebuilding the subtree below y as an independent task. [sent-432, score-0.466]
43 The whole framework of decision tree induction rests on the assumption that smaller consistent trees are better than larger ones. [sent-443, score-0.636]
44 When no such large subtrees exist, our algorithm may repeatedly attempt to improve smaller trees rooted at deep nodes because these trees have low associated costs. [sent-457, score-0.483]
45 To control the tradeoff between efficient resource use and anytime performance flexibility, we add a granularity parameter 0 ≤ g ≤ 1. [sent-461, score-0.477]
46 Thus, to avoid obtaining an induced tree of lower quality, we replace an existing subtree with a newly induced alternative only if the alternative is expected to improve the quality of the complete decision tree. [sent-475, score-0.502]
47 Empirical Evaluation A variety of experiments were conducted to test the performance and behavior of the proposed anytime algorithms. [sent-487, score-0.449]
48 1 Experimental Methodology We start our experimental evaluation by comparing our contract algorithm, given a fixed resource allocation, with the basic decision tree induction algorithms. [sent-491, score-0.565]
49 We then compare the anytime behavior of our contract algorithm to that of fixed lookahead. [sent-492, score-0.549]
50 Next we examine the anytime behavior of our interruptible algorithm. [sent-493, score-0.628]
51 When comparing the algorithms that produce consistent trees, namely ID3, ID3-k and LSID3, the average tree size is the smallest for most data sets when the trees are induced with LSID3. [sent-526, score-0.51]
52 However, in order to serve as good anytime algorithms, the quality of their output should improve with the increase in their allocated resources. [sent-670, score-0.498]
53 For a typical anytime algorithm, this improvement is greater at the beginning and diminishes over time. [sent-671, score-0.455]
54 To test the anytime behavior of LSID3 and ID3-k, we invoked them with successive values of r and k respectively. [sent-672, score-0.449]
55 In the first set of experiments we focused on domains with nominal attributes, while in the second set we tested the anytime behavior for domains with continuous attributes. [sent-673, score-0.477]
56 The graphs indicate that the anytime behavior of LSID3 is better than that of ID3-k. [sent-703, score-0.449]
57 16 We can easily increase the granularity of the anytime graph by smaller gaps between the p values. [sent-758, score-0.46]
58 4 Anytime Behavior of IIDT IIDT was presented as an interruptible decision tree learner that does not require advanced knowledge of its resource allocation: it can be stopped at any moment and return a valid decision tree. [sent-778, score-0.67]
59 Figures 23, 24, and 25 show the anytime performance of IIDT in terms of tree size and accuracy for the Glass, XOR-10, and Tic-tac-toe data sets. [sent-781, score-0.739]
60 Unlike the graphs given in the previous section, these are interruptible anytime graphs, that is, for each point, the y coordinate reflects the performance if the algorithm was interrupted at the associated x coordinate. [sent-783, score-0.639]
61 In all cases, the two anytime versions indeed exploit the additional resources and produce both smaller and more accurate trees. [sent-785, score-0.557]
62 Although these learners were not presented and studied as anytime algorithms, some of them can be viewed as such. [sent-824, score-0.456]
63 In what follows we compare our proposed anytime framework to three such algorithms: bagging, skewing and GATree. [sent-825, score-0.72]
64 1 OVERVIEW OF THE C OMPARED M ETHODS Page and Ray (2003) introduced skewing as an alternative to lookahead for addressing problematic concepts such as parity functions. [sent-829, score-0.507]
65 In order to convert skewing into an interruptible algorithm, we apply the general conversion method described in Section 3. [sent-834, score-0.471]
66 Two improvements to the original skewing algorithm were presented, namely sequential skewing (Page and Ray, 2004), which skews one variable at a time instead of all of them simultaneously, and generalized skewing (Ray and Page, 2005), which can handle nominal and continuous attributes. [sent-836, score-0.931]
67 To convert it into an anytime algorithm, we added a parameter k that controls the number of skewing iterations. [sent-841, score-0.72]
68 For the Tic-tac-toe data set, where the attributes are ternary, we used only the generalized skewing algorithm, parameterized by the number of random orderings by which the nominal attributes are reweighed. [sent-843, score-0.532]
69 GATree can be viewed as an interruptible anytime algorithm that uses additional time to produce more and more generations. [sent-853, score-0.634]
70 Bagging is an ensemble-based method, and as such, it is naturally an interruptible anytime learner. [sent-856, score-0.607]
71 Generating a set of trees rather than a single good tree eliminates one of the greatest advantages of decision trees—their comprehensibility. [sent-878, score-0.532]
72 17 The skewing and sequential skewing versions were run with linearly increasing parameters. [sent-882, score-0.584]
73 Still, IIDT performed significantly better than bagging-LSID3, indicating that for difficult concepts, it is better to invest more resources for improving a single tree than for adding more trees of lower quality to the committee. [sent-904, score-0.586]
74 The inferior results of the skewing algorithms are more difficult to interpret, since skewing was shown to handle difficult concepts well. [sent-905, score-0.615]
75 We believe that the general question of tradeoff between the resources allocated for each tree and the number of trees forming the ensemble should be addressed by further research with extensive experiments on various data sets. [sent-928, score-0.656]
76 The performance of generalized skewing and IIDT was similar in this case, with a slight advantage for skewing in terms of accuracy and an advantage for IIDT in terms of tree size. [sent-929, score-0.895]
77 Related Work While to the best of our knowledge no other work has tried to design an anytime algorithm for decision tree induction in particular, there are several related works that warrant discussion here. [sent-934, score-0.844]
78 1 Single Tree Inducers The goal of our research was to develop anytime algorithms for inducing a single decision tree. [sent-938, score-0.503]
79 Several other algorithms for single decision-tree induction can either be considered anytime algorithms or can be converted into them with relative ease. [sent-939, score-0.503]
80 Therefore, we focus first on the differences between skewing and our anytime framework, and then we consider other decision tree learners. [sent-941, score-1.061]
81 1 S KEWING The skewing approach was presented as an efficient alternative to lookahead (Page and Ray, 2003). [sent-944, score-0.476]
82 Skewing relies on the hypothesis that, when learning a hard concept, it may be easier for a greedy decision tree inducer to pick a relevant attribute if the distribution over the data is significantly different from the uniform distribution. [sent-948, score-0.554]
83 Hence, skewing iterations can be seen as sampling the space of lookahead sub-paths and considering a few of them at a time. [sent-961, score-0.476]
84 Unlike our approach, GATree is not designed to generate consistent decision trees and searches the space of all possible trees over a given set of attributes. [sent-978, score-0.486]
85 5) shows that, especially for hard concepts, it is much better to invest the resources in careful tuning of a single tree than to perform genetic search over the large population of decision trees. [sent-982, score-0.49]
86 (1997) presented DMTI, an induction algorithm that chooses an attribute by building a single decision tree under each candidate attribute and evaluates it using various measures. [sent-984, score-0.728]
87 DMTI is similar to LSID3(r = 1) but, unlike LSID3, it can only use a fixed amount of additional resources and hence cannot serve as an anytime algorithm. [sent-986, score-0.557]
88 Furthermore, DMTI uses a single greedy lookahead tree for attribute evaluation, while we use a biased sample of the possible lookahead trees. [sent-988, score-0.827]
89 Kim and Loh (2001) introduced CRUISE, a bias-free decision tree learner that attempts to produce more compact trees by (1) using multiway splits—one subnode for each class, and (2) examining pair-wise interactions among the variables. [sent-991, score-0.59]
90 The algorithm retains the structure of the tree but attempts to simultaneously improve all the multivariate decisions of the tree using iterative linear programming. [sent-996, score-0.532]
91 2 Induction of Other Models Ensemble-based methods, such as bagging and boosting (Schapire, 1999), can also be viewed as anytime algorithms. [sent-1001, score-0.487]
92 5 in that the latter produces an ensemble of trees that is obviously not as comprehensible as a single decision trees is. [sent-1016, score-0.486]
93 In order to allow induction of better decision trees for hard-to-learn concepts, we presented a framework for anytime induction of decision trees that makes use of additional resources for producing better hypotheses. [sent-1039, score-1.239]
94 We showed that for several real-world and synthetic data sets, our proposed anytime algorithms can exploit larger time budgets. [sent-1041, score-0.455]
95 The major contributions of this paper are: • A better understanding of the space of consistent decision trees: Using sampling techniques, we conducted experiments that support the application of Occam’s Razor for decision trees and showed that smaller trees are clearly preferable. [sent-1043, score-0.561]
96 • LSID3: We presented and empirically evaluated LSID3, a contract anytime algorithm for learning decision trees. [sent-1044, score-0.603]
97 On the data sets studied in this paper, LSID3 was shown to exhibit good anytime behavior and to produce better decision trees. [sent-1045, score-0.524]
98 • IIDT: Motivated by the need for interruptible learners, we introduced IIDT, an interruptible anytime algorithm for inducing decision trees. [sent-1046, score-0.861]
99 To the best of our knowledge, this is the first work that proposes anytime algorithms for learning decision trees. [sent-1051, score-0.503]
100 Skewing: An efficient alternative to lookahead for decision tree induction. [sent-2139, score-0.525]
wordName wordTfidf (topN-words)
[('iidt', 0.454), ('anytime', 0.428), ('skewing', 0.292), ('tree', 0.266), ('trees', 0.191), ('lookahead', 0.184), ('interruptible', 0.179), ('attribute', 0.14), ('nytime', 0.134), ('smeir', 0.134), ('resources', 0.129), ('arkovitch', 0.114), ('ecision', 0.114), ('rees', 0.111), ('gatree', 0.109), ('attributes', 0.106), ('rtg', 0.102), ('contract', 0.1), ('subtree', 0.093), ('splits', 0.086), ('decision', 0.075), ('node', 0.075), ('induction', 0.075), ('sec', 0.073), ('allocation', 0.073), ('allocated', 0.07), ('xor', 0.068), ('occam', 0.066), ('tdidt', 0.065), ('pruning', 0.062), ('bagging', 0.059), ('multiway', 0.058), ('onks', 0.058), ('greedy', 0.053), ('resource', 0.049), ('runtime', 0.047), ('hoose', 0.045), ('umeric', 0.045), ('accuracy', 0.045), ('splitting', 0.044), ('rooted', 0.042), ('earning', 0.042), ('foreach', 0.039), ('razor', 0.039), ('dmti', 0.038), ('uci', 0.038), ('subtrees', 0.037), ('gain', 0.037), ('russell', 0.035), ('split', 0.033), ('gaps', 0.032), ('candidate', 0.032), ('anode', 0.032), ('interrupted', 0.032), ('papagelis', 0.032), ('rebuilding', 0.032), ('concepts', 0.031), ('repository', 0.029), ('comprehensible', 0.029), ('sequencing', 0.029), ('consistent', 0.029), ('seconds', 0.029), ('quinlan', 0.028), ('learners', 0.028), ('nominal', 0.028), ('autos', 0.027), ('budget', 0.027), ('markovitch', 0.027), ('zilberstein', 0.027), ('ray', 0.027), ('improvement', 0.027), ('time', 0.027), ('return', 0.026), ('automobile', 0.026), ('esmeir', 0.026), ('gto', 0.026), ('kalles', 0.026), ('multiplexer', 0.026), ('tmin', 0.026), ('zoo', 0.024), ('root', 0.024), ('induced', 0.024), ('ig', 0.023), ('repeatedly', 0.022), ('utgoff', 0.022), ('irrelevant', 0.022), ('afford', 0.022), ('enode', 0.022), ('depth', 0.021), ('behavior', 0.021), ('loh', 0.021), ('glass', 0.021), ('deeper', 0.021), ('committee', 0.021), ('overcomes', 0.021), ('expected', 0.02), ('hard', 0.02), ('obtainable', 0.019), ('ake', 0.019), ('committees', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 11 jmlr-2007-Anytime Learning of Decision Trees
Author: Saher Esmeir, Shaul Markovitch
Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning
2 0.14632705 52 jmlr-2007-Margin Trees for High-dimensional Classification
Author: Robert Tibshirani, Trevor Hastie
Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART
3 0.12213793 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning
Author: Ray J. Hickey
Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1
4 0.088700451 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
Author: Marlon Núñez, Raúl Fidalgo, Rafael Morales
Abstract: In the process of concept learning, target concepts may have portions with short-term changes, other portions may support long-term changes, and yet others may not change at all. For this reason several local windows need to be handled. We suggest facing this problem, which naturally exists in the field of concept learning, by allocating windows which can adapt their size to portions of the target concept. We propose an incremental decision tree that is updated with incoming examples. Each leaf of the decision tree holds a time window and a local performance measure as the main parameter to be controlled. When the performance of a leaf decreases, the size of its local window is reduced. This learning algorithm, called OnlineTree2, automatically adjusts its internal parameters in order to face the current dynamics of the data stream. Results show that it is comparable to other batch algorithms when facing problems with no concept change, and it is better than evaluated methods in its ability to deal with concept drift when dealing with problems in which: concept change occurs at different speeds, noise may be present and, examples may arrive from different areas of the problem domain (virtual drift). Keywords: incremental algorithms, online learning, concept drift, decision trees, robust learners
Author: Evgeniy Gabrilovich, Shaul Markovitch
Abstract: Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation. Keywords: feature generation, text classification, background knowledge
6 0.047474172 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
7 0.045024756 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
8 0.044240009 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
9 0.043586999 39 jmlr-2007-Handling Missing Values when Applying Classification Models
10 0.042457614 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
11 0.038558785 72 jmlr-2007-Relational Dependency Networks
12 0.036677092 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
13 0.036146563 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
14 0.03207849 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
15 0.031657264 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
16 0.030178282 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks
17 0.028311588 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
19 0.026848985 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
20 0.024128744 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
topicId topicWeight
[(0, 0.165), (1, 0.242), (2, -0.054), (3, 0.11), (4, 0.159), (5, 0.139), (6, 0.147), (7, 0.16), (8, -0.155), (9, 0.191), (10, 0.085), (11, 0.062), (12, -0.001), (13, -0.042), (14, -0.017), (15, -0.002), (16, 0.076), (17, -0.071), (18, 0.066), (19, -0.08), (20, 0.005), (21, -0.155), (22, 0.094), (23, -0.104), (24, 0.037), (25, -0.049), (26, 0.116), (27, -0.114), (28, -0.156), (29, -0.148), (30, 0.102), (31, -0.025), (32, -0.031), (33, 0.064), (34, 0.013), (35, 0.047), (36, 0.069), (37, -0.055), (38, -0.078), (39, 0.029), (40, -0.033), (41, -0.024), (42, -0.178), (43, 0.012), (44, 0.026), (45, -0.001), (46, -0.012), (47, 0.003), (48, -0.065), (49, 0.082)]
simIndex simValue paperId paperTitle
same-paper 1 0.95993543 11 jmlr-2007-Anytime Learning of Decision Trees
Author: Saher Esmeir, Shaul Markovitch
Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning
2 0.87232566 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning
Author: Ray J. Hickey
Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1
3 0.6705361 52 jmlr-2007-Margin Trees for High-dimensional Classification
Author: Robert Tibshirani, Trevor Hastie
Abstract: We propose a method for the classification of more than two classes, from high-dimensional features. Our approach is to build a binary decision tree in a top-down manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins. We compare the performance of the “margin tree” to the closely related “all-pairs” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray data sets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes. Keywords: maximum margin classifier, support vector machine, decision tree, CART
4 0.39685044 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
Author: Marlon Núñez, Raúl Fidalgo, Rafael Morales
Abstract: In the process of concept learning, target concepts may have portions with short-term changes, other portions may support long-term changes, and yet others may not change at all. For this reason several local windows need to be handled. We suggest facing this problem, which naturally exists in the field of concept learning, by allocating windows which can adapt their size to portions of the target concept. We propose an incremental decision tree that is updated with incoming examples. Each leaf of the decision tree holds a time window and a local performance measure as the main parameter to be controlled. When the performance of a leaf decreases, the size of its local window is reduced. This learning algorithm, called OnlineTree2, automatically adjusts its internal parameters in order to face the current dynamics of the data stream. Results show that it is comparable to other batch algorithms when facing problems with no concept change, and it is better than evaluated methods in its ability to deal with concept drift when dealing with problems in which: concept change occurs at different speeds, noise may be present and, examples may arrive from different areas of the problem domain (virtual drift). Keywords: incremental algorithms, online learning, concept drift, decision trees, robust learners
5 0.29978269 39 jmlr-2007-Handling Missing Values when Applying Classification Models
Author: Maytal Saar-Tsechansky, Foster Provost
Abstract: Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. Keywords: missing data, classification, classification trees, decision trees, imputation
6 0.28266865 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
8 0.19451781 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
9 0.19402558 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
10 0.17926276 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
11 0.17921653 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
12 0.16979448 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
13 0.16603006 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
14 0.15221874 72 jmlr-2007-Relational Dependency Networks
15 0.15081351 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
17 0.14882101 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
18 0.14832468 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
19 0.14719526 43 jmlr-2007-Integrating Naïve Bayes and FOIL
20 0.14557576 23 jmlr-2007-Concave Learners for Rankboost
topicId topicWeight
[(10, 0.034), (12, 0.026), (15, 0.035), (28, 0.045), (40, 0.027), (45, 0.022), (48, 0.025), (60, 0.022), (63, 0.013), (80, 0.52), (85, 0.046), (98, 0.083)]
simIndex simValue paperId paperTitle
1 0.95719141 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors
Author: Moshe Koppel, Jonathan Schler, Elisheva Bonchek-Dokow
Abstract: In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the “depth of difference” between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process. Keywords: authorship attribution, one-class learning, unmasking 1
same-paper 2 0.83692175 11 jmlr-2007-Anytime Learning of Decision Trees
Author: Saher Esmeir, Shaul Markovitch
Abstract: The majority of existing algorithms for learning decision trees are greedy—a tree is induced topdown, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Even the few non-greedy learners cannot learn good trees when the concept is difficult. Furthermore, they require a fixed amount of time and are not able to generate a better tree if additional time is available. We introduce a framework for anytime induction of decision trees that overcomes these problems by trading computation speed for better tree quality. Our proposed family of algorithms employs a novel strategy for evaluating candidate splits. A biased sampling of the space of consistent trees rooted at an attribute is used to estimate the size of the minimal tree under that attribute, and an attribute with the smallest expected tree is selected. We present two types of anytime induction algorithms: a contract algorithm that determines the sample size on the basis of a pre-given allocation of time, and an interruptible algorithm that starts with a greedy tree and continuously improves subtrees by additional sampling. Experimental results indicate that, for several hard concepts, our proposed approach exhibits good anytime behavior and yields significantly better decision trees when more time is available. Keywords: anytime algorithms, decision tree induction, lookahead, hard concepts, resource-bounded reasoning
3 0.42658401 39 jmlr-2007-Handling Missing Values when Applying Classification Models
Author: Maytal Saar-Tsechansky, Foster Provost
Abstract: Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. Keywords: missing data, classification, classification trees, decision trees, imputation
4 0.364025 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning
Author: Ray J. Hickey
Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1
5 0.3561365 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
Author: Wei Pan, Xiaotong Shen
Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage
6 0.33246332 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
7 0.33071256 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
8 0.32482213 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
9 0.31753197 72 jmlr-2007-Relational Dependency Networks
10 0.31303212 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
11 0.30538958 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
12 0.30400392 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
13 0.30200404 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
14 0.29744989 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
15 0.29471359 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
16 0.29336208 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
17 0.28377038 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
18 0.28240591 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
19 0.28125298 86 jmlr-2007-Truncating the Loop Series Expansion for Belief Propagation
20 0.2793391 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition