jmlr jmlr2008 jmlr2008-77 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Amit Dhurandhar, Alin Dobra
Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees
Reference: text
sentIndex sentText sentNum sentScore
1 The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. [sent-6, score-0.497]
2 The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. [sent-7, score-0.392]
3 In the present work we customize the generic expressions to compute moments of the generalization error for a more popular classification algorithm: Random decision trees. [sent-34, score-0.368]
4 In particular, we characterize Random decision trees which are an interesting variant with respect to three popular stopping criteria namely; fixed height, purity and scarcity (i. [sent-36, score-0.65]
5 3 is applicable to even other deterministic attribute selection methods based on information gain, gini gain etc. [sent-41, score-0.332]
6 The methodology for studying classification models consists of studying the behavior of the first two central moments of the GE of the classification algorithm studied. [sent-71, score-0.321]
7 If instead he/she is able to derive parametric expressions for the moments of GE, the test results would be more relevant to the particular classification algorithm, since the moments are over all possible data sets of a particular size drawn i. [sent-139, score-0.45]
8 In particular, to derive expressions for the moments of any classification algorithm we need to characterize PZ (N) [ζ(x) = y] for the first moment and PZ (N)×Z (N) [ζ(x) = y ∧ ζ (x ) = y ] for the second moment. [sent-168, score-0.355]
9 , Ad are the corresponding discrete attributes or continuous attributes with predetermined split points. [sent-179, score-0.455]
10 , ad are the number of attribute values/the number of splits of the attributes A1 , A2 , . [sent-183, score-0.537]
11 2226 P ROBABILISTIC C HARACTERIZATION OF R ANDOM D ECISION T REES A1 m m11 A2 m 21 m 31 A3 m 32 A3 m 31 m 22 A3 12 m 32 A2 m 31 m 32 m21 m 22 m21 A2 m 22 Figure 1: The all attribute tree with 3 attributes A1 , A2 , A3 , each having 2 values. [sent-195, score-0.526]
12 m11 m 21 m 31 A1 A2 A3 a m 21 m 11 m 31 A3 A2 m 21 A1 m 31 m A2 A3 A1 11 c b Figure 2: Given 3 attributes A1 , A2 , A3 , the path m11 m21 m31 is formed irrespective of the ordering of the attributes. [sent-196, score-0.411]
13 2 All Attribute Decision Trees (ATT) Let us consider a decision tree algorithm whose only stopping criterion is that no attributes remain when building any part of the tree. [sent-199, score-0.433]
14 It can be seen that irrespective of the split attribute selection method (e. [sent-202, score-0.346]
15 Thus although a particular path in one tree has an ordering of attributes that might be different from a corresponding path in other trees, the leaf nodes will represent the same region in space or the same set of datapoints. [sent-206, score-0.67]
16 3 Decision Trees with Non-trivial Stopping Criteria We just considered decision trees which are grown until all attributes are exhausted. [sent-251, score-0.371]
17 The main reasons for this could be any of the following: we wish to build small decision trees to save space; certain path counts (i. [sent-253, score-0.379]
18 These stopping measures would lead to paths in the tree that contain a subset of the entire set of attributes. [sent-256, score-0.34]
19 After the summation, the right hand side term above is the probability that the cell path pCi has the greatest count, with the path ”path p ” being present in the tree. [sent-264, score-0.4]
20 1 C HARACTERIZING Path Exists FOR T HREE S TOPPING C RITERIA It follows from above that to compute the moments of the GE for a decision tree algorithm we need to characterize conditions under which particular paths are present. [sent-276, score-0.485]
21 This characterization depends on the stopping criteria and split attribute selection method in a decision tree algorithm. [sent-277, score-0.636]
22 We then choose a split attribute selection method, thereby fully characterizing the above two probabilities and hence the moments. [sent-286, score-0.38]
23 Fixed Height: This stopping criteria is basically that every path in the tree should be of length exactly h, where h ∈ [1, . [sent-288, score-0.421]
24 mlh is present in the tree iff the attributes A1 , A2 , . [sent-297, score-0.318]
25 , Ah are chosen in any order to form the path for a tree construction during the split attribute selection phase. [sent-300, score-0.605]
26 Thus, for any path of length h to be present we bi-conditionally imply that the corresponding attributes are chosen. [sent-301, score-0.387]
27 Purity: This stopping criteria implies that we stop growing the tree from a particular split of a particular attribute if all datapoints lying in that split belong to the same class. [sent-303, score-0.627]
28 In this scenario, we could have paths of length 1 to d depending on when we encounter purity (assuming all datapoints don’t lie in 1 class). [sent-305, score-0.38]
29 This means that if a certain set of d − 1 attributes are present in a path in the tree then we split on the d th attribute iff the current path is not pure, finally resulting in a path of length d. [sent-332, score-1.171]
30 This means that if a certain set of h − 1 attributes are present in a path in the tree then we split on some hth attribute iff the current path is not pure and the resulting path is pure. [sent-367, score-1.194]
31 Scarcity: This stopping criteria implies that we stop growing the tree from a particular split of a certain attribute if its count is less than or equal to some pre-specified pruning bound. [sent-370, score-0.522]
32 This means that we stop growing the tree under a node once we find that the next chosen attribute produces a path with occupancy ≤ pb. [sent-432, score-0.515]
33 mlh ) > pb and ii) those that depend split attribute selection method viz. [sent-441, score-0.383]
34 4 Split Attribute Selection In decision tree construction algorithms, at each iteration we have to decide the attribute variable on which the data should be split. [sent-450, score-0.373]
35 As an example, for the deterministic purity based measures mentioned above the split attribute selection is just a function of the sample and thus by appropriately conditioning on the sample we can find the relevant probabilities and hence the moments. [sent-466, score-0.574]
36 5 Random Decision Trees In this subsection we explain the randomized process used for split attribute selection and provide the expression for the probability of choosing an attribute/a set of attributes. [sent-478, score-0.364]
37 We assume a uniform probability distribution in selecting the attribute variables, that is, attributes which have already not been chosen in a particular branch, have an equal chance of being chosen for the next level. [sent-480, score-0.431]
38 , d} is, P[h attributes chosen] = 1 d h since choosing without replacement is equivalent to simultaneously choosing a subset of attributes from the given set. [sent-491, score-0.398]
39 For the second moment when the tree is the same (required in the finding of variance of GE and HE, that is, for finding EZ (N) GE(ζ)2 ), the probability of choosing two sets of attributes such that the two distinct paths resulting from them co-exist in a single tree is given by the following. [sent-493, score-0.579]
40 Thus p − v attributes are common to both paths but have different values. [sent-503, score-0.332]
41 At one of these attributes in a given tree the two paths will bifurcate. [sent-504, score-0.427]
42 The probability that the two paths co-exist given our randomized attribute selection method is computed by finding out all possible ways in which the two paths can co-exist in a tree and then multiplying the number of each kind of way by the probability of having that way. [sent-505, score-0.668]
43 The expression for the probability based on the attribute selection method is, P[l1 and l2 length paths co − exist] = v ∑ vPri (l1 − i − 1)! [sent-507, score-0.398]
44 be an abbreviation for stopping criteria conditions that are sample independent or conditions that are dependent on the attribute selection method. [sent-543, score-0.445]
45 1 F IXED H EIGHT The conditions for ”path exists” for fixed height trees depend only on the attribute selection method as seen in Section 3. [sent-547, score-0.514]
46 The probability for the second moment when the trees are different is given by, PZ (N)×Z (N) ζ(x) =Ci ∧ ζ (x ) =Cv = ∑ PZ (N)×Z (N) [ct(path pCi ) > ct(path pC j ), path p exists, ct(pathqCv ) > ct(pathqCw ), p,q pathq exists, ∀ j = i, ∀w = v, i, j, v, w ∈ [1, . [sent-572, score-0.411]
47 The probability for the second moment when the trees are the same is given by, 2233 D HURANDHAR AND D OBRA PZ (N) ζ(x) =Ci ∧ ζ(x ) =Cv = ∑ PZ (N) [ct(path pCi ) > ct(path pC j ), path p exists, ct(pathqCv ) > ct(pathqCw ), p,q pathq exists, ∀ j = i, ∀w = v, i, j, v, w ∈ [1, . [sent-586, score-0.411]
48 , k]] where r is the number of attributes that are common in the 2 paths, b is the number of attributes that have the same value in the 2 paths, h is the length of the paths and probt = 1 . [sent-600, score-0.563]
49 2 P URITY AND S CARCITY The conditions for ”path exists” in the case of purity and scarcity depend on both the sample and the attribute selection method as can be seen in 3. [sent-610, score-0.607]
50 The probability for the second moment when the trees are the same is given by, PZ (N) ζ(x) =Ci ∧ ζ(x ) =Cv = ∑ PZ (N) [ct(path pCi ) > ct(path pC j ), path p exists, ct(pathqCv ) > ct(pathqCw ), pathq exists, p,q ∀ j = i, ∀w = v, i, j, v, w ∈ [1, . [sent-672, score-0.411]
51 , k]] where r is the number of attributes that are common in the 2 paths sparing the attributes chosen as leaves, b is the number of attributes that have the same value, h p and hq are the lengths of the 2 paths 1 and without loss of generality assuming h p ≤ hq probt = d(d−1). [sent-695, score-0.973]
52 Using the expressions for the above probabilities the moments of GE can be computed. [sent-710, score-0.313]
53 Experiments To exactly compute the probabilities for each path the time complexity for fixed height trees is O(N 2 ) and for purity and scarcity based trees it is O(N 3 ). [sent-713, score-0.897]
54 8 1 Correlation Figure 3: Errors of Fixed height trees (top row figures), Purity trees (center row figures) and Scarcity trees (bottom row figures) with N = 100 are shown. [sent-799, score-0.48]
55 Figure 7: Comparison between AF and MC on three UCI data sets for trees prunned based on fixed N height (h = 3), purity and scarcity (pb = 10 ). [sent-1078, score-0.549]
56 04 Table 1: The above table shows the upper bounds on E Z (N) [GE(ζ)] for different levels of correlation (ρ) between the attributes and class labels obtained using Breiman’s formula. [sent-1170, score-0.313]
57 We observe the behavior of the error for the three kinds of trees with the number of attributes fixed to d = 5 and each attribute having 2 attribute values. [sent-1198, score-0.853]
58 We also increase the number of attributes to d = 8 to study the effect that increasing the number of attributes has on the performance. [sent-1200, score-0.398]
59 In fact, irrespective of the value of d and the number of attribute values for each attribute the scenario can be modeled by a multinomial distribution. [sent-1210, score-0.488]
60 The behavior pim of the error for trees with the three aforementioned stopping criteria is seen for different correlation values and for a class prior of 0. [sent-1213, score-0.447]
61 Figure 4 depicts the error of Fixed height trees for different dimensionalities (5 and 8) and for different number of splits (binary and ternary). [sent-1231, score-0.313]
62 A similar trend is seen for both purity based trees Figure 5 as well as scarcity based trees 6. [sent-1234, score-0.573]
63 Though in the case of purity based trees the performance of both MC1 and MC-10 is much superior as compared with their performance on the other two kinds of trees, especially at low correlations. [sent-1235, score-0.325]
64 Hence, the trees we obtain with high probability using the purity based stopping criteria are all ATT’s. [sent-1237, score-0.463]
65 Since in an ATT all the leaves are identical irrespective of the ordering of the attributes in any path, the randomness in the classifiers produced, is only due to the randomness in the data generation process and not because of the random attribute selection method. [sent-1238, score-0.53]
66 At higher correlations and for the other two kinds of trees the probability of smaller trees is reasonable and hence MC has to account for a larger space of classifiers induced by not only the randomness in the data but also by the randomness in the attribute selection method. [sent-1242, score-0.559]
67 The performance of MC-1 and MC-10 for the purity based trees is not as impressive here since the data set sizes are much smaller (in the tens or hundreds) compared to 10000 and hence the probability of having an empty cell are not particularly low. [sent-1244, score-0.349]
68 Another reason as to why calculating the moments using expressions works better is that the portion of the probabilities for each path that depend on the attribute selection method are computed exactly (i. [sent-1254, score-0.766]
69 Discussion In the previous sections we derived the analytical expressions for the moments of the GE of decision trees and depicted interesting behavior of RDT’s built under the 3 stopping criteria. [sent-1258, score-0.587]
70 1 Extension The conditions presented for the 3 stopping criteria namely, fixed height, purity and scarcity are applicable irrespective of the attribute selection method. [sent-1262, score-0.769]
71 We then compare the loss in entropy of all attributes not already chosen in the path and choose the attribute for which the loss in entropy is maximum. [sent-1267, score-0.619]
72 These conditions account for a particular set of attributes being chosen, in the 3 stopping criteria. [sent-1270, score-0.313]
73 In other words, these conditions 2243 D HURANDHAR AND D OBRA quantify the conditions in the 3 stopping criteria that are attribute selection method dependent. [sent-1271, score-0.445]
74 Similar conditions can be derived for the other attribute selection methods (attribute with maximum gini gain for GG, attribute with maximum gain ratio for GR) from which the relevant probabilities and hence the moments can be computed. [sent-1272, score-0.816]
75 Thus, while computing the probabilities given in Equations 1 and 2 the conditions for path exists for these attribute selection methods depend totally on the sample. [sent-1273, score-0.508]
76 This is unlike what we observed for the randomized attribute selection criterion where the conditions for path exists depending on this randomized criterion, were sample independent while the other conditions in purity and scarcity were sample dependent. [sent-1274, score-0.9]
77 Characterizing these probabilities enables us in computing the moments of GE for these other attribute selection methods. [sent-1275, score-0.47]
78 In the analysis that we presented, we assumed that the split points for continuous attributes were determined apriori to tree construction. [sent-1276, score-0.351]
79 If the split point selection algorithm is dynamic, that is, the split points are selected while building the tree, then in the path exists conditions of the 3 stopping criteria we would have to append an extra condition namely, the split occurs at ”this” particular attribute value. [sent-1277, score-0.783]
80 Using these updated set of conditions for the 3 stopping criteria the moments of GE can be computed. [sent-1280, score-0.33]
81 To this end, it should be noted that if a stopping criterion is not carefully chosen and applied, then the number of possible trees and hence the number of allowed paths can become exponential in the dimensionality. [sent-1287, score-0.352]
82 For studying larger trees the practitioner should combine stopping criteria (e. [sent-1289, score-0.316]
83 , pruning bound and fixed height or scarcity and fixed height), that is, combine the conditions given for each individual stopping criteria or choose a stopping criterion that limits the number of paths (e. [sent-1291, score-0.609]
84 In particular we have specifically characterized RDT’s for three stopping criteria namely, fixed height, purity and scarcity. [sent-1317, score-0.337]
85 Being able to compute moments of GE, allows us to compute the moments of the various validation measures and observe their relative behavior. [sent-1318, score-0.361]
86 The probability that two paths of lengths l1 and l2 (l2 ≥ l1 ) co-exist in a tree based on the randomized attribute selection method is given by, P[l1 and l2 length paths co − exist] = v ∑ vPri (l1 − i − 1)! [sent-1336, score-0.668]
87 (r − v)probi i=0 where r is the number of attributes common in the two paths, v is the number attributes with the v! [sent-1338, score-0.398]
88 Let A 1 , A2 and A3 be three attributes that are common to both paths and also having the same attribute values. [sent-1353, score-0.564]
89 Let A 4 and A5 be common to both paths but have different attribute values for each of them. [sent-1354, score-0.365]
90 For the two paths to co-exist notice that atleast one of A 4 or A5 has to be at a lower depth than the non-common attributes A6 , A7 , A8 . [sent-1357, score-0.379]
91 This has to be true since, if a non-common attribute say A 6 is higher than A4 and A5 in a path of the tree then the other path cannot exist. [sent-1358, score-0.703]
92 Hence, in all the possible ways that the two paths can co-exist, one of the attributes A 4 or A5 has to occur at a maximum depth of v + 1, that is, 4 in this example. [sent-1359, score-0.379]
93 In the successive tree structures, that is, Figure 8b, Figure 8c the common attribute with distinct attribute values (A 4 ) rises higher up in the tree (to lower depths) until in Figure 8d it becomes the root. [sent-1361, score-0.654]
94 Thus the probability of choosing the root is d , the 1 next attribute is d−1 and so on till the subtree splits into two paths at depth 5. [sent-1364, score-0.454]
95 After the split at depth 1 5 the probability of choosing the respective attributes for the two paths is (d−4)2 , since repetitions are allowed in two separate paths. [sent-1365, score-0.436]
96 Finally, the first path ends at depth 6 and only one attribute has 1 to be chosen at depth 7 for the second path which is chosen with a probability of d−6 . [sent-1366, score-0.702]
97 We now find the total number of subtrees with such an arrangement where the highest common attribute with different values is at depth of 4. [sent-1367, score-0.34]
98 Similarly, we find the probability of the arrangement in Figure 8b where the common attribute with different values is at depth 3 then at depth 2 and finally at the root. [sent-1381, score-0.365]
99 are the total permutations of the attributes in path 1 and 2 respectively after the split. [sent-1398, score-0.387]
100 Hence the total probability for the two paths to co-exist which is the sum of the probabilities of the individual arrangements is given by, P[l1 and l2 length paths co − exist] = v vPri (l1 − i − 1)! [sent-1447, score-0.321]
wordName wordTfidf (topN-words)
[('ge', 0.419), ('ct', 0.377), ('mc', 0.251), ('attribute', 0.232), ('pz', 0.222), ('purity', 0.199), ('attributes', 0.199), ('path', 0.188), ('moments', 0.171), ('pci', 0.155), ('paths', 0.133), ('trees', 0.126), ('scarcity', 0.122), ('dhurandhar', 0.119), ('hurandhar', 0.111), ('obra', 0.111), ('expressions', 0.108), ('pathqcv', 0.103), ('pathqcw', 0.103), ('rees', 0.103), ('robabilistic', 0.103), ('height', 0.102), ('correlation', 0.095), ('tree', 0.095), ('stopping', 0.093), ('ecision', 0.088), ('ternary', 0.088), ('af', 0.082), ('dobra', 0.081), ('haracterization', 0.079), ('ah', 0.078), ('pc', 0.074), ('methodology', 0.067), ('ad', 0.064), ('rdt', 0.064), ('andom', 0.063), ('pb', 0.061), ('moment', 0.057), ('split', 0.057), ('breiman', 0.054), ('classi', 0.05), ('datapoints', 0.048), ('impure', 0.048), ('fixed', 0.048), ('depth', 0.047), ('decision', 0.046), ('criteria', 0.045), ('error', 0.043), ('monte', 0.043), ('splits', 0.042), ('randomized', 0.042), ('ez', 0.041), ('gini', 0.041), ('pathq', 0.04), ('slt', 0.04), ('vpri', 0.04), ('arrangement', 0.039), ('hq', 0.039), ('ers', 0.038), ('ci', 0.038), ('characterization', 0.035), ('probabilities', 0.034), ('golden', 0.034), ('selection', 0.033), ('probt', 0.032), ('carlo', 0.032), ('cv', 0.032), ('studying', 0.031), ('probi', 0.027), ('att', 0.027), ('datapoint', 0.027), ('gain', 0.026), ('iff', 0.024), ('irrespective', 0.024), ('designer', 0.024), ('pim', 0.024), ('vprv', 0.024), ('cell', 0.024), ('characterizing', 0.024), ('pure', 0.023), ('synthetic', 0.023), ('er', 0.022), ('subtrees', 0.022), ('built', 0.022), ('vc', 0.022), ('behavior', 0.021), ('randomness', 0.021), ('practitioner', 0.021), ('arrangements', 0.021), ('gg', 0.021), ('conditions', 0.021), ('analyzing', 0.021), ('strength', 0.019), ('bounds', 0.019), ('measures', 0.019), ('ig', 0.019), ('counts', 0.019), ('characterize', 0.019), ('te', 0.018), ('ce', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
Author: Amit Dhurandhar, Alin Dobra
Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees
2 0.10502937 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
Author: Abraham George, Warren B. Powell, Sanjeev R. Kulkarni
Abstract: We consider the problem of estimating the value of a multiattribute resource, where the attributes are categorical or discrete in nature and the number of potential attribute vectors is very large. The problem arises in approximate dynamic programming when we need to estimate the value of a multiattribute resource from estimates based on Monte-Carlo simulation. These problems have been traditionally solved using aggregation, but choosing the right level of aggregation requires resolving the classic tradeoff between aggregation error and sampling error. We propose a method that estimates the value of a resource at different levels of aggregation simultaneously, and then uses a weighted combination of the estimates. Using the optimal weights, which minimizes the variance of the estimate while accounting for correlations between the estimates, is computationally too expensive for practical applications. We have found that a simple inverse variance formula (adjusted for bias), which effectively assumes the estimates are independent, produces near-optimal estimates. We use the setting of two levels of aggregation to explain why this approximation works so well. Keywords: hierarchical statistics, approximate dynamic programming, mixture models, adaptive learning, multiattribute resources
3 0.069945417 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
4 0.06794852 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
Author: Gérard Biau, Luc Devroye, Gábor Lugosi
Abstract: In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent. Keywords: random forests, classification trees, consistency, bagging This paper is dedicated to the memory of Leo Breiman.
5 0.066493981 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
Author: Eric Bax, Augusto Callejas
Abstract: This paper introduces a new PAC transductive error bound for classification. The method uses information from the training examples and inputs of working examples to develop a set of likely assignments to outputs of the working examples. A likely assignment with maximum error determines the bound. The method is very effective for small data sets. Keywords: error bound, transduction, nearest neighbor, dynamic programming
6 0.058229715 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
7 0.053122595 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
8 0.044491101 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
9 0.042344309 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
10 0.041692041 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
11 0.040568307 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
12 0.037470441 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
13 0.036692008 96 jmlr-2008-Visualizing Data using t-SNE
14 0.036590252 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
15 0.035510924 56 jmlr-2008-Magic Moments for Structured Output Prediction
16 0.033678304 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting
17 0.032637343 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
18 0.030084837 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
19 0.02921306 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
20 0.028890966 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
topicId topicWeight
[(0, 0.16), (1, 0.029), (2, 0.092), (3, -0.046), (4, -0.098), (5, 0.042), (6, 0.02), (7, -0.073), (8, -0.021), (9, -0.066), (10, 0.002), (11, 0.11), (12, -0.105), (13, -0.025), (14, -0.129), (15, -0.189), (16, 0.23), (17, -0.274), (18, 0.015), (19, -0.097), (20, 0.137), (21, -0.121), (22, -0.01), (23, 0.185), (24, 0.073), (25, 0.12), (26, 0.013), (27, 0.13), (28, 0.18), (29, -0.085), (30, 0.109), (31, -0.045), (32, -0.05), (33, 0.046), (34, 0.123), (35, -0.139), (36, 0.12), (37, 0.038), (38, 0.021), (39, 0.017), (40, 0.016), (41, -0.028), (42, 0.032), (43, -0.015), (44, -0.14), (45, 0.086), (46, 0.03), (47, -0.13), (48, 0.198), (49, -0.057)]
simIndex simValue paperId paperTitle
same-paper 1 0.95564801 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
Author: Amit Dhurandhar, Alin Dobra
Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees
2 0.49767739 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
Author: Abraham George, Warren B. Powell, Sanjeev R. Kulkarni
Abstract: We consider the problem of estimating the value of a multiattribute resource, where the attributes are categorical or discrete in nature and the number of potential attribute vectors is very large. The problem arises in approximate dynamic programming when we need to estimate the value of a multiattribute resource from estimates based on Monte-Carlo simulation. These problems have been traditionally solved using aggregation, but choosing the right level of aggregation requires resolving the classic tradeoff between aggregation error and sampling error. We propose a method that estimates the value of a resource at different levels of aggregation simultaneously, and then uses a weighted combination of the estimates. Using the optimal weights, which minimizes the variance of the estimate while accounting for correlations between the estimates, is computationally too expensive for practical applications. We have found that a simple inverse variance formula (adjusted for bias), which effectively assumes the estimates are independent, produces near-optimal estimates. We use the setting of two levels of aggregation to explain why this approximation works so well. Keywords: hierarchical statistics, approximate dynamic programming, mixture models, adaptive learning, multiattribute resources
3 0.40723032 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
4 0.29129806 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
Author: Jorge Jambeiro Filho, Jacques Wainer
Abstract: We replaced the conditional probability tables of Bayesian network nodes whose parents have high cardinality with a multilevel empirical hierarchical Bayesian model called hierarchical pattern Bayes (HPB).1 The resulting Bayesian networks achieved signiÄ?Ĺš cant performance improvements over Bayesian networks with the same structure and traditional conditional probability tables, over Bayesian networks with simpler structures like na¨ve Bayes and tree augmented na¨ve Bayes, over Ă„Ä… Ă„Ä… Bayesian networks where traditional conditional probability tables were substituted by noisy-OR gates, default tables, decision trees and decision graphs and over Bayesian networks constructed after a cardinality reduction preprocessing phase using the agglomerative information bottleneck method. Our main tests took place in important fraud detection domains, which are characterized by the presence of high cardinality attributes and by the existence of relevant interactions among them. Other tests, over UCI data sets, show that HPB may have a quite wide applicability. Keywords: probabilistic reasoning, Bayesian networks, smoothing, hierarchical Bayes, empirical Bayes
5 0.27693605 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
Author: Gérard Biau, Luc Devroye, Gábor Lugosi
Abstract: In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent. Keywords: random forests, classification trees, consistency, bagging This paper is dedicated to the memory of Leo Breiman.
6 0.27601263 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
7 0.26665527 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
8 0.25585443 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
9 0.23128176 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
10 0.22461405 96 jmlr-2008-Visualizing Data using t-SNE
11 0.22300284 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
12 0.20814599 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
13 0.20239142 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
14 0.19930753 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
15 0.18017833 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
16 0.17684476 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
17 0.17543112 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
19 0.15512896 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
20 0.15335375 56 jmlr-2008-Magic Moments for Structured Output Prediction
topicId topicWeight
[(0, 0.019), (5, 0.01), (31, 0.012), (40, 0.028), (54, 0.032), (58, 0.027), (66, 0.043), (76, 0.019), (88, 0.081), (92, 0.589), (94, 0.035), (99, 0.016)]
simIndex simValue paperId paperTitle
1 0.96478617 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
same-paper 2 0.95152014 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
Author: Amit Dhurandhar, Alin Dobra
Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees
3 0.66594225 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
Author: Sivan Sabato, Shai Shalev-Shwartz
Abstract: Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This approach can fail when ranking categorical features with many values. We propose the Ginger ranking criterion, that estimates the generalization error of the predictor associated with the Gini index. We show that for almost all training sets, the Ginger criterion produces an accurate estimation of the true generalization error, regardless of the number of values in a categorical feature. We also address the question of finding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassification error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties. We demonstrate the efficiency of our approach for feature selection and for learning decision trees in a series of experiments with synthetic and natural data sets. Keywords: feature ranking, categorical features, generalization bounds, Gini index, decision trees
4 0.62277538 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
Author: Eric Bax
Abstract: This paper develops bounds on out-of-sample error rates for support vector machines (SVMs). The bounds are based on the numbers of support vectors in the SVMs rather than on VC dimension. The bounds developed here improve on support vector counting bounds derived using Littlestone and Warmuth’s compression-based bounding technique. Keywords: compression, error bound, support vector machine, nearly uniform
5 0.59520918 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
7 0.55752027 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
8 0.53966081 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
9 0.48046267 56 jmlr-2008-Magic Moments for Structured Output Prediction
10 0.47469059 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
11 0.4653444 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
12 0.44611934 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
13 0.44464162 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
14 0.44016334 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
15 0.43975878 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
16 0.43769306 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
17 0.43460461 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
18 0.43137059 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
19 0.42925945 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
20 0.42766172 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension