nips nips2013 nips2013-340 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts
Abstract: Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a three-level decomposition of the information jointly provided by all input variables about the output in terms of i) the MDI importance of each input variable, ii) the degree of interaction of a given input variable with the other input variables, iii) the different interaction terms of a given degree. We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees. 1 Motivation An important task in many scientific fields is the prediction of a response variable based on a set of predictor variables. In many situations though, the aim is not only to make the most accurate predictions of the response but also to identify which predictor variables are the most important to make these predictions, e.g. in order to understand the underlying process. Because of their applicability to a wide range of problems and their capability to both build accurate models and, at the same time, to provide variable importance measures, Random Forests (Breiman, 2001) and variants such as Extra-Trees (Geurts et al., 2006) have become a major data analysis tool used with success in various scientific areas. Despite their extensive use in applied research, only a couple of works have studied the theoretical properties and statistical mechanisms of these algorithms. Zhao (2000), Breiman (2004), Biau et al. (2008); Biau (2012), Meinshausen (2006) and Lin and Jeon (2006) investigated simplified to very realistic variants of these algorithms and proved the consistency of those variants. Little is known however regarding the variable importances computed by Random Forests like algorithms, and – as far as we know – the work of Ishwaran (2007) is indeed the only theoretical study of tree-based variable importance measures. In this work, we aim at filling this gap and present a theoretical analysis of the Mean Decrease Impurity importance derived from ensembles of randomized trees. The rest of the paper is organized as follows: in section 2, we provide the background about ensembles of randomized trees and recall how variable importances can be derived from them; in section 3, we then derive a characterization in asymptotic conditions and show how variable importances derived from totally randomized trees offer a three-level decomposition of the information jointly contained in the input variables about the output; section 4 shows that this characterization only depends on the relevant variables and section 5 1 discusses theses ideas in the context of variants closer to the Random Forest algorithm; section 6 then illustrates all these ideas on an artificial problem; finally, section 7 includes our conclusions and proposes directions of future works. 2 Background In this section, we first describe decision trees, as well as forests of randomized trees. Then, we describe the two major variable importances measures derived from them – including the Mean Decrease Impurity (MDI) importance that we will study in the subsequent sections. 2.1 Single classification and regression trees and random forests A binary classification (resp. regression) tree (Breiman et al., 1984) is an input-output model represented by a tree structure T , from a random input vector (X1 , ..., Xp ) taking its values in X1 × ... × Xp = X to a random output variable Y ∈ Y . Any node t in the tree represents a subset of the space X , with the root node being X itself. Internal nodes t are labeled with a binary test (or split) st = (Xm < c) dividing their subset in two subsets1 corresponding to their two children tL and tR , while the terminal nodes (or leaves) are labeled with a best guess value of the output ˆ variable2 . The predicted output Y for a new instance is the label of the leaf reached by the instance when it is propagated through the tree. A tree is built from a learning sample of size N drawn from P (X1 , ..., Xp , Y ) using a recursive procedure which identifies at each node t the split st = s∗ for which the partition of the Nt node samples into tL and tR maximizes the decrease ∆i(s, t) = i(t) − pL i(tL ) − pR i(tR ) (1) of some impurity measure i(t) (e.g., the Gini index, the Shannon entropy, or the variance of Y ), and where pL = NtL /Nt and pR = NtR /Nt . The construction of the tree stops , e.g., when nodes become pure in terms of Y or when all variables Xi are locally constant. Single trees typically suffer from high variance, which makes them not competitive in terms of accuracy. A very efficient and simple way to address this flaw is to use them in the context of randomization-based ensemble methods. Specifically, the core principle is to introduce random perturbations into the learning procedure in order to produce several different decision trees from a single learning set and to use some aggregation technique to combine the predictions of all these trees. In Bagging (Breiman, 1996), trees are built on random bootstrap copies of the original data, hence producing different decision trees. In Random Forests (Breiman, 2001), Bagging is extended and combined with a randomization of the input variables that are used when considering candidate variables to split internal nodes t. In particular, instead of looking for the best split s∗ among all variables, the Random Forest algorithm selects, at each node, a random subset of K variables and then determines the best split over these latter variables only. 2.2 Variable importances In the context of ensembles of randomized trees, Breiman (2001, 2002) proposed to evaluate the importance of a variable Xm for predicting Y by adding up the weighted impurity decreases p(t)∆i(st , t) for all nodes t where Xm is used, averaged over all NT trees in the forest: Imp(Xm ) = 1 NT p(t)∆i(st , t) T (2) t∈T :v(st )=Xm and where p(t) is the proportion Nt /N of samples reaching t and v(st ) is the variable used in split st . When using the Gini index as impurity function, this measure is known as the Gini importance or Mean Decrease Gini. However, since it can be defined for any impurity measure i(t), we will refer to Equation 2 as the Mean Decrease Impurity importance (MDI), no matter the impurity measure i(t). We will characterize and derive results for this measure in the rest of this text. 1 More generally, splits are defined by a (not necessarily binary) partition of the range Xm of possible values of a single variable Xm . 2 e.g. determined as the majority class j(t) (resp., the average value y (t)) within the subset of the leaf t. ¯ 2 In addition to MDI, Breiman (2001, 2002) also proposed to evaluate the importance of a variable Xm by measuring the Mean Decrease Accuracy (MDA) of the forest when the values of Xm are randomly permuted in the out-of-bag samples. For that reason, this latter measure is also known as the permutation importance. Thanks to popular machine learning softwares (Breiman, 2002; Liaw and Wiener, 2002; Pedregosa et al., 2011), both of these variable importance measures have shown their practical utility in an increasing number of experimental studies. Little is known however regarding their inner workings. Strobl et al. (2007) compare both MDI and MDA and show experimentally that the former is biased towards some predictor variables. As explained by White and Liu (1994) in case of single decision trees, this bias stems from an unfair advantage given by the usual impurity functions i(t) towards predictors with a large number of values. Strobl et al. (2008) later showed that MDA is biased as well, and that it overestimates the importance of correlated variables – although this effect was not confirmed in a later experimental study by Genuer et al. (2010). From a theoretical point of view, Ishwaran (2007) provides a detailed theoretical development of a simplified version of MDA, giving key insights for the understanding of the actual MDA. 3 Variable importances derived from totally randomized tree ensembles Let us now consider the MDI importance as defined by Equation 2, and let us assume a set V = {X1 , ..., Xp } of categorical input variables and a categorical output Y . For the sake of simplicity we will use the Shannon entropy as impurity measure, and focus on totally randomized trees; later on we will discuss other impurity measures and tree construction algorithms. Given a training sample L of N joint observations of X1 , ..., Xp , Y independently drawn from the joint distribution P (X1 , ..., Xp , Y ), let us assume that we infer from it an infinitely large ensemble of totally randomized and fully developed trees. In this setting, a totally randomized and fully developed tree is defined as a decision tree in which each node t is partitioned using a variable Xi picked uniformly at random among those not yet used at the parent nodes of t, and where each t is split into |Xi | sub-trees, i.e., one for each possible value of Xi , and where the recursive construction process halts only when all p variables have been used along the current branch. Hence, in such a tree, leaves are all at the same depth p, and the set of leaves of a fully developed tree is in bijection with the set X of all possible joint configurations of the p input variables. For example, if the input variables are all binary, the resulting tree will have exactly 2p leaves. Theorem 1. The MDI importance of Xm ∈ V for Y as computed with an infinite ensemble of fully developed totally randomized trees and an infinitely large training sample is: p−1 Imp(Xm ) = k=0 1 1 k Cp p − k I(Xm ; Y |B), (3) B∈Pk (V −m ) where V −m denotes the subset V \ {Xm }, Pk (V −m ) is the set of subsets of V −m of cardinality k, and I(Xm ; Y |B) is the conditional mutual information of Xm and Y given the variables in B. Proof. See Appendix B. Theorem 2. For any ensemble of fully developed trees in asymptotic learning sample size conditions (e.g., in the same conditions as those of Theorem 1), we have that p Imp(Xm ) = I(X1 , . . . , Xp ; Y ). (4) m=1 Proof. See Appendix C. Together, theorems 1 and 2 show that variable importances derived from totally randomized trees in asymptotic conditions provide a three-level decomposition of the information I(X1 , . . . , Xp ; Y ) contained in the set of input variables about the output variable. The first level is a decomposition among input variables (see Equation 4 of Theorem 2), the second level is a decomposition along the 3 degrees k of interaction terms of a variable with the other ones (see the outer sum in Equation 3 of Theorem 1), and the third level is a decomposition along the combinations B of interaction terms of fixed size k of possible interacting variables (see the inner sum in Equation 3). We observe that the decomposition includes, for each variable, each and every interaction term of each and every degree weighted in a fashion resulting only from the combinatorics of possible interaction terms. In particular, since all I(Xm ; Y |B) terms are at most equal to H(Y ), the prior entropy of Y , the p terms of the outer sum of Equation 3 are each upper bounded by 1 1 1 1 1 H(Y ) = k C k H(Y ) = H(Y ). k Cp p − k Cp p − k p−1 p −m B∈Pk (V ) As such, the second level decomposition resulting from totally randomized trees makes the p sub1 1 importance terms C k p−k B∈Pk (V −m ) I(Xm ; Y |B) to equally contribute (at most) to the total p importance, even though they each include a combinatorially different number of terms. 4 Importances of relevant and irrelevant variables Following Kohavi and John (1997), let us define as relevant to Y with respect to V a variable Xm for which there exists at least one subset B ⊆ V (possibly empty) such that I(Xm ; Y |B) > 0.3 Thus we define as irrelevant to Y with respect to V a variable Xi for which, for all B ⊆ V , I(Xi ; Y |B) = 0. Remark that if Xi is irrelevant to Y with respect to V , then by definition it is also irrelevant to Y with respect to any subset of V . Theorem 3. Xi ∈ V is irrelevant to Y with respect to V if and only if its infinite sample size importance as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is 0. Proof. See Appendix D. Lemma 4. Let Xi ∈ V be an irrelevant variable for Y with respect to V . The infinite sample size / importance of Xm ∈ V as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is the same as the importance derived when using V ∪ {Xi } to build the ensemble of trees for Y . Proof. See Appendix E. Theorem 5. Let VR ⊆ V be the subset of all variables in V that are relevant to Y with respect to V . The infinite sample size importance of any variable Xm ∈ VR as computed with an infinite ensemble of fully developed totally randomized trees built on VR for Y is the same as its importance computed in the same conditions by using all variables in V . That is: p−1 Imp(Xm ) = k=0 r−1 = l=0 1 1 k Cp p − k 1 1 l Cr r − l I(Xm ; Y |B) B∈Pk (V −m ) (5) I(Xm ; Y |B) −m B∈Pl (VR ) where r is the number of relevant variables in VR . Proof. See Appendix F. Theorems 3 and 5 show that the importances computed with an ensemble of totally randomized trees depends only on the relevant variables. Irrelevant variables have a zero importance and do not affect the importance of relevant variables. Practically, we believe that such properties are desirable conditions for a sound criterion assessing the importance of a variable. Indeed, noise should not be credited of any importance and should not make any other variable more (or less) important. 3 Among the relevant variables, we have the marginally relevant ones, for which I(Xm ; Y ) > 0, the strongly relevant ones, for which I(Xm ; Y |V −m ) > 0, and the weakly relevant variables, which are relevant but not strongly relevant. 4 5 Random Forest variants In this section, we consider and discuss variable importances as computed with other types of ensembles of randomized trees. We first show how our results extend to any other impurity measure, and then analyze importances computed by depth-pruned ensemble of randomized trees and those computed by randomized trees built on random subspaces of fixed size. Finally, we discuss the case of non-totally randomized trees. 5.1 Generalization to other impurity measures Although our characterization in sections 3 and 4 uses Shannon entropy as the impurity measure, we show in Appendix I that theorems 1, 3 and 5 hold for other impurity measures, simply substituting conditional mutual information for conditional impurity reduction in the different formulas and in the definition of irrelevant variables. In particular, our results thus hold for the Gini index in classification and can be extended to regression problems using variance as the impurity measure. 5.2 Pruning and random subspaces In sections 3 and 4, we considered totally randomized trees that were fully developed, i.e. until all p variables were used within each branch. When totally randomized trees are developed only up to some smaller depth q ≤ p, we show in Proposition 6 that the variable importances as computed by these trees is limited to the q first terms of Equation 3. We then show in Proposition 7 that these latter importances are actually the same as when each tree of the ensemble is fully developed over a random subspace (Ho, 1998) of q variables drawn prior to its construction. Proposition 6. The importance of Xm ∈ V for Y as computed with an infinite ensemble of pruned totally randomized trees built up to depth q ≤ p and an infinitely large training sample is: q−1 Imp(Xm ) = k=0 1 1 k p−k Cp I(Xm ; Y |B) (6) B∈Pk (V −m ) Proof. See Appendix G. Proposition 7. The importance of Xm ∈ V for Y as computed with an infinite ensemble of pruned totally randomized trees built up to depth q ≤ p and an infinitely large training sample is identical to the importance as computed for Y with an infinite ensemble of fully developed totally randomized trees built on random subspaces of q variables drawn from V . Proof. See Appendix H. As long as q ≥ r (where r denotes the number of relevant variables in V ), it can easily be shown that all relevant variables will still obtain a strictly positive importance, which will however differ in general from the importances computed by fully grown totally randomized trees built over all variables. Also, each irrelevant variable of course keeps an importance equal to zero, which means that, in asymptotic conditions, these pruning and random subspace methods would still allow us identify the relevant variables, as long as we have a good upper bound q on r. 5.3 Non-totally randomized trees In our analysis in the previous sections, trees are built totally at random and hence do not directly relate to those built in Random Forests (Breiman, 2001) or in Extra-Trees (Geurts et al., 2006). To better understand the importances as computed by those algorithms, let us consider a close variant of totally randomized trees: at each node t, let us instead draw uniformly at random 1 ≤ K ≤ p variables and let us choose the one that maximizes ∆i(t). Notice that, for K = 1, this procedure amounts to building ensembles of totally randomized trees as defined before, while, for K = p, it amounts to building classical single trees in a deterministic way. First, the importance of Xm ∈ V as computed with an infinite ensemble of such randomized trees is not the same as Equation 3. For K > 1, masking effects indeed appear: at t, some variables are 5 never selected because there always is some other variable for which ∆i(t) is larger. Such effects tend to pull the best variables at the top of the trees and to push the others at the leaves. As a result, the importance of a variable no longer decomposes into a sum including all I(Xm ; Y |B) terms. The importance of the best variables decomposes into a sum of their mutual information alone or conditioned only with the best others – but not conditioned with all variables since they no longer ever appear at the bottom of trees. By contrast, the importance of the least promising variables now decomposes into a sum of their mutual information conditioned only with all variables – but not alone or conditioned with a couple of others since they no longer ever appear at the top of trees. In other words, because of the guided structure of the trees, the importance of Xm now takes into account only some of the conditioning sets B, which may over- or underestimate its overall relevance. To make things clearer, let us consider a simple example. Let X1 perfectly explains Y and let X2 be a slightly noisy copy of X1 (i.e., I(X1 ; Y ) ≈ I(X2 ; Y ), I(X1 ; Y |X2 ) = and I(X2 ; Y |X1 ) = 0). Using totally randomized trees, the importances of X1 and X2 are nearly equal – the importance of X1 being slightly higher than the importance of X2 : 1 1 1 Imp(X1 ) = I(X1 ; Y ) + I(X1 ; Y |X2 ) = I(X1 ; Y ) + 2 2 2 2 1 1 1 Imp(X2 ) = I(X2 ; Y ) + I(X2 ; Y |X1 ) = I(X2 ; Y ) + 0 2 2 2 In non-totally randomized trees, for K = 2, X1 is always selected at the root node and X2 is always used in its children. Also, since X1 perfectly explains Y , all its children are pure and the reduction of entropy when splitting on X2 is null. As a result, ImpK=2 (X1 ) = I(X1 ; Y ) and ImpK=2 (X2 ) = I(X2 ; Y |X1 ) = 0. Masking effects are here clearly visible: the true importance of X2 is masked by X1 as if X2 were irrelevant, while it is only a bit less informative than X1 . As a direct consequence of the example above, for K > 1, it is no longer true that a variable is irrelevant if and only if its importance is zero. In the same way, it can also be shown that the importances become dependent on the number of irrelevant variables. Let us indeed consider the following counter-example: let us add in the previous example an irrelevant variable Xi with respect to {X1 , X2 } and let us keep K = 2. The probability of selecting X2 at the root node now becomes positive, which means that ImpK=2 (X2 ) now includes I(X2 ; Y ) > 0 and is therefore strictly larger than the importance computed before. For K fixed, adding irrelevant variables dampens masking effects, which thereby makes importances indirectly dependent on the number of irrelevant variables. In conclusion, the importances as computed with totally randomized trees exhibit properties that do not possess, by extension, neither random forests nor extra-trees. With totally randomized trees, the importance of Xm only depends on the relevant variables and is 0 if and only if Xm is irrelevant. As we have shown, it may no longer be the case for K > 1. Asymptotically, the use of totally randomized trees for assessing the importance of a variable may therefore be more appropriate. In a finite setting (i.e., a limited number of samples and a limited number of trees), guiding the choice of the splitting variables remains however a sound strategy. In such a case, I(Xm ; Y |B) cannot be measured neither for all Xm nor for all B. It is therefore pragmatic to promote those that look the most promising – even if the resulting importances may be biased. 6 Illustration on a digit recognition problem In this section, we consider the digit recognition problem of (Breiman et al., 1984) for illustrating variable importances as computed with totally randomized trees. We verify that they match with our theoretical developments and that they decompose as foretold. We also compare these importances with those computed by an ensemble of non-totally randomized trees, as discussed in section 5.3, for increasing values of K. Let us consider a seven-segment indicator displaying numerals using horizontal and vertical lights in on-off combinations, as illustrated in Figure 1. Let Y be a random variable taking its value in {0, 1, ..., 9} with equal probability and let X1 , ..., X7 be binary variables whose values are each determined univocally given the corresponding value of Y in Table 1. Since Table 1 perfectly defines the data distribution, and given the small dimensionality of the problem, it is practicable to directly apply Equation 3 to compute variable importances. To verify our 6 X1 X2 y 0 1 2 3 4 5 6 7 8 9 X3 X4 X5 X6 X7 Eqn. 3 0.412 0.581 0.531 0.542 0.656 0.225 0.372 3.321 K=1 0.414 0.583 0.532 0.543 0.658 0.221 0.368 3.321 x2 1 0 0 0 1 1 1 0 1 1 x3 1 1 1 1 1 0 0 1 1 1 x4 0 0 1 1 1 1 1 0 1 1 x5 1 0 1 0 0 0 1 0 1 0 x6 1 1 0 1 1 1 1 1 1 1 x7 1 0 1 1 0 1 1 0 1 1 Table 1: Values of Y, X1 , ..., X7 Figure 1: 7-segment display X1 X2 X3 X4 X5 X6 X7 x1 1 0 1 1 0 1 1 1 1 1 K=2 0.362 0.663 0.512 0.525 0.731 0.140 0.385 3.321 K=3 0.327 0.715 0.496 0.484 0.778 0.126 0.392 3.321 K=4 0.309 0.757 0.489 0.445 0.810 0.122 0.387 3.321 K=5 0.304 0.787 0.483 0.414 0.827 0.122 0.382 3.321 K=6 0.305 0.801 0.475 0.409 0.831 0.121 0.375 3.321 K=7 0.306 0.799 0.475 0.412 0.835 0.120 0.372 3.321 Table 2: Variable importances as computed with an ensemble of randomized trees, for increasing values of K. Importances at K = 1 follow their theoretical values, as predicted by Equation 3 in Theorem 1. However, as K increases, importances diverge due to masking effects. In accordance with Theorem 2, their sum is also always equal to I(X1 , . . . , X7 ; Y ) = H(Y ) = log2 (10) = 3.321 since inputs allow to perfectly predict the output. theoretical developments, we then compare in Table 2 variable importances as computed by Equation 3 and those yielded by an ensemble of 10000 totally randomized trees (K = 1). Note that given the known structure of the problem, building trees on a sample of finite size that perfectly follows the data distribution amounts to building them on a sample of infinite size. At best, trees can thus be built on a 10-sample dataset, containing exactly one sample for each of the equiprobable outcomes of Y . As the table illustrates, the importances yielded by totally randomized trees match those computed by Equation 3, which confirms Theorem 1. Small differences stem from the fact that a finite number of trees were built in our simulations, but those discrepancies should disappear as the size of the ensemble grows towards infinity. It also shows that importances indeed add up to I(X1 , ...X7 ; Y ), which confirms Theorem 2. Regarding the actual importances, they indicate that X5 is stronger than all others, followed – in that order – by X2 , X4 and X3 which also show large importances. X1 , X7 and X6 appear to be the less informative. The table also reports importances for increasing values of K. As discussed before, we see that a large value of K yields importances that can be either overestimated (e.g., at K = 7, the importances of X2 and X5 are larger than at K = 1) or underestimated due to masking effects (e.g., at K = 7, the importances of X1 , X3 , X4 and X6 are smaller than at K = 1, as if they were less important). It can also be observed that masking effects may even induce changes in the variable rankings (e.g., compare the rankings at K = 1 and at K = 7), which thus confirms that importances are differently affected. To better understand why a variable is important, it is also insightful to look at its decomposition into its p sub-importances terms, as shown in Figure 2. Each row in the plots of the figure corresponds to one the p = 7 variables and each column to a size k of conditioning sets. As such, the value at row m and column k corresponds the importance of Xm when conditioned with k other variables 1 1 (e.g., to the term C k p−k B∈Pk (V −m ) I(Xm ; Y |B) in Equation 3 in the case of totally randomized p trees). In the left plot, for K = 1, the figure first illustrates how importances yielded by totally randomized trees decomposes along the degrees k of interactions terms. We can observe that they each equally contribute (at most) the total importance of a variable. The plot also illustrates why X5 is important: it is informative either alone or conditioned with any combination of the other variables (all of its terms are significantly larger than 0). By contrast, it also clearly shows why 7 K=1 0.5 K=7 X1 X1 X2 X3 X4 X4 X5 X5 X6 X6 X7 0.375 X2 X3 X7 0 1 2 3 4 5 6 0.25 0.125 0 1 2 3 4 5 6 0.0 Figure 2: Decomposition of variable importances along the degrees k of interactions of one variable with the other ones. At K = 1, all I(Xm ; Y |B) are accounted for in the total importance, while at K = 7 only some of them are taken into account due to masking effects. X6 is not important: neither alone nor combined with others X6 seems to be very informative (all of its terms are close to 0). More interestingly, this figure also highlights redundancies: X7 is informative alone or conditioned with a couple of others (the first terms are significantly larger than 0), but becomes uninformative when conditioned with many others (the last terms are closer to 0). The right plot, for K = 7, illustrates the decomposition of importances when variables are chosen in a deterministic way. The first thing to notice is masking effects. Some of the I(Xm ; Y |B) terms are indeed clearly never encountered and their contribution is therefore reduced to 0 in the total importance. For instance, for k = 0, the sub-importances of X2 and X5 are positive, while all others are null, which means that only those two variables are ever selected at the root node, hence masking the others. As a consequence, this also means that the importances of the remaining variables is biased and that it actually only accounts of their relevance when conditioned to X2 or X5 , but not of their relevance in other contexts. At k = 0, masking effects also amplify the contribution of I(X2 ; Y ) (resp. I(X5 ; Y )) since X2 (resp. X5 ) appears more frequently at the root node than in totally randomized trees. In addition, because nodes become pure before reaching depth p, conditioning sets of size k ≥ 4 are never actually encountered, which means that we can no longer know whether variables are still informative when conditioned to many others. All in all, this figure thus indeed confirms that importances as computed with non-totally randomized trees take into account only some of the conditioning sets B, hence biasing the measured importances. 7 Conclusions In this work, we made a first step towards understanding variable importances as computed with a forest of randomized trees. In particular, we derived a theoretical characterization of the Mean Decrease Impurity importances as computed by totally randomized trees in asymptotic conditions. We showed that they offer a three-level decomposition of the information jointly provided by all input variables about the output (Section 3). We then demonstrated (Section 4) that MDI importances as computed by totally randomized trees exhibit desirable properties for assessing the relevance of a variable: it is equal to zero if and only if the variable is irrelevant and it depends only on the relevant variables. We discussed the case of Random Forests and Extra-Trees (Section 5) and finally illustrated our developments on an artificial but insightful example (Section 6). There remain several limitations to our framework that we would like address in the future. First, our results should be adapted to binary splits as used within an actual Random Forest-like algorithm. In this setting, any node t is split in only two subsets, which means that any variable may then appear one or several times within a branch, and thus should make variable importances now dependent on the cardinalities of the input variables. In the same direction, our framework should also be extended to the case of continuous variables. Finally, results presented in this work are valid in an asymptotic setting only. An important direction of future work includes the characterization of the distribution of variable importances in a finite setting. Acknowledgements. Gilles Louppe is a research fellow of the FNRS (Belgium) and acknowledges its financial support. This work is supported by PASCAL2 and the IUAP DYSCO, initiated by the Belgian State, Science Policy Office. 8 References Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research, 98888:1063–1095. Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging classifiers. The Journal of Machine Learning Research, 9:2015–2033. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3. 1. Statistics Department University of California Berkeley, CA, USA. Breiman, L. (2004). Consistency for a simple model of random forests. Technical report, UC Berkeley. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regression trees. Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1):3–42. Ho, T. (1998). The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(8):832–844. Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537. Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1):273–324. Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R news, 2(3):18–22. Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590. Meinshausen, N. (2006). Quantile regression forests. The Journal of Machine Learning Research, 7:983–999. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1):307. Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25. White, A. P. and Liu, W. Z. (1994). Technical note: Bias in information-based measures in decision tree induction. Machine Learning, 15(3):321–329. Zhao, G. (2000). A new perspective on classification. PhD thesis, Utah State University, Department of Mathematics and Statistics. 9
Reference: text
sentIndex sentText sentNum sentScore
1 Understanding variable importances in forests of randomized trees Gilles Louppe, Louis Wehenkel, Antonio Sutera and Pierre Geurts Dept. [sent-1, score-1.315]
2 be Abstract Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. [sent-8, score-0.88]
3 In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. [sent-9, score-1.744]
4 We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. [sent-11, score-0.877]
5 We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees. [sent-12, score-0.471]
6 Because of their applicability to a wide range of problems and their capability to both build accurate models and, at the same time, to provide variable importance measures, Random Forests (Breiman, 2001) and variants such as Extra-Trees (Geurts et al. [sent-17, score-0.266]
7 Little is known however regarding the variable importances computed by Random Forests like algorithms, and – as far as we know – the work of Ishwaran (2007) is indeed the only theoretical study of tree-based variable importance measures. [sent-22, score-0.992]
8 In this work, we aim at filling this gap and present a theoretical analysis of the Mean Decrease Impurity importance derived from ensembles of randomized trees. [sent-23, score-0.458]
9 2 Background In this section, we first describe decision trees, as well as forests of randomized trees. [sent-25, score-0.359]
10 Then, we describe the two major variable importances measures derived from them – including the Mean Decrease Impurity (MDI) importance that we will study in the subsequent sections. [sent-26, score-0.939]
11 1 Single classification and regression trees and random forests A binary classification (resp. [sent-28, score-0.401]
12 , Xp , Y ) using a recursive procedure which identifies at each node t the split st = s∗ for which the partition of the Nt node samples into tL and tR maximizes the decrease ∆i(s, t) = i(t) − pL i(tL ) − pR i(tR ) (1) of some impurity measure i(t) (e. [sent-43, score-0.439]
13 Single trees typically suffer from high variance, which makes them not competitive in terms of accuracy. [sent-49, score-0.268]
14 Specifically, the core principle is to introduce random perturbations into the learning procedure in order to produce several different decision trees from a single learning set and to use some aggregation technique to combine the predictions of all these trees. [sent-51, score-0.291]
15 In Bagging (Breiman, 1996), trees are built on random bootstrap copies of the original data, hence producing different decision trees. [sent-52, score-0.361]
16 When using the Gini index as impurity function, this measure is known as the Gini importance or Mean Decrease Gini. [sent-57, score-0.441]
17 However, since it can be defined for any impurity measure i(t), we will refer to Equation 2 as the Mean Decrease Impurity importance (MDI), no matter the impurity measure i(t). [sent-58, score-0.701]
18 ¯ 2 In addition to MDI, Breiman (2001, 2002) also proposed to evaluate the importance of a variable Xm by measuring the Mean Decrease Accuracy (MDA) of the forest when the values of Xm are randomly permuted in the out-of-bag samples. [sent-65, score-0.296]
19 , 2011), both of these variable importance measures have shown their practical utility in an increasing number of experimental studies. [sent-68, score-0.272]
20 As explained by White and Liu (1994) in case of single decision trees, this bias stems from an unfair advantage given by the usual impurity functions i(t) towards predictors with a large number of values. [sent-72, score-0.283]
21 (2008) later showed that MDA is biased as well, and that it overestimates the importance of correlated variables – although this effect was not confirmed in a later experimental study by Genuer et al. [sent-74, score-0.237]
22 3 Variable importances derived from totally randomized tree ensembles Let us now consider the MDI importance as defined by Equation 2, and let us assume a set V = {X1 , . [sent-77, score-1.398]
23 For the sake of simplicity we will use the Shannon entropy as impurity measure, and focus on totally randomized trees; later on we will discuss other impurity measures and tree construction algorithms. [sent-81, score-1.064]
24 , Xp , Y ), let us assume that we infer from it an infinitely large ensemble of totally randomized and fully developed trees. [sent-88, score-0.657]
25 In this setting, a totally randomized and fully developed tree is defined as a decision tree in which each node t is partitioned using a variable Xi picked uniformly at random among those not yet used at the parent nodes of t, and where each t is split into |Xi | sub-trees, i. [sent-89, score-0.799]
26 For any ensemble of fully developed trees in asymptotic learning sample size conditions (e. [sent-99, score-0.502]
27 Together, theorems 1 and 2 show that variable importances derived from totally randomized trees in asymptotic conditions provide a three-level decomposition of the information I(X1 , . [sent-107, score-1.515]
28 4 Importances of relevant and irrelevant variables Following Kohavi and John (1997), let us define as relevant to Y with respect to V a variable Xm for which there exists at least one subset B ⊆ V (possibly empty) such that I(Xm ; Y |B) > 0. [sent-115, score-0.356]
29 3 Thus we define as irrelevant to Y with respect to V a variable Xi for which, for all B ⊆ V , I(Xi ; Y |B) = 0. [sent-116, score-0.198]
30 Remark that if Xi is irrelevant to Y with respect to V , then by definition it is also irrelevant to Y with respect to any subset of V . [sent-117, score-0.26]
31 Xi ∈ V is irrelevant to Y with respect to V if and only if its infinite sample size importance as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is 0. [sent-119, score-1.338]
32 Let Xi ∈ V be an irrelevant variable for Y with respect to V . [sent-123, score-0.198]
33 The infinite sample size / importance of Xm ∈ V as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is the same as the importance derived when using V ∪ {Xi } to build the ensemble of trees for Y . [sent-124, score-1.826]
34 The infinite sample size importance of any variable Xm ∈ VR as computed with an infinite ensemble of fully developed totally randomized trees built on VR for Y is the same as its importance computed in the same conditions by using all variables in V . [sent-129, score-1.545]
35 Theorems 3 and 5 show that the importances computed with an ensemble of totally randomized trees depends only on the relevant variables. [sent-133, score-1.588]
36 Irrelevant variables have a zero importance and do not affect the importance of relevant variables. [sent-134, score-0.469]
37 Practically, we believe that such properties are desirable conditions for a sound criterion assessing the importance of a variable. [sent-135, score-0.207]
38 Indeed, noise should not be credited of any importance and should not make any other variable more (or less) important. [sent-136, score-0.249]
39 3 Among the relevant variables, we have the marginally relevant ones, for which I(Xm ; Y ) > 0, the strongly relevant ones, for which I(Xm ; Y |V −m ) > 0, and the weakly relevant variables, which are relevant but not strongly relevant. [sent-137, score-0.255]
40 4 5 Random Forest variants In this section, we consider and discuss variable importances as computed with other types of ensembles of randomized trees. [sent-138, score-1.013]
41 We first show how our results extend to any other impurity measure, and then analyze importances computed by depth-pruned ensemble of randomized trees and those computed by randomized trees built on random subspaces of fixed size. [sent-139, score-2.142]
42 Finally, we discuss the case of non-totally randomized trees. [sent-140, score-0.203]
43 In particular, our results thus hold for the Gini index in classification and can be extended to regression problems using variance as the impurity measure. [sent-143, score-0.26]
44 2 Pruning and random subspaces In sections 3 and 4, we considered totally randomized trees that were fully developed, i. [sent-145, score-0.768]
45 When totally randomized trees are developed only up to some smaller depth q ≤ p, we show in Proposition 6 that the variable importances as computed by these trees is limited to the q first terms of Equation 3. [sent-148, score-1.783]
46 We then show in Proposition 7 that these latter importances are actually the same as when each tree of the ensemble is fully developed over a random subspace (Ho, 1998) of q variables drawn prior to its construction. [sent-149, score-0.958]
47 The importance of Xm ∈ V for Y as computed with an infinite ensemble of pruned totally randomized trees built up to depth q ≤ p and an infinitely large training sample is: q−1 Imp(Xm ) = k=0 1 1 k p−k Cp I(Xm ; Y |B) (6) B∈Pk (V −m ) Proof. [sent-151, score-1.187]
48 Also, each irrelevant variable of course keeps an importance equal to zero, which means that, in asymptotic conditions, these pruning and random subspace methods would still allow us identify the relevant variables, as long as we have a good upper bound q on r. [sent-158, score-0.456]
49 3 Non-totally randomized trees In our analysis in the previous sections, trees are built totally at random and hence do not directly relate to those built in Random Forests (Breiman, 2001) or in Extra-Trees (Geurts et al. [sent-160, score-1.125]
50 To better understand the importances as computed by those algorithms, let us consider a close variant of totally randomized trees: at each node t, let us instead draw uniformly at random 1 ≤ K ≤ p variables and let us choose the one that maximizes ∆i(t). [sent-162, score-1.222]
51 Notice that, for K = 1, this procedure amounts to building ensembles of totally randomized trees as defined before, while, for K = p, it amounts to building classical single trees in a deterministic way. [sent-163, score-1.035]
52 First, the importance of Xm ∈ V as computed with an infinite ensemble of such randomized trees is not the same as Equation 3. [sent-164, score-0.829]
53 For K > 1, masking effects indeed appear: at t, some variables are 5 never selected because there always is some other variable for which ∆i(t) is larger. [sent-165, score-0.277]
54 Such effects tend to pull the best variables at the top of the trees and to push the others at the leaves. [sent-166, score-0.384]
55 As a result, the importance of a variable no longer decomposes into a sum including all I(Xm ; Y |B) terms. [sent-167, score-0.299]
56 The importance of the best variables decomposes into a sum of their mutual information alone or conditioned only with the best others – but not conditioned with all variables since they no longer ever appear at the bottom of trees. [sent-168, score-0.516]
57 By contrast, the importance of the least promising variables now decomposes into a sum of their mutual information conditioned only with all variables – but not alone or conditioned with a couple of others since they no longer ever appear at the top of trees. [sent-169, score-0.536]
58 In other words, because of the guided structure of the trees, the importance of Xm now takes into account only some of the conditioning sets B, which may over- or underestimate its overall relevance. [sent-170, score-0.202]
59 Masking effects are here clearly visible: the true importance of X2 is masked by X1 as if X2 were irrelevant, while it is only a bit less informative than X1 . [sent-178, score-0.235]
60 As a direct consequence of the example above, for K > 1, it is no longer true that a variable is irrelevant if and only if its importance is zero. [sent-179, score-0.402]
61 In the same way, it can also be shown that the importances become dependent on the number of irrelevant variables. [sent-180, score-0.773]
62 The probability of selecting X2 at the root node now becomes positive, which means that ImpK=2 (X2 ) now includes I(X2 ; Y ) > 0 and is therefore strictly larger than the importance computed before. [sent-182, score-0.277]
63 For K fixed, adding irrelevant variables dampens masking effects, which thereby makes importances indirectly dependent on the number of irrelevant variables. [sent-183, score-1.08]
64 In conclusion, the importances as computed with totally randomized trees exhibit properties that do not possess, by extension, neither random forests nor extra-trees. [sent-184, score-1.525]
65 With totally randomized trees, the importance of Xm only depends on the relevant variables and is 0 if and only if Xm is irrelevant. [sent-185, score-0.737]
66 Asymptotically, the use of totally randomized trees for assessing the importance of a variable may therefore be more appropriate. [sent-187, score-0.992]
67 It is therefore pragmatic to promote those that look the most promising – even if the resulting importances may be biased. [sent-192, score-0.643]
68 , 1984) for illustrating variable importances as computed with totally randomized trees. [sent-194, score-1.192]
69 We also compare these importances with those computed by an ensemble of non-totally randomized trees, as discussed in section 5. [sent-196, score-1.023]
70 321 Table 2: Variable importances as computed with an ensemble of randomized trees, for increasing values of K. [sent-275, score-1.023]
71 However, as K increases, importances diverge due to masking effects. [sent-277, score-0.764]
72 theoretical developments, we then compare in Table 2 variable importances as computed by Equation 3 and those yielded by an ensemble of 10000 totally randomized trees (K = 1). [sent-283, score-1.625]
73 Note that given the known structure of the problem, building trees on a sample of finite size that perfectly follows the data distribution amounts to building them on a sample of infinite size. [sent-284, score-0.297]
74 At best, trees can thus be built on a 10-sample dataset, containing exactly one sample for each of the equiprobable outcomes of Y . [sent-285, score-0.338]
75 As the table illustrates, the importances yielded by totally randomized trees match those computed by Equation 3, which confirms Theorem 1. [sent-286, score-1.412]
76 Small differences stem from the fact that a finite number of trees were built in our simulations, but those discrepancies should disappear as the size of the ensemble grows towards infinity. [sent-287, score-0.483]
77 It also shows that importances indeed add up to I(X1 , . [sent-288, score-0.643]
78 The table also reports importances for increasing values of K. [sent-294, score-0.643]
79 As discussed before, we see that a large value of K yields importances that can be either overestimated (e. [sent-295, score-0.643]
80 , at K = 7, the importances of X2 and X5 are larger than at K = 1) or underestimated due to masking effects (e. [sent-297, score-0.796]
81 , at K = 7, the importances of X1 , X3 , X4 and X6 are smaller than at K = 1, as if they were less important). [sent-299, score-0.643]
82 It can also be observed that masking effects may even induce changes in the variable rankings (e. [sent-300, score-0.241]
83 , compare the rankings at K = 1 and at K = 7), which thus confirms that importances are differently affected. [sent-302, score-0.663]
84 As such, the value at row m and column k corresponds the importance of Xm when conditioned with k other variables 1 1 (e. [sent-305, score-0.276]
85 , to the term C k p−k B∈Pk (V −m ) I(Xm ; Y |B) in Equation 3 in the case of totally randomized p trees). [sent-307, score-0.449]
86 In the left plot, for K = 1, the figure first illustrates how importances yielded by totally randomized trees decomposes along the degrees k of interactions terms. [sent-308, score-1.429]
87 0 Figure 2: Decomposition of variable importances along the degrees k of interactions of one variable with the other ones. [sent-316, score-0.779]
88 More interestingly, this figure also highlights redundancies: X7 is informative alone or conditioned with a couple of others (the first terms are significantly larger than 0), but becomes uninformative when conditioned with many others (the last terms are closer to 0). [sent-319, score-0.203]
89 The right plot, for K = 7, illustrates the decomposition of importances when variables are chosen in a deterministic way. [sent-320, score-0.758]
90 For instance, for k = 0, the sub-importances of X2 and X5 are positive, while all others are null, which means that only those two variables are ever selected at the root node, hence masking the others. [sent-323, score-0.246]
91 As a consequence, this also means that the importances of the remaining variables is biased and that it actually only accounts of their relevance when conditioned to X2 or X5 , but not of their relevance in other contexts. [sent-324, score-0.776]
92 X5 ) appears more frequently at the root node than in totally randomized trees. [sent-327, score-0.513]
93 In addition, because nodes become pure before reaching depth p, conditioning sets of size k ≥ 4 are never actually encountered, which means that we can no longer know whether variables are still informative when conditioned to many others. [sent-328, score-0.228]
94 All in all, this figure thus indeed confirms that importances as computed with non-totally randomized trees take into account only some of the conditioning sets B, hence biasing the measured importances. [sent-329, score-1.167]
95 7 Conclusions In this work, we made a first step towards understanding variable importances as computed with a forest of randomized trees. [sent-330, score-0.993]
96 In particular, we derived a theoretical characterization of the Mean Decrease Impurity importances as computed by totally randomized trees in asymptotic conditions. [sent-331, score-1.468]
97 We then demonstrated (Section 4) that MDI importances as computed by totally randomized trees exhibit desirable properties for assessing the relevance of a variable: it is equal to zero if and only if the variable is irrelevant and it depends only on the relevant variables. [sent-333, score-1.686]
98 In this setting, any node t is split in only two subsets, which means that any variable may then appear one or several times within a branch, and thus should make variable importances now dependent on the cardinalities of the input variables. [sent-337, score-0.867]
99 An important direction of future work includes the characterization of the distribution of variable importances in a finite setting. [sent-340, score-0.737]
100 Bias in random forest variable importance measures: Illustrations, sources and a solution. [sent-450, score-0.296]
wordName wordTfidf (topN-words)
[('importances', 0.643), ('trees', 0.268), ('xm', 0.26), ('impurity', 0.26), ('totally', 0.246), ('randomized', 0.203), ('importance', 0.181), ('breiman', 0.167), ('mdi', 0.15), ('ensemble', 0.145), ('forests', 0.133), ('irrelevant', 0.13), ('masking', 0.121), ('imp', 0.078), ('built', 0.07), ('variable', 0.068), ('variables', 0.056), ('xp', 0.056), ('geurts', 0.055), ('strobl', 0.055), ('tree', 0.051), ('relevant', 0.051), ('ensembles', 0.05), ('biau', 0.048), ('mda', 0.048), ('gini', 0.048), ('forest', 0.047), ('vr', 0.044), ('node', 0.042), ('pk', 0.042), ('impk', 0.041), ('cp', 0.041), ('conditioned', 0.039), ('decomposition', 0.037), ('ishwaran', 0.033), ('fully', 0.033), ('decrease', 0.033), ('st', 0.032), ('computed', 0.032), ('effects', 0.032), ('bagging', 0.031), ('split', 0.03), ('developed', 0.03), ('perfectly', 0.029), ('equation', 0.029), ('tl', 0.029), ('others', 0.028), ('rms', 0.028), ('alone', 0.027), ('boulesteix', 0.027), ('genuer', 0.027), ('jeon', 0.027), ('liaw', 0.027), ('louppe', 0.027), ('pedregosa', 0.027), ('wehenkel', 0.027), ('zeileis', 0.027), ('appendix', 0.027), ('decomposes', 0.027), ('asymptotic', 0.026), ('assessing', 0.026), ('characterization', 0.026), ('interaction', 0.025), ('depth', 0.025), ('shannon', 0.024), ('derived', 0.024), ('kohavi', 0.024), ('decision', 0.023), ('pl', 0.023), ('longer', 0.023), ('measures', 0.023), ('nt', 0.022), ('nitely', 0.022), ('informative', 0.022), ('root', 0.022), ('illustrates', 0.022), ('nodes', 0.022), ('mutual', 0.021), ('conditioning', 0.021), ('entropy', 0.021), ('yielded', 0.02), ('rankings', 0.02), ('couple', 0.02), ('pure', 0.02), ('developments', 0.019), ('bmc', 0.019), ('ever', 0.019), ('relevance', 0.019), ('belgium', 0.018), ('gilles', 0.018), ('xi', 0.018), ('nite', 0.018), ('subspaces', 0.018), ('wiener', 0.017), ('variants', 0.017), ('pruned', 0.017), ('ho', 0.017), ('insightful', 0.017), ('input', 0.016), ('meinshausen', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 340 nips-2013-Understanding variable importances in forests of randomized trees
Author: Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts
Abstract: Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a three-level decomposition of the information jointly provided by all input variables about the output in terms of i) the MDI importance of each input variable, ii) the degree of interaction of a given input variable with the other input variables, iii) the different interaction terms of a given degree. We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees. 1 Motivation An important task in many scientific fields is the prediction of a response variable based on a set of predictor variables. In many situations though, the aim is not only to make the most accurate predictions of the response but also to identify which predictor variables are the most important to make these predictions, e.g. in order to understand the underlying process. Because of their applicability to a wide range of problems and their capability to both build accurate models and, at the same time, to provide variable importance measures, Random Forests (Breiman, 2001) and variants such as Extra-Trees (Geurts et al., 2006) have become a major data analysis tool used with success in various scientific areas. Despite their extensive use in applied research, only a couple of works have studied the theoretical properties and statistical mechanisms of these algorithms. Zhao (2000), Breiman (2004), Biau et al. (2008); Biau (2012), Meinshausen (2006) and Lin and Jeon (2006) investigated simplified to very realistic variants of these algorithms and proved the consistency of those variants. Little is known however regarding the variable importances computed by Random Forests like algorithms, and – as far as we know – the work of Ishwaran (2007) is indeed the only theoretical study of tree-based variable importance measures. In this work, we aim at filling this gap and present a theoretical analysis of the Mean Decrease Impurity importance derived from ensembles of randomized trees. The rest of the paper is organized as follows: in section 2, we provide the background about ensembles of randomized trees and recall how variable importances can be derived from them; in section 3, we then derive a characterization in asymptotic conditions and show how variable importances derived from totally randomized trees offer a three-level decomposition of the information jointly contained in the input variables about the output; section 4 shows that this characterization only depends on the relevant variables and section 5 1 discusses theses ideas in the context of variants closer to the Random Forest algorithm; section 6 then illustrates all these ideas on an artificial problem; finally, section 7 includes our conclusions and proposes directions of future works. 2 Background In this section, we first describe decision trees, as well as forests of randomized trees. Then, we describe the two major variable importances measures derived from them – including the Mean Decrease Impurity (MDI) importance that we will study in the subsequent sections. 2.1 Single classification and regression trees and random forests A binary classification (resp. regression) tree (Breiman et al., 1984) is an input-output model represented by a tree structure T , from a random input vector (X1 , ..., Xp ) taking its values in X1 × ... × Xp = X to a random output variable Y ∈ Y . Any node t in the tree represents a subset of the space X , with the root node being X itself. Internal nodes t are labeled with a binary test (or split) st = (Xm < c) dividing their subset in two subsets1 corresponding to their two children tL and tR , while the terminal nodes (or leaves) are labeled with a best guess value of the output ˆ variable2 . The predicted output Y for a new instance is the label of the leaf reached by the instance when it is propagated through the tree. A tree is built from a learning sample of size N drawn from P (X1 , ..., Xp , Y ) using a recursive procedure which identifies at each node t the split st = s∗ for which the partition of the Nt node samples into tL and tR maximizes the decrease ∆i(s, t) = i(t) − pL i(tL ) − pR i(tR ) (1) of some impurity measure i(t) (e.g., the Gini index, the Shannon entropy, or the variance of Y ), and where pL = NtL /Nt and pR = NtR /Nt . The construction of the tree stops , e.g., when nodes become pure in terms of Y or when all variables Xi are locally constant. Single trees typically suffer from high variance, which makes them not competitive in terms of accuracy. A very efficient and simple way to address this flaw is to use them in the context of randomization-based ensemble methods. Specifically, the core principle is to introduce random perturbations into the learning procedure in order to produce several different decision trees from a single learning set and to use some aggregation technique to combine the predictions of all these trees. In Bagging (Breiman, 1996), trees are built on random bootstrap copies of the original data, hence producing different decision trees. In Random Forests (Breiman, 2001), Bagging is extended and combined with a randomization of the input variables that are used when considering candidate variables to split internal nodes t. In particular, instead of looking for the best split s∗ among all variables, the Random Forest algorithm selects, at each node, a random subset of K variables and then determines the best split over these latter variables only. 2.2 Variable importances In the context of ensembles of randomized trees, Breiman (2001, 2002) proposed to evaluate the importance of a variable Xm for predicting Y by adding up the weighted impurity decreases p(t)∆i(st , t) for all nodes t where Xm is used, averaged over all NT trees in the forest: Imp(Xm ) = 1 NT p(t)∆i(st , t) T (2) t∈T :v(st )=Xm and where p(t) is the proportion Nt /N of samples reaching t and v(st ) is the variable used in split st . When using the Gini index as impurity function, this measure is known as the Gini importance or Mean Decrease Gini. However, since it can be defined for any impurity measure i(t), we will refer to Equation 2 as the Mean Decrease Impurity importance (MDI), no matter the impurity measure i(t). We will characterize and derive results for this measure in the rest of this text. 1 More generally, splits are defined by a (not necessarily binary) partition of the range Xm of possible values of a single variable Xm . 2 e.g. determined as the majority class j(t) (resp., the average value y (t)) within the subset of the leaf t. ¯ 2 In addition to MDI, Breiman (2001, 2002) also proposed to evaluate the importance of a variable Xm by measuring the Mean Decrease Accuracy (MDA) of the forest when the values of Xm are randomly permuted in the out-of-bag samples. For that reason, this latter measure is also known as the permutation importance. Thanks to popular machine learning softwares (Breiman, 2002; Liaw and Wiener, 2002; Pedregosa et al., 2011), both of these variable importance measures have shown their practical utility in an increasing number of experimental studies. Little is known however regarding their inner workings. Strobl et al. (2007) compare both MDI and MDA and show experimentally that the former is biased towards some predictor variables. As explained by White and Liu (1994) in case of single decision trees, this bias stems from an unfair advantage given by the usual impurity functions i(t) towards predictors with a large number of values. Strobl et al. (2008) later showed that MDA is biased as well, and that it overestimates the importance of correlated variables – although this effect was not confirmed in a later experimental study by Genuer et al. (2010). From a theoretical point of view, Ishwaran (2007) provides a detailed theoretical development of a simplified version of MDA, giving key insights for the understanding of the actual MDA. 3 Variable importances derived from totally randomized tree ensembles Let us now consider the MDI importance as defined by Equation 2, and let us assume a set V = {X1 , ..., Xp } of categorical input variables and a categorical output Y . For the sake of simplicity we will use the Shannon entropy as impurity measure, and focus on totally randomized trees; later on we will discuss other impurity measures and tree construction algorithms. Given a training sample L of N joint observations of X1 , ..., Xp , Y independently drawn from the joint distribution P (X1 , ..., Xp , Y ), let us assume that we infer from it an infinitely large ensemble of totally randomized and fully developed trees. In this setting, a totally randomized and fully developed tree is defined as a decision tree in which each node t is partitioned using a variable Xi picked uniformly at random among those not yet used at the parent nodes of t, and where each t is split into |Xi | sub-trees, i.e., one for each possible value of Xi , and where the recursive construction process halts only when all p variables have been used along the current branch. Hence, in such a tree, leaves are all at the same depth p, and the set of leaves of a fully developed tree is in bijection with the set X of all possible joint configurations of the p input variables. For example, if the input variables are all binary, the resulting tree will have exactly 2p leaves. Theorem 1. The MDI importance of Xm ∈ V for Y as computed with an infinite ensemble of fully developed totally randomized trees and an infinitely large training sample is: p−1 Imp(Xm ) = k=0 1 1 k Cp p − k I(Xm ; Y |B), (3) B∈Pk (V −m ) where V −m denotes the subset V \ {Xm }, Pk (V −m ) is the set of subsets of V −m of cardinality k, and I(Xm ; Y |B) is the conditional mutual information of Xm and Y given the variables in B. Proof. See Appendix B. Theorem 2. For any ensemble of fully developed trees in asymptotic learning sample size conditions (e.g., in the same conditions as those of Theorem 1), we have that p Imp(Xm ) = I(X1 , . . . , Xp ; Y ). (4) m=1 Proof. See Appendix C. Together, theorems 1 and 2 show that variable importances derived from totally randomized trees in asymptotic conditions provide a three-level decomposition of the information I(X1 , . . . , Xp ; Y ) contained in the set of input variables about the output variable. The first level is a decomposition among input variables (see Equation 4 of Theorem 2), the second level is a decomposition along the 3 degrees k of interaction terms of a variable with the other ones (see the outer sum in Equation 3 of Theorem 1), and the third level is a decomposition along the combinations B of interaction terms of fixed size k of possible interacting variables (see the inner sum in Equation 3). We observe that the decomposition includes, for each variable, each and every interaction term of each and every degree weighted in a fashion resulting only from the combinatorics of possible interaction terms. In particular, since all I(Xm ; Y |B) terms are at most equal to H(Y ), the prior entropy of Y , the p terms of the outer sum of Equation 3 are each upper bounded by 1 1 1 1 1 H(Y ) = k C k H(Y ) = H(Y ). k Cp p − k Cp p − k p−1 p −m B∈Pk (V ) As such, the second level decomposition resulting from totally randomized trees makes the p sub1 1 importance terms C k p−k B∈Pk (V −m ) I(Xm ; Y |B) to equally contribute (at most) to the total p importance, even though they each include a combinatorially different number of terms. 4 Importances of relevant and irrelevant variables Following Kohavi and John (1997), let us define as relevant to Y with respect to V a variable Xm for which there exists at least one subset B ⊆ V (possibly empty) such that I(Xm ; Y |B) > 0.3 Thus we define as irrelevant to Y with respect to V a variable Xi for which, for all B ⊆ V , I(Xi ; Y |B) = 0. Remark that if Xi is irrelevant to Y with respect to V , then by definition it is also irrelevant to Y with respect to any subset of V . Theorem 3. Xi ∈ V is irrelevant to Y with respect to V if and only if its infinite sample size importance as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is 0. Proof. See Appendix D. Lemma 4. Let Xi ∈ V be an irrelevant variable for Y with respect to V . The infinite sample size / importance of Xm ∈ V as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is the same as the importance derived when using V ∪ {Xi } to build the ensemble of trees for Y . Proof. See Appendix E. Theorem 5. Let VR ⊆ V be the subset of all variables in V that are relevant to Y with respect to V . The infinite sample size importance of any variable Xm ∈ VR as computed with an infinite ensemble of fully developed totally randomized trees built on VR for Y is the same as its importance computed in the same conditions by using all variables in V . That is: p−1 Imp(Xm ) = k=0 r−1 = l=0 1 1 k Cp p − k 1 1 l Cr r − l I(Xm ; Y |B) B∈Pk (V −m ) (5) I(Xm ; Y |B) −m B∈Pl (VR ) where r is the number of relevant variables in VR . Proof. See Appendix F. Theorems 3 and 5 show that the importances computed with an ensemble of totally randomized trees depends only on the relevant variables. Irrelevant variables have a zero importance and do not affect the importance of relevant variables. Practically, we believe that such properties are desirable conditions for a sound criterion assessing the importance of a variable. Indeed, noise should not be credited of any importance and should not make any other variable more (or less) important. 3 Among the relevant variables, we have the marginally relevant ones, for which I(Xm ; Y ) > 0, the strongly relevant ones, for which I(Xm ; Y |V −m ) > 0, and the weakly relevant variables, which are relevant but not strongly relevant. 4 5 Random Forest variants In this section, we consider and discuss variable importances as computed with other types of ensembles of randomized trees. We first show how our results extend to any other impurity measure, and then analyze importances computed by depth-pruned ensemble of randomized trees and those computed by randomized trees built on random subspaces of fixed size. Finally, we discuss the case of non-totally randomized trees. 5.1 Generalization to other impurity measures Although our characterization in sections 3 and 4 uses Shannon entropy as the impurity measure, we show in Appendix I that theorems 1, 3 and 5 hold for other impurity measures, simply substituting conditional mutual information for conditional impurity reduction in the different formulas and in the definition of irrelevant variables. In particular, our results thus hold for the Gini index in classification and can be extended to regression problems using variance as the impurity measure. 5.2 Pruning and random subspaces In sections 3 and 4, we considered totally randomized trees that were fully developed, i.e. until all p variables were used within each branch. When totally randomized trees are developed only up to some smaller depth q ≤ p, we show in Proposition 6 that the variable importances as computed by these trees is limited to the q first terms of Equation 3. We then show in Proposition 7 that these latter importances are actually the same as when each tree of the ensemble is fully developed over a random subspace (Ho, 1998) of q variables drawn prior to its construction. Proposition 6. The importance of Xm ∈ V for Y as computed with an infinite ensemble of pruned totally randomized trees built up to depth q ≤ p and an infinitely large training sample is: q−1 Imp(Xm ) = k=0 1 1 k p−k Cp I(Xm ; Y |B) (6) B∈Pk (V −m ) Proof. See Appendix G. Proposition 7. The importance of Xm ∈ V for Y as computed with an infinite ensemble of pruned totally randomized trees built up to depth q ≤ p and an infinitely large training sample is identical to the importance as computed for Y with an infinite ensemble of fully developed totally randomized trees built on random subspaces of q variables drawn from V . Proof. See Appendix H. As long as q ≥ r (where r denotes the number of relevant variables in V ), it can easily be shown that all relevant variables will still obtain a strictly positive importance, which will however differ in general from the importances computed by fully grown totally randomized trees built over all variables. Also, each irrelevant variable of course keeps an importance equal to zero, which means that, in asymptotic conditions, these pruning and random subspace methods would still allow us identify the relevant variables, as long as we have a good upper bound q on r. 5.3 Non-totally randomized trees In our analysis in the previous sections, trees are built totally at random and hence do not directly relate to those built in Random Forests (Breiman, 2001) or in Extra-Trees (Geurts et al., 2006). To better understand the importances as computed by those algorithms, let us consider a close variant of totally randomized trees: at each node t, let us instead draw uniformly at random 1 ≤ K ≤ p variables and let us choose the one that maximizes ∆i(t). Notice that, for K = 1, this procedure amounts to building ensembles of totally randomized trees as defined before, while, for K = p, it amounts to building classical single trees in a deterministic way. First, the importance of Xm ∈ V as computed with an infinite ensemble of such randomized trees is not the same as Equation 3. For K > 1, masking effects indeed appear: at t, some variables are 5 never selected because there always is some other variable for which ∆i(t) is larger. Such effects tend to pull the best variables at the top of the trees and to push the others at the leaves. As a result, the importance of a variable no longer decomposes into a sum including all I(Xm ; Y |B) terms. The importance of the best variables decomposes into a sum of their mutual information alone or conditioned only with the best others – but not conditioned with all variables since they no longer ever appear at the bottom of trees. By contrast, the importance of the least promising variables now decomposes into a sum of their mutual information conditioned only with all variables – but not alone or conditioned with a couple of others since they no longer ever appear at the top of trees. In other words, because of the guided structure of the trees, the importance of Xm now takes into account only some of the conditioning sets B, which may over- or underestimate its overall relevance. To make things clearer, let us consider a simple example. Let X1 perfectly explains Y and let X2 be a slightly noisy copy of X1 (i.e., I(X1 ; Y ) ≈ I(X2 ; Y ), I(X1 ; Y |X2 ) = and I(X2 ; Y |X1 ) = 0). Using totally randomized trees, the importances of X1 and X2 are nearly equal – the importance of X1 being slightly higher than the importance of X2 : 1 1 1 Imp(X1 ) = I(X1 ; Y ) + I(X1 ; Y |X2 ) = I(X1 ; Y ) + 2 2 2 2 1 1 1 Imp(X2 ) = I(X2 ; Y ) + I(X2 ; Y |X1 ) = I(X2 ; Y ) + 0 2 2 2 In non-totally randomized trees, for K = 2, X1 is always selected at the root node and X2 is always used in its children. Also, since X1 perfectly explains Y , all its children are pure and the reduction of entropy when splitting on X2 is null. As a result, ImpK=2 (X1 ) = I(X1 ; Y ) and ImpK=2 (X2 ) = I(X2 ; Y |X1 ) = 0. Masking effects are here clearly visible: the true importance of X2 is masked by X1 as if X2 were irrelevant, while it is only a bit less informative than X1 . As a direct consequence of the example above, for K > 1, it is no longer true that a variable is irrelevant if and only if its importance is zero. In the same way, it can also be shown that the importances become dependent on the number of irrelevant variables. Let us indeed consider the following counter-example: let us add in the previous example an irrelevant variable Xi with respect to {X1 , X2 } and let us keep K = 2. The probability of selecting X2 at the root node now becomes positive, which means that ImpK=2 (X2 ) now includes I(X2 ; Y ) > 0 and is therefore strictly larger than the importance computed before. For K fixed, adding irrelevant variables dampens masking effects, which thereby makes importances indirectly dependent on the number of irrelevant variables. In conclusion, the importances as computed with totally randomized trees exhibit properties that do not possess, by extension, neither random forests nor extra-trees. With totally randomized trees, the importance of Xm only depends on the relevant variables and is 0 if and only if Xm is irrelevant. As we have shown, it may no longer be the case for K > 1. Asymptotically, the use of totally randomized trees for assessing the importance of a variable may therefore be more appropriate. In a finite setting (i.e., a limited number of samples and a limited number of trees), guiding the choice of the splitting variables remains however a sound strategy. In such a case, I(Xm ; Y |B) cannot be measured neither for all Xm nor for all B. It is therefore pragmatic to promote those that look the most promising – even if the resulting importances may be biased. 6 Illustration on a digit recognition problem In this section, we consider the digit recognition problem of (Breiman et al., 1984) for illustrating variable importances as computed with totally randomized trees. We verify that they match with our theoretical developments and that they decompose as foretold. We also compare these importances with those computed by an ensemble of non-totally randomized trees, as discussed in section 5.3, for increasing values of K. Let us consider a seven-segment indicator displaying numerals using horizontal and vertical lights in on-off combinations, as illustrated in Figure 1. Let Y be a random variable taking its value in {0, 1, ..., 9} with equal probability and let X1 , ..., X7 be binary variables whose values are each determined univocally given the corresponding value of Y in Table 1. Since Table 1 perfectly defines the data distribution, and given the small dimensionality of the problem, it is practicable to directly apply Equation 3 to compute variable importances. To verify our 6 X1 X2 y 0 1 2 3 4 5 6 7 8 9 X3 X4 X5 X6 X7 Eqn. 3 0.412 0.581 0.531 0.542 0.656 0.225 0.372 3.321 K=1 0.414 0.583 0.532 0.543 0.658 0.221 0.368 3.321 x2 1 0 0 0 1 1 1 0 1 1 x3 1 1 1 1 1 0 0 1 1 1 x4 0 0 1 1 1 1 1 0 1 1 x5 1 0 1 0 0 0 1 0 1 0 x6 1 1 0 1 1 1 1 1 1 1 x7 1 0 1 1 0 1 1 0 1 1 Table 1: Values of Y, X1 , ..., X7 Figure 1: 7-segment display X1 X2 X3 X4 X5 X6 X7 x1 1 0 1 1 0 1 1 1 1 1 K=2 0.362 0.663 0.512 0.525 0.731 0.140 0.385 3.321 K=3 0.327 0.715 0.496 0.484 0.778 0.126 0.392 3.321 K=4 0.309 0.757 0.489 0.445 0.810 0.122 0.387 3.321 K=5 0.304 0.787 0.483 0.414 0.827 0.122 0.382 3.321 K=6 0.305 0.801 0.475 0.409 0.831 0.121 0.375 3.321 K=7 0.306 0.799 0.475 0.412 0.835 0.120 0.372 3.321 Table 2: Variable importances as computed with an ensemble of randomized trees, for increasing values of K. Importances at K = 1 follow their theoretical values, as predicted by Equation 3 in Theorem 1. However, as K increases, importances diverge due to masking effects. In accordance with Theorem 2, their sum is also always equal to I(X1 , . . . , X7 ; Y ) = H(Y ) = log2 (10) = 3.321 since inputs allow to perfectly predict the output. theoretical developments, we then compare in Table 2 variable importances as computed by Equation 3 and those yielded by an ensemble of 10000 totally randomized trees (K = 1). Note that given the known structure of the problem, building trees on a sample of finite size that perfectly follows the data distribution amounts to building them on a sample of infinite size. At best, trees can thus be built on a 10-sample dataset, containing exactly one sample for each of the equiprobable outcomes of Y . As the table illustrates, the importances yielded by totally randomized trees match those computed by Equation 3, which confirms Theorem 1. Small differences stem from the fact that a finite number of trees were built in our simulations, but those discrepancies should disappear as the size of the ensemble grows towards infinity. It also shows that importances indeed add up to I(X1 , ...X7 ; Y ), which confirms Theorem 2. Regarding the actual importances, they indicate that X5 is stronger than all others, followed – in that order – by X2 , X4 and X3 which also show large importances. X1 , X7 and X6 appear to be the less informative. The table also reports importances for increasing values of K. As discussed before, we see that a large value of K yields importances that can be either overestimated (e.g., at K = 7, the importances of X2 and X5 are larger than at K = 1) or underestimated due to masking effects (e.g., at K = 7, the importances of X1 , X3 , X4 and X6 are smaller than at K = 1, as if they were less important). It can also be observed that masking effects may even induce changes in the variable rankings (e.g., compare the rankings at K = 1 and at K = 7), which thus confirms that importances are differently affected. To better understand why a variable is important, it is also insightful to look at its decomposition into its p sub-importances terms, as shown in Figure 2. Each row in the plots of the figure corresponds to one the p = 7 variables and each column to a size k of conditioning sets. As such, the value at row m and column k corresponds the importance of Xm when conditioned with k other variables 1 1 (e.g., to the term C k p−k B∈Pk (V −m ) I(Xm ; Y |B) in Equation 3 in the case of totally randomized p trees). In the left plot, for K = 1, the figure first illustrates how importances yielded by totally randomized trees decomposes along the degrees k of interactions terms. We can observe that they each equally contribute (at most) the total importance of a variable. The plot also illustrates why X5 is important: it is informative either alone or conditioned with any combination of the other variables (all of its terms are significantly larger than 0). By contrast, it also clearly shows why 7 K=1 0.5 K=7 X1 X1 X2 X3 X4 X4 X5 X5 X6 X6 X7 0.375 X2 X3 X7 0 1 2 3 4 5 6 0.25 0.125 0 1 2 3 4 5 6 0.0 Figure 2: Decomposition of variable importances along the degrees k of interactions of one variable with the other ones. At K = 1, all I(Xm ; Y |B) are accounted for in the total importance, while at K = 7 only some of them are taken into account due to masking effects. X6 is not important: neither alone nor combined with others X6 seems to be very informative (all of its terms are close to 0). More interestingly, this figure also highlights redundancies: X7 is informative alone or conditioned with a couple of others (the first terms are significantly larger than 0), but becomes uninformative when conditioned with many others (the last terms are closer to 0). The right plot, for K = 7, illustrates the decomposition of importances when variables are chosen in a deterministic way. The first thing to notice is masking effects. Some of the I(Xm ; Y |B) terms are indeed clearly never encountered and their contribution is therefore reduced to 0 in the total importance. For instance, for k = 0, the sub-importances of X2 and X5 are positive, while all others are null, which means that only those two variables are ever selected at the root node, hence masking the others. As a consequence, this also means that the importances of the remaining variables is biased and that it actually only accounts of their relevance when conditioned to X2 or X5 , but not of their relevance in other contexts. At k = 0, masking effects also amplify the contribution of I(X2 ; Y ) (resp. I(X5 ; Y )) since X2 (resp. X5 ) appears more frequently at the root node than in totally randomized trees. In addition, because nodes become pure before reaching depth p, conditioning sets of size k ≥ 4 are never actually encountered, which means that we can no longer know whether variables are still informative when conditioned to many others. All in all, this figure thus indeed confirms that importances as computed with non-totally randomized trees take into account only some of the conditioning sets B, hence biasing the measured importances. 7 Conclusions In this work, we made a first step towards understanding variable importances as computed with a forest of randomized trees. In particular, we derived a theoretical characterization of the Mean Decrease Impurity importances as computed by totally randomized trees in asymptotic conditions. We showed that they offer a three-level decomposition of the information jointly provided by all input variables about the output (Section 3). We then demonstrated (Section 4) that MDI importances as computed by totally randomized trees exhibit desirable properties for assessing the relevance of a variable: it is equal to zero if and only if the variable is irrelevant and it depends only on the relevant variables. We discussed the case of Random Forests and Extra-Trees (Section 5) and finally illustrated our developments on an artificial but insightful example (Section 6). There remain several limitations to our framework that we would like address in the future. First, our results should be adapted to binary splits as used within an actual Random Forest-like algorithm. In this setting, any node t is split in only two subsets, which means that any variable may then appear one or several times within a branch, and thus should make variable importances now dependent on the cardinalities of the input variables. In the same direction, our framework should also be extended to the case of continuous variables. Finally, results presented in this work are valid in an asymptotic setting only. An important direction of future work includes the characterization of the distribution of variable importances in a finite setting. Acknowledgements. Gilles Louppe is a research fellow of the FNRS (Belgium) and acknowledges its financial support. This work is supported by PASCAL2 and the IUAP DYSCO, initiated by the Belgian State, Science Policy Office. 8 References Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research, 98888:1063–1095. Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging classifiers. The Journal of Machine Learning Research, 9:2015–2033. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3. 1. Statistics Department University of California Berkeley, CA, USA. Breiman, L. (2004). Consistency for a simple model of random forests. Technical report, UC Berkeley. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regression trees. Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1):3–42. Ho, T. (1998). The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(8):832–844. Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537. Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1):273–324. Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R news, 2(3):18–22. Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590. Meinshausen, N. (2006). Quantile regression forests. The Journal of Machine Learning Research, 7:983–999. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1):307. Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25. White, A. P. and Liu, W. Z. (1994). Technical note: Bias in information-based measures in decision tree induction. Machine Learning, 15(3):321–329. Zhao, G. (2000). A new perspective on classification. PhD thesis, Utah State University, Department of Mathematics and Statistics. 9
2 0.1634362 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
Author: Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, John Winn, Antonio Criminisi
Abstract: Randomized decision trees and forests have a rich history in machine learning and have seen considerable success in application, perhaps particularly so for computer vision. However, they face a fundamental limitation: given enough data, the number of nodes in decision trees will grow exponentially with depth. For certain applications, for example on mobile or embedded processors, memory is a limited resource, and so the exponential growth of trees limits their depth, and thus their potential accuracy. This paper proposes decision jungles, revisiting the idea of ensembles of rooted decision directed acyclic graphs (DAGs), and shows these to be compact and powerful discriminative models for classification. Unlike conventional decision trees that only allow one path to every node, a DAG in a decision jungle allows multiple paths from the root to each leaf. We present and compare two new node merging algorithms that jointly optimize both the features and the structure of the DAGs efficiently. During training, node splitting and node merging are driven by the minimization of exactly the same objective function, here the weighted sum of entropies at the leaves. Results on varied datasets show that, compared to decision forests and several other baselines, decision jungles require dramatically less memory while considerably improving generalization. 1
3 0.08110562 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
Author: Gunnar Kedenburg, Raphael Fonteneau, Remi Munos
Abstract: This paper addresses the problem of online planning in Markov decision processes using a randomized simulator, under a budget constraint. We propose a new algorithm which is based on the construction of a forest of planning trees, where each tree corresponds to a random realization of the stochastic environment. The trees are constructed using a “safe” optimistic planning strategy combining the optimistic principle (in order to explore the most promising part of the search space first) with a safety principle (which guarantees a certain amount of uniform exploration). In the decision-making step of the algorithm, the individual trees are aggregated and an immediate action is recommended. We provide a finite-sample analysis and discuss the trade-off between the principles of optimism and safety. We also report numerical results on a benchmark problem. Our algorithm performs as well as state-of-the-art optimistic planning algorithms, and better than a related algorithm which additionally assumes the knowledge of all transition distributions. 1
4 0.074739188 355 nips-2013-Which Space Partitioning Tree to Use for Search?
Author: Parikshit Ram, Alexander Gray
Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors influencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-first branch-and-bound algorithm. Algorithm 1 presents a simplified version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-first tree traversal to a specified depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the first time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Specifically, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically verified in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-first branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-first traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the first approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of finding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difficulty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efficient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Definition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Definition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the definition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise definition of dc . an affine plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η > 0. This notion of the absence of outliers was first introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following definitions to precisely state our results: Definition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any specified value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Definition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is defined as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these definitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is fixed, and ψ is fixed for a dataset S. Then, for a fixed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Definition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Definition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the definition of distance error. Let R = {r ∈ S : q − r < du }. By definition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Definition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the influence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any fixed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, firstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are significantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reflected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any fixed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a significant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to find the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC finds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by: γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassification. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efficiently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2009. [3] R.F. Sproull. Refinements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difficulty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efficient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artificial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9
5 0.048228525 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar
Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1
6 0.047564387 47 nips-2013-Bayesian Hierarchical Community Discovery
7 0.04566462 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
8 0.043096524 318 nips-2013-Structured Learning via Logistic Regression
9 0.042948652 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
10 0.042723861 211 nips-2013-Non-Linear Domain Adaptation with Boosting
11 0.042607956 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
12 0.041193075 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
13 0.038434073 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
14 0.036568996 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
15 0.035683922 173 nips-2013-Least Informative Dimensions
16 0.035571866 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
17 0.034934871 168 nips-2013-Learning to Pass Expectation Propagation Messages
18 0.034927718 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
19 0.034234345 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
20 0.032405488 289 nips-2013-Scalable kernels for graphs with continuous attributes
topicId topicWeight
[(0, 0.1), (1, 0.02), (2, -0.023), (3, 0.002), (4, 0.04), (5, 0.03), (6, 0.028), (7, -0.029), (8, -0.029), (9, 0.016), (10, 0.008), (11, -0.012), (12, 0.061), (13, -0.027), (14, 0.035), (15, 0.002), (16, -0.005), (17, 0.051), (18, 0.018), (19, -0.046), (20, 0.028), (21, 0.126), (22, -0.024), (23, -0.014), (24, 0.04), (25, 0.023), (26, 0.032), (27, -0.007), (28, 0.017), (29, -0.027), (30, 0.029), (31, -0.008), (32, -0.039), (33, -0.041), (34, 0.096), (35, 0.061), (36, -0.042), (37, -0.052), (38, 0.08), (39, -0.073), (40, -0.09), (41, -0.012), (42, -0.042), (43, -0.036), (44, -0.016), (45, 0.084), (46, -0.068), (47, -0.041), (48, 0.031), (49, 0.087)]
simIndex simValue paperId paperTitle
same-paper 1 0.93844277 340 nips-2013-Understanding variable importances in forests of randomized trees
Author: Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts
Abstract: Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a three-level decomposition of the information jointly provided by all input variables about the output in terms of i) the MDI importance of each input variable, ii) the degree of interaction of a given input variable with the other input variables, iii) the different interaction terms of a given degree. We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees. 1 Motivation An important task in many scientific fields is the prediction of a response variable based on a set of predictor variables. In many situations though, the aim is not only to make the most accurate predictions of the response but also to identify which predictor variables are the most important to make these predictions, e.g. in order to understand the underlying process. Because of their applicability to a wide range of problems and their capability to both build accurate models and, at the same time, to provide variable importance measures, Random Forests (Breiman, 2001) and variants such as Extra-Trees (Geurts et al., 2006) have become a major data analysis tool used with success in various scientific areas. Despite their extensive use in applied research, only a couple of works have studied the theoretical properties and statistical mechanisms of these algorithms. Zhao (2000), Breiman (2004), Biau et al. (2008); Biau (2012), Meinshausen (2006) and Lin and Jeon (2006) investigated simplified to very realistic variants of these algorithms and proved the consistency of those variants. Little is known however regarding the variable importances computed by Random Forests like algorithms, and – as far as we know – the work of Ishwaran (2007) is indeed the only theoretical study of tree-based variable importance measures. In this work, we aim at filling this gap and present a theoretical analysis of the Mean Decrease Impurity importance derived from ensembles of randomized trees. The rest of the paper is organized as follows: in section 2, we provide the background about ensembles of randomized trees and recall how variable importances can be derived from them; in section 3, we then derive a characterization in asymptotic conditions and show how variable importances derived from totally randomized trees offer a three-level decomposition of the information jointly contained in the input variables about the output; section 4 shows that this characterization only depends on the relevant variables and section 5 1 discusses theses ideas in the context of variants closer to the Random Forest algorithm; section 6 then illustrates all these ideas on an artificial problem; finally, section 7 includes our conclusions and proposes directions of future works. 2 Background In this section, we first describe decision trees, as well as forests of randomized trees. Then, we describe the two major variable importances measures derived from them – including the Mean Decrease Impurity (MDI) importance that we will study in the subsequent sections. 2.1 Single classification and regression trees and random forests A binary classification (resp. regression) tree (Breiman et al., 1984) is an input-output model represented by a tree structure T , from a random input vector (X1 , ..., Xp ) taking its values in X1 × ... × Xp = X to a random output variable Y ∈ Y . Any node t in the tree represents a subset of the space X , with the root node being X itself. Internal nodes t are labeled with a binary test (or split) st = (Xm < c) dividing their subset in two subsets1 corresponding to their two children tL and tR , while the terminal nodes (or leaves) are labeled with a best guess value of the output ˆ variable2 . The predicted output Y for a new instance is the label of the leaf reached by the instance when it is propagated through the tree. A tree is built from a learning sample of size N drawn from P (X1 , ..., Xp , Y ) using a recursive procedure which identifies at each node t the split st = s∗ for which the partition of the Nt node samples into tL and tR maximizes the decrease ∆i(s, t) = i(t) − pL i(tL ) − pR i(tR ) (1) of some impurity measure i(t) (e.g., the Gini index, the Shannon entropy, or the variance of Y ), and where pL = NtL /Nt and pR = NtR /Nt . The construction of the tree stops , e.g., when nodes become pure in terms of Y or when all variables Xi are locally constant. Single trees typically suffer from high variance, which makes them not competitive in terms of accuracy. A very efficient and simple way to address this flaw is to use them in the context of randomization-based ensemble methods. Specifically, the core principle is to introduce random perturbations into the learning procedure in order to produce several different decision trees from a single learning set and to use some aggregation technique to combine the predictions of all these trees. In Bagging (Breiman, 1996), trees are built on random bootstrap copies of the original data, hence producing different decision trees. In Random Forests (Breiman, 2001), Bagging is extended and combined with a randomization of the input variables that are used when considering candidate variables to split internal nodes t. In particular, instead of looking for the best split s∗ among all variables, the Random Forest algorithm selects, at each node, a random subset of K variables and then determines the best split over these latter variables only. 2.2 Variable importances In the context of ensembles of randomized trees, Breiman (2001, 2002) proposed to evaluate the importance of a variable Xm for predicting Y by adding up the weighted impurity decreases p(t)∆i(st , t) for all nodes t where Xm is used, averaged over all NT trees in the forest: Imp(Xm ) = 1 NT p(t)∆i(st , t) T (2) t∈T :v(st )=Xm and where p(t) is the proportion Nt /N of samples reaching t and v(st ) is the variable used in split st . When using the Gini index as impurity function, this measure is known as the Gini importance or Mean Decrease Gini. However, since it can be defined for any impurity measure i(t), we will refer to Equation 2 as the Mean Decrease Impurity importance (MDI), no matter the impurity measure i(t). We will characterize and derive results for this measure in the rest of this text. 1 More generally, splits are defined by a (not necessarily binary) partition of the range Xm of possible values of a single variable Xm . 2 e.g. determined as the majority class j(t) (resp., the average value y (t)) within the subset of the leaf t. ¯ 2 In addition to MDI, Breiman (2001, 2002) also proposed to evaluate the importance of a variable Xm by measuring the Mean Decrease Accuracy (MDA) of the forest when the values of Xm are randomly permuted in the out-of-bag samples. For that reason, this latter measure is also known as the permutation importance. Thanks to popular machine learning softwares (Breiman, 2002; Liaw and Wiener, 2002; Pedregosa et al., 2011), both of these variable importance measures have shown their practical utility in an increasing number of experimental studies. Little is known however regarding their inner workings. Strobl et al. (2007) compare both MDI and MDA and show experimentally that the former is biased towards some predictor variables. As explained by White and Liu (1994) in case of single decision trees, this bias stems from an unfair advantage given by the usual impurity functions i(t) towards predictors with a large number of values. Strobl et al. (2008) later showed that MDA is biased as well, and that it overestimates the importance of correlated variables – although this effect was not confirmed in a later experimental study by Genuer et al. (2010). From a theoretical point of view, Ishwaran (2007) provides a detailed theoretical development of a simplified version of MDA, giving key insights for the understanding of the actual MDA. 3 Variable importances derived from totally randomized tree ensembles Let us now consider the MDI importance as defined by Equation 2, and let us assume a set V = {X1 , ..., Xp } of categorical input variables and a categorical output Y . For the sake of simplicity we will use the Shannon entropy as impurity measure, and focus on totally randomized trees; later on we will discuss other impurity measures and tree construction algorithms. Given a training sample L of N joint observations of X1 , ..., Xp , Y independently drawn from the joint distribution P (X1 , ..., Xp , Y ), let us assume that we infer from it an infinitely large ensemble of totally randomized and fully developed trees. In this setting, a totally randomized and fully developed tree is defined as a decision tree in which each node t is partitioned using a variable Xi picked uniformly at random among those not yet used at the parent nodes of t, and where each t is split into |Xi | sub-trees, i.e., one for each possible value of Xi , and where the recursive construction process halts only when all p variables have been used along the current branch. Hence, in such a tree, leaves are all at the same depth p, and the set of leaves of a fully developed tree is in bijection with the set X of all possible joint configurations of the p input variables. For example, if the input variables are all binary, the resulting tree will have exactly 2p leaves. Theorem 1. The MDI importance of Xm ∈ V for Y as computed with an infinite ensemble of fully developed totally randomized trees and an infinitely large training sample is: p−1 Imp(Xm ) = k=0 1 1 k Cp p − k I(Xm ; Y |B), (3) B∈Pk (V −m ) where V −m denotes the subset V \ {Xm }, Pk (V −m ) is the set of subsets of V −m of cardinality k, and I(Xm ; Y |B) is the conditional mutual information of Xm and Y given the variables in B. Proof. See Appendix B. Theorem 2. For any ensemble of fully developed trees in asymptotic learning sample size conditions (e.g., in the same conditions as those of Theorem 1), we have that p Imp(Xm ) = I(X1 , . . . , Xp ; Y ). (4) m=1 Proof. See Appendix C. Together, theorems 1 and 2 show that variable importances derived from totally randomized trees in asymptotic conditions provide a three-level decomposition of the information I(X1 , . . . , Xp ; Y ) contained in the set of input variables about the output variable. The first level is a decomposition among input variables (see Equation 4 of Theorem 2), the second level is a decomposition along the 3 degrees k of interaction terms of a variable with the other ones (see the outer sum in Equation 3 of Theorem 1), and the third level is a decomposition along the combinations B of interaction terms of fixed size k of possible interacting variables (see the inner sum in Equation 3). We observe that the decomposition includes, for each variable, each and every interaction term of each and every degree weighted in a fashion resulting only from the combinatorics of possible interaction terms. In particular, since all I(Xm ; Y |B) terms are at most equal to H(Y ), the prior entropy of Y , the p terms of the outer sum of Equation 3 are each upper bounded by 1 1 1 1 1 H(Y ) = k C k H(Y ) = H(Y ). k Cp p − k Cp p − k p−1 p −m B∈Pk (V ) As such, the second level decomposition resulting from totally randomized trees makes the p sub1 1 importance terms C k p−k B∈Pk (V −m ) I(Xm ; Y |B) to equally contribute (at most) to the total p importance, even though they each include a combinatorially different number of terms. 4 Importances of relevant and irrelevant variables Following Kohavi and John (1997), let us define as relevant to Y with respect to V a variable Xm for which there exists at least one subset B ⊆ V (possibly empty) such that I(Xm ; Y |B) > 0.3 Thus we define as irrelevant to Y with respect to V a variable Xi for which, for all B ⊆ V , I(Xi ; Y |B) = 0. Remark that if Xi is irrelevant to Y with respect to V , then by definition it is also irrelevant to Y with respect to any subset of V . Theorem 3. Xi ∈ V is irrelevant to Y with respect to V if and only if its infinite sample size importance as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is 0. Proof. See Appendix D. Lemma 4. Let Xi ∈ V be an irrelevant variable for Y with respect to V . The infinite sample size / importance of Xm ∈ V as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is the same as the importance derived when using V ∪ {Xi } to build the ensemble of trees for Y . Proof. See Appendix E. Theorem 5. Let VR ⊆ V be the subset of all variables in V that are relevant to Y with respect to V . The infinite sample size importance of any variable Xm ∈ VR as computed with an infinite ensemble of fully developed totally randomized trees built on VR for Y is the same as its importance computed in the same conditions by using all variables in V . That is: p−1 Imp(Xm ) = k=0 r−1 = l=0 1 1 k Cp p − k 1 1 l Cr r − l I(Xm ; Y |B) B∈Pk (V −m ) (5) I(Xm ; Y |B) −m B∈Pl (VR ) where r is the number of relevant variables in VR . Proof. See Appendix F. Theorems 3 and 5 show that the importances computed with an ensemble of totally randomized trees depends only on the relevant variables. Irrelevant variables have a zero importance and do not affect the importance of relevant variables. Practically, we believe that such properties are desirable conditions for a sound criterion assessing the importance of a variable. Indeed, noise should not be credited of any importance and should not make any other variable more (or less) important. 3 Among the relevant variables, we have the marginally relevant ones, for which I(Xm ; Y ) > 0, the strongly relevant ones, for which I(Xm ; Y |V −m ) > 0, and the weakly relevant variables, which are relevant but not strongly relevant. 4 5 Random Forest variants In this section, we consider and discuss variable importances as computed with other types of ensembles of randomized trees. We first show how our results extend to any other impurity measure, and then analyze importances computed by depth-pruned ensemble of randomized trees and those computed by randomized trees built on random subspaces of fixed size. Finally, we discuss the case of non-totally randomized trees. 5.1 Generalization to other impurity measures Although our characterization in sections 3 and 4 uses Shannon entropy as the impurity measure, we show in Appendix I that theorems 1, 3 and 5 hold for other impurity measures, simply substituting conditional mutual information for conditional impurity reduction in the different formulas and in the definition of irrelevant variables. In particular, our results thus hold for the Gini index in classification and can be extended to regression problems using variance as the impurity measure. 5.2 Pruning and random subspaces In sections 3 and 4, we considered totally randomized trees that were fully developed, i.e. until all p variables were used within each branch. When totally randomized trees are developed only up to some smaller depth q ≤ p, we show in Proposition 6 that the variable importances as computed by these trees is limited to the q first terms of Equation 3. We then show in Proposition 7 that these latter importances are actually the same as when each tree of the ensemble is fully developed over a random subspace (Ho, 1998) of q variables drawn prior to its construction. Proposition 6. The importance of Xm ∈ V for Y as computed with an infinite ensemble of pruned totally randomized trees built up to depth q ≤ p and an infinitely large training sample is: q−1 Imp(Xm ) = k=0 1 1 k p−k Cp I(Xm ; Y |B) (6) B∈Pk (V −m ) Proof. See Appendix G. Proposition 7. The importance of Xm ∈ V for Y as computed with an infinite ensemble of pruned totally randomized trees built up to depth q ≤ p and an infinitely large training sample is identical to the importance as computed for Y with an infinite ensemble of fully developed totally randomized trees built on random subspaces of q variables drawn from V . Proof. See Appendix H. As long as q ≥ r (where r denotes the number of relevant variables in V ), it can easily be shown that all relevant variables will still obtain a strictly positive importance, which will however differ in general from the importances computed by fully grown totally randomized trees built over all variables. Also, each irrelevant variable of course keeps an importance equal to zero, which means that, in asymptotic conditions, these pruning and random subspace methods would still allow us identify the relevant variables, as long as we have a good upper bound q on r. 5.3 Non-totally randomized trees In our analysis in the previous sections, trees are built totally at random and hence do not directly relate to those built in Random Forests (Breiman, 2001) or in Extra-Trees (Geurts et al., 2006). To better understand the importances as computed by those algorithms, let us consider a close variant of totally randomized trees: at each node t, let us instead draw uniformly at random 1 ≤ K ≤ p variables and let us choose the one that maximizes ∆i(t). Notice that, for K = 1, this procedure amounts to building ensembles of totally randomized trees as defined before, while, for K = p, it amounts to building classical single trees in a deterministic way. First, the importance of Xm ∈ V as computed with an infinite ensemble of such randomized trees is not the same as Equation 3. For K > 1, masking effects indeed appear: at t, some variables are 5 never selected because there always is some other variable for which ∆i(t) is larger. Such effects tend to pull the best variables at the top of the trees and to push the others at the leaves. As a result, the importance of a variable no longer decomposes into a sum including all I(Xm ; Y |B) terms. The importance of the best variables decomposes into a sum of their mutual information alone or conditioned only with the best others – but not conditioned with all variables since they no longer ever appear at the bottom of trees. By contrast, the importance of the least promising variables now decomposes into a sum of their mutual information conditioned only with all variables – but not alone or conditioned with a couple of others since they no longer ever appear at the top of trees. In other words, because of the guided structure of the trees, the importance of Xm now takes into account only some of the conditioning sets B, which may over- or underestimate its overall relevance. To make things clearer, let us consider a simple example. Let X1 perfectly explains Y and let X2 be a slightly noisy copy of X1 (i.e., I(X1 ; Y ) ≈ I(X2 ; Y ), I(X1 ; Y |X2 ) = and I(X2 ; Y |X1 ) = 0). Using totally randomized trees, the importances of X1 and X2 are nearly equal – the importance of X1 being slightly higher than the importance of X2 : 1 1 1 Imp(X1 ) = I(X1 ; Y ) + I(X1 ; Y |X2 ) = I(X1 ; Y ) + 2 2 2 2 1 1 1 Imp(X2 ) = I(X2 ; Y ) + I(X2 ; Y |X1 ) = I(X2 ; Y ) + 0 2 2 2 In non-totally randomized trees, for K = 2, X1 is always selected at the root node and X2 is always used in its children. Also, since X1 perfectly explains Y , all its children are pure and the reduction of entropy when splitting on X2 is null. As a result, ImpK=2 (X1 ) = I(X1 ; Y ) and ImpK=2 (X2 ) = I(X2 ; Y |X1 ) = 0. Masking effects are here clearly visible: the true importance of X2 is masked by X1 as if X2 were irrelevant, while it is only a bit less informative than X1 . As a direct consequence of the example above, for K > 1, it is no longer true that a variable is irrelevant if and only if its importance is zero. In the same way, it can also be shown that the importances become dependent on the number of irrelevant variables. Let us indeed consider the following counter-example: let us add in the previous example an irrelevant variable Xi with respect to {X1 , X2 } and let us keep K = 2. The probability of selecting X2 at the root node now becomes positive, which means that ImpK=2 (X2 ) now includes I(X2 ; Y ) > 0 and is therefore strictly larger than the importance computed before. For K fixed, adding irrelevant variables dampens masking effects, which thereby makes importances indirectly dependent on the number of irrelevant variables. In conclusion, the importances as computed with totally randomized trees exhibit properties that do not possess, by extension, neither random forests nor extra-trees. With totally randomized trees, the importance of Xm only depends on the relevant variables and is 0 if and only if Xm is irrelevant. As we have shown, it may no longer be the case for K > 1. Asymptotically, the use of totally randomized trees for assessing the importance of a variable may therefore be more appropriate. In a finite setting (i.e., a limited number of samples and a limited number of trees), guiding the choice of the splitting variables remains however a sound strategy. In such a case, I(Xm ; Y |B) cannot be measured neither for all Xm nor for all B. It is therefore pragmatic to promote those that look the most promising – even if the resulting importances may be biased. 6 Illustration on a digit recognition problem In this section, we consider the digit recognition problem of (Breiman et al., 1984) for illustrating variable importances as computed with totally randomized trees. We verify that they match with our theoretical developments and that they decompose as foretold. We also compare these importances with those computed by an ensemble of non-totally randomized trees, as discussed in section 5.3, for increasing values of K. Let us consider a seven-segment indicator displaying numerals using horizontal and vertical lights in on-off combinations, as illustrated in Figure 1. Let Y be a random variable taking its value in {0, 1, ..., 9} with equal probability and let X1 , ..., X7 be binary variables whose values are each determined univocally given the corresponding value of Y in Table 1. Since Table 1 perfectly defines the data distribution, and given the small dimensionality of the problem, it is practicable to directly apply Equation 3 to compute variable importances. To verify our 6 X1 X2 y 0 1 2 3 4 5 6 7 8 9 X3 X4 X5 X6 X7 Eqn. 3 0.412 0.581 0.531 0.542 0.656 0.225 0.372 3.321 K=1 0.414 0.583 0.532 0.543 0.658 0.221 0.368 3.321 x2 1 0 0 0 1 1 1 0 1 1 x3 1 1 1 1 1 0 0 1 1 1 x4 0 0 1 1 1 1 1 0 1 1 x5 1 0 1 0 0 0 1 0 1 0 x6 1 1 0 1 1 1 1 1 1 1 x7 1 0 1 1 0 1 1 0 1 1 Table 1: Values of Y, X1 , ..., X7 Figure 1: 7-segment display X1 X2 X3 X4 X5 X6 X7 x1 1 0 1 1 0 1 1 1 1 1 K=2 0.362 0.663 0.512 0.525 0.731 0.140 0.385 3.321 K=3 0.327 0.715 0.496 0.484 0.778 0.126 0.392 3.321 K=4 0.309 0.757 0.489 0.445 0.810 0.122 0.387 3.321 K=5 0.304 0.787 0.483 0.414 0.827 0.122 0.382 3.321 K=6 0.305 0.801 0.475 0.409 0.831 0.121 0.375 3.321 K=7 0.306 0.799 0.475 0.412 0.835 0.120 0.372 3.321 Table 2: Variable importances as computed with an ensemble of randomized trees, for increasing values of K. Importances at K = 1 follow their theoretical values, as predicted by Equation 3 in Theorem 1. However, as K increases, importances diverge due to masking effects. In accordance with Theorem 2, their sum is also always equal to I(X1 , . . . , X7 ; Y ) = H(Y ) = log2 (10) = 3.321 since inputs allow to perfectly predict the output. theoretical developments, we then compare in Table 2 variable importances as computed by Equation 3 and those yielded by an ensemble of 10000 totally randomized trees (K = 1). Note that given the known structure of the problem, building trees on a sample of finite size that perfectly follows the data distribution amounts to building them on a sample of infinite size. At best, trees can thus be built on a 10-sample dataset, containing exactly one sample for each of the equiprobable outcomes of Y . As the table illustrates, the importances yielded by totally randomized trees match those computed by Equation 3, which confirms Theorem 1. Small differences stem from the fact that a finite number of trees were built in our simulations, but those discrepancies should disappear as the size of the ensemble grows towards infinity. It also shows that importances indeed add up to I(X1 , ...X7 ; Y ), which confirms Theorem 2. Regarding the actual importances, they indicate that X5 is stronger than all others, followed – in that order – by X2 , X4 and X3 which also show large importances. X1 , X7 and X6 appear to be the less informative. The table also reports importances for increasing values of K. As discussed before, we see that a large value of K yields importances that can be either overestimated (e.g., at K = 7, the importances of X2 and X5 are larger than at K = 1) or underestimated due to masking effects (e.g., at K = 7, the importances of X1 , X3 , X4 and X6 are smaller than at K = 1, as if they were less important). It can also be observed that masking effects may even induce changes in the variable rankings (e.g., compare the rankings at K = 1 and at K = 7), which thus confirms that importances are differently affected. To better understand why a variable is important, it is also insightful to look at its decomposition into its p sub-importances terms, as shown in Figure 2. Each row in the plots of the figure corresponds to one the p = 7 variables and each column to a size k of conditioning sets. As such, the value at row m and column k corresponds the importance of Xm when conditioned with k other variables 1 1 (e.g., to the term C k p−k B∈Pk (V −m ) I(Xm ; Y |B) in Equation 3 in the case of totally randomized p trees). In the left plot, for K = 1, the figure first illustrates how importances yielded by totally randomized trees decomposes along the degrees k of interactions terms. We can observe that they each equally contribute (at most) the total importance of a variable. The plot also illustrates why X5 is important: it is informative either alone or conditioned with any combination of the other variables (all of its terms are significantly larger than 0). By contrast, it also clearly shows why 7 K=1 0.5 K=7 X1 X1 X2 X3 X4 X4 X5 X5 X6 X6 X7 0.375 X2 X3 X7 0 1 2 3 4 5 6 0.25 0.125 0 1 2 3 4 5 6 0.0 Figure 2: Decomposition of variable importances along the degrees k of interactions of one variable with the other ones. At K = 1, all I(Xm ; Y |B) are accounted for in the total importance, while at K = 7 only some of them are taken into account due to masking effects. X6 is not important: neither alone nor combined with others X6 seems to be very informative (all of its terms are close to 0). More interestingly, this figure also highlights redundancies: X7 is informative alone or conditioned with a couple of others (the first terms are significantly larger than 0), but becomes uninformative when conditioned with many others (the last terms are closer to 0). The right plot, for K = 7, illustrates the decomposition of importances when variables are chosen in a deterministic way. The first thing to notice is masking effects. Some of the I(Xm ; Y |B) terms are indeed clearly never encountered and their contribution is therefore reduced to 0 in the total importance. For instance, for k = 0, the sub-importances of X2 and X5 are positive, while all others are null, which means that only those two variables are ever selected at the root node, hence masking the others. As a consequence, this also means that the importances of the remaining variables is biased and that it actually only accounts of their relevance when conditioned to X2 or X5 , but not of their relevance in other contexts. At k = 0, masking effects also amplify the contribution of I(X2 ; Y ) (resp. I(X5 ; Y )) since X2 (resp. X5 ) appears more frequently at the root node than in totally randomized trees. In addition, because nodes become pure before reaching depth p, conditioning sets of size k ≥ 4 are never actually encountered, which means that we can no longer know whether variables are still informative when conditioned to many others. All in all, this figure thus indeed confirms that importances as computed with non-totally randomized trees take into account only some of the conditioning sets B, hence biasing the measured importances. 7 Conclusions In this work, we made a first step towards understanding variable importances as computed with a forest of randomized trees. In particular, we derived a theoretical characterization of the Mean Decrease Impurity importances as computed by totally randomized trees in asymptotic conditions. We showed that they offer a three-level decomposition of the information jointly provided by all input variables about the output (Section 3). We then demonstrated (Section 4) that MDI importances as computed by totally randomized trees exhibit desirable properties for assessing the relevance of a variable: it is equal to zero if and only if the variable is irrelevant and it depends only on the relevant variables. We discussed the case of Random Forests and Extra-Trees (Section 5) and finally illustrated our developments on an artificial but insightful example (Section 6). There remain several limitations to our framework that we would like address in the future. First, our results should be adapted to binary splits as used within an actual Random Forest-like algorithm. In this setting, any node t is split in only two subsets, which means that any variable may then appear one or several times within a branch, and thus should make variable importances now dependent on the cardinalities of the input variables. In the same direction, our framework should also be extended to the case of continuous variables. Finally, results presented in this work are valid in an asymptotic setting only. An important direction of future work includes the characterization of the distribution of variable importances in a finite setting. Acknowledgements. Gilles Louppe is a research fellow of the FNRS (Belgium) and acknowledges its financial support. This work is supported by PASCAL2 and the IUAP DYSCO, initiated by the Belgian State, Science Policy Office. 8 References Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research, 98888:1063–1095. Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging classifiers. The Journal of Machine Learning Research, 9:2015–2033. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3. 1. Statistics Department University of California Berkeley, CA, USA. Breiman, L. (2004). Consistency for a simple model of random forests. Technical report, UC Berkeley. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regression trees. Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1):3–42. Ho, T. (1998). The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(8):832–844. Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537. Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1):273–324. Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R news, 2(3):18–22. Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590. Meinshausen, N. (2006). Quantile regression forests. The Journal of Machine Learning Research, 7:983–999. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1):307. Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25. White, A. P. and Liu, W. Z. (1994). Technical note: Bias in information-based measures in decision tree induction. Machine Learning, 15(3):321–329. Zhao, G. (2000). A new perspective on classification. PhD thesis, Utah State University, Department of Mathematics and Statistics. 9
2 0.72123426 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
Author: Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, John Winn, Antonio Criminisi
Abstract: Randomized decision trees and forests have a rich history in machine learning and have seen considerable success in application, perhaps particularly so for computer vision. However, they face a fundamental limitation: given enough data, the number of nodes in decision trees will grow exponentially with depth. For certain applications, for example on mobile or embedded processors, memory is a limited resource, and so the exponential growth of trees limits their depth, and thus their potential accuracy. This paper proposes decision jungles, revisiting the idea of ensembles of rooted decision directed acyclic graphs (DAGs), and shows these to be compact and powerful discriminative models for classification. Unlike conventional decision trees that only allow one path to every node, a DAG in a decision jungle allows multiple paths from the root to each leaf. We present and compare two new node merging algorithms that jointly optimize both the features and the structure of the DAGs efficiently. During training, node splitting and node merging are driven by the minimization of exactly the same objective function, here the weighted sum of entropies at the leaves. Results on varied datasets show that, compared to decision forests and several other baselines, decision jungles require dramatically less memory while considerably improving generalization. 1
3 0.59931868 355 nips-2013-Which Space Partitioning Tree to Use for Search?
Author: Parikshit Ram, Alexander Gray
Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors influencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-first branch-and-bound algorithm. Algorithm 1 presents a simplified version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-first tree traversal to a specified depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the first time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Specifically, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically verified in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-first branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-first traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the first approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of finding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difficulty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efficient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Definition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Definition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the definition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise definition of dc . an affine plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η > 0. This notion of the absence of outliers was first introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following definitions to precisely state our results: Definition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any specified value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Definition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is defined as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these definitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is fixed, and ψ is fixed for a dataset S. Then, for a fixed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Definition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Definition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the definition of distance error. Let R = {r ∈ S : q − r < du }. By definition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Definition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the influence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any fixed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, firstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are significantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reflected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any fixed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a significant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to find the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC finds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by: γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassification. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efficiently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2009. [3] R.F. Sproull. Refinements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difficulty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efficient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artificial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9
4 0.58175993 47 nips-2013-Bayesian Hierarchical Community Discovery
Author: Charles Blundell, Yee Whye Teh
Abstract: We propose an efficient Bayesian nonparametric model for discovering hierarchical community structure in social networks. Our model is a tree-structured mixture of potentially exponentially many stochastic blockmodels. We describe a family of greedy agglomerative model selection algorithms that take just one pass through the data to learn a fully probabilistic, hierarchical community model. In the worst case, Our algorithms scale quadratically in the number of vertices of the network, but independent of the number of nested communities. In practice, the run time of our algorithms are two orders of magnitude faster than the Infinite Relational Model, achieving comparable or better accuracy. 1
5 0.55966806 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: High capacity classifiers, such as deep neural networks, often struggle on classes that have very few training examples. We propose a method for improving classification performance for such classes by discovering similar classes and transferring knowledge among them. Our method learns to organize the classes into a tree hierarchy. This tree structure imposes a prior over the classifier’s parameters. We show that the performance of deep neural networks can be improved by applying these priors to the weights in the last layer. Our method combines the strength of discriminatively trained deep neural networks, which typically require large amounts of training data, with tree-based priors, making deep neural networks work well on infrequent classes as well. We also propose an algorithm for learning the underlying tree structure. Starting from an initial pre-specified tree, this algorithm modifies the tree to make it more pertinent to the task being solved, for example, removing semantic relationships in favour of visual ones for an image classification task. Our method achieves state-of-the-art classification results on the CIFAR-100 image data set and the MIR Flickr image-text data set. 1
6 0.53398389 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
7 0.52946389 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
8 0.4943893 318 nips-2013-Structured Learning via Logistic Regression
9 0.48057273 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
10 0.47307861 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs
11 0.44569388 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
12 0.42643967 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting
13 0.4191955 168 nips-2013-Learning to Pass Expectation Propagation Messages
14 0.4176496 189 nips-2013-Message Passing Inference with Chemical Reaction Networks
15 0.41548237 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
16 0.41448781 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning
17 0.41216975 225 nips-2013-One-shot learning and big data with n=2
18 0.41027209 63 nips-2013-Cluster Trees on Manifolds
19 0.40657061 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
20 0.39628366 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family
topicId topicWeight
[(2, 0.013), (12, 0.263), (16, 0.056), (33, 0.121), (34, 0.075), (36, 0.01), (41, 0.024), (49, 0.028), (56, 0.086), (70, 0.021), (85, 0.04), (89, 0.041), (93, 0.114), (95, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.75357258 340 nips-2013-Understanding variable importances in forests of randomized trees
Author: Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts
Abstract: Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a three-level decomposition of the information jointly provided by all input variables about the output in terms of i) the MDI importance of each input variable, ii) the degree of interaction of a given input variable with the other input variables, iii) the different interaction terms of a given degree. We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees. 1 Motivation An important task in many scientific fields is the prediction of a response variable based on a set of predictor variables. In many situations though, the aim is not only to make the most accurate predictions of the response but also to identify which predictor variables are the most important to make these predictions, e.g. in order to understand the underlying process. Because of their applicability to a wide range of problems and their capability to both build accurate models and, at the same time, to provide variable importance measures, Random Forests (Breiman, 2001) and variants such as Extra-Trees (Geurts et al., 2006) have become a major data analysis tool used with success in various scientific areas. Despite their extensive use in applied research, only a couple of works have studied the theoretical properties and statistical mechanisms of these algorithms. Zhao (2000), Breiman (2004), Biau et al. (2008); Biau (2012), Meinshausen (2006) and Lin and Jeon (2006) investigated simplified to very realistic variants of these algorithms and proved the consistency of those variants. Little is known however regarding the variable importances computed by Random Forests like algorithms, and – as far as we know – the work of Ishwaran (2007) is indeed the only theoretical study of tree-based variable importance measures. In this work, we aim at filling this gap and present a theoretical analysis of the Mean Decrease Impurity importance derived from ensembles of randomized trees. The rest of the paper is organized as follows: in section 2, we provide the background about ensembles of randomized trees and recall how variable importances can be derived from them; in section 3, we then derive a characterization in asymptotic conditions and show how variable importances derived from totally randomized trees offer a three-level decomposition of the information jointly contained in the input variables about the output; section 4 shows that this characterization only depends on the relevant variables and section 5 1 discusses theses ideas in the context of variants closer to the Random Forest algorithm; section 6 then illustrates all these ideas on an artificial problem; finally, section 7 includes our conclusions and proposes directions of future works. 2 Background In this section, we first describe decision trees, as well as forests of randomized trees. Then, we describe the two major variable importances measures derived from them – including the Mean Decrease Impurity (MDI) importance that we will study in the subsequent sections. 2.1 Single classification and regression trees and random forests A binary classification (resp. regression) tree (Breiman et al., 1984) is an input-output model represented by a tree structure T , from a random input vector (X1 , ..., Xp ) taking its values in X1 × ... × Xp = X to a random output variable Y ∈ Y . Any node t in the tree represents a subset of the space X , with the root node being X itself. Internal nodes t are labeled with a binary test (or split) st = (Xm < c) dividing their subset in two subsets1 corresponding to their two children tL and tR , while the terminal nodes (or leaves) are labeled with a best guess value of the output ˆ variable2 . The predicted output Y for a new instance is the label of the leaf reached by the instance when it is propagated through the tree. A tree is built from a learning sample of size N drawn from P (X1 , ..., Xp , Y ) using a recursive procedure which identifies at each node t the split st = s∗ for which the partition of the Nt node samples into tL and tR maximizes the decrease ∆i(s, t) = i(t) − pL i(tL ) − pR i(tR ) (1) of some impurity measure i(t) (e.g., the Gini index, the Shannon entropy, or the variance of Y ), and where pL = NtL /Nt and pR = NtR /Nt . The construction of the tree stops , e.g., when nodes become pure in terms of Y or when all variables Xi are locally constant. Single trees typically suffer from high variance, which makes them not competitive in terms of accuracy. A very efficient and simple way to address this flaw is to use them in the context of randomization-based ensemble methods. Specifically, the core principle is to introduce random perturbations into the learning procedure in order to produce several different decision trees from a single learning set and to use some aggregation technique to combine the predictions of all these trees. In Bagging (Breiman, 1996), trees are built on random bootstrap copies of the original data, hence producing different decision trees. In Random Forests (Breiman, 2001), Bagging is extended and combined with a randomization of the input variables that are used when considering candidate variables to split internal nodes t. In particular, instead of looking for the best split s∗ among all variables, the Random Forest algorithm selects, at each node, a random subset of K variables and then determines the best split over these latter variables only. 2.2 Variable importances In the context of ensembles of randomized trees, Breiman (2001, 2002) proposed to evaluate the importance of a variable Xm for predicting Y by adding up the weighted impurity decreases p(t)∆i(st , t) for all nodes t where Xm is used, averaged over all NT trees in the forest: Imp(Xm ) = 1 NT p(t)∆i(st , t) T (2) t∈T :v(st )=Xm and where p(t) is the proportion Nt /N of samples reaching t and v(st ) is the variable used in split st . When using the Gini index as impurity function, this measure is known as the Gini importance or Mean Decrease Gini. However, since it can be defined for any impurity measure i(t), we will refer to Equation 2 as the Mean Decrease Impurity importance (MDI), no matter the impurity measure i(t). We will characterize and derive results for this measure in the rest of this text. 1 More generally, splits are defined by a (not necessarily binary) partition of the range Xm of possible values of a single variable Xm . 2 e.g. determined as the majority class j(t) (resp., the average value y (t)) within the subset of the leaf t. ¯ 2 In addition to MDI, Breiman (2001, 2002) also proposed to evaluate the importance of a variable Xm by measuring the Mean Decrease Accuracy (MDA) of the forest when the values of Xm are randomly permuted in the out-of-bag samples. For that reason, this latter measure is also known as the permutation importance. Thanks to popular machine learning softwares (Breiman, 2002; Liaw and Wiener, 2002; Pedregosa et al., 2011), both of these variable importance measures have shown their practical utility in an increasing number of experimental studies. Little is known however regarding their inner workings. Strobl et al. (2007) compare both MDI and MDA and show experimentally that the former is biased towards some predictor variables. As explained by White and Liu (1994) in case of single decision trees, this bias stems from an unfair advantage given by the usual impurity functions i(t) towards predictors with a large number of values. Strobl et al. (2008) later showed that MDA is biased as well, and that it overestimates the importance of correlated variables – although this effect was not confirmed in a later experimental study by Genuer et al. (2010). From a theoretical point of view, Ishwaran (2007) provides a detailed theoretical development of a simplified version of MDA, giving key insights for the understanding of the actual MDA. 3 Variable importances derived from totally randomized tree ensembles Let us now consider the MDI importance as defined by Equation 2, and let us assume a set V = {X1 , ..., Xp } of categorical input variables and a categorical output Y . For the sake of simplicity we will use the Shannon entropy as impurity measure, and focus on totally randomized trees; later on we will discuss other impurity measures and tree construction algorithms. Given a training sample L of N joint observations of X1 , ..., Xp , Y independently drawn from the joint distribution P (X1 , ..., Xp , Y ), let us assume that we infer from it an infinitely large ensemble of totally randomized and fully developed trees. In this setting, a totally randomized and fully developed tree is defined as a decision tree in which each node t is partitioned using a variable Xi picked uniformly at random among those not yet used at the parent nodes of t, and where each t is split into |Xi | sub-trees, i.e., one for each possible value of Xi , and where the recursive construction process halts only when all p variables have been used along the current branch. Hence, in such a tree, leaves are all at the same depth p, and the set of leaves of a fully developed tree is in bijection with the set X of all possible joint configurations of the p input variables. For example, if the input variables are all binary, the resulting tree will have exactly 2p leaves. Theorem 1. The MDI importance of Xm ∈ V for Y as computed with an infinite ensemble of fully developed totally randomized trees and an infinitely large training sample is: p−1 Imp(Xm ) = k=0 1 1 k Cp p − k I(Xm ; Y |B), (3) B∈Pk (V −m ) where V −m denotes the subset V \ {Xm }, Pk (V −m ) is the set of subsets of V −m of cardinality k, and I(Xm ; Y |B) is the conditional mutual information of Xm and Y given the variables in B. Proof. See Appendix B. Theorem 2. For any ensemble of fully developed trees in asymptotic learning sample size conditions (e.g., in the same conditions as those of Theorem 1), we have that p Imp(Xm ) = I(X1 , . . . , Xp ; Y ). (4) m=1 Proof. See Appendix C. Together, theorems 1 and 2 show that variable importances derived from totally randomized trees in asymptotic conditions provide a three-level decomposition of the information I(X1 , . . . , Xp ; Y ) contained in the set of input variables about the output variable. The first level is a decomposition among input variables (see Equation 4 of Theorem 2), the second level is a decomposition along the 3 degrees k of interaction terms of a variable with the other ones (see the outer sum in Equation 3 of Theorem 1), and the third level is a decomposition along the combinations B of interaction terms of fixed size k of possible interacting variables (see the inner sum in Equation 3). We observe that the decomposition includes, for each variable, each and every interaction term of each and every degree weighted in a fashion resulting only from the combinatorics of possible interaction terms. In particular, since all I(Xm ; Y |B) terms are at most equal to H(Y ), the prior entropy of Y , the p terms of the outer sum of Equation 3 are each upper bounded by 1 1 1 1 1 H(Y ) = k C k H(Y ) = H(Y ). k Cp p − k Cp p − k p−1 p −m B∈Pk (V ) As such, the second level decomposition resulting from totally randomized trees makes the p sub1 1 importance terms C k p−k B∈Pk (V −m ) I(Xm ; Y |B) to equally contribute (at most) to the total p importance, even though they each include a combinatorially different number of terms. 4 Importances of relevant and irrelevant variables Following Kohavi and John (1997), let us define as relevant to Y with respect to V a variable Xm for which there exists at least one subset B ⊆ V (possibly empty) such that I(Xm ; Y |B) > 0.3 Thus we define as irrelevant to Y with respect to V a variable Xi for which, for all B ⊆ V , I(Xi ; Y |B) = 0. Remark that if Xi is irrelevant to Y with respect to V , then by definition it is also irrelevant to Y with respect to any subset of V . Theorem 3. Xi ∈ V is irrelevant to Y with respect to V if and only if its infinite sample size importance as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is 0. Proof. See Appendix D. Lemma 4. Let Xi ∈ V be an irrelevant variable for Y with respect to V . The infinite sample size / importance of Xm ∈ V as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is the same as the importance derived when using V ∪ {Xi } to build the ensemble of trees for Y . Proof. See Appendix E. Theorem 5. Let VR ⊆ V be the subset of all variables in V that are relevant to Y with respect to V . The infinite sample size importance of any variable Xm ∈ VR as computed with an infinite ensemble of fully developed totally randomized trees built on VR for Y is the same as its importance computed in the same conditions by using all variables in V . That is: p−1 Imp(Xm ) = k=0 r−1 = l=0 1 1 k Cp p − k 1 1 l Cr r − l I(Xm ; Y |B) B∈Pk (V −m ) (5) I(Xm ; Y |B) −m B∈Pl (VR ) where r is the number of relevant variables in VR . Proof. See Appendix F. Theorems 3 and 5 show that the importances computed with an ensemble of totally randomized trees depends only on the relevant variables. Irrelevant variables have a zero importance and do not affect the importance of relevant variables. Practically, we believe that such properties are desirable conditions for a sound criterion assessing the importance of a variable. Indeed, noise should not be credited of any importance and should not make any other variable more (or less) important. 3 Among the relevant variables, we have the marginally relevant ones, for which I(Xm ; Y ) > 0, the strongly relevant ones, for which I(Xm ; Y |V −m ) > 0, and the weakly relevant variables, which are relevant but not strongly relevant. 4 5 Random Forest variants In this section, we consider and discuss variable importances as computed with other types of ensembles of randomized trees. We first show how our results extend to any other impurity measure, and then analyze importances computed by depth-pruned ensemble of randomized trees and those computed by randomized trees built on random subspaces of fixed size. Finally, we discuss the case of non-totally randomized trees. 5.1 Generalization to other impurity measures Although our characterization in sections 3 and 4 uses Shannon entropy as the impurity measure, we show in Appendix I that theorems 1, 3 and 5 hold for other impurity measures, simply substituting conditional mutual information for conditional impurity reduction in the different formulas and in the definition of irrelevant variables. In particular, our results thus hold for the Gini index in classification and can be extended to regression problems using variance as the impurity measure. 5.2 Pruning and random subspaces In sections 3 and 4, we considered totally randomized trees that were fully developed, i.e. until all p variables were used within each branch. When totally randomized trees are developed only up to some smaller depth q ≤ p, we show in Proposition 6 that the variable importances as computed by these trees is limited to the q first terms of Equation 3. We then show in Proposition 7 that these latter importances are actually the same as when each tree of the ensemble is fully developed over a random subspace (Ho, 1998) of q variables drawn prior to its construction. Proposition 6. The importance of Xm ∈ V for Y as computed with an infinite ensemble of pruned totally randomized trees built up to depth q ≤ p and an infinitely large training sample is: q−1 Imp(Xm ) = k=0 1 1 k p−k Cp I(Xm ; Y |B) (6) B∈Pk (V −m ) Proof. See Appendix G. Proposition 7. The importance of Xm ∈ V for Y as computed with an infinite ensemble of pruned totally randomized trees built up to depth q ≤ p and an infinitely large training sample is identical to the importance as computed for Y with an infinite ensemble of fully developed totally randomized trees built on random subspaces of q variables drawn from V . Proof. See Appendix H. As long as q ≥ r (where r denotes the number of relevant variables in V ), it can easily be shown that all relevant variables will still obtain a strictly positive importance, which will however differ in general from the importances computed by fully grown totally randomized trees built over all variables. Also, each irrelevant variable of course keeps an importance equal to zero, which means that, in asymptotic conditions, these pruning and random subspace methods would still allow us identify the relevant variables, as long as we have a good upper bound q on r. 5.3 Non-totally randomized trees In our analysis in the previous sections, trees are built totally at random and hence do not directly relate to those built in Random Forests (Breiman, 2001) or in Extra-Trees (Geurts et al., 2006). To better understand the importances as computed by those algorithms, let us consider a close variant of totally randomized trees: at each node t, let us instead draw uniformly at random 1 ≤ K ≤ p variables and let us choose the one that maximizes ∆i(t). Notice that, for K = 1, this procedure amounts to building ensembles of totally randomized trees as defined before, while, for K = p, it amounts to building classical single trees in a deterministic way. First, the importance of Xm ∈ V as computed with an infinite ensemble of such randomized trees is not the same as Equation 3. For K > 1, masking effects indeed appear: at t, some variables are 5 never selected because there always is some other variable for which ∆i(t) is larger. Such effects tend to pull the best variables at the top of the trees and to push the others at the leaves. As a result, the importance of a variable no longer decomposes into a sum including all I(Xm ; Y |B) terms. The importance of the best variables decomposes into a sum of their mutual information alone or conditioned only with the best others – but not conditioned with all variables since they no longer ever appear at the bottom of trees. By contrast, the importance of the least promising variables now decomposes into a sum of their mutual information conditioned only with all variables – but not alone or conditioned with a couple of others since they no longer ever appear at the top of trees. In other words, because of the guided structure of the trees, the importance of Xm now takes into account only some of the conditioning sets B, which may over- or underestimate its overall relevance. To make things clearer, let us consider a simple example. Let X1 perfectly explains Y and let X2 be a slightly noisy copy of X1 (i.e., I(X1 ; Y ) ≈ I(X2 ; Y ), I(X1 ; Y |X2 ) = and I(X2 ; Y |X1 ) = 0). Using totally randomized trees, the importances of X1 and X2 are nearly equal – the importance of X1 being slightly higher than the importance of X2 : 1 1 1 Imp(X1 ) = I(X1 ; Y ) + I(X1 ; Y |X2 ) = I(X1 ; Y ) + 2 2 2 2 1 1 1 Imp(X2 ) = I(X2 ; Y ) + I(X2 ; Y |X1 ) = I(X2 ; Y ) + 0 2 2 2 In non-totally randomized trees, for K = 2, X1 is always selected at the root node and X2 is always used in its children. Also, since X1 perfectly explains Y , all its children are pure and the reduction of entropy when splitting on X2 is null. As a result, ImpK=2 (X1 ) = I(X1 ; Y ) and ImpK=2 (X2 ) = I(X2 ; Y |X1 ) = 0. Masking effects are here clearly visible: the true importance of X2 is masked by X1 as if X2 were irrelevant, while it is only a bit less informative than X1 . As a direct consequence of the example above, for K > 1, it is no longer true that a variable is irrelevant if and only if its importance is zero. In the same way, it can also be shown that the importances become dependent on the number of irrelevant variables. Let us indeed consider the following counter-example: let us add in the previous example an irrelevant variable Xi with respect to {X1 , X2 } and let us keep K = 2. The probability of selecting X2 at the root node now becomes positive, which means that ImpK=2 (X2 ) now includes I(X2 ; Y ) > 0 and is therefore strictly larger than the importance computed before. For K fixed, adding irrelevant variables dampens masking effects, which thereby makes importances indirectly dependent on the number of irrelevant variables. In conclusion, the importances as computed with totally randomized trees exhibit properties that do not possess, by extension, neither random forests nor extra-trees. With totally randomized trees, the importance of Xm only depends on the relevant variables and is 0 if and only if Xm is irrelevant. As we have shown, it may no longer be the case for K > 1. Asymptotically, the use of totally randomized trees for assessing the importance of a variable may therefore be more appropriate. In a finite setting (i.e., a limited number of samples and a limited number of trees), guiding the choice of the splitting variables remains however a sound strategy. In such a case, I(Xm ; Y |B) cannot be measured neither for all Xm nor for all B. It is therefore pragmatic to promote those that look the most promising – even if the resulting importances may be biased. 6 Illustration on a digit recognition problem In this section, we consider the digit recognition problem of (Breiman et al., 1984) for illustrating variable importances as computed with totally randomized trees. We verify that they match with our theoretical developments and that they decompose as foretold. We also compare these importances with those computed by an ensemble of non-totally randomized trees, as discussed in section 5.3, for increasing values of K. Let us consider a seven-segment indicator displaying numerals using horizontal and vertical lights in on-off combinations, as illustrated in Figure 1. Let Y be a random variable taking its value in {0, 1, ..., 9} with equal probability and let X1 , ..., X7 be binary variables whose values are each determined univocally given the corresponding value of Y in Table 1. Since Table 1 perfectly defines the data distribution, and given the small dimensionality of the problem, it is practicable to directly apply Equation 3 to compute variable importances. To verify our 6 X1 X2 y 0 1 2 3 4 5 6 7 8 9 X3 X4 X5 X6 X7 Eqn. 3 0.412 0.581 0.531 0.542 0.656 0.225 0.372 3.321 K=1 0.414 0.583 0.532 0.543 0.658 0.221 0.368 3.321 x2 1 0 0 0 1 1 1 0 1 1 x3 1 1 1 1 1 0 0 1 1 1 x4 0 0 1 1 1 1 1 0 1 1 x5 1 0 1 0 0 0 1 0 1 0 x6 1 1 0 1 1 1 1 1 1 1 x7 1 0 1 1 0 1 1 0 1 1 Table 1: Values of Y, X1 , ..., X7 Figure 1: 7-segment display X1 X2 X3 X4 X5 X6 X7 x1 1 0 1 1 0 1 1 1 1 1 K=2 0.362 0.663 0.512 0.525 0.731 0.140 0.385 3.321 K=3 0.327 0.715 0.496 0.484 0.778 0.126 0.392 3.321 K=4 0.309 0.757 0.489 0.445 0.810 0.122 0.387 3.321 K=5 0.304 0.787 0.483 0.414 0.827 0.122 0.382 3.321 K=6 0.305 0.801 0.475 0.409 0.831 0.121 0.375 3.321 K=7 0.306 0.799 0.475 0.412 0.835 0.120 0.372 3.321 Table 2: Variable importances as computed with an ensemble of randomized trees, for increasing values of K. Importances at K = 1 follow their theoretical values, as predicted by Equation 3 in Theorem 1. However, as K increases, importances diverge due to masking effects. In accordance with Theorem 2, their sum is also always equal to I(X1 , . . . , X7 ; Y ) = H(Y ) = log2 (10) = 3.321 since inputs allow to perfectly predict the output. theoretical developments, we then compare in Table 2 variable importances as computed by Equation 3 and those yielded by an ensemble of 10000 totally randomized trees (K = 1). Note that given the known structure of the problem, building trees on a sample of finite size that perfectly follows the data distribution amounts to building them on a sample of infinite size. At best, trees can thus be built on a 10-sample dataset, containing exactly one sample for each of the equiprobable outcomes of Y . As the table illustrates, the importances yielded by totally randomized trees match those computed by Equation 3, which confirms Theorem 1. Small differences stem from the fact that a finite number of trees were built in our simulations, but those discrepancies should disappear as the size of the ensemble grows towards infinity. It also shows that importances indeed add up to I(X1 , ...X7 ; Y ), which confirms Theorem 2. Regarding the actual importances, they indicate that X5 is stronger than all others, followed – in that order – by X2 , X4 and X3 which also show large importances. X1 , X7 and X6 appear to be the less informative. The table also reports importances for increasing values of K. As discussed before, we see that a large value of K yields importances that can be either overestimated (e.g., at K = 7, the importances of X2 and X5 are larger than at K = 1) or underestimated due to masking effects (e.g., at K = 7, the importances of X1 , X3 , X4 and X6 are smaller than at K = 1, as if they were less important). It can also be observed that masking effects may even induce changes in the variable rankings (e.g., compare the rankings at K = 1 and at K = 7), which thus confirms that importances are differently affected. To better understand why a variable is important, it is also insightful to look at its decomposition into its p sub-importances terms, as shown in Figure 2. Each row in the plots of the figure corresponds to one the p = 7 variables and each column to a size k of conditioning sets. As such, the value at row m and column k corresponds the importance of Xm when conditioned with k other variables 1 1 (e.g., to the term C k p−k B∈Pk (V −m ) I(Xm ; Y |B) in Equation 3 in the case of totally randomized p trees). In the left plot, for K = 1, the figure first illustrates how importances yielded by totally randomized trees decomposes along the degrees k of interactions terms. We can observe that they each equally contribute (at most) the total importance of a variable. The plot also illustrates why X5 is important: it is informative either alone or conditioned with any combination of the other variables (all of its terms are significantly larger than 0). By contrast, it also clearly shows why 7 K=1 0.5 K=7 X1 X1 X2 X3 X4 X4 X5 X5 X6 X6 X7 0.375 X2 X3 X7 0 1 2 3 4 5 6 0.25 0.125 0 1 2 3 4 5 6 0.0 Figure 2: Decomposition of variable importances along the degrees k of interactions of one variable with the other ones. At K = 1, all I(Xm ; Y |B) are accounted for in the total importance, while at K = 7 only some of them are taken into account due to masking effects. X6 is not important: neither alone nor combined with others X6 seems to be very informative (all of its terms are close to 0). More interestingly, this figure also highlights redundancies: X7 is informative alone or conditioned with a couple of others (the first terms are significantly larger than 0), but becomes uninformative when conditioned with many others (the last terms are closer to 0). The right plot, for K = 7, illustrates the decomposition of importances when variables are chosen in a deterministic way. The first thing to notice is masking effects. Some of the I(Xm ; Y |B) terms are indeed clearly never encountered and their contribution is therefore reduced to 0 in the total importance. For instance, for k = 0, the sub-importances of X2 and X5 are positive, while all others are null, which means that only those two variables are ever selected at the root node, hence masking the others. As a consequence, this also means that the importances of the remaining variables is biased and that it actually only accounts of their relevance when conditioned to X2 or X5 , but not of their relevance in other contexts. At k = 0, masking effects also amplify the contribution of I(X2 ; Y ) (resp. I(X5 ; Y )) since X2 (resp. X5 ) appears more frequently at the root node than in totally randomized trees. In addition, because nodes become pure before reaching depth p, conditioning sets of size k ≥ 4 are never actually encountered, which means that we can no longer know whether variables are still informative when conditioned to many others. All in all, this figure thus indeed confirms that importances as computed with non-totally randomized trees take into account only some of the conditioning sets B, hence biasing the measured importances. 7 Conclusions In this work, we made a first step towards understanding variable importances as computed with a forest of randomized trees. In particular, we derived a theoretical characterization of the Mean Decrease Impurity importances as computed by totally randomized trees in asymptotic conditions. We showed that they offer a three-level decomposition of the information jointly provided by all input variables about the output (Section 3). We then demonstrated (Section 4) that MDI importances as computed by totally randomized trees exhibit desirable properties for assessing the relevance of a variable: it is equal to zero if and only if the variable is irrelevant and it depends only on the relevant variables. We discussed the case of Random Forests and Extra-Trees (Section 5) and finally illustrated our developments on an artificial but insightful example (Section 6). There remain several limitations to our framework that we would like address in the future. First, our results should be adapted to binary splits as used within an actual Random Forest-like algorithm. In this setting, any node t is split in only two subsets, which means that any variable may then appear one or several times within a branch, and thus should make variable importances now dependent on the cardinalities of the input variables. In the same direction, our framework should also be extended to the case of continuous variables. Finally, results presented in this work are valid in an asymptotic setting only. An important direction of future work includes the characterization of the distribution of variable importances in a finite setting. Acknowledgements. Gilles Louppe is a research fellow of the FNRS (Belgium) and acknowledges its financial support. This work is supported by PASCAL2 and the IUAP DYSCO, initiated by the Belgian State, Science Policy Office. 8 References Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research, 98888:1063–1095. Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging classifiers. The Journal of Machine Learning Research, 9:2015–2033. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3. 1. Statistics Department University of California Berkeley, CA, USA. Breiman, L. (2004). Consistency for a simple model of random forests. Technical report, UC Berkeley. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regression trees. Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1):3–42. Ho, T. (1998). The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(8):832–844. Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537. Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1):273–324. Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R news, 2(3):18–22. Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590. Meinshausen, N. (2006). Quantile regression forests. The Journal of Machine Learning Research, 7:983–999. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1):307. Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25. White, A. P. and Liu, W. Z. (1994). Technical note: Bias in information-based measures in decision tree induction. Machine Learning, 15(3):321–329. Zhao, G. (2000). A new perspective on classification. PhD thesis, Utah State University, Department of Mathematics and Statistics. 9
2 0.74299169 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers
Author: Kareem Amin, Afshin Rostamizadeh, Umar Syed
Abstract: Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer’s value distribution for a good when the buyer is repeatedly interacting with a seller through a posted-price mechanism. We model the buyer as a strategic agent, whose goal is to maximize her long-term surplus, and we are interested in mechanisms that maximize the seller’s long-term revenue. We define the natural notion of strategic regret — the lost revenue as measured against a truthful (non-strategic) buyer. We present seller algorithms that are no(strategic)-regret when the buyer discounts her future surplus — i.e. the buyer prefers showing advertisements to users sooner rather than later. We also give a lower bound on strategic regret that increases as the buyer’s discounting weakens and shows, in particular, that any seller algorithm will suffer linear strategic regret if there is no discounting. 1
3 0.71099669 350 nips-2013-Wavelets on Graphs via Deep Learning
Author: Raif Rustamov, Leonidas Guibas
Abstract: An increasing number of applications require processing of signals defined on weighted graphs. While wavelets provide a flexible tool for signal processing in the classical setting of regular domains, the existing graph wavelet constructions are less flexible – they are guided solely by the structure of the underlying graph and do not take directly into consideration the particular class of signals to be processed. This paper introduces a machine learning framework for constructing graph wavelets that can sparsely represent a given class of signals. Our construction uses the lifting scheme, and is based on the observation that the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. Particular properties that the resulting wavelets must satisfy determine the training objective and the structure of the involved neural networks. The training is unsupervised, and is conducted similarly to the greedy pre-training of a stack of auto-encoders. After training is completed, we obtain a linear wavelet transform that can be applied to any graph signal in time and memory linear in the size of the graph. Improved sparsity of our wavelet transform for the test signals is confirmed via experiments both on synthetic and real data. 1
4 0.69160497 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
Author: Eftychios A. Pnevmatikakis, Liam Paninski
Abstract: We propose a compressed sensing (CS) calcium imaging framework for monitoring large neuronal populations, where we image randomized projections of the spatial calcium concentration at each timestep, instead of measuring the concentration at individual locations. We develop scalable nonnegative deconvolution methods for extracting the neuronal spike time series from such observations. We also address the problem of demixing the spatial locations of the neurons using rank-penalized matrix factorization methods. By exploiting the sparsity of neural spiking we demonstrate that the number of measurements needed per timestep is significantly smaller than the total number of neurons, a result that can potentially enable imaging of larger populations at considerably faster rates compared to traditional raster-scanning techniques. Unlike traditional CS setups, our problem involves a block-diagonal sensing matrix and a non-orthogonal sparse basis that spans multiple timesteps. We provide tight approximations to the number of measurements needed for perfect deconvolution for certain classes of spiking processes, and show that this number undergoes a “phase transition,” which we characterize using modern tools relating conic geometry to compressed sensing. 1
5 0.67515075 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions
Author: Ari Pakman, Liam Paninski
Abstract: We present a new approach to sample from generic binary distributions, based on an exact Hamiltonian Monte Carlo algorithm applied to a piecewise continuous augmentation of the binary distribution of interest. An extension of this idea to distributions over mixtures of binary and possibly-truncated Gaussian or exponential variables allows us to sample from posteriors of linear and probit regression models with spike-and-slab priors and truncated parameters. We illustrate the advantages of these algorithms in several examples in which they outperform the Metropolis or Gibbs samplers. 1
6 0.61888033 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
7 0.61672354 215 nips-2013-On Decomposing the Proximal Map
8 0.61032015 99 nips-2013-Dropout Training as Adaptive Regularization
9 0.60740751 65 nips-2013-Compressive Feature Learning
10 0.60187835 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
11 0.5998652 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
12 0.59748405 211 nips-2013-Non-Linear Domain Adaptation with Boosting
13 0.59447479 251 nips-2013-Predicting Parameters in Deep Learning
14 0.59165466 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
15 0.59082448 5 nips-2013-A Deep Architecture for Matching Short Texts
16 0.59057719 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
17 0.59042156 30 nips-2013-Adaptive dropout for training deep neural networks
18 0.58824408 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
19 0.58519626 201 nips-2013-Multi-Task Bayesian Optimization
20 0.5851934 339 nips-2013-Understanding Dropout