jmlr jmlr2012 jmlr2012-20 knowledge-graph by maker-knowledge-mining

20 jmlr-2012-Analysis of a Random Forests Model


Source: pdf

Author: Gérard Biau

Abstract: Random forests are a scheme proposed by Leo Breiman in the 2000’s for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present. Keywords: random forests, randomization, sparsity, dimension reduction, consistency, rate of convergence

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. [sent-5, score-0.505]

2 Introduction In a series of papers and technical reports, Breiman (1996, 2000, 2001, 2004) demonstrated that substantial gains in classification and regression accuracy can be achieved by using ensembles of trees, where each tree in the ensemble is grown in accordance with a random parameter. [sent-8, score-0.322]

3 As the base constituents of the ensemble are tree-structured predictors, and since each of these trees is constructed using an injection of randomness, these procedures are called “random forests. [sent-10, score-0.169]

4 1 Random Forests Breiman’s ideas were decisively influenced by the early work of Amit and Geman (1997) on geometric feature selection, the random subspace method of Ho (1998) and the random split selection approach of Dietterich (2000). [sent-12, score-0.16]

5 , 2008, 2010), random e forests have emerged as serious competitors to state-of-the-art methods such as boosting (Freund and Shapire, 1996) and support vector machines (Shawe-Taylor and Cristianini, 2004). [sent-15, score-0.505]

6 e B IAU In Breiman’s approach, each tree in the collection is formed by first selecting at random, at each node, a small group of input coordinates (also called features or variables hereafter) to split on and, secondly, by calculating the best split based on these features in the training set. [sent-23, score-0.373]

7 The tree is grown using CART methodology (Breiman et al. [sent-24, score-0.164]

8 , 2010) to resample, with replacement, the training data set each time a new individual tree is grown. [sent-27, score-0.166]

9 (2008), who offer consistency theorems for various simplified versions of random forests and other randomized ensemble predictors. [sent-31, score-0.536]

10 Nevertheless, the statistical mechanism of “true” random forests is not yet fully understood and is still under active investigation. [sent-32, score-0.549]

11 In the present paper, we go one step further into random forests by working out and solidifying the properties of a model suggested by Breiman (2004). [sent-33, score-0.505]

12 2 The Model Formally, a random forest is a predictor consisting of a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}, where Θ1 , Θ2 , . [sent-50, score-0.249]

13 These random trees are combined to form the aggregated regression estimate rn (X, Dn ) = EΘ [rn (X, Θ, Dn )] , ¯ where EΘ denotes expectation with respect to the random parameter, conditionally on X and the data set Dn . [sent-57, score-0.383]

14 In the following, to lighten notation a little, we will omit the dependency of the estimates in the sample, and write for example rn (X) instead of rn (X, Dn ). [sent-58, score-0.21]

15 The randomizing variable Θ is used to determine how the successive cuts are performed when building the individual trees, such as selection of the coordinate to split and position of the split. [sent-60, score-0.287]

16 With these warnings in mind, we will assume that each individual random tree is constructed in the following way. [sent-66, score-0.206]

17 All nodes of the tree are associated with rectangular cells such that at each step of the construction of the tree, the collection of cells associated with the leaves of the tree (i. [sent-67, score-0.424]

18 The following procedure is then repeated ⌈log2 kn ⌉ times, where log2 is the base-2 logarithm, ⌈. [sent-71, score-0.469]

19 ⌉ the ceiling function and kn ≥ 2 a deterministic parameter, fixed beforehand by the user, and possibly depending on n. [sent-72, score-0.509]

20 , X (d) ) is selected, with the j-th feature having a probability pn j ∈ (0, 1) of being selected. [sent-77, score-0.212]

21 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. [sent-79, score-0.172]

22 Each randomized tree rn (X, Θ) outputs the average over all Yi for which the corresponding vectors Xi fall in the same cell of the random partition as X. [sent-80, score-0.334]

23 In other words, letting An (X, Θ) be the rectangular cell of the random partition containing X, ∑n Yi 1[Xi ∈An (X,Θ)] i=1 rn (X, Θ) = n 1En (X,Θ) , ∑i=1 1[Xi ∈An (X,Θ)] where the event En (X, Θ) is defined by En (X, Θ) = n ∑ 1[X ∈A (X,Θ)] = 0 i n . [sent-81, score-0.272]

24 ) Taking finally expectation with respect to the parameter Θ, the random forests regression estimate takes the form rn (X) = EΘ [rn (X, Θ)] = EΘ ¯ ∑n Yi 1[Xi ∈An (X,Θ)] i=1 1En (X,Θ) . [sent-83, score-0.629]

25 ∑n 1[Xi ∈An (X,Θ)] i=1 Let us now make some general remarks about this random forests model. [sent-84, score-0.505]

26 First of all, we note that, by construction, each individual tree has exactly 2⌈log2 kn ⌉ (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2−⌈log2 kn ⌉ (≈ 1/kn ). [sent-85, score-1.715]

27 Thus, if X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. [sent-86, score-0.142]

28 In particular, the choice kn = n induces a very small number of cases in the final leaves, in accordance with the idea that the single trees should not be pruned. [sent-87, score-0.661]

29 Next, we see that, during the construction of the tree, at each node, each candidate coordinate X ( j) may be chosen with probability pn j ∈ (0, 1). [sent-88, score-0.264]

30 This scheme is close to what the original random forests algorithm does, the essential difference being that the latter algorithm uses the actual data set to calculate the best splits. [sent-92, score-0.564]

31 In Section 2, we prove that the random forests regression estimate rn is consistent and discuss its rate of convergence. [sent-96, score-0.66]

32 As a striking result, we show under a ¯ sparsity framework that the rate of convergence depends only on the number of active (or strong) variables and not on the dimension of the ambient space. [sent-97, score-0.21]

33 This feature is particularly desirable in high-dimensional regression, when the number of variables can be much larger than the sample size, and may explain why random forests are able to handle a very large number of input variables without overfitting. [sent-98, score-0.561]

34 i=1 We start the analysis with the following simple theorem, which shows that the random forests estimate rn is consistent. [sent-103, score-0.596]

35 Then the random forests estimate rn is consistent whenever pn j log kn → ∞ for all j = 1, . [sent-105, score-1.318]

36 ¯ Theorem 1 mainly serves as an illustration of how the consistency problem of random forests predictors may be attacked. [sent-109, score-0.531]

37 It encompasses, in particular, the situation where, at each node, the coordinate to split is chosen uniformly at random over the d candidates. [sent-110, score-0.199]

38 In this “purely random” model, pn j = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. [sent-111, score-1.15]

39 This is however a radically simplified version of the random forests used in practice, which does not explain the good performance of the algorithm. [sent-112, score-0.505]

40 In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than the sample size n, but we believe that the representation is sparse, that is, that very few coordinates of r are non-zero, with indices corresponding to the set S . [sent-137, score-0.186]

41 Within this sparsity framework, it is intuitively clear that the coordinate-sampling probabilities should ideally satisfy the constraints pn j = 1/S for j ∈ S (and, consequently, pn j = 0 otherwise). [sent-140, score-0.507]

42 Thus, to stick to reality, we will rather require in the following that pn j = (1/S)(1 + ξn j ) for j ∈ S (and pn j = ξn j otherwise), where pn j ∈ (0, 1) and each ξn j tends to 0 as n tends to infinity. [sent-142, score-0.744]

43 We have now enough material for a deeper understanding of the random forests algorithm. [sent-146, score-0.505]

44 ¯ i=1 Let us start with the variance/bias decomposition E [¯n (X) − r(X)]2 = E [¯n (X) − rn (X)]2 + E [˜n (X) − r(X)]2 , r r ˜ r 1067 (2) B IAU where we set n rn (X) = ∑ EΘ [Wni (X, Θ)] r(Xi ). [sent-148, score-0.182]

45 Then, if pn j = (1/S)(1 + ξn j ) for j ∈ S , 2 E [¯n (X) − rn (X)] ≤ Cσ r ˜ 2 where C= 288 π S2 S−1 S/2d (1 + ξn ) π log 2 16 kn , n(log kn )S/2d S/2d . [sent-152, score-1.282]

46 In particular, if a < pn j < b for some constants a, b ∈ (0, 1), then 1 + ξn ≤ S−1 2 a(1 − b) S S/2d . [sent-155, score-0.212]

47 The main message of Proposition 2 is that the variance of the forests estimate is O (kn /(n(log kn )S/2d )). [sent-156, score-0.97]

48 To understand this remark, recall that individual (random or not) trees are proved to be consistent by letting the number of cases in each terminal node become large (see Devroye et al. [sent-158, score-0.454]

49 , 1996, Chapter 20), with a typical variance of the order kn /n. [sent-159, score-0.505]

50 , about one observation on average in each terminal node) is clearly not suitable and leads to serious overfitting and variance explosion. [sent-162, score-0.178]

51 On the other hand, the variance of the forest is of the order kn /(n(log kn )S/2d ). [sent-163, score-1.012]

52 Therefore, letting kn = n, the variance is of the order 1/(log n)S/2d , a quantity which still goes to 0 as n grows! [sent-164, score-0.538]

53 We believe that it provides an interesting perspective on why random forests are still able to do a good job, despite the fact that individual trees are not pruned. [sent-166, score-0.676]

54 Then, if pn j = (1/S)(1 + ξn j ) for j ∈ S , E [˜n (X) − r(X)]2 ≤ r 2SL2 0. [sent-173, score-0.212]

55 75 S log 2 (1+γn ) + kn sup r2 (x) e−n/2kn , x∈[0,1]d where γn = min j∈S ξn j tends to 0 as n tends to infinity. [sent-174, score-0.647]

56 75/(S log 2))(1+γn ) should be compared with the ordinary partitioning estimate bias, which is of the order kn −2/d under the smoothness conditions of Proposition 4 (see for instance Gy¨ rfi et al. [sent-177, score-0.51]

57 In this respect, it is easy to see that o kn −(0. [sent-179, score-0.469]

58 In other words, when the number of active variables is less than (roughly) half of the ambient dimension, the bias of the random forests regression estimate decreases to 0 much faster than the usual rate. [sent-183, score-0.708]

59 Note at last that, contrary to Proposition 2, the term e−n/2kn prevents the extreme choice kn = n (about one observation on average in each terminal node). [sent-186, score-0.611]

60 Then, if pn j = (1/S)(1 + ξn j ) for j ∈ S , letting γn = min j∈S ξn j , we have kn 2SL2 E [¯n (X) − r(X)]2 ≤ Ξn + 0. [sent-190, score-0.714]

61 75 r , (1+γn ) n S kn log 2 where Ξn = Cσ2 S2 S−1 S/2d and 288 C= π (1 + ξn ) + 2e−1 sup r2 (x) x∈[0,1]d π log 2 16 S/2d . [sent-191, score-0.58]

62 Then, if pn j = (1/S)(1 + ξn j ) for j ∈ S , with ξn j log n → 0 as n → ∞, for the choice kn ∝ L2 Ξ 1/(1+ S0. [sent-201, score-0.722]

63 This result reveals the fact that the L2 -rate of convergence of rn (X) to r(X) depends only on the ¯ number S of strong variables, and not on the ambient dimension d. [sent-209, score-0.225]

64 The main message of Corollary 6 is that if we are able to properly tune the probability sequences (pn j )n≥1 and make them sufficiently fast to track the informative features, then the rate of convergence of the random forests estimate −0. [sent-210, score-0.536]

65 However, in our setting, the intrinsic dimension of the regression problem is S, not d, and the random forests estimate cleverly adapts to the sparsity of the problem. [sent-216, score-0.649]

66 It is noteworthy that the rate of convergence of the ξn j to 0 (and, consequently, the rate at which the probabilities pn j approach 1/S for j ∈ S ) will eventually depend on the ambient dimension d through the ratio S/d. [sent-220, score-0.463]

67 Remark 8 The optimal parameter kn of Corollary 6 depends on the unknown distribution of (X,Y ), especially on the smoothness of the regression function and the effective dimension S. [sent-237, score-0.529]

68 , data-dependent) choices of kn , such as data-splitting or cross-validation, should preserve the rate of convergence of the estimate. [sent-240, score-0.5]

69 Another route we may follow is to analyse the effect of bootstrapping the sample before growing the individual trees (i. [sent-241, score-0.199]

70 It is our 1071 B IAU belief that this procedure should also preserve the rate of convergence, even for overfitted trees (kn ≈ n), in the spirit of Biau et al. [sent-244, score-0.169]

71 i=1 In this case, the variance term is 0 and we have n rn (X) = rn (X) = ∑ EΘ [Wni (Θ, X)]Yi . [sent-251, score-0.218]

72 Attention shows that this last term is indeed equal to E EΘ,Θ′ Cov rn (X, Θ), rn (X, Θ′ ) | Zn . [sent-258, score-0.182]

73 The key observation is that if trees have strong predictive power, then they can be unconditionally strongly correlated while being conditionally weakly correlated. [sent-259, score-0.179]

74 We recall that these sequences should be in (0, 1) and obey the constraints pn j = (1/S)(1 + ξn j ) for j ∈ S (and pn j = ξn j otherwise), where the (ξn j )n≥1 tend to 0 as n tends to infinity. [sent-266, score-0.478]

75 A positive integer Mn —possibly depending on n—is fixed beforehand and the following splitting scheme is iteratively repeated at each node of the tree: 1. [sent-271, score-0.178]

76 To illustrate this mechanism, the same size as Dn ) in order to mimic the ideal split probability pn suppose—to keep things simple—that the model is linear, that is, Y= ∑ a j X ( j) + ε, j∈S where X = (X (1) , . [sent-283, score-0.324]

77 It is a simple exercise to prove that if X is uniform and j ∈ S , then the split on the j-th side which most decreases the weighted conditional variance is at the midpoint of the node, with a variance decrease equal to a2 /16 > 0. [sent-293, score-0.227]

78 For each of the Mn elected coordinates, calculate the best split, that is, the split which most ′ decreases the within-node sum of squares on the second sample Dn . [sent-299, score-0.149]

79 This procedure is indeed close to what the random forests algorithm does. [sent-302, score-0.505]

80 In fact, depending on the context and the actual cut selection procedure, the informative probabilities pn j ( j ∈ S ) may obey the constraints pn j → p j as n → ∞ (thus, p j is not necessarily equal to 1/S), where the p j are positive and satisfy ∑ j∈S p j = 1. [sent-308, score-0.546]

81 This empirical randomization scheme leads to complicate probabilities of cuts which, this time, vary at each node of each tree and are not easily amenable to analysis. [sent-310, score-0.463]

82 Put differently, for j ∈ S , 1 pn j ≈ (1 + ξn j ) , S where ξn j goes to 0 and satisfies the constraint ξn j log n → 0 as n tends to infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. [sent-312, score-0.858]

83 It is also noteworthy that random forests use the so-called out-of-bag samples (i. [sent-315, score-0.548]

84 Our aim is not to provide a thorough practical study of the 1074 A NALYSIS OF A R ANDOM F ORESTS M ODEL random forests method, but rather to illustrate the main ideas of the article. [sent-328, score-0.505]

85 Figure 2: The tree used as regression function in the model Tree. [sent-340, score-0.166]

86 Observe also that, in our context, the model Tree should be considered as a “no-bias” model, on which the random forests algorithm is expected to perform well. [sent-346, score-0.505]

87 The random forests algorithm was performed with the parameter mtry automatically tuned by the R package RandomForests, 1000 random trees and the minimum node size set to 5 (which is the default value for regression). [sent-361, score-0.871]

88 For the sake of coherence, since the minimum node size is set to 5 in the RandomForests package, the number of terminal nodes in the custom algorithm was calibrated to ⌈n/5⌉. [sent-368, score-0.284]

89 It must be stressed that the essential difference between the standard random forests algorithm and the alternative one is that the number of cases in the final leaves is fixed in the former, whereas the latter assumes a fixed number of terminal nodes. [sent-369, score-0.647]

90 Next, we see that for a sufficiently large n, the capabilities of the forests are nearly independent of d, in accordance with the idea that the (asymptotic) rate of convergence of the method should only depend on the “true” dimensionality S (Theorem 5). [sent-375, score-0.55]

91 Fact 1 Let Kn j (X, Θ) be the number of times the terminal node An (X, Θ) is split on the j-th coordinate ( j = 1, . [sent-380, score-0.382]

92 ⌈log2 kn ⌉ and pn j (by independence of X and Θ). [sent-385, score-0.681]

93 Moreover, by construction, d ∑ Kn j (X, Θ) = ⌈log2 kn ⌉. [sent-386, score-0.469]

94 Fact 2 By construction, λ (An (X, Θ)) = 2−⌈log2 kn ⌉ . [sent-389, score-0.469]

95 In particular, if X is uniformly distributed on [0, 1]d , then the distribution of Nn (X, Θ) conditionally on X and Θ is binomial with parameters n and 2−⌈log2 kn ⌉ (by independence of the random variables X, X1 , . [sent-390, score-0.67]

96 If f is bounded from above and from below, this probability is of the order λ (An (X, Θ)) = 2−⌈log2 kn ⌉ , and the whole approach can be carried out without difficulty. [sent-396, score-0.469]

97 To see this, consider the random tree partition defined by Θ, which has by construction exactly 2⌈log2 kn ⌉ rectangular cells, say A1 , . [sent-404, score-0.68]

98 , N2⌈log2 kn ⌉ denote the number of observations among X, X1 , . [sent-411, score-0.469]

99 , Xn falling in these 2⌈log2 kn ⌉ cells, and let C = {X, X1 , . [sent-414, score-0.529]

100 Thus, for every fixed M ≥ 0, P (Nn (X, Θ) < M) = E [P (Nn (X, Θ) < M | C , Θ)]   Nℓ  = E ∑ n+1 ⌈log2 kn ⌉ ℓ=1,. [sent-419, score-0.469]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kn', 0.469), ('forests', 0.465), ('pn', 0.212), ('iau', 0.2), ('orests', 0.18), ('dn', 0.167), ('breiman', 0.148), ('terminal', 0.142), ('trees', 0.138), ('tree', 0.133), ('andom', 0.128), ('biau', 0.12), ('nn', 0.111), ('node', 0.108), ('mn', 0.103), ('odel', 0.097), ('rn', 0.091), ('cuts', 0.091), ('nalysis', 0.09), ('cart', 0.086), ('wni', 0.086), ('split', 0.08), ('mtry', 0.08), ('ambient', 0.08), ('cell', 0.07), ('mse', 0.068), ('proposition', 0.063), ('randomization', 0.062), ('falling', 0.06), ('genuer', 0.06), ('randomforests', 0.06), ('sinus', 0.06), ('zn', 0.059), ('accordance', 0.054), ('cut', 0.054), ('tends', 0.054), ('boxplots', 0.053), ('coordinate', 0.052), ('coordinates', 0.052), ('replacement', 0.05), ('sparsity', 0.044), ('mechanism', 0.044), ('cells', 0.043), ('noteworthy', 0.043), ('paris', 0.043), ('lipschitz', 0.042), ('conditionally', 0.041), ('log', 0.041), ('midpoint', 0.04), ('rard', 0.04), ('random', 0.04), ('devroye', 0.04), ('beforehand', 0.04), ('adapts', 0.04), ('probabilities', 0.039), ('remark', 0.039), ('binomial', 0.038), ('forest', 0.038), ('mind', 0.038), ('rectangular', 0.038), ('evolution', 0.037), ('variance', 0.036), ('xs', 0.036), ('amit', 0.036), ('rs', 0.036), ('decreases', 0.035), ('elected', 0.034), ('safely', 0.034), ('nodes', 0.034), ('document', 0.034), ('individual', 0.033), ('regression', 0.033), ('letting', 0.033), ('things', 0.032), ('lebesgue', 0.032), ('rate', 0.031), ('grown', 0.031), ('cedex', 0.031), ('geman', 0.031), ('randomizing', 0.031), ('ensemble', 0.031), ('scheme', 0.03), ('ranging', 0.03), ('actual', 0.029), ('sup', 0.029), ('fs', 0.028), ('route', 0.028), ('driving', 0.028), ('lighten', 0.028), ('variables', 0.028), ('uniformly', 0.027), ('bn', 0.027), ('reveals', 0.027), ('friedman', 0.027), ('bias', 0.027), ('distributed', 0.027), ('dimension', 0.027), ('wavelet', 0.027), ('bagging', 0.027), ('predictors', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 20 jmlr-2012-Analysis of a Random Forests Model

Author: Gérard Biau

Abstract: Random forests are a scheme proposed by Leo Breiman in the 2000’s for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present. Keywords: random forests, randomization, sparsity, dimension reduction, consistency, rate of convergence

2 0.12875324 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

Author: Yongdai Kim, Sunghoon Kwon, Hosik Choi

Abstract: Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufficient conditions for model selection consistency are provided. Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufficient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently. Results of simulation studies as well as real data analysis are given to illustrate that finite sample performances of consistent model selection criteria can be quite different. Keywords: model selection consistency, general information criteria, high dimension, regression

3 0.092762746 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa

Abstract: Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller) subset of these rules and to determine their respective weights. We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. We improve the accuracy of the algorithm by adding simple linear functions to the ensemble. We also extensively evaluate the algorithm with and without linear functions. The results show that the accuracy of multi-target regression rule ensembles is high. They are more accurate than, for instance, multi-target regression trees, but not quite as accurate as multi-target random forests. The rule ensembles are significantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. Also in Microtask, Tampere, Finland. †. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia. ‡. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia and the Joˇ ef Stefan International Postgraduate School, Ljubljana, Slovenia. z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. s z ˇ ˇ

4 0.09007699 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

Author: Aviv Tamar, Dotan Di Castro, Ron Meir

Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel definition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms

5 0.08903899 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

Author: Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine, Xin Yan

Abstract: Assessing treatment effects in observational studies is a multifaceted problem that not only involves heterogeneous mechanisms of how the treatment or cause is exposed to subjects, known as propensity, but also differential causal effects across sub-populations. We introduce a concept termed the facilitating score to account for both the confounding and interacting impacts of covariates on the treatment effect. Several approaches for estimating the facilitating score are discussed. In particular, we put forward a machine learning method, called causal inference tree (CIT), to provide a piecewise constant approximation of the facilitating score. With interpretable rules, CIT splits data in such a way that both the propensity and the treatment effect become more homogeneous within each resultant partition. Causal inference at different levels can be made on the basis of CIT. Together with an aggregated grouping procedure, CIT stratifies data into strata where causal effects can be conveniently assessed within each. Besides, a feasible way of predicting individual causal effects (ICE) is made available by aggregating ensemble CIT models. Both the stratified results and the estimated ICE provide an assessment of heterogeneity of causal effects and can be integrated for estimating the average causal effect (ACE). Mean square consistency of CIT is also established. We evaluate the performance of proposed methods with simulations and illustrate their use with the NSW data in Dehejia and Wahba (1999) where the objective is to assess the impact of c 2012 Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine and Xin Yan. S U , K ANG , FAN , L EVINE AND YAN a labor training program, the National Supported Work (NSW) demonstration, on post-intervention earnings. Keywords: CART, causal inference, confounding, interaction, observational study, personalized medicine, recursive partitioning

6 0.082967922 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

7 0.07568635 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

8 0.067265585 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

9 0.066926882 80 jmlr-2012-On Ranking and Generalization Bounds

10 0.066526726 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

11 0.06325604 59 jmlr-2012-Linear Regression With Random Projections

12 0.05779282 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion

13 0.055220716 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss

14 0.052621707 5 jmlr-2012-A Local Spectral Method for Graphs: With Applications to Improving Graph Partitions and Exploring Data Graphs Locally

15 0.046602231 68 jmlr-2012-Minimax Manifold Estimation

16 0.043438092 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

17 0.042873502 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

18 0.042721782 91 jmlr-2012-Plug-in Approach to Active Learning

19 0.041960247 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels

20 0.041335318 82 jmlr-2012-On the Necessity of Irrelevant Variables


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.209), (1, 0.144), (2, -0.059), (3, -0.104), (4, 0.016), (5, -0.013), (6, -0.095), (7, 0.021), (8, 0.055), (9, -0.025), (10, 0.116), (11, 0.134), (12, 0.195), (13, 0.086), (14, -0.127), (15, 0.233), (16, -0.059), (17, -0.122), (18, -0.014), (19, -0.007), (20, -0.075), (21, -0.152), (22, -0.15), (23, -0.071), (24, 0.071), (25, 0.005), (26, -0.061), (27, 0.165), (28, 0.081), (29, -0.132), (30, 0.061), (31, 0.169), (32, 0.019), (33, 0.023), (34, -0.021), (35, -0.002), (36, -0.066), (37, -0.032), (38, -0.175), (39, -0.071), (40, -0.059), (41, -0.074), (42, 0.016), (43, -0.051), (44, 0.026), (45, -0.018), (46, 0.137), (47, -0.06), (48, -0.031), (49, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94891703 20 jmlr-2012-Analysis of a Random Forests Model

Author: Gérard Biau

Abstract: Random forests are a scheme proposed by Leo Breiman in the 2000’s for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present. Keywords: random forests, randomization, sparsity, dimension reduction, consistency, rate of convergence

2 0.68435395 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa

Abstract: Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller) subset of these rules and to determine their respective weights. We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. We improve the accuracy of the algorithm by adding simple linear functions to the ensemble. We also extensively evaluate the algorithm with and without linear functions. The results show that the accuracy of multi-target regression rule ensembles is high. They are more accurate than, for instance, multi-target regression trees, but not quite as accurate as multi-target random forests. The rule ensembles are significantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. Also in Microtask, Tampere, Finland. †. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia. ‡. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia and the Joˇ ef Stefan International Postgraduate School, Ljubljana, Slovenia. z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. s z ˇ ˇ

3 0.55807012 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

Author: Yongdai Kim, Sunghoon Kwon, Hosik Choi

Abstract: Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufficient conditions for model selection consistency are provided. Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufficient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently. Results of simulation studies as well as real data analysis are given to illustrate that finite sample performances of consistent model selection criteria can be quite different. Keywords: model selection consistency, general information criteria, high dimension, regression

4 0.40348023 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

Author: Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine, Xin Yan

Abstract: Assessing treatment effects in observational studies is a multifaceted problem that not only involves heterogeneous mechanisms of how the treatment or cause is exposed to subjects, known as propensity, but also differential causal effects across sub-populations. We introduce a concept termed the facilitating score to account for both the confounding and interacting impacts of covariates on the treatment effect. Several approaches for estimating the facilitating score are discussed. In particular, we put forward a machine learning method, called causal inference tree (CIT), to provide a piecewise constant approximation of the facilitating score. With interpretable rules, CIT splits data in such a way that both the propensity and the treatment effect become more homogeneous within each resultant partition. Causal inference at different levels can be made on the basis of CIT. Together with an aggregated grouping procedure, CIT stratifies data into strata where causal effects can be conveniently assessed within each. Besides, a feasible way of predicting individual causal effects (ICE) is made available by aggregating ensemble CIT models. Both the stratified results and the estimated ICE provide an assessment of heterogeneity of causal effects and can be integrated for estimating the average causal effect (ACE). Mean square consistency of CIT is also established. We evaluate the performance of proposed methods with simulations and illustrate their use with the NSW data in Dehejia and Wahba (1999) where the objective is to assess the impact of c 2012 Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine and Xin Yan. S U , K ANG , FAN , L EVINE AND YAN a labor training program, the National Supported Work (NSW) demonstration, on post-intervention earnings. Keywords: CART, causal inference, confounding, interaction, observational study, personalized medicine, recursive partitioning

5 0.35537654 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

Author: Michael U. Gutmann, Aapo Hyvärinen

Abstract: We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only specified up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities. Keywords: statistics unnormalized models, partition function, computation, estimation, natural image

6 0.33841962 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

7 0.33558473 5 jmlr-2012-A Local Spectral Method for Graphs: With Applications to Improving Graph Partitions and Exploring Data Graphs Locally

8 0.32316703 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

9 0.30675945 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

10 0.30662555 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss

11 0.29891938 59 jmlr-2012-Linear Regression With Random Projections

12 0.26967365 80 jmlr-2012-On Ranking and Generalization Bounds

13 0.26021615 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features

14 0.25050399 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion

15 0.24257748 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

16 0.23952451 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

17 0.23153186 68 jmlr-2012-Minimax Manifold Estimation

18 0.2269883 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel

19 0.21701349 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification

20 0.21531643 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.014), (21, 0.531), (26, 0.03), (29, 0.037), (35, 0.013), (56, 0.015), (75, 0.035), (77, 0.024), (79, 0.013), (92, 0.123), (96, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96826506 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy

Author: Félix-Antoine Fortin, François-Michel De Rainville, Marc-André Gardner, Marc Parizeau, Christian Gagné

Abstract: DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. Its design departs from most other existing frameworks in that it seeks to make algorithms explicit and data structures transparent, as opposed to the more common black-box frameworks. Freely available with extensive documentation at http://deap.gel.ulaval.ca, DEAP is an open source project under an LGPL license. Keywords: distributed evolutionary algorithms, software tools

2 0.91017556 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Author: Jian Huang, Cun-Hui Zhang

Abstract: The ℓ1 -penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1 -penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1 -penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results. Keywords: variable selection, penalized estimation, oracle inequality, generalized linear models, selection consistency, sparsity

same-paper 3 0.88682139 20 jmlr-2012-Analysis of a Random Forests Model

Author: Gérard Biau

Abstract: Random forests are a scheme proposed by Leo Breiman in the 2000’s for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present. Keywords: random forests, randomization, sparsity, dimension reduction, consistency, rate of convergence

4 0.5433045 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

Author: Lan Xue, Annie Qu

Abstract: The varying-coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying-coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefficient modeling could be infinite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying-coefficient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efficient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the least absolute shrinkage and selection operator (LASSO), the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) approaches. Keywords: coordinate decent algorithm, difference convex programming, L0 - regularization, large-p small-n, model selection, nonparametric function, oracle property, truncated L1 penalty

5 0.50839961 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

Author: Christopher R. Genovese, Jiashun Jin, Larry Wasserman, Zhigang Yao

Abstract: The lasso is an important method for sparse, high-dimensional regression problems, with efficient algorithms available, a long history of practical success, and a large body of theoretical results supporting and explaining its performance. But even with the best available algorithms, finding the lasso solutions remains a computationally challenging task in cases where the number of covariates vastly exceeds the number of data points. Marginal regression, where each dependent variable is regressed separately on each covariate, offers a promising alternative in this case because the estimates can be computed roughly two orders faster than the lasso solutions. The question that remains is how the statistical performance of the method compares to that of the lasso in these cases. In this paper, we study the relative statistical performance of the lasso and marginal regression for sparse, high-dimensional regression problems. We consider the problem of learning which coefficients are non-zero. Our main results are as follows: (i) we compare the conditions under which the lasso and marginal regression guarantee exact recovery in the fixed design, noise free case; (ii) we establish conditions under which marginal regression provides exact recovery with high probability in the fixed design, noise free, random coefficients case; and (iii) we derive rates of convergence for both procedures, where performance is measured by the number of coefficients with incorrect sign, and characterize the regions in the parameter space recovery is and is not possible under this metric. In light of the computational advantages of marginal regression in very high dimensional problems, our theoretical and simulations results suggest that the procedure merits further study. Keywords: high-dimensional regression, lasso, phase diagram, regularization

6 0.50838798 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

7 0.50734657 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

8 0.48680696 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

9 0.45623463 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

10 0.4418132 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss

11 0.44012547 73 jmlr-2012-Multi-task Regression using Minimal Penalties

12 0.43905056 111 jmlr-2012-Structured Sparsity and Generalization

13 0.40966347 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels

14 0.40692687 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

15 0.39788705 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

16 0.39437881 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

17 0.39434761 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

18 0.38880873 103 jmlr-2012-Sampling Methods for the Nyström Method

19 0.3834497 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

20 0.3823787 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches