jmlr jmlr2012 jmlr2012-72 knowledge-graph by maker-knowledge-mining

72 jmlr-2012-Multi-Target Regression with Rule Ensembles

Source: pdf

Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa

Abstract: Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller) subset of these rules and to determine their respective weights. We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. We improve the accuracy of the algorithm by adding simple linear functions to the ensemble. We also extensively evaluate the algorithm with and without linear functions. The results show that the accuracy of multi-target regression rule ensembles is high. They are more accurate than, for instance, multi-target regression trees, but not quite as accurate as multi-target random forests. The rule ensembles are signiﬁcantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. Also in Microtask, Tampere, Finland. †. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia. ‡. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia and the Joˇ ef Stefan International Postgraduate School, Ljubljana, Slovenia. z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. s z ˇ ˇ

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. [sent-15, score-0.248]

2 A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. [sent-16, score-0.403]

3 We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. [sent-18, score-0.282]

4 The results show that the accuracy of multi-target regression rule ensembles is high. [sent-21, score-0.282]

5 The rule ensembles are signiﬁcantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. [sent-23, score-0.508]

6 Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. [sent-24, score-0.337]

7 z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. [sent-30, score-0.194]

8 The majority of rule learning methods are based on the sequential covering algorithm (Michalski, 1969), originally designed for learning ordered rule lists for binary classiﬁcation domains. [sent-55, score-0.327]

9 An alternative rule learning method that performs well also on (single-target) regression problems is the approach of rule ensembles (Friedman and Popescu, 2005, 2008; Dembczy´ ski et al. [sent-58, score-0.463]

10 In this paper, we introduce F IRE, an algorithm for multi-target regression based on the rule ensembles approach. [sent-62, score-0.282]

11 Finally, the present paper includes a signiﬁcantly extended empirical evaluation: We consider a larger collection of data sets and methods with which our algorithm is compared, including the latest rule ensemble methods (Dembczy´ ski et al. [sent-75, score-0.246]

12 Section 2 presents related work on multitarget prediction, rule learning, and rule ensembles. [sent-80, score-0.302]

13 The F IRE algorithm for learning multi-target regression rule ensembles is introduced in Section 3. [sent-81, score-0.282]

14 Several standard learning methods such as neural networks, decision trees, model trees, classiﬁcation rules and random forests have been extended towards multi-target prediction (Caruana, 1997; Blockeel et al. [sent-93, score-0.193]

15 Since our method learns regression rules, it is closely related to rule learning (Flach and Lavraˇ , c ˇ ˇ 2003). [sent-124, score-0.186]

16 It employs the standard covering approach (Michalski, 1969) and can learn z ordered or unordered rule sets for classiﬁcation and regression domains. [sent-126, score-0.211]

17 An alternative approach to rule learning is called rule ensembles (Friedman and Popescu, 2005, 2008; Dembczy´ ski et al. [sent-129, score-0.428]

18 Strictly speaking, any set of (unordered) rules can be called a rule n ensemble, as for example, in Indurkhya and Weiss (2001). [sent-131, score-0.249]

19 In this paper, however, a rule ensemble is understood to be a set of unordered rules whose predictions are combined through weighted voting, which is the approach introduced by the RULE F IT (Friedman and Popescu, 2005, 2008) and R EG E NDER methods (Dembczy´ ski et al. [sent-132, score-0.37]

20 n The RULE F IT algorithm starts by generating a set of decision trees in much the same way as ensembles are generated in methods like bagging (Breiman, 1996) and random forests (Breiman, 2001). [sent-134, score-0.278]

21 Because such large ensembles are hard or even impossible to interpret, all the trees are transcribed into a collection of rules, and an optimization procedure is used to select a small subset of the rules and to determine their weights. [sent-135, score-0.302]

22 In addition to rules, we can also use descriptive attributes in the linear combination if we add them to the initial set of rules, and likewise determine their weights in the optimization step. [sent-137, score-0.195]

23 Learning Rule Based Ensembles for Multi-Target Regression Our algorithm for learning rule based ensembles for multi-target regression problems (which we call F IRE: Fitted rule ensembles) is greatly inﬂuenced by the RULE F IT method. [sent-145, score-0.433]

24 Because linear dependencies are known to be difﬁcult to approximate with rules, we optionally add linear terms (simple linear functions) of all numeric descriptive attributes to the collection. [sent-149, score-0.195]

25 Algorithm 1 The F IRE algorithm for learning rule ensembles for multi-target regression. [sent-153, score-0.247]

26 The ﬁrst sum is the contribution of the M rules: each 2371 ˇ ˇ A HO , Z ENKO , D Z EROSKI AND E LOMAA rule ri is a vector function that gives a constant prediction (for each of the targets), if it covers the example x, or returns a zero vector otherwise. [sent-159, score-0.192]

27 There is a term for each combination of a target and a numeric descriptive attribute, thus the total number of linear terms is the number of numeric descriptive attributes K times the number of target attributes T . [sent-161, score-0.54]

28 A linear term x(t, j) is a vector that corresponds to the inﬂuence of the j-th numerical descriptive attribute x j on the t-th target attribute; its t-th component is equal to x j , while all other components are zero: x(t, j) = (0, . [sent-162, score-0.236]

29 A hypothetic rule ensemble that predicts all the target values of this domain simultaneously could be: y = f (x) = 0. [sent-174, score-0.302]

30 It comprises a constant vector, two rules and four linear terms (of attributes x2 and x5 ), but can also be simpliﬁed to a sum of a vector of linear equations and two rules. [sent-210, score-0.196]

31 A set of diverse trees is generated with the multi-target implementation of the z random forest ensemble method (Kocev et al. [sent-217, score-0.202]

32 , 2007), modiﬁed to stop tree building when a given tree depth limit is reached. [sent-218, score-0.196]

33 In order to increase the tree (and, thus, rule) variability, we limit the depth of a particular tree in a randomized fashion as suggested by Friedman and Popescu (2008). [sent-220, score-0.196]

34 It should be ¯ emphasized that the parameter L only affects the average of all tree depth limits dm and thus trees with larger depths can still be generated. [sent-225, score-0.213]

35 1 All regression trees generated with the above procedure are transcribed into rules with the ConvertTreesToRules procedure. [sent-226, score-0.22]

36 In order to equalize the importance of different rules and different targets we proceed in three separate steps: First, we simply zero-center all the targets. [sent-231, score-0.233]

37 In the ﬁrst step, we zero-center all the rule target predictions by subtracting the average avg from each of the original rule predictions r ′′ : r ′ = r ′′ − avg. [sent-240, score-0.551]

38 The average avg contains the average values of the target attributes on the learning set. [sent-241, score-0.298]

39 In the second, more complex, step, we scale the predicted values rt′ of each target attribute t by dividing them with a factor χ: r′ rt = t . [sent-242, score-0.2]

40 (3) χ We choose χ so that it is related to both the largest predicted value of the rule r ′ and to the standard deviation σt of a target attribute t. [sent-243, score-0.314]

41 In this step, we equalize the scaling differences between different target attribute spaces. [sent-257, score-0.208]

42 As usual, we do this simply by dividing the target attribute prediction values rt by twice their standard deviations 2σt : rt∗ = rt r′ r′′ − avgt = t = t . [sent-260, score-0.237]

43 We again shift the linear terms by the average x′′ of the j-th descriptive attribute x′ j) = (0, . [sent-272, score-0.197]

44 σj Linear terms normalized like this appear in the ﬁnal rule ensemble model. [sent-280, score-0.238]

45 Analogously to the third stage of rule normalization we also scale the target dimension space out temporarily: ′ x(t, j) x(t, j) x∗ j) = = . [sent-281, score-0.288]

46 (t, 2σt 2σ j This is, again, only intended to equalize the terms referring to different target attributes during the optimization procedure. [sent-282, score-0.25]

47 The purpose of the regularization part is to make as many weights equal to zero as possible, which means that the resulting rule ensemble will be as small as possible. [sent-287, score-0.241]

48 From the RRMSE diagram (Figure 1a), we can see that random forests are the most accurate method, followed by all the rule ensemble methods, M T S MOTI, regression trees and ﬁnally D IRTY, which seems to perform poorly on single-target data. [sent-300, score-0.433]

49 Our unlimited F IRE versions and limited F IRE with linear terms are in the middle class. [sent-305, score-0.186]

50 In this case, linear terms seem to be surprisingly effective, as they bring the limited F IRE of 30 terms to the same accuracy level with much larger unlimited F IRE set of rules. [sent-310, score-0.206]

51 Both unlimited F IRE versions seem to generate models that are larger than the reference rule ensembles (RULE F IT and R EG E NDER), but the differences are not signiﬁcant. [sent-312, score-0.435]

52 In a pairwise comparison, the algorithm tie in the wins over data sets (12 wins each) and the unlimited version has a slightly lower average accuracy. [sent-320, score-0.203]

53 In sum, the results of this ﬁrst part of the evaluation show that our rule ensemble method F IRE performs well on single-target regression problems. [sent-323, score-0.251]

54 The average ranks and results of the Nemenyi test are given in 2383 ˇ ˇ A HO , Z ENKO , D Z EROSKI AND E LOMAA Figure 2 for RRMSE evaluated on separate targets (a), RRMSE evaluated on target averages within data sets (b), and model size (c). [sent-336, score-0.226]

55 Looking at Figure 2(a), the general picture of algorithm ranking is similar to the one for singletarget prediction: random forests and both unlimited F IRE versions are more accurate than D IRTY, regression trees, the limited versions of F IRE, and M T S MOTI. [sent-337, score-0.32]

56 This is especially clear in per data set average evaluation and occurs in the pairwise comparison with the limited F IRE (in per target evaluation 7 out of 11 wins and per data set average 34–36 out of 63 wins for regression trees). [sent-352, score-0.211]

57 The detailed results in Appendix D show that regression trees tend to win only when the size of the tree produced is much larger than the F IRE limit of 30. [sent-353, score-0.217]

58 When both size and accuracy are taken into account, the unlimited version of F IRE with linear terms seems to perform very well on multi-target regression problems. [sent-367, score-0.195]

59 We also notice that the average proportion of linear terms in the unlimited model is 24%, but drops to only 5% if we limit the model size to 30. [sent-368, score-0.205]

60 As mentioned in Section 3, the total number of added linear terms is the number of numeric descriptive attributes times the number of target attributes. [sent-419, score-0.281]

61 This is probably a reason for the long running time of F IRE with linear terms and why the difference between the two unlimited F IRE versions is greater for multi-target data sets than for single-target ones. [sent-420, score-0.186]

62 The limited F IRE versions use on average about a tenth of the time used by the unlimited ones. [sent-422, score-0.189]

63 In the unlimited F IRE we do not have such a contest between linear terms and rules during the optimization and, thus, the negative effect of adding a linear term is negligible. [sent-833, score-0.279]

64 However, for single-target rule ensembles the linear terms seem to help all the time as seen in Figure 3. [sent-848, score-0.293]

65 It is an interesting question whether this would beneﬁt the multi-target rule ensembles in the same way as the (single dimension) linear terms beneﬁt the single-target rule ensembles. [sent-852, score-0.42]

66 6 Summary of the Experimental Results In our experiments, we ﬁrst evaluated F IRE on single-target domains in order to show that our implementation of rule ensembles also works on standard regression problems. [sent-862, score-0.282]

67 The results are somewhat similar to the ones obtained on single-target domains: random forests and the unlimited F IRE versions are more accurate than the limited F IRE, regression trees, and M T S MOTI. [sent-866, score-0.294]

68 Even though the difference in size between random forests and the unlimited F IRE is not statistically signiﬁcant, the average size of a random forest is more than 300 times larger than the average size of F IRE models (with and without linear terms). [sent-869, score-0.333]

69 Therefore, we believe that the unlimited F IRE with linear terms is a very good choice for modeling multi-target regression problems. [sent-874, score-0.195]

70 In the case of linear terms, we can conclude that for larger (50 terms or more) multi-target rule ensembles adding linear terms should in general improve the accuracy. [sent-878, score-0.291]

71 We have adopted the rule ensemble approach and generalized it to multi-target regression domains. [sent-883, score-0.251]

72 , 2009) was already able to learn multi-target regression rule ensembles. [sent-885, score-0.186]

73 We compared it with two other existing rule ensembles approaches RULE F IT and R EG E NDER, and to other similar multi-target prediction algorithms, namely regression trees, random forests, and model trees (M T S MOTI). [sent-892, score-0.369]

74 Dˇ eroski was supported by the Slovenian Research Agency z (Grants P2-0103 and J2-2285), the European Commission (Grants ICT-2010-266722 and ICT-2011287713), and Operation no. [sent-919, score-0.194]

75 j=1 k=1 T Here J is the number of targets and each P j represents the predictions for the target dimension j: P j = avg j , r1 (x) j , r2 (x) j , . [sent-932, score-0.288]

76 We remember that our rules are transformed from a tree ensemble of form: f (x) = 1 |D| ∑ di (x), |D| i=1 where |D| is the number of trees in the ensemble. [sent-963, score-0.325]

77 Here each tree prediction di in the ensemble is global in contrast to rule predictions ri being local. [sent-964, score-0.358]

78 This consists of a rule from each of the trees in the tree ensemble. [sent-969, score-0.313]

79 |D| i=1 Now the rule predictions are used in the same scale in which they were created during the tree ensemble training. [sent-971, score-0.317]

80 That is, we replace each of the initial rules r ′′ with a zero-centered rule r ′ , which is deﬁned as r ′ (x) = r ′′ (x) − avg 0 if x is covered and otherwise. [sent-975, score-0.335]

81 of covering rules avg + ∑ ri′ (x) = avg + ∑ ri′ (x). [sent-978, score-0.295]

82 of covering rules i=1 i=1 This new form allows us to do the weight optimization freely without caring about the number of covering rules: M f (x) = w0 avg + ∑ wi ri′ (x). [sent-980, score-0.276]

83 i=1 The optimization problem is not invariant to the scaling of rule predictions: If we scale the rule predictions as r = b r ′ ; b > 1, the corresponding weight will not be simply decreased as w = w′ /b, because the regularization part on the right only includes weights and not rule predictions. [sent-982, score-0.525]

84 We would like to have all the rules and targets to have equal initial importance. [sent-984, score-0.188]

85 It is clear that setting 1 to the rule predictions for all targets would discard all the discovered information on relations between the targets stored in the rules. [sent-988, score-0.357]

86 If σt were the standard deviation of a normally distributed target attribute rt′ , dividing a zero-centered attribute by 2σt should put 95% of all values within the [−1, 1] interval. [sent-994, score-0.24]

87 , T } of the maximum target value rm of the rule r ′ is deﬁned by: r′ m = arg max t . [sent-999, score-0.258]

88 We can now also give a strict bound to the predictions rt : By the deﬁnition of m, after this second stage of normalization ′ it holds that rm = 1 and |rt | = |σm /rm rt′ /σt | ≤ 1 for all other targets t. [sent-1001, score-0.204]

89 To sum up, at the second stage of normalization the target predictions rt in a certain rule r are scaled by a factor χ that represents the the initial size of the predictive values of the particular rule. [sent-1003, score-0.376]

90 Thus, this stage roughly equalizes the rules before the optimization phase and affects the rules of the ﬁnal model. [sent-1004, score-0.217]

91 After the two ﬁrst normalization steps our rule predictions are of form: r= r ′ r ′′ − avg = . [sent-1005, score-0.314]

92 Otherwise, targets with large scales would dominate the rule selection. [sent-1007, score-0.241]

93 As usual, we do this simply by dividing the target attribute prediction values rt by twice their standard deviations 2σt : rt∗ = r′ r′′ − avgt rt = t = t . [sent-1011, score-0.237]

94 , 0), t−1 t t+1 which depicts the inﬂuence of the descriptive attribute x j on the target attribute xt . [sent-1020, score-0.313]

95 We add linear terms for all possible combinations of numeric descriptive attributes and target attributes. [sent-1021, score-0.281]

96 Unlike rules, linear terms are affected by two attributes and, thus, two attribute space scales: that of the j-th descriptive attribute space and that of t-th target space. [sent-1022, score-0.411]

97 Linear terms normalized like this appear in the ﬁnal rule ensemble model. [sent-1036, score-0.238]

98 However, analogously to the third stage of rule normalization we also scale the target dimension space out temporarily: ′ x(t, j) x(t, j) ∗ x(t, j) = = . [sent-1037, score-0.288]

99 2σt 2σ j This is, again, only intended to equalize the terms of different target attributes during the optimization procedure. [sent-1038, score-0.25]

100 We notice that the form on last row can be expressed in the form of internal normalization M f (x) = w0 avg + ∑ wi ri′ (x) i=1 only if avg = 0, that is, the data is originally zero-centered. [sent-1048, score-0.244]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ire', 0.588), ('fire', 0.402), ('rrmse', 0.337), ('eroski', 0.194), ('irty', 0.162), ('rule', 0.151), ('unlimited', 0.138), ('moti', 0.11), ('enko', 0.104), ('lomaa', 0.104), ('rules', 0.098), ('ensembles', 0.096), ('forests', 0.095), ('nsembles', 0.094), ('zenko', 0.091), ('targets', 0.09), ('trees', 0.087), ('target', 0.086), ('avg', 0.086), ('attribute', 0.077), ('attributes', 0.076), ('tree', 0.075), ('descriptive', 0.073), ('aho', 0.065), ('ensemble', 0.065), ('ulti', 0.062), ('ho', 0.056), ('blockeel', 0.055), ('egression', 0.053), ('sa', 0.053), ('tampere', 0.052), ('vpv', 0.052), ('normalization', 0.051), ('forest', 0.05), ('equalize', 0.045), ('jalali', 0.045), ('struyf', 0.045), ('popescu', 0.044), ('ri', 0.041), ('ljubljana', 0.04), ('appice', 0.039), ('dembczy', 0.039), ('rt', 0.037), ('cd', 0.036), ('regression', 0.035), ('dirty', 0.035), ('ree', 0.035), ('nemenyi', 0.033), ('nder', 0.033), ('friedman', 0.033), ('auto', 0.032), ('kocev', 0.032), ('mtsmoti', 0.032), ('orest', 0.032), ('timo', 0.032), ('lncs', 0.032), ('terminal', 0.032), ('jt', 0.03), ('bernard', 0.03), ('ski', 0.03), ('err', 0.028), ('ax', 0.027), ('predictions', 0.026), ('bogdan', 0.026), ('errmin', 0.026), ('igmea', 0.026), ('jo', 0.026), ('ousing', 0.026), ('rulefit', 0.026), ('gk', 0.026), ('dem', 0.026), ('eg', 0.026), ('depth', 0.026), ('versions', 0.026), ('eta', 0.025), ('hendrik', 0.025), ('slovenia', 0.025), ('predictive', 0.025), ('ranks', 0.025), ('covering', 0.025), ('weights', 0.025), ('average', 0.025), ('optional', 0.025), ('diagrams', 0.025), ('jerome', 0.024), ('numeric', 0.024), ('seem', 0.024), ('sp', 0.023), ('andom', 0.023), ('species', 0.022), ('terms', 0.022), ('editors', 0.022), ('suzuki', 0.022), ('max', 0.021), ('optimization', 0.021), ('wi', 0.021), ('wk', 0.02), ('stanford', 0.02), ('wins', 0.02), ('limit', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa

2 0.092762746 20 jmlr-2012-Analysis of a Random Forests Model

Author: Gérard Biau

Abstract: Random forests are a scheme proposed by Leo Breiman in the 2000’s for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present. Keywords: random forests, randomization, sparsity, dimension reduction, consistency, rate of convergence

3 0.033557147 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning

Author: Tobias Lang, Marc Toussaint, Kristian Kersting

Abstract: A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and R- MAX algorithms. Efﬁcient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efﬁciency of our framework using count functions under the assumption that we had a relational KWIK learner and a near-optimal planner. We propose a concrete exploration algorithm which integrates a practically efﬁcient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques. Keywords: reinforcement learning, statistical relational learning, exploration, relational transition models, robotics

4 0.030164892 62 jmlr-2012-MULTIBOOST: A Multi-purpose Boosting Package

Author: Djalel Benbouzid, Róbert Busa-Fekete, Norman Casagrande, François-David Collin, Balázs Kégl

Abstract: The M ULTI B OOST package provides a fast C++ implementation of multi-class/multi-label/multitask boosting algorithms. It is based on A DA B OOST.MH but it also implements popular cascade classiﬁers and F ILTER B OOST. The package contains common multi-class base learners (stumps, trees, products, Haar ﬁlters). Further base learners and strong learners following the boosting paradigm can be easily implemented in a ﬂexible framework. Keywords: boosting, A DA B OOST.MH, F ILTER B OOST, cascade classiﬁer

5 0.029680476 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

Author: Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine, Xin Yan

Abstract: Assessing treatment effects in observational studies is a multifaceted problem that not only involves heterogeneous mechanisms of how the treatment or cause is exposed to subjects, known as propensity, but also differential causal effects across sub-populations. We introduce a concept termed the facilitating score to account for both the confounding and interacting impacts of covariates on the treatment effect. Several approaches for estimating the facilitating score are discussed. In particular, we put forward a machine learning method, called causal inference tree (CIT), to provide a piecewise constant approximation of the facilitating score. With interpretable rules, CIT splits data in such a way that both the propensity and the treatment effect become more homogeneous within each resultant partition. Causal inference at different levels can be made on the basis of CIT. Together with an aggregated grouping procedure, CIT stratiﬁes data into strata where causal effects can be conveniently assessed within each. Besides, a feasible way of predicting individual causal effects (ICE) is made available by aggregating ensemble CIT models. Both the stratiﬁed results and the estimated ICE provide an assessment of heterogeneity of causal effects and can be integrated for estimating the average causal effect (ACE). Mean square consistency of CIT is also established. We evaluate the performance of proposed methods with simulations and illustrate their use with the NSW data in Dehejia and Wahba (1999) where the objective is to assess the impact of c 2012 Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine and Xin Yan. S U , K ANG , FAN , L EVINE AND YAN a labor training program, the National Supported Work (NSW) demonstration, on post-intervention earnings. Keywords: CART, causal inference, confounding, interaction, observational study, personalized medicine, recursive partitioning

6 0.029668402 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

7 0.028298 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

8 0.027029248 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

9 0.025257744 59 jmlr-2012-Linear Regression With Random Projections

10 0.024975745 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel

11 0.023887152 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion

12 0.023735961 73 jmlr-2012-Multi-task Regression using Minimal Penalties

13 0.023554027 80 jmlr-2012-On Ranking and Generalization Bounds

14 0.023174796 106 jmlr-2012-Sign Language Recognition using Sub-Units

15 0.022828573 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models

16 0.021681411 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

17 0.020880209 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

18 0.020541571 4 jmlr-2012-A Kernel Two-Sample Test

19 0.020414781 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

20 0.020289211 103 jmlr-2012-Sampling Methods for the Nyström Method

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.107), (1, 0.031), (2, 0.06), (3, -0.051), (4, 0.004), (5, 0.012), (6, -0.003), (7, 0.023), (8, 0.021), (9, 0.053), (10, -0.012), (11, 0.052), (12, 0.066), (13, 0.008), (14, -0.096), (15, 0.094), (16, 0.021), (17, -0.098), (18, -0.045), (19, 0.012), (20, -0.082), (21, -0.043), (22, -0.091), (23, -0.013), (24, 0.087), (25, 0.058), (26, -0.005), (27, 0.342), (28, 0.104), (29, -0.139), (30, 0.051), (31, 0.19), (32, 0.08), (33, 0.043), (34, -0.021), (35, 0.107), (36, -0.072), (37, -0.103), (38, -0.238), (39, -0.105), (40, -0.016), (41, -0.041), (42, -0.189), (43, -0.06), (44, 0.245), (45, -0.122), (46, 0.282), (47, 0.014), (48, 0.092), (49, -0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95281911 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa

2 0.59415829 20 jmlr-2012-Analysis of a Random Forests Model

Author: Gérard Biau

3 0.25243506 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

Author: Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine, Xin Yan

4 0.24213602 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel

Author: Stephen R. Piccolo, Lewis J. Frey

Abstract: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classiﬁcation analyses in a systematic yet ﬂexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classiﬁcation, parallel, ensemble, reproducible research

5 0.24025968 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification

Author: Adrian Barbu, Nathan Lay

Abstract: Prediction markets are used in real life to predict outcomes of interest such as presidential elections. This paper presents a mathematical theory of artiﬁcial prediction markets for supervised learning of conditional probability estimators. The artiﬁcial prediction market is a novel method for fusing the prediction information of features or trained classiﬁers, where the fusion result is the contract price on the possible outcomes. The market can be trained online by updating the participants’ budgets using training examples. Inspired by the real prediction markets, the equations that govern the market are derived from simple and reasonable assumptions. Efﬁcient numerical algorithms are presented for solving these equations. The obtained artiﬁcial prediction market is shown to be a maximum likelihood estimator. It generalizes linear aggregation, existent in boosting and random forest, as well as logistic regression and some kernel methods. Furthermore, the market mechanism allows the aggregation of specialized classiﬁers that participate only on speciﬁc instances. Experimental comparisons show that the artiﬁcial prediction markets often outperform random forest and implicit online learning on synthetic data and real UCI data sets. Moreover, an extensive evaluation for pelvic and abdominal lymph node detection in CT data shows that the prediction market improves adaboost’s detection rate from 79.6% to 81.2% at 3 false positives/volume. Keywords: online learning, ensemble methods, supervised learning, random forest, implicit online learning

6 0.22499689 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

7 0.20355549 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

8 0.19584011 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

9 0.19107257 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features

10 0.17572232 95 jmlr-2012-Random Search for Hyper-Parameter Optimization

11 0.17079636 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

12 0.1662928 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning

13 0.16583106 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

14 0.15128623 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment

15 0.14273758 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

16 0.14087184 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks

17 0.13694417 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

18 0.13598388 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models

19 0.13554041 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

20 0.13488492 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.406), (7, 0.013), (21, 0.05), (26, 0.041), (27, 0.014), (29, 0.044), (35, 0.013), (49, 0.021), (56, 0.016), (57, 0.032), (69, 0.013), (75, 0.04), (77, 0.017), (79, 0.021), (81, 0.012), (92, 0.077), (96, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.68790692 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa

2 0.62755156 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

Author: Grigorios Skolidis, Guido Sanguinetti

Abstract: We propose a novel model for meta-generalisation, that is, performing prediction on novel tasks based on information from multiple different but related tasks. The model is based on two coupled Gaussian processes with structured covariance function; one model performs predictions by learning a constrained covariance function encapsulating the relations between the various training tasks, while the second model determines the similarity of new tasks to previously seen tasks. We demonstrate empirically on several real and synthetic data sets both the strengths of the approach and its limitations due to the distributional assumptions underpinning it. Keywords: transfer learning, meta-generalising, multi-task learning, Gaussian processes, mixture of experts

3 0.31542778 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

4 0.31485885 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

Author: Matus Telgarsky

Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations sufﬁce to produce a predictor with empirical risk ε-close to the inﬁmum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy

5 0.31362584 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

Author: Lan Xue, Annie Qu

Abstract: The varying-coefﬁcient model is ﬂexible and powerful for modeling the dynamic changes of regression coefﬁcients. It is important to identify signiﬁcant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying-coefﬁcient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefﬁcient modeling could be inﬁnite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying-coefﬁcient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efﬁcient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the least absolute shrinkage and selection operator (LASSO), the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) approaches. Keywords: coordinate decent algorithm, difference convex programming, L0 - regularization, large-p small-n, model selection, nonparametric function, oracle property, truncated L1 penalty

6 0.31232578 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

7 0.30982032 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

8 0.3077108 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

9 0.30737114 73 jmlr-2012-Multi-task Regression using Minimal Penalties

10 0.30722043 80 jmlr-2012-On Ranking and Generalization Bounds

11 0.30671942 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

12 0.30585849 103 jmlr-2012-Sampling Methods for the Nyström Method

13 0.30493394 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

14 0.30333483 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

15 0.3029601 111 jmlr-2012-Structured Sparsity and Generalization

16 0.30276769 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

17 0.30274922 82 jmlr-2012-On the Necessity of Irrelevant Variables

18 0.30176151 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

19 0.2999562 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

20 0.29963195 13 jmlr-2012-Active Learning via Perfect Selective Classification