jmlr jmlr2012 jmlr2012-72 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa
Abstract: Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller) subset of these rules and to determine their respective weights. We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. We improve the accuracy of the algorithm by adding simple linear functions to the ensemble. We also extensively evaluate the algorithm with and without linear functions. The results show that the accuracy of multi-target regression rule ensembles is high. They are more accurate than, for instance, multi-target regression trees, but not quite as accurate as multi-target random forests. The rule ensembles are significantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. Also in Microtask, Tampere, Finland. †. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia. ‡. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia and the Joˇ ef Stefan International Postgraduate School, Ljubljana, Slovenia. z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. s z ˇ ˇ
Reference: text
sentIndex sentText sentNum sentScore
1 While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. [sent-15, score-0.248]
2 A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. [sent-16, score-0.403]
3 We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. [sent-18, score-0.282]
4 The results show that the accuracy of multi-target regression rule ensembles is high. [sent-21, score-0.282]
5 The rule ensembles are significantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. [sent-23, score-0.508]
6 Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. [sent-24, score-0.337]
7 z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. [sent-30, score-0.194]
8 The majority of rule learning methods are based on the sequential covering algorithm (Michalski, 1969), originally designed for learning ordered rule lists for binary classification domains. [sent-55, score-0.327]
9 An alternative rule learning method that performs well also on (single-target) regression problems is the approach of rule ensembles (Friedman and Popescu, 2005, 2008; Dembczy´ ski et al. [sent-58, score-0.463]
10 In this paper, we introduce F IRE, an algorithm for multi-target regression based on the rule ensembles approach. [sent-62, score-0.282]
11 Finally, the present paper includes a significantly extended empirical evaluation: We consider a larger collection of data sets and methods with which our algorithm is compared, including the latest rule ensemble methods (Dembczy´ ski et al. [sent-75, score-0.246]
12 Section 2 presents related work on multitarget prediction, rule learning, and rule ensembles. [sent-80, score-0.302]
13 The F IRE algorithm for learning multi-target regression rule ensembles is introduced in Section 3. [sent-81, score-0.282]
14 Several standard learning methods such as neural networks, decision trees, model trees, classification rules and random forests have been extended towards multi-target prediction (Caruana, 1997; Blockeel et al. [sent-93, score-0.193]
15 Since our method learns regression rules, it is closely related to rule learning (Flach and Lavraˇ , c ˇ ˇ 2003). [sent-124, score-0.186]
16 It employs the standard covering approach (Michalski, 1969) and can learn z ordered or unordered rule sets for classification and regression domains. [sent-126, score-0.211]
17 An alternative approach to rule learning is called rule ensembles (Friedman and Popescu, 2005, 2008; Dembczy´ ski et al. [sent-129, score-0.428]
18 Strictly speaking, any set of (unordered) rules can be called a rule n ensemble, as for example, in Indurkhya and Weiss (2001). [sent-131, score-0.249]
19 In this paper, however, a rule ensemble is understood to be a set of unordered rules whose predictions are combined through weighted voting, which is the approach introduced by the RULE F IT (Friedman and Popescu, 2005, 2008) and R EG E NDER methods (Dembczy´ ski et al. [sent-132, score-0.37]
20 n The RULE F IT algorithm starts by generating a set of decision trees in much the same way as ensembles are generated in methods like bagging (Breiman, 1996) and random forests (Breiman, 2001). [sent-134, score-0.278]
21 Because such large ensembles are hard or even impossible to interpret, all the trees are transcribed into a collection of rules, and an optimization procedure is used to select a small subset of the rules and to determine their weights. [sent-135, score-0.302]
22 In addition to rules, we can also use descriptive attributes in the linear combination if we add them to the initial set of rules, and likewise determine their weights in the optimization step. [sent-137, score-0.195]
23 Learning Rule Based Ensembles for Multi-Target Regression Our algorithm for learning rule based ensembles for multi-target regression problems (which we call F IRE: Fitted rule ensembles) is greatly influenced by the RULE F IT method. [sent-145, score-0.433]
24 Because linear dependencies are known to be difficult to approximate with rules, we optionally add linear terms (simple linear functions) of all numeric descriptive attributes to the collection. [sent-149, score-0.195]
25 Algorithm 1 The F IRE algorithm for learning rule ensembles for multi-target regression. [sent-153, score-0.247]
26 The first sum is the contribution of the M rules: each 2371 ˇ ˇ A HO , Z ENKO , D Z EROSKI AND E LOMAA rule ri is a vector function that gives a constant prediction (for each of the targets), if it covers the example x, or returns a zero vector otherwise. [sent-159, score-0.192]
27 There is a term for each combination of a target and a numeric descriptive attribute, thus the total number of linear terms is the number of numeric descriptive attributes K times the number of target attributes T . [sent-161, score-0.54]
28 A linear term x(t, j) is a vector that corresponds to the influence of the j-th numerical descriptive attribute x j on the t-th target attribute; its t-th component is equal to x j , while all other components are zero: x(t, j) = (0, . [sent-162, score-0.236]
29 A hypothetic rule ensemble that predicts all the target values of this domain simultaneously could be: y = f (x) = 0. [sent-174, score-0.302]
30 It comprises a constant vector, two rules and four linear terms (of attributes x2 and x5 ), but can also be simplified to a sum of a vector of linear equations and two rules. [sent-210, score-0.196]
31 A set of diverse trees is generated with the multi-target implementation of the z random forest ensemble method (Kocev et al. [sent-217, score-0.202]
32 , 2007), modified to stop tree building when a given tree depth limit is reached. [sent-218, score-0.196]
33 In order to increase the tree (and, thus, rule) variability, we limit the depth of a particular tree in a randomized fashion as suggested by Friedman and Popescu (2008). [sent-220, score-0.196]
34 It should be ¯ emphasized that the parameter L only affects the average of all tree depth limits dm and thus trees with larger depths can still be generated. [sent-225, score-0.213]
35 1 All regression trees generated with the above procedure are transcribed into rules with the ConvertTreesToRules procedure. [sent-226, score-0.22]
36 In order to equalize the importance of different rules and different targets we proceed in three separate steps: First, we simply zero-center all the targets. [sent-231, score-0.233]
37 In the first step, we zero-center all the rule target predictions by subtracting the average avg from each of the original rule predictions r ′′ : r ′ = r ′′ − avg. [sent-240, score-0.551]
38 The average avg contains the average values of the target attributes on the learning set. [sent-241, score-0.298]
39 In the second, more complex, step, we scale the predicted values rt′ of each target attribute t by dividing them with a factor χ: r′ rt = t . [sent-242, score-0.2]
40 (3) χ We choose χ so that it is related to both the largest predicted value of the rule r ′ and to the standard deviation σt of a target attribute t. [sent-243, score-0.314]
41 In this step, we equalize the scaling differences between different target attribute spaces. [sent-257, score-0.208]
42 As usual, we do this simply by dividing the target attribute prediction values rt by twice their standard deviations 2σt : rt∗ = rt r′ r′′ − avgt = t = t . [sent-260, score-0.237]
43 We again shift the linear terms by the average x′′ of the j-th descriptive attribute x′ j) = (0, . [sent-272, score-0.197]
44 σj Linear terms normalized like this appear in the final rule ensemble model. [sent-280, score-0.238]
45 Analogously to the third stage of rule normalization we also scale the target dimension space out temporarily: ′ x(t, j) x(t, j) x∗ j) = = . [sent-281, score-0.288]
46 (t, 2σt 2σ j This is, again, only intended to equalize the terms referring to different target attributes during the optimization procedure. [sent-282, score-0.25]
47 The purpose of the regularization part is to make as many weights equal to zero as possible, which means that the resulting rule ensemble will be as small as possible. [sent-287, score-0.241]
48 From the RRMSE diagram (Figure 1a), we can see that random forests are the most accurate method, followed by all the rule ensemble methods, M T S MOTI, regression trees and finally D IRTY, which seems to perform poorly on single-target data. [sent-300, score-0.433]
49 Our unlimited F IRE versions and limited F IRE with linear terms are in the middle class. [sent-305, score-0.186]
50 In this case, linear terms seem to be surprisingly effective, as they bring the limited F IRE of 30 terms to the same accuracy level with much larger unlimited F IRE set of rules. [sent-310, score-0.206]
51 Both unlimited F IRE versions seem to generate models that are larger than the reference rule ensembles (RULE F IT and R EG E NDER), but the differences are not significant. [sent-312, score-0.435]
52 In a pairwise comparison, the algorithm tie in the wins over data sets (12 wins each) and the unlimited version has a slightly lower average accuracy. [sent-320, score-0.203]
53 In sum, the results of this first part of the evaluation show that our rule ensemble method F IRE performs well on single-target regression problems. [sent-323, score-0.251]
54 The average ranks and results of the Nemenyi test are given in 2383 ˇ ˇ A HO , Z ENKO , D Z EROSKI AND E LOMAA Figure 2 for RRMSE evaluated on separate targets (a), RRMSE evaluated on target averages within data sets (b), and model size (c). [sent-336, score-0.226]
55 Looking at Figure 2(a), the general picture of algorithm ranking is similar to the one for singletarget prediction: random forests and both unlimited F IRE versions are more accurate than D IRTY, regression trees, the limited versions of F IRE, and M T S MOTI. [sent-337, score-0.32]
56 This is especially clear in per data set average evaluation and occurs in the pairwise comparison with the limited F IRE (in per target evaluation 7 out of 11 wins and per data set average 34–36 out of 63 wins for regression trees). [sent-352, score-0.211]
57 The detailed results in Appendix D show that regression trees tend to win only when the size of the tree produced is much larger than the F IRE limit of 30. [sent-353, score-0.217]
58 When both size and accuracy are taken into account, the unlimited version of F IRE with linear terms seems to perform very well on multi-target regression problems. [sent-367, score-0.195]
59 We also notice that the average proportion of linear terms in the unlimited model is 24%, but drops to only 5% if we limit the model size to 30. [sent-368, score-0.205]
60 As mentioned in Section 3, the total number of added linear terms is the number of numeric descriptive attributes times the number of target attributes. [sent-419, score-0.281]
61 This is probably a reason for the long running time of F IRE with linear terms and why the difference between the two unlimited F IRE versions is greater for multi-target data sets than for single-target ones. [sent-420, score-0.186]
62 The limited F IRE versions use on average about a tenth of the time used by the unlimited ones. [sent-422, score-0.189]
63 In the unlimited F IRE we do not have such a contest between linear terms and rules during the optimization and, thus, the negative effect of adding a linear term is negligible. [sent-833, score-0.279]
64 However, for single-target rule ensembles the linear terms seem to help all the time as seen in Figure 3. [sent-848, score-0.293]
65 It is an interesting question whether this would benefit the multi-target rule ensembles in the same way as the (single dimension) linear terms benefit the single-target rule ensembles. [sent-852, score-0.42]
66 6 Summary of the Experimental Results In our experiments, we first evaluated F IRE on single-target domains in order to show that our implementation of rule ensembles also works on standard regression problems. [sent-862, score-0.282]
67 The results are somewhat similar to the ones obtained on single-target domains: random forests and the unlimited F IRE versions are more accurate than the limited F IRE, regression trees, and M T S MOTI. [sent-866, score-0.294]
68 Even though the difference in size between random forests and the unlimited F IRE is not statistically significant, the average size of a random forest is more than 300 times larger than the average size of F IRE models (with and without linear terms). [sent-869, score-0.333]
69 Therefore, we believe that the unlimited F IRE with linear terms is a very good choice for modeling multi-target regression problems. [sent-874, score-0.195]
70 In the case of linear terms, we can conclude that for larger (50 terms or more) multi-target rule ensembles adding linear terms should in general improve the accuracy. [sent-878, score-0.291]
71 We have adopted the rule ensemble approach and generalized it to multi-target regression domains. [sent-883, score-0.251]
72 , 2009) was already able to learn multi-target regression rule ensembles. [sent-885, score-0.186]
73 We compared it with two other existing rule ensembles approaches RULE F IT and R EG E NDER, and to other similar multi-target prediction algorithms, namely regression trees, random forests, and model trees (M T S MOTI). [sent-892, score-0.369]
74 Dˇ eroski was supported by the Slovenian Research Agency z (Grants P2-0103 and J2-2285), the European Commission (Grants ICT-2010-266722 and ICT-2011287713), and Operation no. [sent-919, score-0.194]
75 j=1 k=1 T Here J is the number of targets and each P j represents the predictions for the target dimension j: P j = avg j , r1 (x) j , r2 (x) j , . [sent-932, score-0.288]
76 We remember that our rules are transformed from a tree ensemble of form: f (x) = 1 |D| ∑ di (x), |D| i=1 where |D| is the number of trees in the ensemble. [sent-963, score-0.325]
77 Here each tree prediction di in the ensemble is global in contrast to rule predictions ri being local. [sent-964, score-0.358]
78 This consists of a rule from each of the trees in the tree ensemble. [sent-969, score-0.313]
79 |D| i=1 Now the rule predictions are used in the same scale in which they were created during the tree ensemble training. [sent-971, score-0.317]
80 That is, we replace each of the initial rules r ′′ with a zero-centered rule r ′ , which is defined as r ′ (x) = r ′′ (x) − avg 0 if x is covered and otherwise. [sent-975, score-0.335]
81 of covering rules avg + ∑ ri′ (x) = avg + ∑ ri′ (x). [sent-978, score-0.295]
82 of covering rules i=1 i=1 This new form allows us to do the weight optimization freely without caring about the number of covering rules: M f (x) = w0 avg + ∑ wi ri′ (x). [sent-980, score-0.276]
83 i=1 The optimization problem is not invariant to the scaling of rule predictions: If we scale the rule predictions as r = b r ′ ; b > 1, the corresponding weight will not be simply decreased as w = w′ /b, because the regularization part on the right only includes weights and not rule predictions. [sent-982, score-0.525]
84 We would like to have all the rules and targets to have equal initial importance. [sent-984, score-0.188]
85 It is clear that setting 1 to the rule predictions for all targets would discard all the discovered information on relations between the targets stored in the rules. [sent-988, score-0.357]
86 If σt were the standard deviation of a normally distributed target attribute rt′ , dividing a zero-centered attribute by 2σt should put 95% of all values within the [−1, 1] interval. [sent-994, score-0.24]
87 , T } of the maximum target value rm of the rule r ′ is defined by: r′ m = arg max t . [sent-999, score-0.258]
88 We can now also give a strict bound to the predictions rt : By the definition of m, after this second stage of normalization ′ it holds that rm = 1 and |rt | = |σm /rm rt′ /σt | ≤ 1 for all other targets t. [sent-1001, score-0.204]
89 To sum up, at the second stage of normalization the target predictions rt in a certain rule r are scaled by a factor χ that represents the the initial size of the predictive values of the particular rule. [sent-1003, score-0.376]
90 Thus, this stage roughly equalizes the rules before the optimization phase and affects the rules of the final model. [sent-1004, score-0.217]
91 After the two first normalization steps our rule predictions are of form: r= r ′ r ′′ − avg = . [sent-1005, score-0.314]
92 Otherwise, targets with large scales would dominate the rule selection. [sent-1007, score-0.241]
93 As usual, we do this simply by dividing the target attribute prediction values rt by twice their standard deviations 2σt : rt∗ = r′ r′′ − avgt rt = t = t . [sent-1011, score-0.237]
94 , 0), t−1 t t+1 which depicts the influence of the descriptive attribute x j on the target attribute xt . [sent-1020, score-0.313]
95 We add linear terms for all possible combinations of numeric descriptive attributes and target attributes. [sent-1021, score-0.281]
96 Unlike rules, linear terms are affected by two attributes and, thus, two attribute space scales: that of the j-th descriptive attribute space and that of t-th target space. [sent-1022, score-0.411]
97 Linear terms normalized like this appear in the final rule ensemble model. [sent-1036, score-0.238]
98 However, analogously to the third stage of rule normalization we also scale the target dimension space out temporarily: ′ x(t, j) x(t, j) ∗ x(t, j) = = . [sent-1037, score-0.288]
99 2σt 2σ j This is, again, only intended to equalize the terms of different target attributes during the optimization procedure. [sent-1038, score-0.25]
100 We notice that the form on last row can be expressed in the form of internal normalization M f (x) = w0 avg + ∑ wi ri′ (x) i=1 only if avg = 0, that is, the data is originally zero-centered. [sent-1048, score-0.244]
wordName wordTfidf (topN-words)
[('ire', 0.588), ('fire', 0.402), ('rrmse', 0.337), ('eroski', 0.194), ('irty', 0.162), ('rule', 0.151), ('unlimited', 0.138), ('moti', 0.11), ('enko', 0.104), ('lomaa', 0.104), ('rules', 0.098), ('ensembles', 0.096), ('forests', 0.095), ('nsembles', 0.094), ('zenko', 0.091), ('targets', 0.09), ('trees', 0.087), ('target', 0.086), ('avg', 0.086), ('attribute', 0.077), ('attributes', 0.076), ('tree', 0.075), ('descriptive', 0.073), ('aho', 0.065), ('ensemble', 0.065), ('ulti', 0.062), ('ho', 0.056), ('blockeel', 0.055), ('egression', 0.053), ('sa', 0.053), ('tampere', 0.052), ('vpv', 0.052), ('normalization', 0.051), ('forest', 0.05), ('equalize', 0.045), ('jalali', 0.045), ('struyf', 0.045), ('popescu', 0.044), ('ri', 0.041), ('ljubljana', 0.04), ('appice', 0.039), ('dembczy', 0.039), ('rt', 0.037), ('cd', 0.036), ('regression', 0.035), ('dirty', 0.035), ('ree', 0.035), ('nemenyi', 0.033), ('nder', 0.033), ('friedman', 0.033), ('auto', 0.032), ('kocev', 0.032), ('mtsmoti', 0.032), ('orest', 0.032), ('timo', 0.032), ('lncs', 0.032), ('terminal', 0.032), ('jt', 0.03), ('bernard', 0.03), ('ski', 0.03), ('err', 0.028), ('ax', 0.027), ('predictions', 0.026), ('bogdan', 0.026), ('errmin', 0.026), ('igmea', 0.026), ('jo', 0.026), ('ousing', 0.026), ('rulefit', 0.026), ('gk', 0.026), ('dem', 0.026), ('eg', 0.026), ('depth', 0.026), ('versions', 0.026), ('eta', 0.025), ('hendrik', 0.025), ('slovenia', 0.025), ('predictive', 0.025), ('ranks', 0.025), ('covering', 0.025), ('weights', 0.025), ('average', 0.025), ('optional', 0.025), ('diagrams', 0.025), ('jerome', 0.024), ('numeric', 0.024), ('seem', 0.024), ('sp', 0.023), ('andom', 0.023), ('species', 0.022), ('terms', 0.022), ('editors', 0.022), ('suzuki', 0.022), ('max', 0.021), ('optimization', 0.021), ('wi', 0.021), ('wk', 0.02), ('stanford', 0.02), ('wins', 0.02), ('limit', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa
Abstract: Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller) subset of these rules and to determine their respective weights. We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. We improve the accuracy of the algorithm by adding simple linear functions to the ensemble. We also extensively evaluate the algorithm with and without linear functions. The results show that the accuracy of multi-target regression rule ensembles is high. They are more accurate than, for instance, multi-target regression trees, but not quite as accurate as multi-target random forests. The rule ensembles are significantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. Also in Microtask, Tampere, Finland. †. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia. ‡. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia and the Joˇ ef Stefan International Postgraduate School, Ljubljana, Slovenia. z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. s z ˇ ˇ
2 0.092762746 20 jmlr-2012-Analysis of a Random Forests Model
Author: Gérard Biau
Abstract: Random forests are a scheme proposed by Leo Breiman in the 2000’s for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present. Keywords: random forests, randomization, sparsity, dimension reduction, consistency, rate of convergence
3 0.033557147 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
Author: Tobias Lang, Marc Toussaint, Kristian Kersting
Abstract: A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and R- MAX algorithms. Efficient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efficiency of our framework using count functions under the assumption that we had a relational KWIK learner and a near-optimal planner. We propose a concrete exploration algorithm which integrates a practically efficient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques. Keywords: reinforcement learning, statistical relational learning, exploration, relational transition models, robotics
4 0.030164892 62 jmlr-2012-MULTIBOOST: A Multi-purpose Boosting Package
Author: Djalel Benbouzid, Róbert Busa-Fekete, Norman Casagrande, François-David Collin, Balázs Kégl
Abstract: The M ULTI B OOST package provides a fast C++ implementation of multi-class/multi-label/multitask boosting algorithms. It is based on A DA B OOST.MH but it also implements popular cascade classifiers and F ILTER B OOST. The package contains common multi-class base learners (stumps, trees, products, Haar filters). Further base learners and strong learners following the boosting paradigm can be easily implemented in a flexible framework. Keywords: boosting, A DA B OOST.MH, F ILTER B OOST, cascade classifier
5 0.029680476 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
Author: Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine, Xin Yan
Abstract: Assessing treatment effects in observational studies is a multifaceted problem that not only involves heterogeneous mechanisms of how the treatment or cause is exposed to subjects, known as propensity, but also differential causal effects across sub-populations. We introduce a concept termed the facilitating score to account for both the confounding and interacting impacts of covariates on the treatment effect. Several approaches for estimating the facilitating score are discussed. In particular, we put forward a machine learning method, called causal inference tree (CIT), to provide a piecewise constant approximation of the facilitating score. With interpretable rules, CIT splits data in such a way that both the propensity and the treatment effect become more homogeneous within each resultant partition. Causal inference at different levels can be made on the basis of CIT. Together with an aggregated grouping procedure, CIT stratifies data into strata where causal effects can be conveniently assessed within each. Besides, a feasible way of predicting individual causal effects (ICE) is made available by aggregating ensemble CIT models. Both the stratified results and the estimated ICE provide an assessment of heterogeneity of causal effects and can be integrated for estimating the average causal effect (ACE). Mean square consistency of CIT is also established. We evaluate the performance of proposed methods with simulations and illustrate their use with the NSW data in Dehejia and Wahba (1999) where the objective is to assess the impact of c 2012 Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine and Xin Yan. S U , K ANG , FAN , L EVINE AND YAN a labor training program, the National Supported Work (NSW) demonstration, on post-intervention earnings. Keywords: CART, causal inference, confounding, interaction, observational study, personalized medicine, recursive partitioning
6 0.029668402 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
7 0.028298 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
8 0.027029248 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
9 0.025257744 59 jmlr-2012-Linear Regression With Random Projections
10 0.024975745 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
11 0.023887152 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
12 0.023735961 73 jmlr-2012-Multi-task Regression using Minimal Penalties
13 0.023554027 80 jmlr-2012-On Ranking and Generalization Bounds
14 0.023174796 106 jmlr-2012-Sign Language Recognition using Sub-Units
15 0.022828573 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
16 0.021681411 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
17 0.020880209 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
18 0.020541571 4 jmlr-2012-A Kernel Two-Sample Test
19 0.020414781 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
20 0.020289211 103 jmlr-2012-Sampling Methods for the Nyström Method
topicId topicWeight
[(0, -0.107), (1, 0.031), (2, 0.06), (3, -0.051), (4, 0.004), (5, 0.012), (6, -0.003), (7, 0.023), (8, 0.021), (9, 0.053), (10, -0.012), (11, 0.052), (12, 0.066), (13, 0.008), (14, -0.096), (15, 0.094), (16, 0.021), (17, -0.098), (18, -0.045), (19, 0.012), (20, -0.082), (21, -0.043), (22, -0.091), (23, -0.013), (24, 0.087), (25, 0.058), (26, -0.005), (27, 0.342), (28, 0.104), (29, -0.139), (30, 0.051), (31, 0.19), (32, 0.08), (33, 0.043), (34, -0.021), (35, 0.107), (36, -0.072), (37, -0.103), (38, -0.238), (39, -0.105), (40, -0.016), (41, -0.041), (42, -0.189), (43, -0.06), (44, 0.245), (45, -0.122), (46, 0.282), (47, 0.014), (48, 0.092), (49, -0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.95281911 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa
Abstract: Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller) subset of these rules and to determine their respective weights. We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. We improve the accuracy of the algorithm by adding simple linear functions to the ensemble. We also extensively evaluate the algorithm with and without linear functions. The results show that the accuracy of multi-target regression rule ensembles is high. They are more accurate than, for instance, multi-target regression trees, but not quite as accurate as multi-target random forests. The rule ensembles are significantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. Also in Microtask, Tampere, Finland. †. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia. ‡. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia and the Joˇ ef Stefan International Postgraduate School, Ljubljana, Slovenia. z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. s z ˇ ˇ
2 0.59415829 20 jmlr-2012-Analysis of a Random Forests Model
Author: Gérard Biau
Abstract: Random forests are a scheme proposed by Leo Breiman in the 2000’s for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present. Keywords: random forests, randomization, sparsity, dimension reduction, consistency, rate of convergence
3 0.25243506 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
Author: Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine, Xin Yan
Abstract: Assessing treatment effects in observational studies is a multifaceted problem that not only involves heterogeneous mechanisms of how the treatment or cause is exposed to subjects, known as propensity, but also differential causal effects across sub-populations. We introduce a concept termed the facilitating score to account for both the confounding and interacting impacts of covariates on the treatment effect. Several approaches for estimating the facilitating score are discussed. In particular, we put forward a machine learning method, called causal inference tree (CIT), to provide a piecewise constant approximation of the facilitating score. With interpretable rules, CIT splits data in such a way that both the propensity and the treatment effect become more homogeneous within each resultant partition. Causal inference at different levels can be made on the basis of CIT. Together with an aggregated grouping procedure, CIT stratifies data into strata where causal effects can be conveniently assessed within each. Besides, a feasible way of predicting individual causal effects (ICE) is made available by aggregating ensemble CIT models. Both the stratified results and the estimated ICE provide an assessment of heterogeneity of causal effects and can be integrated for estimating the average causal effect (ACE). Mean square consistency of CIT is also established. We evaluate the performance of proposed methods with simulations and illustrate their use with the NSW data in Dehejia and Wahba (1999) where the objective is to assess the impact of c 2012 Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine and Xin Yan. S U , K ANG , FAN , L EVINE AND YAN a labor training program, the National Supported Work (NSW) demonstration, on post-intervention earnings. Keywords: CART, causal inference, confounding, interaction, observational study, personalized medicine, recursive partitioning
4 0.24213602 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
Author: Stephen R. Piccolo, Lewis J. Frey
Abstract: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classification, parallel, ensemble, reproducible research
5 0.24025968 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification
Author: Adrian Barbu, Nathan Lay
Abstract: Prediction markets are used in real life to predict outcomes of interest such as presidential elections. This paper presents a mathematical theory of artificial prediction markets for supervised learning of conditional probability estimators. The artificial prediction market is a novel method for fusing the prediction information of features or trained classifiers, where the fusion result is the contract price on the possible outcomes. The market can be trained online by updating the participants’ budgets using training examples. Inspired by the real prediction markets, the equations that govern the market are derived from simple and reasonable assumptions. Efficient numerical algorithms are presented for solving these equations. The obtained artificial prediction market is shown to be a maximum likelihood estimator. It generalizes linear aggregation, existent in boosting and random forest, as well as logistic regression and some kernel methods. Furthermore, the market mechanism allows the aggregation of specialized classifiers that participate only on specific instances. Experimental comparisons show that the artificial prediction markets often outperform random forest and implicit online learning on synthetic data and real UCI data sets. Moreover, an extensive evaluation for pelvic and abdominal lymph node detection in CT data shows that the prediction market improves adaboost’s detection rate from 79.6% to 81.2% at 3 false positives/volume. Keywords: online learning, ensemble methods, supervised learning, random forest, implicit online learning
6 0.22499689 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
7 0.20355549 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression
8 0.19584011 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
9 0.19107257 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features
10 0.17572232 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
11 0.17079636 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
12 0.1662928 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
13 0.16583106 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
14 0.15128623 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
15 0.14273758 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
16 0.14087184 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks
17 0.13694417 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
18 0.13598388 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
19 0.13554041 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
20 0.13488492 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
topicId topicWeight
[(6, 0.406), (7, 0.013), (21, 0.05), (26, 0.041), (27, 0.014), (29, 0.044), (35, 0.013), (49, 0.021), (56, 0.016), (57, 0.032), (69, 0.013), (75, 0.04), (77, 0.017), (79, 0.021), (81, 0.012), (92, 0.077), (96, 0.076)]
simIndex simValue paperId paperTitle
same-paper 1 0.68790692 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa
Abstract: Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller) subset of these rules and to determine their respective weights. We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. We improve the accuracy of the algorithm by adding simple linear functions to the ensemble. We also extensively evaluate the algorithm with and without linear functions. The results show that the accuracy of multi-target regression rule ensembles is high. They are more accurate than, for instance, multi-target regression trees, but not quite as accurate as multi-target random forests. The rule ensembles are significantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. Also in Microtask, Tampere, Finland. †. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia. ‡. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia and the Joˇ ef Stefan International Postgraduate School, Ljubljana, Slovenia. z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. s z ˇ ˇ
2 0.62755156 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
Author: Grigorios Skolidis, Guido Sanguinetti
Abstract: We propose a novel model for meta-generalisation, that is, performing prediction on novel tasks based on information from multiple different but related tasks. The model is based on two coupled Gaussian processes with structured covariance function; one model performs predictions by learning a constrained covariance function encapsulating the relations between the various training tasks, while the second model determines the similarity of new tasks to previously seen tasks. We demonstrate empirically on several real and synthetic data sets both the strengths of the approach and its limitations due to the distributional assumptions underpinning it. Keywords: transfer learning, meta-generalising, multi-task learning, Gaussian processes, mixture of experts
3 0.31542778 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
4 0.31485885 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
5 0.31362584 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
Author: Lan Xue, Annie Qu
Abstract: The varying-coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying-coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefficient modeling could be infinite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying-coefficient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efficient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the least absolute shrinkage and selection operator (LASSO), the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) approaches. Keywords: coordinate decent algorithm, difference convex programming, L0 - regularization, large-p small-n, model selection, nonparametric function, oracle property, truncated L1 penalty
6 0.31232578 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
7 0.30982032 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
8 0.3077108 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
9 0.30737114 73 jmlr-2012-Multi-task Regression using Minimal Penalties
10 0.30722043 80 jmlr-2012-On Ranking and Generalization Bounds
11 0.30671942 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
12 0.30585849 103 jmlr-2012-Sampling Methods for the Nyström Method
13 0.30493394 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
14 0.30333483 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
15 0.3029601 111 jmlr-2012-Structured Sparsity and Generalization
16 0.30276769 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
17 0.30274922 82 jmlr-2012-On the Necessity of Irrelevant Variables
18 0.30176151 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
19 0.2999562 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
20 0.29963195 13 jmlr-2012-Active Learning via Perfect Selective Classification