jmlr jmlr2013 jmlr2013-95 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Stéphan Clémençon, Marine Depecker, Nicolas Vayatis
Abstract: The present paper examines how the aggregation and feature randomization principles underlying the algorithm R ANDOM F OREST (Breiman, 2001) can be adapted to bipartite ranking. The approach taken here is based on nonparametric scoring and ROC curve optimization in the sense of the AUC criterion. In this problem, aggregation is used to increase the performance of scoring rules produced by ranking trees, as those developed in Cl´ mencon and Vayatis (2009c). The present work e ¸ describes the principles for building median scoring rules based on concepts from rank aggregation. Consistency results are derived for these aggregated scoring rules and an algorithm called R ANK ING F OREST is presented. Furthermore, various strategies for feature randomization are explored through a series of numerical experiments on artificial data sets. Keywords: bipartite ranking, nonparametric scoring, classification data, ROC optimization, AUC criterion, tree-based ranking rules, bootstrap, bagging, rank aggregation, median ranking, feature randomization
Reference: text
sentIndex sentText sentNum sentScore
1 Journal of Machine Learning Research 14 (2013) 39-73 Submitted 12/10; Revised 2/12; Published 1/13 Ranking Forests St´ phan Cl´ mencon e e ¸ Marine Depecker STEPHAN . [sent-1, score-0.303]
2 8536 ENS Cachan 61, avenue du Pr´ sident Wilson, Cachan, 94230, France e Editor: Tong Zhang Abstract The present paper examines how the aggregation and feature randomization principles underlying the algorithm R ANDOM F OREST (Breiman, 2001) can be adapted to bipartite ranking. [sent-9, score-0.355]
3 The approach taken here is based on nonparametric scoring and ROC curve optimization in the sense of the AUC criterion. [sent-10, score-0.708]
4 In this problem, aggregation is used to increase the performance of scoring rules produced by ranking trees, as those developed in Cl´ mencon and Vayatis (2009c). [sent-11, score-1.394]
5 The present work e ¸ describes the principles for building median scoring rules based on concepts from rank aggregation. [sent-12, score-0.827]
6 Consistency results are derived for these aggregated scoring rules and an algorithm called R ANK ING F OREST is presented. [sent-13, score-0.757]
7 Keywords: bipartite ranking, nonparametric scoring, classification data, ROC optimization, AUC criterion, tree-based ranking rules, bootstrap, bagging, rank aggregation, median ranking, feature randomization 1. [sent-15, score-0.472]
8 Introduction Aggregating decision rules or function estimators has now become a folk concept in machine learning and nonparametric statistics. [sent-16, score-0.179]
9 Indeed, the idea of combining decision rules with an additional randomization ingredient brings a dramatic improvement of performance in various contexts. [sent-17, score-0.208]
10 In the present paper, we propose to take one step beyond in the program of boosting performance by aggregation and randomization for this problem. [sent-20, score-0.272]
11 This case is also known as the bipartite ranking problem, see Freund et al. [sent-22, score-0.306]
12 The setup of bipartite ranking is useful e ¸ when considering real-life applications such as credit-risk or medical screening, spam filtering, or recommender systems. [sent-26, score-0.332]
13 There are two major approaches to bipartite ranking: the preference-based approach (see Cohen et al. [sent-27, score-0.1]
14 The idea of combining ranking c 2013 St´ phan Cl´ mencon, Marine Depecker and Nicolas Vayatis. [sent-31, score-0.244]
15 e e ¸ ´ C L E MENC ON , D EPECKER AND VAYATIS ¸ rules to learn preferences was introduced in Freund et al. [sent-32, score-0.16]
16 (2003) with a boosting algorithm and the consistency for this type of methods was proved in Cl´ mencon et al. [sent-33, score-0.282]
17 (2008) by reducing the bipare ¸ tite ranking problem to a classification problem over pairs of observations (see also Agarwal et al. [sent-34, score-0.206]
18 Here, we will cast bipartite ranking in the context of nonparametric scoring and we will consider the issue of combining randomized scoring rules. [sent-36, score-1.534]
19 Scoring rules are real-valued functions mapping the observation space with the real line, thus conveying an order relation between high dimensional observation vectors. [sent-37, score-0.162]
20 Nonparametric scoring has received an increasing attention in the machine learning literature as a part of the growing interest which affects ROC analysis. [sent-38, score-0.588]
21 The scoring problem can be seen as a learning problem where one observes input observation vectors X in a high dimensional space X and receives only a binary feedback information through an output variable Y ∈ {−1, +1}. [sent-39, score-0.621]
22 Whereas classification only focuses on predicting the label Y of a new observation X, scoring algorithms aim at recovering an order relation on X in order to predict the ordering over a new sample of observation vectors X ′ 1 , . [sent-40, score-0.606]
23 From a statistical perspective, the scoring problem is more difficult than classification but easier than regression. [sent-44, score-0.588]
24 In previous work, we developed e ¸ a tree-based procedure for nonparametric scoring called T REE R ANK, see Cl´ mencon and Vayatis e ¸ (2009c), Cl´ mencon et al. [sent-46, score-1.153]
25 The T REE R ANK algorithm and its variants produce scoring e ¸ rules expressed as partitions of the input space coupled with a permutation over the cells of the partition. [sent-48, score-0.836]
26 These scoring rules present the interesting feature that they can be stored in an oriented binary tree structure, called a ranking tree. [sent-49, score-0.955]
27 Moreover, their very construction actually implements the optimization of the ROC curve which reflects the quality measure of the scoring rule for the end-user. [sent-50, score-0.747]
28 The use of resampling in this context was first considered in Cl´ mencon et al. [sent-51, score-0.265]
29 A more e ¸ thorough analysis is developed throughout this paper and we show how to combine feature randomization and bootstrap aggregation techniques based on the ranking trees produced by the T REE RANK algorithm in order to increase ranking performance in the sense of the ROC curve. [sent-53, score-0.686]
30 In the classification setup, theoretical evidence has been recently provided for the aggregation of randomized classifiers in the spirit of random forests (see Biau et al. [sent-54, score-0.237]
31 However, in the context of ROC optimization, combining scoring rules through naive aggregation does not necessarily make sense. [sent-56, score-0.923]
32 Our approach builds on the advances in the rank aggregation problem. [sent-57, score-0.238]
33 Rank aggregation was originally introduced in social choice theory (see Barth´ l´ my and Montjardet 1981 and the referee ences therein) and recently “rediscovered” in the context of internet applications (see Pennock et al. [sent-58, score-0.243]
34 For our needs, we shall focus on metric-based consensus methods (see Hudry 2004 or Fagin et al. [sent-60, score-0.107]
35 2006, and the references therein), which provide the key to the aggregation of ranking trees. [sent-61, score-0.397]
36 In the paper, we also discuss various aspects of feature randomization which can be incorporated at various levels in ranking trees. [sent-62, score-0.27]
37 Also a novel ranking methodology, called R ANKING F OREST, is introduced. [sent-63, score-0.206]
38 Section 2 sets out the notations and shortly describes the main notions for the bipartite ranking problem. [sent-65, score-0.306]
39 Section 3 describes the elements from the theory of rank aggregation and measures of consensus leading to the aggregation of scoring rules defined over finite partitions of the input space. [sent-66, score-1.276]
40 The next section presents the main theoretical results of the paper 40 R ANKING F ORESTS which are consistency results for scoring rules based on the aggregation of randomized piecewise constant scoring rules. [sent-67, score-1.624]
41 Section 5 presents R ANKING F OREST, a new algorithm for nonparametric scoring which implements the theoretical concepts developed so far. [sent-68, score-0.667]
42 Probabilistic Setup for Bipartite Ranking ROC analysis is a popular way of evaluating the capacity of a given scoring rule to discriminate between two populations, see Egan (1975). [sent-73, score-0.683]
43 ROC curves and related performance measures such as the AUC have now become of standard use for assessing the quality of ranking methods in a bipartite framework. [sent-74, score-0.324]
44 Throughout this section, we recall basic concepts related to bipartite ranking from the angle of ROC analysis. [sent-75, score-0.334]
45 An informal way of considering the ranking task under this model is as follows. [sent-85, score-0.222]
46 A natural way of defining a total order on the multidimensional space X is to map it with the natural order on the real line by means of a scoring rule, that is, a measurable mapping s : X → R. [sent-90, score-0.588]
47 The capacity of a candidate s to discriminate between the positive and negative populations is generally evaluated by means of its ROC curve (standing for “Receiver Operating Characteristic” curve), a widely used functional performance measure which we recall here. [sent-93, score-0.149]
48 Definition 1 (T RUE ROC CURVE ) Let s be a scoring rule. [sent-94, score-0.588]
49 The true ROC curve of s is the “probabilityprobability” plot given by: t ∈ R → (P {s(X) > t | Y = −1} , P {s(X) > t | Y = 1}) ∈ [0, 1]2 . [sent-95, score-0.085]
50 By convention, when a jump occurs in the plot of the ROC curve, the corresponding extremities of the curve are connected by a line segment, so that the ROC curve of s can be viewed as the graph of a continuous mapping α ∈ [0, 1] → ROC(s, α). [sent-96, score-0.186]
51 We refer to Cl´ mencon and Vayatis (2009c) for a list of properties of ROC curves (see the e ¸ Appendix section therein). [sent-97, score-0.265]
52 The ROC curve offers a visual tool for assessing ranking performance (see Figure 1): the closer to the left upper corner of the unit square [0, 1]2 the curve ROC(s, . [sent-98, score-0.394]
53 Therefore, the ROC curve conveys a partial order on the set of all 1. [sent-100, score-0.123]
54 A preorder is a binary relation which is reflexive and transitive. [sent-101, score-0.092]
55 scoring rules: for all pairs of scoring rules s1 and s2 , we say that s2 is more accurate than s1 when ROC(s1 , α) ≤ ROC(s2 , α) for all α ∈ [0, 1]. [sent-103, score-1.32]
56 By a standard Neyman-Pearson argument, one may establish that the most accurate scoring rules are increasing transforms of the regression function which is equal to the conditional probability function η up to an affine transformation. [sent-104, score-0.753]
57 Definition 2 (O PTIMAL SCORING RULES ) We call optimal scoring rules the elements of the set S ∗ of scoring functions s∗ such that ∀(x, x′ ) ∈ X 2 , η(x) < η(x′ ) ⇒ s∗ (x) < s∗ (x′ ). [sent-105, score-1.32]
58 The fact that the elements of S ∗ are optimizers of the ROC curve is shown in Cl´ mencon and e ¸ Vayatis (2009c) (see Proposition 4 therein). [sent-106, score-0.35]
59 The performance of a candidate scoring rule s is often summarized by a scalar quantity called the Area Under the ROC Curve (AUC) which can be considered as a summary of the ROC curve. [sent-108, score-0.646]
60 The AUC is the functional defined as: AUC(s) = P{s(X1 ) < s(X2 ) | (Y1 , Y2 ) = (−1, +1)} 1 + P{s(X1 ) = s(X2 ) | (Y1 ,Y2 ) = (−1, +1)}, 2 where (X1 ,Y1 ) and (X2 ,Y2 ) denote two independent copies of the pair (X,Y ), for any scoring function s. [sent-111, score-0.588]
61 This functional provides a total order on the set of scoring rules and, equipped with the convention introduced in Definition 1, AUC(s) coincides with 01 ROC(s, α) dα (see, for instance, Proposition 1 in Cl´ mencon et al. [sent-112, score-1.048]
62 We shall denote the optimal curve and the corresponding (maxie ¸ mum) value for the AUC criterion by ROC∗ = ROC(s∗ , . [sent-114, score-0.127]
63 In the paper, we will focus on a particular subclass of scoring rules. [sent-121, score-0.588]
64 Definition 4 (P IECEWISE CONSTANT SCORING RULE ) A scoring rule s is piecewise constant if there exists a finite partition P of X such that for all C ∈ P , there exists a constant kC ∈ R such that ∀x ∈ C , s(x) = kC . [sent-122, score-0.769]
65 The scoring rule conveys an ordering on the cells of the minimal partition. [sent-125, score-0.738]
66 Definition 5 (R ANK OF A CELL ) Let s be a scoring rule and P the associated minimal partition. [sent-126, score-0.646]
67 The scoring rule induces a ranking s over the cells of the partition. [sent-127, score-0.906]
68 For a given cell C ∈ P , we define its rank R s (C ) ∈ {1, . [sent-128, score-0.068]
69 , |P |} as the rank affected by the ranking s over the elements of the partition. [sent-131, score-0.253]
70 The advantage of the class of piecewise constant scoring rules is that they provide finite rankings on the elements of X and they will be the key for applying the aggregation procedure. [sent-133, score-1.14]
71 Aggregation of Scoring Rules In recent years, the issue of summarizing or aggregating various rankings has been a topic of growing interest in the machine-learning community. [sent-135, score-0.145]
72 Such problems have led to a variety of results, ranging from the generalization of the mathematical concepts introduced in social choice theory (see Barth´ l´ my ee and Montjardet 1981 and the references therein) for defining relevant notions of consensus between rankings (Fagin et al. [sent-142, score-0.23]
73 , 2007) through the study of probabilistic models over sets of rankings (Fligner and Verducci , Eds. [sent-145, score-0.144]
74 Here we consider rank aggregation methods in the perspective of extending the bagging approach to ranking trees. [sent-147, score-0.488]
75 1 The Case of Piecewise Constant Scoring Rules The ranking rules considered in this paper result from the aggregation of a collection of piecewise constant scoring rules. [sent-149, score-1.262]
76 Since each of these scoring rules is related to a possibly different partition, we are lead to consider a collection of partitions of X . [sent-150, score-0.819]
77 Hence, the aggregated rule needs to be defined on the least fine subpartition of this collection of partitions. [sent-151, score-0.215]
78 Definition 6 (S UBPARTITION OF A COLLECTION OF PARTITIONS ) Consider a collection of B partitions of X denoted by Pb , b = 1, . [sent-152, score-0.087]
79 A subpartition of this collection is a partition PB made 43 ´ C L E MENC ON , D EPECKER AND VAYATIS ¸ of nonempty subsets C ⊂ X which satisfy the following constraint : for all C ∈ PB , there exists (C1 , . [sent-156, score-0.159]
80 ∗ One may easily see that PB is a subpartition of any of the Pb ’s, and the largest one in the sense ∗ that any partition P which is a subpartition of Pb for all b ∈ {1, . [sent-161, score-0.217]
81 The case where the partitions are obtained from a binary tree structure is of particular interest as we shall consider tree-based piecewise constant scoring rules later on. [sent-165, score-0.937]
82 Now consider a collection of piecewise constant scoring rules sb , b = 1, . [sent-168, score-0.888]
83 Each scoring rule sb naturally induces a ranking (or a pre∗ ∗2 order) ∗ on the partition PB . [sent-172, score-0.902]
84 ∗ The collection of scoring rules leads to a collection of B rankings on PB . [sent-174, score-0.927]
85 Whereas the mean, or the median, naturally provides such a summary when considering scalar data, various meanings can be given to this notion for rankings (see Appendix B). [sent-177, score-0.137]
86 2 Probabilistic Measures of Scoring Agreement The purpose of this subsection is to extend the concept of measures of agreement for rankings to scoring rules defined over a general space X which is not necessarily finite. [sent-179, score-0.876]
87 In practice, however, we will only consider the case of piecewise constant scoring rules and we shall rely on the definition of the probabilistic Kendall tau. [sent-180, score-0.893]
88 We already introduced the notation s for the preorder relation over the cells of a partition P as induced by a piecewise scoring rule s. [sent-182, score-0.914]
89 We shall use the ’curly’ notation for the preorder relation s on X which is described through the following condition: ∀C , C ′ ∈ P , we have x s x′ , ∀x ∈ C , ∀x′ ∈ C ′ , if and only if C s C ′ . [sent-183, score-0.117]
90 We now introduce a measure of similarity for preorders on X induced by scoring rules s1 and s2 . [sent-185, score-0.786]
91 For any real-valued scoring rule s, we have: 1 1 (1 − τ(s(X),Y )) = 2p(1 − p) (1 − AUC(s)) + P{s(X) = s(X ′ ) , Y = Y ′ } . [sent-192, score-0.646]
92 2 2 For given scoring rules s1 and s2 and considering the probabilistic Kendall tau for random variables s1 (X) and s2 (X), we can set: dX (s1 , s2 ) = dτ (s1 (X), s2 (X)). [sent-193, score-0.804]
93 The following proposition shows that the deviation between scoring rules in terms of AUC is controlled by a quantity involving the probabilistic agreement based on Kendall tau. [sent-195, score-0.8]
94 For any scoring rules s1 and s2 on X , we have: dX (s1 , s2 ) 1 − τX (s1 , s2 ) |AUC(s1 ) − AUC(s2 )| ≤ = . [sent-197, score-0.732]
95 Indeed, scoring rules with same AUC may yield to different rankings. [sent-199, score-0.732]
96 However, the following result guarantees that a scoring rule with a nearly optimal AUC is close to the optimal scoring rules in the sense of Kendall tau, under the additional assumption that the noise condition introduced in Cl´ mencon et al. [sent-200, score-1.643]
97 (1) Then, we have, for any scoring rule s and any optimal scoring rule s∗ ∈ S ∗ : 1 − τX (s∗ , s) ≤ C · (AUC∗ − AUC(s))a/(1+a) , with C = 3 · c1/(1+a) · (2p(1 − p))a/(1+a) . [sent-203, score-1.292]
98 Indeed, it is fulfilled for any a ∈ (0, 1) as soon the probability density function of η(X) is bounded (see Corollary 8 in Cl´ mencon et al. [sent-205, score-0.265]
99 e ¸ The next result shows the connection between the Kendall tau distance between preorders on X induced by piecewise constant scoring rules s1 and s2 and a specific notion of distance between the rankings s1 and s2 on P . [sent-207, score-1.052]
100 Lemma 12 Let s1 , s2 , two piecewise constant scoring rules. [sent-208, score-0.684]
wordName wordTfidf (topN-words)
[('scoring', 0.588), ('roc', 0.356), ('mencon', 0.265), ('auc', 0.258), ('ranking', 0.206), ('aggregation', 0.191), ('vayatis', 0.162), ('cl', 0.161), ('pb', 0.161), ('rules', 0.144), ('rankings', 0.121), ('kendall', 0.108), ('bipartite', 0.1), ('anking', 0.097), ('piecewise', 0.096), ('orest', 0.095), ('subpartition', 0.095), ('telecom', 0.095), ('curve', 0.085), ('consensus', 0.065), ('randomization', 0.064), ('ank', 0.058), ('rule', 0.058), ('depecker', 0.057), ('endall', 0.057), ('epecker', 0.057), ('marine', 0.057), ('menc', 0.057), ('orests', 0.057), ('paristech', 0.057), ('preorder', 0.057), ('kc', 0.054), ('cells', 0.054), ('partitions', 0.05), ('tau', 0.049), ('fagin', 0.049), ('cachan', 0.049), ('rank', 0.047), ('cb', 0.045), ('ens', 0.044), ('shall', 0.042), ('barth', 0.038), ('cmla', 0.038), ('conveys', 0.038), ('montjardet', 0.038), ('pennock', 0.038), ('phan', 0.038), ('preorders', 0.038), ('ree', 0.038), ('collection', 0.037), ('nonparametric', 0.035), ('nicolas', 0.034), ('therein', 0.032), ('forests', 0.029), ('convention', 0.028), ('concepts', 0.028), ('meila', 0.027), ('bagging', 0.027), ('populations', 0.027), ('umr', 0.027), ('partition', 0.027), ('setup', 0.026), ('aggregated', 0.025), ('aggregating', 0.024), ('probabilistic', 0.023), ('agreement', 0.023), ('coincides', 0.023), ('dx', 0.023), ('breiman', 0.023), ('sb', 0.023), ('proposition', 0.022), ('discriminate', 0.021), ('fr', 0.021), ('cell', 0.021), ('transforms', 0.021), ('median', 0.02), ('rue', 0.02), ('agarwal', 0.02), ('internet', 0.02), ('bootstrap', 0.019), ('relation', 0.018), ('assessing', 0.018), ('freund', 0.018), ('boosting', 0.017), ('randomized', 0.017), ('perspective', 0.017), ('binary', 0.017), ('implements', 0.016), ('capacity', 0.016), ('feedback', 0.016), ('social', 0.016), ('induced', 0.016), ('informal', 0.016), ('preferences', 0.016), ('clemencon', 0.016), ('exive', 0.016), ('meanings', 0.016), ('extremities', 0.016), ('referee', 0.016), ('france', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 95 jmlr-2013-Ranking Forests
Author: Stéphan Clémençon, Marine Depecker, Nicolas Vayatis
Abstract: The present paper examines how the aggregation and feature randomization principles underlying the algorithm R ANDOM F OREST (Breiman, 2001) can be adapted to bipartite ranking. The approach taken here is based on nonparametric scoring and ROC curve optimization in the sense of the AUC criterion. In this problem, aggregation is used to increase the performance of scoring rules produced by ranking trees, as those developed in Cl´ mencon and Vayatis (2009c). The present work e ¸ describes the principles for building median scoring rules based on concepts from rank aggregation. Consistency results are derived for these aggregated scoring rules and an algorithm called R ANK ING F OREST is presented. Furthermore, various strategies for feature randomization are explored through a series of numerical experiments on artificial data sets. Keywords: bipartite ranking, nonparametric scoring, classification data, ROC optimization, AUC criterion, tree-based ranking rules, bootstrap, bagging, rank aggregation, median ranking, feature randomization
2 0.051899377 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
Author: Pekka Parviainen, Mikko Koivisto
Abstract: We consider the problem of finding a directed acyclic graph (DAG) that optimizes a decomposable Bayesian network score. While in a favorable case an optimal DAG can be found in polynomial time, in the worst case the fastest known algorithms rely on dynamic programming across the node subsets, taking time and space 2n , to within a factor polynomial in the number of nodes n. In practice, these algorithms are feasible to networks of at most around 30 nodes, mainly due to the large space requirement. Here, we generalize the dynamic programming approach to enhance its feasibility in three dimensions: first, the user may trade space against time; second, the proposed algorithms easily and efficiently parallelize onto thousands of processors; third, the algorithms can exploit any prior knowledge about the precedence relation on the nodes. Underlying all these results is the key observation that, given a partial order P on the nodes, an optimal DAG compatible with P can be found in time and space roughly proportional to the number of ideals of P , which can be significantly less than 2n . Considering sufficiently many carefully chosen partial orders guarantees that a globally optimal DAG will be found. Aside from the generic scheme, we present and analyze concrete tradeoff schemes based on parallel bucket orders. Keywords: exact algorithm, parallelization, partial order, space-time tradeoff, structure learning
3 0.048736587 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
Author: Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis
Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efficient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efficiency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. Keywords: Markov boundary discovery, variable/feature selection, information equivalence, violations of faithfulness
4 0.042558178 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
Author: Cynthia Rudin, Benjamin Letham, David Madigan
Abstract: We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction.” In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer’s online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start” problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence” measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis. Keywords: statistical learning theory, algorithmic stability, association rules, sequence prediction, associative classification c 2013 Cynthia Rudin, Benjamin Letham and David Madigan. RUDIN , L E
5 0.038048323 96 jmlr-2013-Regularization-Free Principal Curve Estimation
Author: Samuel Gerber, Ross Whitaker
Abstract: Principal curves and manifolds provide a framework to formulate manifold learning within a statistical context. Principal curves define the notion of a curve passing through the middle of a distribution. While the intuition is clear, the formal definition leads to some technical and practical difficulties. In particular, principal curves are saddle points of the mean-squared projection distance, which poses severe challenges for estimation and model selection. This paper demonstrates that the difficulties in model selection associated with the saddle point property of principal curves are intrinsically tied to the minimization of the mean-squared projection distance. We introduce a new objective function, facilitated through a modification of the principal curve estimation approach, for which all critical points are principal curves and minima. Thus, the new formulation removes the fundamental issue for model selection in principal curve estimation. A gradient-descent-based estimator demonstrates the effectiveness of the new formulation for controlling model complexity on numerical experiments with synthetic and real data. Keywords: principal curve, manifold estimation, unsupervised learning, model complexity, model selection
6 0.036322482 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection
7 0.035329118 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
8 0.035219815 83 jmlr-2013-Orange: Data Mining Toolbox in Python
9 0.034244105 89 jmlr-2013-QuantMiner for Mining Quantitative Association Rules
10 0.032035973 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
11 0.031645108 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
12 0.030822353 8 jmlr-2013-A Theory of Multiclass Boosting
13 0.027956167 106 jmlr-2013-Stationary-Sparse Causality Network Learning
14 0.023345841 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
15 0.021304911 104 jmlr-2013-Sparse Single-Index Model
16 0.020709828 94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections
17 0.019689282 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
18 0.019635001 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
19 0.01828973 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
20 0.018041752 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
topicId topicWeight
[(0, -0.091), (1, 0.017), (2, 0.001), (3, 0.029), (4, 0.037), (5, 0.049), (6, 0.026), (7, 0.033), (8, -0.044), (9, 0.054), (10, -0.039), (11, -0.127), (12, -0.004), (13, -0.252), (14, -0.049), (15, -0.013), (16, -0.043), (17, -0.042), (18, 0.013), (19, -0.002), (20, -0.019), (21, -0.039), (22, -0.034), (23, -0.106), (24, 0.295), (25, 0.043), (26, 0.108), (27, -0.02), (28, 0.066), (29, -0.029), (30, -0.015), (31, 0.024), (32, 0.165), (33, -0.154), (34, 0.107), (35, 0.055), (36, -0.11), (37, -0.04), (38, -0.044), (39, 0.211), (40, 0.063), (41, 0.081), (42, 0.142), (43, -0.117), (44, 0.129), (45, 0.179), (46, 0.01), (47, 0.263), (48, 0.056), (49, -0.134)]
simIndex simValue paperId paperTitle
same-paper 1 0.98709226 95 jmlr-2013-Ranking Forests
Author: Stéphan Clémençon, Marine Depecker, Nicolas Vayatis
Abstract: The present paper examines how the aggregation and feature randomization principles underlying the algorithm R ANDOM F OREST (Breiman, 2001) can be adapted to bipartite ranking. The approach taken here is based on nonparametric scoring and ROC curve optimization in the sense of the AUC criterion. In this problem, aggregation is used to increase the performance of scoring rules produced by ranking trees, as those developed in Cl´ mencon and Vayatis (2009c). The present work e ¸ describes the principles for building median scoring rules based on concepts from rank aggregation. Consistency results are derived for these aggregated scoring rules and an algorithm called R ANK ING F OREST is presented. Furthermore, various strategies for feature randomization are explored through a series of numerical experiments on artificial data sets. Keywords: bipartite ranking, nonparametric scoring, classification data, ROC optimization, AUC criterion, tree-based ranking rules, bootstrap, bagging, rank aggregation, median ranking, feature randomization
2 0.37419596 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
Author: Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis
Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efficient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efficiency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. Keywords: Markov boundary discovery, variable/feature selection, information equivalence, violations of faithfulness
3 0.35538915 83 jmlr-2013-Orange: Data Mining Toolbox in Python
Author: Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, Blaž Zupan
Abstract: Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming. Here we report on the scripting part, which features interactive data analysis and component-based assembly of data mining procedures. In the selection and design of components, we focus on the flexibility of their reuse: our principal intention is to let the user write simple and clear scripts in Python, which build upon C++ implementations of computationallyintensive tasks. Orange is intended both for experienced users and programmers, as well as for students of data mining. Keywords: Python, data mining, machine learning, toolbox, scripting
4 0.33630443 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
Author: Cynthia Rudin, Benjamin Letham, David Madigan
Abstract: We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction.” In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer’s online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start” problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence” measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis. Keywords: statistical learning theory, algorithmic stability, association rules, sequence prediction, associative classification c 2013 Cynthia Rudin, Benjamin Letham and David Madigan. RUDIN , L E
5 0.33477825 106 jmlr-2013-Stationary-Sparse Causality Network Learning
Author: Yuejia He, Yiyuan She, Dapeng Wu
Abstract: Recently, researchers have proposed penalized maximum likelihood to identify network topology underlying a dynamical system modeled by multivariate time series. The time series of interest are assumed to be stationary, but this restriction is never taken into consideration by existing estimation methods. Moreover, practical problems of interest may have ultra-high dimensionality and obvious node collinearity. In addition, none of the available algorithms provides a probabilistic measure of the uncertainty for the obtained network topology which is informative in reliable network identification. The main purpose of this paper is to tackle these challenging issues. We propose the S2 learning framework, which stands for stationary-sparse network learning. We propose a novel algorithm referred to as the Berhu iterative sparsity pursuit with stationarity (BISPS), where the Berhu regularization can improve the Lasso in detection and estimation. The algorithm is extremely easy to implement, efficient in computation and has a theoretical guarantee to converge to a global optimum. We also incorporate a screening technique into BISPS to tackle ultra-high dimensional problems and enhance computational efficiency. Furthermore, a stationary bootstrap technique is applied to provide connection occurring frequency for reliable topology learning. Experiments show that our method can achieve stationary and sparse causality network learning and is scalable for high-dimensional problems. Keywords: stationarity, sparsity, Berhu, screening, bootstrap
6 0.29697853 89 jmlr-2013-QuantMiner for Mining Quantitative Association Rules
7 0.26358622 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
8 0.25962368 96 jmlr-2013-Regularization-Free Principal Curve Estimation
9 0.22917189 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection
10 0.21887602 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
11 0.20923828 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
13 0.14565392 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems
14 0.14550008 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
15 0.13440551 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
16 0.1327381 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
17 0.12754299 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
18 0.12605165 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
19 0.1193405 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
20 0.1179579 8 jmlr-2013-A Theory of Multiclass Boosting
topicId topicWeight
[(0, 0.023), (5, 0.096), (6, 0.022), (10, 0.053), (20, 0.028), (23, 0.025), (68, 0.019), (70, 0.03), (75, 0.033), (85, 0.017), (87, 0.011), (95, 0.53)]
simIndex simValue paperId paperTitle
same-paper 1 0.76283789 95 jmlr-2013-Ranking Forests
Author: Stéphan Clémençon, Marine Depecker, Nicolas Vayatis
Abstract: The present paper examines how the aggregation and feature randomization principles underlying the algorithm R ANDOM F OREST (Breiman, 2001) can be adapted to bipartite ranking. The approach taken here is based on nonparametric scoring and ROC curve optimization in the sense of the AUC criterion. In this problem, aggregation is used to increase the performance of scoring rules produced by ranking trees, as those developed in Cl´ mencon and Vayatis (2009c). The present work e ¸ describes the principles for building median scoring rules based on concepts from rank aggregation. Consistency results are derived for these aggregated scoring rules and an algorithm called R ANK ING F OREST is presented. Furthermore, various strategies for feature randomization are explored through a series of numerical experiments on artificial data sets. Keywords: bipartite ranking, nonparametric scoring, classification data, ROC optimization, AUC criterion, tree-based ranking rules, bootstrap, bagging, rank aggregation, median ranking, feature randomization
2 0.7090323 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
Author: Arnaud Guyader, Nick Hengartner
Abstract: Motivated by promising experimental results, this paper investigates the theoretical properties of a recently proposed nonparametric estimator, called the Mutual Nearest Neighbors rule, which estimates the regression function m(x) = E[Y |X = x] as follows: first identify the k nearest neighbors of x in the sample Dn , then keep only those for which x is itself one of the k nearest neighbors, and finally take the average over the corresponding response variables. We prove that this estimator is consistent and that its rate of convergence is optimal. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics
3 0.22872646 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
4 0.22628808 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos
Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm
Author: Daniil Ryabko, Jérémie Mary
Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classification methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classification and concern highly dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions
6 0.22436212 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
7 0.22392035 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
9 0.22318326 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
10 0.2226485 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
11 0.2206645 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
12 0.22024508 73 jmlr-2013-Multicategory Large-Margin Unified Machines
13 0.21995018 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
14 0.21985953 68 jmlr-2013-Machine Learning with Operational Costs
15 0.21955118 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
16 0.21875755 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
17 0.21843506 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
18 0.21829782 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
19 0.21829465 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
20 0.2180146 59 jmlr-2013-Large-scale SVD and Manifold Learning