nips nips2009 nips2009-248 nips2009-248-reference knowledge-graph by maker-knowledge-mining

248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

Source: pdf

Author: Dimitris Margaritis

Abstract: In this paper we address the problem of provably correct feature selection in arbitrary domains. An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. While numerous algorithms for this problem have been proposed, their theoretical correctness and practical behavior under arbitrary probability distributions is unclear. We address this by introducing the Markov Boundary Theorem that precisely characterizes the properties of an ideal Markov boundary, and use it to develop algorithms that learn a more general boundary that can capture complex interactions that only appear when the values of multiple features are considered together. We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artiﬁcial as well as benchmark and real-world data sets. Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability. 1 Introduction and Motivation The problem of feature selection has a long history due to its signiﬁcance in a wide range of important problems, from early ones like pattern recognition to recent ones such as text categorization, gene expression analysis and others. In such domains, using all available features may be prohibitively expensive, unnecessarily wasteful, and may lead to poor generalization performance, especially in the presence of irrelevant or redundant features. Thus, selecting a subset of features of the domain for use in subsequent application of machine learning algorithms has become a standard preprocessing step. A typical task of these algorithms is learning a classiﬁer: Given a number of input features and a quantity of interest, called the target variable, choose a member of a family of classiﬁers that can predict the target variable’s value as well as possible. Another task is understanding the domain and the quantities that interact with the target quantity. Many algorithms have been proposed for feature selection. Unfortunately, little attention has been paid to the issue of their behavior under a variety of application domains that can be encountered in practice. In particular, it is known that many can fail under certain probability distributions such as ones that contain a (near) parity function [1], which contain interactions that only appear when the values of multiple features are considered together. There is therefore an acute need for algorithms that are widely applicable and can be theoretically proven to work under any probability distribution. In this paper we present two such algorithms, an exact and a more practical randomized approximate one. We use the observation (ﬁrst made in Koller and Sahami [2]) that an optimal solution to the problem is a Markov boundary, deﬁned to be a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain (a more precise deﬁnition is given later in Section 3) and present a family of algorithms for learning 1 the Markov boundary of a target variable in arbitrary domains. We ﬁrst introduce a theorem that exactly characterizes the minimal set of features necessary for probabilistically isolating a variable, and then relax this deﬁnition to derive a family of algorithms that learn a parameterized approximation of the ideal boundary that are provably correct under a minimal set of assumptions, including a set of axioms that hold for any probability distribution. In the following section we present work on feature selection, followed by notation and deﬁnitions in Section 3. We subsequently introduce an important theorem and the aforementioned parameterized family of algorithms in Sections 4 and 5 respectively, including a practical anytime version. We evaluate these algorithms in Section 6 and conclude in Section 7. 2 Related Work Numerous algorithms have been proposed for feature selection. At the highest level algorithms can be classiﬁed as ﬁlter, wrapper, or embedded methods. Filter methods work without consulting the classiﬁer (if any) that will make use of their output i.e., the resulting set of selected features. They therefore have typically wider applicability since they are not tied to any particular classiﬁer family. In contrast, wrappers make the classiﬁer an integral part of their operation, repeatedly invoking it to evaluate each of a sequence of feature subsets, and selecting the subset that results in minimum estimated classiﬁcation error (for that particular classiﬁer). Finally, embedded algorithms are classiﬁer-learning algorithms that perform feature selection implicitly during their operation e.g., decision tree learners. Early work was motivated by the problem of pattern recognition which inherently contains a large number of features (pixels, regions, signal responses at multiple frequencies etc.). Narendra and Fukunaga [3] ﬁrst cast feature selection as a problem of maximization of an objective function over the set of features to use, and proposed a number of search approaches including forward selection and backward elimination. Later work by machine learning researchers includes the FOCUS algorithm of Almuallim and Dietterich [4], which is a ﬁlter method for deterministic, noise-free domains. The RELIEF algorithm [5] instead uses a randomized selection of data points to update a weight assigned to each feature, selecting the features whose weight exceeds a given threshold. A large number of additional algorithms have appeared in the literature, too many to list here—good surveys are included in Dash and Liu [6]; Guyon and Elisseeff [1]; Liu and Motoda [7]. An important concept for feature subset selection is relevance. Several notions of relevance are discussed in a number of important papers such as Blum and Langley [8]; Kohavi and John [9]. The argument that the problem of feature selection can be cast as the problem of Markov blanket discovery was ﬁrst made convincingly in Koller and Sahami [2], who also presented an algorithm for learning an approximate Markov blanket using mutual information. Other algorithms include the GS algorithm [10], originally developed for learning of the structure of a Bayesian network of a domain, and extensions to it [11] including the recent MMMB algorithm [12]. Meinshausen and B¨ hlmann [13] u recently proposed an optimal theoretical solution to the problem of learning the neighborhood of a Markov network when the distribution of the domain can be assumed to be a multidimensional Gaussian i.e., linear relations among features with Gaussian noise. This assumption implies that the Composition axiom holds in the domain (see Pearl [14] for a deﬁnition of Composition); the difference with our work is that we address here the problem in general domains where it may not necessarily hold. 3 Notation and Preliminaries In this section we present notation, fundamental deﬁnitions and axioms that will be subsequently used in the rest of the paper. We use the term “feature” and “variable” interchangeably, and denote variables by capital letters (X, Y etc.) and sets of variables by bold letters (S, T etc.). We denote the set of all variables/features in the domain (the “universe”) by U. All algorithms presented are independence-based, learning the Markov boundary of a given target variable using the truth value of a number of conditional independence statements. The use of conditional independence for feature selection subsumes many other criteria proposed in the literature. In particular, the use of classiﬁcation accuracy of the target variable can be seen as a special case of testing for its conditional independence with some of its predictor variables (conditional on the subset selected at any given moment). A beneﬁt of using conditional independence is that, while classiﬁcation error estimates depend on the classiﬁer family used, conditional independence does not. In addition, algorithms utilizing conditional independence for feature selection are applicable to all domain types, 2 e.g., discrete, ordinal, continuous with non-linear or arbitrary non-degenerate associations or mixed domains, as long as a reliable estimate of probabilistic independence is available. We denote probabilistic independence by the symbol “ ⊥ ” i.e., (X⊥ Y | Z) denotes the fact ⊥ ⊥ that the variables in set X are (jointly) conditionally independent from those in set Y given the values of the variables in set Z; (X⊥ Y | Z) denotes their conditional dependence. We assume ⊥ the existence of a probabilistic independence query oracle that is available to answer any query of the form (X, Y | Z), corresponding to the question “Is the set of variables in X independent of the variables in Y given the value of the variables in Z?” (This is similar to the approach of learning from statistical queries of Kearns and Vazirani [15].) In practice however, such an oracle does not exist, but can be approximated by a statistical independence test on a data set. Many tests of independence have appeared and studied extensively in the statistical literature over the last century; in this work we use the χ2 (chi-square) test of independence [16]. A Markov blanket of variable X is a set of variables such that, after ﬁxing (by “knowing”) the value of all of its members, the set of remaining variables in the domain, taken together as a single setvalued variable, are statistically independent of X. More precisely, we have the following deﬁnition. Deﬁnition 1. A set of variables S ⊆ U is called a Markov blanket of variable X if and only if (X⊥ U − S − {X} | S). ⊥ Intuitively, a Markov blanket S of X captures all the information in the remaining domain variables U − S − {X} that can affect the probability distribution of X, making their value redundant as far as X is concerned (given S). The blanket therefore captures the essence of the feature selection problem for target variable X: By completely “shielding” X, a Markov blanket precludes the existence of any possible information about X that can come from variables not in the blanket, making it an ideal solution to the feature selection problem. A minimal Markov blanket is called a Markov boundary. Deﬁnition 2. A set of variables S ⊆ U − {X} is called a Markov boundary of variable X if it is a minimal Markov blanket of X i.e., none of its proper subsets is a Markov blanket. Pearl [14] proved that that the axioms of Symmetry, Decomposition, Weak Union, and Intersection are sufﬁcient to guarantee a unique Markov boundary. These are shown below together with the axiom of Contraction. (Symmetry) (Decomposition) (Weak Union) (Contraction) (Intersection) (X⊥ Y | Z) =⇒ (Y⊥ X | Z) ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z) ∧ (X⊥ W | Z) ⊥ ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z ∪ W) ⊥ ⊥ (X⊥ Y | Z) ∧ (X⊥ W | Y ∪ Z) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (X⊥ Y | Z ∪ W) ∧ (X⊥ W | Z ∪ Y) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (1) The Symmetry, Decomposition, Contraction and Weak Union axioms are very general: they are necessary axioms for the probabilistic deﬁnition of independence i.e., they hold in every probability distribution, as their proofs are based on the axioms of probability theory. Intersection is not universal but it holds in distributions that are positive, i.e., any value combination of the domain variables has a non-zero probability of occurring. 4 The Markov Boundary Theorem According to Deﬁnition 2, a Markov boundary is a minimal Markov blanket. We ﬁrst introduce a theorem that provides an alternative, equivalent deﬁnition of the concept of Markov boundary that we will relax later in the paper to produce a more general boundary deﬁnition. Theorem 1 (Markov Boundary Theorem). Assuming that the Decomposition and Contraction axioms hold, S ⊆ U − {X} is a Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X}, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ (2) A detailed proof cannot be included here due to space constraints but a proof sketch appears in Appendix A. According to the above theorem, a Markov boundary S partitions the powerset of U − {X} into two parts: (a) set P1 that contains all subsets of U − S, and (b) set P2 containing the remaining subsets. All sets in P1 are conditionally independent of X, and all sets in P2 are conditionally dependent with X. Intuitively, the two directions of the logical equivalence relation of Eq. (2) correspond to the concept of Markov blanket and its minimality i.e., the equation ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X⊥ T | S − T) ⊥ 3 Algorithm 1 The abstract GS(m) (X) algorithm. Returns an m-Markov boundary of X. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: S←∅ /* Growing phase. */ for all Y ⊆ U − S − {X} such that 1 ≤ |Y| ≤ m do if (X ⊥ Y | S) then ⊥ S←S∪Y goto line 3 /* Restart loop. */ /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 8 /* Restart loop. */ return S or, equivalently, (∀ T ⊆ U − S − {X}, (X⊥ T | S)) (as T and S are disjoint) corresponds to ⊥ the deﬁnition of Markov blanket, as it includes T = U − S − {X}. In the opposite direction, the contrapositive form is ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X ⊥ T | S − T) . ⊥ This corresponds to the concept of minimality of the Markov boundary: It states that all sets that contain a part of S cannot be independent of X given the remainder of S. Informally, this is because if there existed some set T that contained a non-empty subset T′ of S such that (X⊥ T | S − T), ⊥ then one would be able to shrink S by T′ (by the property of Contraction) and therefore S would not be minimal (more details in Appendix A). 5 A Family of Algorithms for Arbitrary Domains Theorem 1 deﬁnes conditions that precisely characterize a Markov boundary and thus can be thought of as an alternative deﬁnition of a boundary. By relaxing these conditions we can produce a more general deﬁnition. In particular, an m-Markov boundary is deﬁned as follows. Deﬁnition 3. A set of variables S ⊆ U − {X} of a domain U is called an m-Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ We call the parameter m of an m-Markov boundary the Markov boundary margin. Intuitively, an m-boundary S guarantees that (a) all subsets of its complement (excluding X) of size m or smaller are independent of X given S, and (b) all sets T of size m or smaller that are not subsets of its complement are dependent of X given the part of S that is not contained in T. This deﬁnition is a special case of the properties of a boundary stated in Theorem 1, with each set T mentioned in the theorem now restricted to having size m or smaller. For m = n − 1, where n = |U |, the condition |T| ≤ m is always satisﬁed and can be omitted; in this case the deﬁnition of an (n − 1)-Markov boundary results in exactly Eq. (2) of Theorem 1. We now present an algorithm called GS(m) , shown in Algorithm 1, that provably correctly learns an m-boundary of a target variable X. GS(m) operates in two phases, a growing and a shrinking phase (hence the acronym). During the growing phase it examines sets of variables of size up to m, where m is a user-speciﬁed parameter. During the shrinking phase, single variables are examined for conditional independence and possible removal from S (examining sets in the shrinking phase is not necessary for provably correct operation—see Appendix B). The orders of examination of the sets for possible addition and deletion from the candidate boundary are left intentionally unspeciﬁed in Algorithm 1—one can therefore view it as an abstract representative of a family of algorithms, with each member specifying one such ordering. All members of this family are m-correct, as the proof of correctness does not depend on the ordering. In practice numerous choices for the ordering exist; one possibility is to examine the sets in the growing phase in order of increasing set size and, for each such size, in order of decreasing conditional mutual information I(X, Y, S) between X and Y given S. The rationale for this heuristic choice is that (usually) tests with smaller conditional sets tend to be more reliable, and sorting by mutual information tends to lessen the chance of adding false members of the Markov boundary. We used this implementation in all our experiments, presented later in Section 6. Intuitively, the margin represents a trade-off between sample and computational complexity and completeness: For m = n − 1 = |U| − 1, the algorithm returns a Markov boundary in unrestricted 4 Algorithm 2 The RGS(m,k) (X) algorithm, a randomized anytime version of the GS(m) algorithm, utilizing k random subsets for the growing phase. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: S←∅ /* Growing phase. */ repeat Schanged ← false Y ← subset of U − S − {X} of size 1 ≤ |Y| ≤ m of maximum dependence out of k random subsets if (X ⊥ Y | S) then ⊥ S←S∪Y Schanged ← true until Schanged = false /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 11 /* Restart loop. */ return S (arbitrary) domains. For 1 ≤ m < n − 1, GS(m) may recover the correct boundary depending on characteristics of the domain. For example, it will recover the correct boundary in domains containing embedded parity functions such that the number of variables involved in every k-bit parity function is m + 1 or less i.e., if k ≤ m + 1 (parity functions are corner cases in the space of probability distributions that are known to be hard to learn [17]). The proof of m-correctness of GS(m) is included in Appendix B. Note that it is based on Theorem 1 and the universal axioms of Eqs. (1) only i.e., Intersection is not needed, and thus it is widely applicable (to any domain). A Practical Randomized Anytime Version While GS(m) is provably correct even in difﬁcult domains such as those that contain parity functions, it may be impractical with a large number of features as its asymptotic complexity is O(nm ). We therefore also we here provide a more practical randomized version called RGS(m,k) (Randomized GS(m) ), shown in Algorithm 2. The RGS(m,k) algorithm has an additional parameter k that limits its computational requirements: instead of exhaustively examining all possible subsets of (U −S−{X}) (as GS(m) does), it instead samples k subsets from the set of all possible subsets of (U − S − {X}), where k is user-speciﬁed. It is therefore a randomized algorithm that becomes equivalent to GS(m) given a large enough k. Many possibilities for the method of random selection of the subsets exist; in our experiments we select a subset Y = {Yi } (1 ≤ |Y| ≤ m) with probability proportional |Y| to i=1 (1/p(X, Yi | S)), where p(X, Yi | S) is the p-value of the corresponding (univariate) test between X and Yi given S, which has a low computational cost. The RGS(m,k) algorithm is useful in situations where the amount of time to produce an answer may be limited and/or the limit unknown beforehand: it is easy to show that the growing phase of GS(m) produces an an upper-bound of the m-boundary of X. Therefore, the RGS(m,k) algorithm, if interrupted, will return an approximation of this upper bound. Moreover, if there exists time for the shrinking phase to be executed (which conducts a number of tests linear in n and is thus fast), extraneous variables will be removed and a minimal blanket (boundary) approximation will be returned. These features make it an anytime algorithm, which is a more appropriate choice for situations where critical events may occur that require the interruption of computation, e.g., during the planning phase of a robot, which may be interrupted at any time due to an urgent external event that requires a decision to be made based on the present state’s feature values. 6 Experiments We evaluated the GS(m) and the RGS(m,k) algorithms on synthetic as well as real-world and benchmark data sets. We ﬁrst systematically examined the performance on the task of recovering near-parity functions, which are known to be hard to learn [17]. We compared GS(m) and RGS(m,k) with respect to accuracy of recovery of the original boundary as well as computational cost. We generated domains of sizes ranging from 10 to 100 variables, of which 4 variables (X1 to X4 ) were related through a near-parity relation with bit probability 0.60 and various degrees of noise. The remaining independent variables (X5 to Xn ) act as “dis5 GS(1) GS(3) RGS(1, 1000) 0 0.05 RGS(3, 1000) Relieved, threshold = 0.001 Relieved, threshold = 0.03 0.1 0.15 0.2 0.25 Noise probability 0.3 0.35 0.4 Probabilistic isolation performance of GS(m) and RELIEVED Probabilistic isolation performance of GS(m) and RGS(m ,k) Real-world and benchmark data sets 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) GS 0.6 0.8 1 Real-world and benchmark data sets RGS(m = 3, k = 300) average isolation measure F1 measure 50 variables, true Markov boundary size = 3 Bernoulli probability = 0.6, 1000 data points RELIEVED(threshold = 0.03) average isolation measure F1 measure of GS(m ), RGS(m, k ) and RELIEVED vs. noise level 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) average isolation measure GS 0.6 0.8 1 average isolation measure Figure 2: Left: F1 measure of GS(m) , RGS(m,k) and RELIEVED under increasing amounts of noise. Middle: Probabilistic isolation performance comparison between GS(3) and RELIEVED on real-world and benchmark data sets. Right: Same for GS(3) and RGS(3,1000) . tractors” and had randomly assigned probabilities i.e., the correct boundary of X1 is B1 = {X2 , X3 , X4 }. In such domains, learning the boundary of X1 is difﬁcult because of the large number of distractors and because each Xi ∈ B1 is independent of X1 given any proper subset of B1 − {Xi } (they only become dependent when including all of them in the conditioning set). To measure an algorithm’s feature selection performance, acF -measure of GS and RGS vs. domain size curacy (fraction of variables correctly included or excluded) is inappropriate as the accuracy of trivial algorithms such as returning the empty set will tend to 1 as n increases. Precision and recall are therefore more appropriate, with precision deﬁned as the fraction of features returned that are in the correct boundary (3 features for X1 ), and recall as the fraction of the features present in the correct boundary that are returned by the algorithm. A convenient and frequently used Number of domain variables measure that combines precision and recall is the F1 meaRunning time of GS and RGS vs. domain size sure, deﬁned as the harmonic mean of precision and recall [18]. In Fig. 1 (top) we report 95% conﬁdence intervals for the F1 measure and execution time of GS(m) (margins m = 1 to 3) and RGS(m,k) (margins 1 to 3 and k = 1000 random subsets), using 20 data sets containing 10 to 100 variables, with the target variable X1 was perturbed (inverted) by noise with 10% probability. As can be seen, the RGS(m,k) and GS(m) using the same value for margin perform comparably Number of domain variables with respect to F1 , up to their 95% conﬁdence intervals. With Figure 1: GS(m) and RGS(m,k) per(m,k) respect to execution time however RGS exhibits much formance with respect to domain greater scalability (Fig. 1 bottom, log scale); for example, it size (number of variables). Top: F1 executes in about 10 seconds on average in domains containmeasure, reﬂecting accuracy. Bot(m) ing 100 variables, while GS executes in 1,000 seconds on tom: Execution time in seconds (log average for this domain size. scale). We also compared GS(m) and RGS(m,k) to RELIEF [5], a well-known algorithm for feature selection that is known to be able to recover parity functions in certain cases [5]. RELIEF learns a weight for each variable and compares it to a threshold τ to decide on its inclusion in the set of relevant variables. As it has been reported [9] that RELIEF can exhibit large variance due to randomization that is necessary only for very large data sets, we instead used a deterministic variant called RELIEVED [9], whose behavior corresponds to RELIEF at the limit of inﬁnite execution time. We calculated the F1 measure for GS(m) , RGS(m,k) and RELIEVED in the presence of varying amounts of noise, with noise probability ranging from 0 (no noise) to 0.4. We used domains containing 50 variables, as GS(m) becomes computationally demanding in larger domains. In Figure 2 (left) we show the performance of GS(m) and RGS(m,k) for m equal to 1 and 3, k = 1000 and RELIEVED for thresholds τ = 0.01 and 0.03 for various amounts of noise on the target variable. Again, each experiment was repeated 20 times to generate 95% conﬁdence intervals. We can observe that even though m = 1 (equivalent to the GS algorithm) performs poorly, increasing the margin m makes it more likely to recover the correct Markov boundary, and GS(3) (m = 3) recovers the exact blanket even with few (1,000) data points. RELIEVED does comparably to GS(3) for little noise and for a large threshold, (m ) (m, k ) 1 True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 1 0.9 GS(1) GS(2) GS(3) RGS(1, 1000) (2, 1000) RGS (3, 1000) RGS 0.8 F1-measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 (m ) 70 80 90 100 90 100 (m, k ) True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 10000 GS(1) GS(2) GS(3) Execution time (sec) 1000 RGS(1, 1000) RGS(2, 1000) RGS(3, 1000) 100 10 1 0.1 0.01 10 6 20 30 40 50 60 70 80 but appears to deteriorate for more noisy domains. As we can see it is difﬁcult to choose the “right” threshold for RELIEVED—better performing τ at low noise can become worse in noisy environments; in particular, small τ tend to include irrelevant variables while large τ tend to miss actual members. We also evaluated GS(m) , RGS(m,k) , and RELIEVED on benchmark and real-world data sets from the UCI Machine Learning repository. As the true Markov boundary for these is impossible to know, we used as performance measure a measure of probabilistic isolation by the Markov boundary returned of subsets outside the boundary. For each domain variable X, we measured the independence of subsets Y of size 1, 2 and 3 given the blanket S of X returned by GS(3) and RELIEVED for τ = 0.03 (as this value seemed to do better in the previous set of experiments), as measured by the average p-value of the χ2 test between X and Y given S (with p-values of 0 and 1 indicating ideal dependence and independence, respectively). Due to the large number of subsets outside the boundary when the boundary is small, we limited the estimation of isolation performance to 2,000 subsets per variable. We plot the results in Figure 2 (middle and right). Each point represents a variable in the corresponding data set. Points under the diagonal indicate better probabilistic isolation performance for that variable for GS(3) compared to RELIEVED (middle plot) or to RGS(3,1000) (right plot). To obtain a statistically signiﬁcant comparison, we used the non-parametric Wilcoxon paired signed-rank test, which indicated that GS(3) RGS(3,1000) are statistically equivalent to each other, while both outperformed RELIEVED at the 99.99% signiﬁcance level (α < 10−7 ). 7 Conclusion In this paper we presented algorithms for the problem of feature selection in unrestricted (arbitrary distribution) domains that may contain complex interactions that only appear when the values of multiple features are considered together. We introduced two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and evaluated them on on artiﬁcial, benchmark and real-world data, demonstrating that they perform well, even in the presence of noise. We also introduced the Markov Boundary Theorem that precisely characterizes the properties of a boundary, and used it to prove m-correctness of the exact family of algorithms presented. We made minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, giving our algorithms universal applicability. Appendix A: Proof sketch of the Markov Boundary Theorem Proof sketch. (=⇒ direction) We need to prove that if S is a Markov boundary of X then (a) for every set T ⊆ U − S − {X}, (X⊥ T | S − T), and (b) for every set T′ ⊆ U − S that does not ⊥ contain X, (X ⊥ T′ | S − T′ ). Case (a) is immediate from the deﬁnition of the boundary and the ⊥ Decomposition theorem. Case (b) can be proven by contradiction: Assuming the independence of T′ that contains a non-empty part T′ in S and a part T′ in U − S, we get (from Decomposition) 1 2 (X⊥ T′ | S − T′ ). We can then use Contraction to show that the set S − T′ satisﬁes the inde⊥ 1 1 1 pendence property of a Markov boundary, i.e., that (X⊥ U − (S − T′ ) − {X} | S − T′ ), which ⊥ 1 1 contradicts the assumption that S is a boundary (and thus minimal). (⇐= direction) We need to prove that if Eq. (2) holds, then S is a minimal Markov blanket. The proof that S is a blanket is immediate. We can prove minimality by contradiction: Assume S = S1 ∪ S2 with S1 a blanket and S2 = ∅ i.e., S1 is a blanket strictly smaller than S. Then (X⊥ S2 | ⊥ S1 ) = (X⊥ S2 | S − S2 ). However, since S2 ⊆ U − S, from Eq. (2) we get (X ⊥ S2 | S − S2 ), ⊥ ⊥ which is a contradiction. Appendix B: Proof of m-Correctness of GS(m) Let the value of the set S at the end of the growing phase be SG , its value at the end of the shrinking phase SS , and their difference S∆ = SG − SS . The following two observations are immediate. Observation 1. For every Y ⊆ U − SG − {X} such that 1 ≤ |Y| ≤ m, (X⊥ Y | SG ). ⊥ Observation 2. For every Y ∈ SS , (X ⊥ Y | SS − {Y }). ⊥ Lemma 2. Consider variables Y1 , Y2 , . . . , Yt for some t ≥ 1 and let Y = {Yj }t . Assuming that j=1 Contraction holds, if (X⊥ Yi | S − {Yj }i ) for all i = 1, . . . , t, then (X⊥ Y | S − Y). ⊥ ⊥ j=1 Proof. By induction on Yj , j = 1, 2, . . . , t, using Contraction to decrease the conditioning set S t down to S − {Yj }i j=1 for all i = 1, 2, . . . , t. Since Y = {Yj }j=1 , we immediately obtain the desired relation (X⊥ Y | S − Y). ⊥ 7 Lemma 2 can be used to show that the variables found individually independent of X during the shrinking phase are actually jointly independent of X, given the ﬁnal set SS . Let S∆ = {Y1 , Y2 , . . . , Yt } be the set of variables removed (in that order) from SG to form the ﬁnal set SS i.e., S∆ = SG − SS . Using the above lemma, the following is immediate. Corollary 3. Assuming that the Contraction axiom holds, (X⊥ S∆ | SS ). ⊥ Lemma 4. If the Contraction, Decomposition and Weak Union axioms hold, then for every set T ⊆ U − SG − {X} such that (X⊥ T | SG ), ⊥ (X⊥ T ∪ (SG − SS ) | SS ). ⊥ (3) Furthermore SS is minimal i.e., there does not exist a subset of SS for which Eq. (3) is true. Proof. From Corollary 3, (X⊥ S∆ | SS ). Also, by the hypothesis, (X⊥ T | SG ) = (X⊥ T | ⊥ ⊥ ⊥ SS ∪ S∆ ), where S∆ = SG − SS as usual. From these two relations and Contraction we obtain (X⊥ T ∪ S∆ | SS ). ⊥ To prove minimality, let us assume that SS = ∅ (if SS = ∅ then it is already minimal). We prove by contradiction: Assume that there exists a set S′ ⊂ SS such that (X⊥ T ∪ (SG − S′ ) | S′ ). Let ⊥ W = SS − S′ = ∅. Note that W and S′ are disjoint. We have that SS ⊆ SS ∪ S∆ =⇒ SS − S′ ⊆ SS ∪ S∆ − S′ ⊆ T ∪ (SS ∪ S∆ − S′ ) =⇒ W ⊆ T ∪ (SS ∪ S∆ − S′ ) = T ∪ (SG − S′ ) • Since (X⊥ T ∪ (SG − S′ ) | S′ ) and W ⊆ T ∪ (SS ∪ S∆ − S′ ), from Decomposition we ⊥ get (X⊥ W | S′ ). ⊥ • From (X⊥ W | S′ ) and Weak Union we have that for every Y ∈ W, (X⊥ Y | S′ ∪ ⊥ ⊥ (W − {Y })). • Since S′ and W are disjoint and since Y ∈ W, Y ∈ S′ . Applying the set equality (A − B) ∪ C = (A ∪ B) − (A − C) to S′ ∪ (W − {Y }) we obtain S′ ∪ W − ({Y } − S′ ) = SS − {Y }. • Therefore, ∀ Y ∈ W, (X⊥ Y | SS − {Y }). ⊥ However, at the end of the shrinking phase, all variables Y in SS (and therefore in W, as W ⊆ SS ) have been evaluated for independence and found dependent (Observation 2). Thus, since W = ∅, there exists at least one Y such that (X ⊥ Y | SS − {Y }), producing a contradiction. ⊥ Theorem 5. Assuming that the Contraction, Decomposition, and Weak Union axioms hold, Algorithm 1 is m-correct with respect to X. Proof. We use the Markov Boundary Theorem. We ﬁrst prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X⊥ T | SS − T) ⊥ or, equivalently, ∀ T ⊆ U − SS − {X} such that |T| ≤ m, (X⊥ T | SS ). ⊥ Since U − SS − {X} = S∆ ∪ (U − SG − {X}), S∆ and U − SG − {X} are disjoint, there are three kinds of sets of size m or less to consider: (i) all sets T ⊆ S∆ , (ii) all sets T ⊆ U − SG − {X}, and (iii) all sets (if any) T = T′ ∪ T′′ , T′ ∩ T′′ = ∅, that have a non-empty part T′ ⊆ S∆ and a non-empty part T′′ ⊆ U − SG − {X}. (i) From Corollary 3, (X⊥ S∆ | SS ). Therefore, from Decomposition, for any set T ⊆ S∆ , ⊥ (X⊥ T | SS ). ⊥ (ii) By Observation 1, for every set T ⊆ U − SG − {X} such that |T| ≤ m, (X⊥ T | SG ). ⊥ By Lemma 4 we get (X⊥ T ∪ S∆ | SS ), from which we obtain (X⊥ T | SS ) by ⊥ ⊥ Decomposition. (iii) Since |T| ≤ m, we have that |T′′ | ≤ m. Since T′′ ⊆ U − SG − {X}, by Observation 1, (X⊥ T′′ | SG ). Therefore, by Lemma 4, (X⊥ T′′ ∪ S∆ | SS ). Since T′ ⊆ S∆ ⇒ ⊥ ⊥ T′′ ∪ T′ ⊆ T′′ ∪ S∆ , by Decomposition to obtain (X⊥ T′′ ∪ T′ | SS ) = (X⊥ T | SS ). ⊥ ⊥ To complete the proof we need to prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X ⊥ T | SS − T) . ⊥ Let T = T1 ∪ T2 , with T1 ⊆ SS and T2 ⊆ U − SS . Since T ⊆ U − SS , T1 contains at least one variable Y ∈ SS . From Observation 2, (X ⊥ Y | SS − {Y }). From this and (the contrapositive of) ⊥ Weak Union, we get (X ⊥ {Y } ∪ (T1 − {Y }) | SS − {Y } − (T1 − {Y })) = (X ⊥ T1 | SS − T1 ). ⊥ ⊥ From (the contrapositive of) Decomposition we get (X ⊥ T1 ∪ T2 | SS − T1 ) = (X ⊥ T | ⊥ ⊥ SS − T1 ), which is equal to (X ⊥ T | SS − T1 − T2 ) = (X ⊥ T | SS − T) as SS and T2 are ⊥ ⊥ disjoint. 8 References [1] Isabelle Guyon and Andr´ Elisseeff. An introduction to variable and feature selection. Journal e of Machine Learning Research, 3:1157–1182, 2003. [2] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Tenth International Conference on Machine Learning (ICML), pages 284–292, 1996. [3] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26(9):917–922, 1977. [4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the National Conference on the Americal Association for Artiﬁcal Intelligence (AAAI), 1991. [5] K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on the Americal Association for Artiﬁcal Intelligence (AAAI), pages 129–134, 1992. [6] M. Dash and H. Liu. Feature selection for classiﬁcation. Intelligent Data Analysis, 1(3): 131–156, 1997. [7] Huan Liu and Hiroshi Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective, volume 453 of The Springer International Series in Engineering and Computer Science. 1998. [8] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artiﬁcial Intelligence, 97(1-2):245–271, 1997. [9] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artiﬁcial Intelligence, 97 (1-2):273–324, 1997. [10] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12 (NIPS), 2000. [11] I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In Proceedings of the 16th International FLAIRS Conference, 2003. [12] I. Tsamardinos, C. Aliferis, and A. Statnikov. Time and sample efﬁcient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 673–678, 2003. [13] N. Meinshausen and P. B¨ hlmann. High-dimensional graphs and variable selection with the u Lasso. Annals of Statistics, 34:1436–1462, 2006. [14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988. [15] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. [16] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 1990. [17] M. Kearns. Efﬁcient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998. [18] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 1979. 9

reference text

[1] Isabelle Guyon and Andr´ Elisseeff. An introduction to variable and feature selection. Journal e of Machine Learning Research, 3:1157–1182, 2003.

[2] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Tenth International Conference on Machine Learning (ICML), pages 284–292, 1996.

[3] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26(9):917–922, 1977.

[4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the National Conference on the Americal Association for Artiﬁcal Intelligence (AAAI), 1991.

[5] K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on the Americal Association for Artiﬁcal Intelligence (AAAI), pages 129–134, 1992.

[6] M. Dash and H. Liu. Feature selection for classiﬁcation. Intelligent Data Analysis, 1(3): 131–156, 1997.

[7] Huan Liu and Hiroshi Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective, volume 453 of The Springer International Series in Engineering and Computer Science. 1998.

[8] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artiﬁcial Intelligence, 97(1-2):245–271, 1997.

[9] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artiﬁcial Intelligence, 97 (1-2):273–324, 1997.

[10] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12 (NIPS), 2000.

[11] I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In Proceedings of the 16th International FLAIRS Conference, 2003.

[12] I. Tsamardinos, C. Aliferis, and A. Statnikov. Time and sample efﬁcient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 673–678, 2003.

[13] N. Meinshausen and P. B¨ hlmann. High-dimensional graphs and variable selection with the u Lasso. Annals of Statistics, 34:1436–1462, 2006.

[14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988.

[15] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994.

[16] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 1990.

[17] M. Kearns. Efﬁcient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998.

[18] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 1979. 9