nips nips2009 nips2009-248 knowledge-graph by maker-knowledge-mining

248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains


Source: pdf

Author: Dimitris Margaritis

Abstract: In this paper we address the problem of provably correct feature selection in arbitrary domains. An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. While numerous algorithms for this problem have been proposed, their theoretical correctness and practical behavior under arbitrary probability distributions is unclear. We address this by introducing the Markov Boundary Theorem that precisely characterizes the properties of an ideal Markov boundary, and use it to develop algorithms that learn a more general boundary that can capture complex interactions that only appear when the values of multiple features are considered together. We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artificial as well as benchmark and real-world data sets. Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability. 1 Introduction and Motivation The problem of feature selection has a long history due to its significance in a wide range of important problems, from early ones like pattern recognition to recent ones such as text categorization, gene expression analysis and others. In such domains, using all available features may be prohibitively expensive, unnecessarily wasteful, and may lead to poor generalization performance, especially in the presence of irrelevant or redundant features. Thus, selecting a subset of features of the domain for use in subsequent application of machine learning algorithms has become a standard preprocessing step. A typical task of these algorithms is learning a classifier: Given a number of input features and a quantity of interest, called the target variable, choose a member of a family of classifiers that can predict the target variable’s value as well as possible. Another task is understanding the domain and the quantities that interact with the target quantity. Many algorithms have been proposed for feature selection. Unfortunately, little attention has been paid to the issue of their behavior under a variety of application domains that can be encountered in practice. In particular, it is known that many can fail under certain probability distributions such as ones that contain a (near) parity function [1], which contain interactions that only appear when the values of multiple features are considered together. There is therefore an acute need for algorithms that are widely applicable and can be theoretically proven to work under any probability distribution. In this paper we present two such algorithms, an exact and a more practical randomized approximate one. We use the observation (first made in Koller and Sahami [2]) that an optimal solution to the problem is a Markov boundary, defined to be a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain (a more precise definition is given later in Section 3) and present a family of algorithms for learning 1 the Markov boundary of a target variable in arbitrary domains. We first introduce a theorem that exactly characterizes the minimal set of features necessary for probabilistically isolating a variable, and then relax this definition to derive a family of algorithms that learn a parameterized approximation of the ideal boundary that are provably correct under a minimal set of assumptions, including a set of axioms that hold for any probability distribution. In the following section we present work on feature selection, followed by notation and definitions in Section 3. We subsequently introduce an important theorem and the aforementioned parameterized family of algorithms in Sections 4 and 5 respectively, including a practical anytime version. We evaluate these algorithms in Section 6 and conclude in Section 7. 2 Related Work Numerous algorithms have been proposed for feature selection. At the highest level algorithms can be classified as filter, wrapper, or embedded methods. Filter methods work without consulting the classifier (if any) that will make use of their output i.e., the resulting set of selected features. They therefore have typically wider applicability since they are not tied to any particular classifier family. In contrast, wrappers make the classifier an integral part of their operation, repeatedly invoking it to evaluate each of a sequence of feature subsets, and selecting the subset that results in minimum estimated classification error (for that particular classifier). Finally, embedded algorithms are classifier-learning algorithms that perform feature selection implicitly during their operation e.g., decision tree learners. Early work was motivated by the problem of pattern recognition which inherently contains a large number of features (pixels, regions, signal responses at multiple frequencies etc.). Narendra and Fukunaga [3] first cast feature selection as a problem of maximization of an objective function over the set of features to use, and proposed a number of search approaches including forward selection and backward elimination. Later work by machine learning researchers includes the FOCUS algorithm of Almuallim and Dietterich [4], which is a filter method for deterministic, noise-free domains. The RELIEF algorithm [5] instead uses a randomized selection of data points to update a weight assigned to each feature, selecting the features whose weight exceeds a given threshold. A large number of additional algorithms have appeared in the literature, too many to list here—good surveys are included in Dash and Liu [6]; Guyon and Elisseeff [1]; Liu and Motoda [7]. An important concept for feature subset selection is relevance. Several notions of relevance are discussed in a number of important papers such as Blum and Langley [8]; Kohavi and John [9]. The argument that the problem of feature selection can be cast as the problem of Markov blanket discovery was first made convincingly in Koller and Sahami [2], who also presented an algorithm for learning an approximate Markov blanket using mutual information. Other algorithms include the GS algorithm [10], originally developed for learning of the structure of a Bayesian network of a domain, and extensions to it [11] including the recent MMMB algorithm [12]. Meinshausen and B¨ hlmann [13] u recently proposed an optimal theoretical solution to the problem of learning the neighborhood of a Markov network when the distribution of the domain can be assumed to be a multidimensional Gaussian i.e., linear relations among features with Gaussian noise. This assumption implies that the Composition axiom holds in the domain (see Pearl [14] for a definition of Composition); the difference with our work is that we address here the problem in general domains where it may not necessarily hold. 3 Notation and Preliminaries In this section we present notation, fundamental definitions and axioms that will be subsequently used in the rest of the paper. We use the term “feature” and “variable” interchangeably, and denote variables by capital letters (X, Y etc.) and sets of variables by bold letters (S, T etc.). We denote the set of all variables/features in the domain (the “universe”) by U. All algorithms presented are independence-based, learning the Markov boundary of a given target variable using the truth value of a number of conditional independence statements. The use of conditional independence for feature selection subsumes many other criteria proposed in the literature. In particular, the use of classification accuracy of the target variable can be seen as a special case of testing for its conditional independence with some of its predictor variables (conditional on the subset selected at any given moment). A benefit of using conditional independence is that, while classification error estimates depend on the classifier family used, conditional independence does not. In addition, algorithms utilizing conditional independence for feature selection are applicable to all domain types, 2 e.g., discrete, ordinal, continuous with non-linear or arbitrary non-degenerate associations or mixed domains, as long as a reliable estimate of probabilistic independence is available. We denote probabilistic independence by the symbol “ ⊥ ” i.e., (X⊥ Y | Z) denotes the fact ⊥ ⊥ that the variables in set X are (jointly) conditionally independent from those in set Y given the values of the variables in set Z; (X⊥ Y | Z) denotes their conditional dependence. We assume ⊥ the existence of a probabilistic independence query oracle that is available to answer any query of the form (X, Y | Z), corresponding to the question “Is the set of variables in X independent of the variables in Y given the value of the variables in Z?” (This is similar to the approach of learning from statistical queries of Kearns and Vazirani [15].) In practice however, such an oracle does not exist, but can be approximated by a statistical independence test on a data set. Many tests of independence have appeared and studied extensively in the statistical literature over the last century; in this work we use the χ2 (chi-square) test of independence [16]. A Markov blanket of variable X is a set of variables such that, after fixing (by “knowing”) the value of all of its members, the set of remaining variables in the domain, taken together as a single setvalued variable, are statistically independent of X. More precisely, we have the following definition. Definition 1. A set of variables S ⊆ U is called a Markov blanket of variable X if and only if (X⊥ U − S − {X} | S). ⊥ Intuitively, a Markov blanket S of X captures all the information in the remaining domain variables U − S − {X} that can affect the probability distribution of X, making their value redundant as far as X is concerned (given S). The blanket therefore captures the essence of the feature selection problem for target variable X: By completely “shielding” X, a Markov blanket precludes the existence of any possible information about X that can come from variables not in the blanket, making it an ideal solution to the feature selection problem. A minimal Markov blanket is called a Markov boundary. Definition 2. A set of variables S ⊆ U − {X} is called a Markov boundary of variable X if it is a minimal Markov blanket of X i.e., none of its proper subsets is a Markov blanket. Pearl [14] proved that that the axioms of Symmetry, Decomposition, Weak Union, and Intersection are sufficient to guarantee a unique Markov boundary. These are shown below together with the axiom of Contraction. (Symmetry) (Decomposition) (Weak Union) (Contraction) (Intersection) (X⊥ Y | Z) =⇒ (Y⊥ X | Z) ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z) ∧ (X⊥ W | Z) ⊥ ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z ∪ W) ⊥ ⊥ (X⊥ Y | Z) ∧ (X⊥ W | Y ∪ Z) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (X⊥ Y | Z ∪ W) ∧ (X⊥ W | Z ∪ Y) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (1) The Symmetry, Decomposition, Contraction and Weak Union axioms are very general: they are necessary axioms for the probabilistic definition of independence i.e., they hold in every probability distribution, as their proofs are based on the axioms of probability theory. Intersection is not universal but it holds in distributions that are positive, i.e., any value combination of the domain variables has a non-zero probability of occurring. 4 The Markov Boundary Theorem According to Definition 2, a Markov boundary is a minimal Markov blanket. We first introduce a theorem that provides an alternative, equivalent definition of the concept of Markov boundary that we will relax later in the paper to produce a more general boundary definition. Theorem 1 (Markov Boundary Theorem). Assuming that the Decomposition and Contraction axioms hold, S ⊆ U − {X} is a Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X}, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ (2) A detailed proof cannot be included here due to space constraints but a proof sketch appears in Appendix A. According to the above theorem, a Markov boundary S partitions the powerset of U − {X} into two parts: (a) set P1 that contains all subsets of U − S, and (b) set P2 containing the remaining subsets. All sets in P1 are conditionally independent of X, and all sets in P2 are conditionally dependent with X. Intuitively, the two directions of the logical equivalence relation of Eq. (2) correspond to the concept of Markov blanket and its minimality i.e., the equation ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X⊥ T | S − T) ⊥ 3 Algorithm 1 The abstract GS(m) (X) algorithm. Returns an m-Markov boundary of X. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: S←∅ /* Growing phase. */ for all Y ⊆ U − S − {X} such that 1 ≤ |Y| ≤ m do if (X ⊥ Y | S) then ⊥ S←S∪Y goto line 3 /* Restart loop. */ /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 8 /* Restart loop. */ return S or, equivalently, (∀ T ⊆ U − S − {X}, (X⊥ T | S)) (as T and S are disjoint) corresponds to ⊥ the definition of Markov blanket, as it includes T = U − S − {X}. In the opposite direction, the contrapositive form is ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X ⊥ T | S − T) . ⊥ This corresponds to the concept of minimality of the Markov boundary: It states that all sets that contain a part of S cannot be independent of X given the remainder of S. Informally, this is because if there existed some set T that contained a non-empty subset T′ of S such that (X⊥ T | S − T), ⊥ then one would be able to shrink S by T′ (by the property of Contraction) and therefore S would not be minimal (more details in Appendix A). 5 A Family of Algorithms for Arbitrary Domains Theorem 1 defines conditions that precisely characterize a Markov boundary and thus can be thought of as an alternative definition of a boundary. By relaxing these conditions we can produce a more general definition. In particular, an m-Markov boundary is defined as follows. Definition 3. A set of variables S ⊆ U − {X} of a domain U is called an m-Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ We call the parameter m of an m-Markov boundary the Markov boundary margin. Intuitively, an m-boundary S guarantees that (a) all subsets of its complement (excluding X) of size m or smaller are independent of X given S, and (b) all sets T of size m or smaller that are not subsets of its complement are dependent of X given the part of S that is not contained in T. This definition is a special case of the properties of a boundary stated in Theorem 1, with each set T mentioned in the theorem now restricted to having size m or smaller. For m = n − 1, where n = |U |, the condition |T| ≤ m is always satisfied and can be omitted; in this case the definition of an (n − 1)-Markov boundary results in exactly Eq. (2) of Theorem 1. We now present an algorithm called GS(m) , shown in Algorithm 1, that provably correctly learns an m-boundary of a target variable X. GS(m) operates in two phases, a growing and a shrinking phase (hence the acronym). During the growing phase it examines sets of variables of size up to m, where m is a user-specified parameter. During the shrinking phase, single variables are examined for conditional independence and possible removal from S (examining sets in the shrinking phase is not necessary for provably correct operation—see Appendix B). The orders of examination of the sets for possible addition and deletion from the candidate boundary are left intentionally unspecified in Algorithm 1—one can therefore view it as an abstract representative of a family of algorithms, with each member specifying one such ordering. All members of this family are m-correct, as the proof of correctness does not depend on the ordering. In practice numerous choices for the ordering exist; one possibility is to examine the sets in the growing phase in order of increasing set size and, for each such size, in order of decreasing conditional mutual information I(X, Y, S) between X and Y given S. The rationale for this heuristic choice is that (usually) tests with smaller conditional sets tend to be more reliable, and sorting by mutual information tends to lessen the chance of adding false members of the Markov boundary. We used this implementation in all our experiments, presented later in Section 6. Intuitively, the margin represents a trade-off between sample and computational complexity and completeness: For m = n − 1 = |U| − 1, the algorithm returns a Markov boundary in unrestricted 4 Algorithm 2 The RGS(m,k) (X) algorithm, a randomized anytime version of the GS(m) algorithm, utilizing k random subsets for the growing phase. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: S←∅ /* Growing phase. */ repeat Schanged ← false Y ← subset of U − S − {X} of size 1 ≤ |Y| ≤ m of maximum dependence out of k random subsets if (X ⊥ Y | S) then ⊥ S←S∪Y Schanged ← true until Schanged = false /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 11 /* Restart loop. */ return S (arbitrary) domains. For 1 ≤ m < n − 1, GS(m) may recover the correct boundary depending on characteristics of the domain. For example, it will recover the correct boundary in domains containing embedded parity functions such that the number of variables involved in every k-bit parity function is m + 1 or less i.e., if k ≤ m + 1 (parity functions are corner cases in the space of probability distributions that are known to be hard to learn [17]). The proof of m-correctness of GS(m) is included in Appendix B. Note that it is based on Theorem 1 and the universal axioms of Eqs. (1) only i.e., Intersection is not needed, and thus it is widely applicable (to any domain). A Practical Randomized Anytime Version While GS(m) is provably correct even in difficult domains such as those that contain parity functions, it may be impractical with a large number of features as its asymptotic complexity is O(nm ). We therefore also we here provide a more practical randomized version called RGS(m,k) (Randomized GS(m) ), shown in Algorithm 2. The RGS(m,k) algorithm has an additional parameter k that limits its computational requirements: instead of exhaustively examining all possible subsets of (U −S−{X}) (as GS(m) does), it instead samples k subsets from the set of all possible subsets of (U − S − {X}), where k is user-specified. It is therefore a randomized algorithm that becomes equivalent to GS(m) given a large enough k. Many possibilities for the method of random selection of the subsets exist; in our experiments we select a subset Y = {Yi } (1 ≤ |Y| ≤ m) with probability proportional |Y| to i=1 (1/p(X, Yi | S)), where p(X, Yi | S) is the p-value of the corresponding (univariate) test between X and Yi given S, which has a low computational cost. The RGS(m,k) algorithm is useful in situations where the amount of time to produce an answer may be limited and/or the limit unknown beforehand: it is easy to show that the growing phase of GS(m) produces an an upper-bound of the m-boundary of X. Therefore, the RGS(m,k) algorithm, if interrupted, will return an approximation of this upper bound. Moreover, if there exists time for the shrinking phase to be executed (which conducts a number of tests linear in n and is thus fast), extraneous variables will be removed and a minimal blanket (boundary) approximation will be returned. These features make it an anytime algorithm, which is a more appropriate choice for situations where critical events may occur that require the interruption of computation, e.g., during the planning phase of a robot, which may be interrupted at any time due to an urgent external event that requires a decision to be made based on the present state’s feature values. 6 Experiments We evaluated the GS(m) and the RGS(m,k) algorithms on synthetic as well as real-world and benchmark data sets. We first systematically examined the performance on the task of recovering near-parity functions, which are known to be hard to learn [17]. We compared GS(m) and RGS(m,k) with respect to accuracy of recovery of the original boundary as well as computational cost. We generated domains of sizes ranging from 10 to 100 variables, of which 4 variables (X1 to X4 ) were related through a near-parity relation with bit probability 0.60 and various degrees of noise. The remaining independent variables (X5 to Xn ) act as “dis5 GS(1) GS(3) RGS(1, 1000) 0 0.05 RGS(3, 1000) Relieved, threshold = 0.001 Relieved, threshold = 0.03 0.1 0.15 0.2 0.25 Noise probability 0.3 0.35 0.4 Probabilistic isolation performance of GS(m) and RELIEVED Probabilistic isolation performance of GS(m) and RGS(m ,k) Real-world and benchmark data sets 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) GS 0.6 0.8 1 Real-world and benchmark data sets RGS(m = 3, k = 300) average isolation measure F1 measure 50 variables, true Markov boundary size = 3 Bernoulli probability = 0.6, 1000 data points RELIEVED(threshold = 0.03) average isolation measure F1 measure of GS(m ), RGS(m, k ) and RELIEVED vs. noise level 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) average isolation measure GS 0.6 0.8 1 average isolation measure Figure 2: Left: F1 measure of GS(m) , RGS(m,k) and RELIEVED under increasing amounts of noise. Middle: Probabilistic isolation performance comparison between GS(3) and RELIEVED on real-world and benchmark data sets. Right: Same for GS(3) and RGS(3,1000) . tractors” and had randomly assigned probabilities i.e., the correct boundary of X1 is B1 = {X2 , X3 , X4 }. In such domains, learning the boundary of X1 is difficult because of the large number of distractors and because each Xi ∈ B1 is independent of X1 given any proper subset of B1 − {Xi } (they only become dependent when including all of them in the conditioning set). To measure an algorithm’s feature selection performance, acF -measure of GS and RGS vs. domain size curacy (fraction of variables correctly included or excluded) is inappropriate as the accuracy of trivial algorithms such as returning the empty set will tend to 1 as n increases. Precision and recall are therefore more appropriate, with precision defined as the fraction of features returned that are in the correct boundary (3 features for X1 ), and recall as the fraction of the features present in the correct boundary that are returned by the algorithm. A convenient and frequently used Number of domain variables measure that combines precision and recall is the F1 meaRunning time of GS and RGS vs. domain size sure, defined as the harmonic mean of precision and recall [18]. In Fig. 1 (top) we report 95% confidence intervals for the F1 measure and execution time of GS(m) (margins m = 1 to 3) and RGS(m,k) (margins 1 to 3 and k = 1000 random subsets), using 20 data sets containing 10 to 100 variables, with the target variable X1 was perturbed (inverted) by noise with 10% probability. As can be seen, the RGS(m,k) and GS(m) using the same value for margin perform comparably Number of domain variables with respect to F1 , up to their 95% confidence intervals. With Figure 1: GS(m) and RGS(m,k) per(m,k) respect to execution time however RGS exhibits much formance with respect to domain greater scalability (Fig. 1 bottom, log scale); for example, it size (number of variables). Top: F1 executes in about 10 seconds on average in domains containmeasure, reflecting accuracy. Bot(m) ing 100 variables, while GS executes in 1,000 seconds on tom: Execution time in seconds (log average for this domain size. scale). We also compared GS(m) and RGS(m,k) to RELIEF [5], a well-known algorithm for feature selection that is known to be able to recover parity functions in certain cases [5]. RELIEF learns a weight for each variable and compares it to a threshold τ to decide on its inclusion in the set of relevant variables. As it has been reported [9] that RELIEF can exhibit large variance due to randomization that is necessary only for very large data sets, we instead used a deterministic variant called RELIEVED [9], whose behavior corresponds to RELIEF at the limit of infinite execution time. We calculated the F1 measure for GS(m) , RGS(m,k) and RELIEVED in the presence of varying amounts of noise, with noise probability ranging from 0 (no noise) to 0.4. We used domains containing 50 variables, as GS(m) becomes computationally demanding in larger domains. In Figure 2 (left) we show the performance of GS(m) and RGS(m,k) for m equal to 1 and 3, k = 1000 and RELIEVED for thresholds τ = 0.01 and 0.03 for various amounts of noise on the target variable. Again, each experiment was repeated 20 times to generate 95% confidence intervals. We can observe that even though m = 1 (equivalent to the GS algorithm) performs poorly, increasing the margin m makes it more likely to recover the correct Markov boundary, and GS(3) (m = 3) recovers the exact blanket even with few (1,000) data points. RELIEVED does comparably to GS(3) for little noise and for a large threshold, (m ) (m, k ) 1 True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 1 0.9 GS(1) GS(2) GS(3) RGS(1, 1000) (2, 1000) RGS (3, 1000) RGS 0.8 F1-measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 (m ) 70 80 90 100 90 100 (m, k ) True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 10000 GS(1) GS(2) GS(3) Execution time (sec) 1000 RGS(1, 1000) RGS(2, 1000) RGS(3, 1000) 100 10 1 0.1 0.01 10 6 20 30 40 50 60 70 80 but appears to deteriorate for more noisy domains. As we can see it is difficult to choose the “right” threshold for RELIEVED—better performing τ at low noise can become worse in noisy environments; in particular, small τ tend to include irrelevant variables while large τ tend to miss actual members. We also evaluated GS(m) , RGS(m,k) , and RELIEVED on benchmark and real-world data sets from the UCI Machine Learning repository. As the true Markov boundary for these is impossible to know, we used as performance measure a measure of probabilistic isolation by the Markov boundary returned of subsets outside the boundary. For each domain variable X, we measured the independence of subsets Y of size 1, 2 and 3 given the blanket S of X returned by GS(3) and RELIEVED for τ = 0.03 (as this value seemed to do better in the previous set of experiments), as measured by the average p-value of the χ2 test between X and Y given S (with p-values of 0 and 1 indicating ideal dependence and independence, respectively). Due to the large number of subsets outside the boundary when the boundary is small, we limited the estimation of isolation performance to 2,000 subsets per variable. We plot the results in Figure 2 (middle and right). Each point represents a variable in the corresponding data set. Points under the diagonal indicate better probabilistic isolation performance for that variable for GS(3) compared to RELIEVED (middle plot) or to RGS(3,1000) (right plot). To obtain a statistically significant comparison, we used the non-parametric Wilcoxon paired signed-rank test, which indicated that GS(3) RGS(3,1000) are statistically equivalent to each other, while both outperformed RELIEVED at the 99.99% significance level (α < 10−7 ). 7 Conclusion In this paper we presented algorithms for the problem of feature selection in unrestricted (arbitrary distribution) domains that may contain complex interactions that only appear when the values of multiple features are considered together. We introduced two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and evaluated them on on artificial, benchmark and real-world data, demonstrating that they perform well, even in the presence of noise. We also introduced the Markov Boundary Theorem that precisely characterizes the properties of a boundary, and used it to prove m-correctness of the exact family of algorithms presented. We made minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, giving our algorithms universal applicability. Appendix A: Proof sketch of the Markov Boundary Theorem Proof sketch. (=⇒ direction) We need to prove that if S is a Markov boundary of X then (a) for every set T ⊆ U − S − {X}, (X⊥ T | S − T), and (b) for every set T′ ⊆ U − S that does not ⊥ contain X, (X ⊥ T′ | S − T′ ). Case (a) is immediate from the definition of the boundary and the ⊥ Decomposition theorem. Case (b) can be proven by contradiction: Assuming the independence of T′ that contains a non-empty part T′ in S and a part T′ in U − S, we get (from Decomposition) 1 2 (X⊥ T′ | S − T′ ). We can then use Contraction to show that the set S − T′ satisfies the inde⊥ 1 1 1 pendence property of a Markov boundary, i.e., that (X⊥ U − (S − T′ ) − {X} | S − T′ ), which ⊥ 1 1 contradicts the assumption that S is a boundary (and thus minimal). (⇐= direction) We need to prove that if Eq. (2) holds, then S is a minimal Markov blanket. The proof that S is a blanket is immediate. We can prove minimality by contradiction: Assume S = S1 ∪ S2 with S1 a blanket and S2 = ∅ i.e., S1 is a blanket strictly smaller than S. Then (X⊥ S2 | ⊥ S1 ) = (X⊥ S2 | S − S2 ). However, since S2 ⊆ U − S, from Eq. (2) we get (X ⊥ S2 | S − S2 ), ⊥ ⊥ which is a contradiction. Appendix B: Proof of m-Correctness of GS(m) Let the value of the set S at the end of the growing phase be SG , its value at the end of the shrinking phase SS , and their difference S∆ = SG − SS . The following two observations are immediate. Observation 1. For every Y ⊆ U − SG − {X} such that 1 ≤ |Y| ≤ m, (X⊥ Y | SG ). ⊥ Observation 2. For every Y ∈ SS , (X ⊥ Y | SS − {Y }). ⊥ Lemma 2. Consider variables Y1 , Y2 , . . . , Yt for some t ≥ 1 and let Y = {Yj }t . Assuming that j=1 Contraction holds, if (X⊥ Yi | S − {Yj }i ) for all i = 1, . . . , t, then (X⊥ Y | S − Y). ⊥ ⊥ j=1 Proof. By induction on Yj , j = 1, 2, . . . , t, using Contraction to decrease the conditioning set S t down to S − {Yj }i j=1 for all i = 1, 2, . . . , t. Since Y = {Yj }j=1 , we immediately obtain the desired relation (X⊥ Y | S − Y). ⊥ 7 Lemma 2 can be used to show that the variables found individually independent of X during the shrinking phase are actually jointly independent of X, given the final set SS . Let S∆ = {Y1 , Y2 , . . . , Yt } be the set of variables removed (in that order) from SG to form the final set SS i.e., S∆ = SG − SS . Using the above lemma, the following is immediate. Corollary 3. Assuming that the Contraction axiom holds, (X⊥ S∆ | SS ). ⊥ Lemma 4. If the Contraction, Decomposition and Weak Union axioms hold, then for every set T ⊆ U − SG − {X} such that (X⊥ T | SG ), ⊥ (X⊥ T ∪ (SG − SS ) | SS ). ⊥ (3) Furthermore SS is minimal i.e., there does not exist a subset of SS for which Eq. (3) is true. Proof. From Corollary 3, (X⊥ S∆ | SS ). Also, by the hypothesis, (X⊥ T | SG ) = (X⊥ T | ⊥ ⊥ ⊥ SS ∪ S∆ ), where S∆ = SG − SS as usual. From these two relations and Contraction we obtain (X⊥ T ∪ S∆ | SS ). ⊥ To prove minimality, let us assume that SS = ∅ (if SS = ∅ then it is already minimal). We prove by contradiction: Assume that there exists a set S′ ⊂ SS such that (X⊥ T ∪ (SG − S′ ) | S′ ). Let ⊥ W = SS − S′ = ∅. Note that W and S′ are disjoint. We have that SS ⊆ SS ∪ S∆ =⇒ SS − S′ ⊆ SS ∪ S∆ − S′ ⊆ T ∪ (SS ∪ S∆ − S′ ) =⇒ W ⊆ T ∪ (SS ∪ S∆ − S′ ) = T ∪ (SG − S′ ) • Since (X⊥ T ∪ (SG − S′ ) | S′ ) and W ⊆ T ∪ (SS ∪ S∆ − S′ ), from Decomposition we ⊥ get (X⊥ W | S′ ). ⊥ • From (X⊥ W | S′ ) and Weak Union we have that for every Y ∈ W, (X⊥ Y | S′ ∪ ⊥ ⊥ (W − {Y })). • Since S′ and W are disjoint and since Y ∈ W, Y ∈ S′ . Applying the set equality (A − B) ∪ C = (A ∪ B) − (A − C) to S′ ∪ (W − {Y }) we obtain S′ ∪ W − ({Y } − S′ ) = SS − {Y }. • Therefore, ∀ Y ∈ W, (X⊥ Y | SS − {Y }). ⊥ However, at the end of the shrinking phase, all variables Y in SS (and therefore in W, as W ⊆ SS ) have been evaluated for independence and found dependent (Observation 2). Thus, since W = ∅, there exists at least one Y such that (X ⊥ Y | SS − {Y }), producing a contradiction. ⊥ Theorem 5. Assuming that the Contraction, Decomposition, and Weak Union axioms hold, Algorithm 1 is m-correct with respect to X. Proof. We use the Markov Boundary Theorem. We first prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X⊥ T | SS − T) ⊥ or, equivalently, ∀ T ⊆ U − SS − {X} such that |T| ≤ m, (X⊥ T | SS ). ⊥ Since U − SS − {X} = S∆ ∪ (U − SG − {X}), S∆ and U − SG − {X} are disjoint, there are three kinds of sets of size m or less to consider: (i) all sets T ⊆ S∆ , (ii) all sets T ⊆ U − SG − {X}, and (iii) all sets (if any) T = T′ ∪ T′′ , T′ ∩ T′′ = ∅, that have a non-empty part T′ ⊆ S∆ and a non-empty part T′′ ⊆ U − SG − {X}. (i) From Corollary 3, (X⊥ S∆ | SS ). Therefore, from Decomposition, for any set T ⊆ S∆ , ⊥ (X⊥ T | SS ). ⊥ (ii) By Observation 1, for every set T ⊆ U − SG − {X} such that |T| ≤ m, (X⊥ T | SG ). ⊥ By Lemma 4 we get (X⊥ T ∪ S∆ | SS ), from which we obtain (X⊥ T | SS ) by ⊥ ⊥ Decomposition. (iii) Since |T| ≤ m, we have that |T′′ | ≤ m. Since T′′ ⊆ U − SG − {X}, by Observation 1, (X⊥ T′′ | SG ). Therefore, by Lemma 4, (X⊥ T′′ ∪ S∆ | SS ). Since T′ ⊆ S∆ ⇒ ⊥ ⊥ T′′ ∪ T′ ⊆ T′′ ∪ S∆ , by Decomposition to obtain (X⊥ T′′ ∪ T′ | SS ) = (X⊥ T | SS ). ⊥ ⊥ To complete the proof we need to prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X ⊥ T | SS − T) . ⊥ Let T = T1 ∪ T2 , with T1 ⊆ SS and T2 ⊆ U − SS . Since T ⊆ U − SS , T1 contains at least one variable Y ∈ SS . From Observation 2, (X ⊥ Y | SS − {Y }). From this and (the contrapositive of) ⊥ Weak Union, we get (X ⊥ {Y } ∪ (T1 − {Y }) | SS − {Y } − (T1 − {Y })) = (X ⊥ T1 | SS − T1 ). ⊥ ⊥ From (the contrapositive of) Decomposition we get (X ⊥ T1 ∪ T2 | SS − T1 ) = (X ⊥ T | ⊥ ⊥ SS − T1 ), which is equal to (X ⊥ T | SS − T1 − T2 ) = (X ⊥ T | SS − T) as SS and T2 are ⊥ ⊥ disjoint. 8 References [1] Isabelle Guyon and Andr´ Elisseeff. An introduction to variable and feature selection. Journal e of Machine Learning Research, 3:1157–1182, 2003. [2] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Tenth International Conference on Machine Learning (ICML), pages 284–292, 1996. [3] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26(9):917–922, 1977. [4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), 1991. [5] K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), pages 129–134, 1992. [6] M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(3): 131–156, 1997. [7] Huan Liu and Hiroshi Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective, volume 453 of The Springer International Series in Engineering and Computer Science. 1998. [8] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245–271, 1997. [9] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97 (1-2):273–324, 1997. [10] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12 (NIPS), 2000. [11] I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In Proceedings of the 16th International FLAIRS Conference, 2003. [12] I. Tsamardinos, C. Aliferis, and A. Statnikov. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 673–678, 2003. [13] N. Meinshausen and P. B¨ hlmann. High-dimensional graphs and variable selection with the u Lasso. Annals of Statistics, 34:1436–1462, 2006. [14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988. [15] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. [16] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 1990. [17] M. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998. [18] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 1979. 9

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper we address the problem of provably correct feature selection in arbitrary domains. [sent-3, score-0.2]

2 An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. [sent-4, score-0.23]

3 We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artificial as well as benchmark and real-world data sets. [sent-7, score-0.226]

4 Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability. [sent-8, score-0.299]

5 1 Introduction and Motivation The problem of feature selection has a long history due to its significance in a wide range of important problems, from early ones like pattern recognition to recent ones such as text categorization, gene expression analysis and others. [sent-9, score-0.092]

6 Thus, selecting a subset of features of the domain for use in subsequent application of machine learning algorithms has become a standard preprocessing step. [sent-11, score-0.133]

7 A typical task of these algorithms is learning a classifier: Given a number of input features and a quantity of interest, called the target variable, choose a member of a family of classifiers that can predict the target variable’s value as well as possible. [sent-12, score-0.168]

8 Another task is understanding the domain and the quantities that interact with the target quantity. [sent-13, score-0.092]

9 In particular, it is known that many can fail under certain probability distributions such as ones that contain a (near) parity function [1], which contain interactions that only appear when the values of multiple features are considered together. [sent-16, score-0.098]

10 We subsequently introduce an important theorem and the aforementioned parameterized family of algorithms in Sections 4 and 5 respectively, including a practical anytime version. [sent-22, score-0.141]

11 Finally, embedded algorithms are classifier-learning algorithms that perform feature selection implicitly during their operation e. [sent-31, score-0.167]

12 Narendra and Fukunaga [3] first cast feature selection as a problem of maximization of an objective function over the set of features to use, and proposed a number of search approaches including forward selection and backward elimination. [sent-36, score-0.179]

13 The RELIEF algorithm [5] instead uses a randomized selection of data points to update a weight assigned to each feature, selecting the features whose weight exceeds a given threshold. [sent-38, score-0.135]

14 An important concept for feature subset selection is relevance. [sent-40, score-0.11]

15 The argument that the problem of feature selection can be cast as the problem of Markov blanket discovery was first made convincingly in Koller and Sahami [2], who also presented an algorithm for learning an approximate Markov blanket using mutual information. [sent-42, score-0.506]

16 This assumption implies that the Composition axiom holds in the domain (see Pearl [14] for a definition of Composition); the difference with our work is that we address here the problem in general domains where it may not necessarily hold. [sent-47, score-0.141]

17 3 Notation and Preliminaries In this section we present notation, fundamental definitions and axioms that will be subsequently used in the rest of the paper. [sent-48, score-0.163]

18 All algorithms presented are independence-based, learning the Markov boundary of a given target variable using the truth value of a number of conditional independence statements. [sent-53, score-0.472]

19 The use of conditional independence for feature selection subsumes many other criteria proposed in the literature. [sent-54, score-0.197]

20 In particular, the use of classification accuracy of the target variable can be seen as a special case of testing for its conditional independence with some of its predictor variables (conditional on the subset selected at any given moment). [sent-55, score-0.238]

21 A benefit of using conditional independence is that, while classification error estimates depend on the classifier family used, conditional independence does not. [sent-56, score-0.238]

22 In addition, algorithms utilizing conditional independence for feature selection are applicable to all domain types, 2 e. [sent-57, score-0.279]

23 , discrete, ordinal, continuous with non-linear or arbitrary non-degenerate associations or mixed domains, as long as a reliable estimate of probabilistic independence is available. [sent-59, score-0.118]

24 We denote probabilistic independence by the symbol “ ⊥ ” i. [sent-60, score-0.101]

25 , (X⊥ Y | Z) denotes the fact ⊥ ⊥ that the variables in set X are (jointly) conditionally independent from those in set Y given the values of the variables in set Z; (X⊥ Y | Z) denotes their conditional dependence. [sent-62, score-0.147]

26 We assume ⊥ the existence of a probabilistic independence query oracle that is available to answer any query of the form (X, Y | Z), corresponding to the question “Is the set of variables in X independent of the variables in Y given the value of the variables in Z? [sent-63, score-0.239]

27 Many tests of independence have appeared and studied extensively in the statistical literature over the last century; in this work we use the χ2 (chi-square) test of independence [16]. [sent-66, score-0.16]

28 A Markov blanket of variable X is a set of variables such that, after fixing (by “knowing”) the value of all of its members, the set of remaining variables in the domain, taken together as a single setvalued variable, are statistically independent of X. [sent-67, score-0.329]

29 A set of variables S ⊆ U is called a Markov blanket of variable X if and only if (X⊥ U − S − {X} | S). [sent-70, score-0.283]

30 ⊥ Intuitively, a Markov blanket S of X captures all the information in the remaining domain variables U − S − {X} that can affect the probability distribution of X, making their value redundant as far as X is concerned (given S). [sent-71, score-0.306]

31 A minimal Markov blanket is called a Markov boundary. [sent-73, score-0.272]

32 A set of variables S ⊆ U − {X} is called a Markov boundary of variable X if it is a minimal Markov blanket of X i. [sent-75, score-0.617]

33 Pearl [14] proved that that the axioms of Symmetry, Decomposition, Weak Union, and Intersection are sufficient to guarantee a unique Markov boundary. [sent-78, score-0.163]

34 , they hold in every probability distribution, as their proofs are based on the axioms of probability theory. [sent-82, score-0.184]

35 , any value combination of the domain variables has a non-zero probability of occurring. [sent-85, score-0.099]

36 4 The Markov Boundary Theorem According to Definition 2, a Markov boundary is a minimal Markov blanket. [sent-86, score-0.334]

37 We first introduce a theorem that provides an alternative, equivalent definition of the concept of Markov boundary that we will relax later in the paper to produce a more general boundary definition. [sent-87, score-0.567]

38 Assuming that the Decomposition and Contraction axioms hold, S ⊆ U − {X} is a Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X}, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . [sent-89, score-0.462]

39 According to the above theorem, a Markov boundary S partitions the powerset of U − {X} into two parts: (a) set P1 that contains all subsets of U − S, and (b) set P2 containing the remaining subsets. [sent-91, score-0.326]

40 All sets in P1 are conditionally independent of X, and all sets in P2 are conditionally dependent with X. [sent-92, score-0.1]

41 (2) correspond to the concept of Markov blanket and its minimality i. [sent-94, score-0.261]

42 5 A Family of Algorithms for Arbitrary Domains Theorem 1 defines conditions that precisely characterize a Markov boundary and thus can be thought of as an alternative definition of a boundary. [sent-106, score-0.269]

43 In particular, an m-Markov boundary is defined as follows. [sent-108, score-0.269]

44 A set of variables S ⊆ U − {X} of a domain U is called an m-Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . [sent-110, score-0.398]

45 ⊥ We call the parameter m of an m-Markov boundary the Markov boundary margin. [sent-111, score-0.538]

46 Intuitively, an m-boundary S guarantees that (a) all subsets of its complement (excluding X) of size m or smaller are independent of X given S, and (b) all sets T of size m or smaller that are not subsets of its complement are dependent of X given the part of S that is not contained in T. [sent-112, score-0.134]

47 This definition is a special case of the properties of a boundary stated in Theorem 1, with each set T mentioned in the theorem now restricted to having size m or smaller. [sent-113, score-0.298]

48 For m = n − 1, where n = |U |, the condition |T| ≤ m is always satisfied and can be omitted; in this case the definition of an (n − 1)-Markov boundary results in exactly Eq. [sent-114, score-0.269]

49 We now present an algorithm called GS(m) , shown in Algorithm 1, that provably correctly learns an m-boundary of a target variable X. [sent-116, score-0.124]

50 GS(m) operates in two phases, a growing and a shrinking phase (hence the acronym). [sent-117, score-0.191]

51 During the growing phase it examines sets of variables of size up to m, where m is a user-specified parameter. [sent-118, score-0.165]

52 During the shrinking phase, single variables are examined for conditional independence and possible removal from S (examining sets in the shrinking phase is not necessary for provably correct operation—see Appendix B). [sent-119, score-0.494]

53 The orders of examination of the sets for possible addition and deletion from the candidate boundary are left intentionally unspecified in Algorithm 1—one can therefore view it as an abstract representative of a family of algorithms, with each member specifying one such ordering. [sent-120, score-0.317]

54 In practice numerous choices for the ordering exist; one possibility is to examine the sets in the growing phase in order of increasing set size and, for each such size, in order of decreasing conditional mutual information I(X, Y, S) between X and Y given S. [sent-122, score-0.162]

55 For 1 ≤ m < n − 1, GS(m) may recover the correct boundary depending on characteristics of the domain. [sent-130, score-0.305]

56 For example, it will recover the correct boundary in domains containing embedded parity functions such that the number of variables involved in every k-bit parity function is m + 1 or less i. [sent-131, score-0.55]

57 Note that it is based on Theorem 1 and the universal axioms of Eqs. [sent-135, score-0.184]

58 A Practical Randomized Anytime Version While GS(m) is provably correct even in difficult domains such as those that contain parity functions, it may be impractical with a large number of features as its asymptotic complexity is O(nm ). [sent-139, score-0.241]

59 The RGS(m,k) algorithm has an additional parameter k that limits its computational requirements: instead of exhaustively examining all possible subsets of (U −S−{X}) (as GS(m) does), it instead samples k subsets from the set of all possible subsets of (U − S − {X}), where k is user-specified. [sent-141, score-0.171]

60 The RGS(m,k) algorithm is useful in situations where the amount of time to produce an answer may be limited and/or the limit unknown beforehand: it is easy to show that the growing phase of GS(m) produces an an upper-bound of the m-boundary of X. [sent-144, score-0.099]

61 Moreover, if there exists time for the shrinking phase to be executed (which conducts a number of tests linear in n and is thus fast), extraneous variables will be removed and a minimal blanket (boundary) approximation will be returned. [sent-146, score-0.458]

62 These features make it an anytime algorithm, which is a more appropriate choice for situations where critical events may occur that require the interruption of computation, e. [sent-147, score-0.088]

63 , during the planning phase of a robot, which may be interrupted at any time due to an urgent external event that requires a decision to be made based on the present state’s feature values. [sent-149, score-0.113]

64 We compared GS(m) and RGS(m,k) with respect to accuracy of recovery of the original boundary as well as computational cost. [sent-152, score-0.269]

65 We generated domains of sizes ranging from 10 to 100 variables, of which 4 variables (X1 to X4 ) were related through a near-parity relation with bit probability 0. [sent-153, score-0.098]

66 4 Probabilistic isolation performance of GS(m) and RELIEVED Probabilistic isolation performance of GS(m) and RGS(m ,k) Real-world and benchmark data sets 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0. [sent-165, score-0.281]

67 8 1 Real-world and benchmark data sets RGS(m = 3, k = 300) average isolation measure F1 measure 50 variables, true Markov boundary size = 3 Bernoulli probability = 0. [sent-173, score-0.451]

68 03) average isolation measure F1 measure of GS(m ), RGS(m, k ) and RELIEVED vs. [sent-175, score-0.13]

69 8 1 average isolation measure Figure 2: Left: F1 measure of GS(m) , RGS(m,k) and RELIEVED under increasing amounts of noise. [sent-196, score-0.13]

70 Middle: Probabilistic isolation performance comparison between GS(3) and RELIEVED on real-world and benchmark data sets. [sent-197, score-0.124]

71 , the correct boundary of X1 is B1 = {X2 , X3 , X4 }. [sent-201, score-0.305]

72 In such domains, learning the boundary of X1 is difficult because of the large number of distractors and because each Xi ∈ B1 is independent of X1 given any proper subset of B1 − {Xi } (they only become dependent when including all of them in the conditioning set). [sent-202, score-0.287]

73 To measure an algorithm’s feature selection performance, acF -measure of GS and RGS vs. [sent-203, score-0.111]

74 domain size curacy (fraction of variables correctly included or excluded) is inappropriate as the accuracy of trivial algorithms such as returning the empty set will tend to 1 as n increases. [sent-204, score-0.128]

75 Precision and recall are therefore more appropriate, with precision defined as the fraction of features returned that are in the correct boundary (3 features for X1 ), and recall as the fraction of the features present in the correct boundary that are returned by the algorithm. [sent-205, score-0.751]

76 A convenient and frequently used Number of domain variables measure that combines precision and recall is the F1 meaRunning time of GS and RGS vs. [sent-206, score-0.118]

77 As can be seen, the RGS(m,k) and GS(m) using the same value for margin perform comparably Number of domain variables with respect to F1 , up to their 95% confidence intervals. [sent-210, score-0.099]

78 We also compared GS(m) and RGS(m,k) to RELIEF [5], a well-known algorithm for feature selection that is known to be able to recover parity functions in certain cases [5]. [sent-216, score-0.157]

79 We can observe that even though m = 1 (equivalent to the GS algorithm) performs poorly, increasing the margin m makes it more likely to recover the correct Markov boundary, and GS(3) (m = 3) recovers the exact blanket even with few (1,000) data points. [sent-226, score-0.243]

80 RELIEVED does comparably to GS(3) for little noise and for a large threshold, (m ) (m, k ) 1 True Markov boundary size = 3, 1000 data points Bernoulli probability = 0. [sent-227, score-0.293]

81 1 0 10 20 30 40 50 60 (m ) 70 80 90 100 90 100 (m, k ) True Markov boundary size = 3, 1000 data points Bernoulli probability = 0. [sent-238, score-0.269]

82 As we can see it is difficult to choose the “right” threshold for RELIEVED—better performing τ at low noise can become worse in noisy environments; in particular, small τ tend to include irrelevant variables while large τ tend to miss actual members. [sent-243, score-0.091]

83 As the true Markov boundary for these is impossible to know, we used as performance measure a measure of probabilistic isolation by the Markov boundary returned of subsets outside the boundary. [sent-245, score-0.767]

84 For each domain variable X, we measured the independence of subsets Y of size 1, 2 and 3 given the blanket S of X returned by GS(3) and RELIEVED for τ = 0. [sent-246, score-0.448]

85 Due to the large number of subsets outside the boundary when the boundary is small, we limited the estimation of isolation performance to 2,000 subsets per variable. [sent-248, score-0.744]

86 Points under the diagonal indicate better probabilistic isolation performance for that variable for GS(3) compared to RELIEVED (middle plot) or to RGS(3,1000) (right plot). [sent-251, score-0.143]

87 7 Conclusion In this paper we presented algorithms for the problem of feature selection in unrestricted (arbitrary distribution) domains that may contain complex interactions that only appear when the values of multiple features are considered together. [sent-254, score-0.228]

88 We introduced two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and evaluated them on on artificial, benchmark and real-world data, demonstrating that they perform well, even in the presence of noise. [sent-255, score-0.226]

89 We also introduced the Markov Boundary Theorem that precisely characterizes the properties of a boundary, and used it to prove m-correctness of the exact family of algorithms presented. [sent-256, score-0.102]

90 We made minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, giving our algorithms universal applicability. [sent-257, score-0.299]

91 (=⇒ direction) We need to prove that if S is a Markov boundary of X then (a) for every set T ⊆ U − S − {X}, (X⊥ T | S − T), and (b) for every set T′ ⊆ U − S that does not ⊥ contain X, (X ⊥ T′ | S − T′ ). [sent-259, score-0.293]

92 Case (a) is immediate from the definition of the boundary and the ⊥ Decomposition theorem. [sent-260, score-0.269]

93 , that (X⊥ U − (S − T′ ) − {X} | S − T′ ), which ⊥ 1 1 contradicts the assumption that S is a boundary (and thus minimal). [sent-264, score-0.269]

94 We can prove minimality by contradiction: Assume S = S1 ∪ S2 with S1 a blanket and S2 = ∅ i. [sent-268, score-0.285]

95 Appendix B: Proof of m-Correctness of GS(m) Let the value of the set S at the end of the growing phase be SG , its value at the end of the shrinking phase SS , and their difference S∆ = SG − SS . [sent-274, score-0.239]

96 ⊥ 7 Lemma 2 can be used to show that the variables found individually independent of X during the shrinking phase are actually jointly independent of X, given the final set SS . [sent-298, score-0.186]

97 If the Contraction, Decomposition and Weak Union axioms hold, then for every set T ⊆ U − SG − {X} such that (X⊥ T | SG ), ⊥ (X⊥ T ∪ (SG − SS ) | SS ). [sent-309, score-0.163]

98 ⊥ However, at the end of the shrinking phase, all variables Y in SS (and therefore in W, as W ⊆ SS ) have been evaluated for independence and found dependent (Observation 2). [sent-327, score-0.218]

99 Assuming that the Contraction, Decomposition, and Weak Union axioms hold, Algorithm 1 is m-correct with respect to X. [sent-330, score-0.163]

100 The feature selection problem: Traditional methods and a new algorithm. [sent-371, score-0.092]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gs', 0.442), ('ss', 0.436), ('rgs', 0.435), ('boundary', 0.269), ('sg', 0.262), ('blanket', 0.207), ('relieved', 0.19), ('axioms', 0.163), ('markov', 0.112), ('contraction', 0.106), ('isolation', 0.092), ('shrinking', 0.092), ('independence', 0.08), ('relief', 0.068), ('parity', 0.065), ('minimal', 0.065), ('subsets', 0.057), ('anytime', 0.055), ('provably', 0.055), ('selection', 0.054), ('minimality', 0.054), ('domain', 0.053), ('domains', 0.052), ('growing', 0.051), ('phase', 0.048), ('randomized', 0.048), ('variables', 0.046), ('decomposition', 0.043), ('contrapositive', 0.041), ('schanged', 0.041), ('target', 0.039), ('feature', 0.038), ('axiom', 0.036), ('goto', 0.036), ('correct', 0.036), ('union', 0.035), ('execution', 0.033), ('features', 0.033), ('benchmark', 0.032), ('contradiction', 0.031), ('restart', 0.031), ('weak', 0.03), ('variable', 0.03), ('conditionally', 0.03), ('appendix', 0.03), ('yj', 0.03), ('algorithms', 0.029), ('theorem', 0.029), ('intersection', 0.029), ('family', 0.028), ('nition', 0.027), ('aliferis', 0.027), ('almuallim', 0.027), ('audiology', 0.027), ('balloons', 0.027), ('executes', 0.027), ('interrupted', 0.027), ('kohavi', 0.027), ('margaritis', 0.027), ('monks', 0.027), ('motoda', 0.027), ('narendra', 0.027), ('sahami', 0.027), ('screening', 0.027), ('wrappers', 0.027), ('conditional', 0.025), ('noise', 0.024), ('prove', 0.024), ('tsamardinos', 0.024), ('dash', 0.024), ('americal', 0.024), ('symmetry', 0.023), ('unrestricted', 0.022), ('chess', 0.022), ('margins', 0.022), ('pearl', 0.022), ('nursery', 0.022), ('returned', 0.021), ('characterizes', 0.021), ('hold', 0.021), ('universal', 0.021), ('threshold', 0.021), ('probabilistic', 0.021), ('dimitris', 0.02), ('breast', 0.02), ('sets', 0.02), ('proof', 0.02), ('cal', 0.019), ('guyon', 0.019), ('measure', 0.019), ('lemma', 0.019), ('credit', 0.018), ('ideal', 0.018), ('classi', 0.018), ('numerous', 0.018), ('subset', 0.018), ('cancer', 0.018), ('arbitrary', 0.017), ('embedded', 0.017), ('members', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

Author: Dimitris Margaritis

Abstract: In this paper we address the problem of provably correct feature selection in arbitrary domains. An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. While numerous algorithms for this problem have been proposed, their theoretical correctness and practical behavior under arbitrary probability distributions is unclear. We address this by introducing the Markov Boundary Theorem that precisely characterizes the properties of an ideal Markov boundary, and use it to develop algorithms that learn a more general boundary that can capture complex interactions that only appear when the values of multiple features are considered together. We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artificial as well as benchmark and real-world data sets. Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability. 1 Introduction and Motivation The problem of feature selection has a long history due to its significance in a wide range of important problems, from early ones like pattern recognition to recent ones such as text categorization, gene expression analysis and others. In such domains, using all available features may be prohibitively expensive, unnecessarily wasteful, and may lead to poor generalization performance, especially in the presence of irrelevant or redundant features. Thus, selecting a subset of features of the domain for use in subsequent application of machine learning algorithms has become a standard preprocessing step. A typical task of these algorithms is learning a classifier: Given a number of input features and a quantity of interest, called the target variable, choose a member of a family of classifiers that can predict the target variable’s value as well as possible. Another task is understanding the domain and the quantities that interact with the target quantity. Many algorithms have been proposed for feature selection. Unfortunately, little attention has been paid to the issue of their behavior under a variety of application domains that can be encountered in practice. In particular, it is known that many can fail under certain probability distributions such as ones that contain a (near) parity function [1], which contain interactions that only appear when the values of multiple features are considered together. There is therefore an acute need for algorithms that are widely applicable and can be theoretically proven to work under any probability distribution. In this paper we present two such algorithms, an exact and a more practical randomized approximate one. We use the observation (first made in Koller and Sahami [2]) that an optimal solution to the problem is a Markov boundary, defined to be a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain (a more precise definition is given later in Section 3) and present a family of algorithms for learning 1 the Markov boundary of a target variable in arbitrary domains. We first introduce a theorem that exactly characterizes the minimal set of features necessary for probabilistically isolating a variable, and then relax this definition to derive a family of algorithms that learn a parameterized approximation of the ideal boundary that are provably correct under a minimal set of assumptions, including a set of axioms that hold for any probability distribution. In the following section we present work on feature selection, followed by notation and definitions in Section 3. We subsequently introduce an important theorem and the aforementioned parameterized family of algorithms in Sections 4 and 5 respectively, including a practical anytime version. We evaluate these algorithms in Section 6 and conclude in Section 7. 2 Related Work Numerous algorithms have been proposed for feature selection. At the highest level algorithms can be classified as filter, wrapper, or embedded methods. Filter methods work without consulting the classifier (if any) that will make use of their output i.e., the resulting set of selected features. They therefore have typically wider applicability since they are not tied to any particular classifier family. In contrast, wrappers make the classifier an integral part of their operation, repeatedly invoking it to evaluate each of a sequence of feature subsets, and selecting the subset that results in minimum estimated classification error (for that particular classifier). Finally, embedded algorithms are classifier-learning algorithms that perform feature selection implicitly during their operation e.g., decision tree learners. Early work was motivated by the problem of pattern recognition which inherently contains a large number of features (pixels, regions, signal responses at multiple frequencies etc.). Narendra and Fukunaga [3] first cast feature selection as a problem of maximization of an objective function over the set of features to use, and proposed a number of search approaches including forward selection and backward elimination. Later work by machine learning researchers includes the FOCUS algorithm of Almuallim and Dietterich [4], which is a filter method for deterministic, noise-free domains. The RELIEF algorithm [5] instead uses a randomized selection of data points to update a weight assigned to each feature, selecting the features whose weight exceeds a given threshold. A large number of additional algorithms have appeared in the literature, too many to list here—good surveys are included in Dash and Liu [6]; Guyon and Elisseeff [1]; Liu and Motoda [7]. An important concept for feature subset selection is relevance. Several notions of relevance are discussed in a number of important papers such as Blum and Langley [8]; Kohavi and John [9]. The argument that the problem of feature selection can be cast as the problem of Markov blanket discovery was first made convincingly in Koller and Sahami [2], who also presented an algorithm for learning an approximate Markov blanket using mutual information. Other algorithms include the GS algorithm [10], originally developed for learning of the structure of a Bayesian network of a domain, and extensions to it [11] including the recent MMMB algorithm [12]. Meinshausen and B¨ hlmann [13] u recently proposed an optimal theoretical solution to the problem of learning the neighborhood of a Markov network when the distribution of the domain can be assumed to be a multidimensional Gaussian i.e., linear relations among features with Gaussian noise. This assumption implies that the Composition axiom holds in the domain (see Pearl [14] for a definition of Composition); the difference with our work is that we address here the problem in general domains where it may not necessarily hold. 3 Notation and Preliminaries In this section we present notation, fundamental definitions and axioms that will be subsequently used in the rest of the paper. We use the term “feature” and “variable” interchangeably, and denote variables by capital letters (X, Y etc.) and sets of variables by bold letters (S, T etc.). We denote the set of all variables/features in the domain (the “universe”) by U. All algorithms presented are independence-based, learning the Markov boundary of a given target variable using the truth value of a number of conditional independence statements. The use of conditional independence for feature selection subsumes many other criteria proposed in the literature. In particular, the use of classification accuracy of the target variable can be seen as a special case of testing for its conditional independence with some of its predictor variables (conditional on the subset selected at any given moment). A benefit of using conditional independence is that, while classification error estimates depend on the classifier family used, conditional independence does not. In addition, algorithms utilizing conditional independence for feature selection are applicable to all domain types, 2 e.g., discrete, ordinal, continuous with non-linear or arbitrary non-degenerate associations or mixed domains, as long as a reliable estimate of probabilistic independence is available. We denote probabilistic independence by the symbol “ ⊥ ” i.e., (X⊥ Y | Z) denotes the fact ⊥ ⊥ that the variables in set X are (jointly) conditionally independent from those in set Y given the values of the variables in set Z; (X⊥ Y | Z) denotes their conditional dependence. We assume ⊥ the existence of a probabilistic independence query oracle that is available to answer any query of the form (X, Y | Z), corresponding to the question “Is the set of variables in X independent of the variables in Y given the value of the variables in Z?” (This is similar to the approach of learning from statistical queries of Kearns and Vazirani [15].) In practice however, such an oracle does not exist, but can be approximated by a statistical independence test on a data set. Many tests of independence have appeared and studied extensively in the statistical literature over the last century; in this work we use the χ2 (chi-square) test of independence [16]. A Markov blanket of variable X is a set of variables such that, after fixing (by “knowing”) the value of all of its members, the set of remaining variables in the domain, taken together as a single setvalued variable, are statistically independent of X. More precisely, we have the following definition. Definition 1. A set of variables S ⊆ U is called a Markov blanket of variable X if and only if (X⊥ U − S − {X} | S). ⊥ Intuitively, a Markov blanket S of X captures all the information in the remaining domain variables U − S − {X} that can affect the probability distribution of X, making their value redundant as far as X is concerned (given S). The blanket therefore captures the essence of the feature selection problem for target variable X: By completely “shielding” X, a Markov blanket precludes the existence of any possible information about X that can come from variables not in the blanket, making it an ideal solution to the feature selection problem. A minimal Markov blanket is called a Markov boundary. Definition 2. A set of variables S ⊆ U − {X} is called a Markov boundary of variable X if it is a minimal Markov blanket of X i.e., none of its proper subsets is a Markov blanket. Pearl [14] proved that that the axioms of Symmetry, Decomposition, Weak Union, and Intersection are sufficient to guarantee a unique Markov boundary. These are shown below together with the axiom of Contraction. (Symmetry) (Decomposition) (Weak Union) (Contraction) (Intersection) (X⊥ Y | Z) =⇒ (Y⊥ X | Z) ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z) ∧ (X⊥ W | Z) ⊥ ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z ∪ W) ⊥ ⊥ (X⊥ Y | Z) ∧ (X⊥ W | Y ∪ Z) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (X⊥ Y | Z ∪ W) ∧ (X⊥ W | Z ∪ Y) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (1) The Symmetry, Decomposition, Contraction and Weak Union axioms are very general: they are necessary axioms for the probabilistic definition of independence i.e., they hold in every probability distribution, as their proofs are based on the axioms of probability theory. Intersection is not universal but it holds in distributions that are positive, i.e., any value combination of the domain variables has a non-zero probability of occurring. 4 The Markov Boundary Theorem According to Definition 2, a Markov boundary is a minimal Markov blanket. We first introduce a theorem that provides an alternative, equivalent definition of the concept of Markov boundary that we will relax later in the paper to produce a more general boundary definition. Theorem 1 (Markov Boundary Theorem). Assuming that the Decomposition and Contraction axioms hold, S ⊆ U − {X} is a Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X}, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ (2) A detailed proof cannot be included here due to space constraints but a proof sketch appears in Appendix A. According to the above theorem, a Markov boundary S partitions the powerset of U − {X} into two parts: (a) set P1 that contains all subsets of U − S, and (b) set P2 containing the remaining subsets. All sets in P1 are conditionally independent of X, and all sets in P2 are conditionally dependent with X. Intuitively, the two directions of the logical equivalence relation of Eq. (2) correspond to the concept of Markov blanket and its minimality i.e., the equation ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X⊥ T | S − T) ⊥ 3 Algorithm 1 The abstract GS(m) (X) algorithm. Returns an m-Markov boundary of X. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: S←∅ /* Growing phase. */ for all Y ⊆ U − S − {X} such that 1 ≤ |Y| ≤ m do if (X ⊥ Y | S) then ⊥ S←S∪Y goto line 3 /* Restart loop. */ /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 8 /* Restart loop. */ return S or, equivalently, (∀ T ⊆ U − S − {X}, (X⊥ T | S)) (as T and S are disjoint) corresponds to ⊥ the definition of Markov blanket, as it includes T = U − S − {X}. In the opposite direction, the contrapositive form is ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X ⊥ T | S − T) . ⊥ This corresponds to the concept of minimality of the Markov boundary: It states that all sets that contain a part of S cannot be independent of X given the remainder of S. Informally, this is because if there existed some set T that contained a non-empty subset T′ of S such that (X⊥ T | S − T), ⊥ then one would be able to shrink S by T′ (by the property of Contraction) and therefore S would not be minimal (more details in Appendix A). 5 A Family of Algorithms for Arbitrary Domains Theorem 1 defines conditions that precisely characterize a Markov boundary and thus can be thought of as an alternative definition of a boundary. By relaxing these conditions we can produce a more general definition. In particular, an m-Markov boundary is defined as follows. Definition 3. A set of variables S ⊆ U − {X} of a domain U is called an m-Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ We call the parameter m of an m-Markov boundary the Markov boundary margin. Intuitively, an m-boundary S guarantees that (a) all subsets of its complement (excluding X) of size m or smaller are independent of X given S, and (b) all sets T of size m or smaller that are not subsets of its complement are dependent of X given the part of S that is not contained in T. This definition is a special case of the properties of a boundary stated in Theorem 1, with each set T mentioned in the theorem now restricted to having size m or smaller. For m = n − 1, where n = |U |, the condition |T| ≤ m is always satisfied and can be omitted; in this case the definition of an (n − 1)-Markov boundary results in exactly Eq. (2) of Theorem 1. We now present an algorithm called GS(m) , shown in Algorithm 1, that provably correctly learns an m-boundary of a target variable X. GS(m) operates in two phases, a growing and a shrinking phase (hence the acronym). During the growing phase it examines sets of variables of size up to m, where m is a user-specified parameter. During the shrinking phase, single variables are examined for conditional independence and possible removal from S (examining sets in the shrinking phase is not necessary for provably correct operation—see Appendix B). The orders of examination of the sets for possible addition and deletion from the candidate boundary are left intentionally unspecified in Algorithm 1—one can therefore view it as an abstract representative of a family of algorithms, with each member specifying one such ordering. All members of this family are m-correct, as the proof of correctness does not depend on the ordering. In practice numerous choices for the ordering exist; one possibility is to examine the sets in the growing phase in order of increasing set size and, for each such size, in order of decreasing conditional mutual information I(X, Y, S) between X and Y given S. The rationale for this heuristic choice is that (usually) tests with smaller conditional sets tend to be more reliable, and sorting by mutual information tends to lessen the chance of adding false members of the Markov boundary. We used this implementation in all our experiments, presented later in Section 6. Intuitively, the margin represents a trade-off between sample and computational complexity and completeness: For m = n − 1 = |U| − 1, the algorithm returns a Markov boundary in unrestricted 4 Algorithm 2 The RGS(m,k) (X) algorithm, a randomized anytime version of the GS(m) algorithm, utilizing k random subsets for the growing phase. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: S←∅ /* Growing phase. */ repeat Schanged ← false Y ← subset of U − S − {X} of size 1 ≤ |Y| ≤ m of maximum dependence out of k random subsets if (X ⊥ Y | S) then ⊥ S←S∪Y Schanged ← true until Schanged = false /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 11 /* Restart loop. */ return S (arbitrary) domains. For 1 ≤ m < n − 1, GS(m) may recover the correct boundary depending on characteristics of the domain. For example, it will recover the correct boundary in domains containing embedded parity functions such that the number of variables involved in every k-bit parity function is m + 1 or less i.e., if k ≤ m + 1 (parity functions are corner cases in the space of probability distributions that are known to be hard to learn [17]). The proof of m-correctness of GS(m) is included in Appendix B. Note that it is based on Theorem 1 and the universal axioms of Eqs. (1) only i.e., Intersection is not needed, and thus it is widely applicable (to any domain). A Practical Randomized Anytime Version While GS(m) is provably correct even in difficult domains such as those that contain parity functions, it may be impractical with a large number of features as its asymptotic complexity is O(nm ). We therefore also we here provide a more practical randomized version called RGS(m,k) (Randomized GS(m) ), shown in Algorithm 2. The RGS(m,k) algorithm has an additional parameter k that limits its computational requirements: instead of exhaustively examining all possible subsets of (U −S−{X}) (as GS(m) does), it instead samples k subsets from the set of all possible subsets of (U − S − {X}), where k is user-specified. It is therefore a randomized algorithm that becomes equivalent to GS(m) given a large enough k. Many possibilities for the method of random selection of the subsets exist; in our experiments we select a subset Y = {Yi } (1 ≤ |Y| ≤ m) with probability proportional |Y| to i=1 (1/p(X, Yi | S)), where p(X, Yi | S) is the p-value of the corresponding (univariate) test between X and Yi given S, which has a low computational cost. The RGS(m,k) algorithm is useful in situations where the amount of time to produce an answer may be limited and/or the limit unknown beforehand: it is easy to show that the growing phase of GS(m) produces an an upper-bound of the m-boundary of X. Therefore, the RGS(m,k) algorithm, if interrupted, will return an approximation of this upper bound. Moreover, if there exists time for the shrinking phase to be executed (which conducts a number of tests linear in n and is thus fast), extraneous variables will be removed and a minimal blanket (boundary) approximation will be returned. These features make it an anytime algorithm, which is a more appropriate choice for situations where critical events may occur that require the interruption of computation, e.g., during the planning phase of a robot, which may be interrupted at any time due to an urgent external event that requires a decision to be made based on the present state’s feature values. 6 Experiments We evaluated the GS(m) and the RGS(m,k) algorithms on synthetic as well as real-world and benchmark data sets. We first systematically examined the performance on the task of recovering near-parity functions, which are known to be hard to learn [17]. We compared GS(m) and RGS(m,k) with respect to accuracy of recovery of the original boundary as well as computational cost. We generated domains of sizes ranging from 10 to 100 variables, of which 4 variables (X1 to X4 ) were related through a near-parity relation with bit probability 0.60 and various degrees of noise. The remaining independent variables (X5 to Xn ) act as “dis5 GS(1) GS(3) RGS(1, 1000) 0 0.05 RGS(3, 1000) Relieved, threshold = 0.001 Relieved, threshold = 0.03 0.1 0.15 0.2 0.25 Noise probability 0.3 0.35 0.4 Probabilistic isolation performance of GS(m) and RELIEVED Probabilistic isolation performance of GS(m) and RGS(m ,k) Real-world and benchmark data sets 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) GS 0.6 0.8 1 Real-world and benchmark data sets RGS(m = 3, k = 300) average isolation measure F1 measure 50 variables, true Markov boundary size = 3 Bernoulli probability = 0.6, 1000 data points RELIEVED(threshold = 0.03) average isolation measure F1 measure of GS(m ), RGS(m, k ) and RELIEVED vs. noise level 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) average isolation measure GS 0.6 0.8 1 average isolation measure Figure 2: Left: F1 measure of GS(m) , RGS(m,k) and RELIEVED under increasing amounts of noise. Middle: Probabilistic isolation performance comparison between GS(3) and RELIEVED on real-world and benchmark data sets. Right: Same for GS(3) and RGS(3,1000) . tractors” and had randomly assigned probabilities i.e., the correct boundary of X1 is B1 = {X2 , X3 , X4 }. In such domains, learning the boundary of X1 is difficult because of the large number of distractors and because each Xi ∈ B1 is independent of X1 given any proper subset of B1 − {Xi } (they only become dependent when including all of them in the conditioning set). To measure an algorithm’s feature selection performance, acF -measure of GS and RGS vs. domain size curacy (fraction of variables correctly included or excluded) is inappropriate as the accuracy of trivial algorithms such as returning the empty set will tend to 1 as n increases. Precision and recall are therefore more appropriate, with precision defined as the fraction of features returned that are in the correct boundary (3 features for X1 ), and recall as the fraction of the features present in the correct boundary that are returned by the algorithm. A convenient and frequently used Number of domain variables measure that combines precision and recall is the F1 meaRunning time of GS and RGS vs. domain size sure, defined as the harmonic mean of precision and recall [18]. In Fig. 1 (top) we report 95% confidence intervals for the F1 measure and execution time of GS(m) (margins m = 1 to 3) and RGS(m,k) (margins 1 to 3 and k = 1000 random subsets), using 20 data sets containing 10 to 100 variables, with the target variable X1 was perturbed (inverted) by noise with 10% probability. As can be seen, the RGS(m,k) and GS(m) using the same value for margin perform comparably Number of domain variables with respect to F1 , up to their 95% confidence intervals. With Figure 1: GS(m) and RGS(m,k) per(m,k) respect to execution time however RGS exhibits much formance with respect to domain greater scalability (Fig. 1 bottom, log scale); for example, it size (number of variables). Top: F1 executes in about 10 seconds on average in domains containmeasure, reflecting accuracy. Bot(m) ing 100 variables, while GS executes in 1,000 seconds on tom: Execution time in seconds (log average for this domain size. scale). We also compared GS(m) and RGS(m,k) to RELIEF [5], a well-known algorithm for feature selection that is known to be able to recover parity functions in certain cases [5]. RELIEF learns a weight for each variable and compares it to a threshold τ to decide on its inclusion in the set of relevant variables. As it has been reported [9] that RELIEF can exhibit large variance due to randomization that is necessary only for very large data sets, we instead used a deterministic variant called RELIEVED [9], whose behavior corresponds to RELIEF at the limit of infinite execution time. We calculated the F1 measure for GS(m) , RGS(m,k) and RELIEVED in the presence of varying amounts of noise, with noise probability ranging from 0 (no noise) to 0.4. We used domains containing 50 variables, as GS(m) becomes computationally demanding in larger domains. In Figure 2 (left) we show the performance of GS(m) and RGS(m,k) for m equal to 1 and 3, k = 1000 and RELIEVED for thresholds τ = 0.01 and 0.03 for various amounts of noise on the target variable. Again, each experiment was repeated 20 times to generate 95% confidence intervals. We can observe that even though m = 1 (equivalent to the GS algorithm) performs poorly, increasing the margin m makes it more likely to recover the correct Markov boundary, and GS(3) (m = 3) recovers the exact blanket even with few (1,000) data points. RELIEVED does comparably to GS(3) for little noise and for a large threshold, (m ) (m, k ) 1 True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 1 0.9 GS(1) GS(2) GS(3) RGS(1, 1000) (2, 1000) RGS (3, 1000) RGS 0.8 F1-measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 (m ) 70 80 90 100 90 100 (m, k ) True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 10000 GS(1) GS(2) GS(3) Execution time (sec) 1000 RGS(1, 1000) RGS(2, 1000) RGS(3, 1000) 100 10 1 0.1 0.01 10 6 20 30 40 50 60 70 80 but appears to deteriorate for more noisy domains. As we can see it is difficult to choose the “right” threshold for RELIEVED—better performing τ at low noise can become worse in noisy environments; in particular, small τ tend to include irrelevant variables while large τ tend to miss actual members. We also evaluated GS(m) , RGS(m,k) , and RELIEVED on benchmark and real-world data sets from the UCI Machine Learning repository. As the true Markov boundary for these is impossible to know, we used as performance measure a measure of probabilistic isolation by the Markov boundary returned of subsets outside the boundary. For each domain variable X, we measured the independence of subsets Y of size 1, 2 and 3 given the blanket S of X returned by GS(3) and RELIEVED for τ = 0.03 (as this value seemed to do better in the previous set of experiments), as measured by the average p-value of the χ2 test between X and Y given S (with p-values of 0 and 1 indicating ideal dependence and independence, respectively). Due to the large number of subsets outside the boundary when the boundary is small, we limited the estimation of isolation performance to 2,000 subsets per variable. We plot the results in Figure 2 (middle and right). Each point represents a variable in the corresponding data set. Points under the diagonal indicate better probabilistic isolation performance for that variable for GS(3) compared to RELIEVED (middle plot) or to RGS(3,1000) (right plot). To obtain a statistically significant comparison, we used the non-parametric Wilcoxon paired signed-rank test, which indicated that GS(3) RGS(3,1000) are statistically equivalent to each other, while both outperformed RELIEVED at the 99.99% significance level (α < 10−7 ). 7 Conclusion In this paper we presented algorithms for the problem of feature selection in unrestricted (arbitrary distribution) domains that may contain complex interactions that only appear when the values of multiple features are considered together. We introduced two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and evaluated them on on artificial, benchmark and real-world data, demonstrating that they perform well, even in the presence of noise. We also introduced the Markov Boundary Theorem that precisely characterizes the properties of a boundary, and used it to prove m-correctness of the exact family of algorithms presented. We made minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, giving our algorithms universal applicability. Appendix A: Proof sketch of the Markov Boundary Theorem Proof sketch. (=⇒ direction) We need to prove that if S is a Markov boundary of X then (a) for every set T ⊆ U − S − {X}, (X⊥ T | S − T), and (b) for every set T′ ⊆ U − S that does not ⊥ contain X, (X ⊥ T′ | S − T′ ). Case (a) is immediate from the definition of the boundary and the ⊥ Decomposition theorem. Case (b) can be proven by contradiction: Assuming the independence of T′ that contains a non-empty part T′ in S and a part T′ in U − S, we get (from Decomposition) 1 2 (X⊥ T′ | S − T′ ). We can then use Contraction to show that the set S − T′ satisfies the inde⊥ 1 1 1 pendence property of a Markov boundary, i.e., that (X⊥ U − (S − T′ ) − {X} | S − T′ ), which ⊥ 1 1 contradicts the assumption that S is a boundary (and thus minimal). (⇐= direction) We need to prove that if Eq. (2) holds, then S is a minimal Markov blanket. The proof that S is a blanket is immediate. We can prove minimality by contradiction: Assume S = S1 ∪ S2 with S1 a blanket and S2 = ∅ i.e., S1 is a blanket strictly smaller than S. Then (X⊥ S2 | ⊥ S1 ) = (X⊥ S2 | S − S2 ). However, since S2 ⊆ U − S, from Eq. (2) we get (X ⊥ S2 | S − S2 ), ⊥ ⊥ which is a contradiction. Appendix B: Proof of m-Correctness of GS(m) Let the value of the set S at the end of the growing phase be SG , its value at the end of the shrinking phase SS , and their difference S∆ = SG − SS . The following two observations are immediate. Observation 1. For every Y ⊆ U − SG − {X} such that 1 ≤ |Y| ≤ m, (X⊥ Y | SG ). ⊥ Observation 2. For every Y ∈ SS , (X ⊥ Y | SS − {Y }). ⊥ Lemma 2. Consider variables Y1 , Y2 , . . . , Yt for some t ≥ 1 and let Y = {Yj }t . Assuming that j=1 Contraction holds, if (X⊥ Yi | S − {Yj }i ) for all i = 1, . . . , t, then (X⊥ Y | S − Y). ⊥ ⊥ j=1 Proof. By induction on Yj , j = 1, 2, . . . , t, using Contraction to decrease the conditioning set S t down to S − {Yj }i j=1 for all i = 1, 2, . . . , t. Since Y = {Yj }j=1 , we immediately obtain the desired relation (X⊥ Y | S − Y). ⊥ 7 Lemma 2 can be used to show that the variables found individually independent of X during the shrinking phase are actually jointly independent of X, given the final set SS . Let S∆ = {Y1 , Y2 , . . . , Yt } be the set of variables removed (in that order) from SG to form the final set SS i.e., S∆ = SG − SS . Using the above lemma, the following is immediate. Corollary 3. Assuming that the Contraction axiom holds, (X⊥ S∆ | SS ). ⊥ Lemma 4. If the Contraction, Decomposition and Weak Union axioms hold, then for every set T ⊆ U − SG − {X} such that (X⊥ T | SG ), ⊥ (X⊥ T ∪ (SG − SS ) | SS ). ⊥ (3) Furthermore SS is minimal i.e., there does not exist a subset of SS for which Eq. (3) is true. Proof. From Corollary 3, (X⊥ S∆ | SS ). Also, by the hypothesis, (X⊥ T | SG ) = (X⊥ T | ⊥ ⊥ ⊥ SS ∪ S∆ ), where S∆ = SG − SS as usual. From these two relations and Contraction we obtain (X⊥ T ∪ S∆ | SS ). ⊥ To prove minimality, let us assume that SS = ∅ (if SS = ∅ then it is already minimal). We prove by contradiction: Assume that there exists a set S′ ⊂ SS such that (X⊥ T ∪ (SG − S′ ) | S′ ). Let ⊥ W = SS − S′ = ∅. Note that W and S′ are disjoint. We have that SS ⊆ SS ∪ S∆ =⇒ SS − S′ ⊆ SS ∪ S∆ − S′ ⊆ T ∪ (SS ∪ S∆ − S′ ) =⇒ W ⊆ T ∪ (SS ∪ S∆ − S′ ) = T ∪ (SG − S′ ) • Since (X⊥ T ∪ (SG − S′ ) | S′ ) and W ⊆ T ∪ (SS ∪ S∆ − S′ ), from Decomposition we ⊥ get (X⊥ W | S′ ). ⊥ • From (X⊥ W | S′ ) and Weak Union we have that for every Y ∈ W, (X⊥ Y | S′ ∪ ⊥ ⊥ (W − {Y })). • Since S′ and W are disjoint and since Y ∈ W, Y ∈ S′ . Applying the set equality (A − B) ∪ C = (A ∪ B) − (A − C) to S′ ∪ (W − {Y }) we obtain S′ ∪ W − ({Y } − S′ ) = SS − {Y }. • Therefore, ∀ Y ∈ W, (X⊥ Y | SS − {Y }). ⊥ However, at the end of the shrinking phase, all variables Y in SS (and therefore in W, as W ⊆ SS ) have been evaluated for independence and found dependent (Observation 2). Thus, since W = ∅, there exists at least one Y such that (X ⊥ Y | SS − {Y }), producing a contradiction. ⊥ Theorem 5. Assuming that the Contraction, Decomposition, and Weak Union axioms hold, Algorithm 1 is m-correct with respect to X. Proof. We use the Markov Boundary Theorem. We first prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X⊥ T | SS − T) ⊥ or, equivalently, ∀ T ⊆ U − SS − {X} such that |T| ≤ m, (X⊥ T | SS ). ⊥ Since U − SS − {X} = S∆ ∪ (U − SG − {X}), S∆ and U − SG − {X} are disjoint, there are three kinds of sets of size m or less to consider: (i) all sets T ⊆ S∆ , (ii) all sets T ⊆ U − SG − {X}, and (iii) all sets (if any) T = T′ ∪ T′′ , T′ ∩ T′′ = ∅, that have a non-empty part T′ ⊆ S∆ and a non-empty part T′′ ⊆ U − SG − {X}. (i) From Corollary 3, (X⊥ S∆ | SS ). Therefore, from Decomposition, for any set T ⊆ S∆ , ⊥ (X⊥ T | SS ). ⊥ (ii) By Observation 1, for every set T ⊆ U − SG − {X} such that |T| ≤ m, (X⊥ T | SG ). ⊥ By Lemma 4 we get (X⊥ T ∪ S∆ | SS ), from which we obtain (X⊥ T | SS ) by ⊥ ⊥ Decomposition. (iii) Since |T| ≤ m, we have that |T′′ | ≤ m. Since T′′ ⊆ U − SG − {X}, by Observation 1, (X⊥ T′′ | SG ). Therefore, by Lemma 4, (X⊥ T′′ ∪ S∆ | SS ). Since T′ ⊆ S∆ ⇒ ⊥ ⊥ T′′ ∪ T′ ⊆ T′′ ∪ S∆ , by Decomposition to obtain (X⊥ T′′ ∪ T′ | SS ) = (X⊥ T | SS ). ⊥ ⊥ To complete the proof we need to prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X ⊥ T | SS − T) . ⊥ Let T = T1 ∪ T2 , with T1 ⊆ SS and T2 ⊆ U − SS . Since T ⊆ U − SS , T1 contains at least one variable Y ∈ SS . From Observation 2, (X ⊥ Y | SS − {Y }). From this and (the contrapositive of) ⊥ Weak Union, we get (X ⊥ {Y } ∪ (T1 − {Y }) | SS − {Y } − (T1 − {Y })) = (X ⊥ T1 | SS − T1 ). ⊥ ⊥ From (the contrapositive of) Decomposition we get (X ⊥ T1 ∪ T2 | SS − T1 ) = (X ⊥ T | ⊥ ⊥ SS − T1 ), which is equal to (X ⊥ T | SS − T1 − T2 ) = (X ⊥ T | SS − T) as SS and T2 are ⊥ ⊥ disjoint. 8 References [1] Isabelle Guyon and Andr´ Elisseeff. An introduction to variable and feature selection. Journal e of Machine Learning Research, 3:1157–1182, 2003. [2] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Tenth International Conference on Machine Learning (ICML), pages 284–292, 1996. [3] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26(9):917–922, 1977. [4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), 1991. [5] K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), pages 129–134, 1992. [6] M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(3): 131–156, 1997. [7] Huan Liu and Hiroshi Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective, volume 453 of The Springer International Series in Engineering and Computer Science. 1998. [8] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245–271, 1997. [9] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97 (1-2):273–324, 1997. [10] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12 (NIPS), 2000. [11] I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In Proceedings of the 16th International FLAIRS Conference, 2003. [12] I. Tsamardinos, C. Aliferis, and A. Statnikov. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 673–678, 2003. [13] N. Meinshausen and P. B¨ hlmann. High-dimensional graphs and variable selection with the u Lasso. Annals of Statistics, 34:1436–1462, 2006. [14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988. [15] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. [16] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 1990. [17] M. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998. [18] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 1979. 9

2 0.18276678 256 nips-2009-Which graphical models are difficult to learn?

Author: Andrea Montanari, Jose A. Pereira

Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random field µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is defined as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change significantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for specific classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results confirm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be definite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to find W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after fixing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is defined as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More specifically, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This first result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More specifically, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the fluctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a specific low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is defined from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our first result shows that Rlr(λ) indeed reconstructs G if θ is sufficiently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires significant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difficult to learn. We will instead show how to evaluate it on specific graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufficient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and define S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Define Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our first auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufficient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufficiently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisfied since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by definition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and define the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external field’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any fixed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) finite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simplifications arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random fields”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ fling and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9

3 0.11571376 205 nips-2009-Rethinking LDA: Why Priors Matter

Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach

Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1

4 0.1041151 211 nips-2009-Segmenting Scenes by Matching Image Composites

Author: Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman

Abstract: In this paper, we investigate how, given an image, similar images sharing the same global description can help with unsupervised scene segmentation. In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes. This allows for a better explanation of the input scenes. We perform MRF-based segmentation that optimizes over matches, while respecting boundary information. The recovered segments are then used to re-query a large database of images to retrieve better matches for the target regions. We show improved performance in detecting the principal occluding and contact boundaries for the scene over previous methods on data gathered from the LabelMe database.

5 0.077434756 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models

Author: Baback Moghaddam, Emtiyaz Khan, Kevin P. Murphy, Benjamin M. Marlin

Abstract: We make several contributions in accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our first contribution is to show how to efficiently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efficiently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call “Neighborhood Fusion” (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our third contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.

6 0.067375608 3 nips-2009-AUC optimization and the two-sample problem

7 0.05517317 201 nips-2009-Region-based Segmentation and Object Detection

8 0.05251452 90 nips-2009-Factor Modeling for Advertisement Targeting

9 0.051642656 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

10 0.042222172 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

11 0.041532762 54 nips-2009-Compositionality of optimal control laws

12 0.041086782 196 nips-2009-Quantification and the language of thought

13 0.03538388 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison

14 0.034649394 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

15 0.034548648 122 nips-2009-Label Selection on Graphs

16 0.03447412 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

17 0.034164656 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

18 0.032725614 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

19 0.032158889 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

20 0.032078709 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.122), (1, 0.008), (2, -0.015), (3, -0.028), (4, -0.018), (5, -0.035), (6, 0.005), (7, -0.01), (8, 0.014), (9, -0.02), (10, -0.03), (11, -0.014), (12, 0.085), (13, 0.001), (14, 0.045), (15, 0.039), (16, -0.017), (17, -0.083), (18, 0.024), (19, -0.067), (20, 0.039), (21, 0.068), (22, 0.041), (23, -0.044), (24, -0.09), (25, 0.032), (26, 0.012), (27, -0.114), (28, -0.058), (29, -0.109), (30, -0.017), (31, -0.067), (32, -0.001), (33, 0.005), (34, 0.0), (35, -0.2), (36, -0.081), (37, -0.188), (38, -0.177), (39, -0.109), (40, 0.02), (41, -0.242), (42, 0.081), (43, 0.05), (44, -0.031), (45, -0.152), (46, -0.022), (47, 0.087), (48, 0.085), (49, -0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93160707 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

Author: Dimitris Margaritis

Abstract: In this paper we address the problem of provably correct feature selection in arbitrary domains. An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. While numerous algorithms for this problem have been proposed, their theoretical correctness and practical behavior under arbitrary probability distributions is unclear. We address this by introducing the Markov Boundary Theorem that precisely characterizes the properties of an ideal Markov boundary, and use it to develop algorithms that learn a more general boundary that can capture complex interactions that only appear when the values of multiple features are considered together. We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artificial as well as benchmark and real-world data sets. Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability. 1 Introduction and Motivation The problem of feature selection has a long history due to its significance in a wide range of important problems, from early ones like pattern recognition to recent ones such as text categorization, gene expression analysis and others. In such domains, using all available features may be prohibitively expensive, unnecessarily wasteful, and may lead to poor generalization performance, especially in the presence of irrelevant or redundant features. Thus, selecting a subset of features of the domain for use in subsequent application of machine learning algorithms has become a standard preprocessing step. A typical task of these algorithms is learning a classifier: Given a number of input features and a quantity of interest, called the target variable, choose a member of a family of classifiers that can predict the target variable’s value as well as possible. Another task is understanding the domain and the quantities that interact with the target quantity. Many algorithms have been proposed for feature selection. Unfortunately, little attention has been paid to the issue of their behavior under a variety of application domains that can be encountered in practice. In particular, it is known that many can fail under certain probability distributions such as ones that contain a (near) parity function [1], which contain interactions that only appear when the values of multiple features are considered together. There is therefore an acute need for algorithms that are widely applicable and can be theoretically proven to work under any probability distribution. In this paper we present two such algorithms, an exact and a more practical randomized approximate one. We use the observation (first made in Koller and Sahami [2]) that an optimal solution to the problem is a Markov boundary, defined to be a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain (a more precise definition is given later in Section 3) and present a family of algorithms for learning 1 the Markov boundary of a target variable in arbitrary domains. We first introduce a theorem that exactly characterizes the minimal set of features necessary for probabilistically isolating a variable, and then relax this definition to derive a family of algorithms that learn a parameterized approximation of the ideal boundary that are provably correct under a minimal set of assumptions, including a set of axioms that hold for any probability distribution. In the following section we present work on feature selection, followed by notation and definitions in Section 3. We subsequently introduce an important theorem and the aforementioned parameterized family of algorithms in Sections 4 and 5 respectively, including a practical anytime version. We evaluate these algorithms in Section 6 and conclude in Section 7. 2 Related Work Numerous algorithms have been proposed for feature selection. At the highest level algorithms can be classified as filter, wrapper, or embedded methods. Filter methods work without consulting the classifier (if any) that will make use of their output i.e., the resulting set of selected features. They therefore have typically wider applicability since they are not tied to any particular classifier family. In contrast, wrappers make the classifier an integral part of their operation, repeatedly invoking it to evaluate each of a sequence of feature subsets, and selecting the subset that results in minimum estimated classification error (for that particular classifier). Finally, embedded algorithms are classifier-learning algorithms that perform feature selection implicitly during their operation e.g., decision tree learners. Early work was motivated by the problem of pattern recognition which inherently contains a large number of features (pixels, regions, signal responses at multiple frequencies etc.). Narendra and Fukunaga [3] first cast feature selection as a problem of maximization of an objective function over the set of features to use, and proposed a number of search approaches including forward selection and backward elimination. Later work by machine learning researchers includes the FOCUS algorithm of Almuallim and Dietterich [4], which is a filter method for deterministic, noise-free domains. The RELIEF algorithm [5] instead uses a randomized selection of data points to update a weight assigned to each feature, selecting the features whose weight exceeds a given threshold. A large number of additional algorithms have appeared in the literature, too many to list here—good surveys are included in Dash and Liu [6]; Guyon and Elisseeff [1]; Liu and Motoda [7]. An important concept for feature subset selection is relevance. Several notions of relevance are discussed in a number of important papers such as Blum and Langley [8]; Kohavi and John [9]. The argument that the problem of feature selection can be cast as the problem of Markov blanket discovery was first made convincingly in Koller and Sahami [2], who also presented an algorithm for learning an approximate Markov blanket using mutual information. Other algorithms include the GS algorithm [10], originally developed for learning of the structure of a Bayesian network of a domain, and extensions to it [11] including the recent MMMB algorithm [12]. Meinshausen and B¨ hlmann [13] u recently proposed an optimal theoretical solution to the problem of learning the neighborhood of a Markov network when the distribution of the domain can be assumed to be a multidimensional Gaussian i.e., linear relations among features with Gaussian noise. This assumption implies that the Composition axiom holds in the domain (see Pearl [14] for a definition of Composition); the difference with our work is that we address here the problem in general domains where it may not necessarily hold. 3 Notation and Preliminaries In this section we present notation, fundamental definitions and axioms that will be subsequently used in the rest of the paper. We use the term “feature” and “variable” interchangeably, and denote variables by capital letters (X, Y etc.) and sets of variables by bold letters (S, T etc.). We denote the set of all variables/features in the domain (the “universe”) by U. All algorithms presented are independence-based, learning the Markov boundary of a given target variable using the truth value of a number of conditional independence statements. The use of conditional independence for feature selection subsumes many other criteria proposed in the literature. In particular, the use of classification accuracy of the target variable can be seen as a special case of testing for its conditional independence with some of its predictor variables (conditional on the subset selected at any given moment). A benefit of using conditional independence is that, while classification error estimates depend on the classifier family used, conditional independence does not. In addition, algorithms utilizing conditional independence for feature selection are applicable to all domain types, 2 e.g., discrete, ordinal, continuous with non-linear or arbitrary non-degenerate associations or mixed domains, as long as a reliable estimate of probabilistic independence is available. We denote probabilistic independence by the symbol “ ⊥ ” i.e., (X⊥ Y | Z) denotes the fact ⊥ ⊥ that the variables in set X are (jointly) conditionally independent from those in set Y given the values of the variables in set Z; (X⊥ Y | Z) denotes their conditional dependence. We assume ⊥ the existence of a probabilistic independence query oracle that is available to answer any query of the form (X, Y | Z), corresponding to the question “Is the set of variables in X independent of the variables in Y given the value of the variables in Z?” (This is similar to the approach of learning from statistical queries of Kearns and Vazirani [15].) In practice however, such an oracle does not exist, but can be approximated by a statistical independence test on a data set. Many tests of independence have appeared and studied extensively in the statistical literature over the last century; in this work we use the χ2 (chi-square) test of independence [16]. A Markov blanket of variable X is a set of variables such that, after fixing (by “knowing”) the value of all of its members, the set of remaining variables in the domain, taken together as a single setvalued variable, are statistically independent of X. More precisely, we have the following definition. Definition 1. A set of variables S ⊆ U is called a Markov blanket of variable X if and only if (X⊥ U − S − {X} | S). ⊥ Intuitively, a Markov blanket S of X captures all the information in the remaining domain variables U − S − {X} that can affect the probability distribution of X, making their value redundant as far as X is concerned (given S). The blanket therefore captures the essence of the feature selection problem for target variable X: By completely “shielding” X, a Markov blanket precludes the existence of any possible information about X that can come from variables not in the blanket, making it an ideal solution to the feature selection problem. A minimal Markov blanket is called a Markov boundary. Definition 2. A set of variables S ⊆ U − {X} is called a Markov boundary of variable X if it is a minimal Markov blanket of X i.e., none of its proper subsets is a Markov blanket. Pearl [14] proved that that the axioms of Symmetry, Decomposition, Weak Union, and Intersection are sufficient to guarantee a unique Markov boundary. These are shown below together with the axiom of Contraction. (Symmetry) (Decomposition) (Weak Union) (Contraction) (Intersection) (X⊥ Y | Z) =⇒ (Y⊥ X | Z) ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z) ∧ (X⊥ W | Z) ⊥ ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z ∪ W) ⊥ ⊥ (X⊥ Y | Z) ∧ (X⊥ W | Y ∪ Z) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (X⊥ Y | Z ∪ W) ∧ (X⊥ W | Z ∪ Y) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (1) The Symmetry, Decomposition, Contraction and Weak Union axioms are very general: they are necessary axioms for the probabilistic definition of independence i.e., they hold in every probability distribution, as their proofs are based on the axioms of probability theory. Intersection is not universal but it holds in distributions that are positive, i.e., any value combination of the domain variables has a non-zero probability of occurring. 4 The Markov Boundary Theorem According to Definition 2, a Markov boundary is a minimal Markov blanket. We first introduce a theorem that provides an alternative, equivalent definition of the concept of Markov boundary that we will relax later in the paper to produce a more general boundary definition. Theorem 1 (Markov Boundary Theorem). Assuming that the Decomposition and Contraction axioms hold, S ⊆ U − {X} is a Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X}, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ (2) A detailed proof cannot be included here due to space constraints but a proof sketch appears in Appendix A. According to the above theorem, a Markov boundary S partitions the powerset of U − {X} into two parts: (a) set P1 that contains all subsets of U − S, and (b) set P2 containing the remaining subsets. All sets in P1 are conditionally independent of X, and all sets in P2 are conditionally dependent with X. Intuitively, the two directions of the logical equivalence relation of Eq. (2) correspond to the concept of Markov blanket and its minimality i.e., the equation ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X⊥ T | S − T) ⊥ 3 Algorithm 1 The abstract GS(m) (X) algorithm. Returns an m-Markov boundary of X. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: S←∅ /* Growing phase. */ for all Y ⊆ U − S − {X} such that 1 ≤ |Y| ≤ m do if (X ⊥ Y | S) then ⊥ S←S∪Y goto line 3 /* Restart loop. */ /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 8 /* Restart loop. */ return S or, equivalently, (∀ T ⊆ U − S − {X}, (X⊥ T | S)) (as T and S are disjoint) corresponds to ⊥ the definition of Markov blanket, as it includes T = U − S − {X}. In the opposite direction, the contrapositive form is ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X ⊥ T | S − T) . ⊥ This corresponds to the concept of minimality of the Markov boundary: It states that all sets that contain a part of S cannot be independent of X given the remainder of S. Informally, this is because if there existed some set T that contained a non-empty subset T′ of S such that (X⊥ T | S − T), ⊥ then one would be able to shrink S by T′ (by the property of Contraction) and therefore S would not be minimal (more details in Appendix A). 5 A Family of Algorithms for Arbitrary Domains Theorem 1 defines conditions that precisely characterize a Markov boundary and thus can be thought of as an alternative definition of a boundary. By relaxing these conditions we can produce a more general definition. In particular, an m-Markov boundary is defined as follows. Definition 3. A set of variables S ⊆ U − {X} of a domain U is called an m-Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ We call the parameter m of an m-Markov boundary the Markov boundary margin. Intuitively, an m-boundary S guarantees that (a) all subsets of its complement (excluding X) of size m or smaller are independent of X given S, and (b) all sets T of size m or smaller that are not subsets of its complement are dependent of X given the part of S that is not contained in T. This definition is a special case of the properties of a boundary stated in Theorem 1, with each set T mentioned in the theorem now restricted to having size m or smaller. For m = n − 1, where n = |U |, the condition |T| ≤ m is always satisfied and can be omitted; in this case the definition of an (n − 1)-Markov boundary results in exactly Eq. (2) of Theorem 1. We now present an algorithm called GS(m) , shown in Algorithm 1, that provably correctly learns an m-boundary of a target variable X. GS(m) operates in two phases, a growing and a shrinking phase (hence the acronym). During the growing phase it examines sets of variables of size up to m, where m is a user-specified parameter. During the shrinking phase, single variables are examined for conditional independence and possible removal from S (examining sets in the shrinking phase is not necessary for provably correct operation—see Appendix B). The orders of examination of the sets for possible addition and deletion from the candidate boundary are left intentionally unspecified in Algorithm 1—one can therefore view it as an abstract representative of a family of algorithms, with each member specifying one such ordering. All members of this family are m-correct, as the proof of correctness does not depend on the ordering. In practice numerous choices for the ordering exist; one possibility is to examine the sets in the growing phase in order of increasing set size and, for each such size, in order of decreasing conditional mutual information I(X, Y, S) between X and Y given S. The rationale for this heuristic choice is that (usually) tests with smaller conditional sets tend to be more reliable, and sorting by mutual information tends to lessen the chance of adding false members of the Markov boundary. We used this implementation in all our experiments, presented later in Section 6. Intuitively, the margin represents a trade-off between sample and computational complexity and completeness: For m = n − 1 = |U| − 1, the algorithm returns a Markov boundary in unrestricted 4 Algorithm 2 The RGS(m,k) (X) algorithm, a randomized anytime version of the GS(m) algorithm, utilizing k random subsets for the growing phase. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: S←∅ /* Growing phase. */ repeat Schanged ← false Y ← subset of U − S − {X} of size 1 ≤ |Y| ≤ m of maximum dependence out of k random subsets if (X ⊥ Y | S) then ⊥ S←S∪Y Schanged ← true until Schanged = false /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 11 /* Restart loop. */ return S (arbitrary) domains. For 1 ≤ m < n − 1, GS(m) may recover the correct boundary depending on characteristics of the domain. For example, it will recover the correct boundary in domains containing embedded parity functions such that the number of variables involved in every k-bit parity function is m + 1 or less i.e., if k ≤ m + 1 (parity functions are corner cases in the space of probability distributions that are known to be hard to learn [17]). The proof of m-correctness of GS(m) is included in Appendix B. Note that it is based on Theorem 1 and the universal axioms of Eqs. (1) only i.e., Intersection is not needed, and thus it is widely applicable (to any domain). A Practical Randomized Anytime Version While GS(m) is provably correct even in difficult domains such as those that contain parity functions, it may be impractical with a large number of features as its asymptotic complexity is O(nm ). We therefore also we here provide a more practical randomized version called RGS(m,k) (Randomized GS(m) ), shown in Algorithm 2. The RGS(m,k) algorithm has an additional parameter k that limits its computational requirements: instead of exhaustively examining all possible subsets of (U −S−{X}) (as GS(m) does), it instead samples k subsets from the set of all possible subsets of (U − S − {X}), where k is user-specified. It is therefore a randomized algorithm that becomes equivalent to GS(m) given a large enough k. Many possibilities for the method of random selection of the subsets exist; in our experiments we select a subset Y = {Yi } (1 ≤ |Y| ≤ m) with probability proportional |Y| to i=1 (1/p(X, Yi | S)), where p(X, Yi | S) is the p-value of the corresponding (univariate) test between X and Yi given S, which has a low computational cost. The RGS(m,k) algorithm is useful in situations where the amount of time to produce an answer may be limited and/or the limit unknown beforehand: it is easy to show that the growing phase of GS(m) produces an an upper-bound of the m-boundary of X. Therefore, the RGS(m,k) algorithm, if interrupted, will return an approximation of this upper bound. Moreover, if there exists time for the shrinking phase to be executed (which conducts a number of tests linear in n and is thus fast), extraneous variables will be removed and a minimal blanket (boundary) approximation will be returned. These features make it an anytime algorithm, which is a more appropriate choice for situations where critical events may occur that require the interruption of computation, e.g., during the planning phase of a robot, which may be interrupted at any time due to an urgent external event that requires a decision to be made based on the present state’s feature values. 6 Experiments We evaluated the GS(m) and the RGS(m,k) algorithms on synthetic as well as real-world and benchmark data sets. We first systematically examined the performance on the task of recovering near-parity functions, which are known to be hard to learn [17]. We compared GS(m) and RGS(m,k) with respect to accuracy of recovery of the original boundary as well as computational cost. We generated domains of sizes ranging from 10 to 100 variables, of which 4 variables (X1 to X4 ) were related through a near-parity relation with bit probability 0.60 and various degrees of noise. The remaining independent variables (X5 to Xn ) act as “dis5 GS(1) GS(3) RGS(1, 1000) 0 0.05 RGS(3, 1000) Relieved, threshold = 0.001 Relieved, threshold = 0.03 0.1 0.15 0.2 0.25 Noise probability 0.3 0.35 0.4 Probabilistic isolation performance of GS(m) and RELIEVED Probabilistic isolation performance of GS(m) and RGS(m ,k) Real-world and benchmark data sets 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) GS 0.6 0.8 1 Real-world and benchmark data sets RGS(m = 3, k = 300) average isolation measure F1 measure 50 variables, true Markov boundary size = 3 Bernoulli probability = 0.6, 1000 data points RELIEVED(threshold = 0.03) average isolation measure F1 measure of GS(m ), RGS(m, k ) and RELIEVED vs. noise level 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) average isolation measure GS 0.6 0.8 1 average isolation measure Figure 2: Left: F1 measure of GS(m) , RGS(m,k) and RELIEVED under increasing amounts of noise. Middle: Probabilistic isolation performance comparison between GS(3) and RELIEVED on real-world and benchmark data sets. Right: Same for GS(3) and RGS(3,1000) . tractors” and had randomly assigned probabilities i.e., the correct boundary of X1 is B1 = {X2 , X3 , X4 }. In such domains, learning the boundary of X1 is difficult because of the large number of distractors and because each Xi ∈ B1 is independent of X1 given any proper subset of B1 − {Xi } (they only become dependent when including all of them in the conditioning set). To measure an algorithm’s feature selection performance, acF -measure of GS and RGS vs. domain size curacy (fraction of variables correctly included or excluded) is inappropriate as the accuracy of trivial algorithms such as returning the empty set will tend to 1 as n increases. Precision and recall are therefore more appropriate, with precision defined as the fraction of features returned that are in the correct boundary (3 features for X1 ), and recall as the fraction of the features present in the correct boundary that are returned by the algorithm. A convenient and frequently used Number of domain variables measure that combines precision and recall is the F1 meaRunning time of GS and RGS vs. domain size sure, defined as the harmonic mean of precision and recall [18]. In Fig. 1 (top) we report 95% confidence intervals for the F1 measure and execution time of GS(m) (margins m = 1 to 3) and RGS(m,k) (margins 1 to 3 and k = 1000 random subsets), using 20 data sets containing 10 to 100 variables, with the target variable X1 was perturbed (inverted) by noise with 10% probability. As can be seen, the RGS(m,k) and GS(m) using the same value for margin perform comparably Number of domain variables with respect to F1 , up to their 95% confidence intervals. With Figure 1: GS(m) and RGS(m,k) per(m,k) respect to execution time however RGS exhibits much formance with respect to domain greater scalability (Fig. 1 bottom, log scale); for example, it size (number of variables). Top: F1 executes in about 10 seconds on average in domains containmeasure, reflecting accuracy. Bot(m) ing 100 variables, while GS executes in 1,000 seconds on tom: Execution time in seconds (log average for this domain size. scale). We also compared GS(m) and RGS(m,k) to RELIEF [5], a well-known algorithm for feature selection that is known to be able to recover parity functions in certain cases [5]. RELIEF learns a weight for each variable and compares it to a threshold τ to decide on its inclusion in the set of relevant variables. As it has been reported [9] that RELIEF can exhibit large variance due to randomization that is necessary only for very large data sets, we instead used a deterministic variant called RELIEVED [9], whose behavior corresponds to RELIEF at the limit of infinite execution time. We calculated the F1 measure for GS(m) , RGS(m,k) and RELIEVED in the presence of varying amounts of noise, with noise probability ranging from 0 (no noise) to 0.4. We used domains containing 50 variables, as GS(m) becomes computationally demanding in larger domains. In Figure 2 (left) we show the performance of GS(m) and RGS(m,k) for m equal to 1 and 3, k = 1000 and RELIEVED for thresholds τ = 0.01 and 0.03 for various amounts of noise on the target variable. Again, each experiment was repeated 20 times to generate 95% confidence intervals. We can observe that even though m = 1 (equivalent to the GS algorithm) performs poorly, increasing the margin m makes it more likely to recover the correct Markov boundary, and GS(3) (m = 3) recovers the exact blanket even with few (1,000) data points. RELIEVED does comparably to GS(3) for little noise and for a large threshold, (m ) (m, k ) 1 True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 1 0.9 GS(1) GS(2) GS(3) RGS(1, 1000) (2, 1000) RGS (3, 1000) RGS 0.8 F1-measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 (m ) 70 80 90 100 90 100 (m, k ) True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 10000 GS(1) GS(2) GS(3) Execution time (sec) 1000 RGS(1, 1000) RGS(2, 1000) RGS(3, 1000) 100 10 1 0.1 0.01 10 6 20 30 40 50 60 70 80 but appears to deteriorate for more noisy domains. As we can see it is difficult to choose the “right” threshold for RELIEVED—better performing τ at low noise can become worse in noisy environments; in particular, small τ tend to include irrelevant variables while large τ tend to miss actual members. We also evaluated GS(m) , RGS(m,k) , and RELIEVED on benchmark and real-world data sets from the UCI Machine Learning repository. As the true Markov boundary for these is impossible to know, we used as performance measure a measure of probabilistic isolation by the Markov boundary returned of subsets outside the boundary. For each domain variable X, we measured the independence of subsets Y of size 1, 2 and 3 given the blanket S of X returned by GS(3) and RELIEVED for τ = 0.03 (as this value seemed to do better in the previous set of experiments), as measured by the average p-value of the χ2 test between X and Y given S (with p-values of 0 and 1 indicating ideal dependence and independence, respectively). Due to the large number of subsets outside the boundary when the boundary is small, we limited the estimation of isolation performance to 2,000 subsets per variable. We plot the results in Figure 2 (middle and right). Each point represents a variable in the corresponding data set. Points under the diagonal indicate better probabilistic isolation performance for that variable for GS(3) compared to RELIEVED (middle plot) or to RGS(3,1000) (right plot). To obtain a statistically significant comparison, we used the non-parametric Wilcoxon paired signed-rank test, which indicated that GS(3) RGS(3,1000) are statistically equivalent to each other, while both outperformed RELIEVED at the 99.99% significance level (α < 10−7 ). 7 Conclusion In this paper we presented algorithms for the problem of feature selection in unrestricted (arbitrary distribution) domains that may contain complex interactions that only appear when the values of multiple features are considered together. We introduced two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and evaluated them on on artificial, benchmark and real-world data, demonstrating that they perform well, even in the presence of noise. We also introduced the Markov Boundary Theorem that precisely characterizes the properties of a boundary, and used it to prove m-correctness of the exact family of algorithms presented. We made minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, giving our algorithms universal applicability. Appendix A: Proof sketch of the Markov Boundary Theorem Proof sketch. (=⇒ direction) We need to prove that if S is a Markov boundary of X then (a) for every set T ⊆ U − S − {X}, (X⊥ T | S − T), and (b) for every set T′ ⊆ U − S that does not ⊥ contain X, (X ⊥ T′ | S − T′ ). Case (a) is immediate from the definition of the boundary and the ⊥ Decomposition theorem. Case (b) can be proven by contradiction: Assuming the independence of T′ that contains a non-empty part T′ in S and a part T′ in U − S, we get (from Decomposition) 1 2 (X⊥ T′ | S − T′ ). We can then use Contraction to show that the set S − T′ satisfies the inde⊥ 1 1 1 pendence property of a Markov boundary, i.e., that (X⊥ U − (S − T′ ) − {X} | S − T′ ), which ⊥ 1 1 contradicts the assumption that S is a boundary (and thus minimal). (⇐= direction) We need to prove that if Eq. (2) holds, then S is a minimal Markov blanket. The proof that S is a blanket is immediate. We can prove minimality by contradiction: Assume S = S1 ∪ S2 with S1 a blanket and S2 = ∅ i.e., S1 is a blanket strictly smaller than S. Then (X⊥ S2 | ⊥ S1 ) = (X⊥ S2 | S − S2 ). However, since S2 ⊆ U − S, from Eq. (2) we get (X ⊥ S2 | S − S2 ), ⊥ ⊥ which is a contradiction. Appendix B: Proof of m-Correctness of GS(m) Let the value of the set S at the end of the growing phase be SG , its value at the end of the shrinking phase SS , and their difference S∆ = SG − SS . The following two observations are immediate. Observation 1. For every Y ⊆ U − SG − {X} such that 1 ≤ |Y| ≤ m, (X⊥ Y | SG ). ⊥ Observation 2. For every Y ∈ SS , (X ⊥ Y | SS − {Y }). ⊥ Lemma 2. Consider variables Y1 , Y2 , . . . , Yt for some t ≥ 1 and let Y = {Yj }t . Assuming that j=1 Contraction holds, if (X⊥ Yi | S − {Yj }i ) for all i = 1, . . . , t, then (X⊥ Y | S − Y). ⊥ ⊥ j=1 Proof. By induction on Yj , j = 1, 2, . . . , t, using Contraction to decrease the conditioning set S t down to S − {Yj }i j=1 for all i = 1, 2, . . . , t. Since Y = {Yj }j=1 , we immediately obtain the desired relation (X⊥ Y | S − Y). ⊥ 7 Lemma 2 can be used to show that the variables found individually independent of X during the shrinking phase are actually jointly independent of X, given the final set SS . Let S∆ = {Y1 , Y2 , . . . , Yt } be the set of variables removed (in that order) from SG to form the final set SS i.e., S∆ = SG − SS . Using the above lemma, the following is immediate. Corollary 3. Assuming that the Contraction axiom holds, (X⊥ S∆ | SS ). ⊥ Lemma 4. If the Contraction, Decomposition and Weak Union axioms hold, then for every set T ⊆ U − SG − {X} such that (X⊥ T | SG ), ⊥ (X⊥ T ∪ (SG − SS ) | SS ). ⊥ (3) Furthermore SS is minimal i.e., there does not exist a subset of SS for which Eq. (3) is true. Proof. From Corollary 3, (X⊥ S∆ | SS ). Also, by the hypothesis, (X⊥ T | SG ) = (X⊥ T | ⊥ ⊥ ⊥ SS ∪ S∆ ), where S∆ = SG − SS as usual. From these two relations and Contraction we obtain (X⊥ T ∪ S∆ | SS ). ⊥ To prove minimality, let us assume that SS = ∅ (if SS = ∅ then it is already minimal). We prove by contradiction: Assume that there exists a set S′ ⊂ SS such that (X⊥ T ∪ (SG − S′ ) | S′ ). Let ⊥ W = SS − S′ = ∅. Note that W and S′ are disjoint. We have that SS ⊆ SS ∪ S∆ =⇒ SS − S′ ⊆ SS ∪ S∆ − S′ ⊆ T ∪ (SS ∪ S∆ − S′ ) =⇒ W ⊆ T ∪ (SS ∪ S∆ − S′ ) = T ∪ (SG − S′ ) • Since (X⊥ T ∪ (SG − S′ ) | S′ ) and W ⊆ T ∪ (SS ∪ S∆ − S′ ), from Decomposition we ⊥ get (X⊥ W | S′ ). ⊥ • From (X⊥ W | S′ ) and Weak Union we have that for every Y ∈ W, (X⊥ Y | S′ ∪ ⊥ ⊥ (W − {Y })). • Since S′ and W are disjoint and since Y ∈ W, Y ∈ S′ . Applying the set equality (A − B) ∪ C = (A ∪ B) − (A − C) to S′ ∪ (W − {Y }) we obtain S′ ∪ W − ({Y } − S′ ) = SS − {Y }. • Therefore, ∀ Y ∈ W, (X⊥ Y | SS − {Y }). ⊥ However, at the end of the shrinking phase, all variables Y in SS (and therefore in W, as W ⊆ SS ) have been evaluated for independence and found dependent (Observation 2). Thus, since W = ∅, there exists at least one Y such that (X ⊥ Y | SS − {Y }), producing a contradiction. ⊥ Theorem 5. Assuming that the Contraction, Decomposition, and Weak Union axioms hold, Algorithm 1 is m-correct with respect to X. Proof. We use the Markov Boundary Theorem. We first prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X⊥ T | SS − T) ⊥ or, equivalently, ∀ T ⊆ U − SS − {X} such that |T| ≤ m, (X⊥ T | SS ). ⊥ Since U − SS − {X} = S∆ ∪ (U − SG − {X}), S∆ and U − SG − {X} are disjoint, there are three kinds of sets of size m or less to consider: (i) all sets T ⊆ S∆ , (ii) all sets T ⊆ U − SG − {X}, and (iii) all sets (if any) T = T′ ∪ T′′ , T′ ∩ T′′ = ∅, that have a non-empty part T′ ⊆ S∆ and a non-empty part T′′ ⊆ U − SG − {X}. (i) From Corollary 3, (X⊥ S∆ | SS ). Therefore, from Decomposition, for any set T ⊆ S∆ , ⊥ (X⊥ T | SS ). ⊥ (ii) By Observation 1, for every set T ⊆ U − SG − {X} such that |T| ≤ m, (X⊥ T | SG ). ⊥ By Lemma 4 we get (X⊥ T ∪ S∆ | SS ), from which we obtain (X⊥ T | SS ) by ⊥ ⊥ Decomposition. (iii) Since |T| ≤ m, we have that |T′′ | ≤ m. Since T′′ ⊆ U − SG − {X}, by Observation 1, (X⊥ T′′ | SG ). Therefore, by Lemma 4, (X⊥ T′′ ∪ S∆ | SS ). Since T′ ⊆ S∆ ⇒ ⊥ ⊥ T′′ ∪ T′ ⊆ T′′ ∪ S∆ , by Decomposition to obtain (X⊥ T′′ ∪ T′ | SS ) = (X⊥ T | SS ). ⊥ ⊥ To complete the proof we need to prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X ⊥ T | SS − T) . ⊥ Let T = T1 ∪ T2 , with T1 ⊆ SS and T2 ⊆ U − SS . Since T ⊆ U − SS , T1 contains at least one variable Y ∈ SS . From Observation 2, (X ⊥ Y | SS − {Y }). From this and (the contrapositive of) ⊥ Weak Union, we get (X ⊥ {Y } ∪ (T1 − {Y }) | SS − {Y } − (T1 − {Y })) = (X ⊥ T1 | SS − T1 ). ⊥ ⊥ From (the contrapositive of) Decomposition we get (X ⊥ T1 ∪ T2 | SS − T1 ) = (X ⊥ T | ⊥ ⊥ SS − T1 ), which is equal to (X ⊥ T | SS − T1 − T2 ) = (X ⊥ T | SS − T) as SS and T2 are ⊥ ⊥ disjoint. 8 References [1] Isabelle Guyon and Andr´ Elisseeff. An introduction to variable and feature selection. Journal e of Machine Learning Research, 3:1157–1182, 2003. [2] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Tenth International Conference on Machine Learning (ICML), pages 284–292, 1996. [3] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26(9):917–922, 1977. [4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), 1991. [5] K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), pages 129–134, 1992. [6] M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(3): 131–156, 1997. [7] Huan Liu and Hiroshi Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective, volume 453 of The Springer International Series in Engineering and Computer Science. 1998. [8] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245–271, 1997. [9] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97 (1-2):273–324, 1997. [10] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12 (NIPS), 2000. [11] I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In Proceedings of the 16th International FLAIRS Conference, 2003. [12] I. Tsamardinos, C. Aliferis, and A. Statnikov. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 673–678, 2003. [13] N. Meinshausen and P. B¨ hlmann. High-dimensional graphs and variable selection with the u Lasso. Annals of Statistics, 34:1436–1462, 2006. [14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988. [15] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. [16] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 1990. [17] M. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998. [18] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 1979. 9

2 0.71916538 256 nips-2009-Which graphical models are difficult to learn?

Author: Andrea Montanari, Jose A. Pereira

Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random field µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is defined as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change significantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for specific classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results confirm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be definite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to find W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after fixing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is defined as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More specifically, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This first result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More specifically, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the fluctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a specific low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is defined from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our first result shows that Rlr(λ) indeed reconstructs G if θ is sufficiently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires significant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difficult to learn. We will instead show how to evaluate it on specific graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufficient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and define S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Define Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our first auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufficient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufficiently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisfied since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by definition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and define the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external field’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any fixed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) finite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simplifications arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random fields”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ fling and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9

3 0.51379794 90 nips-2009-Factor Modeling for Advertisement Targeting

Author: Ye Chen, Michael Kapralov, John Canny, Dmitry Y. Pavlov

Abstract: We adapt a probabilistic latent variable model, namely GaP (Gamma-Poisson) [6], to ad targeting in the contexts of sponsored search (SS) and behaviorally targeted (BT) display advertising. We also approach the important problem of ad positional bias by formulating a one-latent-dimension GaP factorization. Learning from click-through data is intrinsically large scale, even more so for ads. We scale up the algorithm to terabytes of real-world SS and BT data that contains hundreds of millions of users and hundreds of thousands of features, by leveraging the scalability characteristics of the algorithm and the inherent structure of the problem including data sparsity and locality. SpeciÄ?Ĺš cally, we demonstrate two somewhat orthogonal philosophies of scaling algorithms to large-scale problems, through the SS and BT implementations, respectively. Finally, we report the experimental results using Yahoo’s vast datasets, and show that our approach substantially outperform the state-of-the-art methods in prediction accuracy. For BT in particular, the ROC area achieved by GaP is exceeding 0.95, while one prior approach using Poisson regression [11] yielded 0.83. For computational performance, we compare a single-node sparse implementation with a parallel implementation using Hadoop MapReduce, the results are counterintuitive yet quite interesting. We therefore provide insights into the underlying principles of large-scale learning. 1

4 0.37856126 3 nips-2009-AUC optimization and the two-sample problem

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method. 1

5 0.36446777 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs

Author: Manqi Zhao, Venkatesh Saligrama

Abstract: We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on n-point nominal data. Anomalies are declared whenever the score of a test sample falls below α, which is supposed to be the desired false alarm level. The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the specified false alarm level, α, for the case when the anomaly density is a mixture of the nominal and a known density. Our algorithm is computationally efficient, being linear in dimension and quadratic in data size. It does not require choosing complicated tuning parameters or function approximation classes and it can adapt to local structure such as local change in dimensionality. We demonstrate the algorithm on both artificial and real data sets in high dimensional feature spaces.

6 0.34524891 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

7 0.33654344 205 nips-2009-Rethinking LDA: Why Priors Matter

8 0.33350444 206 nips-2009-Riffled Independence for Ranked Data

9 0.30919829 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models

10 0.30888093 152 nips-2009-Measuring model complexity with the prior predictive

11 0.30193728 56 nips-2009-Conditional Neural Fields

12 0.29342705 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models

13 0.2901397 54 nips-2009-Compositionality of optimal control laws

14 0.28121138 69 nips-2009-Discrete MDL Predicts in Total Variation

15 0.27396384 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison

16 0.26932672 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

17 0.26407167 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

18 0.25717804 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

19 0.25402898 193 nips-2009-Potential-Based Agnostic Boosting

20 0.24686009 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.011), (11, 0.269), (24, 0.044), (25, 0.097), (35, 0.05), (36, 0.107), (39, 0.032), (58, 0.05), (61, 0.021), (62, 0.05), (71, 0.072), (81, 0.018), (86, 0.069), (91, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83017325 149 nips-2009-Maximin affinity learning of image segmentation

Author: Kevin Briggman, Winfried Denk, Sebastian Seung, Moritz N. Helmstaedter, Srinivas C. Turaga

Abstract: Images can be segmented by first using a classifier to predict an affinity graph that reflects the degree to which image pixels must be grouped together and then partitioning the graph to yield a segmentation. Machine learning has been applied to the affinity classifier to produce affinity graphs that are good in the sense of minimizing edge misclassification rates. However, this error measure is only indirectly related to the quality of segmentations produced by ultimately partitioning the affinity graph. We present the first machine learning algorithm for training a classifier to produce affinity graphs that are good in the sense of producing segmentations that directly minimize the Rand index, a well known segmentation performance measure. The Rand index measures segmentation performance by quantifying the classification of the connectivity of image pixel pairs after segmentation. By using the simple graph partitioning algorithm of finding the connected components of the thresholded affinity graph, we are able to train an affinity classifier to directly minimize the Rand index of segmentations resulting from the graph partitioning. Our learning algorithm corresponds to the learning of maximin affinities between image pixel pairs, which are predictive of the pixel-pair connectivity. 1

same-paper 2 0.79557943 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

Author: Dimitris Margaritis

Abstract: In this paper we address the problem of provably correct feature selection in arbitrary domains. An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. While numerous algorithms for this problem have been proposed, their theoretical correctness and practical behavior under arbitrary probability distributions is unclear. We address this by introducing the Markov Boundary Theorem that precisely characterizes the properties of an ideal Markov boundary, and use it to develop algorithms that learn a more general boundary that can capture complex interactions that only appear when the values of multiple features are considered together. We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artificial as well as benchmark and real-world data sets. Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability. 1 Introduction and Motivation The problem of feature selection has a long history due to its significance in a wide range of important problems, from early ones like pattern recognition to recent ones such as text categorization, gene expression analysis and others. In such domains, using all available features may be prohibitively expensive, unnecessarily wasteful, and may lead to poor generalization performance, especially in the presence of irrelevant or redundant features. Thus, selecting a subset of features of the domain for use in subsequent application of machine learning algorithms has become a standard preprocessing step. A typical task of these algorithms is learning a classifier: Given a number of input features and a quantity of interest, called the target variable, choose a member of a family of classifiers that can predict the target variable’s value as well as possible. Another task is understanding the domain and the quantities that interact with the target quantity. Many algorithms have been proposed for feature selection. Unfortunately, little attention has been paid to the issue of their behavior under a variety of application domains that can be encountered in practice. In particular, it is known that many can fail under certain probability distributions such as ones that contain a (near) parity function [1], which contain interactions that only appear when the values of multiple features are considered together. There is therefore an acute need for algorithms that are widely applicable and can be theoretically proven to work under any probability distribution. In this paper we present two such algorithms, an exact and a more practical randomized approximate one. We use the observation (first made in Koller and Sahami [2]) that an optimal solution to the problem is a Markov boundary, defined to be a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain (a more precise definition is given later in Section 3) and present a family of algorithms for learning 1 the Markov boundary of a target variable in arbitrary domains. We first introduce a theorem that exactly characterizes the minimal set of features necessary for probabilistically isolating a variable, and then relax this definition to derive a family of algorithms that learn a parameterized approximation of the ideal boundary that are provably correct under a minimal set of assumptions, including a set of axioms that hold for any probability distribution. In the following section we present work on feature selection, followed by notation and definitions in Section 3. We subsequently introduce an important theorem and the aforementioned parameterized family of algorithms in Sections 4 and 5 respectively, including a practical anytime version. We evaluate these algorithms in Section 6 and conclude in Section 7. 2 Related Work Numerous algorithms have been proposed for feature selection. At the highest level algorithms can be classified as filter, wrapper, or embedded methods. Filter methods work without consulting the classifier (if any) that will make use of their output i.e., the resulting set of selected features. They therefore have typically wider applicability since they are not tied to any particular classifier family. In contrast, wrappers make the classifier an integral part of their operation, repeatedly invoking it to evaluate each of a sequence of feature subsets, and selecting the subset that results in minimum estimated classification error (for that particular classifier). Finally, embedded algorithms are classifier-learning algorithms that perform feature selection implicitly during their operation e.g., decision tree learners. Early work was motivated by the problem of pattern recognition which inherently contains a large number of features (pixels, regions, signal responses at multiple frequencies etc.). Narendra and Fukunaga [3] first cast feature selection as a problem of maximization of an objective function over the set of features to use, and proposed a number of search approaches including forward selection and backward elimination. Later work by machine learning researchers includes the FOCUS algorithm of Almuallim and Dietterich [4], which is a filter method for deterministic, noise-free domains. The RELIEF algorithm [5] instead uses a randomized selection of data points to update a weight assigned to each feature, selecting the features whose weight exceeds a given threshold. A large number of additional algorithms have appeared in the literature, too many to list here—good surveys are included in Dash and Liu [6]; Guyon and Elisseeff [1]; Liu and Motoda [7]. An important concept for feature subset selection is relevance. Several notions of relevance are discussed in a number of important papers such as Blum and Langley [8]; Kohavi and John [9]. The argument that the problem of feature selection can be cast as the problem of Markov blanket discovery was first made convincingly in Koller and Sahami [2], who also presented an algorithm for learning an approximate Markov blanket using mutual information. Other algorithms include the GS algorithm [10], originally developed for learning of the structure of a Bayesian network of a domain, and extensions to it [11] including the recent MMMB algorithm [12]. Meinshausen and B¨ hlmann [13] u recently proposed an optimal theoretical solution to the problem of learning the neighborhood of a Markov network when the distribution of the domain can be assumed to be a multidimensional Gaussian i.e., linear relations among features with Gaussian noise. This assumption implies that the Composition axiom holds in the domain (see Pearl [14] for a definition of Composition); the difference with our work is that we address here the problem in general domains where it may not necessarily hold. 3 Notation and Preliminaries In this section we present notation, fundamental definitions and axioms that will be subsequently used in the rest of the paper. We use the term “feature” and “variable” interchangeably, and denote variables by capital letters (X, Y etc.) and sets of variables by bold letters (S, T etc.). We denote the set of all variables/features in the domain (the “universe”) by U. All algorithms presented are independence-based, learning the Markov boundary of a given target variable using the truth value of a number of conditional independence statements. The use of conditional independence for feature selection subsumes many other criteria proposed in the literature. In particular, the use of classification accuracy of the target variable can be seen as a special case of testing for its conditional independence with some of its predictor variables (conditional on the subset selected at any given moment). A benefit of using conditional independence is that, while classification error estimates depend on the classifier family used, conditional independence does not. In addition, algorithms utilizing conditional independence for feature selection are applicable to all domain types, 2 e.g., discrete, ordinal, continuous with non-linear or arbitrary non-degenerate associations or mixed domains, as long as a reliable estimate of probabilistic independence is available. We denote probabilistic independence by the symbol “ ⊥ ” i.e., (X⊥ Y | Z) denotes the fact ⊥ ⊥ that the variables in set X are (jointly) conditionally independent from those in set Y given the values of the variables in set Z; (X⊥ Y | Z) denotes their conditional dependence. We assume ⊥ the existence of a probabilistic independence query oracle that is available to answer any query of the form (X, Y | Z), corresponding to the question “Is the set of variables in X independent of the variables in Y given the value of the variables in Z?” (This is similar to the approach of learning from statistical queries of Kearns and Vazirani [15].) In practice however, such an oracle does not exist, but can be approximated by a statistical independence test on a data set. Many tests of independence have appeared and studied extensively in the statistical literature over the last century; in this work we use the χ2 (chi-square) test of independence [16]. A Markov blanket of variable X is a set of variables such that, after fixing (by “knowing”) the value of all of its members, the set of remaining variables in the domain, taken together as a single setvalued variable, are statistically independent of X. More precisely, we have the following definition. Definition 1. A set of variables S ⊆ U is called a Markov blanket of variable X if and only if (X⊥ U − S − {X} | S). ⊥ Intuitively, a Markov blanket S of X captures all the information in the remaining domain variables U − S − {X} that can affect the probability distribution of X, making their value redundant as far as X is concerned (given S). The blanket therefore captures the essence of the feature selection problem for target variable X: By completely “shielding” X, a Markov blanket precludes the existence of any possible information about X that can come from variables not in the blanket, making it an ideal solution to the feature selection problem. A minimal Markov blanket is called a Markov boundary. Definition 2. A set of variables S ⊆ U − {X} is called a Markov boundary of variable X if it is a minimal Markov blanket of X i.e., none of its proper subsets is a Markov blanket. Pearl [14] proved that that the axioms of Symmetry, Decomposition, Weak Union, and Intersection are sufficient to guarantee a unique Markov boundary. These are shown below together with the axiom of Contraction. (Symmetry) (Decomposition) (Weak Union) (Contraction) (Intersection) (X⊥ Y | Z) =⇒ (Y⊥ X | Z) ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z) ∧ (X⊥ W | Z) ⊥ ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z ∪ W) ⊥ ⊥ (X⊥ Y | Z) ∧ (X⊥ W | Y ∪ Z) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (X⊥ Y | Z ∪ W) ∧ (X⊥ W | Z ∪ Y) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (1) The Symmetry, Decomposition, Contraction and Weak Union axioms are very general: they are necessary axioms for the probabilistic definition of independence i.e., they hold in every probability distribution, as their proofs are based on the axioms of probability theory. Intersection is not universal but it holds in distributions that are positive, i.e., any value combination of the domain variables has a non-zero probability of occurring. 4 The Markov Boundary Theorem According to Definition 2, a Markov boundary is a minimal Markov blanket. We first introduce a theorem that provides an alternative, equivalent definition of the concept of Markov boundary that we will relax later in the paper to produce a more general boundary definition. Theorem 1 (Markov Boundary Theorem). Assuming that the Decomposition and Contraction axioms hold, S ⊆ U − {X} is a Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X}, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ (2) A detailed proof cannot be included here due to space constraints but a proof sketch appears in Appendix A. According to the above theorem, a Markov boundary S partitions the powerset of U − {X} into two parts: (a) set P1 that contains all subsets of U − S, and (b) set P2 containing the remaining subsets. All sets in P1 are conditionally independent of X, and all sets in P2 are conditionally dependent with X. Intuitively, the two directions of the logical equivalence relation of Eq. (2) correspond to the concept of Markov blanket and its minimality i.e., the equation ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X⊥ T | S − T) ⊥ 3 Algorithm 1 The abstract GS(m) (X) algorithm. Returns an m-Markov boundary of X. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: S←∅ /* Growing phase. */ for all Y ⊆ U − S − {X} such that 1 ≤ |Y| ≤ m do if (X ⊥ Y | S) then ⊥ S←S∪Y goto line 3 /* Restart loop. */ /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 8 /* Restart loop. */ return S or, equivalently, (∀ T ⊆ U − S − {X}, (X⊥ T | S)) (as T and S are disjoint) corresponds to ⊥ the definition of Markov blanket, as it includes T = U − S − {X}. In the opposite direction, the contrapositive form is ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X ⊥ T | S − T) . ⊥ This corresponds to the concept of minimality of the Markov boundary: It states that all sets that contain a part of S cannot be independent of X given the remainder of S. Informally, this is because if there existed some set T that contained a non-empty subset T′ of S such that (X⊥ T | S − T), ⊥ then one would be able to shrink S by T′ (by the property of Contraction) and therefore S would not be minimal (more details in Appendix A). 5 A Family of Algorithms for Arbitrary Domains Theorem 1 defines conditions that precisely characterize a Markov boundary and thus can be thought of as an alternative definition of a boundary. By relaxing these conditions we can produce a more general definition. In particular, an m-Markov boundary is defined as follows. Definition 3. A set of variables S ⊆ U − {X} of a domain U is called an m-Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ We call the parameter m of an m-Markov boundary the Markov boundary margin. Intuitively, an m-boundary S guarantees that (a) all subsets of its complement (excluding X) of size m or smaller are independent of X given S, and (b) all sets T of size m or smaller that are not subsets of its complement are dependent of X given the part of S that is not contained in T. This definition is a special case of the properties of a boundary stated in Theorem 1, with each set T mentioned in the theorem now restricted to having size m or smaller. For m = n − 1, where n = |U |, the condition |T| ≤ m is always satisfied and can be omitted; in this case the definition of an (n − 1)-Markov boundary results in exactly Eq. (2) of Theorem 1. We now present an algorithm called GS(m) , shown in Algorithm 1, that provably correctly learns an m-boundary of a target variable X. GS(m) operates in two phases, a growing and a shrinking phase (hence the acronym). During the growing phase it examines sets of variables of size up to m, where m is a user-specified parameter. During the shrinking phase, single variables are examined for conditional independence and possible removal from S (examining sets in the shrinking phase is not necessary for provably correct operation—see Appendix B). The orders of examination of the sets for possible addition and deletion from the candidate boundary are left intentionally unspecified in Algorithm 1—one can therefore view it as an abstract representative of a family of algorithms, with each member specifying one such ordering. All members of this family are m-correct, as the proof of correctness does not depend on the ordering. In practice numerous choices for the ordering exist; one possibility is to examine the sets in the growing phase in order of increasing set size and, for each such size, in order of decreasing conditional mutual information I(X, Y, S) between X and Y given S. The rationale for this heuristic choice is that (usually) tests with smaller conditional sets tend to be more reliable, and sorting by mutual information tends to lessen the chance of adding false members of the Markov boundary. We used this implementation in all our experiments, presented later in Section 6. Intuitively, the margin represents a trade-off between sample and computational complexity and completeness: For m = n − 1 = |U| − 1, the algorithm returns a Markov boundary in unrestricted 4 Algorithm 2 The RGS(m,k) (X) algorithm, a randomized anytime version of the GS(m) algorithm, utilizing k random subsets for the growing phase. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: S←∅ /* Growing phase. */ repeat Schanged ← false Y ← subset of U − S − {X} of size 1 ≤ |Y| ≤ m of maximum dependence out of k random subsets if (X ⊥ Y | S) then ⊥ S←S∪Y Schanged ← true until Schanged = false /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 11 /* Restart loop. */ return S (arbitrary) domains. For 1 ≤ m < n − 1, GS(m) may recover the correct boundary depending on characteristics of the domain. For example, it will recover the correct boundary in domains containing embedded parity functions such that the number of variables involved in every k-bit parity function is m + 1 or less i.e., if k ≤ m + 1 (parity functions are corner cases in the space of probability distributions that are known to be hard to learn [17]). The proof of m-correctness of GS(m) is included in Appendix B. Note that it is based on Theorem 1 and the universal axioms of Eqs. (1) only i.e., Intersection is not needed, and thus it is widely applicable (to any domain). A Practical Randomized Anytime Version While GS(m) is provably correct even in difficult domains such as those that contain parity functions, it may be impractical with a large number of features as its asymptotic complexity is O(nm ). We therefore also we here provide a more practical randomized version called RGS(m,k) (Randomized GS(m) ), shown in Algorithm 2. The RGS(m,k) algorithm has an additional parameter k that limits its computational requirements: instead of exhaustively examining all possible subsets of (U −S−{X}) (as GS(m) does), it instead samples k subsets from the set of all possible subsets of (U − S − {X}), where k is user-specified. It is therefore a randomized algorithm that becomes equivalent to GS(m) given a large enough k. Many possibilities for the method of random selection of the subsets exist; in our experiments we select a subset Y = {Yi } (1 ≤ |Y| ≤ m) with probability proportional |Y| to i=1 (1/p(X, Yi | S)), where p(X, Yi | S) is the p-value of the corresponding (univariate) test between X and Yi given S, which has a low computational cost. The RGS(m,k) algorithm is useful in situations where the amount of time to produce an answer may be limited and/or the limit unknown beforehand: it is easy to show that the growing phase of GS(m) produces an an upper-bound of the m-boundary of X. Therefore, the RGS(m,k) algorithm, if interrupted, will return an approximation of this upper bound. Moreover, if there exists time for the shrinking phase to be executed (which conducts a number of tests linear in n and is thus fast), extraneous variables will be removed and a minimal blanket (boundary) approximation will be returned. These features make it an anytime algorithm, which is a more appropriate choice for situations where critical events may occur that require the interruption of computation, e.g., during the planning phase of a robot, which may be interrupted at any time due to an urgent external event that requires a decision to be made based on the present state’s feature values. 6 Experiments We evaluated the GS(m) and the RGS(m,k) algorithms on synthetic as well as real-world and benchmark data sets. We first systematically examined the performance on the task of recovering near-parity functions, which are known to be hard to learn [17]. We compared GS(m) and RGS(m,k) with respect to accuracy of recovery of the original boundary as well as computational cost. We generated domains of sizes ranging from 10 to 100 variables, of which 4 variables (X1 to X4 ) were related through a near-parity relation with bit probability 0.60 and various degrees of noise. The remaining independent variables (X5 to Xn ) act as “dis5 GS(1) GS(3) RGS(1, 1000) 0 0.05 RGS(3, 1000) Relieved, threshold = 0.001 Relieved, threshold = 0.03 0.1 0.15 0.2 0.25 Noise probability 0.3 0.35 0.4 Probabilistic isolation performance of GS(m) and RELIEVED Probabilistic isolation performance of GS(m) and RGS(m ,k) Real-world and benchmark data sets 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) GS 0.6 0.8 1 Real-world and benchmark data sets RGS(m = 3, k = 300) average isolation measure F1 measure 50 variables, true Markov boundary size = 3 Bernoulli probability = 0.6, 1000 data points RELIEVED(threshold = 0.03) average isolation measure F1 measure of GS(m ), RGS(m, k ) and RELIEVED vs. noise level 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) average isolation measure GS 0.6 0.8 1 average isolation measure Figure 2: Left: F1 measure of GS(m) , RGS(m,k) and RELIEVED under increasing amounts of noise. Middle: Probabilistic isolation performance comparison between GS(3) and RELIEVED on real-world and benchmark data sets. Right: Same for GS(3) and RGS(3,1000) . tractors” and had randomly assigned probabilities i.e., the correct boundary of X1 is B1 = {X2 , X3 , X4 }. In such domains, learning the boundary of X1 is difficult because of the large number of distractors and because each Xi ∈ B1 is independent of X1 given any proper subset of B1 − {Xi } (they only become dependent when including all of them in the conditioning set). To measure an algorithm’s feature selection performance, acF -measure of GS and RGS vs. domain size curacy (fraction of variables correctly included or excluded) is inappropriate as the accuracy of trivial algorithms such as returning the empty set will tend to 1 as n increases. Precision and recall are therefore more appropriate, with precision defined as the fraction of features returned that are in the correct boundary (3 features for X1 ), and recall as the fraction of the features present in the correct boundary that are returned by the algorithm. A convenient and frequently used Number of domain variables measure that combines precision and recall is the F1 meaRunning time of GS and RGS vs. domain size sure, defined as the harmonic mean of precision and recall [18]. In Fig. 1 (top) we report 95% confidence intervals for the F1 measure and execution time of GS(m) (margins m = 1 to 3) and RGS(m,k) (margins 1 to 3 and k = 1000 random subsets), using 20 data sets containing 10 to 100 variables, with the target variable X1 was perturbed (inverted) by noise with 10% probability. As can be seen, the RGS(m,k) and GS(m) using the same value for margin perform comparably Number of domain variables with respect to F1 , up to their 95% confidence intervals. With Figure 1: GS(m) and RGS(m,k) per(m,k) respect to execution time however RGS exhibits much formance with respect to domain greater scalability (Fig. 1 bottom, log scale); for example, it size (number of variables). Top: F1 executes in about 10 seconds on average in domains containmeasure, reflecting accuracy. Bot(m) ing 100 variables, while GS executes in 1,000 seconds on tom: Execution time in seconds (log average for this domain size. scale). We also compared GS(m) and RGS(m,k) to RELIEF [5], a well-known algorithm for feature selection that is known to be able to recover parity functions in certain cases [5]. RELIEF learns a weight for each variable and compares it to a threshold τ to decide on its inclusion in the set of relevant variables. As it has been reported [9] that RELIEF can exhibit large variance due to randomization that is necessary only for very large data sets, we instead used a deterministic variant called RELIEVED [9], whose behavior corresponds to RELIEF at the limit of infinite execution time. We calculated the F1 measure for GS(m) , RGS(m,k) and RELIEVED in the presence of varying amounts of noise, with noise probability ranging from 0 (no noise) to 0.4. We used domains containing 50 variables, as GS(m) becomes computationally demanding in larger domains. In Figure 2 (left) we show the performance of GS(m) and RGS(m,k) for m equal to 1 and 3, k = 1000 and RELIEVED for thresholds τ = 0.01 and 0.03 for various amounts of noise on the target variable. Again, each experiment was repeated 20 times to generate 95% confidence intervals. We can observe that even though m = 1 (equivalent to the GS algorithm) performs poorly, increasing the margin m makes it more likely to recover the correct Markov boundary, and GS(3) (m = 3) recovers the exact blanket even with few (1,000) data points. RELIEVED does comparably to GS(3) for little noise and for a large threshold, (m ) (m, k ) 1 True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 1 0.9 GS(1) GS(2) GS(3) RGS(1, 1000) (2, 1000) RGS (3, 1000) RGS 0.8 F1-measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 (m ) 70 80 90 100 90 100 (m, k ) True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 10000 GS(1) GS(2) GS(3) Execution time (sec) 1000 RGS(1, 1000) RGS(2, 1000) RGS(3, 1000) 100 10 1 0.1 0.01 10 6 20 30 40 50 60 70 80 but appears to deteriorate for more noisy domains. As we can see it is difficult to choose the “right” threshold for RELIEVED—better performing τ at low noise can become worse in noisy environments; in particular, small τ tend to include irrelevant variables while large τ tend to miss actual members. We also evaluated GS(m) , RGS(m,k) , and RELIEVED on benchmark and real-world data sets from the UCI Machine Learning repository. As the true Markov boundary for these is impossible to know, we used as performance measure a measure of probabilistic isolation by the Markov boundary returned of subsets outside the boundary. For each domain variable X, we measured the independence of subsets Y of size 1, 2 and 3 given the blanket S of X returned by GS(3) and RELIEVED for τ = 0.03 (as this value seemed to do better in the previous set of experiments), as measured by the average p-value of the χ2 test between X and Y given S (with p-values of 0 and 1 indicating ideal dependence and independence, respectively). Due to the large number of subsets outside the boundary when the boundary is small, we limited the estimation of isolation performance to 2,000 subsets per variable. We plot the results in Figure 2 (middle and right). Each point represents a variable in the corresponding data set. Points under the diagonal indicate better probabilistic isolation performance for that variable for GS(3) compared to RELIEVED (middle plot) or to RGS(3,1000) (right plot). To obtain a statistically significant comparison, we used the non-parametric Wilcoxon paired signed-rank test, which indicated that GS(3) RGS(3,1000) are statistically equivalent to each other, while both outperformed RELIEVED at the 99.99% significance level (α < 10−7 ). 7 Conclusion In this paper we presented algorithms for the problem of feature selection in unrestricted (arbitrary distribution) domains that may contain complex interactions that only appear when the values of multiple features are considered together. We introduced two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and evaluated them on on artificial, benchmark and real-world data, demonstrating that they perform well, even in the presence of noise. We also introduced the Markov Boundary Theorem that precisely characterizes the properties of a boundary, and used it to prove m-correctness of the exact family of algorithms presented. We made minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, giving our algorithms universal applicability. Appendix A: Proof sketch of the Markov Boundary Theorem Proof sketch. (=⇒ direction) We need to prove that if S is a Markov boundary of X then (a) for every set T ⊆ U − S − {X}, (X⊥ T | S − T), and (b) for every set T′ ⊆ U − S that does not ⊥ contain X, (X ⊥ T′ | S − T′ ). Case (a) is immediate from the definition of the boundary and the ⊥ Decomposition theorem. Case (b) can be proven by contradiction: Assuming the independence of T′ that contains a non-empty part T′ in S and a part T′ in U − S, we get (from Decomposition) 1 2 (X⊥ T′ | S − T′ ). We can then use Contraction to show that the set S − T′ satisfies the inde⊥ 1 1 1 pendence property of a Markov boundary, i.e., that (X⊥ U − (S − T′ ) − {X} | S − T′ ), which ⊥ 1 1 contradicts the assumption that S is a boundary (and thus minimal). (⇐= direction) We need to prove that if Eq. (2) holds, then S is a minimal Markov blanket. The proof that S is a blanket is immediate. We can prove minimality by contradiction: Assume S = S1 ∪ S2 with S1 a blanket and S2 = ∅ i.e., S1 is a blanket strictly smaller than S. Then (X⊥ S2 | ⊥ S1 ) = (X⊥ S2 | S − S2 ). However, since S2 ⊆ U − S, from Eq. (2) we get (X ⊥ S2 | S − S2 ), ⊥ ⊥ which is a contradiction. Appendix B: Proof of m-Correctness of GS(m) Let the value of the set S at the end of the growing phase be SG , its value at the end of the shrinking phase SS , and their difference S∆ = SG − SS . The following two observations are immediate. Observation 1. For every Y ⊆ U − SG − {X} such that 1 ≤ |Y| ≤ m, (X⊥ Y | SG ). ⊥ Observation 2. For every Y ∈ SS , (X ⊥ Y | SS − {Y }). ⊥ Lemma 2. Consider variables Y1 , Y2 , . . . , Yt for some t ≥ 1 and let Y = {Yj }t . Assuming that j=1 Contraction holds, if (X⊥ Yi | S − {Yj }i ) for all i = 1, . . . , t, then (X⊥ Y | S − Y). ⊥ ⊥ j=1 Proof. By induction on Yj , j = 1, 2, . . . , t, using Contraction to decrease the conditioning set S t down to S − {Yj }i j=1 for all i = 1, 2, . . . , t. Since Y = {Yj }j=1 , we immediately obtain the desired relation (X⊥ Y | S − Y). ⊥ 7 Lemma 2 can be used to show that the variables found individually independent of X during the shrinking phase are actually jointly independent of X, given the final set SS . Let S∆ = {Y1 , Y2 , . . . , Yt } be the set of variables removed (in that order) from SG to form the final set SS i.e., S∆ = SG − SS . Using the above lemma, the following is immediate. Corollary 3. Assuming that the Contraction axiom holds, (X⊥ S∆ | SS ). ⊥ Lemma 4. If the Contraction, Decomposition and Weak Union axioms hold, then for every set T ⊆ U − SG − {X} such that (X⊥ T | SG ), ⊥ (X⊥ T ∪ (SG − SS ) | SS ). ⊥ (3) Furthermore SS is minimal i.e., there does not exist a subset of SS for which Eq. (3) is true. Proof. From Corollary 3, (X⊥ S∆ | SS ). Also, by the hypothesis, (X⊥ T | SG ) = (X⊥ T | ⊥ ⊥ ⊥ SS ∪ S∆ ), where S∆ = SG − SS as usual. From these two relations and Contraction we obtain (X⊥ T ∪ S∆ | SS ). ⊥ To prove minimality, let us assume that SS = ∅ (if SS = ∅ then it is already minimal). We prove by contradiction: Assume that there exists a set S′ ⊂ SS such that (X⊥ T ∪ (SG − S′ ) | S′ ). Let ⊥ W = SS − S′ = ∅. Note that W and S′ are disjoint. We have that SS ⊆ SS ∪ S∆ =⇒ SS − S′ ⊆ SS ∪ S∆ − S′ ⊆ T ∪ (SS ∪ S∆ − S′ ) =⇒ W ⊆ T ∪ (SS ∪ S∆ − S′ ) = T ∪ (SG − S′ ) • Since (X⊥ T ∪ (SG − S′ ) | S′ ) and W ⊆ T ∪ (SS ∪ S∆ − S′ ), from Decomposition we ⊥ get (X⊥ W | S′ ). ⊥ • From (X⊥ W | S′ ) and Weak Union we have that for every Y ∈ W, (X⊥ Y | S′ ∪ ⊥ ⊥ (W − {Y })). • Since S′ and W are disjoint and since Y ∈ W, Y ∈ S′ . Applying the set equality (A − B) ∪ C = (A ∪ B) − (A − C) to S′ ∪ (W − {Y }) we obtain S′ ∪ W − ({Y } − S′ ) = SS − {Y }. • Therefore, ∀ Y ∈ W, (X⊥ Y | SS − {Y }). ⊥ However, at the end of the shrinking phase, all variables Y in SS (and therefore in W, as W ⊆ SS ) have been evaluated for independence and found dependent (Observation 2). Thus, since W = ∅, there exists at least one Y such that (X ⊥ Y | SS − {Y }), producing a contradiction. ⊥ Theorem 5. Assuming that the Contraction, Decomposition, and Weak Union axioms hold, Algorithm 1 is m-correct with respect to X. Proof. We use the Markov Boundary Theorem. We first prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X⊥ T | SS − T) ⊥ or, equivalently, ∀ T ⊆ U − SS − {X} such that |T| ≤ m, (X⊥ T | SS ). ⊥ Since U − SS − {X} = S∆ ∪ (U − SG − {X}), S∆ and U − SG − {X} are disjoint, there are three kinds of sets of size m or less to consider: (i) all sets T ⊆ S∆ , (ii) all sets T ⊆ U − SG − {X}, and (iii) all sets (if any) T = T′ ∪ T′′ , T′ ∩ T′′ = ∅, that have a non-empty part T′ ⊆ S∆ and a non-empty part T′′ ⊆ U − SG − {X}. (i) From Corollary 3, (X⊥ S∆ | SS ). Therefore, from Decomposition, for any set T ⊆ S∆ , ⊥ (X⊥ T | SS ). ⊥ (ii) By Observation 1, for every set T ⊆ U − SG − {X} such that |T| ≤ m, (X⊥ T | SG ). ⊥ By Lemma 4 we get (X⊥ T ∪ S∆ | SS ), from which we obtain (X⊥ T | SS ) by ⊥ ⊥ Decomposition. (iii) Since |T| ≤ m, we have that |T′′ | ≤ m. Since T′′ ⊆ U − SG − {X}, by Observation 1, (X⊥ T′′ | SG ). Therefore, by Lemma 4, (X⊥ T′′ ∪ S∆ | SS ). Since T′ ⊆ S∆ ⇒ ⊥ ⊥ T′′ ∪ T′ ⊆ T′′ ∪ S∆ , by Decomposition to obtain (X⊥ T′′ ∪ T′ | SS ) = (X⊥ T | SS ). ⊥ ⊥ To complete the proof we need to prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X ⊥ T | SS − T) . ⊥ Let T = T1 ∪ T2 , with T1 ⊆ SS and T2 ⊆ U − SS . Since T ⊆ U − SS , T1 contains at least one variable Y ∈ SS . From Observation 2, (X ⊥ Y | SS − {Y }). From this and (the contrapositive of) ⊥ Weak Union, we get (X ⊥ {Y } ∪ (T1 − {Y }) | SS − {Y } − (T1 − {Y })) = (X ⊥ T1 | SS − T1 ). ⊥ ⊥ From (the contrapositive of) Decomposition we get (X ⊥ T1 ∪ T2 | SS − T1 ) = (X ⊥ T | ⊥ ⊥ SS − T1 ), which is equal to (X ⊥ T | SS − T1 − T2 ) = (X ⊥ T | SS − T) as SS and T2 are ⊥ ⊥ disjoint. 8 References [1] Isabelle Guyon and Andr´ Elisseeff. An introduction to variable and feature selection. Journal e of Machine Learning Research, 3:1157–1182, 2003. [2] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Tenth International Conference on Machine Learning (ICML), pages 284–292, 1996. [3] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26(9):917–922, 1977. [4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), 1991. [5] K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), pages 129–134, 1992. [6] M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(3): 131–156, 1997. [7] Huan Liu and Hiroshi Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective, volume 453 of The Springer International Series in Engineering and Computer Science. 1998. [8] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245–271, 1997. [9] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97 (1-2):273–324, 1997. [10] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12 (NIPS), 2000. [11] I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In Proceedings of the 16th International FLAIRS Conference, 2003. [12] I. Tsamardinos, C. Aliferis, and A. Statnikov. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 673–678, 2003. [13] N. Meinshausen and P. B¨ hlmann. High-dimensional graphs and variable selection with the u Lasso. Annals of Statistics, 34:1436–1462, 2006. [14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988. [15] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. [16] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 1990. [17] M. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998. [18] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 1979. 9

3 0.65793514 260 nips-2009-Zero-shot Learning with Semantic Output Codes

Author: Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, Tom M. Mitchell

Abstract: We consider the problem of zero-shot learning, where the goal is to learn a classifier f : X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we define the notion of a semantic output code classifier (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes. We provide a formalism for this type of classifier and study its theoretical properties in a PAC framework, showing conditions under which the classifier can accurately predict novel classes. As a case study, we build a SOC classifier for a neural decoding task and show that it can often predict words that people are thinking about from functional magnetic resonance images (fMRI) of their neural activity, even without training examples for those words. 1

4 0.57898068 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

Author: Marcel V. Gerven, Botond Cseke, Robert Oostenveld, Tom Heskes

Abstract: We introduce a novel multivariate Laplace (MVL) distribution as a sparsity promoting prior for Bayesian source localization that allows the specification of constraints between and within sources. We represent the MVL distribution as a scale mixture that induces a coupling between source variances instead of their means. Approximation of the posterior marginals using expectation propagation is shown to be very efficient due to properties of the scale mixture representation. The computational bottleneck amounts to computing the diagonal elements of a sparse matrix inverse. Our approach is illustrated using a mismatch negativity paradigm for which MEG data and a structural MRI have been acquired. We show that spatial coupling leads to sources which are active over larger cortical areas as compared with an uncoupled prior. 1

5 0.57169271 97 nips-2009-Free energy score space

Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic

Abstract: A score function induced by a generative model of the data can provide a feature vector of a fixed dimension for each data sample. Data samples themselves may be of differing lengths (e.g., speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a fixed-length vector in a highly informative space, typically referred to as a “score space”. Discriminative classifiers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classifiers, or the discriminative classifiers using standard feature extractors. In this paper, we present a novel score space that exploits the free energy associated with a generative model. The resulting free energy score space (FESS) takes into account latent structure of the data at various levels, and can be trivially shown to lead to classification performance that at least matches the performance of the free energy classifier based on the same generative model, and the same factorization of the posterior. We also show that in several typical vision and computational biology applications the classifiers optimized in FESS outperform the corresponding pure generative approaches, as well as a number of previous approaches to combining discriminating and generative models.

6 0.56961679 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

7 0.56919968 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

8 0.56544232 13 nips-2009-A Neural Implementation of the Kalman Filter

9 0.56477416 129 nips-2009-Learning a Small Mixture of Trees

10 0.56374288 226 nips-2009-Spatial Normalized Gamma Processes

11 0.56129998 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

12 0.56067204 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

13 0.5596568 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

14 0.55952352 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

15 0.5591951 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

16 0.55908382 71 nips-2009-Distribution-Calibrated Hierarchical Classification

17 0.55860192 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

18 0.55825835 133 nips-2009-Learning models of object structure

19 0.5582515 175 nips-2009-Occlusive Components Analysis

20 0.5581702 113 nips-2009-Improving Existing Fault Recovery Policies