nips nips2009 nips2009-256 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrea Montanari, Jose A. Pereira
Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random field µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is defined as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change significantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for specific classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results confirm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be definite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to find W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after fixing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is defined as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More specifically, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This first result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More specifically, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the fluctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a specific low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is defined from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our first result shows that Rlr(λ) indeed reconstructs G if θ is sufficiently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires significant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difficult to learn. We will instead show how to evaluate it on specific graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufficient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and define S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Define Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our first auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufficient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufficiently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisfied since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by definition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and define the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external field’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any fixed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) finite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simplifications arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random fields”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ fling and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9
Reference: text
sentIndex sentText sentNum sentScore
1 More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). [sent-10, score-0.067]
2 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random field µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . [sent-11, score-0.252]
3 (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G. [sent-19, score-0.154]
4 , x(n) with distribution µG,θ ( · ), reconstruct the graph G. [sent-26, score-0.181]
5 The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. [sent-28, score-0.174]
6 The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? [sent-29, score-0.181]
7 More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is defined as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . [sent-30, score-0.156]
8 Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. [sent-44, score-0.155]
9 For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const. [sent-46, score-0.447]
10 In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). [sent-48, score-0.064]
11 1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. [sent-50, score-0.205]
12 The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . [sent-51, score-0.302]
13 ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). [sent-52, score-0.141]
14 The sample complexity nThr(τ ) can be bounded for specific classes of graphs as follows (the proofs are straightforward and omitted from this paper). [sent-56, score-0.256]
15 If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . [sent-59, score-0.147]
16 There exists a numerical constant K such that the following is true. [sent-63, score-0.075]
17 If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i. [sent-64, score-0.273]
18 These results confirm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. [sent-67, score-0.084]
19 If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. [sent-68, score-0.315]
20 The same happens on bounded-degree graphs if θ ≤ const. [sent-69, score-0.12]
21 /∆, there exists families of bounded degree graphs with long-range correlations. [sent-72, score-0.31]
22 In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. [sent-78, score-0.214]
23 1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. [sent-81, score-0.146]
24 Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). [sent-83, score-0.207]
25 Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. [sent-84, score-0.164]
26 Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. [sent-86, score-0.151]
27 On the other hand, if U ∂r, then it is possible to find W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after fixing all other values in U ∪ W will produce no noticeable change in the conditional marginal. [sent-87, score-0.117]
28 In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. [sent-92, score-0.112]
29 For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . [sent-99, score-0.102]
30 This first result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. [sent-101, score-0.104]
31 However, the degree of such polynomial is pretty high and non-uniform in ∆. [sent-102, score-0.083]
32 The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. [sent-105, score-0.123]
33 The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. [sent-108, score-0.158]
34 Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More specifically, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . [sent-114, score-0.224]
35 To control the fluctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. [sent-120, score-0.12]
36 As a specific low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. [sent-121, score-0.071]
37 There exists numerical constants K1 , K2 , K3 , such that the following is true. [sent-127, score-0.075]
38 This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. [sent-131, score-0.193]
39 Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. [sent-133, score-0.304]
40 There exists a numerical constant K such that the following happens. [sent-136, score-0.075]
41 If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i. [sent-137, score-0.334]
42 The graphs for which regularized logistic regression fails are not contrived examples. [sent-140, score-0.218]
43 Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. [sent-141, score-0.33]
44 6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. [sent-143, score-0.166]
45 Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires significant computational effort 4 1 0. [sent-146, score-0.229]
46 8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. [sent-173, score-0.103]
47 Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). [sent-174, score-0.058]
48 Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difficult to learn. [sent-179, score-0.229]
49 We will instead show how to evaluate it on specific graph families. [sent-180, score-0.122]
50 It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. [sent-184, score-0.107]
51 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). [sent-185, score-0.108]
52 Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). [sent-192, score-0.052]
53 Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . [sent-193, score-0.159]
54 Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. [sent-195, score-0.091]
55 Each such graphs was obtained by removing each edge independently with probability ρ = 0. [sent-196, score-0.12]
56 Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. [sent-198, score-0.18]
57 Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. [sent-201, score-0.079]
58 Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0. [sent-202, score-0.072]
59 8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. [sent-233, score-0.237]
60 Left: success probability as a function of the number of samples n for several values of θ. [sent-234, score-0.11]
61 Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. [sent-237, score-0.239]
62 The evolution of the success probability with n clearly shows a dichotomy. [sent-238, score-0.058]
63 When θ is below a threshold, a small number of samples is sufficient to reconstruct G with high probability. [sent-239, score-0.111]
64 Above the threshold even n = 104 samples are to few. [sent-240, score-0.088]
65 If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . [sent-248, score-0.05]
66 As above, we let r be the vertex whose neighborhood we are trying to reconstruct and define S = ∂r, S c = V \ ∂r ∪ r. [sent-249, score-0.187]
67 Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. [sent-251, score-0.048]
68 We will omit arguments whenever clear from the context. [sent-257, score-0.045]
69 Throughout this section G is a graph of maximum degree ∆. [sent-262, score-0.205]
70 6 Our first auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufficient ˆ∗ S SS condition for the failure of Rlr(λ). [sent-265, score-0.094]
71 Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. [sent-269, score-0.152]
72 There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . [sent-275, score-0.122]
73 The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. [sent-277, score-0.247]
74 Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufficiently small. [sent-280, score-0.274]
75 Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. [sent-281, score-0.069]
76 The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . [sent-283, score-0.266]
77 2 is more straightforward and we omit its proof for space reasons. [sent-289, score-0.076]
78 3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. [sent-294, score-0.182]
79 1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. [sent-301, score-0.082]
80 1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. [sent-306, score-0.218]
81 To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. [sent-308, score-0.045]
82 θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. [sent-309, score-0.048]
83 An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . [sent-310, score-0.056]
84 To simplify notation we will from θ omit the ∗ in all Qn ∗ . [sent-312, score-0.045]
85 SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). [sent-316, score-0.052]
86 Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i. [sent-317, score-0.07]
87 The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. [sent-323, score-0.087]
88 Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. [sent-324, score-0.048]
89 Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. [sent-326, score-0.07]
90 For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and define the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . [sent-334, score-0.178]
91 µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. [sent-335, score-0.18]
92 It can be proved using [17] and uniform continuity with respect to the ‘external field’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. [sent-336, score-0.102]
93 T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). [sent-337, score-0.194]
94 Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . [sent-340, score-0.311]
95 We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). [sent-343, score-0.077]
96 The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simplifications arising thanks to the tree symmetry and by taking t ≫ 1. [sent-344, score-0.039]
97 The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. [sent-345, score-0.045]
98 Ng, “Learning factor graphs in polynomial time and sample complexity”. [sent-351, score-0.12]
99 Talata, “Consistent estimation of the basic neighborhood structure of Markov random fields”, The Annals of Statistics, 2006, 34, Vol. [sent-368, score-0.068]
100 B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. [sent-403, score-0.148]
wordName wordTfidf (topN-words)
[('cmin', 0.373), ('ss', 0.362), ('rlr', 0.327), ('qn', 0.292), ('ising', 0.274), ('alg', 0.187), ('nalg', 0.163), ('psucc', 0.163), ('thr', 0.163), ('zs', 0.141), ('tanh', 0.133), ('br', 0.123), ('graph', 0.122), ('graphs', 0.12), ('crit', 0.102), ('eg', 0.083), ('degree', 0.083), ('xr', 0.079), ('lemma', 0.076), ('rj', 0.071), ('incoherence', 0.07), ('ferromagnetic', 0.07), ('nthr', 0.07), ('xbr', 0.07), ('bounded', 0.07), ('neighborhood', 0.068), ('xw', 0.066), ('ws', 0.063), ('node', 0.063), ('xj', 0.06), ('vertex', 0.06), ('reconstruct', 0.059), ('success', 0.058), ('ind', 0.056), ('theorem', 0.056), ('correlations', 0.053), ('core', 0.053), ('montanari', 0.053), ('samples', 0.052), ('thresholding', 0.049), ('gp', 0.049), ('subgradient', 0.048), ('atanh', 0.047), ('cri', 0.047), ('indd', 0.047), ('nrlr', 0.047), ('qss', 0.047), ('reconstructs', 0.047), ('zg', 0.047), ('lasso', 0.046), ('omit', 0.045), ('xi', 0.044), ('cij', 0.042), ('rs', 0.042), ('arxiv', 0.042), ('sinh', 0.041), ('tree', 0.039), ('expectations', 0.038), ('ij', 0.038), ('numerical', 0.038), ('pseudolikelihood', 0.037), ('condition', 0.037), ('exists', 0.037), ('critical', 0.036), ('threshold', 0.036), ('vertices', 0.036), ('logistic', 0.036), ('regular', 0.035), ('xu', 0.035), ('phenomenon', 0.034), ('complexity', 0.034), ('uniformly', 0.034), ('regularized', 0.034), ('develops', 0.033), ('coincide', 0.033), ('subgraphs', 0.033), ('proofs', 0.032), ('candidate', 0.032), ('notice', 0.032), ('converging', 0.032), ('mechanics', 0.032), ('wainwright', 0.031), ('failure', 0.031), ('proof', 0.031), ('noticeable', 0.029), ('selection', 0.028), ('ravikumar', 0.028), ('fix', 0.028), ('fails', 0.028), ('solvable', 0.027), ('log', 0.027), ('proved', 0.026), ('auxiliary', 0.026), ('change', 0.025), ('indices', 0.025), ('markov', 0.025), ('unbounded', 0.025), ('structural', 0.024), ('upper', 0.024), ('reasonable', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 256 nips-2009-Which graphical models are difficult to learn?
Author: Andrea Montanari, Jose A. Pereira
Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random field µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is defined as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change significantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for specific classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results confirm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be definite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to find W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after fixing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is defined as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More specifically, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This first result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More specifically, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the fluctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a specific low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is defined from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our first result shows that Rlr(λ) indeed reconstructs G if θ is sufficiently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires significant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difficult to learn. We will instead show how to evaluate it on specific graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufficient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and define S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Define Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our first auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufficient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufficiently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisfied since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by definition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and define the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external field’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any fixed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) finite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simplifications arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random fields”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ fling and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9
2 0.18276678 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains
Author: Dimitris Margaritis
Abstract: In this paper we address the problem of provably correct feature selection in arbitrary domains. An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. While numerous algorithms for this problem have been proposed, their theoretical correctness and practical behavior under arbitrary probability distributions is unclear. We address this by introducing the Markov Boundary Theorem that precisely characterizes the properties of an ideal Markov boundary, and use it to develop algorithms that learn a more general boundary that can capture complex interactions that only appear when the values of multiple features are considered together. We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artificial as well as benchmark and real-world data sets. Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability. 1 Introduction and Motivation The problem of feature selection has a long history due to its significance in a wide range of important problems, from early ones like pattern recognition to recent ones such as text categorization, gene expression analysis and others. In such domains, using all available features may be prohibitively expensive, unnecessarily wasteful, and may lead to poor generalization performance, especially in the presence of irrelevant or redundant features. Thus, selecting a subset of features of the domain for use in subsequent application of machine learning algorithms has become a standard preprocessing step. A typical task of these algorithms is learning a classifier: Given a number of input features and a quantity of interest, called the target variable, choose a member of a family of classifiers that can predict the target variable’s value as well as possible. Another task is understanding the domain and the quantities that interact with the target quantity. Many algorithms have been proposed for feature selection. Unfortunately, little attention has been paid to the issue of their behavior under a variety of application domains that can be encountered in practice. In particular, it is known that many can fail under certain probability distributions such as ones that contain a (near) parity function [1], which contain interactions that only appear when the values of multiple features are considered together. There is therefore an acute need for algorithms that are widely applicable and can be theoretically proven to work under any probability distribution. In this paper we present two such algorithms, an exact and a more practical randomized approximate one. We use the observation (first made in Koller and Sahami [2]) that an optimal solution to the problem is a Markov boundary, defined to be a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain (a more precise definition is given later in Section 3) and present a family of algorithms for learning 1 the Markov boundary of a target variable in arbitrary domains. We first introduce a theorem that exactly characterizes the minimal set of features necessary for probabilistically isolating a variable, and then relax this definition to derive a family of algorithms that learn a parameterized approximation of the ideal boundary that are provably correct under a minimal set of assumptions, including a set of axioms that hold for any probability distribution. In the following section we present work on feature selection, followed by notation and definitions in Section 3. We subsequently introduce an important theorem and the aforementioned parameterized family of algorithms in Sections 4 and 5 respectively, including a practical anytime version. We evaluate these algorithms in Section 6 and conclude in Section 7. 2 Related Work Numerous algorithms have been proposed for feature selection. At the highest level algorithms can be classified as filter, wrapper, or embedded methods. Filter methods work without consulting the classifier (if any) that will make use of their output i.e., the resulting set of selected features. They therefore have typically wider applicability since they are not tied to any particular classifier family. In contrast, wrappers make the classifier an integral part of their operation, repeatedly invoking it to evaluate each of a sequence of feature subsets, and selecting the subset that results in minimum estimated classification error (for that particular classifier). Finally, embedded algorithms are classifier-learning algorithms that perform feature selection implicitly during their operation e.g., decision tree learners. Early work was motivated by the problem of pattern recognition which inherently contains a large number of features (pixels, regions, signal responses at multiple frequencies etc.). Narendra and Fukunaga [3] first cast feature selection as a problem of maximization of an objective function over the set of features to use, and proposed a number of search approaches including forward selection and backward elimination. Later work by machine learning researchers includes the FOCUS algorithm of Almuallim and Dietterich [4], which is a filter method for deterministic, noise-free domains. The RELIEF algorithm [5] instead uses a randomized selection of data points to update a weight assigned to each feature, selecting the features whose weight exceeds a given threshold. A large number of additional algorithms have appeared in the literature, too many to list here—good surveys are included in Dash and Liu [6]; Guyon and Elisseeff [1]; Liu and Motoda [7]. An important concept for feature subset selection is relevance. Several notions of relevance are discussed in a number of important papers such as Blum and Langley [8]; Kohavi and John [9]. The argument that the problem of feature selection can be cast as the problem of Markov blanket discovery was first made convincingly in Koller and Sahami [2], who also presented an algorithm for learning an approximate Markov blanket using mutual information. Other algorithms include the GS algorithm [10], originally developed for learning of the structure of a Bayesian network of a domain, and extensions to it [11] including the recent MMMB algorithm [12]. Meinshausen and B¨ hlmann [13] u recently proposed an optimal theoretical solution to the problem of learning the neighborhood of a Markov network when the distribution of the domain can be assumed to be a multidimensional Gaussian i.e., linear relations among features with Gaussian noise. This assumption implies that the Composition axiom holds in the domain (see Pearl [14] for a definition of Composition); the difference with our work is that we address here the problem in general domains where it may not necessarily hold. 3 Notation and Preliminaries In this section we present notation, fundamental definitions and axioms that will be subsequently used in the rest of the paper. We use the term “feature” and “variable” interchangeably, and denote variables by capital letters (X, Y etc.) and sets of variables by bold letters (S, T etc.). We denote the set of all variables/features in the domain (the “universe”) by U. All algorithms presented are independence-based, learning the Markov boundary of a given target variable using the truth value of a number of conditional independence statements. The use of conditional independence for feature selection subsumes many other criteria proposed in the literature. In particular, the use of classification accuracy of the target variable can be seen as a special case of testing for its conditional independence with some of its predictor variables (conditional on the subset selected at any given moment). A benefit of using conditional independence is that, while classification error estimates depend on the classifier family used, conditional independence does not. In addition, algorithms utilizing conditional independence for feature selection are applicable to all domain types, 2 e.g., discrete, ordinal, continuous with non-linear or arbitrary non-degenerate associations or mixed domains, as long as a reliable estimate of probabilistic independence is available. We denote probabilistic independence by the symbol “ ⊥ ” i.e., (X⊥ Y | Z) denotes the fact ⊥ ⊥ that the variables in set X are (jointly) conditionally independent from those in set Y given the values of the variables in set Z; (X⊥ Y | Z) denotes their conditional dependence. We assume ⊥ the existence of a probabilistic independence query oracle that is available to answer any query of the form (X, Y | Z), corresponding to the question “Is the set of variables in X independent of the variables in Y given the value of the variables in Z?” (This is similar to the approach of learning from statistical queries of Kearns and Vazirani [15].) In practice however, such an oracle does not exist, but can be approximated by a statistical independence test on a data set. Many tests of independence have appeared and studied extensively in the statistical literature over the last century; in this work we use the χ2 (chi-square) test of independence [16]. A Markov blanket of variable X is a set of variables such that, after fixing (by “knowing”) the value of all of its members, the set of remaining variables in the domain, taken together as a single setvalued variable, are statistically independent of X. More precisely, we have the following definition. Definition 1. A set of variables S ⊆ U is called a Markov blanket of variable X if and only if (X⊥ U − S − {X} | S). ⊥ Intuitively, a Markov blanket S of X captures all the information in the remaining domain variables U − S − {X} that can affect the probability distribution of X, making their value redundant as far as X is concerned (given S). The blanket therefore captures the essence of the feature selection problem for target variable X: By completely “shielding” X, a Markov blanket precludes the existence of any possible information about X that can come from variables not in the blanket, making it an ideal solution to the feature selection problem. A minimal Markov blanket is called a Markov boundary. Definition 2. A set of variables S ⊆ U − {X} is called a Markov boundary of variable X if it is a minimal Markov blanket of X i.e., none of its proper subsets is a Markov blanket. Pearl [14] proved that that the axioms of Symmetry, Decomposition, Weak Union, and Intersection are sufficient to guarantee a unique Markov boundary. These are shown below together with the axiom of Contraction. (Symmetry) (Decomposition) (Weak Union) (Contraction) (Intersection) (X⊥ Y | Z) =⇒ (Y⊥ X | Z) ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z) ∧ (X⊥ W | Z) ⊥ ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z ∪ W) ⊥ ⊥ (X⊥ Y | Z) ∧ (X⊥ W | Y ∪ Z) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (X⊥ Y | Z ∪ W) ∧ (X⊥ W | Z ∪ Y) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (1) The Symmetry, Decomposition, Contraction and Weak Union axioms are very general: they are necessary axioms for the probabilistic definition of independence i.e., they hold in every probability distribution, as their proofs are based on the axioms of probability theory. Intersection is not universal but it holds in distributions that are positive, i.e., any value combination of the domain variables has a non-zero probability of occurring. 4 The Markov Boundary Theorem According to Definition 2, a Markov boundary is a minimal Markov blanket. We first introduce a theorem that provides an alternative, equivalent definition of the concept of Markov boundary that we will relax later in the paper to produce a more general boundary definition. Theorem 1 (Markov Boundary Theorem). Assuming that the Decomposition and Contraction axioms hold, S ⊆ U − {X} is a Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X}, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ (2) A detailed proof cannot be included here due to space constraints but a proof sketch appears in Appendix A. According to the above theorem, a Markov boundary S partitions the powerset of U − {X} into two parts: (a) set P1 that contains all subsets of U − S, and (b) set P2 containing the remaining subsets. All sets in P1 are conditionally independent of X, and all sets in P2 are conditionally dependent with X. Intuitively, the two directions of the logical equivalence relation of Eq. (2) correspond to the concept of Markov blanket and its minimality i.e., the equation ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X⊥ T | S − T) ⊥ 3 Algorithm 1 The abstract GS(m) (X) algorithm. Returns an m-Markov boundary of X. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: S←∅ /* Growing phase. */ for all Y ⊆ U − S − {X} such that 1 ≤ |Y| ≤ m do if (X ⊥ Y | S) then ⊥ S←S∪Y goto line 3 /* Restart loop. */ /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 8 /* Restart loop. */ return S or, equivalently, (∀ T ⊆ U − S − {X}, (X⊥ T | S)) (as T and S are disjoint) corresponds to ⊥ the definition of Markov blanket, as it includes T = U − S − {X}. In the opposite direction, the contrapositive form is ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X ⊥ T | S − T) . ⊥ This corresponds to the concept of minimality of the Markov boundary: It states that all sets that contain a part of S cannot be independent of X given the remainder of S. Informally, this is because if there existed some set T that contained a non-empty subset T′ of S such that (X⊥ T | S − T), ⊥ then one would be able to shrink S by T′ (by the property of Contraction) and therefore S would not be minimal (more details in Appendix A). 5 A Family of Algorithms for Arbitrary Domains Theorem 1 defines conditions that precisely characterize a Markov boundary and thus can be thought of as an alternative definition of a boundary. By relaxing these conditions we can produce a more general definition. In particular, an m-Markov boundary is defined as follows. Definition 3. A set of variables S ⊆ U − {X} of a domain U is called an m-Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ We call the parameter m of an m-Markov boundary the Markov boundary margin. Intuitively, an m-boundary S guarantees that (a) all subsets of its complement (excluding X) of size m or smaller are independent of X given S, and (b) all sets T of size m or smaller that are not subsets of its complement are dependent of X given the part of S that is not contained in T. This definition is a special case of the properties of a boundary stated in Theorem 1, with each set T mentioned in the theorem now restricted to having size m or smaller. For m = n − 1, where n = |U |, the condition |T| ≤ m is always satisfied and can be omitted; in this case the definition of an (n − 1)-Markov boundary results in exactly Eq. (2) of Theorem 1. We now present an algorithm called GS(m) , shown in Algorithm 1, that provably correctly learns an m-boundary of a target variable X. GS(m) operates in two phases, a growing and a shrinking phase (hence the acronym). During the growing phase it examines sets of variables of size up to m, where m is a user-specified parameter. During the shrinking phase, single variables are examined for conditional independence and possible removal from S (examining sets in the shrinking phase is not necessary for provably correct operation—see Appendix B). The orders of examination of the sets for possible addition and deletion from the candidate boundary are left intentionally unspecified in Algorithm 1—one can therefore view it as an abstract representative of a family of algorithms, with each member specifying one such ordering. All members of this family are m-correct, as the proof of correctness does not depend on the ordering. In practice numerous choices for the ordering exist; one possibility is to examine the sets in the growing phase in order of increasing set size and, for each such size, in order of decreasing conditional mutual information I(X, Y, S) between X and Y given S. The rationale for this heuristic choice is that (usually) tests with smaller conditional sets tend to be more reliable, and sorting by mutual information tends to lessen the chance of adding false members of the Markov boundary. We used this implementation in all our experiments, presented later in Section 6. Intuitively, the margin represents a trade-off between sample and computational complexity and completeness: For m = n − 1 = |U| − 1, the algorithm returns a Markov boundary in unrestricted 4 Algorithm 2 The RGS(m,k) (X) algorithm, a randomized anytime version of the GS(m) algorithm, utilizing k random subsets for the growing phase. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: S←∅ /* Growing phase. */ repeat Schanged ← false Y ← subset of U − S − {X} of size 1 ≤ |Y| ≤ m of maximum dependence out of k random subsets if (X ⊥ Y | S) then ⊥ S←S∪Y Schanged ← true until Schanged = false /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 11 /* Restart loop. */ return S (arbitrary) domains. For 1 ≤ m < n − 1, GS(m) may recover the correct boundary depending on characteristics of the domain. For example, it will recover the correct boundary in domains containing embedded parity functions such that the number of variables involved in every k-bit parity function is m + 1 or less i.e., if k ≤ m + 1 (parity functions are corner cases in the space of probability distributions that are known to be hard to learn [17]). The proof of m-correctness of GS(m) is included in Appendix B. Note that it is based on Theorem 1 and the universal axioms of Eqs. (1) only i.e., Intersection is not needed, and thus it is widely applicable (to any domain). A Practical Randomized Anytime Version While GS(m) is provably correct even in difficult domains such as those that contain parity functions, it may be impractical with a large number of features as its asymptotic complexity is O(nm ). We therefore also we here provide a more practical randomized version called RGS(m,k) (Randomized GS(m) ), shown in Algorithm 2. The RGS(m,k) algorithm has an additional parameter k that limits its computational requirements: instead of exhaustively examining all possible subsets of (U −S−{X}) (as GS(m) does), it instead samples k subsets from the set of all possible subsets of (U − S − {X}), where k is user-specified. It is therefore a randomized algorithm that becomes equivalent to GS(m) given a large enough k. Many possibilities for the method of random selection of the subsets exist; in our experiments we select a subset Y = {Yi } (1 ≤ |Y| ≤ m) with probability proportional |Y| to i=1 (1/p(X, Yi | S)), where p(X, Yi | S) is the p-value of the corresponding (univariate) test between X and Yi given S, which has a low computational cost. The RGS(m,k) algorithm is useful in situations where the amount of time to produce an answer may be limited and/or the limit unknown beforehand: it is easy to show that the growing phase of GS(m) produces an an upper-bound of the m-boundary of X. Therefore, the RGS(m,k) algorithm, if interrupted, will return an approximation of this upper bound. Moreover, if there exists time for the shrinking phase to be executed (which conducts a number of tests linear in n and is thus fast), extraneous variables will be removed and a minimal blanket (boundary) approximation will be returned. These features make it an anytime algorithm, which is a more appropriate choice for situations where critical events may occur that require the interruption of computation, e.g., during the planning phase of a robot, which may be interrupted at any time due to an urgent external event that requires a decision to be made based on the present state’s feature values. 6 Experiments We evaluated the GS(m) and the RGS(m,k) algorithms on synthetic as well as real-world and benchmark data sets. We first systematically examined the performance on the task of recovering near-parity functions, which are known to be hard to learn [17]. We compared GS(m) and RGS(m,k) with respect to accuracy of recovery of the original boundary as well as computational cost. We generated domains of sizes ranging from 10 to 100 variables, of which 4 variables (X1 to X4 ) were related through a near-parity relation with bit probability 0.60 and various degrees of noise. The remaining independent variables (X5 to Xn ) act as “dis5 GS(1) GS(3) RGS(1, 1000) 0 0.05 RGS(3, 1000) Relieved, threshold = 0.001 Relieved, threshold = 0.03 0.1 0.15 0.2 0.25 Noise probability 0.3 0.35 0.4 Probabilistic isolation performance of GS(m) and RELIEVED Probabilistic isolation performance of GS(m) and RGS(m ,k) Real-world and benchmark data sets 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) GS 0.6 0.8 1 Real-world and benchmark data sets RGS(m = 3, k = 300) average isolation measure F1 measure 50 variables, true Markov boundary size = 3 Bernoulli probability = 0.6, 1000 data points RELIEVED(threshold = 0.03) average isolation measure F1 measure of GS(m ), RGS(m, k ) and RELIEVED vs. noise level 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) average isolation measure GS 0.6 0.8 1 average isolation measure Figure 2: Left: F1 measure of GS(m) , RGS(m,k) and RELIEVED under increasing amounts of noise. Middle: Probabilistic isolation performance comparison between GS(3) and RELIEVED on real-world and benchmark data sets. Right: Same for GS(3) and RGS(3,1000) . tractors” and had randomly assigned probabilities i.e., the correct boundary of X1 is B1 = {X2 , X3 , X4 }. In such domains, learning the boundary of X1 is difficult because of the large number of distractors and because each Xi ∈ B1 is independent of X1 given any proper subset of B1 − {Xi } (they only become dependent when including all of them in the conditioning set). To measure an algorithm’s feature selection performance, acF -measure of GS and RGS vs. domain size curacy (fraction of variables correctly included or excluded) is inappropriate as the accuracy of trivial algorithms such as returning the empty set will tend to 1 as n increases. Precision and recall are therefore more appropriate, with precision defined as the fraction of features returned that are in the correct boundary (3 features for X1 ), and recall as the fraction of the features present in the correct boundary that are returned by the algorithm. A convenient and frequently used Number of domain variables measure that combines precision and recall is the F1 meaRunning time of GS and RGS vs. domain size sure, defined as the harmonic mean of precision and recall [18]. In Fig. 1 (top) we report 95% confidence intervals for the F1 measure and execution time of GS(m) (margins m = 1 to 3) and RGS(m,k) (margins 1 to 3 and k = 1000 random subsets), using 20 data sets containing 10 to 100 variables, with the target variable X1 was perturbed (inverted) by noise with 10% probability. As can be seen, the RGS(m,k) and GS(m) using the same value for margin perform comparably Number of domain variables with respect to F1 , up to their 95% confidence intervals. With Figure 1: GS(m) and RGS(m,k) per(m,k) respect to execution time however RGS exhibits much formance with respect to domain greater scalability (Fig. 1 bottom, log scale); for example, it size (number of variables). Top: F1 executes in about 10 seconds on average in domains containmeasure, reflecting accuracy. Bot(m) ing 100 variables, while GS executes in 1,000 seconds on tom: Execution time in seconds (log average for this domain size. scale). We also compared GS(m) and RGS(m,k) to RELIEF [5], a well-known algorithm for feature selection that is known to be able to recover parity functions in certain cases [5]. RELIEF learns a weight for each variable and compares it to a threshold τ to decide on its inclusion in the set of relevant variables. As it has been reported [9] that RELIEF can exhibit large variance due to randomization that is necessary only for very large data sets, we instead used a deterministic variant called RELIEVED [9], whose behavior corresponds to RELIEF at the limit of infinite execution time. We calculated the F1 measure for GS(m) , RGS(m,k) and RELIEVED in the presence of varying amounts of noise, with noise probability ranging from 0 (no noise) to 0.4. We used domains containing 50 variables, as GS(m) becomes computationally demanding in larger domains. In Figure 2 (left) we show the performance of GS(m) and RGS(m,k) for m equal to 1 and 3, k = 1000 and RELIEVED for thresholds τ = 0.01 and 0.03 for various amounts of noise on the target variable. Again, each experiment was repeated 20 times to generate 95% confidence intervals. We can observe that even though m = 1 (equivalent to the GS algorithm) performs poorly, increasing the margin m makes it more likely to recover the correct Markov boundary, and GS(3) (m = 3) recovers the exact blanket even with few (1,000) data points. RELIEVED does comparably to GS(3) for little noise and for a large threshold, (m ) (m, k ) 1 True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 1 0.9 GS(1) GS(2) GS(3) RGS(1, 1000) (2, 1000) RGS (3, 1000) RGS 0.8 F1-measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 (m ) 70 80 90 100 90 100 (m, k ) True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 10000 GS(1) GS(2) GS(3) Execution time (sec) 1000 RGS(1, 1000) RGS(2, 1000) RGS(3, 1000) 100 10 1 0.1 0.01 10 6 20 30 40 50 60 70 80 but appears to deteriorate for more noisy domains. As we can see it is difficult to choose the “right” threshold for RELIEVED—better performing τ at low noise can become worse in noisy environments; in particular, small τ tend to include irrelevant variables while large τ tend to miss actual members. We also evaluated GS(m) , RGS(m,k) , and RELIEVED on benchmark and real-world data sets from the UCI Machine Learning repository. As the true Markov boundary for these is impossible to know, we used as performance measure a measure of probabilistic isolation by the Markov boundary returned of subsets outside the boundary. For each domain variable X, we measured the independence of subsets Y of size 1, 2 and 3 given the blanket S of X returned by GS(3) and RELIEVED for τ = 0.03 (as this value seemed to do better in the previous set of experiments), as measured by the average p-value of the χ2 test between X and Y given S (with p-values of 0 and 1 indicating ideal dependence and independence, respectively). Due to the large number of subsets outside the boundary when the boundary is small, we limited the estimation of isolation performance to 2,000 subsets per variable. We plot the results in Figure 2 (middle and right). Each point represents a variable in the corresponding data set. Points under the diagonal indicate better probabilistic isolation performance for that variable for GS(3) compared to RELIEVED (middle plot) or to RGS(3,1000) (right plot). To obtain a statistically significant comparison, we used the non-parametric Wilcoxon paired signed-rank test, which indicated that GS(3) RGS(3,1000) are statistically equivalent to each other, while both outperformed RELIEVED at the 99.99% significance level (α < 10−7 ). 7 Conclusion In this paper we presented algorithms for the problem of feature selection in unrestricted (arbitrary distribution) domains that may contain complex interactions that only appear when the values of multiple features are considered together. We introduced two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and evaluated them on on artificial, benchmark and real-world data, demonstrating that they perform well, even in the presence of noise. We also introduced the Markov Boundary Theorem that precisely characterizes the properties of a boundary, and used it to prove m-correctness of the exact family of algorithms presented. We made minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, giving our algorithms universal applicability. Appendix A: Proof sketch of the Markov Boundary Theorem Proof sketch. (=⇒ direction) We need to prove that if S is a Markov boundary of X then (a) for every set T ⊆ U − S − {X}, (X⊥ T | S − T), and (b) for every set T′ ⊆ U − S that does not ⊥ contain X, (X ⊥ T′ | S − T′ ). Case (a) is immediate from the definition of the boundary and the ⊥ Decomposition theorem. Case (b) can be proven by contradiction: Assuming the independence of T′ that contains a non-empty part T′ in S and a part T′ in U − S, we get (from Decomposition) 1 2 (X⊥ T′ | S − T′ ). We can then use Contraction to show that the set S − T′ satisfies the inde⊥ 1 1 1 pendence property of a Markov boundary, i.e., that (X⊥ U − (S − T′ ) − {X} | S − T′ ), which ⊥ 1 1 contradicts the assumption that S is a boundary (and thus minimal). (⇐= direction) We need to prove that if Eq. (2) holds, then S is a minimal Markov blanket. The proof that S is a blanket is immediate. We can prove minimality by contradiction: Assume S = S1 ∪ S2 with S1 a blanket and S2 = ∅ i.e., S1 is a blanket strictly smaller than S. Then (X⊥ S2 | ⊥ S1 ) = (X⊥ S2 | S − S2 ). However, since S2 ⊆ U − S, from Eq. (2) we get (X ⊥ S2 | S − S2 ), ⊥ ⊥ which is a contradiction. Appendix B: Proof of m-Correctness of GS(m) Let the value of the set S at the end of the growing phase be SG , its value at the end of the shrinking phase SS , and their difference S∆ = SG − SS . The following two observations are immediate. Observation 1. For every Y ⊆ U − SG − {X} such that 1 ≤ |Y| ≤ m, (X⊥ Y | SG ). ⊥ Observation 2. For every Y ∈ SS , (X ⊥ Y | SS − {Y }). ⊥ Lemma 2. Consider variables Y1 , Y2 , . . . , Yt for some t ≥ 1 and let Y = {Yj }t . Assuming that j=1 Contraction holds, if (X⊥ Yi | S − {Yj }i ) for all i = 1, . . . , t, then (X⊥ Y | S − Y). ⊥ ⊥ j=1 Proof. By induction on Yj , j = 1, 2, . . . , t, using Contraction to decrease the conditioning set S t down to S − {Yj }i j=1 for all i = 1, 2, . . . , t. Since Y = {Yj }j=1 , we immediately obtain the desired relation (X⊥ Y | S − Y). ⊥ 7 Lemma 2 can be used to show that the variables found individually independent of X during the shrinking phase are actually jointly independent of X, given the final set SS . Let S∆ = {Y1 , Y2 , . . . , Yt } be the set of variables removed (in that order) from SG to form the final set SS i.e., S∆ = SG − SS . Using the above lemma, the following is immediate. Corollary 3. Assuming that the Contraction axiom holds, (X⊥ S∆ | SS ). ⊥ Lemma 4. If the Contraction, Decomposition and Weak Union axioms hold, then for every set T ⊆ U − SG − {X} such that (X⊥ T | SG ), ⊥ (X⊥ T ∪ (SG − SS ) | SS ). ⊥ (3) Furthermore SS is minimal i.e., there does not exist a subset of SS for which Eq. (3) is true. Proof. From Corollary 3, (X⊥ S∆ | SS ). Also, by the hypothesis, (X⊥ T | SG ) = (X⊥ T | ⊥ ⊥ ⊥ SS ∪ S∆ ), where S∆ = SG − SS as usual. From these two relations and Contraction we obtain (X⊥ T ∪ S∆ | SS ). ⊥ To prove minimality, let us assume that SS = ∅ (if SS = ∅ then it is already minimal). We prove by contradiction: Assume that there exists a set S′ ⊂ SS such that (X⊥ T ∪ (SG − S′ ) | S′ ). Let ⊥ W = SS − S′ = ∅. Note that W and S′ are disjoint. We have that SS ⊆ SS ∪ S∆ =⇒ SS − S′ ⊆ SS ∪ S∆ − S′ ⊆ T ∪ (SS ∪ S∆ − S′ ) =⇒ W ⊆ T ∪ (SS ∪ S∆ − S′ ) = T ∪ (SG − S′ ) • Since (X⊥ T ∪ (SG − S′ ) | S′ ) and W ⊆ T ∪ (SS ∪ S∆ − S′ ), from Decomposition we ⊥ get (X⊥ W | S′ ). ⊥ • From (X⊥ W | S′ ) and Weak Union we have that for every Y ∈ W, (X⊥ Y | S′ ∪ ⊥ ⊥ (W − {Y })). • Since S′ and W are disjoint and since Y ∈ W, Y ∈ S′ . Applying the set equality (A − B) ∪ C = (A ∪ B) − (A − C) to S′ ∪ (W − {Y }) we obtain S′ ∪ W − ({Y } − S′ ) = SS − {Y }. • Therefore, ∀ Y ∈ W, (X⊥ Y | SS − {Y }). ⊥ However, at the end of the shrinking phase, all variables Y in SS (and therefore in W, as W ⊆ SS ) have been evaluated for independence and found dependent (Observation 2). Thus, since W = ∅, there exists at least one Y such that (X ⊥ Y | SS − {Y }), producing a contradiction. ⊥ Theorem 5. Assuming that the Contraction, Decomposition, and Weak Union axioms hold, Algorithm 1 is m-correct with respect to X. Proof. We use the Markov Boundary Theorem. We first prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X⊥ T | SS − T) ⊥ or, equivalently, ∀ T ⊆ U − SS − {X} such that |T| ≤ m, (X⊥ T | SS ). ⊥ Since U − SS − {X} = S∆ ∪ (U − SG − {X}), S∆ and U − SG − {X} are disjoint, there are three kinds of sets of size m or less to consider: (i) all sets T ⊆ S∆ , (ii) all sets T ⊆ U − SG − {X}, and (iii) all sets (if any) T = T′ ∪ T′′ , T′ ∩ T′′ = ∅, that have a non-empty part T′ ⊆ S∆ and a non-empty part T′′ ⊆ U − SG − {X}. (i) From Corollary 3, (X⊥ S∆ | SS ). Therefore, from Decomposition, for any set T ⊆ S∆ , ⊥ (X⊥ T | SS ). ⊥ (ii) By Observation 1, for every set T ⊆ U − SG − {X} such that |T| ≤ m, (X⊥ T | SG ). ⊥ By Lemma 4 we get (X⊥ T ∪ S∆ | SS ), from which we obtain (X⊥ T | SS ) by ⊥ ⊥ Decomposition. (iii) Since |T| ≤ m, we have that |T′′ | ≤ m. Since T′′ ⊆ U − SG − {X}, by Observation 1, (X⊥ T′′ | SG ). Therefore, by Lemma 4, (X⊥ T′′ ∪ S∆ | SS ). Since T′ ⊆ S∆ ⇒ ⊥ ⊥ T′′ ∪ T′ ⊆ T′′ ∪ S∆ , by Decomposition to obtain (X⊥ T′′ ∪ T′ | SS ) = (X⊥ T | SS ). ⊥ ⊥ To complete the proof we need to prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X ⊥ T | SS − T) . ⊥ Let T = T1 ∪ T2 , with T1 ⊆ SS and T2 ⊆ U − SS . Since T ⊆ U − SS , T1 contains at least one variable Y ∈ SS . From Observation 2, (X ⊥ Y | SS − {Y }). From this and (the contrapositive of) ⊥ Weak Union, we get (X ⊥ {Y } ∪ (T1 − {Y }) | SS − {Y } − (T1 − {Y })) = (X ⊥ T1 | SS − T1 ). ⊥ ⊥ From (the contrapositive of) Decomposition we get (X ⊥ T1 ∪ T2 | SS − T1 ) = (X ⊥ T | ⊥ ⊥ SS − T1 ), which is equal to (X ⊥ T | SS − T1 − T2 ) = (X ⊥ T | SS − T) as SS and T2 are ⊥ ⊥ disjoint. 8 References [1] Isabelle Guyon and Andr´ Elisseeff. An introduction to variable and feature selection. Journal e of Machine Learning Research, 3:1157–1182, 2003. [2] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Tenth International Conference on Machine Learning (ICML), pages 284–292, 1996. [3] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26(9):917–922, 1977. [4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), 1991. [5] K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), pages 129–134, 1992. [6] M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(3): 131–156, 1997. [7] Huan Liu and Hiroshi Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective, volume 453 of The Springer International Series in Engineering and Computer Science. 1998. [8] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245–271, 1997. [9] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97 (1-2):273–324, 1997. [10] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12 (NIPS), 2000. [11] I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In Proceedings of the 16th International FLAIRS Conference, 2003. [12] I. Tsamardinos, C. Aliferis, and A. Statnikov. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 673–678, 2003. [13] N. Meinshausen and P. B¨ hlmann. High-dimensional graphs and variable selection with the u Lasso. Annals of Statistics, 34:1436–1462, 2006. [14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988. [15] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. [16] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 1990. [17] M. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998. [18] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 1979. 9
3 0.11813272 141 nips-2009-Local Rules for Global MAP: When Do They Work ?
Author: Kyomin Jung, Pushmeet Kohli, Devavrat Shah
Abstract: We consider the question of computing Maximum A Posteriori (MAP) assignment in an arbitrary pair-wise Markov Random Field (MRF). We present a randomized iterative algorithm based on simple local updates. The algorithm, starting with an arbitrary initial assignment, updates it in each iteration by first, picking a random node, then selecting an (appropriately chosen) random local neighborhood and optimizing over this local neighborhood. Somewhat surprisingly, we show that this algorithm finds a near optimal assignment within n log 2 n iterations with high probability for any n node pair-wise MRF with geometry (i.e. MRF graph with polynomial growth) with the approximation error depending on (in a reasonable manner) the geometric growth rate of the graph and the average radius of the local neighborhood – this allows for a graceful tradeoff between the complexity of the algorithm and the approximation error. Through extensive simulations, we show that our algorithm finds extremely good approximate solutions for various kinds of MRFs with geometry.
4 0.11262318 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1
5 0.10810739 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
Author: Peter Sollich, Matthew Urry, Camille Coti
Abstract: We investigate how well Gaussian process regression can learn functions defined on graphs, using large regular random graphs as a paradigmatic example. Random-walk based kernels are shown to have some non-trivial properties: within the standard approximation of a locally tree-like graph structure, the kernel does not become constant, i.e. neighbouring function values do not become fully correlated, when the lengthscale σ of the kernel is made large. Instead the kernel attains a non-trivial limiting form, which we calculate. The fully correlated limit is reached only once loops become relevant, and we estimate where the crossover to this regime occurs. Our main subject are learning curves of Bayes error versus training set size. We show that these are qualitatively well predicted by a simple approximation using only the spectrum of a large tree as input, and generically scale with n/V , the number of training examples per vertex. We also explore how this behaviour changes for kernel lengthscales that are large enough for loops to become important. 1 Motivation and Outline Gaussian processes (GPs) have become a standard part of the machine learning toolbox [1]. Learning curves are a convenient way of characterizing their capabilities: they give the generalization error as a function of the number of training examples n, averaged over all datasets of size n under appropriate assumptions about the process generating the data. We focus here on the case of GP regression, where a real-valued output function f (x) is to be learned. The general behaviour of GP learning curves is then relatively well understood for the scenario where the inputs x come from a continuous space, typically Rn [2, 3, 4, 5, 6, 7, 8, 9, 10]. For large n, the learning curves then typically decay as a power law ∝ n−α with an exponent α ≤ 1 that depends on the dimensionality n of the space as well as the smoothness properties of the function f (x) as encoded in the covariance function. But there are many interesting application domains that involve discrete input spaces, where x could be a string, an amino acid sequence (with f (x) some measure of secondary structure or biological function), a research paper (with f (x) related to impact), a web page (with f (x) giving a score used to rank pages), etc. In many such situations, similarity between different inputs – which will govern our prior beliefs about how closely related the corresponding function values are – can be represented by edges in a graph. One would then like to know how well GP regression can work in such problem domains; see also [11] for a related online regression algorithm. We study this 1 problem here theoretically by focussing on the paradigmatic example of random regular graphs, where every node has the same connectivity. Sec. 2 discusses the properties of random-walk inspired kernels [12] on such random graphs. These are analogous to the standard radial basis function kernels exp[−(x − x )2 /(2σ 2 )], but we find that they have surprising properties on large graphs. In particular, while loops in large random graphs are long and can be neglected for many purposes, by approximating the graph structure as locally tree-like, here this leads to a non-trivial limiting form of the kernel for σ → ∞ that is not constant. The fully correlated limit, where the kernel is constant, is obtained only because of the presence of loops, and we estimate when the crossover to this regime takes place. In Sec. 3 we move on to the learning curves themselves. A simple approximation based on the graph eigenvalues, using only the known spectrum of a large tree as input, works well qualitatively and predicts the exact asymptotics for large numbers of training examples. When the kernel lengthscale is not too large, below the crossover discussed in Sec. 2 for the covariance kernel, the learning curves depend on the number of examples per vertex. We also explore how this behaviour changes as the kernel lengthscale is made larger. Sec. 4 summarizes the results and discusses some open questions. 2 Kernels on graphs and trees We assume that we are trying to learn a function defined on the vertices of a graph. Vertices are labelled by i = 1 . . . V , instead of the generic input label x we used in the introduction, and the associated function values are denoted fi ∈ R. By taking the prior P (f ) over these functions f = (f1 , . . . , fV ) as a (zero mean) Gaussian process we are saying that P (f ) ∝ exp(− 1 f T C −1 f ). 2 The covariance function or kernel C is then, in our graph setting, just a positive definite V × V matrix. The graph structure is characterized by a V × V adjacency matrix, with Aij = 1 if nodes i and j are connected by an edge, and 0 otherwise. All links are assumed to be undirected, so that Aij = Aji , V and there are no self-loops (Aii = 0). The degree of each node is then defined as di = j=1 Aij . The covariance kernels we discuss in this paper are the natural generalizations of the squaredexponential kernel in Euclidean space [12]. They can be expressed in terms of the normalized graph Laplacian, defined as L = 1 − D −1/2 AD −1/2 , where D is a diagonal matrix with entries d1 , . . . , dV and 1 is the V × V identity matrix. An advantage of L over the unnormalized Laplacian D − A, which was used in the earlier paper [13], is that the eigenvalues of L (again a V × V matrix) lie in the interval [0,2] (see e.g. [14]). From the graph Laplacian, the covariance kernels we consider here are constructed as follows. The p-step random walk kernel is (for a ≥ 2) C ∝ (1 − a−1 L)p = 1 − a−1 1 + a−1 D −1/2 AD −1/2 p (1) while the diffusion kernel is given by 1 C ∝ exp − 2 σ 2 L ∝ exp 1 2 −1/2 AD −1/2 2σ D (2) We will always normalize these so that (1/V ) i Cii = 1, which corresponds to setting the average (over vertices) prior variance of the function to be learned to unity. To see the connection of the above kernels to random walks, assume we have a walker on the graph who at each time step selects randomly one of the neighbouring vertices and moves to it. The probability for a move from vertex j to i is then Aij /dj . The transition matrix after s steps follows as (AD −1 )s : its ij-element gives the probability of being on vertex i, having started at j. We can now compare this with the p-step kernel by expanding the p-th power in (1): p p ( p )a−s (1−a−1 )p−s (D −1/2 AD −1/2 )s = D −1/2 s C∝ s=0 ( p )a−s (1−a−1 )p−s (AD −1 )s D 1/2 s s=0 (3) Thus C is essentially a random walk transition matrix, averaged over the number of steps s with s ∼ Binomial(p, 1/a) 2 (4) a=2, d=3 K1 1 1 Cl,p 0.9 p=1 p=2 p=3 p=4 p=5 p=10 p=20 p=50 p=100 p=200 p=500 p=infty 0.8 0.6 0.4 d=3 0.8 0.7 0.6 a=2, V=infty a=2, V=500 a=4, V=infty a=4, V=500 0.5 0.4 0.3 0.2 0.2 ln V / ln(d-1) 0.1 0 0 5 10 l 0 15 1 10 p/a 100 1000 Figure 1: (Left) Random walk kernel C ,p plotted vs distance along graph, for increasing number of steps p and a = 2, d = 3. Note the convergence to a limiting shape for large p that is not the naive fully correlated limit C ,p→∞ = 1. (Right) Numerical results for average covariance K1 between neighbouring nodes, averaged over neighbours and over randomly generated regular graphs. This shows that 1/a can be interpreted as the probability of actually taking a step at each of p “attempts”. To obtain the actual C the resulting averaged transition matrix is premultiplied by D −1/2 and postmultiplied by D 1/2 , which ensures that the kernel C is symmetric. For the diffusion kernel, one finds an analogous result but the number of random walk steps is now distributed as s ∼ Poisson(σ 2 /2). This implies in particular that the diffusion kernel is the limit of the p-step kernel for p, a → ∞ at constant p/a = σ 2 /2. Accordingly, we discuss mainly the p-step kernel in this paper because results for the diffusion kernel can be retrieved as limiting cases. In the limit of a large number of steps s, the random walk on a graph will reach its stationary distribution p∞ ∝ De where e = (1, . . . , 1). (This form of p∞ can be verified by checking that it remains unchanged after multiplication with the transition matrix AD −1 .) The s-step transition matrix for large s is then p∞ eT = DeeT because we converge from any starting vertex to the stationary distribution. It follows that for large p or σ 2 the covariance kernel becomes C ∝ D 1/2 eeT D 1/2 , i.e. Cij ∝ (di dj )1/2 . This is consistent with the interpretation of σ or (p/a)1/2 as a lengthscale over which the random walk can diffuse along the graph: once this lengthscale becomes large, the covariance kernel Cij is essentially independent of the distance (along the graph) between the vertices i and j, and the function f becomes fully correlated across the graph. (Explicitly f = vD 1/2 e under the prior, with v a single Gaussian random variable.) As we next show, however, the approach to this fully correlated limit as p or σ are increased is non-trivial. We focus in this paper on kernels on random regular graphs. This means we consider adjacency matrices A which are regular in the sense that they give for each vertex the same degree, di = d. A uniform probability distribution is then taken across all A that obey this constraint [15]. What will the above kernels look like on typical samples drawn from this distribution? Such random regular graphs will have long loops, of length of order ln(V ) or larger if V is large. Their local structure is then that of a regular tree of degree d, which suggests that it should be possible to calculate the kernel accurately within a tree approximation. In a regular tree all nodes are equivalent, so the kernel can only depend on the distance between two nodes i and j. Denoting this kernel value C ,p for a p-step random walk kernel, one has then C ,p=0 = δ ,0 and γp+1 C0,p+1 γp+1 C ,p+1 = = 1− 1 ad C 1 a C0,p + −1,p 1 a + 1− C1,p 1 a C (5) ,p + d−1 ad C +1,p for ≥1 (6) where γp is chosen to achieve the desired normalization C0,p = 1 of the prior variance for every p. Fig. 1(left) shows results obtained by iterating this recursion numerically, for a regular graph (in the tree approximation) with degree d = 3, and a = 2. As expected the kernel becomes more longranged initially as p increases, but eventually it is seen to approach a non-trivial limiting form. This can be calculated as C ,p→∞ = [1 + (d − 1)/d](d − 1)− /2 (7) 3 and is also plotted in the figure, showing good agreement with the numerical iteration. There are (at least) two ways of obtaining the result (7). One is to take the limit σ → ∞ of the integral representation of the diffusion kernel on regular trees given in [16] (which is also quoted in [13] but with a typographical error that effectively removes the factor (d − 1)− /2 ). Another route is to find the steady state of the recursion for C ,p . This is easy to do but requires as input the unknown steady state value of γp . To determine this, one can map from C ,p to the total random walk probability S ,p in each “shell” of vertices at distance from the starting vertex, changing variables to S0,p = C0,p and S ,p = d(d − 1) −1 C ,p ( ≥ 1). Omitting the factors γp , this results in a recursion for S ,p that simply describes a biased random walk on = 0, 1, 2, . . ., with a probability of 1 − 1/a of remaining at the current , probability 1/(ad) of moving to the left and probability (d − 1)/(ad) of moving to the right. The point = 0 is a reflecting barrier where only moves to the right are allowed, with probability 1/a. The time evolution of this random walk starting from = 0 can now be analysed as in [17]. As expected from the balance of moves to the left and right, S ,p for large p is peaked around the average position of the walk, = p(d − 2)/(ad). For smaller than this S ,p has a tail behaving as ∝ (d − 1) /2 , and converting back to C ,p gives the large- scaling of C ,p→∞ ∝ (d − 1)− /2 ; this in turn fixes the value of γp→∞ and so eventually gives (7). The above analysis shows that for large p the random walk kernel, calculated in the absence of loops, does not approach the expected fully correlated limit; given that all vertices have the same degree, the latter would correspond to C ,p→∞ = 1. This implies, conversely, that the fully correlated limit is reached only because of the presence of loops in the graph. It is then interesting to ask at what point, as p is increased, the tree approximation for the kernel breaks down. To estimate this, we note that a regular tree of depth has V = 1 + d(d − 1) −1 nodes. So a regular graph can be tree-like at most out to ≈ ln(V )/ ln(d − 1). Comparing with the typical number of steps our random walk takes, which is p/a from (4), we then expect loop effects to appear in the covariance kernel when p/a ≈ ln(V )/ ln(d − 1) (8) To check this prediction, we measure the analogue of C1,p on randomly generated [15] regular graphs. Because of the presence of loops, the local kernel values are not all identical, so the appropriate estimate of what would be C1,p on a tree is K1 = Cij / Cii Cjj for neighbouring nodes i and j. Averaging over all pairs of such neighbours, and then over a number of randomly generated graphs we find the results in Fig. 1(right). The results for K1 (symbols) accurately track the tree predictions (lines) for small p/a, and start to deviate just around the values of p/a expected from (8), as marked by the arrow. The deviations manifest themselves in larger values of K1 , which eventually – now that p/a is large enough for the kernel to “notice” the loops - approach the fully correlated limit K1 = 1. 3 Learning curves We now turn to the analysis of learning curves for GP regression on random regular graphs. We assume that the target function f ∗ is drawn from a GP prior with a p-step random walk covariance kernel C. Training examples are input-output pairs (iµ , fi∗ + ξµ ) where ξµ is i.i.d. Gaussian noise µ of variance σ 2 ; the distribution of training inputs iµ is taken to be uniform across vertices. Inference from a data set D of n such examples µ = 1, . . . , n takes place using the prior defined by C and a Gaussian likelihood with noise variance σ 2 . We thus assume an inference model that is matched to the data generating process. This is obviously an over-simplification but is appropriate for the present first exploration of learning curves on random graphs. We emphasize that as n is increased we see more and more function values from the same graph, which is fixed by the problem domain; the graph does not grow. ˆ The generalization error is the squared difference between the estimated function fi and the target fi∗ , averaged across the (uniform) input distribution, the posterior distribution of f ∗ given D, the distribution of datasets D, and finally – in our non-Euclidean setting – the random graph ensemble. Given the assumption of a matched inference model, this is just the average Bayes error, or the average posterior variance, which can be expressed explicitly as [1] (n) = V −1 Cii − k(i)T Kk−1 (i) i 4 D,graphs (9) where the average is over data sets and over graphs, K is an n × n matrix with elements Kµµ = Ciµ ,iµ + σ 2 δµµ and k(i) is a vector with entries kµ (i) = Ci,iµ . The resulting learning curve depends, in addition to n, on the graph structure as determined by V and d, and the kernel and noise level as specified by p, a and σ 2 . We fix d = 3 throughout to avoid having too many parameters to vary, although similar results are obtained for larger d. Exact prediction of learning curves by analytical calculation is very difficult due to the complicated way in which the random selection of training inputs enters the matrix K and vector k in (9). However, by first expressing these quantities in terms of kernel eigenvalues (see below) and then approximating the average over datasets, one can derive the approximation [3, 6] =g n + σ2 V , g(h) = (λ−1 + h)−1 α (10) α=1 This equation for has to be solved self-consistently because also appears on the r.h.s. In the Euclidean case the resulting predictions approximate the true learning curves quite reliably. The derivation of (10) for inputs on a fixed graph is unchanged from [3], provided the kernel eigenvalues λα appearing in the function g(h) are defined appropriately, by the eigenfunction condition Cij φj = λφi ; the average here is over the input distribution, i.e. . . . = V −1 j . . . From the definition (1) of the p-step kernel, we see that then λα = κV −1 (1 − λL /a)p in terms of the corα responding eigenvalue of the graph Laplacian L. The constant κ has to be chosen to enforce our normalization convention α λα = Cjj = 1. Fortunately, for large V the spectrum of the Laplacian of a random regular graph can be approximated by that of the corresponding large regular tree, which has spectral density [14] L ρ(λ ) = 4(d−1) − (λL − 1)2 d2 2πdλL (2 − λL ) (11) in the range λL ∈ [λL , λL ], λL = 1 + 2d−1 (d − 1)1/2 , where the term under the square root is ± + − positive. (There are also two isolated eigenvalues λL = 0, 2 but these have weight 1/V each and so can be ignored for large V .) Rewriting (10) as = V −1 α [(V λα )−1 + (n/V )( + σ 2 )−1 ]−1 and then replacing the average over kernel eigenvalues by an integral over the spectral density leads to the following prediction for the learning curve: = dλL ρ(λL )[κ−1 (1 − λL /a)−p + ν/( + σ 2 )]−1 (12) with κ determined from κ dλL ρ(λL )(1 − λL /a)p = 1. A general consequence of the form of this result is that the learning curve depends on n and V only through the ratio ν = n/V , i.e. the number of training examples per vertex. The approximation (12) also predicts that the learning curve will have two regimes, one for small ν where σ 2 and the generalization error will be essentially 2 independent of σ ; and another for large ν where σ 2 so that can be neglected on the r.h.s. and one has a fully explicit expression for . We compare the above prediction in Fig. 2(left) to the results of numerical simulations of the learning curves, averaged over datasets and random regular graphs. The two regimes predicted by the approximation are clearly visible; the approximation works well inside each regime but less well in the crossover between the two. One striking observation is that the approximation seems to predict the asymptotic large-n behaviour exactly; this is distinct to the Euclidean case, where generally only the power-law of the n-dependence but not its prefactor come out accurately. To see why, we exploit that for large n (where σ 2 ) the approximation (9) effectively neglects fluctuations in the training input “density” of a randomly drawn set of training inputs [3, 6]. This is justified in the graph case for large ν = n/V , because the number of training inputs each vertex receives, Binomial(n, 1/V ), has negligible relative fluctuations away from its mean ν. In the Euclidean case there is no similar result, because all training inputs are different with probability one even for large n. Fig. 2(right) illustrates that for larger a the difference in the crossover region between the true (numerically simulated) learning curves and our approximation becomes larger. This is because the average number of steps p/a of the random walk kernel then decreases: we get closer to the limit of uncorrelated function values (a → ∞, Cij = δij ). In that limit and for low σ 2 and large V the 5 V=500 (filled) & 1000 (empty), d=3, a=2, p=10 V=500, d=3, a=4, p=10 0 0 10 10 ε ε -1 -1 10 10 -2 10 -2 10 2 σ = 0.1 2 σ = 0.1 2 -3 10 σ = 0.01 2 σ = 0.01 -3 10 2 σ = 0.001 2 σ = 0.001 2 -4 10 2 σ = 0.0001 σ = 0.0001 -4 10 2 σ =0 -5 2 σ =0 -5 10 0.1 1 ν=n/V 10 10 0.1 1 ν=n/V 10 Figure 2: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500 (small filled circles) and V = 1000 (empty circles) vertices. Plotting generalization error versus ν = n/V superimposes the results for both values of V , as expected from the approximation (12). The lines are the quantitative predictions of this approximation. Noise level as shown, kernel parameters a = 2, p = 10. (Right) As on the left but with V = 500 only and for larger a = 4. 2 V=500, d=3, a=2, p=20 0 0 V=500, d=3, a=2, p=200, σ =0.1 10 10 ε ε simulation -1 2 10 1/(1+n/σ ) theory (tree) theory (eigenv.) -1 10 -2 10 2 σ = 0.1 -3 10 -4 10 -2 10 2 σ = 0.01 2 σ = 0.001 2 σ = 0.0001 -3 10 2 σ =0 -5 10 -4 0.1 1 ν=n/V 10 10 1 10 100 n 1000 10000 Figure 3: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500, and kernel parameters a = 2, p = 20; noise level σ 2 as shown. Circles: numerical simulations; lines: approximation (12). (Right) As on the left but for much larger p = 200 and for a single random graph, with σ 2 = 0.1. Dotted line: naive estimate = 1/(1 + n/σ 2 ). Dashed line: approximation (10) using the tree spectrum and the large p-limit, see (17). Solid line: (10) with numerically determined graph eigenvalues λL as input. α true learning curve is = exp(−ν), reflecting the probability of a training input set not containing a particular vertex, while the approximation can be shown to predict = max{1 − ν, 0}, i.e. a decay of the error to zero at ν = 1. Plotting these two curves (not displayed here) indeed shows the same “shape” of disagreement as in Fig. 2(right), with the approximation underestimating the true generalization error. Increasing p has the effect of making the kernel longer ranged, giving an effect opposite to that of increasing a. In line with this, larger values of p improve the accuracy of the approximation (12): see Fig. 3(left). One may ask about the shape of the learning curves for large number of training examples (per vertex) ν. The roughly straight lines on the right of the log-log plots discussed so far suggest that ∝ 1/ν in this regime. This is correct in the mathematical limit ν → ∞ because the graph kernel has a nonzero minimal eigenvalue λ− = κV −1 (1−λL /a)p : for ν σ 2 /(V λ− ), the square bracket + 6 in (12) can then be approximated by ν/( +σ 2 ) and one gets (because also regime) ≈ σ 2 /ν. σ 2 in the asymptotic However, once p becomes reasonably large, V λ− can be shown – by analysing the scaling of κ, see Appendix – to be extremely (exponentially in p) small; for the parameter values in Fig. 3(left) it is around 4 × 10−30 . The “terminal” asymptotic regime ≈ σ 2 /ν is then essentially unreachable. A more detailed analysis of (12) for large p and large (but not exponentially large) ν, as sketched in the Appendix, yields ∝ (cσ 2 /ν) ln3/2 (ν/(cσ 2 )), c ∝ p−3/2 (13) This shows that there are logarithmic corrections to the naive σ 2 /ν scaling that would apply in the true terminal regime. More intriguing is the scaling of the coefficient c with p, which implies that to reach a specified (low) generalization error one needs a number of training examples per vertex of order ν ∝ cσ 2 ∝ p−3/2 σ 2 . Even though the covariance kernel C ,p – in the same tree approximation that also went into (12) – approaches a limiting form for large p as discussed in Sec. 2, generalization performance thus continues to improve with increasing p. The explanation for this must presumably be that C ,p converges to the limit (7) only at fixed , while in the tail ∝ p, it continues to change. For finite graph sizes V we know of course that loops will eventually become important as p increases, around the crossover point estimated in (8). The approximation for the learning curve in (12) should then break down. The most naive estimate beyond this point would be to say that the kernel becomes nearly fully correlated, Cij ∝ (di dj )1/2 which in the regular case simplifies to Cij = 1. With only one function value to learn, and correspondingly only one nonzero kernel eigenvalue λα=1 = 1, one would predict = 1/(1 + n/σ 2 ). Fig. 3(right) shows, however, that this significantly underestimates the actual generalization error, even though for this graph λα=1 = 0.994 is very close to unity so that the other eigenvalues sum to no more than 0.006. An almost perfect prediction is obtained, on the other hand, from the approximation (10) with the numerically calculated values of the Laplacian – and hence kernel – eigenvalues. The presence of the small kernel eigenvalues is again seen to cause logarithmic corrections to the naive ∝ 1/n scaling. Using the tree spectrum as an approximation and exploiting the large-p limit, one finds indeed (see Appendix, Eq. (17)) that ∝ (c σ 2 /n) ln3/2 (n/c σ 2 ) where now n enters rather than ν = n/V , c being a constant dependent only on p and a: informally, the function to be learned only has a finite (rather than ∝ V ) number of degrees of freedom. The approximation (17) in fact provides a qualitatively accurate description of the data Fig. 3(right), as the dashed line in the figure shows. We thus have the somewhat unusual situation that the tree spectrum is enough to give a good description of the learning curves even when loops are important, while (see Sec. 2) this is not so as far as the evaluation of the covariance kernel itself is concerned. 4 Summary and Outlook We have studied theoretically the generalization performance of GP regression on graphs, focussing on the paradigmatic case of random regular graphs where every vertex has the same degree d. Our initial concern was with the behaviour of p-step random walk kernels on such graphs. If these are calculated within the usual approximation of a locally tree-like structure, then they converge to a non-trivial limiting form (7) when p – or the corresponding lengthscale σ in the closely related diffusion kernel – becomes large. The limit of full correlation between all function values on the graph is only reached because of the presence of loops, and we have estimated in (8) the values of p around which the crossover to this loop-dominated regime occurs; numerical data for correlations of function values on neighbouring vertices support this result. In the second part of the paper we concentrated on the learning curves themselves. We assumed that inference is performed with the correct parameters describing the data generating process; the generalization error is then just the Bayes error. The approximation (12) gives a good qualitative description of the learning curve using only the known spectrum of a large regular tree as input. It predicts in particular that the key parameter that determines the generalization error is ν = n/V , the number of training examples per vertex. We demonstrated also that the approximation is in fact more useful than in the Euclidean case because it gives exact asymptotics for the limit ν 1. Quantitatively, we found that the learning curves decay as ∝ σ 2 /ν with non-trivial logarithmic correction terms. Slower power laws ∝ ν −α with α < 1, as in the Euclidean case, do not appear. 7 We attribute this to the fact that on a graph there is no analogue of the local roughness of a target function because there is a minimum distance (one step along the graph) between different input points. Finally we looked at the learning curves for larger p, where loops become important. These can still be predicted quite accurately by using the tree eigenvalue spectrum as an approximation, if one keeps track of the zero graph Laplacian eigenvalue which we were able to ignore previously; the approximation shows that the generalization error scales as σ 2 /n with again logarithmic corrections. In future work we plan to extend our analysis to graphs that are not regular, including ones from application domains as well as artificial ones with power-law tails in the distribution of degrees d, where qualitatively new effects are to be expected. It would also be desirable to improve the predictions for the learning curve in the crossover region ≈ σ 2 , which should be achievable using iterative approaches based on belief propagation that have already been shown to give accurate approximations for graph eigenvalue spectra [18]. These tools could then be further extended to study e.g. the effects of model mismatch in GP regression on random graphs, and how these are mitigated by tuning appropriate hyperparameters. Appendix We sketch here how to derive (13) from (12) for large p. Eq. (12) writes = g(νV /( + σ 2 )) with λL + g(h) = dλL ρ(λL )[κ−1 (1 − λL /a)−p + hV −1 ]−1 (14) λL − and κ determined from the condition g(0) = 1. (This g(h) is the tree spectrum approximation to the g(h) of (10).) Turning first to g(0), the factor (1 − λL /a)p decays quickly to zero as λL increases above λL . One can then approximate this factor according to (1 − λL /a)p [(a − λL )/(a − λL )]p ≈ − − − (1 − λL /a)p exp[−(λL − λL )p/(a − λL )]. In the regime near λL one can also approximate the − − − − spectral density (11) by its leading square-root increase, ρ(λL ) = r(λL − λL )1/2 , with r = (d − − 1)1/4 d5/2 /[π(d − 2)2 ]. Switching then to a new integration variable y = (λL − λL )p/(a − λL ) and − − extending the integration limit to ∞ gives ∞ √ 1 = g(0) = κr(1 − λL /a)p [p/(a − λL )]−3/2 dy y e−y (15) − − 0 and this fixes κ. Proceeding similarly for h > 0 gives ∞ g(h) = κr(1−λL /a)p [p/(a−λL )]−3/2 F (hκV −1 (1−λL /a)p ), − − − F (z) = √ dy y (ey +z)−1 0 (16) Dividing by g(0) = 1 shows that simply g(h) = F (hV −1 c−1 )/F (0), where c = 1/[κ(1 − σ2 λL /a)p ] = rF (0)[p/(a − λL )]−3/2 which scales as p−3/2 . In the asymptotic regime − − 2 2 we then have = g(νV /σ ) = F (ν/(cσ ))/F (0) and the desired result (13) follows from the large-z behaviour of F (z) ≈ z −1 ln3/2 (z). One can proceed similarly for the regime where loops become important. Clearly the zero Laplacian eigenvalue with weight 1/V then has to be taken into account. If we assume that the remainder of the Laplacian spectrum can still be approximated by that of a tree [18], we get (V + hκ)−1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (hκV −1 (1 − λL /a)p ) − − − g(h) = (17) V −1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (0) − − The denominator here is κ−1 and the two terms are proportional respectively to the covariance kernel eigenvalue λ1 , corresponding to λL = 0 and the constant eigenfunction, and to 1−λ1 . Dropping the 1 first terms in the numerator and denominator of (17) by taking V → ∞ leads back to the previous analysis as it should. For a situation as in Fig. 3(right), on the other hand, where λ1 is close to unity, we have κ ≈ V and so g(h) ≈ (1 + h)−1 + rV (1 − λL /a)p [p/(a − λL )]−3/2 F (h(1 − λL /a)p ) (18) − − − The second term, coming from the small kernel eigenvalues, is the more slowly decaying because it corresponds to fine detail of the target function that needs many training examples to learn accurately. It will therefore dominate the asymptotic behaviour of the learning curve: = g(n/σ 2 ) ∝ F (n/(c σ 2 )) with c = (1 − λL /a)−p independent of V . The large-n tail of the learning curve in − Fig. 3(right) is consistent with this form. 8 References [1] C E Rasmussen and C K I Williams. Gaussian processes for regression. In D S Touretzky, M C Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press. [2] M Opper. Regression with Gaussian processes: Average case performance. In I K Kwok-Yee, M Wong, I King, and Dit-Yun Yeung, editors, Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective, pages 17–23. Springer, 1997. [3] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, Advances in Neural Information Processing Systems 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [4] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, Advances in Neural Information Processing Systems 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [5] C K I Williams and F Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes. Mach. Learn., 40(1):77–102, 2000. [6] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Information Processing Systems 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [7] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [8] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [9] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [10] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artificial Intelligence, pages 199–210. 2005. [11] M Herbster, M Pontil, and L Wainer. Online learning over graphs. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 305–312, New York, NY, USA, 2005. ACM. [12] A J Smola and R Kondor. Kernels and regularization on graphs. In M Warmuth and B Sch¨ lkopf, o editors, Proc. Conference on Learning Theory (COLT), Lect. Notes Comp. Sci., pages 144–158. Springer, Heidelberg, 2003. [13] R I Kondor and J D Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann. [14] F R K Chung. Spectral graph theory. Number 92 in Regional Conference Series in Mathematics. Americal Mathematical Society, 1997. [15] A Steger and N C Wormald. Generating random regular graphs quickly. Combinator. Probab. Comput., 8(4):377–396, 1999. [16] F Chung and S-T Yau. Coverings, heat kernels and spanning trees. The Electronic Journal of Combinatorics, 6(1):R12, 1999. [17] C Monthus and C Texier. Random walk on the Bethe lattice and hyperbolic brownian motion. J. Phys. A, 29(10):2399–2409, 1996. [18] T Rogers, I Perez Castillo, R Kuehn, and K Takeda. Cavity approach to the spectral density of sparse symmetric random matrices. Phys. Rev. E, 78(3):031116, 2008. 9
6 0.10208603 205 nips-2009-Rethinking LDA: Why Priors Matter
7 0.084058441 147 nips-2009-Matrix Completion from Noisy Entries
8 0.082732871 98 nips-2009-From PAC-Bayes Bounds to KL Regularization
9 0.080767758 148 nips-2009-Matrix Completion from Power-Law Distributed Samples
10 0.080031231 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
11 0.079550184 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
12 0.078613393 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation
13 0.073416039 175 nips-2009-Occlusive Components Analysis
14 0.072189167 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction
15 0.070862688 187 nips-2009-Particle-based Variational Inference for Continuous Systems
16 0.06964989 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
17 0.068674207 31 nips-2009-An LP View of the M-best MAP problem
18 0.066738114 95 nips-2009-Fast subtree kernels on graphs
19 0.065666594 122 nips-2009-Label Selection on Graphs
20 0.064723946 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes
topicId topicWeight
[(0, -0.184), (1, 0.093), (2, -0.022), (3, 0.042), (4, -0.024), (5, -0.091), (6, 0.032), (7, -0.0), (8, -0.036), (9, -0.08), (10, -0.098), (11, 0.008), (12, 0.115), (13, 0.023), (14, 0.054), (15, 0.065), (16, -0.048), (17, -0.055), (18, 0.047), (19, -0.071), (20, 0.007), (21, 0.097), (22, 0.11), (23, -0.025), (24, 0.004), (25, 0.034), (26, 0.075), (27, -0.175), (28, -0.099), (29, -0.133), (30, 0.024), (31, -0.115), (32, 0.007), (33, -0.03), (34, -0.005), (35, -0.142), (36, -0.056), (37, -0.104), (38, -0.172), (39, -0.013), (40, 0.072), (41, -0.29), (42, 0.027), (43, 0.052), (44, 0.04), (45, -0.109), (46, -0.029), (47, 0.067), (48, 0.014), (49, 0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.92554832 256 nips-2009-Which graphical models are difficult to learn?
Author: Andrea Montanari, Jose A. Pereira
Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random field µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is defined as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change significantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for specific classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results confirm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be definite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to find W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after fixing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is defined as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More specifically, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This first result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More specifically, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the fluctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a specific low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is defined from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our first result shows that Rlr(λ) indeed reconstructs G if θ is sufficiently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires significant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difficult to learn. We will instead show how to evaluate it on specific graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufficient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and define S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Define Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our first auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufficient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufficiently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisfied since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by definition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and define the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external field’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any fixed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) finite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simplifications arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random fields”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ fling and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9
2 0.8045463 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains
Author: Dimitris Margaritis
Abstract: In this paper we address the problem of provably correct feature selection in arbitrary domains. An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. While numerous algorithms for this problem have been proposed, their theoretical correctness and practical behavior under arbitrary probability distributions is unclear. We address this by introducing the Markov Boundary Theorem that precisely characterizes the properties of an ideal Markov boundary, and use it to develop algorithms that learn a more general boundary that can capture complex interactions that only appear when the values of multiple features are considered together. We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artificial as well as benchmark and real-world data sets. Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability. 1 Introduction and Motivation The problem of feature selection has a long history due to its significance in a wide range of important problems, from early ones like pattern recognition to recent ones such as text categorization, gene expression analysis and others. In such domains, using all available features may be prohibitively expensive, unnecessarily wasteful, and may lead to poor generalization performance, especially in the presence of irrelevant or redundant features. Thus, selecting a subset of features of the domain for use in subsequent application of machine learning algorithms has become a standard preprocessing step. A typical task of these algorithms is learning a classifier: Given a number of input features and a quantity of interest, called the target variable, choose a member of a family of classifiers that can predict the target variable’s value as well as possible. Another task is understanding the domain and the quantities that interact with the target quantity. Many algorithms have been proposed for feature selection. Unfortunately, little attention has been paid to the issue of their behavior under a variety of application domains that can be encountered in practice. In particular, it is known that many can fail under certain probability distributions such as ones that contain a (near) parity function [1], which contain interactions that only appear when the values of multiple features are considered together. There is therefore an acute need for algorithms that are widely applicable and can be theoretically proven to work under any probability distribution. In this paper we present two such algorithms, an exact and a more practical randomized approximate one. We use the observation (first made in Koller and Sahami [2]) that an optimal solution to the problem is a Markov boundary, defined to be a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain (a more precise definition is given later in Section 3) and present a family of algorithms for learning 1 the Markov boundary of a target variable in arbitrary domains. We first introduce a theorem that exactly characterizes the minimal set of features necessary for probabilistically isolating a variable, and then relax this definition to derive a family of algorithms that learn a parameterized approximation of the ideal boundary that are provably correct under a minimal set of assumptions, including a set of axioms that hold for any probability distribution. In the following section we present work on feature selection, followed by notation and definitions in Section 3. We subsequently introduce an important theorem and the aforementioned parameterized family of algorithms in Sections 4 and 5 respectively, including a practical anytime version. We evaluate these algorithms in Section 6 and conclude in Section 7. 2 Related Work Numerous algorithms have been proposed for feature selection. At the highest level algorithms can be classified as filter, wrapper, or embedded methods. Filter methods work without consulting the classifier (if any) that will make use of their output i.e., the resulting set of selected features. They therefore have typically wider applicability since they are not tied to any particular classifier family. In contrast, wrappers make the classifier an integral part of their operation, repeatedly invoking it to evaluate each of a sequence of feature subsets, and selecting the subset that results in minimum estimated classification error (for that particular classifier). Finally, embedded algorithms are classifier-learning algorithms that perform feature selection implicitly during their operation e.g., decision tree learners. Early work was motivated by the problem of pattern recognition which inherently contains a large number of features (pixels, regions, signal responses at multiple frequencies etc.). Narendra and Fukunaga [3] first cast feature selection as a problem of maximization of an objective function over the set of features to use, and proposed a number of search approaches including forward selection and backward elimination. Later work by machine learning researchers includes the FOCUS algorithm of Almuallim and Dietterich [4], which is a filter method for deterministic, noise-free domains. The RELIEF algorithm [5] instead uses a randomized selection of data points to update a weight assigned to each feature, selecting the features whose weight exceeds a given threshold. A large number of additional algorithms have appeared in the literature, too many to list here—good surveys are included in Dash and Liu [6]; Guyon and Elisseeff [1]; Liu and Motoda [7]. An important concept for feature subset selection is relevance. Several notions of relevance are discussed in a number of important papers such as Blum and Langley [8]; Kohavi and John [9]. The argument that the problem of feature selection can be cast as the problem of Markov blanket discovery was first made convincingly in Koller and Sahami [2], who also presented an algorithm for learning an approximate Markov blanket using mutual information. Other algorithms include the GS algorithm [10], originally developed for learning of the structure of a Bayesian network of a domain, and extensions to it [11] including the recent MMMB algorithm [12]. Meinshausen and B¨ hlmann [13] u recently proposed an optimal theoretical solution to the problem of learning the neighborhood of a Markov network when the distribution of the domain can be assumed to be a multidimensional Gaussian i.e., linear relations among features with Gaussian noise. This assumption implies that the Composition axiom holds in the domain (see Pearl [14] for a definition of Composition); the difference with our work is that we address here the problem in general domains where it may not necessarily hold. 3 Notation and Preliminaries In this section we present notation, fundamental definitions and axioms that will be subsequently used in the rest of the paper. We use the term “feature” and “variable” interchangeably, and denote variables by capital letters (X, Y etc.) and sets of variables by bold letters (S, T etc.). We denote the set of all variables/features in the domain (the “universe”) by U. All algorithms presented are independence-based, learning the Markov boundary of a given target variable using the truth value of a number of conditional independence statements. The use of conditional independence for feature selection subsumes many other criteria proposed in the literature. In particular, the use of classification accuracy of the target variable can be seen as a special case of testing for its conditional independence with some of its predictor variables (conditional on the subset selected at any given moment). A benefit of using conditional independence is that, while classification error estimates depend on the classifier family used, conditional independence does not. In addition, algorithms utilizing conditional independence for feature selection are applicable to all domain types, 2 e.g., discrete, ordinal, continuous with non-linear or arbitrary non-degenerate associations or mixed domains, as long as a reliable estimate of probabilistic independence is available. We denote probabilistic independence by the symbol “ ⊥ ” i.e., (X⊥ Y | Z) denotes the fact ⊥ ⊥ that the variables in set X are (jointly) conditionally independent from those in set Y given the values of the variables in set Z; (X⊥ Y | Z) denotes their conditional dependence. We assume ⊥ the existence of a probabilistic independence query oracle that is available to answer any query of the form (X, Y | Z), corresponding to the question “Is the set of variables in X independent of the variables in Y given the value of the variables in Z?” (This is similar to the approach of learning from statistical queries of Kearns and Vazirani [15].) In practice however, such an oracle does not exist, but can be approximated by a statistical independence test on a data set. Many tests of independence have appeared and studied extensively in the statistical literature over the last century; in this work we use the χ2 (chi-square) test of independence [16]. A Markov blanket of variable X is a set of variables such that, after fixing (by “knowing”) the value of all of its members, the set of remaining variables in the domain, taken together as a single setvalued variable, are statistically independent of X. More precisely, we have the following definition. Definition 1. A set of variables S ⊆ U is called a Markov blanket of variable X if and only if (X⊥ U − S − {X} | S). ⊥ Intuitively, a Markov blanket S of X captures all the information in the remaining domain variables U − S − {X} that can affect the probability distribution of X, making their value redundant as far as X is concerned (given S). The blanket therefore captures the essence of the feature selection problem for target variable X: By completely “shielding” X, a Markov blanket precludes the existence of any possible information about X that can come from variables not in the blanket, making it an ideal solution to the feature selection problem. A minimal Markov blanket is called a Markov boundary. Definition 2. A set of variables S ⊆ U − {X} is called a Markov boundary of variable X if it is a minimal Markov blanket of X i.e., none of its proper subsets is a Markov blanket. Pearl [14] proved that that the axioms of Symmetry, Decomposition, Weak Union, and Intersection are sufficient to guarantee a unique Markov boundary. These are shown below together with the axiom of Contraction. (Symmetry) (Decomposition) (Weak Union) (Contraction) (Intersection) (X⊥ Y | Z) =⇒ (Y⊥ X | Z) ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z) ∧ (X⊥ W | Z) ⊥ ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z ∪ W) ⊥ ⊥ (X⊥ Y | Z) ∧ (X⊥ W | Y ∪ Z) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (X⊥ Y | Z ∪ W) ∧ (X⊥ W | Z ∪ Y) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (1) The Symmetry, Decomposition, Contraction and Weak Union axioms are very general: they are necessary axioms for the probabilistic definition of independence i.e., they hold in every probability distribution, as their proofs are based on the axioms of probability theory. Intersection is not universal but it holds in distributions that are positive, i.e., any value combination of the domain variables has a non-zero probability of occurring. 4 The Markov Boundary Theorem According to Definition 2, a Markov boundary is a minimal Markov blanket. We first introduce a theorem that provides an alternative, equivalent definition of the concept of Markov boundary that we will relax later in the paper to produce a more general boundary definition. Theorem 1 (Markov Boundary Theorem). Assuming that the Decomposition and Contraction axioms hold, S ⊆ U − {X} is a Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X}, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ (2) A detailed proof cannot be included here due to space constraints but a proof sketch appears in Appendix A. According to the above theorem, a Markov boundary S partitions the powerset of U − {X} into two parts: (a) set P1 that contains all subsets of U − S, and (b) set P2 containing the remaining subsets. All sets in P1 are conditionally independent of X, and all sets in P2 are conditionally dependent with X. Intuitively, the two directions of the logical equivalence relation of Eq. (2) correspond to the concept of Markov blanket and its minimality i.e., the equation ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X⊥ T | S − T) ⊥ 3 Algorithm 1 The abstract GS(m) (X) algorithm. Returns an m-Markov boundary of X. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: S←∅ /* Growing phase. */ for all Y ⊆ U − S − {X} such that 1 ≤ |Y| ≤ m do if (X ⊥ Y | S) then ⊥ S←S∪Y goto line 3 /* Restart loop. */ /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 8 /* Restart loop. */ return S or, equivalently, (∀ T ⊆ U − S − {X}, (X⊥ T | S)) (as T and S are disjoint) corresponds to ⊥ the definition of Markov blanket, as it includes T = U − S − {X}. In the opposite direction, the contrapositive form is ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X ⊥ T | S − T) . ⊥ This corresponds to the concept of minimality of the Markov boundary: It states that all sets that contain a part of S cannot be independent of X given the remainder of S. Informally, this is because if there existed some set T that contained a non-empty subset T′ of S such that (X⊥ T | S − T), ⊥ then one would be able to shrink S by T′ (by the property of Contraction) and therefore S would not be minimal (more details in Appendix A). 5 A Family of Algorithms for Arbitrary Domains Theorem 1 defines conditions that precisely characterize a Markov boundary and thus can be thought of as an alternative definition of a boundary. By relaxing these conditions we can produce a more general definition. In particular, an m-Markov boundary is defined as follows. Definition 3. A set of variables S ⊆ U − {X} of a domain U is called an m-Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ We call the parameter m of an m-Markov boundary the Markov boundary margin. Intuitively, an m-boundary S guarantees that (a) all subsets of its complement (excluding X) of size m or smaller are independent of X given S, and (b) all sets T of size m or smaller that are not subsets of its complement are dependent of X given the part of S that is not contained in T. This definition is a special case of the properties of a boundary stated in Theorem 1, with each set T mentioned in the theorem now restricted to having size m or smaller. For m = n − 1, where n = |U |, the condition |T| ≤ m is always satisfied and can be omitted; in this case the definition of an (n − 1)-Markov boundary results in exactly Eq. (2) of Theorem 1. We now present an algorithm called GS(m) , shown in Algorithm 1, that provably correctly learns an m-boundary of a target variable X. GS(m) operates in two phases, a growing and a shrinking phase (hence the acronym). During the growing phase it examines sets of variables of size up to m, where m is a user-specified parameter. During the shrinking phase, single variables are examined for conditional independence and possible removal from S (examining sets in the shrinking phase is not necessary for provably correct operation—see Appendix B). The orders of examination of the sets for possible addition and deletion from the candidate boundary are left intentionally unspecified in Algorithm 1—one can therefore view it as an abstract representative of a family of algorithms, with each member specifying one such ordering. All members of this family are m-correct, as the proof of correctness does not depend on the ordering. In practice numerous choices for the ordering exist; one possibility is to examine the sets in the growing phase in order of increasing set size and, for each such size, in order of decreasing conditional mutual information I(X, Y, S) between X and Y given S. The rationale for this heuristic choice is that (usually) tests with smaller conditional sets tend to be more reliable, and sorting by mutual information tends to lessen the chance of adding false members of the Markov boundary. We used this implementation in all our experiments, presented later in Section 6. Intuitively, the margin represents a trade-off between sample and computational complexity and completeness: For m = n − 1 = |U| − 1, the algorithm returns a Markov boundary in unrestricted 4 Algorithm 2 The RGS(m,k) (X) algorithm, a randomized anytime version of the GS(m) algorithm, utilizing k random subsets for the growing phase. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: S←∅ /* Growing phase. */ repeat Schanged ← false Y ← subset of U − S − {X} of size 1 ≤ |Y| ≤ m of maximum dependence out of k random subsets if (X ⊥ Y | S) then ⊥ S←S∪Y Schanged ← true until Schanged = false /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 11 /* Restart loop. */ return S (arbitrary) domains. For 1 ≤ m < n − 1, GS(m) may recover the correct boundary depending on characteristics of the domain. For example, it will recover the correct boundary in domains containing embedded parity functions such that the number of variables involved in every k-bit parity function is m + 1 or less i.e., if k ≤ m + 1 (parity functions are corner cases in the space of probability distributions that are known to be hard to learn [17]). The proof of m-correctness of GS(m) is included in Appendix B. Note that it is based on Theorem 1 and the universal axioms of Eqs. (1) only i.e., Intersection is not needed, and thus it is widely applicable (to any domain). A Practical Randomized Anytime Version While GS(m) is provably correct even in difficult domains such as those that contain parity functions, it may be impractical with a large number of features as its asymptotic complexity is O(nm ). We therefore also we here provide a more practical randomized version called RGS(m,k) (Randomized GS(m) ), shown in Algorithm 2. The RGS(m,k) algorithm has an additional parameter k that limits its computational requirements: instead of exhaustively examining all possible subsets of (U −S−{X}) (as GS(m) does), it instead samples k subsets from the set of all possible subsets of (U − S − {X}), where k is user-specified. It is therefore a randomized algorithm that becomes equivalent to GS(m) given a large enough k. Many possibilities for the method of random selection of the subsets exist; in our experiments we select a subset Y = {Yi } (1 ≤ |Y| ≤ m) with probability proportional |Y| to i=1 (1/p(X, Yi | S)), where p(X, Yi | S) is the p-value of the corresponding (univariate) test between X and Yi given S, which has a low computational cost. The RGS(m,k) algorithm is useful in situations where the amount of time to produce an answer may be limited and/or the limit unknown beforehand: it is easy to show that the growing phase of GS(m) produces an an upper-bound of the m-boundary of X. Therefore, the RGS(m,k) algorithm, if interrupted, will return an approximation of this upper bound. Moreover, if there exists time for the shrinking phase to be executed (which conducts a number of tests linear in n and is thus fast), extraneous variables will be removed and a minimal blanket (boundary) approximation will be returned. These features make it an anytime algorithm, which is a more appropriate choice for situations where critical events may occur that require the interruption of computation, e.g., during the planning phase of a robot, which may be interrupted at any time due to an urgent external event that requires a decision to be made based on the present state’s feature values. 6 Experiments We evaluated the GS(m) and the RGS(m,k) algorithms on synthetic as well as real-world and benchmark data sets. We first systematically examined the performance on the task of recovering near-parity functions, which are known to be hard to learn [17]. We compared GS(m) and RGS(m,k) with respect to accuracy of recovery of the original boundary as well as computational cost. We generated domains of sizes ranging from 10 to 100 variables, of which 4 variables (X1 to X4 ) were related through a near-parity relation with bit probability 0.60 and various degrees of noise. The remaining independent variables (X5 to Xn ) act as “dis5 GS(1) GS(3) RGS(1, 1000) 0 0.05 RGS(3, 1000) Relieved, threshold = 0.001 Relieved, threshold = 0.03 0.1 0.15 0.2 0.25 Noise probability 0.3 0.35 0.4 Probabilistic isolation performance of GS(m) and RELIEVED Probabilistic isolation performance of GS(m) and RGS(m ,k) Real-world and benchmark data sets 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) GS 0.6 0.8 1 Real-world and benchmark data sets RGS(m = 3, k = 300) average isolation measure F1 measure 50 variables, true Markov boundary size = 3 Bernoulli probability = 0.6, 1000 data points RELIEVED(threshold = 0.03) average isolation measure F1 measure of GS(m ), RGS(m, k ) and RELIEVED vs. noise level 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) average isolation measure GS 0.6 0.8 1 average isolation measure Figure 2: Left: F1 measure of GS(m) , RGS(m,k) and RELIEVED under increasing amounts of noise. Middle: Probabilistic isolation performance comparison between GS(3) and RELIEVED on real-world and benchmark data sets. Right: Same for GS(3) and RGS(3,1000) . tractors” and had randomly assigned probabilities i.e., the correct boundary of X1 is B1 = {X2 , X3 , X4 }. In such domains, learning the boundary of X1 is difficult because of the large number of distractors and because each Xi ∈ B1 is independent of X1 given any proper subset of B1 − {Xi } (they only become dependent when including all of them in the conditioning set). To measure an algorithm’s feature selection performance, acF -measure of GS and RGS vs. domain size curacy (fraction of variables correctly included or excluded) is inappropriate as the accuracy of trivial algorithms such as returning the empty set will tend to 1 as n increases. Precision and recall are therefore more appropriate, with precision defined as the fraction of features returned that are in the correct boundary (3 features for X1 ), and recall as the fraction of the features present in the correct boundary that are returned by the algorithm. A convenient and frequently used Number of domain variables measure that combines precision and recall is the F1 meaRunning time of GS and RGS vs. domain size sure, defined as the harmonic mean of precision and recall [18]. In Fig. 1 (top) we report 95% confidence intervals for the F1 measure and execution time of GS(m) (margins m = 1 to 3) and RGS(m,k) (margins 1 to 3 and k = 1000 random subsets), using 20 data sets containing 10 to 100 variables, with the target variable X1 was perturbed (inverted) by noise with 10% probability. As can be seen, the RGS(m,k) and GS(m) using the same value for margin perform comparably Number of domain variables with respect to F1 , up to their 95% confidence intervals. With Figure 1: GS(m) and RGS(m,k) per(m,k) respect to execution time however RGS exhibits much formance with respect to domain greater scalability (Fig. 1 bottom, log scale); for example, it size (number of variables). Top: F1 executes in about 10 seconds on average in domains containmeasure, reflecting accuracy. Bot(m) ing 100 variables, while GS executes in 1,000 seconds on tom: Execution time in seconds (log average for this domain size. scale). We also compared GS(m) and RGS(m,k) to RELIEF [5], a well-known algorithm for feature selection that is known to be able to recover parity functions in certain cases [5]. RELIEF learns a weight for each variable and compares it to a threshold τ to decide on its inclusion in the set of relevant variables. As it has been reported [9] that RELIEF can exhibit large variance due to randomization that is necessary only for very large data sets, we instead used a deterministic variant called RELIEVED [9], whose behavior corresponds to RELIEF at the limit of infinite execution time. We calculated the F1 measure for GS(m) , RGS(m,k) and RELIEVED in the presence of varying amounts of noise, with noise probability ranging from 0 (no noise) to 0.4. We used domains containing 50 variables, as GS(m) becomes computationally demanding in larger domains. In Figure 2 (left) we show the performance of GS(m) and RGS(m,k) for m equal to 1 and 3, k = 1000 and RELIEVED for thresholds τ = 0.01 and 0.03 for various amounts of noise on the target variable. Again, each experiment was repeated 20 times to generate 95% confidence intervals. We can observe that even though m = 1 (equivalent to the GS algorithm) performs poorly, increasing the margin m makes it more likely to recover the correct Markov boundary, and GS(3) (m = 3) recovers the exact blanket even with few (1,000) data points. RELIEVED does comparably to GS(3) for little noise and for a large threshold, (m ) (m, k ) 1 True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 1 0.9 GS(1) GS(2) GS(3) RGS(1, 1000) (2, 1000) RGS (3, 1000) RGS 0.8 F1-measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 (m ) 70 80 90 100 90 100 (m, k ) True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 10000 GS(1) GS(2) GS(3) Execution time (sec) 1000 RGS(1, 1000) RGS(2, 1000) RGS(3, 1000) 100 10 1 0.1 0.01 10 6 20 30 40 50 60 70 80 but appears to deteriorate for more noisy domains. As we can see it is difficult to choose the “right” threshold for RELIEVED—better performing τ at low noise can become worse in noisy environments; in particular, small τ tend to include irrelevant variables while large τ tend to miss actual members. We also evaluated GS(m) , RGS(m,k) , and RELIEVED on benchmark and real-world data sets from the UCI Machine Learning repository. As the true Markov boundary for these is impossible to know, we used as performance measure a measure of probabilistic isolation by the Markov boundary returned of subsets outside the boundary. For each domain variable X, we measured the independence of subsets Y of size 1, 2 and 3 given the blanket S of X returned by GS(3) and RELIEVED for τ = 0.03 (as this value seemed to do better in the previous set of experiments), as measured by the average p-value of the χ2 test between X and Y given S (with p-values of 0 and 1 indicating ideal dependence and independence, respectively). Due to the large number of subsets outside the boundary when the boundary is small, we limited the estimation of isolation performance to 2,000 subsets per variable. We plot the results in Figure 2 (middle and right). Each point represents a variable in the corresponding data set. Points under the diagonal indicate better probabilistic isolation performance for that variable for GS(3) compared to RELIEVED (middle plot) or to RGS(3,1000) (right plot). To obtain a statistically significant comparison, we used the non-parametric Wilcoxon paired signed-rank test, which indicated that GS(3) RGS(3,1000) are statistically equivalent to each other, while both outperformed RELIEVED at the 99.99% significance level (α < 10−7 ). 7 Conclusion In this paper we presented algorithms for the problem of feature selection in unrestricted (arbitrary distribution) domains that may contain complex interactions that only appear when the values of multiple features are considered together. We introduced two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and evaluated them on on artificial, benchmark and real-world data, demonstrating that they perform well, even in the presence of noise. We also introduced the Markov Boundary Theorem that precisely characterizes the properties of a boundary, and used it to prove m-correctness of the exact family of algorithms presented. We made minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, giving our algorithms universal applicability. Appendix A: Proof sketch of the Markov Boundary Theorem Proof sketch. (=⇒ direction) We need to prove that if S is a Markov boundary of X then (a) for every set T ⊆ U − S − {X}, (X⊥ T | S − T), and (b) for every set T′ ⊆ U − S that does not ⊥ contain X, (X ⊥ T′ | S − T′ ). Case (a) is immediate from the definition of the boundary and the ⊥ Decomposition theorem. Case (b) can be proven by contradiction: Assuming the independence of T′ that contains a non-empty part T′ in S and a part T′ in U − S, we get (from Decomposition) 1 2 (X⊥ T′ | S − T′ ). We can then use Contraction to show that the set S − T′ satisfies the inde⊥ 1 1 1 pendence property of a Markov boundary, i.e., that (X⊥ U − (S − T′ ) − {X} | S − T′ ), which ⊥ 1 1 contradicts the assumption that S is a boundary (and thus minimal). (⇐= direction) We need to prove that if Eq. (2) holds, then S is a minimal Markov blanket. The proof that S is a blanket is immediate. We can prove minimality by contradiction: Assume S = S1 ∪ S2 with S1 a blanket and S2 = ∅ i.e., S1 is a blanket strictly smaller than S. Then (X⊥ S2 | ⊥ S1 ) = (X⊥ S2 | S − S2 ). However, since S2 ⊆ U − S, from Eq. (2) we get (X ⊥ S2 | S − S2 ), ⊥ ⊥ which is a contradiction. Appendix B: Proof of m-Correctness of GS(m) Let the value of the set S at the end of the growing phase be SG , its value at the end of the shrinking phase SS , and their difference S∆ = SG − SS . The following two observations are immediate. Observation 1. For every Y ⊆ U − SG − {X} such that 1 ≤ |Y| ≤ m, (X⊥ Y | SG ). ⊥ Observation 2. For every Y ∈ SS , (X ⊥ Y | SS − {Y }). ⊥ Lemma 2. Consider variables Y1 , Y2 , . . . , Yt for some t ≥ 1 and let Y = {Yj }t . Assuming that j=1 Contraction holds, if (X⊥ Yi | S − {Yj }i ) for all i = 1, . . . , t, then (X⊥ Y | S − Y). ⊥ ⊥ j=1 Proof. By induction on Yj , j = 1, 2, . . . , t, using Contraction to decrease the conditioning set S t down to S − {Yj }i j=1 for all i = 1, 2, . . . , t. Since Y = {Yj }j=1 , we immediately obtain the desired relation (X⊥ Y | S − Y). ⊥ 7 Lemma 2 can be used to show that the variables found individually independent of X during the shrinking phase are actually jointly independent of X, given the final set SS . Let S∆ = {Y1 , Y2 , . . . , Yt } be the set of variables removed (in that order) from SG to form the final set SS i.e., S∆ = SG − SS . Using the above lemma, the following is immediate. Corollary 3. Assuming that the Contraction axiom holds, (X⊥ S∆ | SS ). ⊥ Lemma 4. If the Contraction, Decomposition and Weak Union axioms hold, then for every set T ⊆ U − SG − {X} such that (X⊥ T | SG ), ⊥ (X⊥ T ∪ (SG − SS ) | SS ). ⊥ (3) Furthermore SS is minimal i.e., there does not exist a subset of SS for which Eq. (3) is true. Proof. From Corollary 3, (X⊥ S∆ | SS ). Also, by the hypothesis, (X⊥ T | SG ) = (X⊥ T | ⊥ ⊥ ⊥ SS ∪ S∆ ), where S∆ = SG − SS as usual. From these two relations and Contraction we obtain (X⊥ T ∪ S∆ | SS ). ⊥ To prove minimality, let us assume that SS = ∅ (if SS = ∅ then it is already minimal). We prove by contradiction: Assume that there exists a set S′ ⊂ SS such that (X⊥ T ∪ (SG − S′ ) | S′ ). Let ⊥ W = SS − S′ = ∅. Note that W and S′ are disjoint. We have that SS ⊆ SS ∪ S∆ =⇒ SS − S′ ⊆ SS ∪ S∆ − S′ ⊆ T ∪ (SS ∪ S∆ − S′ ) =⇒ W ⊆ T ∪ (SS ∪ S∆ − S′ ) = T ∪ (SG − S′ ) • Since (X⊥ T ∪ (SG − S′ ) | S′ ) and W ⊆ T ∪ (SS ∪ S∆ − S′ ), from Decomposition we ⊥ get (X⊥ W | S′ ). ⊥ • From (X⊥ W | S′ ) and Weak Union we have that for every Y ∈ W, (X⊥ Y | S′ ∪ ⊥ ⊥ (W − {Y })). • Since S′ and W are disjoint and since Y ∈ W, Y ∈ S′ . Applying the set equality (A − B) ∪ C = (A ∪ B) − (A − C) to S′ ∪ (W − {Y }) we obtain S′ ∪ W − ({Y } − S′ ) = SS − {Y }. • Therefore, ∀ Y ∈ W, (X⊥ Y | SS − {Y }). ⊥ However, at the end of the shrinking phase, all variables Y in SS (and therefore in W, as W ⊆ SS ) have been evaluated for independence and found dependent (Observation 2). Thus, since W = ∅, there exists at least one Y such that (X ⊥ Y | SS − {Y }), producing a contradiction. ⊥ Theorem 5. Assuming that the Contraction, Decomposition, and Weak Union axioms hold, Algorithm 1 is m-correct with respect to X. Proof. We use the Markov Boundary Theorem. We first prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X⊥ T | SS − T) ⊥ or, equivalently, ∀ T ⊆ U − SS − {X} such that |T| ≤ m, (X⊥ T | SS ). ⊥ Since U − SS − {X} = S∆ ∪ (U − SG − {X}), S∆ and U − SG − {X} are disjoint, there are three kinds of sets of size m or less to consider: (i) all sets T ⊆ S∆ , (ii) all sets T ⊆ U − SG − {X}, and (iii) all sets (if any) T = T′ ∪ T′′ , T′ ∩ T′′ = ∅, that have a non-empty part T′ ⊆ S∆ and a non-empty part T′′ ⊆ U − SG − {X}. (i) From Corollary 3, (X⊥ S∆ | SS ). Therefore, from Decomposition, for any set T ⊆ S∆ , ⊥ (X⊥ T | SS ). ⊥ (ii) By Observation 1, for every set T ⊆ U − SG − {X} such that |T| ≤ m, (X⊥ T | SG ). ⊥ By Lemma 4 we get (X⊥ T ∪ S∆ | SS ), from which we obtain (X⊥ T | SS ) by ⊥ ⊥ Decomposition. (iii) Since |T| ≤ m, we have that |T′′ | ≤ m. Since T′′ ⊆ U − SG − {X}, by Observation 1, (X⊥ T′′ | SG ). Therefore, by Lemma 4, (X⊥ T′′ ∪ S∆ | SS ). Since T′ ⊆ S∆ ⇒ ⊥ ⊥ T′′ ∪ T′ ⊆ T′′ ∪ S∆ , by Decomposition to obtain (X⊥ T′′ ∪ T′ | SS ) = (X⊥ T | SS ). ⊥ ⊥ To complete the proof we need to prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X ⊥ T | SS − T) . ⊥ Let T = T1 ∪ T2 , with T1 ⊆ SS and T2 ⊆ U − SS . Since T ⊆ U − SS , T1 contains at least one variable Y ∈ SS . From Observation 2, (X ⊥ Y | SS − {Y }). From this and (the contrapositive of) ⊥ Weak Union, we get (X ⊥ {Y } ∪ (T1 − {Y }) | SS − {Y } − (T1 − {Y })) = (X ⊥ T1 | SS − T1 ). ⊥ ⊥ From (the contrapositive of) Decomposition we get (X ⊥ T1 ∪ T2 | SS − T1 ) = (X ⊥ T | ⊥ ⊥ SS − T1 ), which is equal to (X ⊥ T | SS − T1 − T2 ) = (X ⊥ T | SS − T) as SS and T2 are ⊥ ⊥ disjoint. 8 References [1] Isabelle Guyon and Andr´ Elisseeff. An introduction to variable and feature selection. Journal e of Machine Learning Research, 3:1157–1182, 2003. [2] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Tenth International Conference on Machine Learning (ICML), pages 284–292, 1996. [3] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26(9):917–922, 1977. [4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), 1991. [5] K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), pages 129–134, 1992. [6] M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(3): 131–156, 1997. [7] Huan Liu and Hiroshi Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective, volume 453 of The Springer International Series in Engineering and Computer Science. 1998. [8] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245–271, 1997. [9] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97 (1-2):273–324, 1997. [10] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12 (NIPS), 2000. [11] I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In Proceedings of the 16th International FLAIRS Conference, 2003. [12] I. Tsamardinos, C. Aliferis, and A. Statnikov. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 673–678, 2003. [13] N. Meinshausen and P. B¨ hlmann. High-dimensional graphs and variable selection with the u Lasso. Annals of Statistics, 34:1436–1462, 2006. [14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988. [15] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. [16] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 1990. [17] M. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998. [18] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 1979. 9
3 0.59862137 141 nips-2009-Local Rules for Global MAP: When Do They Work ?
Author: Kyomin Jung, Pushmeet Kohli, Devavrat Shah
Abstract: We consider the question of computing Maximum A Posteriori (MAP) assignment in an arbitrary pair-wise Markov Random Field (MRF). We present a randomized iterative algorithm based on simple local updates. The algorithm, starting with an arbitrary initial assignment, updates it in each iteration by first, picking a random node, then selecting an (appropriately chosen) random local neighborhood and optimizing over this local neighborhood. Somewhat surprisingly, we show that this algorithm finds a near optimal assignment within n log 2 n iterations with high probability for any n node pair-wise MRF with geometry (i.e. MRF graph with polynomial growth) with the approximation error depending on (in a reasonable manner) the geometric growth rate of the graph and the average radius of the local neighborhood – this allows for a graceful tradeoff between the complexity of the algorithm and the approximation error. Through extensive simulations, we show that our algorithm finds extremely good approximate solutions for various kinds of MRFs with geometry.
4 0.51907593 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models
Author: Laura Dietz, Valentin Dallmeier, Andreas Zeller, Tobias Scheffer
Abstract: We devise a graphical model that supports the process of debugging software by guiding developers to code that is likely to contain defects. The model is trained using execution traces of passing test runs; it reflects the distribution over transitional patterns of code positions. Given a failing test case, the model determines the least likely transitional pattern in the execution trace. The model is designed such that Bayesian inference has a closed-form solution. We evaluate the Bernoulli graph model on data of the software projects AspectJ and Rhino. 1
5 0.51123631 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
Author: Baback Moghaddam, Emtiyaz Khan, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We make several contributions in accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our first contribution is to show how to efficiently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efficiently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call “Neighborhood Fusion” (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our third contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.
6 0.49094096 90 nips-2009-Factor Modeling for Advertisement Targeting
7 0.46398616 148 nips-2009-Matrix Completion from Power-Law Distributed Samples
8 0.42620796 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
9 0.37779886 122 nips-2009-Label Selection on Graphs
10 0.37001759 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation
11 0.36121416 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
12 0.35888839 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation
13 0.35277879 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
14 0.3487629 206 nips-2009-Riffled Independence for Ranked Data
15 0.34648827 95 nips-2009-Fast subtree kernels on graphs
16 0.34156886 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs
17 0.34130877 69 nips-2009-Discrete MDL Predicts in Total Variation
18 0.32536927 180 nips-2009-On the Convergence of the Concave-Convex Procedure
19 0.31697771 205 nips-2009-Rethinking LDA: Why Priors Matter
20 0.31636673 56 nips-2009-Conditional Neural Fields
topicId topicWeight
[(24, 0.059), (25, 0.077), (31, 0.313), (35, 0.048), (36, 0.091), (39, 0.05), (57, 0.015), (58, 0.085), (61, 0.014), (71, 0.044), (81, 0.023), (86, 0.092), (91, 0.013)]
simIndex simValue paperId paperTitle
1 0.9169271 183 nips-2009-Optimal context separation of spiking haptic signals by second-order somatosensory neurons
Author: Romain Brasselet, Roland Johansson, Angelo Arleo
Abstract: We study an encoding/decoding mechanism accounting for the relative spike timing of the signals propagating from peripheral nerve fibers to second-order somatosensory neurons in the cuneate nucleus (CN). The CN is modeled as a population of spiking neurons receiving as inputs the spatiotemporal responses of real mechanoreceptors obtained via microneurography recordings in humans. The efficiency of the haptic discrimination process is quantified by a novel definition of entropy that takes into full account the metrical properties of the spike train space. This measure proves to be a suitable decoding scheme for generalizing the classical Shannon entropy to spike-based neural codes. It permits an assessment of neurotransmission in the presence of a large output space (i.e. hundreds of spike trains) with 1 ms temporal precision. It is shown that the CN population code performs a complete discrimination of 81 distinct stimuli already within 35 ms of the first afferent spike, whereas a partial discrimination (80% of the maximum information transmission) is possible as rapidly as 15 ms. This study suggests that the CN may not constitute a mere synaptic relay along the somatosensory pathway but, rather, it may convey optimal contextual accounts (in terms of fast and reliable information transfer) of peripheral tactile inputs to downstream structures of the central nervous system. 1
2 0.87646961 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity
Author: Bryan Conroy, Ben Singer, James Haxby, Peter J. Ramadge
Abstract: The inter-subject alignment of functional MRI (fMRI) data is important for improving the statistical power of fMRI group analyses. In contrast to existing anatomically-based methods, we propose a novel multi-subject algorithm that derives a functional correspondence by aligning spatial patterns of functional connectivity across a set of subjects. We test our method on fMRI data collected during a movie viewing experiment. By cross-validating the results of our algorithm, we show that the correspondence successfully generalizes to a secondary movie dataset not used to derive the alignment. 1
same-paper 3 0.79192811 256 nips-2009-Which graphical models are difficult to learn?
Author: Andrea Montanari, Jose A. Pereira
Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random field µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is defined as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change significantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for specific classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results confirm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be definite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to find W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after fixing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is defined as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More specifically, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This first result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More specifically, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the fluctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a specific low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is defined from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our first result shows that Rlr(λ) indeed reconstructs G if θ is sufficiently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires significant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difficult to learn. We will instead show how to evaluate it on specific graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufficient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and define S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Define Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our first auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufficient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufficiently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisfied since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by definition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and define the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external field’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any fixed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) finite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simplifications arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random fields”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ fling and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9
4 0.75929999 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
Author: Chong Wang, David M. Blei
Abstract: The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. In this paper, we develop an alternative inference technique based on variational methods. To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efficiently explores a posterior over a large set of combinatorial structures. We demonstrate the use of this approach for text and hand written digits modeling, where we show we can adapt the nCRP to continuous data as well. 1
5 0.69010168 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections
Author: Rob Fergus, Yair Weiss, Antonio Torralba
Abstract: With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. “Clean labels” can be manually obtained on a small fraction, “noisy labels” may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images. Specifically, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images gathered from the Internet. 1
6 0.54360396 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
7 0.53089762 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
8 0.52839786 104 nips-2009-Group Sparse Coding
9 0.52563822 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA
10 0.52356678 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
11 0.52309376 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization
12 0.52251256 137 nips-2009-Learning transport operators for image manifolds
13 0.5221411 112 nips-2009-Human Rademacher Complexity
14 0.52155685 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction
15 0.52134132 3 nips-2009-AUC optimization and the two-sample problem
16 0.5209136 97 nips-2009-Free energy score space
17 0.52068406 70 nips-2009-Discriminative Network Models of Schizophrenia
18 0.52060288 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data
19 0.52002889 138 nips-2009-Learning with Compressible Priors
20 0.51997429 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions