jmlr jmlr2007 jmlr2007-31 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Markus Kalisch, Peter Bühlmann
Abstract: We consider the PC-algorithm (Spirtes et al., 2000) for estimating the skeleton and equivalence class of a very high-dimensional directed acyclic graph (DAG) with corresponding Gaussian distribution. The PC-algorithm is computationally feasible and often very fast for sparse problems with many nodes (variables), and it has the attractive property to automatically achieve high computational efficiency as a function of sparseness of the true underlying DAG. We prove uniform consistency of the algorithm for very high-dimensional, sparse DAGs where the number of nodes is allowed to quickly grow with sample size n, as fast as O(na ) for any 0 < a < ∞. The sparseness assumption is rather minimal requiring only that the neighborhoods in the DAG are of lower order than sample size n. We also demonstrate the PC-algorithm for simulated data. Keywords: asymptotic consistency, DAG, graphical model, PC-algorithm, skeleton
Reference: text
sentIndex sentText sentNum sentScore
1 , 2000) for estimating the skeleton and equivalence class of a very high-dimensional directed acyclic graph (DAG) with corresponding Gaussian distribution. [sent-8, score-0.695]
2 The PC-algorithm is computationally feasible and often very fast for sparse problems with many nodes (variables), and it has the attractive property to automatically achieve high computational efficiency as a function of sparseness of the true underlying DAG. [sent-9, score-0.164]
3 We prove uniform consistency of the algorithm for very high-dimensional, sparse DAGs where the number of nodes is allowed to quickly grow with sample size n, as fast as O(na ) for any 0 < a < ∞. [sent-10, score-0.175]
4 Keywords: asymptotic consistency, DAG, graphical model, PC-algorithm, skeleton 1. [sent-13, score-0.421]
5 Of particular current interest are directed acyclic graphs (DAGs), containing directed rather than undirected edges, which restrict in a sense the conditional dependence relations. [sent-17, score-0.389]
6 These graphs can be interpreted by applying the directed Markov property (see Lauritzen, 1996). [sent-18, score-0.123]
7 When ignoring the directions of a DAG, we get the skeleton of a DAG. [sent-19, score-0.398]
8 In general, it is different from the conditional independence graph (CIG), see Section 2. [sent-20, score-0.138]
9 (Thus, estimation methods for directed graphs cannot be easily borrowed from approaches for undirected CIGs. [sent-22, score-0.226]
10 1, the skeleton can be interpreted easily and thus yields interesting insights into the dependence structure of the data. [sent-24, score-0.398]
11 The greedy DAG search can be improved by exploiting probabilistic equivalence relations, and the search space can be reduced from individual DAGs to equivalence classes, as proposed in GES (Greedy Equivalent Search, see Chickering, 2002a). [sent-30, score-0.232]
12 Although this method seems quite promising when having few or a moderate number of nodes, it is limited by the fact that the space of equivalence classes is conjectured to grow super-exponentially in the nodes as well (Gillispie and Perlman, 2001). [sent-31, score-0.203]
13 It starts from a complete, undirected graph and deletes recursively edges based on conditional independence decisions. [sent-37, score-0.301]
14 This yields an undirected graph which can then be partially directed and further extended to represent the underlying DAG (see later). [sent-38, score-0.24]
15 We focus in this paper on estimating the equivalence class and the skeleton of DAGs (corresponding to multivariate Gaussian distributions) in the high-dimensional context, that is, the number of nodes p may be much larger than sample size n. [sent-44, score-0.65]
16 We prove that the PC-algorithm consistently estimates the equivalence class and the skeleton of an underlying sparse DAG, as sample size n → ∞, even if p = pn = O(na ) (0 ≤ a < ∞) is allowed to grow very quickly as a function of n. [sent-45, score-0.603]
17 They show that, assuming only faithfulness (explained in Section 2), uniform consistency cannot be achieved, but pointwise consistency can. [sent-53, score-0.16]
18 More importantly, we show that consistency holds even as the number of nodes and neighbors increases and the size of the smallest non-zero partial correlations decrease as a function of the sample size. [sent-55, score-0.319]
19 Stricter assumptions than the faithfulness condition that render uniform consistency possible have been also proposed in Zhang and Spirtes (2003). [sent-56, score-0.122]
20 The problem of finding the equivalence class of a DAG has a substantial overlap with the problem of feature selection: If the equivalence class is found, the Markov Blanket of any variable (node) can be read of easily. [sent-59, score-0.232]
21 Given a set of nodes V and suppose that M is the Markov Blanket of node X, then X is conditionally independent of V \M given M. [sent-60, score-0.16]
22 1 Definitions and Preliminaries A graph G = (V, E) consists of a set of nodes or vertices V = {1, . [sent-68, score-0.138]
23 , p} and a set of edges E ⊆ V ×V , that is, the edge set is a subset of ordered pairs of distinct nodes. [sent-71, score-0.164]
24 An edge (i, j) ∈ E is called directed if (i, j) ∈ E but ( j, i) ∈ E; we then use the notation i → j. [sent-73, score-0.152]
25 A directed acyclic graph (DAG) is a graph G where all edges are directed and not containing any cycle. [sent-75, score-0.378]
26 If there is a directed edge i → j, node i is said to be a parent of node j. [sent-76, score-0.228]
27 The adjacency set of a node j in graph G, denoted by ad j(G, j), are all nodes i which are directly connected to j by an edge (directed or undirected). [sent-78, score-0.363]
28 The elements of ad j(G, j) are also called neighbors of or adjacent to j. [sent-79, score-0.188]
29 A probability distribution P on R p is said to be faithful with respect to a graph G if conditional independencies of the distribution can be inferred from so-called d-separation in the graph G and vice-versa. [sent-80, score-0.246]
30 The skeleton of a DAG G is the undirected graph obtained from G by substituting undirected edges for directed edges. [sent-92, score-0.801]
31 A v-structure in a DAG G is an ordered triple of nodes (i, j, k) such that G contains the directed edges i → j and k → j, and i and k are not adjacent in G. [sent-93, score-0.322]
32 Using a result from Verma and Pearl (1990), we can characterize equivalent classes more precisely: Two DAGs are equivalent if and only if they have the same skeleton and the same v-structures. [sent-97, score-0.398]
33 A common tool for visualizing equivalence classes of DAGs are completed partially directed acyclic graphs (CPDAG). [sent-98, score-0.283]
34 A partially directed acyclic graph (PDAG) is a graph where some edges are directed and some are undirected and one cannot trace a cycle by following the direction of directed edges and any direction for undirected edges. [sent-99, score-0.73]
35 A PDAG is completed, if (1) every directed edge exists also in every DAG belonging to the equivalence class of the DAG and (2) for every undirected edge i − j there exists a DAG with i → j and a DAG with i ← j in the equivalence class. [sent-101, score-0.553]
36 CPDAGs encode all independence informations contained in the corresponding equivalence class. [sent-102, score-0.17]
37 It was shown in Chickering (2002b) that two CPDAGs are identical if and only if they represent the same equivalence class, that is, they represent a equivalence class uniquely. [sent-103, score-0.232]
38 Although the main goal is to identify the CPDAG, the skeleton itself already contains interesting information. [sent-104, score-0.398]
39 In particular, if P is faithful with respect to a DAG G, there is an edge between nodes i and j in the skeleton of DAG G ⇔ for all s ⊆ V \ {i, j}, X(i) and X( j) are conditionally dependent given {X(r) ; r ∈ s}, (1) (Spirtes et al. [sent-105, score-0.655]
40 This implies that if P is faithful with respect to a DAG G, the skeleton of the DAG G is a subset (or equal) to the conditional independence graph (CIG) corresponding to P. [sent-109, score-0.605]
41 More importantly, every edge in the skeleton indicates some strong dependence which cannot be explained away by accounting for other variables. [sent-111, score-0.464]
42 As we will see later in more detail, estimating the CPDAG consists of two main parts (which will naturally structure our analysis): (1) Estimation of the skeleton and (2) partial orientation of edges. [sent-113, score-0.442]
43 The skeleton itself oftentimes already provides interesting insights, and in a high-dimensional setting it might be interesting to use the undirected skeleton as an alternative target to the CPDAG when finding a useful approximation of the CPDAG seems hopeless. [sent-122, score-0.899]
44 2 The PC-algorithm for Finding the Skeleton A naive strategy for finding the skeleton would be to check conditional independencies given all subsets s ⊆ V \ {i, j} (see Formula 1), that is, all partial correlations in the case of multivariate normal distributions as first suggested by Verma and J. [sent-128, score-0.594]
45 More precisely, we apply the part of the PC-algorithm that identifies the undirected edges of the DAG. [sent-132, score-0.163]
46 1 P OPULATION V ERSION FOR THE S KELETON In the population version of the PC-algorithm, we assume that perfect knowledge about all necessary conditional independence relations is available. [sent-135, score-0.118]
47 Algorithm 1 The PC pop -algorithm 1: INPUT: Vertex Set V , Conditional Independence Information 2: OUTPUT: Estimated skeleton C, separation sets S (only needed when directing the skeleton afterwards) ˜ 3: Form the complete undirected graph C on the vertex set V. [sent-138, score-1.049]
48 ˜ 4: = −1; C = C 5: repeat 6: = +1 7: repeat 8: Select a (new) ordered pair of nodes i, j that are adjacent in C such that |ad j(C, i) \ { j}| ≥ 9: repeat 10: Choose (new) k ⊆ ad j(C, i) \ { j} with |k| = . [sent-139, score-0.267]
49 The maximal value of rithm 1 is denoted by mreach = maximal reached value of . [sent-142, score-0.151]
50 A proof that this algorithm produces the correct skeleton can be easily deduced from Theorem 5. [sent-144, score-0.398]
51 Then, the PC pop -algorithm constructs the true skeleton of the DAG. [sent-150, score-0.497]
52 Furthermore, we assume faithful models, which means that the conditional independence relations correspond to d-separations (and so can be read off the graph) and vice versa; see Section 2. [sent-157, score-0.156]
53 In the Gaussian case, conditional independencies can be inferred from partial correlations. [sent-159, score-0.119]
54 We can thus estimate partial correlations to obtain estimates of conditional independencies. [sent-173, score-0.13]
55 3 Extending the Skeleton to the Equivalence Class While finding the skeleton as in Algorithm 1, we recorded the separation sets that made edges drop out in the variable denoted by S. [sent-190, score-0.458]
56 This was not necessary for finding the skeleton itself, but will be essential for extending the skeleton to the equivalence class. [sent-191, score-0.912]
57 50f) to extend the skeleton to a CPDAG belonging to the equivalence class of the underlying DAG. [sent-193, score-0.514]
58 2 is asymptotically consistent for the skeleton of a DAG, even if p is much larger than n but the DAG is sparse. [sent-205, score-0.398]
59 (A3) The maximal number of neighbors in the DAG Gn is denoted by qn = max1≤ j≤pn |ad j(G, j)|, with qn = O(n1−b ) for some 0 < b ≤ 1. [sent-217, score-0.21]
60 Recently, for undirected graphs the Lasso has been proposed as a computationally efficient algorithm for estimating high-dimensional conditional independence graphs where ¨ the growth in dimensionality is as in (A2) (see Meinshausen and B uhlmann, 2006). [sent-227, score-0.264]
61 2 and by Gskel,n the true skeleton from the DAG Gn . [sent-232, score-0.398]
62 A choice for the value of the significance level is α n = 2(1 − Φ(n1/2 cn /2)) which depends on the unknown lower bound of partial correlations in (A4). [sent-235, score-0.148]
63 If this part is completed perfectly, that is, if there was no error while testing conditional independencies (it is not enough to assume that the skeleton was estimated correctly), the second part will never fail (see Meek, 1995b). [sent-238, score-0.473]
64 Numerical Examples We analyze the PC-algorithm for finding the skeleton and the CPDAG using various simulated data sets. [sent-251, score-0.398]
65 1 Simulating Data In this section, we analyze the PC-algorithm for the skeleton using simulated data. [sent-256, score-0.398]
66 The corresponding DAG draws a directed edge from node i to node j if i < j and A ji = 0. [sent-269, score-0.228]
67 3 Performance for Different Parameter Settings In this section, we give an overview over the performance in terms of the true positive rate (TPR) and false positive rate (FPR) for the skeleton and the SHD for the CPDAG. [sent-313, score-0.398]
68 As expected, the fit for a dense graph (triangles; E[N] = 5) is worse than the fit for a sparse graph (circles; E[N] = 2). [sent-331, score-0.151]
69 We should note, that while the number of neighbors to a given variable may be growing almost as fast as n, so that the number of neighbors is increasing with sample size, the percentage of true among all possible edges is going down with n. [sent-347, score-0.177]
70 00002) Table 1: The number of variables p increases exponentially, the sample size n increases linearly and the expected neighborhood size E[N] increases sub-linearly. [sent-460, score-0.141]
71 06 TPR p=9 p=27 p=81 p=243 p=729 p=2187 p=9 p=27 p=81 p=243 p=729 p=2187 Figure 3: While the number of variables p increases exponentially, the sample size n increases linearly and the expected neighborhood size E[N] increases sub-linearly, the TPR increases and the FPR decreases. [sent-474, score-0.167]
72 The average processor time together with its standard deviation for estimating both the skeleton and the CPDAG is given in Table 2. [sent-492, score-0.438]
73 Graphs of p = 1000 nodes and 8 neighbors on average could be estimated in about 25 minutes, while networks with up to p = 100 nodes could be estimated in about a second. [sent-493, score-0.22]
74 The additional time spent for finding the CPDAG from the skeleton is comparable for both neighborhood sizes and varies between a couple to almost 100 percent of the time needed to estimate the skeleton. [sent-494, score-0.436]
75 While the line for the dense graph is very straight, the line for the sparse graphs has a positive curvature. [sent-498, score-0.137]
76 6 GHz, 4 GB) for estimating the skeleton ( Gskel ) ˆ CPDAG ) for different DAGs in seconds, with standard errors in brackets. [sent-544, score-0.398]
77 R-Package pcalg The R-package pcalg can be used to estimate from data the underlying skeleton or equivalence class of a DAG. [sent-548, score-0.612]
78 In the following, we show an example of how to generate a random DAG, draw samples and infer from data the skeleton and the equivalence class of the original DAG using the R-package pcalg. [sent-563, score-0.514]
79 The line width of the edges in the resulting skeleton and CPDAG can be adjusted to correspond to the reliability of the estimated dependencies. [sent-564, score-0.458]
80 seed(42) g <- randomDAG(p,s) # generate a random DAG d <- rmvDAG(n,g) # generate random samples Then we estimate the underlying skeleton by using the function pcAlgo and extend the skeleton to the CPDAG by using the function udag2cpdag. [sent-579, score-0.796]
81 Then, for any 0 < γ ≤ 2, sup I ρn;i, j − ρn;i, j | > γ] ≤ C1 (n − 2) exp (n − 4) log( P[| ˆ m i, j,k∈Ki, jn 4 − γ2 ) , 4 + γ2 for some constant 0 < C1 < ∞ depending on M only. [sent-582, score-0.225]
82 Consider sup −1<ρ<1;ρ+γ≤x≤1 gρ (x) = sup −1<ρ≤1−γ 1 − γ4 2 = 1 − ρ2 1 − (ρ + γ)2 1 − ρ(ρ + γ) 1 − γ4 2 γ 1 − ( −γ )( 2 ) 2 630 = 4 − γ2 < 1 for all 0 < γ ≤ 2. [sent-594, score-0.124]
83 Then, for any γ > 0, P[| ˆ sup I ρn;i, j|k − ρn;i, j|k | > γ] m i, j,k∈Ki, jn ≤ C1 (n − 2 − mn ) exp (n − 4 − mn ) log( 4 − γ2 ) , 4 + γ2 for some constant 0 < C1 < ∞ depending on M from (A4) only. [sent-605, score-0.577]
84 Then, for any γ > 0, P[|Z sup I n;i, j|k − zn;i, j|k | > γ] m i, j,k∈Ki, jn ≤ O(n − mn ) exp((n − 4 − mn ) log( 4 − (γ/L)2 )) + exp(−C2 (n − mn )) 4 + (γ/L)2 for some constant 0 < C2 < ∞ and L = 1/(1 − (1 + M)2 /4). [sent-610, score-0.727]
85 By applying Corollary 1 with γ = κ = (1 − M)/2 we have sup I ρn;i, j|k − ρn;i, j|k | ≤ κ] P[| ˜ m i, j,k∈Ki, jn > 1 −C1 (n − 2 − mn ) exp(−C2 (n − mn )). [sent-614, score-0.551]
86 Therefore, since κ = (1 − M)/2 yielding 1/(1 − (M + κ)2 ) = L, and using Formula (11), we get ˜ sup I P[|g (ρn;i, j|k )| ≤ L] m i, j,k∈Ki, jn ≥ 1 −C1 (n − 2 − mn ) exp(−C2 (n − mn )). [sent-616, score-0.551]
87 (12) Since |g (ρ)| ≥ 1 for all ρ, we obtain with Formula (10): sup I n;i, j|k − zn;i, j|k | > γ] P[|Z (13) m i, j,k∈Ki, jn ≤ ˜ sup I P[|g (ρn;i, j|k )| > L] + m i, j,k∈Ki, jn sup I ρn;i, j|k − ρn;i, j|k | > γ/L]. [sent-617, score-0.46]
88 2 is then same as the PC(mreach )-algorithm, where mreach is the sample version of Formula (2). [sent-628, score-0.124]
89 ˆ ˆ The population version PC pop (mn )-algorithm when stopped at level mn = mreach,n constructs the true skeleton according to Proposition 1. [sent-629, score-0.704]
90 ˜ = −1; C = C repeat = +1 repeat Select a (new) ordered pair of nodes i, j that are adjacent in C such that |ad j(C, i) \ { j}| ≥ repeat Choose (new) k ⊆ ad j(C, i) \ { j} with |k| = . [sent-633, score-0.267]
91 if i and j are conditionally independent given k then Delete edge i, j Denote this new graph by C. [sent-634, score-0.152]
92 Denote by ˆ Gskel,n (αn , mn ) the estimate from the PC(mn )-algorithm in Section 2. [sent-637, score-0.176]
93 2 and by Gskel,n the true skeleton from the DAG Gn . [sent-639, score-0.398]
94 Then, for mn ≥ mreach,n , mn = O(n1−b ) (n → ∞), there exists αn → 0 (n → ∞) such that I Gskel,n (αn , mn ) = Gskel,n ] P[ ˆ = 1 − O(exp(−Cn1−2d )) → 1 (n → ∞) for some 0 < C < ∞. [sent-641, score-0.528]
95 Then, sup I i, j|k ] = P[E I m i, j,k∈Ki, jn sup I i, j|k − zi, j|k | > (n/(n − |k| − 3))1/2 cn /2] P[|Z m i, j,k∈Ki, jn ≤ O(n − mn ) exp(−C3 (n − mn )c2 ), n (16) 4−δ2 for some 0 < C3 < ∞ using Lemma 3 and the fact that log( 4+δ2 ) ∼ −δ2 /2 as δ → 0. [sent-646, score-0.801]
96 By invoking Lemma 3 we then obtain: sup I i, j|k ] ≤ O(n − mn ) exp(−C4 (n − mn )c2 ) P[E II n (17) m i, j,k∈Ki, jn for some 0 < C4 < ∞. [sent-648, score-0.551]
97 Proof: Consider the population algorithm PC pop (m): the reached stopping level satisfies mreach ∈ {qn − 1, qn }, see Proposition 1. [sent-655, score-0.321]
98 The sample PC(mn )-algorithm with stopping level in the range of mreach ≤ mn = O(n1−b ), coincides with the population version on a set A having probability P[A] = 1 − O(exp(−Cn1−2d )), see the last formula in the proof of Lemma 4. [sent-656, score-0.382]
99 Hence, on the set A, mreach,n = mreach ∈ {qn − 1, qn }. [sent-657, score-0.168]
100 3, due to the result of Meek (1995b), it is sufficient to estimate the correct skeleton and separation sets. [sent-664, score-0.398]
wordName wordTfidf (topN-words)
[('dag', 0.496), ('skeleton', 0.398), ('cpdag', 0.272), ('shd', 0.235), ('dags', 0.178), ('mn', 0.176), ('fpr', 0.173), ('tpr', 0.173), ('jn', 0.137), ('alisch', 0.136), ('uhlmann', 0.126), ('imensional', 0.124), ('pc', 0.12), ('equivalence', 0.116), ('igh', 0.105), ('undirected', 0.103), ('mreach', 0.099), ('pop', 0.099), ('spirtes', 0.099), ('ad', 0.091), ('nodes', 0.087), ('directed', 0.086), ('faithfulness', 0.084), ('ki', 0.078), ('qn', 0.069), ('faithful', 0.069), ('edge', 0.066), ('gcpdag', 0.062), ('gskel', 0.062), ('sup', 0.062), ('edges', 0.06), ('independence', 0.054), ('correlations', 0.053), ('orient', 0.052), ('sparseness', 0.052), ('causal', 0.051), ('cn', 0.051), ('graph', 0.051), ('adjacent', 0.051), ('cance', 0.051), ('pcalg', 0.049), ('verma', 0.049), ('neighbors', 0.046), ('acyclic', 0.044), ('partial', 0.044), ('replicates', 0.042), ('independencies', 0.042), ('inspecting', 0.042), ('processor', 0.04), ('lauritzen', 0.04), ('pn', 0.039), ('neighborhood', 0.038), ('ordered', 0.038), ('node', 0.038), ('consistency', 0.038), ('tsamardinos', 0.037), ('cig', 0.037), ('kimn', 0.037), ('pdag', 0.037), ('pmn', 0.037), ('graphs', 0.037), ('exponentially', 0.036), ('conditionally', 0.035), ('conditional', 0.033), ('ei', 0.033), ('zn', 0.032), ('kalisch', 0.031), ('hotelling', 0.031), ('triangles', 0.031), ('lemma', 0.031), ('population', 0.031), ('chickering', 0.03), ('meek', 0.03), ('adjacency', 0.03), ('gn', 0.029), ('settings', 0.028), ('formula', 0.028), ('zi', 0.027), ('maximal', 0.026), ('exp', 0.026), ('increases', 0.026), ('afterwards', 0.026), ('sample', 0.025), ('cpdags', 0.025), ('ersion', 0.025), ('ethz', 0.025), ('gillispie', 0.025), ('goldenberg', 0.025), ('keleton', 0.025), ('pdags', 0.025), ('robins', 0.025), ('skeletons', 0.025), ('zuk', 0.025), ('sparse', 0.025), ('correlation', 0.024), ('dense', 0.024), ('multivariate', 0.024), ('na', 0.023), ('stopping', 0.023), ('graphical', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 31 jmlr-2007-Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm
Author: Markus Kalisch, Peter Bühlmann
Abstract: We consider the PC-algorithm (Spirtes et al., 2000) for estimating the skeleton and equivalence class of a very high-dimensional directed acyclic graph (DAG) with corresponding Gaussian distribution. The PC-algorithm is computationally feasible and often very fast for sparse problems with many nodes (variables), and it has the attractive property to automatically achieve high computational efficiency as a function of sparseness of the true underlying DAG. We prove uniform consistency of the algorithm for very high-dimensional, sparse DAGs where the number of nodes is allowed to quickly grow with sample size n, as fast as O(na ) for any 0 < a < ∞. The sparseness assumption is rather minimal requiring only that the neighborhoods in the DAG are of lower order than sample size n. We also demonstrate the PC-algorithm for simulated data. Keywords: asymptotic consistency, DAG, graphical model, PC-algorithm, skeleton
2 0.060130626 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
Author: Jean-Yves Audibert, Olivier Bousquet
Abstract: There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, first, to underline the links between these bounds, and second, to combine the different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester (1998), which is interesting for randomized predictions, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand (see Talagrand, 1996), in a way that also takes into account the variance of the combined functions. We also show how this connects to Rademacher based bounds. Keywords: statistical learning theory, PAC-Bayes theorems, generalization error bounds
3 0.059208237 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér
Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics
4 0.053328995 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin
Abstract: Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set covering machine that has the property to depend on the observed fraction of positive examples and on what the classifier achieves on the positive training examples. We show that this loss bound is incorrect. We then propose a loss bound, valid for any sample-compression learning algorithm (including the set covering machine), that depends on the observed fraction of positive examples and on what the classifier achieves on them. We also compare numerically the loss bound proposed in this paper with the incorrect bound, the original SCM bound and a recently proposed loss bound of Marchand and Sokolova (2005) (which does not depend on the observed fraction of positive examples) and show that the latter loss bounds can be substantially larger than the new bound in the presence of imbalanced misclassifications. Keywords: set covering machines, sample-compression, loss bounds
5 0.053199615 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
Author: Rie Johnson, Tong Zhang
Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian
6 0.040807363 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
7 0.036105596 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
9 0.033600006 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
10 0.032381259 72 jmlr-2007-Relational Dependency Networks
11 0.032112055 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks
12 0.031001167 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
13 0.02867325 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
14 0.027263813 9 jmlr-2007-AdaBoost is Consistent
15 0.025313178 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
16 0.024731815 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
17 0.023928637 42 jmlr-2007-Infinitely Imbalanced Logistic Regression
18 0.023656435 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
19 0.023057578 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
20 0.022968072 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
topicId topicWeight
[(0, 0.154), (1, 0.003), (2, 0.002), (3, -0.047), (4, 0.004), (5, 0.052), (6, 0.07), (7, -0.051), (8, 0.161), (9, 0.017), (10, 0.082), (11, 0.113), (12, 0.067), (13, -0.101), (14, 0.024), (15, 0.004), (16, 0.018), (17, 0.023), (18, -0.059), (19, 0.065), (20, 0.04), (21, -0.086), (22, -0.043), (23, -0.052), (24, 0.039), (25, 0.151), (26, 0.083), (27, 0.181), (28, -0.037), (29, 0.034), (30, -0.079), (31, 0.191), (32, -0.358), (33, 0.078), (34, -0.1), (35, 0.101), (36, -0.125), (37, -0.165), (38, -0.216), (39, 0.202), (40, 0.014), (41, 0.211), (42, 0.144), (43, 0.073), (44, -0.043), (45, -0.459), (46, -0.064), (47, 0.153), (48, 0.08), (49, 0.127)]
simIndex simValue paperId paperTitle
same-paper 1 0.9678703 31 jmlr-2007-Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm
Author: Markus Kalisch, Peter Bühlmann
Abstract: We consider the PC-algorithm (Spirtes et al., 2000) for estimating the skeleton and equivalence class of a very high-dimensional directed acyclic graph (DAG) with corresponding Gaussian distribution. The PC-algorithm is computationally feasible and often very fast for sparse problems with many nodes (variables), and it has the attractive property to automatically achieve high computational efficiency as a function of sparseness of the true underlying DAG. We prove uniform consistency of the algorithm for very high-dimensional, sparse DAGs where the number of nodes is allowed to quickly grow with sample size n, as fast as O(na ) for any 0 < a < ∞. The sparseness assumption is rather minimal requiring only that the neighborhoods in the DAG are of lower order than sample size n. We also demonstrate the PC-algorithm for simulated data. Keywords: asymptotic consistency, DAG, graphical model, PC-algorithm, skeleton
2 0.24904458 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen
Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to fit together and to yield efficient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a specific model structure has been fixed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation
3 0.2482609 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér
Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics
4 0.2223796 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
Author: Rie Johnson, Tong Zhang
Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian
5 0.21596865 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin
Abstract: Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set covering machine that has the property to depend on the observed fraction of positive examples and on what the classifier achieves on the positive training examples. We show that this loss bound is incorrect. We then propose a loss bound, valid for any sample-compression learning algorithm (including the set covering machine), that depends on the observed fraction of positive examples and on what the classifier achieves on them. We also compare numerically the loss bound proposed in this paper with the incorrect bound, the original SCM bound and a recently proposed loss bound of Marchand and Sokolova (2005) (which does not depend on the observed fraction of positive examples) and show that the latter loss bounds can be substantially larger than the new bound in the presence of imbalanced misclassifications. Keywords: set covering machines, sample-compression, loss bounds
6 0.19763343 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks
7 0.15783274 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
8 0.13191064 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
10 0.12372884 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
11 0.11135779 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
12 0.10948122 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
13 0.10539904 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
14 0.1034266 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
15 0.10234568 72 jmlr-2007-Relational Dependency Networks
16 0.10067939 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"
17 0.10059455 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
18 0.10017926 9 jmlr-2007-AdaBoost is Consistent
20 0.09766914 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
topicId topicWeight
[(4, 0.011), (8, 0.018), (10, 0.017), (12, 0.035), (14, 0.477), (28, 0.056), (40, 0.038), (45, 0.022), (48, 0.028), (60, 0.056), (80, 0.012), (85, 0.054), (98, 0.084)]
simIndex simValue paperId paperTitle
same-paper 1 0.75225407 31 jmlr-2007-Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm
Author: Markus Kalisch, Peter Bühlmann
Abstract: We consider the PC-algorithm (Spirtes et al., 2000) for estimating the skeleton and equivalence class of a very high-dimensional directed acyclic graph (DAG) with corresponding Gaussian distribution. The PC-algorithm is computationally feasible and often very fast for sparse problems with many nodes (variables), and it has the attractive property to automatically achieve high computational efficiency as a function of sparseness of the true underlying DAG. We prove uniform consistency of the algorithm for very high-dimensional, sparse DAGs where the number of nodes is allowed to quickly grow with sample size n, as fast as O(na ) for any 0 < a < ∞. The sparseness assumption is rather minimal requiring only that the neighborhoods in the DAG are of lower order than sample size n. We also demonstrate the PC-algorithm for simulated data. Keywords: asymptotic consistency, DAG, graphical model, PC-algorithm, skeleton
2 0.35141525 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér
Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics
3 0.27903074 72 jmlr-2007-Relational Dependency Networks
Author: Jennifer Neville, David Jensen
Abstract: Recent work on graphical models for relational data has demonstrated significant improvements in classification and inference when models represent the dependencies among instances. Despite its use in conventional statistical models, the assumption of instance independence is contradicted by most relational data sets. For example, in citation data there are dependencies among the topics of a paper’s references, and in genomic data there are dependencies among the functions of interacting proteins. In this paper, we present relational dependency networks (RDNs), graphical models that are capable of expressing and reasoning with such dependencies in a relational setting. We discuss RDNs in the context of relational Bayes networks and relational Markov networks and outline the relative strengths of RDNs—namely, the ability to represent cyclic dependencies, simple methods for parameter estimation, and efficient structure learning techniques. The strengths of RDNs are due to the use of pseudolikelihood learning techniques, which estimate an efficient approximation of the full joint distribution. We present learned RDNs for a number of real-world data sets and evaluate the models in a prediction context, showing that RDNs identify and exploit cyclic relational dependencies to achieve significant performance gains over conventional conditional models. In addition, we use synthetic data to explore model performance under various relational data characteristics, showing that RDN learning and inference techniques are accurate over a wide range of conditions. Keywords: relational learning, probabilistic relational models, knowledge discovery, graphical models, dependency networks, pseudolikelihood estimation
Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire
Abstract: We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and unifies techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling
5 0.26107234 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
Author: Rie Johnson, Tong Zhang
Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian
6 0.26043987 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
7 0.25894341 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
8 0.25798553 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
9 0.25704175 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
11 0.25470793 43 jmlr-2007-Integrating Naïve Bayes and FOIL
12 0.25437626 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
13 0.25279367 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
14 0.25277704 77 jmlr-2007-Stagewise Lasso
15 0.25259075 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
16 0.25243387 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
17 0.25180432 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
18 0.25146881 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
19 0.25143251 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
20 0.25049376 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection