jmlr jmlr2013 jmlr2013-84 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Naftali Harris, Mathias Drton
Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Statistics University of Washington Seattle, WA, USA 98195 Editor: Chris Meek Abstract The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. [sent-3, score-0.41]
2 In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. [sent-4, score-0.221]
3 Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. [sent-5, score-0.537]
4 Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution 1. [sent-9, score-0.407]
5 Introduction Let G = (V, E) be an acyclic digraph with finite vertex set, and let X = (Xv )v∈V be a random vector whose entries are in correspondence with the graph’s vertices. [sent-10, score-0.2]
6 Then the graph G determines a statistical model for the joint distribution of X by imposing conditional independences that can be read off from G using the concept of d-separation. [sent-11, score-0.238]
7 As common in the field, we use the abbreviation DAG (for ‘directed acyclic graph’) when referring to acyclic digraphs. [sent-17, score-0.192]
8 This algorithm uses conditional independence tests to infer a DAG from data. [sent-20, score-0.165]
9 Hence, the goal becomes estimation of the Markov equivalence class of an acyclic digraph G. [sent-29, score-0.203]
10 For representation of the equivalence class, prior work considers a particular partially directed graph C(G), for which it holds that C(G) = C(H) if and only if the two DAGs G and H are Markov equivalent; see Andersson et al. [sent-30, score-0.227]
11 The graph C(G) may contain both directed and undirected edges, and it is acyclic in the sense of its directed subgraph having no directed cycles. [sent-32, score-0.504]
12 We will refer to C(G) as the completed partially directed acyclic graph (CPDAG), but other terminology such as the essential graph is in use. [sent-33, score-0.387]
13 The PC algorithm uses conditional independence tests to infer a CPDAG from data (Spirtes et al. [sent-34, score-0.165]
14 ⊥ ⊥ Moreover, the pairwise conditional independence of Xu and Xv given XS is equivalent to the vanishing of the partial correlation ρuv|S , that is, the correlation obtained from the bivariate normal conditional distribution of (Xu , Xv ) given XS . [sent-45, score-0.612]
15 Our later theoretical analysis will use / the fact that Ψ−1 uv , (2) ρuv|S = − −1 −1 Ψuu Ψvv 3366 PC A LGORITHM FOR N ONPARANORMAL G RAPHICAL M ODELS where Ψ = Σ(u,v,S),(u,v,S) is the concerned principal submatrix of Σ. [sent-47, score-0.306]
16 A natural estimate of ρuv|S is the sample partial correlation obtained by replacing Σ with the empirical covariance matrix of available observations. [sent-48, score-0.209]
17 The main results in Kalisch and B¨ hlmann (2007) show high-dimensional consistency of the PC algou rithm, when the observations form a sample of independent normal random vectors that are faithful to a suitably sparse DAG. [sent-50, score-0.278]
18 The purpose of this paper is to show that the PC algorithm has high-dimensional consistency properties for a broader class of distributions, when standard Pearson-type empirical correlations are replaced by rank-based measures of correlations in tests of conditional independence. [sent-51, score-0.544]
19 Definition 2 Let f = ( fv )v∈V be a collection of strictly increasing functions fv : R → R, and let Σ ∈ RV ×V be a positive definite correlation matrix. [sent-56, score-0.242]
20 The nonparanormal distribution NPN( f , Σ) is the distribution of the random vector ( fv (Zv ))v∈V for (Zv )v∈V ∼ N(0, Σ). [sent-57, score-0.327]
21 Definition 3 The nonparanormal graphical model NPN(G) associated with a DAG G is the set of all distributions NPN( f , Σ) that are Markov with respect to G. [sent-61, score-0.326]
22 Since the marginal transformations fv are deterministic, the dependence structure in a nonparanormal distribution corresponds to that in the underlying latent multivariate normal distribution. [sent-62, score-0.408]
23 In light of this equivalence, we will occasionally speak of a correlation matrix Σ being Markov or faithful to a DAG, meaning that the requirement holds for any distribution NPN( f , Σ). [sent-65, score-0.2]
24 In the remainder of the paper we study the PC algorithm in the nonparanormal context, proposing the use of Spearman’s rank correlation and Kendall’s τ for estimation of the correlation matrix parameter of a nonparanormal distribution. [sent-66, score-0.845]
25 In Section 2, we review how transformations of Spearman’s rank correlation and Kendall’s τ yield accurate estimators of the latent Gaussian correlations. [sent-67, score-0.185]
26 Theorem 8 in Section 4 gives 3367 H ARRIS AND D RTON our main result, an error bound for the output of the PC algorithm when correlations are used to determine nonparanormal conditional independence. [sent-70, score-0.508]
27 Our numerical work in Section 5 makes a strong case for the use of rank correlations in the PC algorithm. [sent-73, score-0.248]
28 In the nonparanormal setting, the marginal distributions are continuous so that ties occur with probability zero, making ranks well-defined. [sent-87, score-0.268]
29 ˆ ˆ It turns out that simple trigonometric transformations of ρS and τ are excellent estimators of the population Pearson correlation for multivariate normal data. [sent-91, score-0.205]
30 Since ranks are preserved under strictly increasing functions, (4) and (5) still hold if (X,Y ) ∼ NPN( f , Σ) with Pearson correlation ρ = Σxy in the underlying latent bivariate normal distribution. [sent-98, score-0.208]
31 Throughout ˆ the rest of this paper, we will assume that we have some estimator ρ of ρ which has the property that, for nonparanormal data, ˆ P(|ρ − ρ| > ε) < A exp −Bnε2 (6) for fixed constants 0 < A, B < ∞. [sent-99, score-0.268]
32 When presented with multivariate observations from a distribution NPN( f , Σ), we apply the ˆ estimator from (6) to every pair of coordinates to obtain an estimator Σ of the correlation matrix ˆ into (1) or equivalently into (2) gives partial correlation estimators that we parameter. [sent-101, score-0.364]
33 An analysis of the algorithm in the context of nonparanormal distributions thus requires bounds on errors in partial correlations. [sent-105, score-0.353]
34 It provides a uniform bound on errors in partial correlations when a uniform bound on errors in marginal correlations is available. [sent-107, score-0.459]
35 If Σ ∈ Rq×q satisfies ˆ Σ−Σ ∞< cλ2 min (2 + c)q + λmin cq with c > 0, then all partial correlations are well-defined and their differences are bounded as ˆ |ρuv|\{u,v} − ρuv|\{u,v} | := Σ−1 uv Σ−1 Σ−1 uu vv − ˆ uv Σ−1 < c, ˆ uu ˆ vv Σ−1 Σ−1 1 ≤ u < v ≤ q. [sent-112, score-0.933]
36 Lemma 6 (Diagonal of inverted correlation matrix) If Σ ∈ Rq×q is a positive definite correlation matrix, then the diagonal entries of Σ−1 = (σi j ) satisfy σii ≥ 1. [sent-126, score-0.248]
37 The next lemma addresses the error propagation from the inverse of a correlation matrix to partial correlations. [sent-136, score-0.249]
38 Rank PC Algorithm ˆ Based on the equivalence (3), we may use the rank-based partial correlation estimates ρuv|S to test conditional independences. [sent-144, score-0.298]
39 The second part of the algorithm is independent of the type of correlations involved. [sent-151, score-0.187]
40 For an accurate enough estimate of a correlation matrix Σ that is faithful to a DAG G, this second part takes O(pdeg(G) ) time in the worst case, but it is often much faster; compare Kalisch and B¨ hlmann (2007). [sent-153, score-0.296]
41 For a correlation matrix Σ ∈ V ×V , let R cmin (Σ) := min |ρuv|S | : u, v ∈ V, S ⊆ V \ {u, v}, ρuv|S = 0 (10) be the minimal magnitude of any non-zero partial correlation, and let λmin (Σ) be the minimal eigenvalue. [sent-158, score-0.358]
42 , Xn be a sample of independent observations drawn from a nonparanormal distribution NPN( f , Σ) that is faithful to a DAG G with p nodes. [sent-163, score-0.344]
43 By Theorem 1, if all conditional independence tests for conditioning sets of size |S| ≤ deg(G) make correct decisions, then the output of ˆ the RPC algorithm Cγ (G) is equal to the CPDAG C(G). [sent-168, score-0.194]
44 Hence, the test makes ⊥ ˆ a correct decision if |ρuv|S − ρuv|S | < c/2 because all non-zero partial correlations for |S| ≤ deg(G) are bounded away from zero by c; recall (10) and (11). [sent-170, score-0.272]
45 2 PC A LGORITHM FOR N ONPARANORMAL G RAPHICAL M ODELS ˆ Therefore, Σ − Σ ∞ < ε implies that our tests decide all conditional independences correctly in the RPC algorithm. [sent-176, score-0.199]
46 The inequality c ≤ 1 holds trivially because partial correlations are in [−1, 1]. [sent-179, score-0.272]
47 Let pn be the number of nodes of Gn , and let qn = deg(Gn ) + 2. [sent-184, score-0.266]
48 Suppose (Σn ) is a sequence of pn × pn correlation matrices, with Σn faithful to Gn . [sent-185, score-0.282]
49 Suppose further that there are constants 0 ≤ a, b, d, f < 1 that govern the growth of the graphs as log pn = O(na ), qn = O(nb ), and minimal signal strengths and eigenvalues as cmin (Σn , qn ) = Ω(n−d ), λmin (Σn , qn ) = Ω(n− f ). [sent-186, score-0.749]
50 , Xn from a nonparanormal distribution NPN( · , Σn ). [sent-190, score-0.268]
51 It is then natural to work with a bound on the partial correlations associated with small conditioning sets. [sent-195, score-0.301]
52 More precisely, Kalisch and B¨ hlmann (2007) assume that there is a constant 0 ≤ M < 1 such u that for any n, the partial correlations ρuv|S of the matrix Σn satisfy |ρuv|S | ≤ M ∀ u, v ∈ V, S ⊆ V \ {u, v}, |S| ≤ qn . [sent-196, score-0.553]
53 Stated for the case of polynomial u growth of pn (with a = 0), their result gives consistency when b+2d < 1; our constant b corresponds to 1 − b in Kalisch and B¨ hlmann (2007). [sent-199, score-0.193]
54 In the important special case of bounded degree, however, our nonparanormal result is just as strong as the previously established Gaussian consistency guarantee. [sent-202, score-0.324]
55 Moreover, the set of correlation matrices of size q0 satisfying (14) with qn = q0 is compact. [sent-205, score-0.309]
56 (15) ˆ If the observations are multivariate normal and ρuv|S are sample partial correlations then α is an asymptotic significance level for the test. [sent-214, score-0.353]
57 Hence, our Theorem 8 would apply directly when plugging rank correlations into the mentioned software implementations of the PC algorithm. [sent-218, score-0.248]
58 Following Kalisch and B¨ hlmann (2007), we simulate random DAGs and sample from probau bility distributions faithful to them. [sent-229, score-0.172]
59 The contaminated distributions in (iii) do not belong to the nonparanormal class. [sent-246, score-0.312]
60 For the simulations we sample from two graph distributions: A small graph on ten vertices with an expected vertex degree of three, and a larger graph on one hundred vertices with an expected vertex degree of six. [sent-247, score-0.464]
61 For each n ∈ {50, 1000} and each of the three types of data listed above, we sample 201 random graphs from both the small and large graph distributions, and then sample n observations from the graph with the given data distribution. [sent-248, score-0.236]
62 We consider the RPC algorithm in the version that uses Spearman correlations as in (4); the results for Kendall’s τ were similar. [sent-251, score-0.187]
63 The skeleton of a graph G is the undirected graph with edges between nodes that are adjacent in G. [sent-253, score-0.332]
64 As expected, the performance of RPC on nonparanormal data is the same as on normal data, while that of Pearson-PC and Qn -PC deteriorate. [sent-308, score-0.318]
65 2 Gene Expression Data While Kendall’s τ and Spearman’s rank correlation give similar results for continuous observations from a distribution with Gaussian copula, the two measures of correlation can give quite different results in applications. [sent-312, score-0.309]
66 Multimodal marginals can arise under nonparanormal distributions, which thus have the potential to alleviate the effects of such Small graph, n = 50 Small graph, n = 1000 Large graph, n = 50 Large graph, n = 1000 Pearson-PC 0. [sent-317, score-0.268]
67 We ran the PC algorithm using Pearson correlations, Spearman correlations as well as Kendall’s τ. [sent-344, score-0.187]
68 Briefly put, a bidirected edge arises when this edge appears in the skeleton inferred in the first stage of the PC algorithm but the edge orientation rules in the second stage of the algorithm yield arrowheads at both tails of the edge. [sent-364, score-0.25]
69 (2000) addresses the problem of model selection in graphical modelling with directed graphs via a clever scheme of testing conditional independences. [sent-371, score-0.268]
70 For multivariate normal observations, the algorithm is known to have high-dimensional consistency properties when conditional independence is tested using sample partial correlations (Kalisch and B¨ hlmann, u 2007). [sent-372, score-0.513]
71 We showed that the PC algorithm retains these consistency properties when observations follow a Gaussian copula model and rank-based measures of correlation are used to assess conditional independence. [sent-373, score-0.321]
72 When the degree grows our assumptions are slightly more restrictive as our proof requires control of the conditioning of principal submatrices of correlation matrices that are inverted to estimate partial correlations in the rank-based PC (RPC) algorithm. [sent-375, score-0.459]
73 Since rank correlations take only marginally longer to com3378 PC A LGORITHM FOR N ONPARANORMAL G RAPHICAL M ODELS pute than sample correlations, the simulations suggest that there are hardly any downsides associated with making RPC the standard version of the PC algorithm for continuous data. [sent-378, score-0.278]
74 In fact, our results make the stronger assumption that non-zero partial correlations are sufficiently far from zero. [sent-380, score-0.272]
75 First, for graphs with suitably bounded degree the population version of the PC algorithm only needs to check conditional independences with small conditioning sets. [sent-384, score-0.237]
76 Second, the low-order partial correlations whose vanishing corresponds to these conditional independence can be estimated accurately. [sent-385, score-0.414]
77 Lemma 4, which provides the error propagation from marginal to partial correlations, could similarly be used to analyze other algorithms that test the vanishing of low-order partial correlations. [sent-386, score-0.248]
78 Sample Size Adjustment We now show that the consistency result in Corollary 9 still holds when using the conditional independence tests from (15). [sent-399, score-0.221]
79 We claim that the RPC algorithm using the tests from (17) is consistent when choosing the quantile sequence √ 1 + cn /3 , (18) zn = n − 3 · log 1 − cn /3 3379 H ARRIS AND D RTON where we use the abbreviation cn := cmin (Σn , qn ). [sent-404, score-0.672]
80 ˆ We will show that as the sample size n tends to infinity, with probability tending to one, |ρuv|S − ρuv|S | < cn /3 for every u, v ∈ V and |S| ≤ qn . [sent-405, score-0.276]
81 Furthermore, we will show that for the above choice of zn and all sufficiently large n, we have cn /3 ≤ γ(n, |S|, zn ) ≤ 2cn /3 for each relevant set S with 0 ≤ |S| ≤ qn . [sent-406, score-0.412]
82 When substituting pn , qn , cn and λmin (Σn , qn ) for p, q, c and λ, respectively, the scaling assumptions in Corollary 9 imply that the probability bound in (19) tends to one as n → ∞, and we obtain the first part of our claim. [sent-410, score-0.502]
83 For the second part of our claim, note that our choice of zn in (18) gives γ(n, 0, zn ) = cn /3. [sent-411, score-0.227]
84 Since γ(n, |S|, z) is monotonically increasing in |S|, we need only show that for sufficiently large n, γ(n, qn , zn ) − γ(n, 0, zn ) ≤ cn /3. [sent-412, score-0.412]
85 For x ≥ 0, the function exp(x) − 1 exp(x) + 1 f (x) = is concave and, thus, for any qn ≥ 0, γ(n, qn , zn ) − γ(n, 0, zn ) = f z z −f √ n − qn − 3 n−3 z z z √ √ . [sent-413, score-0.691]
86 −√ n − qn − 3 n−3 n−3 √ ≤ f′ (20) The derivative of f is f ′ (x) = 2 exp(x) (exp(x) + 1)2 . [sent-414, score-0.185]
87 Evaluating the right hand side of (20), we obtain that √ n−3 1 + cn /3 √ −1 log 1 − cn /3 n − qn − 3 √ n−3 1 + cn /3 1 √ −1 . [sent-415, score-0.458]
88 ≤ log 2 1 − cn /3 n − qn − 3 1 c2 γ(n, qn , zn ) − γ(n, 0, zn ) ≤ 1− n 2 9 3380 (21) PC A LGORITHM FOR N ONPARANORMAL G RAPHICAL M ODELS Being derived from absolute values of partial correlations, the sequence cn is in [0, 1]. [sent-416, score-0.773]
89 Therefore, 1 1 + cn /3 1 log ≤ log(2) · cn , cn ∈ [0, 1]. [sent-418, score-0.273]
90 2 1 − cn /3 2 √ This shows that the bound in (21) is o(cn ) because, by assumption, qn = o( n). [sent-419, score-0.276]
91 Background on Graphical Models Let G = (V, E) be an acyclic digraph with finite vertex set. [sent-422, score-0.2]
92 As mentioned in the introduction, the conditional independences associated with the graph G may be determined using d-separation; compare, for example, page 48 in Lauritzen (1996). [sent-424, score-0.238]
93 , vn ) such that v0 = u, vn = v and for all 1 ≤ k ≤ n, either vk−1 → vk ∈ E or vk−1 ← vk ∈ E. [sent-429, score-0.26]
94 (ii) The three nodes form a collider vk−1 → vk ← vk+1 , and neither vk nor any of its descendants is in S. [sent-431, score-0.3]
95 The skeleton of a digraph G is the undirected graph obtained by converting each directed edge into an undirected edge. [sent-437, score-0.444]
96 Let [G] be the Markov equivalence class of an acyclic digraph G = (V, E). [sent-439, score-0.203]
97 The graph 3381 H ARRIS AND D RTON C(G) = (V, [E]) is known as the completed partially directed acyclic graph (CPDAG) for G or also as the essential graph. [sent-443, score-0.387]
98 Learning high-dimensional directed acyclic graphs with latent and selection variables. [sent-481, score-0.223]
99 Estimating high-dimensional directed acyclic graphs with the u PC-algorithm. [sent-502, score-0.223]
100 Robustification of the PC-algorithm for directed acyclic u graphs. [sent-509, score-0.187]
wordName wordTfidf (topN-words)
[('pc', 0.331), ('rpc', 0.316), ('uv', 0.306), ('nonparanormal', 0.268), ('kalisch', 0.231), ('correlations', 0.187), ('qn', 0.185), ('dag', 0.152), ('kendall', 0.132), ('vk', 0.13), ('arris', 0.128), ('onparanormal', 0.128), ('rton', 0.128), ('correlation', 0.124), ('npn', 0.122), ('xv', 0.121), ('deg', 0.11), ('raphical', 0.11), ('graph', 0.1), ('hlmann', 0.096), ('acyclic', 0.096), ('cn', 0.091), ('directed', 0.091), ('copula', 0.088), ('cmin', 0.085), ('cpdag', 0.085), ('independences', 0.085), ('partial', 0.085), ('xs', 0.081), ('spearman', 0.081), ('gn', 0.081), ('faithful', 0.076), ('pearson', 0.074), ('markus', 0.073), ('dags', 0.071), ('digraph', 0.071), ('odels', 0.069), ('zn', 0.068), ('lgorithm', 0.067), ('rank', 0.061), ('drton', 0.061), ('tests', 0.061), ('fv', 0.059), ('rq', 0.059), ('graphical', 0.058), ('colombo', 0.057), ('maathuis', 0.057), ('skeleton', 0.057), ('consistency', 0.056), ('edge', 0.055), ('mathias', 0.055), ('bic', 0.053), ('conditional', 0.053), ('independence', 0.051), ('normal', 0.05), ('cq', 0.049), ('qq', 0.049), ('spirtes', 0.048), ('han', 0.045), ('contaminated', 0.044), ('andersson', 0.043), ('marloes', 0.043), ('pathway', 0.043), ('tetrad', 0.043), ('horn', 0.042), ('liu', 0.041), ('pn', 0.041), ('propagation', 0.04), ('auc', 0.04), ('nodes', 0.04), ('adjustment', 0.038), ('vanishing', 0.038), ('xu', 0.037), ('causal', 0.037), ('pcalg', 0.037), ('graphs', 0.036), ('markov', 0.036), ('equivalence', 0.036), ('undirected', 0.035), ('xa', 0.034), ('bivariate', 0.034), ('xb', 0.034), ('degree', 0.034), ('vertex', 0.033), ('naftali', 0.033), ('minimal', 0.032), ('peter', 0.032), ('gene', 0.032), ('multivariate', 0.031), ('clever', 0.03), ('simulations', 0.03), ('conditioning', 0.029), ('bidirected', 0.028), ('brem', 0.028), ('mapk', 0.028), ('mtn', 0.028), ('sadeghi', 0.028), ('schmidberger', 0.028), ('uhler', 0.028), ('unshielded', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
Author: Naftali Harris, Mathias Drton
Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution
2 0.19606315 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
Author: Fang Han, Tuo Zhao, Han Liu
Abstract: We propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA). The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. (2009). To simultaneously achieve estimation efficiency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassification error is consistent to the Bayes risk. Our theory is backed up by careful numerical experiments, which show that the extra flexibility gained by the CODA method incurs little efficiency loss even when the data are truly Gaussian. These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics
Author: Nima Noorshams, Martin J. Wainwright
Abstract: The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP fixed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series basis expansion, and a stochastic estimation of the basis coefficients via Monte Carlo techniques and damped updates. We prove that the SOSMP iterates converge to a δ-neighborhood of the unique BP fixed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefficients as a function of the desired approximation accuracy δ and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical flow estimation. Keywords: graphical models, sum-product for continuous state spaces, low-complexity belief propagation, stochastic approximation, Monte Carlo methods, orthogonal basis expansion
4 0.13585733 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
Author: Rami Mahdi, Jason Mezey
Abstract: Constraint-based learning of Bayesian networks (BN) from limited data can lead to multiple testing problems when recovering dense areas of the skeleton and to conflicting results in the orientation of edges. In this paper, we present a new constraint-based algorithm, light mutual min (LMM) for improved accuracy of BN learning from small sample data. LMM improves the assessment of candidate edges by using a ranking criterion that considers conditional independence on neighboring variables at both sides of an edge simultaneously. The algorithm also employs an adaptive relaxation of constraints that, selectively, allows some nodes not to condition on some neighbors. This relaxation aims at reducing the incorrect rejection of true edges connecting high degree nodes due to multiple testing. LMM additionally incorporates a new criterion for ranking v-structures that is used to recover the completed partially directed acyclic graph (CPDAG) and to resolve conflicting v-structures, a common problem in small sample constraint-based learning. Using simulated data, each of these components of LMM is shown to significantly improve network inference compared to commonly applied methods when learning from limited data, including more accurate recovery of skeletons and CPDAGs compared to the PC, MaxMin, and MaxMin hill climbing algorithms. A proof of asymptotic correctness is also provided for LMM for recovering the correct skeleton and CPDAG. Keywords: Bayesian networks, skeleton, constraint-based learning, mutual min
5 0.11511786 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
Author: Pekka Parviainen, Mikko Koivisto
Abstract: We consider the problem of finding a directed acyclic graph (DAG) that optimizes a decomposable Bayesian network score. While in a favorable case an optimal DAG can be found in polynomial time, in the worst case the fastest known algorithms rely on dynamic programming across the node subsets, taking time and space 2n , to within a factor polynomial in the number of nodes n. In practice, these algorithms are feasible to networks of at most around 30 nodes, mainly due to the large space requirement. Here, we generalize the dynamic programming approach to enhance its feasibility in three dimensions: first, the user may trade space against time; second, the proposed algorithms easily and efficiently parallelize onto thousands of processors; third, the algorithms can exploit any prior knowledge about the precedence relation on the nodes. Underlying all these results is the key observation that, given a partial order P on the nodes, an optimal DAG compatible with P can be found in time and space roughly proportional to the number of ideals of P , which can be significantly less than 2n . Considering sufficiently many carefully chosen partial orders guarantees that a globally optimal DAG will be found. Aside from the generic scheme, we present and analyze concrete tradeoff schemes based on parallel bucket orders. Keywords: exact algorithm, parallelization, partial order, space-time tradeoff, structure learning
6 0.070911132 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
8 0.056978125 41 jmlr-2013-Experiment Selection for Causal Discovery
9 0.056163073 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
10 0.054623879 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
11 0.054401025 120 jmlr-2013-Variational Algorithms for Marginal MAP
12 0.054011911 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
13 0.05183699 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
14 0.051155835 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
15 0.051010713 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
16 0.050358549 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
17 0.048507582 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
18 0.048470661 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models
19 0.048157211 68 jmlr-2013-Machine Learning with Operational Costs
20 0.044874325 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
topicId topicWeight
[(0, -0.232), (1, 0.013), (2, 0.032), (3, 0.117), (4, 0.191), (5, 0.251), (6, 0.055), (7, 0.116), (8, 0.011), (9, 0.227), (10, -0.14), (11, 0.096), (12, 0.041), (13, -0.228), (14, -0.139), (15, -0.078), (16, 0.375), (17, -0.15), (18, -0.097), (19, -0.146), (20, -0.045), (21, 0.045), (22, 0.052), (23, 0.013), (24, 0.018), (25, 0.057), (26, -0.088), (27, -0.074), (28, 0.016), (29, 0.069), (30, 0.006), (31, -0.008), (32, -0.041), (33, -0.051), (34, -0.025), (35, 0.023), (36, -0.05), (37, 0.012), (38, 0.006), (39, -0.094), (40, 0.015), (41, -0.059), (42, -0.002), (43, -0.011), (44, 0.013), (45, -0.045), (46, -0.005), (47, -0.097), (48, 0.023), (49, -0.069)]
simIndex simValue paperId paperTitle
same-paper 1 0.94567651 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
Author: Naftali Harris, Mathias Drton
Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution
2 0.63961405 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
Author: Fang Han, Tuo Zhao, Han Liu
Abstract: We propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA). The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. (2009). To simultaneously achieve estimation efficiency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassification error is consistent to the Bayes risk. Our theory is backed up by careful numerical experiments, which show that the extra flexibility gained by the CODA method incurs little efficiency loss even when the data are truly Gaussian. These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics
3 0.57504386 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
Author: Rami Mahdi, Jason Mezey
Abstract: Constraint-based learning of Bayesian networks (BN) from limited data can lead to multiple testing problems when recovering dense areas of the skeleton and to conflicting results in the orientation of edges. In this paper, we present a new constraint-based algorithm, light mutual min (LMM) for improved accuracy of BN learning from small sample data. LMM improves the assessment of candidate edges by using a ranking criterion that considers conditional independence on neighboring variables at both sides of an edge simultaneously. The algorithm also employs an adaptive relaxation of constraints that, selectively, allows some nodes not to condition on some neighbors. This relaxation aims at reducing the incorrect rejection of true edges connecting high degree nodes due to multiple testing. LMM additionally incorporates a new criterion for ranking v-structures that is used to recover the completed partially directed acyclic graph (CPDAG) and to resolve conflicting v-structures, a common problem in small sample constraint-based learning. Using simulated data, each of these components of LMM is shown to significantly improve network inference compared to commonly applied methods when learning from limited data, including more accurate recovery of skeletons and CPDAGs compared to the PC, MaxMin, and MaxMin hill climbing algorithms. A proof of asymptotic correctness is also provided for LMM for recovering the correct skeleton and CPDAG. Keywords: Bayesian networks, skeleton, constraint-based learning, mutual min
4 0.48212141 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
Author: Pekka Parviainen, Mikko Koivisto
Abstract: We consider the problem of finding a directed acyclic graph (DAG) that optimizes a decomposable Bayesian network score. While in a favorable case an optimal DAG can be found in polynomial time, in the worst case the fastest known algorithms rely on dynamic programming across the node subsets, taking time and space 2n , to within a factor polynomial in the number of nodes n. In practice, these algorithms are feasible to networks of at most around 30 nodes, mainly due to the large space requirement. Here, we generalize the dynamic programming approach to enhance its feasibility in three dimensions: first, the user may trade space against time; second, the proposed algorithms easily and efficiently parallelize onto thousands of processors; third, the algorithms can exploit any prior knowledge about the precedence relation on the nodes. Underlying all these results is the key observation that, given a partial order P on the nodes, an optimal DAG compatible with P can be found in time and space roughly proportional to the number of ideals of P , which can be significantly less than 2n . Considering sufficiently many carefully chosen partial orders guarantees that a globally optimal DAG will be found. Aside from the generic scheme, we present and analyze concrete tradeoff schemes based on parallel bucket orders. Keywords: exact algorithm, parallelization, partial order, space-time tradeoff, structure learning
5 0.47591969 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
Author: Nima Noorshams, Martin J. Wainwright
Abstract: The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP fixed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series basis expansion, and a stochastic estimation of the basis coefficients via Monte Carlo techniques and damped updates. We prove that the SOSMP iterates converge to a δ-neighborhood of the unique BP fixed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefficients as a function of the desired approximation accuracy δ and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical flow estimation. Keywords: graphical models, sum-product for continuous state spaces, low-complexity belief propagation, stochastic approximation, Monte Carlo methods, orthogonal basis expansion
6 0.27018002 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
7 0.25978592 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
8 0.23936044 41 jmlr-2013-Experiment Selection for Causal Discovery
10 0.22649558 15 jmlr-2013-Bayesian Canonical Correlation Analysis
11 0.21643311 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
12 0.20992321 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
13 0.19950016 85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models
14 0.197924 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
15 0.19402243 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
16 0.19361739 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models
17 0.19326586 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
18 0.19021176 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
19 0.18656249 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality
20 0.18108702 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
topicId topicWeight
[(0, 0.028), (5, 0.675), (6, 0.028), (10, 0.04), (23, 0.022), (30, 0.018), (68, 0.02), (70, 0.02), (75, 0.02), (85, 0.021)]
simIndex simValue paperId paperTitle
1 0.99725139 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java
Author: Niya Wang, Fan Meng, Li Chen, Subha Madhavan, Robert Clarke, Eric P. Hoffman, Jianhua Xuan, Yue Wang
Abstract: We describe a R-Java CAM (convex analysis of mixtures) package that provides comprehensive analytic functions and a graphic user interface (GUI) for blindly separating mixed nonnegative sources. This open-source multiplatform software implements recent and classic algorithms in the literature including Chan et al. (2008), Wang et al. (2010), Chen et al. (2011a) and Chen et al. (2011b). The CAM package offers several attractive features: (1) instead of using proprietary MATLAB, its analytic functions are written in R, which makes the codes more portable and easier to modify; (2) besides producing and plotting results in R, it also provides a Java GUI for automatic progress update and convenient visual monitoring; (3) multi-thread interactions between the R and Java modules are driven and integrated by a Java GUI, assuring that the whole CAM software runs responsively; (4) the package offers a simple mechanism to allow others to plug-in additional R-functions. Keywords: convex analysis of mixtures, blind source separation, affinity propagation clustering, compartment modeling, information-based model selection c 2013 Niya Wang, Fan Meng, Li Chen, Subha Madhavan, Robert Clarke, Eric P. Hoffman, Jianhua Xuan and Yue Wang. WANG , M ENG , C HEN , M ADHAVAN , C LARKE , H OFFMAN , X UAN AND WANG 1. Overview Blind source separation (BSS) has proven to be a powerful and widely-applicable tool for the analysis and interpretation of composite patterns in engineering and science (Hillman and Moore, 2007; Lee and Seung, 1999). BSS is often described by a linear latent variable model X = AS, where X is the observation data matrix, A is the unknown mixing matrix, and S is the unknown source data matrix. The fundamental objective of BSS is to estimate both the unknown but informative mixing proportions and the source signals based only on the observed mixtures (Child, 2006; Cruces-Alvarez et al., 2004; Hyvarinen et al., 2001; Keshava and Mustard, 2002). While many existing BSS algorithms can usefully extract interesting patterns from mixture observations, they often prove inaccurate or even incorrect in the face of real-world BSS problems in which the pre-imposed assumptions may be invalid. There is a family of approaches exploiting the source non-negativity, including the non-negative matrix factorization (NMF) (Gillis, 2012; Lee and Seung, 1999). This motivates the development of alternative BSS techniques involving exploitation of source nonnegative nature (Chan et al., 2008; Chen et al., 2011a,b; Wang et al., 2010). The method works by performing convex analysis of mixtures (CAM) that automatically identifies pure-source signals that reside at the vertices of the multifaceted simplex most tightly enclosing the data scatter, enabling geometrically-principled delineation of distinct source patterns from mixtures, with the number of underlying sources being suggested by the minimum description length criterion. Consider a latent variable model x(i) = As(i), where the observation vector x(i) = [x1 (i), ..., xM (i)]T can be expressed as a non-negative linear combination of the source vectors s(i) = [s1 (i), ..., sJ (i)]T , and A = [a1 , ..., aJ ] is the mixing matrix with a j being the jth column vector. This falls neatly within the definition of a convex set (Fig. 1) (Chen et al., 2011a): X= J J ∑ j=1 s j (i)a j |a j ∈ A, s j (i) ≥ 0, ∑ j=1 s j (i) = 1, i = 1, ..., N . Assume that the sources have at least one sample point whose signal is exclusively enriched in a particular source (Wang et al., 2010), we have shown that the vertex points of the observation simplex (Fig. 1) correspond to the column vectors of the mixing matrix (Chen et al., 2011b). Via a minimum-error-margin volume maximization, CAM identifies the optimum set of the vertices (Chen et al., 2011b; Wang et al., 2010). Using the samples attached to the vertices, compartment modeling (CM) (Chen et al., 2011a) obtains a parametric solution of A, nonnegative independent component analysis (nICA) (Oja and Plumbley, 2004) estimates A (and s) that maximizes the independency in s, and nonnegative well-grounded component analysis (nWCA) (Wang et al., 2010) finds the column vectors of A directly from the vertex cluster centers. Figure 1: Schematic and illustrative flowchart of R-Java CAM package. 2900 T HE CAM S OFTWARE IN R-JAVA In this paper we describe a newly developed R-Java CAM package whose analytic functions are written in R, while a graphic user interface (GUI) is implemented in Java, taking full advantages of both programming languages. The core software suite implements CAM functions and includes normalization, clustering, and data visualization. Multi-thread interactions between the R and Java modules are driven and integrated by a Java GUI, which not only provides convenient data or parameter passing and visual progress monitoring but also assures the responsive execution of the entire CAM software. 2. Software Design and Implementation The CAM package mainly consists of R and Java modules. The R module is a collection of main and helper functions, each represented by an R function object and achieving an independent and specific task (Fig. 1). The R module mainly performs various analytic tasks required by CAM: figure plotting, update, or error message generation. The Java module is developed to provide a GUI (Fig. 2). We adopt the model-view-controller (MVC) design strategy, and use different Java classes to separately perform information visualization and human-computer interaction. The Java module also serves as the software driver and integrator that use a multi-thread strategy to facilitate the interactions between the R and Java modules, such as importing raw data, passing algorithmic parameters, calling R scripts, and transporting results and messages. Figure 2: Interactive Java GUI supported by a multi-thread design strategy. 2.1 Analytic and Presentation Tasks Implemented in R The R module performs the CAM algorithm and facilitates a suite of subsequent analyses including CM, nICA, and nWCA. These tasks are performed by the three main functions: CAM-CM.R, CAM-nICA.R, and CAM-nWCA.R, which can be activated by the three R scripts: Java-runCAM-CM.R, Java-runCAM-ICA.R, and Java-runCAM-nWCA.R. The R module also performs auxiliary tasks including automatic R library installation, figure drawing, and result recording; and offers other standard methods such as nonnegative matrix factorization (Lee and Seung, 1999), Fast ICA (Hyvarinen et al., 2001), factor analysis (Child, 2006), principal component analysis, affinity propagation, k-means clustering, and expectation-maximization algorithm for learning standard finite normal mixture model. 2.2 Graphic User Interface Written in Java Swing The Java GUI module allows users to import data, select algorithms and parameters, and display results. The module encloses two packages: guiView contains classes for handling frames and 2901 WANG , M ENG , C HEN , M ADHAVAN , C LARKE , H OFFMAN , X UAN AND WANG Figure 3: Application of R-Java CAM to deconvolving dynamic medical image sequence. dialogs for managing user inputs; guiModel contains classes for representing result data sets and for interacting with the R script caller. Packaged as one jar file, the GUI module runs automatically. 2.3 Functional Interaction Between R and Java We adopt the open-source program RCaller (http://code.google.com/p/rcaller) to implement the interaction between R and Java modules (Fig. 2), supported by explicitly designed R scripts such as Java-runCAM-CM.R. Specifically, five featured Java classes are introduced to interact with R for importing data or parameters, running algorithms, passing on or recording results, displaying figures, and handing over error messages. The examples of these classes include guiModel.MyRCaller.java, guiModel.MyRCaller.readResults(), and guiView.MyRPlotViewer. 3. Case Studies and Experimental Results The CAM package has been successfully applied to various data types. Using dynamic contrastenhanced magnetic resonance imaging data set of an advanced breast cancer case (Chen, et al., 2011b),“double click” (or command lines under Ubuntu) activated execution of CAM-Java.jar reveals two biologically interpretable vascular compartments with distinct kinetic patterns: fast clearance in the peripheral “rim” and slow clearance in the inner “core”. These outcomes are consistent with previously reported intratumor heterogeneity (Fig. 3). Angiogenesis is essential to tumor development beyond 1-2mm3 . It has been widely observed that active angiogenesis is often observed in advanced breast tumors occurring in the peripheral “rim” with co-occurrence of inner-core hypoxia. This pattern is largely due to the defective endothelial barrier function and outgrowth blood supply. In another application to natural image mixtures, CAM algorithm successfully recovered the source images in a large number of trials (see Users Manual). 4. Summary and Acknowledgements We have developed a R-Java CAM package for blindly separating mixed nonnegative sources. The open-source cross-platform software is easy-to-use and effective, validated in several real-world applications leading to plausible scientific discoveries. The software is freely downloadable from http://mloss.org/software/view/437/. We intend to maintain and support this package in the future. This work was supported in part by the US National Institutes of Health under Grants CA109872, CA 100970, and NS29525. We thank T.H. Chan, F.Y. Wang, Y. Zhu, and D.J. Miller for technical discussions. 2902 T HE CAM S OFTWARE IN R-JAVA References T.H. Chan, W.K. Ma, C.Y. Chi, and Y. Wang. A convex analysis framework for blind separation of non-negative sources. IEEE Transactions on Signal Processing, 56:5120–5143, 2008. L. Chen, T.H. Chan, P.L. Choyke, and E.M. Hillman et al. Cam-cm: a signal deconvolution tool for in vivo dynamic contrast-enhanced imaging of complex tissues. Bioinformatics, 27:2607–2609, 2011a. L. Chen, P.L. Choyke, T.H. Chan, and C.Y. Chi et al. Tissue-specific compartmental analysis for dynamic contrast-enhanced mr imaging of complex tumors. IEEE Transactions on Medical Imaging, 30:2044–2058, 2011b. D. Child. The essentials of factor analysis. Continuum International, 2006. S.A. Cruces-Alvarez, Andrzej Cichocki, and Shun ichi Amari. From blind signal extraction to blind instantaneous signal separation: criteria, algorithms, and stability. IEEE Transactions on Neural Networks, 15:859–873, 2004. N. Gillis. Sparse and unique nonnegative matrix factorization through data preprocessing. Journal of Machine Learning Research, 13:3349–3386, 2012. E.M.C. Hillman and A. Moore. All-optical anatomical co-registration for molecular imaging of small animals using dynamic contrast. Nature Photonics, 1:526–530, 2007. A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley, New York, 2001. N. Keshava and J.F. Mustard. Spectral unmixing. IEEE Signal Processing Magazine, 19:44–57, 2002. D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999. E. Oja and M. Plumbley. Blind separation of positive sources by globally convergent gradient search. Neural Computation, 16:1811–1825, 2004. F.Y. Wang, C.Y. Chi, T.H. Chan, and Y. Wang. Nonnegative least-correlated component analysis for separation of dependent sources by volume maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:857–888, 2010. 2903
2 0.99627727 8 jmlr-2013-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games
Author: Yahya Forghani, Hadi Sadoghi
Abstract: This paper comments on the published work dealing with robustness and regularization of support vector machines (Journal of Machine Learning Research, Vol. 10, pp. 1485-1510, 2009) by H. Xu et al. They proposed a theorem to show that it is possible to relate robustness in the feature space and robustness in the sample space directly. In this paper, we propose a counter example that rejects their theorem. Keywords: kernel, robustness, support vector machine 1. Comment Firstly, it must be stated that Xu et al. (2009) made a good study of robustness and regularization of support vector machines. They proposed the following theorem to show that it is possible to relate robustness in the feature space and robustness in the sample space directly: Theorem (Xu et al., 2009) Suppose that the kernel function has the form k(x, x′ ) = f ( x − x′ ), with f : R+ → R a decreasing function. Denote by H the RKHS space of k(., .) and φ(.) the corresponding feature mapping. Then we have any x ∈ Rn , w ∈ H and c > 0, sup w, φ(x − δ) = δ ≤c δφ H≤ sup √ 2 f (0)−2 f (c) w, φ(x) − δφ . The following counter example rejects the mentioned theorem. However, this theorem is a standalone result in the appendix of the paper of Xu et al. (2009), which is not used anywhere else in the paper of Xu et al. (2009). Thus, the main result and all other results of Xu et al. (2009) are not affected in any way. Counter example. Let φ(.) be the feature mapping of Gaussian kernel function. We have φ(x) H = 1. Let w = φ(x). Therefore, w, φ(x) = w H , and sup w, φ(x − δ) = w δ ≤c ∗. Also at Islamic Azad University, Mashhad Branch, Mashhad, IRAN. c 2013 Yahya Forghani and Hadi Sadoghi. H. (1) F ORGHANI AND S ADOGHI Moreover, δφ δφ H≤ H≤ w, φ(x) − δφ = sup √ 2 f (0)−2 f (c) w, φ(x) + sup √ 2 f (0)−2 f (c) w H δφ + δφ w H H≤ + w H≤ w, δφ = sup √ 2 f (0)−2 f (c) w, δφ = sup √ 2 f (0)−2 f (c) H 2 f (0) − 2 f (c). (2) According to Equation (1) and (2), and since f is a decreasing function, for any c > 0, we have sup w, φ(x − δ) ≤ δ ≤c δφ H≤ sup √ 2 f (0)−2 f (c) w, φ(x) − δφ . End of counter example. The exact spot that the error has been occurred in the mentioned theorem is Equation (19) of the paper of Xu et al. (2009). There it has been claimed that the image of the RKHS feature mapping is dense, which unfortunately is not true. Indeed, because φ(x), φ(x) = K(0) where K(.) is the kernel function, the image of the feature mapping is in a ball of radius K(0). References Huan Xu, Shie Mannor, and Constantine Caramanis. Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10:1485–1510, 2009. 3494
4 0.99361277 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
Author: Bruno Scherrer
Abstract: We consider the discrete-time infinite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form. We extend and unify the separate analyzes developed by Munos for approximate value iteration (Munos, 2007) and approximate policy iteration (Munos, 2003), and provide performance bounds in the discounted and the undiscounted situations. Finally, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualified as “paradoxical” and “intriguing”). We track down the reason to be a minor implementation error of the algorithm, which suggests that, in practice, λ policy iteration may be more stable than previously thought. Keywords: stochastic optimal control, reinforcement learning, Markov decision processes, analysis of algorithms
same-paper 5 0.99223101 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
Author: Naftali Harris, Mathias Drton
Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution
6 0.97434175 15 jmlr-2013-Bayesian Canonical Correlation Analysis
7 0.96014053 114 jmlr-2013-The Rate of Convergence of AdaBoost
8 0.90257788 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
9 0.89802122 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
10 0.89464301 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
11 0.88709718 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
12 0.87469625 73 jmlr-2013-Multicategory Large-Margin Unified Machines
13 0.87424564 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
14 0.86987382 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
15 0.86820829 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
16 0.85888815 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
17 0.85351694 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
18 0.85087663 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
19 0.84875047 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling
20 0.84650576 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion