nips nips2011 nips2011-195 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. [sent-8, score-0.304]
2 Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. [sent-9, score-0.458]
3 As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. [sent-10, score-0.372]
4 As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. [sent-11, score-0.137]
5 Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). [sent-12, score-0.252]
6 Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. [sent-13, score-0.589]
7 1 Introduction Undirected graphical models, also known as Markov random fields, are used in a variety of domains, including statistical physics, natural language processing and image analysis among others. [sent-15, score-0.164]
8 In this paper we are concerned with the task of estimating the graph structure G of a Markov random field (MRF) over a discrete random vector X = (X1 , X2 , . [sent-16, score-0.234]
9 This underlying graph structure encodes conditional independence assumptions among subsets of the variables, and thus plays an important role in a broad range of applications of MRFs. [sent-23, score-0.24]
10 Methods for estimating such graph structure include those based on constraint and hypothesis testing [22], and those that estimate restricted classes of graph structures such as trees [8], polytrees [11], and hypertrees [23]. [sent-25, score-0.419]
11 A recent class of successful approaches for graphical model structure learning are based on estimating the local neighborhood of each node. [sent-26, score-0.29]
12 One subclass of these for the special case of bounded degree graphs involve the use of exhaustive search so that their computational complexity grows at least as quickly as O(pd ), where d is the maximum neighborhood size in the graphical model [1, 4, 9]. [sent-27, score-0.396]
13 Another subclass use convex programs to learn the neighborhood structure: for instance [20, 17, 16] estimate the neighborhood set for each vertex r ∈ V by optimizing its ℓ1 -regularized conditional likelihood; [15, 10] use ℓ1 /ℓ2 -regularized conditional likelihood. [sent-28, score-0.419]
14 Another popular class of approaches are based on using a score metric and searching for the best scoring structure from a candidate set of graph structures. [sent-30, score-0.189]
15 Question: Can one use local procedures that are as inexpensive as the heuristic greedy approaches, and yet come with the strong statistical guarantees of the regularized convex program based approaches? [sent-33, score-0.467]
16 Of relevance to graphical model structure learning is the structure of sparsity, where a sparse set of non-zero parameters entail a sparse set of edges. [sent-37, score-0.291]
17 A related line of recent work on learning sparse models has focused on “stagewise” greedy algorithms. [sent-41, score-0.388]
18 These perform simple forward steps (adding parameters greedily), and possibly also backward steps (removing parameters greedily), and yet provide strong statistical guarantees for the estimate after a finite number of greedy steps. [sent-42, score-0.864]
19 The forward greedy variant which performs just the forward step has appeared in various guises in multiple communities: in machine learning as boosting [13], in function approximation [24], and in signal processing as basis pursuit [6]. [sent-43, score-0.743]
20 Zhang [27] analyzes a more general greedy algorithm for sparse linear regression that performs forward and backward steps, and showed that it is sparsistent under a weaker restricted eigenvalue condition. [sent-45, score-1.134]
21 Here we ask the question: Can we provide an analysis of a general forward backward algorithm for parameter estimation in general statistical models? [sent-46, score-0.533]
22 In the first part, we analyze the forward backward greedy algorithm [28] for general statistical models. [sent-51, score-0.85]
23 In the second part, we use this to show that when combined with neighborhood estimation, the forward backward variant applied to local conditional log-likelihoods provides a simple computationally tractable method that adds and deletes edges, but comes with strong sparsistency guarantees. [sent-53, score-0.945]
24 We reiterate that the our first result on the sparsistency of the forward backward greedy algorithm for general objectives is of independent interest even outside the context of graphical models. [sent-54, score-1.209]
25 As we show, the greedy method is better than the ℓ1 -regularized counterpart in [20] theoretically, as well as experimentally. [sent-55, score-0.349]
26 The sufficient condition on the parameters imposed by the greedy algorithm is a restricted strong convexity condition [19], which is weaker than the irrepresentable condition required by [20]. [sent-56, score-0.814]
27 Further, the number of samples required for sparsistent graph recovery scales as O(d2 log p), where d is the maximum node degree, in contrast to O(d3 log p) for the ℓ1 -regularized counterpart. [sent-57, score-0.55]
28 We corroborate this in our simulations, where we find that the greedy algorithm requires fewer observations than [20] for sparsistent graph recovery. [sent-58, score-0.772]
29 Let G = (V, E) denote a graph with p nodes, corresponding to the p variables {X1 , . [sent-64, score-0.149]
30 , Xp ) is then specified by nodewise and pairwise functions θr : X → R for all r ∈ V , and θrt : X × X → R for all (r, t) ∈ E: P(x) ∝ exp θrt (xr , xt ) . [sent-71, score-0.182]
31 θr (xr ) + r∈V (r,t)∈E 2 (1) In this paper, we largely focus on the case where the variables are binary with X = {−1, +1}, where we can rewrite (1) to the Ising model form [14] for some set of parameters {θr } and {θrt } as P(x) ∝ exp θrt xr xt . [sent-72, score-0.402]
32 from a distribution Pθ∗ of the form (1), for parameters θ∗ and graph G = (V, E ∗ ) over the p variables. [sent-83, score-0.149]
33 (3) ∗ The graphical model selection task consists of inferring this edge set E from the samples D. [sent-85, score-0.27]
34 Then the graphical ˆ model selection problem is equivalent to that of estimating the neighborhoods Nn (r) ⊂ V , so that ˆn (r) = N ∗ (r); ∀r ∈ V ] → 1 as n → ∞. [sent-88, score-0.245]
35 , θrp ), our goal is to use the conditional likelihood of Xr conditioned on XV \r to estimate Θr and hence its neighborhood N (r). [sent-93, score-0.205]
36 This conditional distribution of Xr conditioned on XV \r generated by (2) is given by the logistic model P Xr = xr XV \r = xV \r = exp(θr xr + t∈V \r 1 + exp(θr + θrt xr xt ) r∈V \r θrt xr ) . [sent-94, score-1.449]
37 Given the n samples D, the corresponding conditional log-likelihood is given by L(Θr ; D) = 1 n n i=1 log 1+ exp θr x(i) + t∈V \r (i) (i) (i) θrt xr xt −θr xr − t∈V \r (i) θrt x(i) xt r . [sent-95, score-0.914]
38 (4) In Section 4, we study a greedy algorithm (Algorithm 2) that finds these node neighborhoods ˆ Nn (r) = Supp(Θr ) of each random variable Xr separately by a greedy stagewise optimization of the conditional log-likelihood of Xr conditioned on XV \r . [sent-96, score-0.994]
39 The algorithm then combines these ˆ neighborhoods to obtain a graph estimate E using an “OR” rule: En = ∪r {(r, t) : t ∈ Nn (r)}. [sent-97, score-0.256]
40 Other rules such as the “AND” rule, that add an edge only if it occurs in each of the respective node neighborhoods, could be used to combine the node-neighborhoods to a graph estimate. [sent-98, score-0.234]
41 We show in Theorem 2 that the neighborhood selection by the greedy algorithm succeeds in recovering the exact node-neighborhoods with high probability, so that by a union bound, the graph estimates using either the AND or OR rules would be exact with high probability as well. [sent-99, score-0.725]
42 Before we describe this greedy algorithm and its analysis in Section 4 however, we first consider the general statistical model case in the next section. [sent-100, score-0.417]
43 We first describe the forward backward greedy algorithm of Zhang [28] as applied to general statistical models, followed by a sparsistency analysis for this general case. [sent-101, score-1.107]
44 The algorithm starts with an empty set of active variables S (0) and gradually adds (and removes) vairables to the active set until it meets the stopping criterion. [sent-119, score-0.303]
45 This algorithm has two major steps: the forward step and the backward step. [sent-120, score-0.47]
46 In the forward step, the algorithm finds the best next candidate and adds it to the active set as long as it improves the loss function at least by ǫS , otherwise the stopping criterion is met and the algorithm terminates. [sent-121, score-0.608]
47 Then, in the backward step, the algorithm checks the influence of all variables in the presence of the newly added variable. [sent-122, score-0.273]
48 If one or more of the previously added variables do not contribute at least νǫS to the loss function, then the algorithm removes them from the active set. [sent-123, score-0.168]
49 This procedure ensures that at each round, the loss function is improved by at least (1 − ν)ǫS and hence it terminates within a finite number of steps. [sent-124, score-0.149]
50 We state the assumptions on the loss function such that sparsistency is guaranteed. [sent-125, score-0.327]
51 Let us first recall the definition of restricted strong convexity from Negahban et al. [sent-126, score-0.167]
52 Specifically, for a given set S, the loss function is said to satisfy restricted strong convexity (RSC) with parameter κl with respect to the set S if κl n n n L(θ + ∆; Z1 ) − L(θ; Z1 ) − ∇L(θ; Z1 ), ∆ ≥ ∆ 2 for all ∆ ∈ S. [sent-128, score-0.269]
53 (5) 2 2 We can now define sparsity restricted strong convexity as follows. [sent-129, score-0.208]
54 Specifically, we say that the loss function L satisfies RSC(k) with parameter κl if it satisfies RSC with parameter κl for the set {∆ ∈ Rp : ∆ 0 ≤ k}. [sent-130, score-0.134]
55 In contrast, we say that the loss function satisfies restricted strong smoothness (RSS) with parameter κu with respect to a set S if κu n n n ∆ 2 for all ∆ ∈ S. [sent-131, score-0.204]
56 The loss function L satisfies RSS(k) with parameter κu if it satisfies RSS with parameter κu for the set {∆ ∈ Rp : ∆ 0 ≤ k}. [sent-133, score-0.134]
57 Given any constants κl and κu , and a sample based loss function L, we can typically use concentration based arguments to obtain bounds on the sample size required so that the RSS and RSC conditions hold with high probability. [sent-134, score-0.171]
58 Another property of the loss function that we require is an upper bound λn on the ℓ∞ norm of the gradient of the loss at the true parameter θ∗ , i. [sent-135, score-0.172]
59 Then if we run Algorithm 1 with stopping threshold ǫS ≥ (8ρη/κl ) s∗ λ2 , the output θ with support S satisfies: n √ √ √ √ 2 (a) Error Bound: θ − θ∗ 2 ≤ κl s∗ (λn η + ǫS 2κu ). [sent-143, score-0.222]
60 The proof theorem hinges on three main lemmas: Lemmas 1 and 2 which are simple consequences of the forward and backward steps failing when the greedy algorithm stops, and Lemma 3 which uses these two lemmas and extends techniques from [21] and [19] to obtain an ℓ2 error bound on the error. [sent-147, score-1.021]
61 Provided these lemmas hold, we then show below that the greedy algorithm is sparsistent. [sent-148, score-0.491]
62 However, these lemmas require apriori that the RSC and RSS conditions hold for sparsity size |S ∗ ∪ S|. [sent-149, score-0.175]
63 In this Lemma, we show that the upper bound holds by drawing from fixed point techniques in [21] and [19], and by using a simple consequence of the forward step failing when the greedy algorithm stops. [sent-153, score-0.642]
64 2κu ǫS ∗ ∗ Substituting |{j ∈ S ∗ − S : |θj |2 > τ }| = |S ∗ − S| − |{j ∈ S ∗ − S : |θj |2 ≤ τ }|, we get ∗ |S ∗ − S| ≤ |{j ∈ S ∗ − S : |θj |2 ≤ τ }| + ηs∗ λ2 ∗ n ≤ |{j ∈ S ∗ − S : |θj |2 ≤ τ }| + 1/2, 2κu ǫS due to the setting of the stopping threshold ǫS . [sent-157, score-0.222]
65 (c) From Lemma 2, which provides a simple consequence of the backward step failing when the greedy algorithm stops, for ∆ = θ − θ∗ , we have ǫS /κu |S − S ∗ | ≤ ∆S−S ∗ 2 ≤ ∆ 2 , so that 2 2 using Lemma 3 and that |S ∗ − S| = 0, we obtain that |S − S ∗ | ≤ setting of the stopping threshold ǫS . [sent-159, score-0.903]
66 5 4ηs∗ λ2 κu n ǫS κ2 l ≤ 1/2, due to the Algorithm 2 Greedy forward-backward algorithm for pairwise discrete graphical model learning Input: Data D := {x(1) , . [sent-160, score-0.285]
67 1 Lemmas for Theorem 1 We list the simple lemmas that characterize the solution obtained when the algorithm terminates, and on which the proof of Theorem 1 hinges. [sent-164, score-0.142]
68 When the algorithm 1 stops with parameter θ supported on S, we have L θ − L (θ∗ ) < 2 |S ∗ − S| κu ǫS θ − θ∗ 2 . [sent-166, score-0.171]
69 When the algorithm 1 stops with parameter θ supported on S, we have ∆S−S ∗ 2 2 ≥ ǫS S − S∗ . [sent-168, score-0.171]
70 When the algorithm 1 stops with parameter θ supported on S, we have θ − θ∗ 2 ≤ 2 κl Lemma 4 (Stopping Size). [sent-170, score-0.171]
71 and RSC (ηs∗ ) holds for some η ≥ 2 + , then the algorithm 1 stops with k ≤ (η − 1)s∗ . [sent-172, score-0.139]
72 , x(n) }, drawn from a pairwise Ising model as in (2), with parameters θ∗ , and graph G = (V, E ∗ ). [sent-181, score-0.219]
73 It will be useful to denote the maximum node-degree in the graph E ∗ by d. [sent-182, score-0.149]
74 We propose Algorithm 2 for estimating the underlying graphical model from the n samples D. [sent-184, score-0.194]
75 We first note that the conditional log-likelihood loss function in (4) corresponds to a logistic likelihood. [sent-193, score-0.267]
76 [19, 2] analyze the RSC and RSS properties of generalized linear models, of which logistic models are an instance, and show that the following result holds if the covariates are sub-Gaussian. [sent-195, score-0.177]
77 n We can now verify that under the assumptions in the corollary, the conditions on the stopping size ǫS ∗ and the minimum absolute value of the non-zero parameters minj∈S ∗ |θj | are satisfied. [sent-211, score-0.17]
78 Thus, Theorem 1 yields that each node neighborhood is recovered with no false exclusions or inclusions with probability at least 1 − c′ exp(−c′′ n). [sent-213, score-0.402]
79 The sufficient condition on the parameters imposed by the greedy algorithm is a restricted strong convexity condition [19], which is weaker than the irrepresentable condition required by [20]. [sent-216, score-0.814]
80 Further, the number of samples required for sparsistent graph recovery scales as O(d2 log p), where d is the maximum node degree, in contrast to O(d3 log p) for the ℓ1 regularized counterpart. [sent-217, score-0.586]
81 We corroborate this in our simulations, where we find that the greedy algorithm requires fewer observations than [20] for sparsistent graph recovery. [sent-218, score-0.772]
82 We also note that the result can also be extended to the general pairwise graphical model case, where each random variable takes values in the range {1, . [sent-219, score-0.203]
83 Our analysis however naturally extends to such a group greedy setting as well. [sent-224, score-0.349]
84 5 Control Parameter (c) Star (d) Chain, 4-Nearest Neighbor and Star Graphs Fig 1: Plots of success probability P[N± (r) = N ∗ (r), ∀r ∈ V ] versus the control parameter β(n, p, d) = n/[20d log(p)] for Ising model on (a) chain (d = 2), (b) 4-nearest neighbor (d = 4) ∗ and (c) Star graph (d = 0. [sent-252, score-0.343]
85 50 for both greedy and node-wise ℓ1 -regularized logistic regression methods. [sent-255, score-0.545]
86 As our theorem suggests and these figures show, the greedy algorithm requires less samples to recover the exact structure of the graphical model. [sent-256, score-0.658]
87 We simulated structure learning of different graph structures and compared the learning rates of our method to that of node-wise ℓ1 -regularized logistic regression as outlined in [20]. [sent-258, score-0.415]
88 We performed experiments using 3 different graph structures: (a) chain (line graph), (b) 4-nearest neighbor (grid graph) and (c) star graph. [sent-259, score-0.312]
89 We then attempted to learn the structure of the model using both Algorithm 2 as well as node-wise ℓ1 -regularized logistic regression. [sent-268, score-0.186]
90 We then compared the actual graph structure with the empirically learned graph structures. [sent-269, score-0.338]
91 If the graph structures matched completely then we declared the result a success otherwise we declared the result a failure. [sent-270, score-0.311]
92 For all greedy experiments we set the stopping threshold ǫS = c log(np) , where c is a tuning constant, as suggested by Theorem 2, and set the backwards n step threshold ν = 0. [sent-272, score-0.623]
93 For all logistic regression experiments we set the regularization parameter λn = c′ log(p)/n, where c′ was set via cross-validation. [sent-274, score-0.228]
94 1p) graphs using both Algorithm 2 and node-wise ℓ1 -regularized logistic regression for three different graph sizes p ∈ {36, 64, 100} with mixed (random sign) couplings. [sent-276, score-0.384]
95 For each sample size, we generated a batch of 10 different graphical models and averaged the probability of success (complete structure learned) over the batch. [sent-277, score-0.227]
96 These results support our theoretical claims and demonstrate the efficiency of the greedy method in comparison to node-wise ℓ1 -regularized logistic regression [20]. [sent-279, score-0.545]
97 Consistent estimation of the basic neighborhood structure of Markov random a fields. [sent-336, score-0.157]
98 High-dimensional ising model selection using ℓ1 regularized logistic regression. [sent-409, score-0.305]
99 Adaptive forward-backward greedy algorithm for sparse learning with linear models. [sent-448, score-0.425]
100 On the consistency of feature selection using greedy least squares regression. [sent-452, score-0.422]
wordName wordTfidf (topN-words)
[('greedy', 0.349), ('rsc', 0.3), ('xr', 0.29), ('rss', 0.276), ('sparsistency', 0.257), ('backward', 0.236), ('forward', 0.197), ('stopping', 0.17), ('graph', 0.149), ('logistic', 0.146), ('sparsistent', 0.138), ('graphical', 0.133), ('neighborhood', 0.117), ('rt', 0.113), ('lemmas', 0.105), ('stops', 0.102), ('xv', 0.098), ('zrt', 0.086), ('star', 0.085), ('ising', 0.081), ('inclusions', 0.075), ('exclusions', 0.075), ('irrepresentable', 0.075), ('lemma', 0.071), ('ravikumar', 0.071), ('neighborhoods', 0.07), ('pairwise', 0.07), ('loss', 0.07), ('corroborate', 0.069), ('negahban', 0.068), ('rp', 0.066), ('convexity', 0.065), ('en', 0.063), ('satis', 0.062), ('samples', 0.061), ('failing', 0.059), ('asutin', 0.057), ('exp', 0.057), ('minj', 0.057), ('xp', 0.055), ('log', 0.055), ('xt', 0.055), ('success', 0.054), ('false', 0.053), ('threshold', 0.052), ('strong', 0.051), ('restricted', 0.051), ('node', 0.051), ('conditional', 0.051), ('stagewise', 0.05), ('regression', 0.05), ('terminates', 0.048), ('nn', 0.047), ('ej', 0.047), ('markov', 0.046), ('discrete', 0.045), ('texas', 0.045), ('jalali', 0.043), ('subclass', 0.043), ('selection', 0.042), ('imposed', 0.041), ('recovery', 0.041), ('sparsity', 0.041), ('programs', 0.04), ('structure', 0.04), ('chain', 0.04), ('annals', 0.039), ('declared', 0.039), ('corollary', 0.039), ('graphs', 0.039), ('constants', 0.039), ('sparse', 0.039), ('theorem', 0.038), ('neighbor', 0.038), ('nr', 0.038), ('weaker', 0.037), ('algorithm', 0.037), ('conditioned', 0.037), ('regularized', 0.036), ('condition', 0.036), ('adds', 0.036), ('wainwright', 0.036), ('optimizer', 0.035), ('edge', 0.034), ('es', 0.033), ('arguments', 0.033), ('degree', 0.033), ('parameter', 0.032), ('zhang', 0.032), ('union', 0.031), ('statistical', 0.031), ('least', 0.031), ('decomposable', 0.031), ('covariates', 0.031), ('active', 0.03), ('control', 0.03), ('observations', 0.03), ('structures', 0.03), ('greedily', 0.029), ('hold', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
2 0.21309526 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
Author: Inderjit S. Dhillon, Pradeep K. Ravikumar, Ambuj Tewari
Abstract: Increasingly, optimization problems in machine learning, especially those arising from bigh-dimensional statistical estimation, bave a large number of variables. Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structore to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a snite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structore, we are not only able to significantly speed up the vanilla greedy method, hot also outperform cyclic descent when the problem size becomes large. Our resnlts indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.
3 0.14710341 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
Author: Animashree Anandkumar, Vincent Tan, Alan S. Willsky
Abstract: We consider the problem of Ising and Gaussian graphical model selection given n i.i.d. samples from the model. We propose an efficient threshold-based algorithm for structure estimation based on conditional mutual information thresholding. This simple local algorithm requires only loworder statistics of the data and decides whether two nodes are neighbors in the unknown graph. We identify graph families for which the proposed algorithm has low sample and computational complexities. Under some transparent assumptions, we establish that the proposed algorithm is −4 structurally consistent (or sparsistent) when the number of samples scales as n = Ω(Jmin log p), where p is the number of nodes and Jmin is the minimum edge potential. We also develop novel non-asymptotic techniques for obtaining necessary conditions for graphical model selection. Keywords: Graphical model selection, high-dimensional learning, local-separation property, necessary conditions, typical sets, Fano’s inequality.
4 0.13479918 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
Author: Po-ling Loh, Martin J. Wainwright
Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1
5 0.11747137 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
Author: Shilin Ding, Grace Wahba, Xiaojin Zhu
Abstract: In discrete undirected graphical models, the conditional independence of node labels Y is specified by the graph structure. We study the case where there is another input random vector X (e.g. observed features) such that the distribution P (Y | X) is determined by functions of X that characterize the (higher-order) interactions among the Y ’s. The main contribution of this paper is to learn the graph structure and the functions conditioned on X at the same time. We prove that discrete undirected graphical models with feature X are equivalent to multivariate discrete models. The reparameterization of the potential functions in graphical models by conditional log odds ratios of the latter offers advantages in representation of the conditional independence structure. The functional spaces can be flexibly determined by kernels. Additionally, we impose a Structure Lasso (SLasso) penalty on groups of functions to learn the graph structure. These groups with overlaps are designed to enforce hierarchical function selection. In this way, we are able to shrink higher order interactions to obtain a sparse graph structure. 1
6 0.094075367 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
7 0.092206813 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
8 0.091177419 213 nips-2011-Phase transition in the family of p-resistances
9 0.085552581 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
10 0.084984466 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities
11 0.076705232 258 nips-2011-Sparse Bayesian Multi-Task Learning
12 0.075424053 272 nips-2011-Stochastic convex optimization with bandit feedback
13 0.075308539 231 nips-2011-Randomized Algorithms for Comparison-based Search
14 0.074303389 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems
15 0.073761202 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
16 0.072173208 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
17 0.071428448 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
18 0.070783019 302 nips-2011-Variational Learning for Recurrent Spiking Networks
19 0.070507877 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
20 0.066013142 265 nips-2011-Sparse recovery by thresholded non-negative least squares
topicId topicWeight
[(0, 0.232), (1, -0.052), (2, -0.046), (3, -0.101), (4, -0.061), (5, 0.002), (6, -0.061), (7, -0.013), (8, 0.015), (9, -0.07), (10, 0.051), (11, -0.055), (12, -0.075), (13, 0.002), (14, 0.006), (15, 0.071), (16, 0.141), (17, -0.023), (18, -0.052), (19, 0.047), (20, -0.118), (21, 0.088), (22, 0.025), (23, 0.09), (24, -0.083), (25, -0.062), (26, 0.037), (27, -0.028), (28, -0.064), (29, 0.046), (30, 0.074), (31, -0.204), (32, 0.026), (33, 0.074), (34, 0.088), (35, -0.169), (36, -0.022), (37, -0.021), (38, -0.027), (39, -0.024), (40, 0.111), (41, 0.04), (42, -0.092), (43, 0.123), (44, -0.011), (45, 0.051), (46, -0.017), (47, 0.001), (48, 0.008), (49, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.9557097 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
2 0.76358587 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
Author: Inderjit S. Dhillon, Pradeep K. Ravikumar, Ambuj Tewari
Abstract: Increasingly, optimization problems in machine learning, especially those arising from bigh-dimensional statistical estimation, bave a large number of variables. Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structore to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a snite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structore, we are not only able to significantly speed up the vanilla greedy method, hot also outperform cyclic descent when the problem size becomes large. Our resnlts indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.
3 0.72409022 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
Author: Animashree Anandkumar, Vincent Tan, Alan S. Willsky
Abstract: We consider the problem of Ising and Gaussian graphical model selection given n i.i.d. samples from the model. We propose an efficient threshold-based algorithm for structure estimation based on conditional mutual information thresholding. This simple local algorithm requires only loworder statistics of the data and decides whether two nodes are neighbors in the unknown graph. We identify graph families for which the proposed algorithm has low sample and computational complexities. Under some transparent assumptions, we establish that the proposed algorithm is −4 structurally consistent (or sparsistent) when the number of samples scales as n = Ω(Jmin log p), where p is the number of nodes and Jmin is the minimum edge potential. We also develop novel non-asymptotic techniques for obtaining necessary conditions for graphical model selection. Keywords: Graphical model selection, high-dimensional learning, local-separation property, necessary conditions, typical sets, Fano’s inequality.
4 0.67694765 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
Author: Shilin Ding, Grace Wahba, Xiaojin Zhu
Abstract: In discrete undirected graphical models, the conditional independence of node labels Y is specified by the graph structure. We study the case where there is another input random vector X (e.g. observed features) such that the distribution P (Y | X) is determined by functions of X that characterize the (higher-order) interactions among the Y ’s. The main contribution of this paper is to learn the graph structure and the functions conditioned on X at the same time. We prove that discrete undirected graphical models with feature X are equivalent to multivariate discrete models. The reparameterization of the potential functions in graphical models by conditional log odds ratios of the latter offers advantages in representation of the conditional independence structure. The functional spaces can be flexibly determined by kernels. Additionally, we impose a Structure Lasso (SLasso) penalty on groups of functions to learn the graph structure. These groups with overlaps are designed to enforce hierarchical function selection. In this way, we are able to shrink higher order interactions to obtain a sparse graph structure. 1
5 0.61056453 67 nips-2011-Data Skeletonization via Reeb Graphs
Author: Xiaoyin Ge, Issam I. Safa, Mikhail Belkin, Yusu Wang
Abstract: Recovering hidden structure from complex and noisy non-linear data is one of the most fundamental problems in machine learning and statistical inference. While such data is often high-dimensional, it is of interest to approximate it with a lowdimensional or even one-dimensional space, since many important aspects of data are often intrinsically low-dimensional. Furthermore, there are many scenarios where the underlying structure is graph-like, e.g, river/road networks or various trajectories. In this paper, we develop a framework to extract, as well as to simplify, a one-dimensional ”skeleton” from unorganized data using the Reeb graph. Our algorithm is very simple, does not require complex optimizations and can be easily applied to unorganized high-dimensional data such as point clouds or proximity graphs. It can also represent arbitrary graph structures in the data. We also give theoretical results to justify our method. We provide a number of experiments to demonstrate the effectiveness and generality of our algorithm, including comparisons to existing methods, such as principal curves. We believe that the simplicity and practicality of our algorithm will help to promote skeleton graphs as a data analysis tool for a broad range of applications.
6 0.55773807 213 nips-2011-Phase transition in the family of p-resistances
7 0.53888738 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
8 0.51795942 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
9 0.51701576 274 nips-2011-Structure Learning for Optimization
10 0.48256725 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks
11 0.48122522 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems
12 0.47121677 55 nips-2011-Collective Graphical Models
13 0.46201253 265 nips-2011-Sparse recovery by thresholded non-negative least squares
14 0.45800525 150 nips-2011-Learning a Distance Metric from a Network
15 0.45115906 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
16 0.44899905 209 nips-2011-Orthogonal Matching Pursuit with Replacement
17 0.44848812 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions
18 0.44728607 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs
19 0.43596232 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
20 0.43052503 201 nips-2011-On the Completeness of First-Order Knowledge Compilation for Lifted Probabilistic Inference
topicId topicWeight
[(0, 0.029), (4, 0.065), (20, 0.046), (26, 0.029), (31, 0.077), (33, 0.012), (43, 0.127), (45, 0.143), (57, 0.034), (74, 0.049), (77, 0.179), (83, 0.079), (99, 0.048)]
simIndex simValue paperId paperTitle
1 0.90772372 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
Author: Dominique C. Perrault-joncas, Marina Meila
Abstract: This paper considers the problem of embedding directed graphs in Euclidean space while retaining directional information. We model the observed graph as a sample from a manifold endowed with a vector field, and we design an algorithm that separates and recovers the features of this process: the geometry of the manifold, the data density and the vector field. The algorithm is motivated by our analysis of Laplacian-type operators and their continuous limit as generators of diffusions on a manifold. We illustrate the recovery algorithm on both artificially constructed and real data. 1 Motivation Recent advances in graph embedding and visualization have focused on undirected graphs, for which the graph Laplacian properties make the analysis particularly elegant [1, 2]. However, there is an important number of graph data, such as social networks, alignment scores between biological sequences, and citation data, which are naturally asymmetric. A commonly used approach for this type of data is to disregard the asymmetry by studying the spectral properties of W +W T or W T W , where W is the affinity matrix of the graph. Some approaches have been offered to preserve the asymmetry information contained in data: [3], [4], [5] or to define directed Laplacian operators [6]. Although quite successful, these works adopt a purely graph-theoretical point of view. Thus, they are not concerned with the generative process that produces the graph, nor with the interpretability and statistical properties of their algorithms. In contrast, we view the nodes of a directed graph as a finite sample from a manifold in Euclidean space, and the edges as macroscopic observations of a diffusion kernel between neighboring points on the manifold. We explore how this diffusion kernel determines the overall connectivity and asymmetry of the resulting graph and demonstrate how Laplacian-type operators of this graph can offer insights into the underlying generative process. Based on the analysis of the Laplacian-type operators, we derive an algorithm that, in the limit of infinite sample and vanishing bandwidth, recovers the key features of the sampling process: manifold geometry, sampling distribution, and local directionality, up to their intrinsic indeterminacies. 2 Model The first premise here is that we observe a directed graph G, with n nodes, having weights W = [Wij ] for the edge from node i to node j. In following with common Laplacian-based embedding approaches, we assume that G is a geometric random graph constructed from n points sampled according to distribution p = e−U on an unobserved compact smooth manifold M ⊆ Rl of known intrinsic dimension d ≤ l. The edge weight Wij is then determined by a directed similarity kernel k (xi , xj ) with bandwidth . The directional component of k (xi , xj ) will be taken to be derived 1 from a vector field r on M, which assigns a preferred direction between weights Wij and Wji . The choice of a vector field r to characterize the directional component of G might seem restrictive at first. In the asymptotic limit of → 0 and n → ∞ however, kernels are characterized by their diffusion, drift, and source components [7]. As such, r is sufficient to characterize any directionality associated with a drift component and as it turns out, the component of r normal M in Rl can also be use to characterize any source component. As for the diffusion component, it is not possible to uniquely identify it from G alone [8]. Some absolute knownledge of M is needed to say anything about it. Hence, without loss of generality, we will construct k (xi , xj ) so that the diffusion component ends being isotropic and constant, i.e. equal to Laplace-Beltrami operator ∆ on M. The schematic of this generative process is shown in the top left of Figure 1 below. From left to right: the graph generative process mapping the sample on M to geometric random graph G via the kernel k (x, y), then the subsequent embedding (α) Ψn of G by operators Haa,n , (α) Hss,n (defined in section 3.1). As these operators converge to (α) their respective limits, Haa and (α) Hss , so will Ψn → Ψ, pn → p, and rn → r. We design an algorithm that, given G, produces the top right embedding (Ψn , pn , and rn ). Figure 1: Schematic of our framework. The question is then as follows: can the generative process’ geometry M, distribution p = e−U , and directionality r, be recovered from G? In other words, is there an embedding of G in Rm , m ≥ d that approximates all three components of the process and that is also consistent as sample size increases and the bandwidth vanishes? In the case of undirected graphs, the theory of Laplacian eigenmaps [1] and Diffusion maps [9] answers this question in the affirmative, in that the geometry of M and p = e−U can be inferred using spectral graph theory. The aim here is to build on the undirected problem and recover all three components of the generative process from a directed graph G. The spectral approach to undirected graph embedding relies on the fact that eigenfunctions of the Laplace-Beltrami operator are known to preserve the local geometry of M [1]. With a consistent empirical Laplace-Beltrami operator based on G, its eigenvectors also recover the geometry of M and converge to the corresponding eigenfunctions on M. For a directed graph G, an additional operator is needed to recover the local directional component r, but the principle remains the same. (α) The schematic for this is shown in Figure 1 where two operators - Hss,n , introduced in [9] for (α) undirected embeddings, and Haa,n , a new operator defined in section 3.1 - are used to obtain the (α) (α) (α) embedding Ψn , distribution pn , and vector field rn . As Haa,n and Hss,n converge to Haa and (α) Hss , Ψn , pn , and rn also converge to Ψ, p, and r, where Ψ is the local geometry preserving the embedding of M into Rm . (α) The algorithm we propose in Section 4 will calculate the matrices corresponding to H·,n from the graph G, and with their eigenvectors, will find estimates for the node coordinates Ψ, the directional component r, and the sampling distribution p. In the next section we briefly describe the mathematical models of the diffusion processes that our model relies on. 2 2.1 Problem Setting The similarity kernel k (x, y) can be used to define transport operators on M. The natural transport operator is defined by normalizing k (x, y) as T [f ](x) = M k (x, y) f (y)p(y)dy , where p (x) = p (x) k (x, y)p(y)dy . (1) M T [f ](x) represents the diffusion of a distribution f (y) by the transition density k (x, y)p(y)/ k (x, y )p(y )dy . The eigenfunctions of this infinitesimal operator are the continuous limit of the eigenvectors of the transition probability matrix P = D−1 W given by normalizing the affinity matrix W of G by D = diag(W 1) [10]. Meanwhile, the infinitesimal transition ∂f (T − I)f = lim (2) →0 ∂t defines the backward equation for this diffusion process over M based on kernel k . Obtaining the explicit expression for transport operators like (2) is then the main technical challenge. 2.2 Choice of Kernel In order for T [f ] to have the correct asymptotic form, some hypotheses about the similarity kernel k (x, y) are required. The hypotheses are best presented by considering the decomposition of k (x, y) into symmetric h (x, y) = h (y, x) and anti-symmetric a (x, y) = −a (y, x) components: k (x, y) = h (x, y) + a (x, y) . (3) The symmetric component h (x, y) is assumed to satisfy the following properties: 1. h (||y − 2 x||2 ) = h(||y−x|| / ) , and 2. h ≥ 0 and h is exponentially decreasing as ||y − x|| → ∞. This form d/2 of symmetric kernel was used in [9] to analyze the diffusion map. For the asymmetric part of the similarity kernel, we assume the form a (x, y) = r(x, y) h(||y − x||2 / ) · (y − x) , d/2 2 (4) with r(x, y) = r(y, x) so that a (x, y) = −a (y, x). Here r(x, y) is a smooth vector field on the manifold that gives an orientation to the asymmetry of the kernel k (x, y). It is worth noting that the dependence of r(x, y) on both x and y implies that r : M × M → Rl with Rl the ambient space of M; however in the asymptotic limit, the dependence in y is only important “locally” (x = y), and as such it is appropriate to think of r(x, x) being a vector field on M. As a side note, it is worth pointing out that even though the form of a (x, y) might seem restrictive at first, it is sufficiently rich to describe any vector field . This can be seen by taking r(x, y) = (w(x) + w(y))/2 so that at x = y the resulting vector field is given by r(x, x) = w(x) for an arbitrary vector field w(x). 3 Continuous Limit of Laplacian Type Operators We are now ready to state the main asymptotic result. Proposition 3.1 Let M be a compact, closed, smooth manifold of dimension d and k (x, y) an asymmetric similarity kernel satisfying the conditions of section 2.2, then for any function f ∈ C 2 (M), the integral operator based on k has the asymptotic expansion k (x, y)f (y)dy = m0 f (x) + g(f (x), x) + o( ) , (5) M where g(f (x), x) = and m0 = Rd m2 (ω(x)f (x) + ∆f (x) + r · 2 h(||u||2 )du, m2 = Rd u2 h(||u||2 )du. i 3 f (x) + f (x) · r + c(x)f (x)) (6) The proof can be found in [8] along with the definition of ω(x) and c(x) in (6). For now, it suffices to say that ω(x) corresponds to an interaction between the symmetric kernel h and the curvature of M and was first derived in [9]. Meanwhile, c(x) is a new term that originates from the interaction between h and the component of r that is normal to M in the ambient space Rl . Proposition 3.1 foreshadows a general fact about spectral embedding algorithms: in most cases, Laplacian operators confound the effects of spatial proximity, sampling density and directional flow due to the presence of the various terms above. 3.1 Anisotropic Limit Operators Proposition 3.1 above can be used to derive the limits of a variety of Laplacian type operators associated with spectral embedding algorithms like [5, 6, 3]. Although we will focus primarily on a few operators that give the most insight into the generative process and enable us to recover the model defined in Figure 1, we first present four distinct families of operators for completeness. These operator families are inspired by the anisotropic family of operators that [9] introduced for undirected graphs, which make use of anisotropic kernels of the form: k (α) (x, y) = k (x, y) pα (x)pα (y) , (7) with α ∈ [0, 1] where α = 0 is the isotropic limit. To normalize the anisotropic kernels, we need (α) (α) (α) as p (x) = M k (x, y)p(y)dy. From (7), four to redefine the outdegrees distribution of k families of diffusion processes of the form ft = H (α) [f ](x) can be derived depending on which kernel is normalized and which outdegree distribution is used for the normalization. Specifically, (α) (α) we define transport operators by normalizing the asymmetric k or symmetric h kernels with the 1 asymmetric p or symmetric q = M h (x, y)p(y)dy outdegree distribution . To keep track of all options, we introduce the following notation: the operators will be indexed by the type of kernel and outdegree distribution they correspond to (symmetric or asymmetric), with the first index identifying the kernel and the second index identifying the outdegree distribution. For example, the family of anisotropic limit operators introduced by [9] is defined by normalizing the symmetric kernel by (α) the symmetric outdegree distribution, hence they will be denoted as Hss , with the superscript corresponding to the anisotropic power α. Proposition 3.2 With the above notation, (α) Haa [f ] = ∆f − 2 (1 − α) U · f + r· f (α) Has [f ] (α) Hsa [f ] (α) Hss [f ] = ∆f − 2 (1 − α) U · = ∆f − 2 (1 − α) U · = ∆f − 2(1 − α) U · (8) f − cf + (α − 1)(r · U )f − ( · r + (α − 1)r · f + (c + f. U )f · r)f + r · f (9) (10) (11) The proof of this proposition, which can be found in [8], follows from repeated application of Proposition 3.1 to p(y) or q(y) and then to k α (x, y) or hα (x, y), as well as the fact that p1 = α 1 p−α [1 − α (ω + ∆p p + 2r · p p +2 · r + c)] + o( ). (α) Thus, if we use the asymmetric k and p , we get Haa , defined by the advected diffusion equa(α) tion (8). In general, Haa is not hermitian, so it commonly has complex eigenvectors. This makes (1) embedding directed graphs with this operator problematic. Nevertheless, Haa will play an important role in extracting the directionality of the sampling process. If we use the symmetric kernel h but the asymmetric outdegree distribution p , we get the family (α) of operators Hsa , of which the WCut of [3] is a special case (α = 0). If we reverse the above, i.e. (α) (α) (α) use k and q , we obtain Has . This turns out to be merely a combination of Haa and Hsa . 1 The reader may notice that there are in fact eight possible combinations of kernel and degree distribution, since the anisotripic kernel (7) could also be defined using a symmetric or asymmetric outdegree distribution. However, there are only four distinct asymptotic results and they are all covered by using one kernel (symmetric or asymmetric) and one degree distribution (symmetric or asymmetric) throughout. 4 Algorithm 1 Directed Embedding Input: Affinity matrix Wi,j and embedding dimension m, (m ≥ d) 1. S ← (W + W T )/2 (Steps 1–6 estimate the coordinates as in [11]) n 2. qi ← j=1 Si,j , Q = diag(q) 3. V ← Q−1 SQ−1 (1) n 4. qi ← j=1 Vi,j , Q(1) = diag(q (1) ) (1) −1 5. Hss,n ← Q(1) V 6. Compute the Ψ the n × (m + 1) matrix with orthonormal columns containing the m + 1 largest (1) right eigenvector (by eigenvalue) of Hss,n as well as the Λ the (m + 1) × (m + 1) diagonal matrix of eigenvalues. Eigenvectors 2 to m + 1 from Ψ are the m coordinates of the embedding. (1) 7. Compute π the left eigenvector of Hss,n with eigenvalue 1. (Steps 7–8 estimate the density) n 8. π ← π/ i=1 πi is the density distribution over the embedding. n 9. pi ← j=1 Wi,j , P = diag(p) (Steps 9–13 estimate the vector field r) 10. T ← P −1 W P −1 (1) n 11. pi ← j=1 Ti,j , P (1) = diag(p(1) ) (1) −1 12. Haa,n ← P (1) T (1) (1) 13. R ← (Haa,n − Hss,n )Ψ/2. Columns 2 to m + 1 of R are the vector field components in the direction of the corresponding coordinates of the embedding. (α) Finally, if we only consider the symmetric kernel h and degree distribution q , we recover Hss , the anisotropic kernels of [9] for symmetric graphs. This operator for α = 1 is shown to separate the manifold from the probability distribution [11] and will be used as part of our recovery algorithm. Isolating the Vector Field r 4 Our aim is to esimate the manifold M, the density distribution p = e−U , and the vector field r. The (1) first two components of the data can be recovered from Hss as shown in [11] and summarized in Algorithm 1. At this juncture, one feature of generative process is missing: the vector field r. The natural approach (α) (α) for recovering r is to isolate the linear operator r · from Haa by substracting Hss : (α) (α) Haa − Hss = r · . (12) The advantage of recovering r in operator form as in (12) is that r · is coordinate free. In other words, as long as the chosen embedding of M is diffeomorphic to M2 , (12) can be used to express the component of r that lies in the tangent space T M, which we denote by r|| . Specifically, let Ψ be a diffeomorphic embedding of M ; the component of r along coordinate ψk is then given by r · ψk = rk , and so, in general, r|| = r · Ψ. (13) The subtle point that only r|| is recovered from (13) follows from the fact that the operator r · only defined along M and hence any directional derivative is necessarily along T M. is Equation (13) and the previous observations are the basis for Algorithm 1, which recovers the three important features of the generative process for an asymmetric graph with affinity matrix W . A similar approach can be employed to recover c + · r, or simply · r if r has no component perpendicular to the tangent space T M (meaning that c ≡ 0). Recovering c + · r is achieved by taking advantage of the fact that (1) (1) (Hsa − Hss ) = (c + 2 · r) , (14) (1) A diffeomorphic embedding is guaranteed by using the eigendecomposition of Hss . 5 (1) (1) which is a diagonal operator. Taking into account that for finite n (Hsa,n − Hss,n ) is not perfectly (1) (1) diagonal, using ψn ≡ 1n (vector of ones), i.e. (Hsa,n − Hss,n )[1n ] = (cn + · rn ), has been found (1) (1) empirically to be more stable than simply extracting the diagonal of (Hsa,n − Hss,n ). 5 Experiments Artificial Data For illustrative purposes, we begin by applying our method to an artificial example. We use the planet Earth as a manifold with a topographic density distribution, where sampling probability is proportional to elevation. We also consider two vector fields: the first is parallel to the line of constant latitude and purely tangential to the sphere, while the second is parallel to the line of constant longitude with a component of the vector field perpendicular to the manifold. The true model with constant latitude vector field is shown in Figure 2, along with the estimated density and vector field projected on the true manifold (sphere). Model Recovered Latitudinal (a) Longitudinal (b) Figure 2: (a): Sphere with latitudinal vector field, i.e East-West asymmetry, with Wew > Wwe if node w lies to the West of node e. The graph nodes are sampled non-uniformly, with the topographic map of the world as sampling density. We sample n = 5000 nodes, and observe only the resulting W matrix, but not the node locations. From W , our algorithm estimates the sample locations (geometry), the vector field (black arrows) generating the observed asymmetries, and the sampling distribution at each data point (colormap). (b) Vector fields on a spherical region (blue), and their estimates (red): latitudinal vector field tangent to the manifold (left) and longitudinal vector field with component perpendicular to manifold tangent plane (right). Both the estimated density and vector field agree with the true model, demonstrating that for artificial data, the recovery algorithm 1 performs quite well. We note that the estimated density does not recover all the details of the original density, even for large sample size (here n = 5000 with = 0.07). Meanwhile, the estimated vector field performs quite well even when the sampling is reduced to n = 500 with = 0.1. This can be seen in Figure 2, b, where the true and estimated vector fields are superimposed. Figure 2 also demonstrates how r · only recovers the tangential component of r. The estimated geometry is not shown on any of these figures, since the success of the diffusion map in recovering the geometry for such a simple manifold is already well established [2, 9]. Real DataThe National Longitudinal Survey of Youth (NLSY) 1979 Cohort is a representative sample of young men and women in the United States who were followed from 1979 to 2000 [12, 13]. The aim here is to use this survey to obtain a representation of the job market as a diffusion process over a manifold. The data set consists of a sample of 7,816 individual career sequences of length 64, listing the jobs a particular individual held every quarter between the ages of 20 and 36. Each token in the sequence identifies a job. Each job corresponds to an industry × occupation pair. There are 25 unique industry and 20 unique occupation indices. Out of the 500 possible pairings, approximately 450 occur in the data, with only 213 occurring with sufficient frequency to be included here. Thus, our graph G has 213 nodes - the jobs - and our observations consist of 7,816 walks between the graph nodes. We convert these walks to a directed graph with affinity matrix W . Specifically, Wij represents the number of times a transition from job i to job j was observed (Note that this matrix is asymmetric, 6 i.e Wij = Wji ). Normalizing each row i of W by its outdegree di gives P = diag(di )−1 W , the non-parametric maximum likelihood estimator for the Markov chain over G for the progression (0) of career sequences. This Markov chain has as limit operator Haa , as the granularity of the job market increases along with the number of observations. Thus, in trying to recover the geometry, distribution and vector field, we are actually interested in estimating the full advective effect of the (0) diffusion process generated by Haa ; that is, we want to estimate r · − 2 U · where we can use (0) (1) −2 U · = Hss − Hss to complement Algorithm 1. 0.25 1600 0.05 0.1 1400 0.1 3 0.7 1800 0.15 ! 0.8 0.15 0.2 0.9 0.2 2000 0.25 !30.05 0.6 0.5 0 0 0.4 1200 −0.05 −0.05 −0.1 0.3 −0.1 1000 0.2 −0.15 −0.15 800 −0.2 −0.25 0.1 −0.2 −0.25 −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (a) −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (b) Figure 3: Embedding the job market along with field r − 2 U over the first two non-constant eigenvectors. The color map corresponds to the mean monthly wage in dollars (a) and to the female proportion (b) for each job. We obtain an embedding of the job market that describes the relative position of jobs, their distribution, and the natural time progression from each job. Of these, the relative position and natural time progression are the most interesting. Together, they summarize the job market dynamics by describing which jobs are naturally “close” as well as where they can lead in the future. From a public policy perspective, this can potentially improve focus on certain jobs for helping individuals attain better upward mobility. The job market was found to be a high dimensional manifold. We present only the first two dimen(0) sions, that is, the second and third eigenvectors of Hss , since the first eigenvector is uninformative (constant) by construction. The eigenvectors showed correlation with important demographic data, such as wages and gender. Figure 3 displays this two-dimensional sub-embedding along with the directional information r − 2 U for each dimension. The plot shows very little net progression toward regions of increasing mean salary3 . This is somewhat surprising, but it is easy to overstate this observation: diffusion alone would be enough to move the individuals towards higher salary. What Figure 3 (a) suggests is that there appear to be no “external forces” advecting individuals towards higher salary. Nevertheless, there appear to be other external forces at play in the job market: Figure 3 (b), which is analogous to Figure 3 (a), but with gender replacing the salary color scheme, suggests that these forces push individuals towards greater gender differentiation. This is especially true amongst male-dominated jobs which appear to be advected toward the left edge of the embedding. Hence, this simple analysis of the job market can be seen as an indication that males and females tend to move away from each other over time, while neither seems to have a monopoly on high- or low- paying jobs. 6 Discussion This paper makes three contributions: (1) it introduces a manifold-based generative model for directed graphs with weighted edges, (2) it obtains asymptotic results for operators constructed from the directed graphs, and (3) these asymptotic results lead to a natural algorithm for estimating the model. 3 It is worth noting that in the NLSY data set, high paying jobs are teacher, nurse and mechanic. This is due to the fact that the career paths observed stop at at age 36, which is relatively early in an individual’s career. 7 Generative Models that assume that data are sampled from a manifold are standard for undirected graphs, but to our knowledge, none have yet been proposed for directed graphs. When W is symmetric, it is natural to assume that it depends on the points’ proximity. For asymmetric affinities W , one must include an additional component to explain the asymmetry. In the asymptotic limit, this is tantamount to defining a vector field on the manifold. Algorithm We have used from [9] the idea of defining anisotropic kernels (indexed by α) in order to separate the density p and the manifold geometry M. Also, we adopted their general assumptions about the symmetric part of the kernel. As a consequence, the recovery algorithm for p and M is identical to theirs. However, insofar as the asymmetric part of the kernel is concerned, everything, starting from the definition and the introduction of the vector field r as a way to model the asymmetry, through the derivation of the asymptotic expression for the symmetric plus asymmetric kernel, is new. We go significantly beyond the elegant idea of [9] regarding the use of anisotropic kernels by analyzing the four distinct renormalizations possible for a given α, each of them combining different aspects of M, p and r. Only the successful (and novel) combination of two different anisotropic operators is able to recover the directional flow r. Algorithm 1 is natural, but we do not claim it is the only possible one in the context of our model. (α) For instance, we can also use Hsa to recover the operator · r (which empirically seems to have worse numerical properties than r · ). In the National Longitudinal Survery of Youth study, we were interested in the whole advective term, so we estimated it from a different combination of operators. Depending on the specific question, other features of the model could be obtained Limit Results Proposition 3.1 is a general result on the asymptotics of asymmetric kernels. Recovering the manifold and r is just one, albeit the most useful, of the many ways of exploiting these (0) results. For instance, Hsa is the limit operator of the operators used in [3] and [5]. The limit analysis could be extended to other digraph embedding algorithms such as [4, 6]. How general is our model? Any kernel can be decomposed into a symmetric and an asymmetric part, as we have done. The assumptions on the symmetric part h are standard. The paper of [7] goes one step further from these assumptions; we will discuss it in relationship with our work shortly. The more interesting question is how limiting are our assumptions regarding the choice of kernel, especially the asymmetric part, which we parameterized as a (x, y) = r/2 · (y − x)h (x, y) in (4). In the asymptotic limit, this choice turns out to be fully general, at least up to the identifiable aspects of the model. For a more detailed discussion of this issue, see [8]. In [7], Ting, Huang and Jordan presented asymptotic results for a general family of kernels that includes asymmetric and random kernels. Our k can be expressed in the notation of [7] by taking wx (y) ← 1+r(x, y)·(y−x), rx (y) ← 1, K0 ← h, h ← . Their assumptions are more general than the assumptions we make here, yet our model is general up to what can be identified from G alone. The distinction arises because [7] focuses on the graph construction methods from an observed sample of M, while we focus on explaining an observed directed graph G through a manifold generative process. Moreover, while the [7] results can be used to analyze data from directed graphs, they differ from our Proposition 3.1. Specifically, with respect to the limit in Theorem 3 from [7], we obtain the additional source terms f (x) · r and c(x)f (x) that follow from not enforcing (α) (α) conservation of mass while defining operators Hsa and Has . We applied our theory of directed graph embedding to the analysis of the career sequences in Section 5, but asymmetric affinity data abound in other social contexts, and in the physical and life sciences. Indeed, any “similarity” score that is obtained from a likelihood of the form Wvu =likelihood(u|v) is generally asymmetric. Hence our methods can be applied to study not only social networks, but also patterns of human movement, road traffic, and trade relations, as well as alignment scores in molecular biology. Finally, the physical interpretation of our model also makes it naturally applicable to physical models of flows. Acknowledgments This research was partially supported by NSW awards IIS-0313339 and IIS-0535100. 8 References [1] Belkin and Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2002. [2] Nadler, Lafon, and Coifman. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In Neural Information Processing Systems Conference, 2006. [3] Meila and Pentney. Clustering by weighted cuts in directed graphs. In SIAM Data Mining Conference, 2007. [4] Zhou, Huang, and Scholkopf. Learning from labeled and unlabeled data on a directed graph. In International Conference on Machine Learning, pages 1041–1048, 2005. [5] Zhou, Schlkopf, and Hofmann. Semi-supervised learning on directed graphs. In Advances in Neural Information Processing Systems, volume 17, pages 1633–1640, 2005. [6] Fan R. K. Chung. The diameter and laplacian eigenvalues of directed graphs. Electr. J. Comb., 13, 2006. [7] Ting, Huang, and Jordan. An analysis of the convergence of graph Laplacians. In International Conference on Machine Learning, 2010. [8] Dominique Perrault-Joncas and Marina Meil˘ . Directed graph embedding: an algorithm based on contina uous limits of laplacian-type operators. Technical Report TR 587, University of Washington - Department of Statistics, November 2011. [9] Coifman and Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21:6–30, 2006. [10] Mikhail Belkin and Partha Niyogi. Convergence of laplacian eigenmaps. preprint, short version NIPS 2008, 2008. [11] Coifman, Lafon, Lee, Maggioni, Warner, and Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. In Proceedings of the National Academy of Sciences, pages 7426–7431, 2005. [12] United States Department of Labor. National longitudinal survey of youth 1979 cohort. http://www.bls.gov/nls/, retrived October 2011. [13] Marc A. Scott. Affinity models for career sequences. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60(3):417–436, 2011. 9
2 0.88953412 306 nips-2011-t-divergence Based Approximate Inference
Author: Nan Ding, Yuan Qi, S.v.n. Vishwanathan
Abstract: Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Student’s t-prior. 1
3 0.88929665 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
Author: Xinggang Wang, Xiang Bai, Xingwei Yang, Wenyu Liu, Longin J. Latecki
Abstract: We propose a novel inference framework for finding maximal cliques in a weighted graph that satisfy hard constraints. The constraints specify the graph nodes that must belong to the solution as well as mutual exclusions of graph nodes, i.e., sets of nodes that cannot belong to the same solution. The proposed inference is based on a novel particle filter algorithm with state permeations. We apply the inference framework to a challenging problem of learning part-based, deformable object models. Two core problems in the learning framework, matching of image patches and finding salient parts, are formulated as two instances of the problem of finding maximal cliques with hard constraints. Our learning framework yields discriminative part based object models that achieve very good detection rate, and outperform other methods on object classes with large deformation. 1
same-paper 4 0.85943139 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
5 0.79299766 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
Author: Po-ling Loh, Martin J. Wainwright
Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1
6 0.77863365 258 nips-2011-Sparse Bayesian Multi-Task Learning
7 0.77273804 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
8 0.77085161 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
9 0.77046716 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
10 0.76996338 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
11 0.76796877 29 nips-2011-Algorithms and hardness results for parallel large margin learning
12 0.76779163 80 nips-2011-Efficient Online Learning via Randomized Rounding
13 0.76711017 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
14 0.76701194 186 nips-2011-Noise Thresholds for Spectral Clustering
15 0.76541495 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
16 0.76526594 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
17 0.76430571 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
18 0.7631672 5 nips-2011-A Denoising View of Matrix Completion
19 0.76298815 276 nips-2011-Structured sparse coding via lateral inhibition
20 0.76200783 265 nips-2011-Sparse recovery by thresholded non-negative least squares