nips nips2006 nips2006-92 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martin J. Wainwright, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We focus on the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on 1 regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1 -constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main result is to establish sufficient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. Under certain mutual incoherence conditions analogous to those imposed in previous work on linear regression, we prove that consistent neighborhood selection can be obtained as long as the number of observations n grows more quickly than 6d6 log d + 2d5 log p, thereby establishing that logarithmic growth in the number of samples n relative to graph size p is sufficient to achieve neighborhood consistency. Keywords: Graphical models; Markov random fields; structure learning; 1 -regularization; model selection; convex risk minimization; high-dimensional asymptotics; concentration. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Pittsburgh, PA 15213 Abstract We focus on the problem of estimating the graph structure associated with a discrete Markov random field. [sent-9, score-0.217]
2 We describe a method based on 1 regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1 -constraint. [sent-10, score-0.7]
3 Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. [sent-11, score-0.365]
4 Our main result is to establish sufficient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. [sent-12, score-0.789]
5 Keywords: Graphical models; Markov random fields; structure learning; 1 -regularization; model selection; convex risk minimization; high-dimensional asymptotics; concentration. [sent-14, score-0.066]
6 1 Introduction Consider a p-dimensional discrete random variable X = (X1 , X2 , . [sent-15, score-0.041]
7 , Xp ) where the distribution of X is governed by an unknown undirected graphical model. [sent-18, score-0.147]
8 In this paper, we investigate the problem of estimating the graph structure from an i. [sent-19, score-0.176]
9 This structure learning problem plays an important role i=1 in a broad range of applications where graphical models are used as a probabilistic representation tool, including image processing, document analysis and medical diagnosis. [sent-26, score-0.111]
10 Our approach is to perform an 1 -regularized logistic regression of each variable on the remaining variables, and to use the sparsity pattern of the regression vector to infer the underlying neighborhood structure. [sent-27, score-0.573]
11 The main contribution of the paper is a theoretical analysis showing that, under suitable conditions, this procedure recovers the true graph structure with probability one, in the high-dimensional setting in which both the sample size n and graph size p = p(n) increase to infinity. [sent-28, score-0.53]
12 The problem of structure learning for discrete graphical models—due to both its importance and difficulty—has attracted considerable attention. [sent-29, score-0.152]
13 Constraint based approaches use hypothesis testing to estimate the set of conditional independencies in the data, and then determine a graph that most closely represents those independencies [8]. [sent-30, score-0.262]
14 An alternative approach is to view the problem as estimation of a stochastic model, combining a scoring metric on candidate graph structures with a goodness of fit measure to the data. [sent-31, score-0.286]
15 The scoring met- ric approach must be used together with a search procedure that generates candidate graph structures to be scored. [sent-32, score-0.286]
16 The combinatorial space of graph structures is super-exponential, however, and Chickering [1] shows that this problem is in general NP-hard. [sent-33, score-0.211]
17 Estimation of graph structures in undirected models has thus largely been restricted to simple graph classes such as trees [2], polytrees [3] and hypertrees [9]. [sent-35, score-0.423]
18 In this paper, we adapt the technique of 1 -regularized logistic regression to the problem of inferring graph structure. [sent-40, score-0.399]
19 The technique is computationally efficient and thus well-suited to high dimensional problems, since it involves the solution only of standard convex programs. [sent-41, score-0.066]
20 Our main result establishes conditions on the sample size n, graph size p and maximum neighborhood size d under which the true neighborhood structure can be inferred with probability one as (n, p, d) increase. [sent-42, score-0.887]
21 Our analysis, though asymptotic in nature, leads to growth conditions that are sufficiently weak so as to require only that the number of observations n grow logarithmically in terms of the graph size. [sent-43, score-0.481]
22 Consequently, our results establish that graphical structure can be learned from relatively sparse data. [sent-44, score-0.187]
23 Our analysis and results are similar in spirit to the recent work of Meinshausen and B¨ hlmann [5] on covariance selection in Gaussian graphical models, but focusing rather u on the case of discrete models. [sent-45, score-0.23]
24 The remainder of this paper is organized as follows. [sent-46, score-0.045]
25 In Section 2, we formulate the problem and establish notation, before moving on to a precise statement of our main result, and a high-level proof outline in Section 3. [sent-47, score-0.25]
26 Sections 4 and 5 detail the proof, with some technical details deferred to the full-length version. [sent-48, score-0.047]
27 2 Problem Formulation and Notation Let G = (V, E) denote a graph with vertex set V of size |V | = p and edge set E. [sent-50, score-0.214]
28 A pairwise graphical model with graph G is a family of probability distributions for a random variable X = (X1 , X2 , . [sent-52, score-0.287]
29 In this paper, we restrict our attention to the case where each xs ∈ {0, 1} is binary, and the family of probability distributions is given by the Ising model p(x; θ) = exp s∈V θs xs + (s,t)∈E θst xs xt − Ψ(θ) . [sent-56, score-0.329]
30 (1) Given such an exponential family in a minimal representation, the log partition function Ψ(θ) is strictly convex, which ensures that the parameter matrix θ is identifiable. [sent-57, score-0.052]
31 Our set-up includes the important situation in which the number of variables p may be large relative to the sample size n. [sent-60, score-0.072]
32 In particular, we allow the graph Gn = (Vn , En ) to vary with n, so that the number of variables p = |Vn | and the sizes of the neighborhoods ds := |N (s)| may vary with sample size. [sent-61, score-0.317]
33 (For notational clarity we will sometimes omit subscripts indicating a dependence on n. [sent-62, score-0.042]
34 Equivalently, we consider the problem of estimating neighborhoods Nn (s) ⊂ Vn so that P[ Nn (s) = N (s), ∀ s ∈ Vn ] −→ 1. [sent-64, score-0.069]
35 For many problems of interest, the graphical model provides a compact representation where the size of the neighborhoods are typically small—say ds p for all s ∈ Vn . [sent-65, score-0.218]
36 Our goal is to use 1 -regularized logistic regression to estimate these neighborhoods; for this paper, the actual values of the parameters θij is a secondary concern. [sent-66, score-0.223]
37 Given input data {(z (i) , y (i) )}, where z (i) is a p-dimensional covariate and y (i) ∈ {0, 1} is a binary response, logistic regression involves minimizing the negative log likelihood 1 n fs (θ; x) = n i=1 log(1 + exp(θT z (i) )) − y (i) θT z (i) . [sent-67, score-0.355]
38 (2) We focus on regularized version of this regression problem, involving an 1 constraint on (a (i) subset of) the parameter vector θ. [sent-68, score-0.158]
39 For the graph learning task, we regress each variable Xs onto the remaining variables, sharing the same data x(i) across problems. [sent-70, score-0.176]
40 This leads to the following collection of optimization problems (p in total, one for each graph node): 1 n θs,λ = arg min p θ∈R n i=1 log(1 + exp(θT z (i,s) )) − x(i) θT z (i,s) + λn θ\s s (i,s) 1 . [sent-71, score-0.176]
41 (3) (i) where s ∈ V , and z (i,s) ∈ {0, 1}p denotes the vector where zt = xt for t = s and (i,s) zs = 1. [sent-72, score-0.505]
42 Our estimate of the neighborhood N (s) is then given by s,λ Nn (s) = t ∈ V, t = s : θt = 0 . [sent-75, score-0.193]
43 Our goal is to provide conditions on the graphical model—in particular, relations among the number of nodes p, number of observations n and maximum node degree d—that ensure that the collection of neighborhood estimates (2), one for each node s of the graph, is consistent with high probability. [sent-76, score-0.767]
44 We conclude this section with some additional notation that is used throughout the sequel. [sent-77, score-0.037]
45 Finally, for ease of notation, we make frequent use the shortand Qs (θ) = 3 (4a) i=1 2 (4b) fs (θ; x). [sent-79, score-0.111]
46 Main Result and Outline of Analysis In this section, we begin with a precise statement of our main result, and then provide a high-level overview of the key steps involved in its proof. [sent-80, score-0.1]
47 1 Statement of main result We begin by stating the assumptions that underlie our main result. [sent-82, score-0.124]
48 A subset of the assumptions involve the Fisher information matrix associated with the logistic regression model, defined for each node s ∈ V as Q∗ s = E ps (Z; θ∗ ) {1 − ps (Z; θ∗ )}ZZ T , (5) Note that Q∗ is the population average of the Hessian Qs (θ∗ ). [sent-83, score-0.491]
49 For ease of notation we use s S to denote the neighborhood N (s), and S c to denote the complement V − N (s). [sent-84, score-0.261]
50 Our first two assumptions (A1 and A2) place restrictions on the dependency and coherence structure of this Fisher information matrix. [sent-85, score-0.048]
51 We note that these first two assumptions are analogous to conditions imposed in previous work [5, 10, 11, 12] on linear regression. [sent-86, score-0.235]
52 Our third assumption is a growth rate condition on the triple (n, p, d). [sent-87, score-0.253]
53 [A1] Dependency condition: We require that the subset of the Fisher information matrix corresponding to the relevant covariates has bounded eigenvalues: namely, there exist constants Cmin > 0 and Cmax < +∞ such that Cmin ≤ Λmin (Q∗ ), SS and Λmax (Q∗ ) ≤ Cmax . [sent-88, score-0.076]
54 SS (6) These conditions ensure that the relevant covariates do not become overly dependent, and can be guaranteed (for instance) by assuming that θs,λ lies within a compact set. [sent-89, score-0.212]
55 [A2] Incoherence condition: Our next assumption captures the intuition that the large number of irrelevant covariates (i. [sent-90, score-0.076]
56 , non-neighbors of node s) cannot exert an overly strong effect on the subset of relevant covariates (i. [sent-92, score-0.216]
57 (7) Analogous conditions are required for the success of the Lasso in the case of linear regression [5, 10, 11, 12]. [sent-96, score-0.197]
58 [A3] Growth rates: Our second set of assumptions involve the growth rates of the number of observations n, the graph size p, and the maximum node degree d. [sent-97, score-0.54]
59 (8) d5 Note that this condition allows the graph size p to grow exponentially with the number of observations (i. [sent-99, score-0.386]
60 Moreover, it is worthwhile noting that for model selection in graphical models, one is typically interested in node degrees d that remain bounded (e. [sent-102, score-0.271]
61 , d = O(1)), or grow only weakly with graph size (say d = o(log p)). [sent-104, score-0.275]
62 With these assumptions, we now state our main result: Theorem 1. [sent-105, score-0.038]
63 Given a graphical model and triple (n, p, d) such that conditions A1 through A3 are satisfied, suppose that the regularization parameter λn is chosen such that (a) nλ2 − 2 log(p) → +∞, and (b) dλn → 0. [sent-106, score-0.305]
64 2 Outline of analysis We now provide a high-level roadmap of the main steps involved in our proof of Theorem 1. [sent-109, score-0.072]
65 We then show that this construction succeeds with probability converging to one under the stated conditions. [sent-111, score-0.057]
66 A key fact is that the convergence rate is sufficiently fast that a simple union bound over all graph nodes shows that we achieve consistent neighborhood estimation for all nodes simultaneously. [sent-112, score-0.463]
67 To provide some insight into the nature of our construction, the analysis in Section 4 shows the neighborhood N (s) is correctly recovered if and only if the pair (θ, z) satisfies the ∗ following four conditions: (a) θS c = 0; (b) |θt | > 0 for all t ∈ S; (c) zS = sgn(θS ); and (d) zS c ∞ < 1. [sent-113, score-0.233]
68 The first step in our construction is to choose the pair (θ, z) such that both conditions (a) and (c) hold. [sent-114, score-0.148]
69 The remainder of the analysis is then devoted to establishing that properties (b) and (d) hold with high probability. [sent-115, score-0.141]
70 In the first part of our analysis, we assume that the dependence (A1) mutual incoherence (A2) conditions hold for the sample Fisher information matrices Qs (θ∗ ) defined below equation (4b). [sent-116, score-0.409]
71 Under this assumption, we then show that the conditions on λn in the theorem statement suffice to guarantee that properties (b) and (d) hold for the constructed pair (θ, z). [sent-117, score-0.245]
72 While it follows immediately from the law of large numbers that the empirical Fsiher information Qn (θ∗ ) converges to the population version Q∗ for any fixed subset, AA AA the delicacy is that we require controlling this convergence over subsets of increasing size. [sent-119, score-0.073]
73 4 Primal-Dual Relations for 1 -Regularized Logistic Regression Basic convexity theory can be used to characterize the solutions of 1 -regularized logistic regression. [sent-121, score-0.196]
74 We assume in this section that θ1 corresponds to the unregularized bias term, and omit the dependence on sample size n in the notation. [sent-122, score-0.114]
75 If p ≤ n then f (θ; x) is a strictly convex function of θ. [sent-125, score-0.066]
76 Since the 1 -norm is convex, it follows that L(θ, λ) is convex in θ and strictly convex in θ for p ≤ n. [sent-126, score-0.132]
77 The subgradient ∂ θ\1 1 ⊂ Rp is the collection of all vectors z satisfying |zt | ≤ 1 and zt = 0 for t = 1 sign(θt ) if θt = 0. [sent-135, score-0.108]
78 The analysis in the following sections shows that, with high probability, a primal-dual pair (θ, z) can be constructed so that |zt | < 1 and therefore θt = 0 in case ∗ θt = 0 in the true model θ∗ from which the data are generated. [sent-137, score-0.04]
79 5 Constructing a Primal-Dual Pair We now fix a variable Xs for the logistic regression, denoting the set of variables in its neighborhood by S. [sent-138, score-0.327]
80 Our proof proceeds by showing the existence (with high probability) of a primal-dual pair (θ, z) that satisfy these conditions. [sent-140, score-0.074]
81 We first establish a consistency result when incoherence conditions are imposed on the sample Fisher information Qn . [sent-142, score-0.449]
82 The remaining analysis, deferred to the full-length version, establishes that the incoherence assumption (A2) on the population version ensures that the sample version also obeys the property with probability converging to one exponentially fast. [sent-143, score-0.463]
83 Let us introduce the notation Wn 1 n := n z (i,s) exp(θ∗ T z (i,s) ) 1 + exp(θ∗ T z (i,s) ) x(i) − s i=1 Substituting into the subgradient optimality condition (10) yields the equivalent condition f (θ; x) − W n + λn z f (θ; x) − = 0. [sent-149, score-0.231]
84 we write the zero-subgradient condition (13) in block s,λ ∗ Qn c S [θS − θS ] = S Qn SS 2 (13) n n WS c − λn zS c + RS c , n WS = − λn zS + (14a) n RS . [sent-151, score-0.071]
85 one, so that these conditions can be n n Qn c S (Qn )−1 [WS − λn zS + RS ] = SS S n n WS c − λn zS c + RS c . [sent-154, score-0.108]
86 n We apply these two lemmas to the bound (17) to obtain that with probability converging to one at rate O(exp exp nλ2 − log(p) , we have n zS c ∞ ≤ 4 + 4 + (1 − ) = 1 − . [sent-162, score-0.17]
87 2 Analysis of condition (b): We next show that condition (b) can be satisfied, so that ∗ sgn(θS ) = sgn(θ∗ S ). [sent-163, score-0.142]
88 SS (20) s,λ s,λ Therefore, in order to establish that |θi | > 0 for all i ∈ S, and moreover that sign(θS ) = ∗ sign(θS ), it suffices to show that (Qn )−1 [WS − λn zS + RS ] SS ∞ ≤ ρn . [sent-166, score-0.076]
89 We implemented the 3√ n), and solved 1 -regularized logistic regression by setting the 1 penalty as λn = O((log p) the convex program using a customized primal-dual algorithm (described in more detail in the full-length version of this paper). [sent-177, score-0.32]
90 We considered various sparsity regimes, including constant (d = Ω(1)), logarithmic (d = α log(p)), or linear (d = αp). [sent-178, score-0.098]
91 In each case, we evaluate a given method in terms of its average precision (one minus the fraction of falsely included edges), and its recall (one minus the fraction of falsely excluded edges). [sent-179, score-0.267]
92 Figure 1 shows results for the case of constant degrees (d ≤ 4), and graph sizes p ∈ {100, 200, 400}, for the AND method (respectively the OR) method, in which an edge (s, t) is included if and only if it is included in the local regressions at both node s and (respectively or ) node t. [sent-180, score-0.438]
93 Note that both the precision and recall tend to one as the number of samples n is increased. [sent-181, score-0.137]
94 7 Conclusion We have shown that a technique based on 1 -regularization, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1 -constraint, can be used for consistent model selection in discrete graphical models. [sent-182, score-0.756]
95 Our analysis applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. [sent-183, score-0.365]
96 Whereas the current analysis provides sufficient conditions on the triple (n, p, d) that ensure consistent neighborhood selection, it remains to establish necessary conditions as well [11]. [sent-184, score-0.599]
97 Finally, the ideas described here, while specialized in this paper to the binary case, should be more broadly applicable to discrete graphical models. [sent-185, score-0.152]
98 Each panel shows precision/recall versus n, for graph sizes p ∈ {100, 200, 400}. [sent-220, score-0.214]
99 High dimensional graphs and variable selection with u the lasso. [sent-248, score-0.048]
100 Sharp thresholds for high-dimensional and noisy sparsity recovery using 1 -constrained quadratic programs. [sent-280, score-0.068]
wordName wordTfidf (topN-words)
[('zs', 0.449), ('qn', 0.381), ('ws', 0.259), ('ss', 0.223), ('neighborhood', 0.193), ('rs', 0.191), ('incoherence', 0.19), ('graph', 0.176), ('logistic', 0.134), ('sgn', 0.12), ('node', 0.112), ('su', 0.111), ('graphical', 0.111), ('conditions', 0.108), ('growth', 0.096), ('xs', 0.092), ('regression', 0.089), ('vn', 0.087), ('triple', 0.086), ('cmin', 0.081), ('fs', 0.08), ('qs', 0.08), ('fisher', 0.08), ('en', 0.08), ('establish', 0.076), ('covariates', 0.076), ('condition', 0.071), ('neighborhoods', 0.069), ('sparsity', 0.068), ('convex', 0.066), ('rp', 0.064), ('statement', 0.062), ('precision', 0.061), ('grow', 0.061), ('ising', 0.06), ('lemmas', 0.06), ('nn', 0.058), ('converging', 0.057), ('zt', 0.056), ('cmax', 0.054), ('falsely', 0.054), ('meinshausen', 0.054), ('xp', 0.054), ('exp', 0.053), ('log', 0.052), ('subgradient', 0.052), ('assumptions', 0.048), ('wn', 0.048), ('selection', 0.048), ('deferred', 0.047), ('rn', 0.045), ('march', 0.045), ('remainder', 0.045), ('scoring', 0.043), ('independencies', 0.043), ('population', 0.042), ('recall', 0.042), ('dependence', 0.042), ('discrete', 0.041), ('imposed', 0.041), ('observations', 0.04), ('outline', 0.04), ('pair', 0.04), ('regularized', 0.038), ('main', 0.038), ('sizes', 0.038), ('size', 0.038), ('analogous', 0.038), ('notation', 0.037), ('undirected', 0.036), ('structures', 0.035), ('hold', 0.035), ('proof', 0.034), ('sample', 0.034), ('samples', 0.034), ('berkeley', 0.033), ('devoted', 0.033), ('ps', 0.033), ('nodes', 0.033), ('candidate', 0.032), ('solutions', 0.031), ('version', 0.031), ('st', 0.031), ('establishes', 0.031), ('aa', 0.031), ('convexity', 0.031), ('ease', 0.031), ('degree', 0.03), ('logarithmic', 0.03), ('recovers', 0.03), ('focusing', 0.03), ('carnegie', 0.029), ('mellon', 0.029), ('pittsburgh', 0.029), ('primal', 0.029), ('overly', 0.028), ('minus', 0.028), ('establishing', 0.028), ('edges', 0.028), ('consistent', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
Author: Martin J. Wainwright, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We focus on the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on 1 regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1 -constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main result is to establish sufficient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. Under certain mutual incoherence conditions analogous to those imposed in previous work on linear regression, we prove that consistent neighborhood selection can be obtained as long as the number of observations n grows more quickly than 6d6 log d + 2d5 log p, thereby establishing that logarithmic growth in the number of samples n relative to graph size p is sufficient to achieve neighborhood consistency. Keywords: Graphical models; Markov random fields; structure learning; 1 -regularization; model selection; convex risk minimization; high-dimensional asymptotics; concentration. 1
2 0.18815026 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
3 0.16171485 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein
Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1
4 0.12858744 69 nips-2006-Distributed Inference in Dynamical Systems
Author: Stanislav Funiak, Carlos Guestrin, Rahul Sukthankar, Mark A. Paskin
Abstract: We present a robust distributed algorithm for approximate probabilistic inference in dynamical systems, such as sensor networks and teams of mobile robots. Using assumed density filtering, the network nodes maintain a tractable representation of the belief state in a distributed fashion. At each time step, the nodes coordinate to condition this distribution on the observations made throughout the network, and to advance this estimate to the next time step. In addition, we identify a significant challenge for probabilistic inference in dynamical systems: message losses or network partitions can cause nodes to have inconsistent beliefs about the current state of the system. We address this problem by developing distributed algorithms that guarantee that nodes will reach an informative consistent distribution when communication is re-established. We present a suite of experimental results on real-world sensor data for two real sensor network deployments: one with 25 cameras and another with 54 temperature sensors. 1
5 0.10859866 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
Author: Alexis Battle, Gal Chechik, Daphne Koller
Abstract: We present a probabilistic model applied to the fMRI video rating prediction task of the Pittsburgh Brain Activity Interpretation Competition (PBAIC) [2]. Our goal is to predict a time series of subjective, semantic ratings of a movie given functional MRI data acquired during viewing by three subjects. Our method uses conditionally trained Gaussian Markov random fields, which model both the relationships between the subjects’ fMRI voxel measurements and the ratings, as well as the dependencies of the ratings across time steps and between subjects. We also employed non-traditional methods for feature selection and regularization that exploit the spatial structure of voxel activity in the brain. The model displayed good performance in predicting the scored ratings for the three subjects in test data sets, and a variant of this model was the third place entrant to the 2006 PBAIC. 1
6 0.10480902 14 nips-2006-A Small World Threshold for Economic Network Formation
7 0.080322884 35 nips-2006-Approximate inference using planar graph decomposition
8 0.080052949 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
9 0.079071023 128 nips-2006-Manifold Denoising
10 0.076557145 57 nips-2006-Conditional mean field
11 0.075716548 163 nips-2006-Prediction on a Graph with a Perceptron
12 0.07506986 39 nips-2006-Balanced Graph Matching
13 0.074038848 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
14 0.069602266 126 nips-2006-Logistic Regression for Single Trial EEG Classification
15 0.069451451 77 nips-2006-Fast Computation of Graph Kernels
16 0.06674882 169 nips-2006-Relational Learning with Gaussian Processes
17 0.065919451 75 nips-2006-Efficient sparse coding algorithms
18 0.06547299 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
19 0.065246142 116 nips-2006-Learning from Multiple Sources
20 0.059194319 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
topicId topicWeight
[(0, -0.212), (1, 0.054), (2, -0.078), (3, -0.021), (4, 0.038), (5, 0.049), (6, 0.049), (7, 0.115), (8, -0.057), (9, -0.071), (10, 0.118), (11, -0.195), (12, 0.119), (13, 0.046), (14, -0.093), (15, -0.057), (16, 0.026), (17, 0.068), (18, -0.004), (19, -0.011), (20, -0.013), (21, 0.019), (22, 0.121), (23, -0.086), (24, -0.029), (25, -0.053), (26, 0.062), (27, 0.065), (28, 0.064), (29, 0.045), (30, 0.052), (31, -0.001), (32, -0.042), (33, -0.125), (34, -0.01), (35, 0.152), (36, -0.032), (37, 0.056), (38, 0.137), (39, -0.247), (40, 0.067), (41, 0.013), (42, 0.047), (43, -0.002), (44, 0.036), (45, -0.014), (46, 0.057), (47, -0.008), (48, -0.074), (49, -0.202)]
simIndex simValue paperId paperTitle
same-paper 1 0.94800013 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
Author: Martin J. Wainwright, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We focus on the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on 1 regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1 -constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main result is to establish sufficient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. Under certain mutual incoherence conditions analogous to those imposed in previous work on linear regression, we prove that consistent neighborhood selection can be obtained as long as the number of observations n grows more quickly than 6d6 log d + 2d5 log p, thereby establishing that logarithmic growth in the number of samples n relative to graph size p is sufficient to achieve neighborhood consistency. Keywords: Graphical models; Markov random fields; structure learning; 1 -regularization; model selection; convex risk minimization; high-dimensional asymptotics; concentration. 1
2 0.67672628 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein
Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1
3 0.51335704 69 nips-2006-Distributed Inference in Dynamical Systems
Author: Stanislav Funiak, Carlos Guestrin, Rahul Sukthankar, Mark A. Paskin
Abstract: We present a robust distributed algorithm for approximate probabilistic inference in dynamical systems, such as sensor networks and teams of mobile robots. Using assumed density filtering, the network nodes maintain a tractable representation of the belief state in a distributed fashion. At each time step, the nodes coordinate to condition this distribution on the observations made throughout the network, and to advance this estimate to the next time step. In addition, we identify a significant challenge for probabilistic inference in dynamical systems: message losses or network partitions can cause nodes to have inconsistent beliefs about the current state of the system. We address this problem by developing distributed algorithms that guarantee that nodes will reach an informative consistent distribution when communication is re-established. We present a suite of experimental results on real-world sensor data for two real sensor network deployments: one with 25 cameras and another with 54 temperature sensors. 1
4 0.44827762 14 nips-2006-A Small World Threshold for Economic Network Formation
Author: Eyal Even-dar, Michael Kearns
Abstract: We introduce a game-theoretic model for network formation inspired by earlier stochastic models that mix localized and long-distance connectivity. In this model, players may purchase edges at distance d at a cost of dα , and wish to minimize the sum of their edge purchases and their average distance to other players. In this model, we show there is a striking “small world” threshold phenomenon: in two dimensions, if α < 2 then every Nash equilibrium results in a network of constant diameter (independent of network size), and if α > 2 then every Nash equilibrium results in a network whose diameter grows as a root of the network size, and thus is unbounded. We contrast our results with those of Kleinberg [8] in a stochastic model, and empirically investigate the “navigability” of equilibrium networks. Our theoretical results all generalize to higher dimensions. 1
5 0.44268438 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
6 0.43635803 98 nips-2006-Inferring Network Structure from Co-Occurrences
7 0.4324894 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
8 0.43039271 128 nips-2006-Manifold Denoising
9 0.42572147 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
10 0.41362929 75 nips-2006-Efficient sparse coding algorithms
11 0.39193776 163 nips-2006-Prediction on a Graph with a Perceptron
12 0.38938358 35 nips-2006-Approximate inference using planar graph decomposition
13 0.38909119 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
14 0.38677734 139 nips-2006-Multi-dynamic Bayesian Networks
15 0.3770318 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
16 0.368559 180 nips-2006-Speakers optimize information density through syntactic reduction
17 0.35488221 77 nips-2006-Fast Computation of Graph Kernels
18 0.34832227 57 nips-2006-Conditional mean field
19 0.33954918 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
20 0.33548254 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation
topicId topicWeight
[(1, 0.065), (3, 0.011), (7, 0.066), (9, 0.052), (22, 0.519), (44, 0.08), (57, 0.061), (65, 0.026), (69, 0.015), (71, 0.017)]
simIndex simValue paperId paperTitle
1 0.99262005 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
Author: Manfred K. Warmuth, Dima Kuzmin
Abstract: We design an on-line algorithm for Principal Component Analysis. In each trial the current instance is projected onto a probabilistically chosen low dimensional subspace. The total expected quadratic approximation error equals the total quadratic approximation error of the best subspace chosen in hindsight plus some additional term that grows linearly in dimension of the subspace but logarithmically in the dimension of the instances. 1
same-paper 2 0.96099693 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
Author: Martin J. Wainwright, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We focus on the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on 1 regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1 -constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main result is to establish sufficient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. Under certain mutual incoherence conditions analogous to those imposed in previous work on linear regression, we prove that consistent neighborhood selection can be obtained as long as the number of observations n grows more quickly than 6d6 log d + 2d5 log p, thereby establishing that logarithmic growth in the number of samples n relative to graph size p is sufficient to achieve neighborhood consistency. Keywords: Graphical models; Markov random fields; structure learning; 1 -regularization; model selection; convex risk minimization; high-dimensional asymptotics; concentration. 1
3 0.92278916 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
Author: Joseph Turian, Benjamin Wellington, I. D. Melamed
Abstract: Parsing and translating natural languages can be viewed as problems of predicting tree structures. For machine learning approaches to these predictions, the diversity and high dimensionality of the structures involved mandate very large training sets. This paper presents a purely discriminative learning method that scales up well to problems of this size. Its accuracy was at least as good as other comparable methods on a standard parsing task. To our knowledge, it is the first purely discriminative learning algorithm for translation with treestructured models. Unlike other popular methods, this method does not require a great deal of feature engineering a priori, because it performs feature selection over a compound feature space as it learns. Experiments demonstrate the method’s versatility, accuracy, and efficiency. Relevant software is freely available at http://nlp.cs.nyu.edu/parser and http://nlp.cs.nyu.edu/GenPar. 1
4 0.9187057 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
Author: Jiayuan Huang, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf, Alex J. Smola
Abstract: We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
5 0.7135452 203 nips-2006-implicit Online Learning with Kernels
Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan
Abstract: We present two new algorithms for online learning in reproducing kernel Hilbert spaces. Our first algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data. 1
6 0.69319743 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
7 0.6749317 126 nips-2006-Logistic Regression for Single Trial EEG Classification
8 0.67128748 61 nips-2006-Convex Repeated Games and Fenchel Duality
9 0.65652657 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
10 0.64863282 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
11 0.62919891 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
12 0.62677455 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
13 0.62471485 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
14 0.62469155 194 nips-2006-Towards a general independent subspace analysis
15 0.61689746 124 nips-2006-Linearly-solvable Markov decision problems
16 0.61599916 131 nips-2006-Mixture Regression for Covariate Shift
17 0.60873747 5 nips-2006-A Kernel Method for the Two-Sample-Problem
18 0.60638469 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
19 0.60074162 98 nips-2006-Inferring Network Structure from Co-Occurrences
20 0.60070872 20 nips-2006-Active learning for misspecified generalized linear models