nips nips2011 nips2011-118 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Po-ling Loh, Martin J. Wainwright
Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1
Reference: text
sentIndex sentText sentNum sentScore
1 High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity Martin J. [sent-1, score-0.613]
2 manner, many applications involve noisy and/or missing data, possibly involving dependencies. [sent-8, score-0.501]
3 We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. [sent-9, score-0.265]
4 Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. [sent-10, score-0.543]
5 On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. [sent-12, score-0.222]
6 On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. [sent-13, score-0.524]
7 Consider the problem of modeling the voting behavior of politicians: in this setting, votes may be missing due to abstentions, and temporally dependent due to collusion or “tit-for-tat” behavior. [sent-17, score-0.46]
8 Similarly, surveys often suffer from the missing data problem, since users fail to respond to all questions. [sent-18, score-0.399]
9 Sensor network data also tends to be both noisy due to measurement error, and partially missing due to failures or drop-outs of sensors. [sent-19, score-0.539]
10 There are a variety of methods for dealing with noisy and/or missing data, including various heuristic methods, as well as likelihood-based methods involving the expectation-maximization (EM) algorithm (e. [sent-20, score-0.501]
11 For instance, in applications of EM, problems in which the negative likelihood is a convex function often become non-convex with missing or noisy data. [sent-24, score-0.501]
12 Consequently, although the EM algorithm will converge to a local minimum, it is difficult to guarantee that the local optimum is close to a global minimum. [sent-25, score-0.244]
13 In this paper, we study these issues in the context of high-dimensional sparse linear regression, in the case when the predictors or covariates are noisy, missing, and/or dependent. [sent-26, score-0.248]
14 The resulting estimators allow us to solve the problem of high-dimensional Gaussian graphical model selection with missing data. [sent-31, score-0.561]
15 There is a large body of work on the problem of corrupted covariates or errors-in-variables for regression problems (see the papers and books [2, 3, 4, 5] and references therein). [sent-32, score-0.325]
16 Most relevant to this paper is more recent work that has examined issues of corrupted and/or missing data in the context of highdimensional sparse linear models, allowing for n p. [sent-34, score-0.6]
17 St¨ dler and B¨ hlmann [6] developed an EM-based a u method for sparse inverse covariance matrix estimation in the missing data regime, and used this result to derive an algorithm for sparse linear regression with missing data. [sent-35, score-1.359]
18 As mentioned above, however, it is difficult to guarantee that EM will converge to a point close to a global optimum of the likelihood, in contrast to the methods studied here. [sent-36, score-0.244]
19 Rosenbaum and Tsybakov [7] studied the sparse linear model when the covariates are corrupted by noise, and proposed a modified form of the Dantzig selector, involving a convex program. [sent-37, score-0.305]
20 This convexity produces a computationally attractive method, but the statistical error bounds that they establish scale proportionally with the size of the additive perturbation, hence are often weaker than the bounds that can be proved using our methods. [sent-38, score-0.431]
21 We then introduce the class of estimators we will consider and the form of the projected gradient descent algorithm. [sent-41, score-0.535]
22 Section 3 is devoted to a description of our main results, including a pair of general theorems on the statistical and optimization error, and then a series of corollaries applying our results to the cases of noisy, missing, and dependent data. [sent-42, score-0.231]
23 We then describe a class of projected gradient descent algorithms to be used in the sequel. [sent-50, score-0.428]
24 Rather than directly observing each xi ∈ Rp , we observe a vector zi ∈ Rp linked to xi via some conditional distribution: zi ∼ Q(· | xi ), for i = 1, 2, . [sent-57, score-0.329]
25 (2) This setup allows us to model various types of disturbances to the covariates, including (a) Additive noise: We observe zi = xi + wi , where wi ∈ Rp is a random vector independent of xi , say zero-mean with known covariance matrix Σw . [sent-61, score-0.427]
26 (b) Missing data: For a fraction ρ ∈ [0, 1), we observe a random vector zi ∈ Rp such that independently for each component j, we observe zij = xij with probability 1 − ρ, and zij = ∗ with probability ρ. [sent-62, score-0.378]
27 This model can also be generalized to allow for different missing probabilities for each covariate. [sent-63, score-0.399]
28 2 M -estimators for noisy and missing covariates We begin by examining a simple deterministic problem. [sent-73, score-0.69]
29 In i this paper, we focus on more general instantiations of the programs (4) and (5), involving different choices of the pair (Γ, γ) that are adapted to the cases of noisy and/or missing data. [sent-82, score-0.666]
30 Note that the matrix ΓLas is positive semidefinite, so the Lasso program is convex. [sent-83, score-0.223]
31 In sharp contrast, for the cases of noisy or missing data, the most natural choice of the matrix Γ is not positive semidefinite, hence the loss functions appearing in the problems (4) and (5) are non-convex. [sent-84, score-0.615]
32 Remarkably, we prove that a simple projected gradient descent algorithm still converges with high probability to a vector close to any global optimum in our setting. [sent-86, score-0.627]
33 Suppose we observe the n × p matrix Z = X + W , where W is a random matrix independent of X, with rows wi drawn i. [sent-88, score-0.362]
34 Indeed, since the matrix n Z T Z has rank at most n, the subtracted matrix Σw may cause Γadd to have a large number of negative eigenvalues. [sent-95, score-0.268]
35 Suppose each entry of X is missing independently with probability ρ ∈ [0, 1), and we observe the matrix Z ∈ Rn×p with entries Xij with probability 1 − ρ, Zij = 0 otherwise. [sent-97, score-0.609]
36 It is easy to see that the pair (Γmis , γmis ) reduces to the pair (ΓLas , γLas ) for the standard Lasso when ρ = 0, corresponding to no missing data. [sent-99, score-0.477]
37 In the more interesting case when ρ ∈ (0, 1), the matrix e e ZT Z n in equation (8) has rank at most n, so the subtracted diagonal matrix may cause the matrix Γmis to have a large number of negative eigenvalues when n p, and the associated quadratic function is not convex. [sent-100, score-0.382]
38 When the covariate matrix X is fully observed (so that the Lasso 1 can be applied), it is well understood that a sufficient condition for 2 -recovery is that the matrix ΓLas = n X T X satisfy a restricted eigenvalue (RE) condition (e. [sent-104, score-0.565]
39 The matrix Γ satisfies a lower restricted eigenvalue condition with curvature α > 0 and tolerance τ (n, p) > 0 if θT Γθ ≥ α θ 2 2 − τ (n, p) θ 2 1 for all θ ∈ Rp . [sent-108, score-0.364]
40 Moreover, it is known that for various random choices of the design matrix X, the Lasso matrix ΓLas will satisfy such an RE condition with high probability (e. [sent-110, score-0.312]
41 The matrix Γ satisfies an upper restricted eigenvalue condition with smoothness αu > 0 and tolerance τu (n, p) > 0 if θT Γθ ≤ αu θ 2 2 + τu (n, p) θ 2 1 for all θ ∈ Rp . [sent-114, score-0.364]
42 (10) In recent work on high-dimensional projected gradient descent, Agarwal et al. [sent-115, score-0.289]
43 4 Projected gradient descent In addition to proving results about the global minima of programs (4) and (5), we are also interested in polynomial-time procedures for approximating such optima. [sent-118, score-0.425]
44 We show that the simple projected gradient descent algorithm can be used to solve the program (4). [sent-119, score-0.537]
45 Our analysis shows that under a reasonable set of conditions, the iterates for the family of programs (4) converges to a point extremely close to any global optimum in both 1 -norm and 2 -norm, even for the non-convex program. [sent-123, score-0.353]
46 1 Statistical error In controlling the statistical error, we assume that the matrix Γ satisfies a lower-RE condition with curvature α and tolerance τ (n, p), as previously defined (9). [sent-126, score-0.379]
47 In addition, recall that the matrix Γ and vector γ serve as surrogates to the deterministic quantities Σx ∈ Rp×p and Σx β ∗ ∈ Rp , respectively. [sent-127, score-0.253]
48 n The following result applies to any global optimum β of the program (12) with λn ≥ 4 ϕ(Q, σ ) (13) log p n : Theorem 1 (Statistical error). [sent-129, score-0.399]
49 Suppose the surrogates (Γ, γ) satisfy the deviation bounds (13), and the matrix Γ satisfies the lower-RE condition (9) with parameters (α , τ ) such that √ k τ (n, p) ≤ min ϕ(Q, σ ) α √ , 2 b0 128 k 4 log p . [sent-130, score-0.499]
50 n (14) Then for any vector β ∗ with sparsity at most k, there is a universal positive constant c0 such that any global optimum β satisfies the bounds √ log p c0 k ∗ max ϕ(Q, σ ) , λn , and (15a) β−β 2 ≤ α n β − β∗ 1 ≤ 8 c0 k max ϕ(Q, σ ) α log p , λn . [sent-131, score-0.479]
51 2 Optimization error Although Theorem 1 provides guarantees that hold uniformly for any choice of global minimizer, it does not provide any guidance on how to approximate such a global minimizer using a polynomial-time algorithm. [sent-138, score-0.355]
52 Nonetheless, we are able to show that for the family of programs (4), under reasonable conditions on Γ satisfied in various settings, a simple projected gradient method will converge geometrically fast to a very good approximation of any global optimum. [sent-139, score-0.511]
53 Suppose that the surrogate matrix Γ satisfies the lower-RE (9) and upper-RE (10) conditions with log p τu , τ l n , and that we apply projected gradient descent (11) with constant stepsize η = 2αu . [sent-142, score-0.683]
54 (17) Note that the bound (16) controls the 2 -distance between the iterate β t at time t, which is easily computed in polynomial-time, and any global optimum β of the program (4), which may be difficult to compute. [sent-151, score-0.308]
55 3 2- and 1 -optimization error are bounded as O( k log p ) and O k n log p n , respectively. [sent-154, score-0.266]
56 We say that a random matrix X ∈ Rn×p is sub-Gaussian with parameters (Σ, σ 2 ) if each row xT ∈ Rp is sampled independently from a zero-mean distribution with i covariance Σ, and for any unit vector u ∈ Rp , the random variable uT xi is sub-Gaussian with parameter at most σ. [sent-158, score-0.26]
57 Then for the M -estimator based on the surrogates (Γadd , γadd ), the results of Theorems 1 and 2 hold with parameters 1 2 2 2 2 α = λmin (Σx ) and ϕ(Q, σ ) = c0 σx + σw + σ σx + σw , 2 with probability at least 1 − c1 exp(−c2 log p). [sent-165, score-0.243]
58 samples with missing data, we have the following: 2 Corollary 2. [sent-169, score-0.399]
59 Suppose X ∈ Rn×p is a sub-Gaussian matrix with parameters (Σx , σx ), and Z is the missing 4 σx 1 data matrix with parameter ρ. [sent-170, score-0.627]
60 If n max (1−ρ)4 λ2 (Σx ) , 1 k log p, then Theorems 1 and 2 hold with min probability at least 1 − c1 exp(−c2 log p) for α = 1 λmin (Σx ) and 2 ϕ(Q, σ ) = c0 σx σx σ + . [sent-171, score-0.283]
61 , n − 1, (18) where vi ∈ Rp is a zero-mean noise vector with covariance matrix Σv , and A ∈ Rp×p is a driving matrix with spectral norm |||A|||2 < 1. [sent-175, score-0.551]
62 Corollary 3 corresponds to the case of additive noise for a Gaussian VAR process. [sent-177, score-0.289]
63 A similar result can be derived in the missing data setting. [sent-178, score-0.399]
64 Suppose the rows of X are drawn according to a Gaussian VAR process with driving matrix A. [sent-180, score-0.251]
65 If n max λ2 ζ (Σx ) , 1 k log p, with min ζ 2 = |||Σw |||op + 2|||Σx |||op , 1 − |||A|||op then Theorems 1 and 2 hold with probability at least 1 − c1 exp(−c2 log p) for α = 1 λmin (Σx ) and 2 ϕ(Q, σ ) = c0 (σ ζ + ζ 2 ). [sent-185, score-0.283]
66 4 Application to graphical model inverse covariance estimation The problem of inverse covariance estimation for a Gaussian graphical model is closely related to the Lasso. [sent-187, score-0.44]
67 Meinshausen and B¨ hlmann [20] prescribed a way to recover the support of the precision matrix Θ when each u column of Θ is k-sparse, via linear regression and the Lasso. [sent-188, score-0.234]
68 Defining aj := −(Σjj − Σj,−j θ ) , we have ⊥ Θj,−j = aj θj . [sent-202, score-0.252]
69 Our algorithm estimates θj and aj for each j and combines the estimates to obtain Θj,−j = aj θj . [sent-203, score-0.252]
70 In the additive noise case, we observe Z = X + W . [sent-204, score-0.336]
71 Hence, our ⊥ ⊥ results on covariates with additive noise produce an estimate of θj by solving the program (4) or (12) with the 1 1 pair (Γ(j) , γ (j) ) = (Σ−j,−j , n Z −jT Z j ), where Σ = n Z T Z − Σw . [sent-210, score-0.588]
72 (2) Estimate the scalars aj using aj := −(Σjj − Σj,−j θj )−1 . [sent-215, score-0.252]
73 Suppose the columns of the matrix Θ are k-sparse, and suppose the condition number κ(Θ) is nonzero and finite. [sent-220, score-0.283]
74 Suppose the deviation conditions γ (j) − Σ−j,−j θj ∞ log p n ≤ ϕ(Q, σ ) (Γ(j) − Σ−j,−j )θj and ∞ log p n ≤ ϕ(Q, σ ) (20) hold for all j, and suppose we have the following additional deviation condition on Σ: Σ−Σ max log p . [sent-221, score-0.493]
75 In Figure 1, we plot the results of simulations under the additive noise model described in Example 1, using 2 Σx = I and Σw = σw I with σw = 0. [sent-226, score-0.415]
76 If we plot the 2 -error versus the rescaled sample size n/(k log p), as depicted in panel (b), the curves roughly align for different values of p, agreeing with Theorem 1. [sent-230, score-0.524]
77 Panel (c) shows analogous curves for VAR data with additive noise, using a driving matrix A with |||A|||op = 0. [sent-231, score-0.342]
78 Plots of the error β − β ∗ 2 after running projected gradient descent on the non-convex objective, with √ sparsity k ≈ p. [sent-267, score-0.512]
79 data with additive noise, and plot (b) shows 2 -error versus the n rescaled sample size k log p . [sent-271, score-0.585]
80 Plot (c) depicts a similar (rescaled) plot for VAR data with additive noise. [sent-272, score-0.262]
81 As predicted by Theorem 1, the curves align for different values of p in the rescaled plot. [sent-273, score-0.293]
82 Figure 2 shows the results of applying projected gradient descent to solve the optimization problem (4) in the cases of additive noise and missing data. [sent-275, score-1.116]
83 We first applied projected gradient to obtain an initial estimate β, then reapplied projected gradient descent 10 times, tracking the optimization error β t − β 2 (in blue) and statistical error β t − β ∗ 2 (in red). [sent-276, score-0.936]
84 Finally, we simulated the inverse covariance matrix estimation algorithm on three types of graphical models: (a) Chain-structured graphs. [sent-278, score-0.334]
85 Then δ is chosen so Θ = B + δI has condition number p, and Θ is rescaled so |||Θ|||op = 1. [sent-294, score-0.277]
86 7 Log error plot: additive noise case Log error plot: missing data case 0. [sent-295, score-0.856]
87 Plots of the optimization error log( β t − β 2 ) and statistical error log( β t − β ∗ 2 ) versus iteration number t, generated by running projected gradient descent on the non-convex objective. [sent-306, score-0.686]
88 samples from the appropriate graphical model, with covariance matrix Σx = Θ−1 , we generated the corrupted matrix Z = X + W with Σw = (0. [sent-311, score-0.491]
89 Figure 3 shows the rescaled 1 2 -error √k |||Θ − Θ|||op plotted against the sample size n for a chain-structured graph, with panel (a) showing the original plot and panel (b) plotting against the rescaled sample size. [sent-313, score-0.596]
90 We obtained qualitatively similar results for the star and Erd¨ s-Renyi graphs, in the presence of missing and/or dependent data. [sent-314, score-0.46]
91 6 1/sqrt(k) * l2 operator norm error 1/sqrt(k) * l2 operator norm error 0. [sent-328, score-0.284]
92 6 error plot for chain graph, additive noise 10 20 30 40 n/(k log p) 50 60 (b) rescaled plot 1 b Figure 3. [sent-329, score-0.825]
93 (a) Plots of the rescaled error √k |||Θ−Θ|||op after running projected gradient descent for a chain-structured Gaussian graphical model with additive noise. [sent-330, score-0.938]
94 As predicted by Theorems 1 and 2, all curves align when the rescaled n error is plotted against the ratio k log p , as shown in (b). [sent-331, score-0.468]
95 5 Discussion In this paper, we formulated an 1 -constrained minimization problem for sparse linear regression on corrupted data. [sent-333, score-0.224]
96 The source of corruption may be additive noise or missing data, and although the resulting objective is not generally convex, we showed that projected gradient descent is guaranteed to converge to a point within statistical precision of the optimum. [sent-334, score-1.259]
97 Finally, we used our results on linear regression to perform sparse inverse covariance estimation for a Gaussian graphical model, based on corrupted data. [sent-339, score-0.444]
98 The bounds we obtain for the spectral norm of the error are of the same order as existing bounds for inverse covariance matrix estimation with uncorrupted, i. [sent-340, score-0.539]
99 High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. [sent-396, score-0.613]
100 Fast global convergence of gradient methods for highdimensional statistical recovery. [sent-441, score-0.249]
wordName wordTfidf (topN-words)
[('missing', 0.399), ('las', 0.252), ('rescaled', 0.193), ('projected', 0.18), ('rp', 0.18), ('additive', 0.178), ('op', 0.17), ('mis', 0.156), ('covariates', 0.151), ('descent', 0.139), ('var', 0.132), ('lasso', 0.131), ('aj', 0.126), ('matrix', 0.114), ('noise', 0.111), ('optimum', 0.11), ('program', 0.109), ('gradient', 0.109), ('estimators', 0.107), ('corrupted', 0.104), ('covariance', 0.104), ('zij', 0.103), ('noisy', 0.102), ('surrogates', 0.101), ('log', 0.091), ('global', 0.089), ('programs', 0.088), ('suppose', 0.085), ('condition', 0.084), ('error', 0.084), ('plot', 0.084), ('theorems', 0.08), ('zi', 0.078), ('berkeley', 0.075), ('eigenvalue', 0.07), ('regression', 0.07), ('iterates', 0.066), ('autoregressive', 0.064), ('panel', 0.063), ('dler', 0.062), ('loh', 0.062), ('rosenbaum', 0.062), ('dependent', 0.061), ('inverse', 0.061), ('jj', 0.06), ('meinshausen', 0.06), ('dantzig', 0.06), ('bounds', 0.059), ('norm', 0.058), ('rn', 0.056), ('negahban', 0.056), ('graphical', 0.055), ('rothman', 0.055), ('align', 0.054), ('duchi', 0.054), ('corollary', 0.053), ('hold', 0.051), ('statistical', 0.051), ('driving', 0.05), ('stepsize', 0.05), ('hlmann', 0.05), ('sparse', 0.05), ('min', 0.05), ('radius', 0.05), ('restricted', 0.05), ('em', 0.049), ('covariate', 0.049), ('entries', 0.049), ('add', 0.049), ('rows', 0.047), ('corruption', 0.047), ('carroll', 0.047), ('observe', 0.047), ('issues', 0.047), ('tolerance', 0.046), ('predicted', 0.046), ('consequences', 0.045), ('zt', 0.045), ('converge', 0.045), ('stat', 0.045), ('gaussian', 0.044), ('satis', 0.044), ('agarwal', 0.043), ('erd', 0.043), ('simulations', 0.042), ('plots', 0.042), ('xi', 0.042), ('guarantees', 0.042), ('selector', 0.041), ('drawn', 0.04), ('subtracted', 0.04), ('pl', 0.04), ('semide', 0.04), ('wainwright', 0.039), ('universal', 0.039), ('versus', 0.039), ('pair', 0.039), ('instantiations', 0.038), ('measurement', 0.038), ('deterministic', 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
Author: Po-ling Loh, Martin J. Wainwright
Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1
2 0.26884288 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
Author: Nasser M. Nasrabadi, Trac D. Tran, Nam Nguyen
Abstract: This paper studies the problem of accurately recovering a sparse vector β from highly corrupted linear measurements y = Xβ + e + w where e is a sparse error vector whose nonzero entries may be unbounded and w is a bounded noise. We propose a so-called extended Lasso optimization which takes into consideration sparse prior information of both β and e . Our first result shows that the extended Lasso can faithfully recover both the regression and the corruption vectors. Our analysis is relied on a notion of extended restricted eigenvalue for the design matrix X. Our second set of results applies to a general class of Gaussian design matrix X with i.i.d rows N (0, Σ), for which we provide a surprising phenomenon: the extended Lasso can recover exact signed supports of both β and e from only Ω(k log p log n) observations, even the fraction of corruption is arbitrarily close to one. Our analysis also shows that this amount of observations required to achieve exact signed support is optimal. 1
3 0.18344879 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
Author: Edouard Grave, Guillaume R. Obozinski, Francis R. Bach
Abstract: Using the 1 -norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm of the selected covariates, which is a convex surrogate of their rank, as the criterion of model complexity. We analyze the properties of our norm, describe an optimization algorithm based on reweighted least-squares, and illustrate the behavior of this norm on synthetic data, showing that it is more adapted to strong correlations than competing methods such as the elastic net. 1
4 0.14118513 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
5 0.13479918 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
6 0.13094494 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
7 0.12439222 213 nips-2011-Phase transition in the family of p-resistances
8 0.11223302 5 nips-2011-A Denoising View of Matrix Completion
9 0.10819006 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
10 0.10719454 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
11 0.10684612 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
12 0.10551783 251 nips-2011-Shaping Level Sets with Submodular Functions
13 0.10384889 271 nips-2011-Statistical Tests for Optimization Efficiency
14 0.10056291 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
15 0.098979294 259 nips-2011-Sparse Estimation with Structured Dictionaries
16 0.096945219 165 nips-2011-Matrix Completion for Multi-label Image Classification
17 0.09682183 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure
18 0.096799582 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
19 0.096302956 186 nips-2011-Noise Thresholds for Spectral Clustering
20 0.096126862 265 nips-2011-Sparse recovery by thresholded non-negative least squares
topicId topicWeight
[(0, 0.283), (1, -0.03), (2, -0.053), (3, -0.252), (4, -0.14), (5, 0.056), (6, -0.0), (7, 0.039), (8, 0.092), (9, 0.101), (10, 0.115), (11, -0.049), (12, -0.107), (13, -0.021), (14, 0.011), (15, 0.084), (16, 0.098), (17, -0.024), (18, 0.003), (19, -0.054), (20, -0.018), (21, 0.074), (22, -0.023), (23, 0.083), (24, -0.029), (25, -0.03), (26, -0.001), (27, -0.001), (28, 0.059), (29, -0.024), (30, 0.075), (31, -0.022), (32, 0.13), (33, 0.046), (34, 0.035), (35, -0.047), (36, 0.113), (37, 0.129), (38, -0.056), (39, -0.014), (40, -0.083), (41, 0.021), (42, -0.078), (43, 0.023), (44, 0.006), (45, 0.01), (46, -0.005), (47, -0.131), (48, -0.036), (49, -0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.96674848 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
Author: Po-ling Loh, Martin J. Wainwright
Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1
2 0.85010833 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
Author: Nasser M. Nasrabadi, Trac D. Tran, Nam Nguyen
Abstract: This paper studies the problem of accurately recovering a sparse vector β from highly corrupted linear measurements y = Xβ + e + w where e is a sparse error vector whose nonzero entries may be unbounded and w is a bounded noise. We propose a so-called extended Lasso optimization which takes into consideration sparse prior information of both β and e . Our first result shows that the extended Lasso can faithfully recover both the regression and the corruption vectors. Our analysis is relied on a notion of extended restricted eigenvalue for the design matrix X. Our second set of results applies to a general class of Gaussian design matrix X with i.i.d rows N (0, Σ), for which we provide a surprising phenomenon: the extended Lasso can recover exact signed supports of both β and e from only Ω(k log p log n) observations, even the fraction of corruption is arbitrarily close to one. Our analysis also shows that this amount of observations required to achieve exact signed support is optimal. 1
3 0.6830824 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
Author: Tzu-kuo Huang, Jeff G. Schneider
Abstract: Vector Auto-regressive models (VAR) are useful tools for analyzing time series data. In quite a few modern time series modelling tasks, the collection of reliable time series turns out to be a major challenge, either due to the slow progression of the dynamic process of interest, or inaccessibility of repetitive measurements of the same dynamic process over time. In those situations, however, we observe that it is often easier to collect a large amount of non-sequence samples, or snapshots of the dynamic process of interest. In this work, we assume a small amount of time series data are available, and propose methods to incorporate non-sequence data into penalized least-square estimation of VAR models. We consider non-sequence data as samples drawn from the stationary distribution of the underlying VAR model, and devise a novel penalization scheme based on the Lyapunov equation concerning the covariance of the stationary distribution. Experiments on synthetic and video data demonstrate the effectiveness of the proposed methods. 1
4 0.67397678 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
Author: Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, Aarti Singh
Abstract: We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions 1. We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. 2. We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. 3. We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition. 1
5 0.66855049 69 nips-2011-Differentially Private M-Estimators
Author: Jing Lei
Abstract: This paper studies privacy preserving M-estimators using perturbed histograms. The proposed approach allows the release of a wide class of M-estimators with both differential privacy and statistical utility without knowing a priori the particular inference procedure. The performance of the proposed method is demonstrated through a careful study of the convergence rates. A practical algorithm is given and applied on a real world data set containing both continuous and categorical variables. 1
6 0.66850919 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
7 0.6643182 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
8 0.65479863 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
9 0.6477387 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
10 0.63299364 265 nips-2011-Sparse recovery by thresholded non-negative least squares
11 0.61793661 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
12 0.61092579 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
13 0.5821901 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
14 0.56987733 73 nips-2011-Divide-and-Conquer Matrix Factorization
15 0.56459963 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure
16 0.56298399 213 nips-2011-Phase transition in the family of p-resistances
17 0.53361928 5 nips-2011-A Denoising View of Matrix Completion
18 0.53238434 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems
19 0.52945703 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements
20 0.52852678 258 nips-2011-Sparse Bayesian Multi-Task Learning
topicId topicWeight
[(0, 0.041), (4, 0.066), (20, 0.034), (26, 0.061), (31, 0.088), (33, 0.016), (39, 0.013), (43, 0.123), (45, 0.151), (57, 0.05), (74, 0.058), (81, 0.082), (83, 0.071), (85, 0.011), (99, 0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.95118451 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
Author: Po-ling Loh, Martin J. Wainwright
Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1
2 0.92051876 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
3 0.91995794 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
Author: Feng Yan, Yuan Qi
Abstract: For many real-world applications, we often need to select correlated variables— such as genetic variations and imaging features associated with Alzheimer’s disease—in a high dimensional space. The correlation between variables presents a challenge to classical variable selection methods. To address this challenge, the elastic net has been developed and successfully applied to many applications. Despite its great success, the elastic net does not exploit the correlation information embedded in the data to select correlated variables. To overcome this limitation, we present a novel hybrid model, EigenNet, that uses the eigenstructures of data to guide variable selection. Specifically, it integrates a sparse conditional classification model with a generative model capturing variable correlations in a principled Bayesian framework. We develop an efficient active-set algorithm to estimate the model via evidence maximization. Experimental results on synthetic data and imaging genetics data demonstrate the superior predictive performance of the EigenNet over the lasso, the elastic net, and the automatic relevance determination. 1
4 0.91888607 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
Author: Rina Foygel, Ohad Shamir, Nati Srebro, Ruslan Salakhutdinov
Abstract: We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial. 1
5 0.91728497 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
6 0.91598594 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
7 0.9155401 22 nips-2011-Active Ranking using Pairwise Comparisons
8 0.91473985 29 nips-2011-Algorithms and hardness results for parallel large margin learning
9 0.91214114 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
10 0.91071838 265 nips-2011-Sparse recovery by thresholded non-negative least squares
11 0.91065907 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
12 0.90975171 231 nips-2011-Randomized Algorithms for Comparison-based Search
13 0.9085871 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
14 0.90852642 80 nips-2011-Efficient Online Learning via Randomized Rounding
15 0.90656596 186 nips-2011-Noise Thresholds for Spectral Clustering
16 0.90513164 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
17 0.90298802 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
18 0.9026112 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure
19 0.9015342 276 nips-2011-Structured sparse coding via lateral inhibition
20 0.90121865 199 nips-2011-On fast approximate submodular minimization