nips nips2013 nips2013-91 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eunho Yang, Pradeep Ravikumar
Abstract: We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. [sent-5, score-0.399]
2 We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. [sent-7, score-0.647]
3 We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. [sent-8, score-1.065]
4 It has now become well understood that even in this type of high-dimensional p n scaling, it is possible to obtain statistically consistent estimators provided one imposes structural constraints on the statistical models. [sent-11, score-0.355]
5 Examples of such structural constraints include sparsity constraints (e. [sent-12, score-0.293]
6 For each of these structural constraints, a large body of work have proposed and analyzed statistically consistent estimators. [sent-15, score-0.203]
7 For instance, a key subclass leverage such structural constraints via specific regularization functions. [sent-16, score-0.475]
8 Examples include `1 -regularization for sparse models, nuclear norm regularization for low-rank matrix-structured models, and so on. [sent-17, score-0.467]
9 A caveat to this strong line of work is that imposing such “clean” structural constraints such as sparsity or low-rank structure, is typically too stringent for real-world messy data. [sent-18, score-0.346]
10 [6] also apply this matrix decomposition estimation to the learning of latentvariable Gaussian graphical models, where they estimate an inverse covariance matrix that is the sum of sparse and low-rank matrices. [sent-24, score-0.397]
11 A number of papers have applied such decomposition estimation to robust principal component analysis: Cand` s et al. [sent-25, score-0.234]
12 [3] learn a covariance matrix that is the sum e of a low-rank factored matrix and a sparse “error/outlier” matrix, while [9, 15] learn a covariance matrix that is the sum of a low-rank matrix and a column-sparse error matrix. [sent-26, score-0.785]
13 [7] analyze this estimation of a sum of a low-rank and elementwise sparse matrix in the noisy setting; while Agarwal et al. [sent-28, score-0.331]
14 [1] extend this to the sum of a low-rank matrix and a matrix with general structure. [sent-29, score-0.27]
15 Another application is multi-task learning, where [8] learn a multiple-linear-regression coefficient 1 matrix that is the sum of a sparse and a block-sparse matrix. [sent-30, score-0.239]
16 This strong line of work can be seen to follow the resume of estimating a superposition of two structures; and indeed their results show this simple extension provides a vast increase in the practical applicability of structurally constrained models. [sent-31, score-0.37]
17 This long-line of work above on M -estimators and analyses for specific pairs of super-position structures for specific statistical models, lead to the question: is there a unified framework for studying any general tuple (i. [sent-33, score-0.212]
18 By such “superposition-structure,” we mean the constraint that the parameter be a superposition of “clean” structurally constrained parameters. [sent-37, score-0.393]
19 Our unified analysis allows the following very general class of M -estimators, which are the sum of any loss function, and an instance of what we call a “hybrid” regularization function, that is the infimal convolution of any weighted regularization functions, one for each structural component. [sent-39, score-0.874]
20 As we show, this is equivalent to an M -estimator that is the sum of (a) a loss function applied to the sum of the multiple parameter vectors, one corresponding to each structural component; and (b) a weighted sum of regularization functions, one for each of the parameter vectors. [sent-40, score-0.876]
21 We stress that our analysis allows for general loss functions, and general component regularization functions. [sent-41, score-0.418]
22 We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. [sent-42, score-1.065]
23 We assume that the statistical model parameter ✓⇤ is “superposition-structured,” so that it is the sum of parameter components, each of which is constrained by a specific structure. [sent-52, score-0.266]
24 There, they use subspace pairs (M, M ), where M ✓ M, to capture any structured parameter. [sent-55, score-0.366]
25 M is the perturbation subspace of parameters that represents perturbations away from the model subspace. [sent-58, score-0.245]
26 They also define the property of decomposability of a regularization function, which captures the suitablity of a regularization function R to particular structure. [sent-59, score-0.503]
27 R is said to be decomposable with respect to a subspace pair (M, M ), if R(u + v) = R(u) + R(v), ? [sent-61, score-0.418]
28 , we can define the corresponding low-dimensional model subspaces, as well as regularization functions that are decomposable with respect to the corresponding subspace pairs. [sent-64, score-0.645]
29 , p} of the coordinates, let M(S) be the subspace of vectors in Rp that have support contained in S. [sent-70, score-0.245]
30 As shown in [11], the `1 norm R(✓) = k✓k1 , commonly used as a sparsity-encouraging regularization ? [sent-75, score-0.353]
31 function, is decomposable with respect to subspace pairs (M(S), M (S)). [sent-76, score-0.473]
32 For a given pair of r-dimensional subspaces U ✓ Rk and V ✓ Rm , we define the subspace pairs as follows: M(U, V ) := ⇥ 2 Rk⇥m | row(⇥) ✓ V, col(⇥) ✓ U and 2 ? [sent-81, score-0.416]
33 R(✓) = |||✓|||1 is decomposable with respect to the subspace pairs (M(U, V ), M (U, V )). [sent-86, score-0.473]
34 In our dirty statistical model setting, we do not just have one, but a set of structures; suppose we P ⇤ index them by the set I. [sent-87, score-0.377]
35 Our key structural constraint can then be stated as: ✓⇤ = ↵2I ✓↵ , where ? [sent-88, score-0.203]
36 ⇤ ✓↵ is a “clean” structured parameter with respect to a subspace pair (M↵ , M↵ ), for M↵ ✓ M↵ . [sent-89, score-0.371]
37 We also assume we are given a set of regularization functions R↵ (·), for ↵ 2 I that are suited to the respective structures, in the sense that they are decomposable with respect to the subspace pairs ? [sent-90, score-0.7]
38 We are interested in the following “super-position” estimator: ⇣X ⌘ X min L ✓↵ + (1) ↵ R↵ (✓↵ ), (✓↵ )↵2I ↵2I ↵2I where ( ↵ )↵2I are the regularization penalties. [sent-95, score-0.227]
39 This optimization problem involves not just one parameter vector, but multiple parameter vectors, one for each structural component: while the loss function applies only to the sum of these, separate regularization functions are applied to the corresponding parameter vectors. [sent-96, score-0.776]
40 We will now see that this can be re-written to a standard M estimation problem which minimizes, over a single parameter vector, the sum of a loss function and a special “dirty” regularization function. [sent-97, score-0.453]
41 Given a vector c := (c↵ )↵2I of convex-combination weights, suppose we define the following “dirty” regularization function, that is the infimal convolution of a set of regularization functions: nX o X R(✓; c) = inf c↵ R↵ (✓↵ ) : ✓↵ = ✓ . [sent-98, score-0.603]
42 (2) ↵2I ↵2I It can be shown that provided the individual regularization functions R↵ (·), for ↵ 2 I, are norms, R(·; c) is a norm as well. [sent-99, score-0.353]
43 We discuss this and other properties of this hybrid regularization function R(·; c) in Appendix A. [sent-100, score-0.379]
44 While the tuning parameters in (1) correspond to the regularization penalties ( ↵ )↵2I , the tuning parameters in (3) correspond to the weights (c↵ )↵2I specifying the “dirty” regularization function. [sent-106, score-0.454]
45 3 Error Bounds for Convex M -estimators b Our goal is to provide error bounds k✓ ✓⇤ k, between the target parameter ✓⇤ , the minimizer of the b population risk, and our M -estimate ✓ from (1), for any error norm k · k. [sent-108, score-0.344]
46 A common example of an error norm for instance is the `2 norm k · k2 . [sent-109, score-0.307]
47 We now turn to the properties of the loss function and regularization function that underlie our analysis. [sent-110, score-0.313]
48 We first restate some natural assumptions on the loss and regularization functions. [sent-111, score-0.313]
49 (C2) The regularizers R↵ are norms, and are decomposable with respect to the subspace pairs ? [sent-113, score-0.512]
50 3 Our next assumption is a restricted strong convexity assumption [11]. [sent-115, score-0.22]
51 Note that these conditions (C1)-(C3) are imposed even when the model has a single clean structural constraint; see [11]. [sent-117, score-0.484]
52 Note that for a model with a single clean structural constraint, with |I| = 1, the condition (C4) is trivially satisfied since the LHS becomes 0. [sent-121, score-0.489]
53 We will see in the sequel that for a large collection of loss functions including all linear loss functions, the condition (C4) simplifies considerably, and moreover holds with high probability, typically with h↵ = 0. [sent-122, score-0.329]
54 We note that this condition is much weaker than “incoherence” conditions typically imposed when analyzing specific instances of such superposition-structured models (see e. [sent-123, score-0.205]
55 references in the introduction), where the assumptions typically include (a) assuming that the structured subspaces (M↵ )↵2I intersect only at {0}, and (b) that the sizes of these subspaces are extremely small. [sent-125, score-0.298]
56 Finally, we will use the notion of subspace compatibility constant defined in [11], that captures the relationship between the regularization function R(·) and the error norm k · k, over vectors in the R subspace M: (M, k · k) := supu2M\{0} kuk . [sent-126, score-0.985]
57 Suppose we solve the M -estimation problem in (3), withP hybrid regularization R(·; c), where the convex-combination weights c are set as c↵ = ↵ / ↵2I ↵ , with ↵ n 2R⇤ r✓↵ L(✓⇤ ; Z1 ) . [sent-128, score-0.379]
58 Then, the parame↵ ter error bounds are given as: ✓ ◆ p p 3|I| b k✓ ✓⇤ k max ↵ ↵ (M↵ ) + (|I| ⌧L / ), ¯ 2¯ ↵2I where ⇣ ⌘2 L 1 p 32¯2 |I| max ↵ ↵ (M↵ ) , g := max g ¯ g ↵ + h↵ , ↵ ↵2I 2 ↵ i Xh 2 ↵ ⇤ ⇤ ⌧L := 32¯2 2 R2 ⇧M? [sent-130, score-0.385]
59 ↵ ↵ |I| := ¯ ↵2I Remarks: (R1) It is instructive to compare Theorem 1 to the main Theorem in [11], where they derive parameter error bounds for any M -estimator with a decomposable regularizer, for any “clean” structure. [sent-133, score-0.382]
60 We cannot derive our result in turn from their theorem applied to the M -estimator (3) with the hybrid regularization function R(·; c): the “superposition” structure is not captured by a pair of subspaces, nor is the hybrid regularization function decomposable, as is required by their theorem. [sent-135, score-0.832]
61 Our setting as well as analysis is strictly more general, because of which we needed the additional structural incoherence assumption (C4) (which is trivially satisfied when |I| = 1). [sent-136, score-0.45]
62 [1] provide Frobenius norm error bounds for the matrix-decomposition problem of recovering the sum of low-rank and a general structured matrix. [sent-138, score-0.375]
63 First, the proof for their theorem requires the regularization 4 penalty for the second structure to be strongly bounded away from zero: their convergence rate does not approach zero even with infinite number of samples n. [sent-140, score-0.301]
64 Second, they assumed much stronger conditions for their theorem to hold; in Theorem 1 in contrast, we pose much milder “local” RSC conditions (C3), and a structural incoherence condition (C4). [sent-142, score-0.651]
65 the theorem holds for any set of subspace pairs (M↵ , M↵ )↵2I with respect to which the corresponding regularizers are decomposable. [sent-145, score-0.452]
66 As noted earlier, the M↵ should ideally be set to the structured subspace in which the true parameter at least approximately lies, and which we want to be as small as possible (note that the bound includes a term that depends on the size of this subspace via the subspace compatibility constant). [sent-146, score-0.941]
67 Suppose we solve the M -estimation problem in (1), withP hybrid regularization R(·; c), where the convex-combination weights c are set as c↵ = ↵ / ↵2I ↵ , with ↵ n 2R⇤ r✓↵ L(✓⇤ ; Z1 ) , and suppose conditions (C1) - (C4) are satisfied. [sent-152, score-0.543]
68 Then, the parameter error bounds are given as: ✓ ◆ 3|I| b k✓ ✓⇤ k max ↵ ↵ (M↵ ). [sent-154, score-0.257]
69 This first term can be thought of as the “estimation error” component of the error bound, when the parameter has exactly the structure being modeled by the regularizers. [sent-157, score-0.258]
70 The second term can be thought of as the “approximation error” component of the error bound, which is the penalty for the parameter not exactly lying in the structured subspaces modeled by the regularizers. [sent-158, score-0.44]
71 ↵2I Note that each ↵ is larger than a particular norm of the sample score function (gradient of the loss at the true parameter): since the expected value of the score function is zero, the magnitude of the sample score function captures the amount of “noise” in the data. [sent-160, score-0.261]
72 This is in turn scaled by ↵ (M↵ ), ⇤ which captures the size of the structured subspace corresponding to the parameter component ✓↵ . [sent-161, score-0.525]
73 We now provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. [sent-163, score-1.065]
74 For this class of statistical models, we will consider the instantiation of (1) with the loss function L consisting of the squared loss: ( ) X X 2 1 min Y X ✓↵ + (5) ↵ R↵ (✓↵ ) . [sent-166, score-0.225]
75 The restricted strong convexity condition (C3) reduces to the following. [sent-168, score-0.225]
76 Finally, our structural incoherence condition reduces to the following: Noting that L(✓⇤ + P P P 2 ⇤ 1)L(✓⇤ ) i in this specific ↵ ) + (|I| ↵) = n ↵2I ↵2I L(✓ + ↵< hX ↵ , X case, P P P 2 2 2 L (D4) n i 2 ↵< hX ↵ , X ↵2I k ↵ k + ↵2I h↵ R↵ ( ↵ ). [sent-170, score-0.445]
77 1 Structural Incoherence with Gaussian Design We now show that the condition (D4) required for Theorem 1, holds with high probability when the observation matrix is drawn from a so-called ⌃-Gaussian ensemble: where each row Xi is independently sampled from N (0, ⌃). [sent-172, score-0.322]
78 Let PM denote the matrix corresponding to the projection operator for the subspace M . [sent-174, score-0.34]
79 For any ↵, 2 I, ¯ (M 2 ) n ⇣ ⌘ ⇣ 2 2 ⌘ ⇣ ⌘o L max max PM↵ ⌃ PM , max PM↵ ⌃ PM? [sent-176, score-0.282]
80 Suppose each row Xi of the observation matrix X is independently sampled from ⇤ N (0, ⌃), and the condition (C-Linear) (6) holds. [sent-181, score-0.283]
81 2 Linear Regression with Sparse and Group-sparse structures We now consider the following superposition structure, comprised of both sparse and group-sparse structures. [sent-188, score-0.417]
82 Suppose that the linear regression parameter ✓⇤ is ⇤ a superposition of a group-sparse component ✓g with respect to this set of groups G, as well as a ⇤ ⇤ ⇤ sparse component ✓s with respect to the remaining indices {1, . [sent-196, score-0.707]
83 i=1 P Then, we use the hybrid regularization function ↵2I ↵ R↵ (✓↵ ) = s k✓s k1 + g k✓g k1,a where Pq k✓k1,a := t=1 k✓Gt ka for a 2. [sent-200, score-0.379]
84 Consider the linear model (4) where ✓⇤ is the sum of exact s-sparse ✓s and exact sg ⇤ group-sparse ✓g . [sent-202, score-0.232]
85 Then, if we solve (5) with r ⇢ 1 1/a r log p m log q p and + , s =8 g =8 n n n then, with probability at least 1 b k✓ c1 exp( c2 n 2 ) c3 /q 2 , we have the error bound: s r ⇢r p sg m1 1/a 24 s log p sg log q ⇤ p ✓ k2 max , + . [sent-205, score-0.689]
86 On n the other hand, the support of ✓⇤ also can be interpreted as comprising sg + s disjoint groups in the worst case, so that “clean” `1 /`2 group regularization entails, with high probability, the bound [11]: ✓q ◆ q (sg +s)m (sg +s) log q b k✓`1 /`2 ✓⇤ k2 = O + . [sent-208, score-0.48]
87 For this class of statistical models, we will consider the instantiation of (1) with the loss function L consisting of the squared loss: n1 o X X min |||Y X ⇥↵ |||2 + (8) ↵ R↵ (⇥↵ ) . [sent-212, score-0.225]
88 F n 2 ↵ ↵< Consider an instance of this multiple linear regression model with the superposition structure con⇤ sisting of row-sparse, column-sparse and elementwise sparse matrices: ⇥⇤ = ⇥P ⇤ +⇥⇤ . [sent-221, score-0.529]
89 Consider the multiple linear regression model (7) where ⇥⇤ is the sum of ⇥⇤ with sr r nonzero rows, ⇥⇤ with sc nonzero columns, and ⇥⇤ with s nonzero elements. [sent-224, score-0.454]
90 Suppose that the design c s p matrix X is ⌃-Gaussian ensemble with the properties of column normalization and max (X) n. [sent-225, score-0.244]
91 Further, suppose that (6) holds and W is elementwise sub-Gaussian with parameter . [sent-226, score-0.289]
92 Ui ⇠ N (0, ⇥⇤ ) is the “uncorrupted” set of observations, with a low-rank covariance matrix ⇥⇤ = LLT , for some loading matrix L 2 Rp⇥r . [sent-232, score-0.314]
93 vi 2 Rp is a noise/error vector; in standard factor analysis, vi is a spherical Gaussian noise vector: vi ⇠ N (0, 2 Ip⇥p ) (or vi = 0); and the goal is to recover the loading matrix L from samples. [sent-233, score-0.456]
94 In PCA with sparse noise, vi ⇠ N (0, ⇤ ), where ⇤ is elementwise sparse. [sent-234, score-0.231]
95 + where ||| · |||1 denotes the nuclear norm while k · k1 does the element-wise `1 norm (we will use ||| · |||2 for the spectral norm. [sent-238, score-0.302]
96 In contrast to the previous two examples, (9) includes a trivial design matrix, n = Ip⇥p , which alX o lows (D4) to hold under the simpler (C-linear) condition: where ⇤ is max max n max ⇣ P M⇥ P M ¯ ¯ ⌘ , max ⇣ ⌘ P M ⇥ P M? [sent-240, score-0.376]
97 Consider the principal component analysis model where ⇥⇤ has the rank r at most and ⇤ has s nonzero entries. [sent-246, score-0.244]
98 Then, given the choice of r r p p log p , = 32⇢(⌃) , ⇥ = 16 |||⌃|||2 n n where ⇢(⌃) = maxj ⌃jj , the optimal error of (9) is bounded by b k⇥ with probability at least 1 r r ⇢ p 48 rp s log p ⇥ k2 max |||⌃|||2 , 2⇢(⌃) , ¯ n n ⇤ c1 exp( c2 log p). [sent-248, score-0.456]
99 Under a stricter “global” RSC condition, they p q p p b ⇥⇤ k2 ⇣ max{ |||⌃|||2 rp , ⇢(⌃) s log p + ↵ } where ↵ is a compute the error bound k⇥ n n p parameter between 1 and p. [sent-252, score-0.346]
100 Note that our work and [1] derive Frobenius error bounds under restricted strong convexity conditions; other recent works such as [7] also derive such Frobenius error bounds but under stronger conditions (see [1] for details). [sent-257, score-0.416]
wordName wordTfidf (topN-words)
[('superposition', 0.262), ('subspace', 0.245), ('regularization', 0.227), ('dirty', 0.213), ('structural', 0.203), ('decomposable', 0.173), ('pm', 0.173), ('incoherence', 0.161), ('clean', 0.157), ('hybrid', 0.152), ('sg', 0.152), ('corollaries', 0.137), ('rp', 0.13), ('norm', 0.126), ('subspaces', 0.116), ('showcasing', 0.113), ('uni', 0.113), ('regression', 0.111), ('chandrasekaran', 0.109), ('component', 0.105), ('suppose', 0.098), ('corollary', 0.097), ('matrix', 0.095), ('max', 0.094), ('col', 0.092), ('elementwise', 0.092), ('varied', 0.092), ('structures', 0.091), ('principal', 0.087), ('loss', 0.086), ('condition', 0.081), ('sum', 0.08), ('hx', 0.076), ('mal', 0.076), ('hhx', 0.075), ('withp', 0.075), ('vi', 0.075), ('proposition', 0.075), ('theorem', 0.074), ('instantiation', 0.073), ('parrilo', 0.073), ('structurally', 0.071), ('negahban', 0.067), ('collated', 0.067), ('conditions', 0.066), ('statistical', 0.066), ('structured', 0.066), ('sparse', 0.064), ('sc', 0.064), ('covariance', 0.063), ('caveat', 0.061), ('loading', 0.061), ('msg', 0.061), ('rk', 0.061), ('zi', 0.061), ('frobenius', 0.06), ('remarks', 0.06), ('agarwal', 0.06), ('parameter', 0.06), ('log', 0.059), ('eunho', 0.058), ('rsc', 0.058), ('imposed', 0.058), ('convexity', 0.057), ('error', 0.055), ('ensemble', 0.055), ('pairs', 0.055), ('observation', 0.054), ('row', 0.053), ('ravikumar', 0.052), ('cand', 0.052), ('nonzero', 0.052), ('convolution', 0.051), ('nuclear', 0.05), ('restricted', 0.05), ('captures', 0.049), ('trivially', 0.048), ('bounds', 0.048), ('rn', 0.048), ('instructive', 0.046), ('kx', 0.045), ('constraints', 0.045), ('sanghavi', 0.043), ('sr', 0.043), ('bound', 0.042), ('robust', 0.042), ('eigenvalue', 0.041), ('imposes', 0.041), ('regularized', 0.041), ('rates', 0.04), ('regularizers', 0.039), ('pca', 0.039), ('holds', 0.039), ('compatibility', 0.038), ('thought', 0.038), ('assumption', 0.038), ('strong', 0.037), ('sequel', 0.037), ('austin', 0.036), ('texas', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 91 nips-2013-Dirty Statistical Models
Author: Eunho Yang, Pradeep Ravikumar
Abstract: We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1
2 0.23696919 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe
Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efficiently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1
3 0.22770503 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory
Author: Jason Lee, Yuekai Sun, Jonathan E. Taylor
Abstract: Penalized M-estimators are used in diverse areas of science and engineering to fit high-dimensional models with some low-dimensional structure. Often, the penalties are geometrically decomposable, i.e. can be expressed as a sum of support functions over convex sets. We generalize the notion of irrepresentable to geometrically decomposable penalties and develop a general framework for establishing consistency and model selection consistency of M-estimators with such penalties. We then use this framework to derive results for some special cases of interest in bioinformatics and statistical learning. 1
4 0.18504241 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
Author: Yu-Xiang Wang, Huan Xu, Chenlei Leng
Abstract: Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of “Self-Expressiveness”. The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the “Self-Expressiveness Property” and “Graph Connectivity” at the same time. 1
5 0.17586575 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
Author: Ke Hou, Zirui Zhou, Anthony Man-Cho So, Zhi-Quan Luo
Abstract: Motivated by various applications in machine learning, the problem of minimizing a convex smooth loss function with trace norm regularization has received much attention lately. Currently, a popular method for solving such problem is the proximal gradient method (PGM), which is known to have a sublinear rate of convergence. In this paper, we show that for a large class of loss functions, the convergence rate of the PGM is in fact linear. Our result is established without any strong convexity assumption on the loss function. A key ingredient in our proof is a new Lipschitzian error bound for the aforementioned trace norm–regularized problem, which may be of independent interest.
6 0.17525838 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
7 0.16323155 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
8 0.15966918 233 nips-2013-Online Robust PCA via Stochastic Optimization
9 0.15778527 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
10 0.15538742 224 nips-2013-On the Sample Complexity of Subspace Learning
11 0.15386623 137 nips-2013-High-Dimensional Gaussian Process Bandits
12 0.14505826 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
13 0.13844796 188 nips-2013-Memory Limited, Streaming PCA
14 0.13371667 185 nips-2013-Matrix Completion From any Given Set of Observations
15 0.11906201 171 nips-2013-Learning with Noisy Labels
16 0.11881668 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
17 0.11778869 158 nips-2013-Learning Multiple Models via Regularized Weighting
18 0.1140312 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
19 0.10648987 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
20 0.10363677 109 nips-2013-Estimating LASSO Risk and Noise Level
topicId topicWeight
[(0, 0.274), (1, 0.112), (2, 0.196), (3, 0.148), (4, -0.049), (5, 0.023), (6, -0.128), (7, 0.065), (8, -0.209), (9, -0.015), (10, 0.051), (11, 0.119), (12, 0.018), (13, 0.056), (14, -0.049), (15, -0.039), (16, 0.007), (17, -0.018), (18, 0.003), (19, -0.083), (20, 0.013), (21, 0.029), (22, -0.001), (23, -0.027), (24, 0.076), (25, -0.0), (26, 0.005), (27, -0.016), (28, -0.008), (29, 0.063), (30, -0.037), (31, -0.067), (32, 0.018), (33, -0.115), (34, -0.015), (35, 0.085), (36, -0.031), (37, -0.032), (38, 0.082), (39, -0.07), (40, 0.042), (41, 0.048), (42, -0.075), (43, 0.058), (44, 0.043), (45, 0.015), (46, 0.009), (47, 0.096), (48, 0.048), (49, -0.098)]
simIndex simValue paperId paperTitle
same-paper 1 0.95868838 91 nips-2013-Dirty Statistical Models
Author: Eunho Yang, Pradeep Ravikumar
Abstract: We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1
2 0.85283649 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe
Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efficiently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1
3 0.83940011 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
Author: Yu-Xiang Wang, Huan Xu, Chenlei Leng
Abstract: Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of “Self-Expressiveness”. The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the “Self-Expressiveness Property” and “Graph Connectivity” at the same time. 1
4 0.73649931 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory
Author: Jason Lee, Yuekai Sun, Jonathan E. Taylor
Abstract: Penalized M-estimators are used in diverse areas of science and engineering to fit high-dimensional models with some low-dimensional structure. Often, the penalties are geometrically decomposable, i.e. can be expressed as a sum of support functions over convex sets. We generalize the notion of irrepresentable to geometrically decomposable penalties and develop a general framework for establishing consistency and model selection consistency of M-estimators with such penalties. We then use this framework to derive results for some special cases of interest in bioinformatics and statistical learning. 1
5 0.71150416 224 nips-2013-On the Sample Complexity of Subspace Learning
Author: Alessandro Rudi, Guillermo D. Canas, Lorenzo Rosasco
Abstract: A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribution, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods. 1
6 0.70134771 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
7 0.69567972 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
8 0.69273853 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
9 0.68379152 137 nips-2013-High-Dimensional Gaussian Process Bandits
10 0.66266364 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
11 0.64146215 188 nips-2013-Memory Limited, Streaming PCA
12 0.63669765 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
13 0.6291759 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
14 0.6237908 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition
15 0.62344074 233 nips-2013-Online Robust PCA via Stochastic Optimization
16 0.62225974 342 nips-2013-Unsupervised Spectral Learning of Finite State Transducers
17 0.60943478 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
18 0.60177106 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
19 0.59781331 354 nips-2013-When in Doubt, SWAP: High-Dimensional Sparse Recovery from Correlated Measurements
20 0.57609481 265 nips-2013-Reconciling "priors" & "priors" without prejudice?
topicId topicWeight
[(16, 0.029), (33, 0.152), (34, 0.115), (41, 0.014), (49, 0.025), (56, 0.109), (70, 0.017), (85, 0.04), (89, 0.396), (93, 0.033), (95, 0.014)]
simIndex simValue paperId paperTitle
1 0.91863436 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
Author: George H. Chen, Stanislav Nikolov, Devavrat Shah
Abstract: For classifying time series, a nearest-neighbor approach is widely used in practice with performance often competitive with or better than more elaborate methods such as neural networks, decision trees, and support vector machines. We develop theoretical justification for the effectiveness of nearest-neighbor-like classification of time series. Our guiding hypothesis is that in many applications, such as forecasting which topics will become trends on Twitter, there aren’t actually that many prototypical time series to begin with, relative to the number of time series we have access to, e.g., topics become trends on Twitter only in a few distinct manners whereas we can collect massive amounts of Twitter data. To operationalize this hypothesis, we propose a latent source model for time series, which naturally leads to a “weighted majority voting” classification rule that can be approximated by a nearest-neighbor classifier. We establish nonasymptotic performance guarantees of both weighted majority voting and nearest-neighbor classification under our model accounting for how much of the time series we observe and the model complexity. Experimental results on synthetic data show weighted majority voting achieving the same misclassification rate as nearest-neighbor classification while observing less of the time series. We then use weighted majority to forecast which news topics on Twitter become trends, where we are able to detect such “trending topics” in advance of Twitter 79% of the time, with a mean early advantage of 1 hour and 26 minutes, a true positive rate of 95%, and a false positive rate of 4%. 1
2 0.91720819 305 nips-2013-Spectral methods for neural characterization using generalized quadratic models
Author: Il M. Park, Evan W. Archer, Nicholas Priebe, Jonathan W. Pillow
Abstract: We describe a set of fast, tractable methods for characterizing neural responses to high-dimensional sensory stimuli using a model we refer to as the generalized quadratic model (GQM). The GQM consists of a low-rank quadratic function followed by a point nonlinearity and exponential-family noise. The quadratic function characterizes the neuron’s stimulus selectivity in terms of a set linear receptive fields followed by a quadratic combination rule, and the invertible nonlinearity maps this output to the desired response range. Special cases of the GQM include the 2nd-order Volterra model [1, 2] and the elliptical Linear-Nonlinear-Poisson model [3]. Here we show that for “canonical form” GQMs, spectral decomposition of the first two response-weighted moments yields approximate maximumlikelihood estimators via a quantity called the expected log-likelihood. The resulting theory generalizes moment-based estimators such as the spike-triggered covariance, and, in the Gaussian noise case, provides closed-form estimators under a large class of non-Gaussian stimulus distributions. We show that these estimators are fast and provide highly accurate estimates with far lower computational cost than full maximum likelihood. Moreover, the GQM provides a natural framework for combining multi-dimensional stimulus sensitivity and spike-history dependencies within a single model. We show applications to both analog and spiking data using intracellular recordings of V1 membrane potential and extracellular recordings of retinal spike trains. 1
3 0.90531582 109 nips-2013-Estimating LASSO Risk and Noise Level
Author: Mohsen Bayati, Murat A. Erdogdu, Andrea Montanari
Abstract: We study the fundamental problems of variance and risk estimation in high dimensional statistical modeling. In particular, we consider the problem of learning a coefficient vector θ0 ∈ Rp from noisy linear observations y = Xθ0 + w ∈ Rn (p > n) and the popular estimation procedure of solving the 1 -penalized least squares objective known as the LASSO or Basis Pursuit DeNoising (BPDN). In this context, we develop new estimators for the 2 estimation risk θ − θ0 2 and the variance of the noise when distributions of θ0 and w are unknown. These can be used to select the regularization parameter optimally. Our approach combines Stein’s unbiased risk estimate [Ste81] and the recent results of [BM12a][BM12b] on the analysis of approximate message passing and the risk of LASSO. We establish high-dimensional consistency of our estimators for sequences of matrices X of increasing dimensions, with independent Gaussian entries. We establish validity for a broader class of Gaussian designs, conditional on a certain conjecture from statistical physics. To the best of our knowledge, this result is the first that provides an asymptotically consistent risk estimator for the LASSO solely based on data. In addition, we demonstrate through simulations that our variance estimation outperforms several existing methods in the literature. 1
same-paper 4 0.89430577 91 nips-2013-Dirty Statistical Models
Author: Eunho Yang, Pradeep Ravikumar
Abstract: We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1
5 0.81942004 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
Author: Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
Abstract: As massively parallel computations have become broadly available with modern GPUs, deep architectures trained on very large datasets have risen in popularity. Discriminatively trained convolutional neural networks, in particular, were recently shown to yield state-of-the-art performance in challenging image classification benchmarks such as ImageNet. However, elements of these architectures are similar to standard hand-crafted representations used in computer vision. In this paper, we explore the extent of this analogy, proposing a version of the stateof-the-art Fisher vector image encoding that can be stacked in multiple layers. This architecture significantly improves on standard Fisher vectors, and obtains competitive results with deep convolutional networks at a smaller computational learning cost. Our hybrid architecture allows us to assess how the performance of a conventional hand-crafted image classification pipeline changes with increased depth. We also show that convolutional networks and Fisher vector encodings are complementary in the sense that their combination further improves the accuracy. 1
7 0.72670734 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
8 0.69392914 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition
9 0.68112755 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
10 0.66686147 237 nips-2013-Optimal integration of visual speed across different spatiotemporal frequency channels
11 0.66424733 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
12 0.65069866 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
13 0.64789504 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures
14 0.6455701 147 nips-2013-Lasso Screening Rules via Dual Polytope Projection
15 0.64013571 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
16 0.63292998 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
17 0.63217956 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models
18 0.63039547 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
19 0.62915212 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking
20 0.62404859 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization