nips nips2009 nips2009-245 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shuheng Zhou
Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. [sent-2, score-0.137]
2 For example, it works for cases where the ordinary Lasso would have failed. [sent-4, score-0.082]
3 1 Introduction In a typical high dimensional setting, the number of variables p is much larger than the number of observations n. [sent-6, score-0.078]
4 This challenging setting appears in linear regression, signal recovery, covariance selection in graphical modeling, and sparse approximations. [sent-7, score-0.178]
5 1) where X is an n × p design matrix, Y is a vector of noisy observations and ǫ is the noise term. [sent-9, score-0.088]
6 In particular, recovery of the sparsity pattern S = supp(β) := {j : βj = 0}, also known as variable (model) selection, refers to the task of correctly identifying the support set (or a subset of “significant” coefficients in β) based on the noisy observations. [sent-14, score-0.271]
7 One important stream of research, which we also adopt here, requires computational feasibility for the estimation methods, among which the Lasso and the Dantzig selector are both well studied and shown with provable nice statistical properties; see for example [11, 9, 19, 21, 5, 18, 12, 2]. [sent-17, score-0.402]
8 For a chosen penalization parameter λn ≥ 0, regularized estimation with the ℓ1 -norm penalty, also known 1 as the Lasso [16] or Basis Pursuit [6] refers to the following convex optimization problem β = arg min β 1 Y − Xβ 2n 2 2 + λn β 1, (1. [sent-18, score-0.235]
9 2) where the scaling factor 1/(2n) is chosen by convenience; The Dantzig selector [5] is defined as, (DS) arg min β b β∈Rp 1 subject to 1 T X (Y − X β) n ∞ ≤ λn . [sent-19, score-0.566]
10 , p}, let us define XT as the n × |T | submatrix obtained by extracting columns of X indexed by T ; similarly, let βT ∈ R|T | , be a subvector of β ∈ Rp confined to T . [sent-26, score-0.081]
11 Formally, we study a Multi-step Procedure: First we obtain an initial estimator βinit using the Lasso as in (1. [sent-27, score-0.257]
12 We then threshold the estimator βinit with t0 , with the general goal such that, we get a set I1 with cardinality at most 2s; in general, we also have |I1 ∪ S| ≤ 2s, where I1 = {j ∈ {1, . [sent-31, score-0.294]
13 We then feed (Y, XI ) to either the Lasso estimator as in (1. [sent-37, score-0.169]
14 2) or the ordinary least squares T T (OLS) estimator to obtain β, where we set βI = (XI XI )−1 XI Y and βI c = 0. [sent-38, score-0.288]
15 We then possibly threshold βI1 with t1 = 4λn |I1 | (to be specified), to obtain I2 , repeat step 2 with I = I2 to obtain βI and set all other coordinates to zero; return β. [sent-40, score-0.196]
16 Our algorithm is constructive in that it does not rely on the unknown parameters s, βmin := minj∈S |βj | or those that characterize the incoherence conditions on X; instead, our choice of λn and thresholding parameters only depends on σ, n, and p. [sent-41, score-0.299]
17 In our experiments, we apply only the first two steps, which we refer to as a two-step procedure; In particular, the Gauss-Dantzig selector is a two-step procedure with the Dantzig selector as βinit [5]. [sent-42, score-0.797]
18 Throughout this paper, we assume that n ≥ 2s and △ Λmin (2s) = min Xυ υ=0;2s−sparse 2 2 2 /(n υ 2 ) > 0. [sent-47, score-0.196]
19 4) It is clear that n ≥ 2s is necessary, as any submatrix with more than n columns must be singular. [sent-49, score-0.081]
20 As defined in [4], the s-restricted isometry constant δs of X is the smallest quantity such that (1 − δs ) υ 2 2 ≤ XT υ 2 2 /n ≤ (1 + δs ) υ 2 2 , for all T ⊆ {1, . [sent-51, score-0.117]
21 Consider the least square estimator βI = (XI XI )−1 XI Y , where |I| ≤ s and consider the ideal least-squares estimator β ⋄ β⋄ = arg min I⊆{1,. [sent-66, score-0.658]
22 It follows from [5] that for Λmax (s) < ∞, p E β − β⋄ 2 2 ≥ min (1, 1/Λmax (s)) 2 min(βi , σ 2 /n). [sent-71, score-0.196]
23 8) These bounds are meaningful since p 2 min(βi , σ 2 /n) = i=1 min I⊆{1,. [sent-77, score-0.225]
24 ,p} β − βI 2 2 + |I|σ 2 n represents the ideal squared bias and variance. [sent-80, score-0.082]
25 We elaborate on conditions on the design, under which we accomplish these goals using the multi-step procedures in the rest of this section. [sent-81, score-0.094]
26 We now define a constant λσ,a,p for each a > 0, by which we bound the maximum correlation between the √ noise and covariates of X, which we only apply to X with column ℓ2 norm bounded by n; Let √ XT ǫ 2 log p Ta := ǫ : ≤ λσ,a,p , where λσ,a,p = σ 1 + a , hence (1. [sent-82, score-0.278]
27 10) P (Ta ) ≥ 1 − ( π log ppa )−1 , for a ≥ 0; see [5]. [sent-84, score-0.152]
28 1 shows that consistent variable selection is possible under the Restricted Eigenvalue conditions, as formalized in [2]. [sent-87, score-0.101]
29 1 (Restricted Eigenvalue assumption RE(s, k0 , X) [2]) For some integer 1 ≤ s ≤ p and a positive number k0 , the following holds: Xυ 2 1 △ √ > 0. [sent-90, score-0.095]
30 ,p}, ‚ ‚ υ=0, n υJ0 2 |J0 |≤s ‚ ‚ ‚υJ c ‚ ≤k0 υJ0 0 1 1 If RE(s, k0 , X) is satisfied with k0 ≥ 1, then the square submatrices of size ≤ 2s of X T X are necessarily positive definite (see [2]) and hence (1. [sent-95, score-0.091]
31 Note that when s > n/2, it is impossible for the restricted eigenvalue assumption to hold as XI for any I such that |I| = 2s becomes singular in this case. [sent-99, score-0.181]
32 Then with probability at least P (Ta ), the multi-step procedure returns β such that B2 S ⊆ I := supp(β), where |I \ S| < 2 and 16 2 λ2 |I| 2 log p(1 + a)sσ 2 (1 + B2 /16) σ,a,p ≤ , β−β 2 ≤ 2 2 Λmin (|I|) nΛ2 (2s) min √ p 2 which satisfies (1. [sent-109, score-0.332]
33 3 Our analysis builds upon the rate of convergence bounds for βinit derived in [2]. [sent-112, score-0.095]
34 The first implication of this work and also one of the motivations for analyzing the thresholding methods is: under Assumption 1. [sent-113, score-0.217]
35 1, one can obtain consistent variable selection for very significant values of s, if only a few extra variables are allowed to be included in the estimator β. [sent-114, score-0.347]
36 1 is: is there a good thresholding rule that enables us to obtain a √ sufficiently sparse estimator β when some components of βS (and hence βmin ) are well below σ/ n, which also satisfies the oracle inequality as in (1. [sent-119, score-0.686]
37 Before we answer this question, we define s0 as the smallest integer such that p i=1 2 min(βi , λ2 σ 2 ) ≤ s0 λ2 σ 2 , where λ = 2 log p/n, (1. [sent-121, score-0.165]
38 Note that θ is non-decreasing in s, s′ and small values of θs,s′ indicates that disjoint subsets covariates in XT and XT ′ span nearly orthogonal subspaces. [sent-127, score-0.076]
39 5), when β is as sparse as required by the Dantzig selector to achieve such an oracle inequality [5]. [sent-130, score-0.584]
40 2 Choose τ, a > 0 and set λn = λp,τ σ, where λp,τ := ( 1 + a + τ −1 ) 2 log p/n, in (1. [sent-135, score-0.079]
41 Let threshold t0 be chosen from the range (C1 λp,τ σ, C4 λp,τ σ] for some constants C1 , C4 to be defined. [sent-138, score-0.116]
42 Then with probability at least √ 1−( π log ppa )−1 , the Gauss-Dantzig selector β selects a model I := supp(β) such that |I| ≤ 2s0 , p |I \ S| ≤ s0 ≤ s, and β−β 2 2 2 ≤ 2C3 log p σ 2 /n + 2 min(βi , σ 2 /n) , (1. [sent-139, score-0.601]
43 Finally, we briefly review related work in multi-step procedures and the role of sparsity for high-dimensional statistical inference. [sent-150, score-0.124]
44 Before this work, hard thresholding idea has been shown in [5] (via Gauss-Dantzig selector) as a method to correct the bias of the initial Dantzig selector. [sent-151, score-0.268]
45 The empirical success of the Gauss-Dantzig selector in terms of improving the statistical accuracy is strongly evident in their experimental results. [sent-152, score-0.435]
46 Our theoretical analysis on the oracle inequalities, which hold for the Gauss-Dantzig selector under a uniform uncertainty principle, is exactly inspired by their theoretical analysis of the initial Dantzig selector under the same conditions. [sent-153, score-0.997]
47 The sparse recovery problem under arbitrary noise is also well studied, see [3, 15, 14]. [sent-157, score-0.238]
48 Although as argued in [3, 14], the best accuracy under arbitrary noise has essentially been achieved in both work, their bounds are worse than that in [5] (hence the present paper) under the stochastic noise as discussed in the present paper; see more discussions in [5]. [sent-158, score-0.117]
49 Moreover, greedy algorithms in [15, 14] require s to be part of their input, while the iterative algorithms in the present paper do not have such requirement, and hence adapt to the unknown level of sparsity s well. [sent-159, score-0.174]
50 A more general framework on multi-step variable selection was studied by [20]. [sent-160, score-0.101]
51 Finally, under a restricted eigenvalue condition slightly stronger than Assumption 1. [sent-163, score-0.163]
52 1, [22] requires s = O( n/ log p) in order to achieve variable selection consistency using the adaptive Lasso [23] as the second step procedure. [sent-164, score-0.216]
53 A thresholding framework for the general setting is described in Section 3, which also sketches the proof of Theorem 1. [sent-168, score-0.217]
54 Section 4 briefly discusses the relationship between linear sparsity and random design matrices. [sent-170, score-0.137]
55 Section 5 includes simulation results showing that our two-step procedure is consistent with our theoretical analysis on variable selection. [sent-171, score-0.089]
56 2 Thresholding procedure when βmin is large √ We use a penalization parameter λn = Bλσ,a,p and assume βmin > Cλn s for some constants B, C throughout this section; we first specify the thresholding parameters in this case. [sent-172, score-0.394]
57 1 that our algorithm works under any conditions so long as the rate of convergence of the initial estimator obeys the bounds in (2. [sent-174, score-0.415]
58 1, given the rate of convergence bounds for βinit following derivations in [2]. [sent-179, score-0.095]
59 We obtain an initial estimator βinit using the Lasso or the Dantzig selector. [sent-181, score-0.257]
60 1 Let λn ≥ Bλσ,a,p , where B ≥ 1 is a constant suitably chosen such that the initial estimator βinit satisfies on Ta , for υinit = βinit − β and some constants B0 , B1 , √ υinit,S 2 ≤ B0 λn s and υinit,S c 1 ≤ B1 λn s; (2. [sent-186, score-0.3]
61 4) β (i) − β 2 ≤ λσ,a,p |Si |/Λmin (|Si |) ≤ λn B2 2s, ∀i = 1, 2, where β (i) are the OLS estimators based on I = Si ; Finally, the Iterative Procedure includes the correct set of variables in S2 such that S ⊆ S2 ⊆ S1 and S2 \ S := supp(β) \ S ≤ 1 16B 2 Λ2 (|S1 |) min 5 ≤ 2 B2 . [sent-190, score-0.236]
62 We also note that in order to obtain S1 such that |S1 | ≤ 2s and S1 ⊇ S, we only need to threshold βinit at t0 = B1 λn √ Section 3 and (see Lemma 3. [sent-195, score-0.107]
63 3 A thresholding framework for the general setting In this section, we wish to derive a meaningful criteria for consistency in variable selection, when βmin is well below the noise level. [sent-197, score-0.37]
64 Suppose that we are given an initial estimator βinit that achieves the rate of convergence bound as in (1. [sent-198, score-0.315]
65 14), which adapts nearly ideally to the uncertainty in the support set S and the “significant” set. [sent-199, score-0.107]
66 We show that although we cannot guarantee the presence of variables indexed by {j : |βj | < σ 2 log p/n} to be included in the final set I (cf. [sent-200, score-0.119]
67 7)) due to their lack of strength, we wish to include the significant variables from S in I such that the OLS estimator based on I achieves this almost ideal rate of convergence as βinit does, even though some variables from S are missing in I. [sent-202, score-0.498]
68 Here we pay a price for the missing variables in order to obtain a sparse model I. [sent-203, score-0.185]
69 First we obtain an initial estimator βmin using the Dantzig selector with λp,τ := ( 1 + a+ τ −1 ) 2 log p/n, where τ, a ≥ 0; we then threshold βinit with t0 , chosen from the range (C1 λp,τ σ, C4 λp,τ σ], to obtain a set I of cardinality at most 2s, (we prove a stronger result in Lemma 3. [sent-207, score-0.899]
70 In the second step, given a set I of cardinality at most 2s, we run the OLS regression to T T obtain obtained via (3. [sent-214, score-0.092]
71 Theorem 2 in [5] has shown that the Dantzig selector achieves nearly the ideal level of MSE. [sent-216, score-0.519]
72 Choose τ, a > 0 √ and set λn = λp,τ σ := ( 1 + a + τ −1 )σ 2 log p/n in (1. [sent-222, score-0.079]
73 Then if β is s-sparse with δ2s + 2 √ θs,2s < 1 − τ , the Dantzig selector obeys with probability at least 1 − ( π log ppa )−1 , β − β ≤ 2 √ p 2 2 2C2 ( 1 + a + τ −1 )2 log p σ 2 /n + i=1 min βi , σ 2 /n . [sent-224, score-0.865]
74 4) p 2 2 2 2 2 Recall that s0 is the smallest integer such that i=1 min(βi , λ σ ) ≤ s0 λ σ , where λ = 2 log p/n. [sent-234, score-0.165]
75 Thus by definition of s0 , as essentially shown in [5], that 0 ≤ s0 ≤ s and ′ C1 = C0 + p s0 λ2 σ 2 ≤ λ2 σ 2 + i=1 p 2 min(βi , λ2 σ 2 ) ≤ 2 log p 2 σ2 2 σ + min βi , n n i=1 (3. [sent-235, score-0.275]
76 2 that thresholding at the level of Cλσ at step 1 selects a set I of at most 2s0 variables, among which at most s0 are from S c . [sent-241, score-0.217]
77 Let βinit be the ℓ1 -minimizer subject to √ the constraints, for λ := 2 log p/n and λp,τ := ( 1 + a + t−1 ) 2 log p/n, 1 T X (Y − Xβinit ) n ∞ ≤ λp,τ σ. [sent-244, score-0.158]
78 3), choose a thresholding parameter t0 so that C4 λp,τ σ ≥ t0 > C1 λp,τ σ; Set I = {j : |βj,init | > t0 }. [sent-247, score-0.217]
79 10) Next we show that even if we miss some columns of X in S, we can still hope to get the convergence rate as required in Theorem 1. [sent-256, score-0.108]
80 3 a general result on rate of convergence of the OLS estimator based on a chosen model I, where a subset of relevant variables are missing. [sent-260, score-0.275]
81 3 (OLS estimator with missing variables) Let D := {1, . [sent-262, score-0.2]
82 Then we have on Ta , for the least squares estimator T T based on I, βI = (XI XI )−1 XI Y , it holds that βI − β 2 2 ≤ θ|I|,|SR | βD 2 + λσ,a,p |I| /Λmin (|I|) 2 + βD 2 2 . [sent-267, score-0.218]
83 10) that we cannot cut too many “significant” variables; in particular, for those that are √ larger λσ s0 , we can cut at most a constant number of them. [sent-276, score-0.094]
84 4 Linear sparsity and random matrices A special case of design matrices that satisfy the Restricted Eigenvalue assumptions are the random design matrices. [sent-277, score-0.181]
85 This is shown in a large body of work, for example [3, 4, 5, 1, 13], which shows that the uniform uncertainty principle (UUP) holds for “generic” or random design matrices for very significant values of s. [sent-278, score-0.222]
86 In our simulations as shown in Section 5, exact recovery rate of the sparsity pattern is very high for a few types of random matrices using a two-step procedure, once the number of samples passes a certain threshold. [sent-284, score-0.241]
87 In an ongoing work, the author is exploring thresholding algorithms for a broader class of random designs that satisfy the Restricted Eigenvalue assumptions. [sent-289, score-0.217]
88 The two-step procedure requires much fewer samples than the ordinary Lasso. [sent-315, score-0.139]
89 (b) (c) show the probability of success of the two-step procedure under different levels of sparsity when n increases for p = 512 and 1024 respectively; (d) The number of samples n increases almost linearly with s for p = 1024. [sent-316, score-0.215]
90 We show in Figure 1 that the two-step procedure indeed recovers a sparse model using a small number of samples per non-zero component in β when X is a Gaussian Ensemble. [sent-318, score-0.134]
91 We run under three cases of p = 256, 512, 1024; for each p, we increase the sparsity s by roughly equal steps from s = 0. [sent-320, score-0.093]
92 Y and X are then fed to the two-step procedure to obtain β. [sent-329, score-0.094]
93 We compare with the ordinary Lasso, for which we search over the full path of LARS [8] and always choose the β that best matches β in terms of support. [sent-332, score-0.082]
94 69 2 log p/n and threshold βinit at t0 = ft log p s, where s = |S0 | for n S0 = {j : βj,init ≥ 0. [sent-334, score-0.261]
95 The author thanks Larry Wasserman, Sara van de Geer and Peter B¨ hlmann for helpful discussions, comments and their kind support throughout this work. [sent-338, score-0.096]
96 A simple proof of the restricted isometry property for random matrices. [sent-347, score-0.104]
97 High dimensional graphs and variable selection with the Lasso. [sent-410, score-0.139]
98 CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. [sent-427, score-0.233]
99 Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit. [sent-432, score-0.201]
100 Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1 -constrained quadratic programming. [sent-455, score-0.21]
wordName wordTfidf (topN-words)
[('init', 0.438), ('selector', 0.37), ('dantzig', 0.296), ('thresholding', 0.217), ('min', 0.196), ('lasso', 0.191), ('estimator', 0.169), ('ta', 0.159), ('ols', 0.146), ('oracle', 0.137), ('supp', 0.136), ('sr', 0.118), ('recovery', 0.117), ('sparsity', 0.093), ('uup', 0.091), ('si', 0.09), ('ideal', 0.082), ('ordinary', 0.082), ('log', 0.079), ('rp', 0.078), ('sparse', 0.077), ('ppa', 0.073), ('eigenvalue', 0.071), ('threshold', 0.07), ('xi', 0.069), ('selection', 0.069), ('obeys', 0.068), ('success', 0.065), ('cand', 0.065), ('theorem', 0.064), ('lemma', 0.063), ('xt', 0.063), ('restricted', 0.061), ('principle', 0.06), ('annals', 0.059), ('bernoulli', 0.058), ('procedure', 0.057), ('ensemble', 0.056), ('cardinality', 0.055), ('subgaussian', 0.053), ('coordinates', 0.052), ('inaccurate', 0.052), ('initial', 0.051), ('ds', 0.05), ('constructive', 0.05), ('hence', 0.049), ('assumption', 0.049), ('holds', 0.049), ('integer', 0.046), ('constants', 0.046), ('supply', 0.045), ('minj', 0.045), ('needell', 0.045), ('noise', 0.044), ('design', 0.044), ('isometry', 0.043), ('geer', 0.043), ('columns', 0.042), ('square', 0.042), ('wish', 0.041), ('max', 0.041), ('wasserman', 0.041), ('smallest', 0.04), ('uncertainty', 0.04), ('variables', 0.04), ('penalization', 0.039), ('submatrix', 0.039), ('nearly', 0.038), ('dimensional', 0.038), ('covariates', 0.038), ('obtain', 0.037), ('meinshausen', 0.037), ('consistency', 0.036), ('throughout', 0.035), ('convergence', 0.035), ('bounded', 0.034), ('constant', 0.034), ('ip', 0.034), ('cant', 0.033), ('suppose', 0.033), ('ft', 0.033), ('variable', 0.032), ('incomplete', 0.032), ('signal', 0.032), ('conditions', 0.032), ('nice', 0.032), ('iterative', 0.032), ('van', 0.032), ('suf', 0.032), ('rate', 0.031), ('elaborate', 0.031), ('procedures', 0.031), ('missing', 0.031), ('stronger', 0.031), ('donoho', 0.03), ('cut', 0.03), ('support', 0.029), ('uniform', 0.029), ('bounds', 0.029), ('achieves', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation
Author: Shuheng Zhou
Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1
2 0.22130913 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar
Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1
3 0.15697215 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes
Author: Mladen Kolar, Le Song, Eric P. Xing
Abstract: To estimate the changing structure of a varying-coefficient varying-structure (VCVS) model remains an important and open problem in dynamic system modelling, which includes learning trajectories of stock prices, or uncovering the topology of an evolving gene network. In this paper, we investigate sparsistent learning of a sub-family of this model — piecewise constant VCVS models. We analyze two main issues in this problem: inferring time points where structural changes occur and estimating model structure (i.e., model selection) on each of the constant segments. We propose a two-stage adaptive procedure, which first identifies jump points of structural changes and then identifies relevant covariates to a response on each of the segments. We provide an asymptotic analysis of the procedure, showing that with the increasing sample size, number of structural changes, and number of variables, the true model can be consistently selected. We demonstrate the performance of the method on synthetic data and apply it to the brain computer interface dataset. We also consider how this applies to structure estimation of time-varying probabilistic graphical models. 1
4 0.14440435 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar
Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1
5 0.12693462 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright
Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.
6 0.12260006 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry
7 0.12149597 147 nips-2009-Matrix Completion from Noisy Entries
8 0.12108947 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
9 0.11871736 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis
10 0.11692531 157 nips-2009-Multi-Label Prediction via Compressed Sensing
11 0.11534911 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors
12 0.11185057 67 nips-2009-Directed Regression
13 0.10596824 55 nips-2009-Compressed Least-Squares Regression
14 0.096829832 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem
15 0.090937361 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization
16 0.079870187 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases
17 0.079293855 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing
18 0.078613393 256 nips-2009-Which graphical models are difficult to learn?
19 0.078122593 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
20 0.075681247 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output
topicId topicWeight
[(0, -0.227), (1, 0.182), (2, 0.038), (3, 0.101), (4, 0.031), (5, -0.04), (6, 0.131), (7, -0.173), (8, 0.15), (9, 0.078), (10, -0.002), (11, -0.026), (12, 0.011), (13, 0.017), (14, 0.105), (15, 0.14), (16, -0.034), (17, -0.103), (18, -0.008), (19, 0.092), (20, 0.005), (21, 0.126), (22, 0.024), (23, 0.115), (24, -0.025), (25, -0.05), (26, -0.017), (27, -0.036), (28, -0.052), (29, -0.068), (30, -0.078), (31, 0.036), (32, -0.073), (33, -0.025), (34, 0.059), (35, 0.024), (36, -0.052), (37, 0.03), (38, 0.038), (39, -0.015), (40, 0.046), (41, -0.014), (42, -0.013), (43, -0.056), (44, 0.011), (45, -0.035), (46, -0.063), (47, 0.075), (48, -0.085), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.95868683 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation
Author: Shuheng Zhou
Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1
2 0.82611465 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis
Author: Sundeep Rangan, Alyson K. Fletcher
Abstract: A well-known analysis of Tropp and Gilbert shows that orthogonal matching pursuit (OMP) can recover a k-sparse n-dimensional real vector from m = 4k log(n) noise-free linear measurements obtained through a random Gaussian measurement matrix with a probability that approaches one as n → ∞. This work strengthens this result by showing that a lower number of measurements, m = 2k log(n − k), is in fact sufficient for asymptotic recovery. More generally, when the sparsity level satisfies kmin ≤ k ≤ kmax but is unknown, m = 2kmax log(n − kmin ) measurements is sufficient. Furthermore, this number of measurements is also sufficient for detection of the sparsity pattern (support) of the vector with measurement errors provided the signal-to-noise ratio (SNR) scales to infinity. The scaling m = 2k log(n − k) exactly matches the number of measurements required by the more complex lasso method for signal recovery in a similar SNR scaling.
3 0.82396042 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar
Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1
4 0.80068725 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction
Author: Grzegorz Swirszcz, Naoki Abe, Aurelie C. Lozano
Abstract: We consider the problem of variable group selection for least squares regression, namely, that of selecting groups of variables for best regression performance, leveraging and adhering to a natural grouping structure within the explanatory variables. We show that this problem can be efficiently addressed by using a certain greedy style algorithm. More precisely, we propose the Group Orthogonal Matching Pursuit algorithm (Group-OMP), which extends the standard OMP procedure (also referred to as “forward greedy feature selection algorithm” for least squares regression) to perform stage-wise group variable selection. We prove that under certain conditions Group-OMP can identify the correct (groups of) variables. We also provide an upperbound on the l∞ norm of the difference between the estimated regression coefficients and the true coefficients. Experimental results on simulated and real world datasets indicate that Group-OMP compares favorably to Group Lasso, OMP and Lasso, both in terms of variable selection and prediction accuracy. 1
5 0.72313231 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem
Author: Han Liu, Xi Chen
Abstract: This paper studies the forward greedy strategy in sparse nonparametric regression. For additive models, we propose an algorithm called additive forward regression; for general multivariate models, we propose an algorithm called generalized forward regression. Both algorithms simultaneously conduct estimation and variable selection in nonparametric settings for the high dimensional sparse learning problem. Our main emphasis is empirical: on both simulated and real data, these two simple greedy methods can clearly outperform several state-ofthe-art competitors, including LASSO, a nonparametric version of LASSO called the sparse additive model (SpAM) and a recently proposed adaptive parametric forward-backward algorithm called Foba. We also provide some theoretical justifications of specific versions of the additive forward regression. 1
6 0.68856812 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors
7 0.67943066 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing
8 0.67193609 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes
9 0.63731992 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
10 0.60782802 55 nips-2009-Compressed Least-Squares Regression
11 0.60292763 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
12 0.5847342 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
13 0.56517643 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry
14 0.54592758 147 nips-2009-Matrix Completion from Noisy Entries
15 0.53295761 138 nips-2009-Learning with Compressible Priors
16 0.52552503 11 nips-2009-A General Projection Property for Distribution Families
17 0.52502233 180 nips-2009-On the Convergence of the Concave-Convex Procedure
18 0.52088898 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors
19 0.50646108 157 nips-2009-Multi-Label Prediction via Compressed Sensing
20 0.49414179 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
topicId topicWeight
[(24, 0.053), (25, 0.528), (35, 0.033), (36, 0.094), (39, 0.017), (57, 0.011), (58, 0.094), (71, 0.043), (86, 0.039), (91, 0.01)]
simIndex simValue paperId paperTitle
1 0.99131835 152 nips-2009-Measuring model complexity with the prior predictive
Author: Wolf Vanpaemel
Abstract: In the last few decades, model complexity has received a lot of press. While many methods have been proposed that jointly measure a model’s descriptive adequacy and its complexity, few measures exist that measure complexity in itself. Moreover, existing measures ignore the parameter prior, which is an inherent part of the model and affects the complexity. This paper presents a stand alone measure for model complexity, that takes the number of parameters, the functional form, the range of the parameters and the parameter prior into account. This Prior Predictive Complexity (PPC) is an intuitive and easy to compute measure. It starts from the observation that model complexity is the property of the model that enables it to fit a wide range of outcomes. The PPC then measures how wide this range exactly is. keywords: Model Selection & Structure Learning; Model Comparison Methods; Perception The recent revolution in model selection methods in the cognitive sciences was driven to a large extent by the observation that computational models can differ in their complexity. Differences in complexity put models on unequal footing when their ability to approximate empirical data is assessed. Therefore, models should be penalized for their complexity when their adequacy is measured. The balance between descriptive adequacy and complexity has been termed generalizability [1, 2]. Much attention has been devoted to developing, advocating, and comparing different measures of generalizability (for a recent overview, see [3]). In contrast, measures of complexity have received relatively little attention. The aim of the current paper is to propose and illustrate a stand alone measure of model complexity, called the Prior Predictive Complexity (PPC). The PPC is based on the intuitive idea that a complex model can predict many outcomes and a simple model can predict a few outcomes only. First, I discuss existing approaches to measuring model complexity and note some of their limitations. In particular, I argue that currently existing measures ignore one important aspect of a model: the prior distribution it assumes over the parameters. I then introduce the PPC, which, unlike the existing measures, is sensitive to the parameter prior. Next, the PPC is illustrated by calculating the complexities of two popular models of information integration. 1 Previous approaches to measuring model complexity A first approach to assess the (relative) complexity of models relies on simulated data. Simulationbased methods differ in how these artificial data are generated. A first, atheoretical approach uses random data [4, 5]. In the semi-theoretical approach, the data are generated from some theoretically ∗ I am grateful to Michael Lee and Liz Bonawitz. 1 interesting functions, such as the exponential or the logistic function [4]. Using these approaches, the models under consideration are equally complex if each model provides the best optimal fit to roughly the same number of data sets. A final approach to generating artificial data is a theoretical one, in which the data are generated from the models of interest themselves [6, 7]. The parameter sets used in the generation can either be hand-picked by the researcher, estimated from empirical data or drawn from a previously specified distribution. If the models under consideration are equally complex, each model should provide the best optimal fit to self-generated data more often than the other models under consideration do. One problem with this simulation-based approach is that it is very labor intensive. It requires generating a large amount of artificial data sets, and fitting the models to all these data sets. Further, it relies on choices that are often made in an arbitrary fashion that nonetheless bias the results. For example, in the semi-theoretical approach, a crucial choice is which functions to use. Similarly, in the theoretical approach, results are heavily influenced by the parameter values used in generating the data. If they are fixed, on what basis? If they are estimated from empirical data, from which data? If they are drawn randomly, from which distribution? Further, a simulation study only gives a rough idea of complexity differences but provides no direct measure reflecting the complexity. A number of proposals have been made to measure model complexity more directly. Consider a model M with k parameters, summarized in the parameter vector θ = (θ1 , θ2 , . . . , θk , ) which has a range indicated by Ω. Let d denote the data and p(d|θ, M ) the likelihood. The most straightforward measure of model complexity is the parametric complexity (PC), which simply counts the number of parameters: PC = k. (1) PC is attractive as a measure of model complexity since it is very easy to calculate. Further, it has a direct and well understood relation toward complexity: the more parameters, the more complex the model. It is included as the complexity term of several generalizability measures such as AIC [8] and BIC [9], and it is at the heart of the Likelihood Ratio Test. Despite this intuitive appeal, PC is not free from problems. One problem with PC is that it reflects only a single aspect of complexity. Also the parameter range and the functional form (the way the parameters are combined in the model equation) influence a model’s complexity, but these dimensions of complexity are ignored in PC [2, 6]. A complexity measure that takes these three dimensions into account is provided by the geometric complexity (GC) measure, which is inspired by differential geometry [10]. In GC, complexity is conceptualized as the number of distinguishable probability distributions a model can generate. It is defined by GC = k n ln + ln 2 2π det I(θ|M )dθ, (2) Ω where n indicates the size of the data sample and I(θ) is the Fisher Information Matrix: Iij (θ|M ) = −Eθ ∂ 2 ln p(d|θ, M ) . ∂θi ∂θj (3) Note that I(θ|M ) is determined by the likelihood function p(d|θ, M ), which is in turn determined by the model equation. Hence GC is sensitive to the number of parameters (through k), the functional form (through I), and the range (through Ω). Quite surprisingly, GC turns out to be equal to the complexity term used in one version of Minimum Description Length (MDL), a measure of generalizability developed within the domain of information theory [2, 11, 12, 13]. GC contrasts favorably with PC, in the sense that it takes three dimensions of complexity into account rather than a single one. A major drawback of GC is that, unlike PC, it requires considerable technical sophistication to be computed, as it relies on the second derivative of the likelihood. A more important limitation of both PC and GC is that these measures are insensitive to yet another important dimension contributing to model complexity: the prior distribution over the model parameters. The relation between the parameter prior distribution and model complexity is discussed next. 2 2 Model complexity and the parameter prior The growing popularity of Bayesian methods in psychology has not only raised awareness that model complexity should be taken into account when testing models [6], it has also drawn attention to the fact that in many occasions, relevant prior information is available [14]. In Bayesian methods, there is room to incorporate this information in two different flavors: as a prior distribution over the models, or as a prior distribution over the parameters. Specifying a model prior is a daunting task, so almost invariably, the model prior is taken to be uniform (but see [15] for an exception). In contrast, information regarding the parameter is much easier to include, although still challenging (e.g., [16]). There are two ways to formalize prior information about a model’s parameters: using the parameter prior range (often referred to as simply the range) and using the parameter prior distribution (often referred to as simply the prior). The prior range indicates which parameter values are allowed and which are forbidden. The prior distribution indicates which parameter values are likely and which are unlikely. Models that share the same equation and the same range but differ in the prior distribution can be considered different models (or at least different model versions), just like models that share the same equation but differ in range are different model versions. Like the parameter prior range, the parameter prior distribution influences the model complexity. In general, a model with a vague parameter prior distribution is more complex than a model with a sharply peaked parameter prior distribution, much as a model with a broad-ranged parameter is more complex than the same model where the parameter is heavily restricted. To drive home the point that the parameter prior should be considered when model complexity is assessed, consider the following “fair coin” model Mf and a “biased coin” model Mb . There is a clear intuitive complexity difference between these models: Mb is more complex than Mf . The most straightforward way to formalize these models is as follows, where ph denotes the probability of observing heads: ph = 1/2, (4) ph = θ 0≤θ≤1 p(θ) = 1, (5) for model Mf and the triplet of equations jointly define model Mb . The range forbids values smaller than 0 or greater than 1 because ph is a proportion. As Mf and Mb have a different number of parameters, both PC and GC, being sensitive to the number of parameters, pick up the difference in model complexity between the models. Alternatively, model Mf could be defined as follows: ph = θ 0≤θ≤1 1 p(θ) = δ(θ − ), 2 (6) where δ(x) is the Dirac delta. Note that the model formalized in Equation 6 is exactly identical the model formalized in Equation 4. However, relying on the formulation of model Mf in Equation 6, PC and GC now judge Mf and Mb to be equally complex: both models share the same model equation (which implies they have the same number of parameters and the same functional form) and the same range for the parameter. Hence, PC and GC make an incorrect judgement of the complexity difference between both models. This misjudgement is a direct result of the insensitivity of these measures to the parameter prior. As models Mf and Mb have different prior distributions over their parameter, a measure sensitive to the prior would pick up the complexity difference between these models. Such a measure is introduced next. 3 The Prior Predictive Complexity Model complexity refers to the property of the model that enables it to predict a wide range of data patterns [2]. The idea of the PPC is to measure how wide this range exactly is. A complex model 3 can predict many outcomes, and a simple model can predict a few outcomes only. Model simplicity, then, refers to the property of placing restrictions on the possible outcomes: the greater restrictions, the greater the simplicity. To understand how model complexity is measured in the PPC, it is useful to think about the universal interval (UI) and the predicted interval (PI). The universal interval is the range of outcomes that could potentially be observed, irrespective of any model. For example, in an experiment with n binomial trials, it is impossible to observe less that zero successes, or more than n successes, so the range of possible outcomes is [0, n] . Similarly, the universal interval for a proportion is [0, 1]. The predicted interval is the interval containing all outcomes the model predicts. An intuitive way to gauge model complexity is then the cardinality of the predicted interval, relative to the cardinality of the universal interval, averaged over all m conditions or stimuli: PPC = 1 m m i=1 |PIi | . |UIi | (7) A key aspect of the PPC is deriving the predicted interval. For a parameterized likelihood-based model, prediction takes the form of a distribution over all possible outcomes for some future, yet-tobe-observed data d under some model M . This distribution is called the prior predictive distribution (ppd) and can be calculated using the law of total probability: p(d|M ) = p(d|θ, M )p(θ|M )dθ. (8) Ω Predicting the probability of unseen future data d arising under the assumption that model M is true involves integrating the probability of the data for each of the possible parameter values, p(d|θ, M ), as weighted by the prior probability of each of these values, p(θ|M ). Note that the ppd relies on the number of parameters (through the number of integrals and the likelihood), the model equation (through the likelihood), and the parameter range (through Ω). Therefore, as GC, the PPC is sensitive to all these aspects. In contrast to GC, however, the ppd, and hence the PPC, also relies on the parameter prior. Since predictions are made probabilistically, virtually all outcomes will be assigned some prior weight. This implies that, in principle, the predicted interval equals the universal interval. However, for some outcomes the assigned weight will be extremely small. Therefore, it seems reasonable to restrict the predicted interval to the smallest interval that includes some predetermined amount of the prior mass. For example, the 95% predictive interval is defined by those outcomes with the highest prior mass that together make up 95% of the prior mass. Analytical solutions to the integral defining the ppd are rarely available. Instead, one should rely on approximations to the ppd by drawing samples from it. In the current study, sampling was performed using WinBUGS [17, 18], a highly versatile, user friendly, and freely available software package. It contains sophisticated and relatively general-purpose Markov Chain Monte Carlo (MCMC) algorithms to sample from any distribution of interest. 4 An application example The PPC is illustrated by comparing the complexity of two popular models of information integration, which attempt to account for how people merge potentially ambiguous or conflicting information from various sensorial sources to create subjective experience. These models either assume that the sources of information are combined additively (the Linear Integration Model; LIM; [19]) or multiplicatively (the Fuzzy Logical Model of Perception; FLMP; [20, 21]). 4.1 Information integration tasks A typical information integration task exposes participants simultaneously to different sources of information and requires this combined experience to be identified in a forced-choice identification task. The presented stimuli are generated from a factorial manipulation of the sources of information by systematically varying the ambiguity of each of the sources. The relevant empirical data consist 4 of, for each of the presented stimuli, the counts km of the number of times the mth stimulus was identified as one of the response alternatives, out of the tm trials on which it was presented. For example, an experiment in phonemic identification could involve two phonemes to be identified, /ba/ and /da/ and two sources of information, auditory and visual. Stimuli are created by crossing different levels of audible speech, varying between /ba/ and /da/, with different levels of visible speech, also varying between these alternatives. The resulting set of stimuli spans a continuum between the two syllables. The participant is then asked to listen and to watch the speaker, and based on this combined audiovisual experience, to identify the syllable as being either /ba/ or /da/. In the so-called expanded factorial design, not only bimodal stimuli (containing both auditory and visual information) but also unimodal stimuli (providing only a single source of information) are presented. 4.2 Information integration models In what follows, the formal description of the LIM and the FLMP is outlined for a design with two response alternatives (/da/ or /ba/) and two sources (auditory and visual), with I and J levels, respectively. In such a two-choice identification task, the counts km follow a Binomial distribution: km ∼ Binomial(pm , tm ), (9) where pm indicates the probability that the mth stimulus is identified as /da/. 4.2.1 Model equation The probability for the stimulus constructed with the ith level of the first source and the jth level of the second being identified as /da/ is computed according to the choice rule: pij = s (ij, /da/) , s (ij, /da/) + s (ij, /ba/) (10) where s (ij, /da/) represents the overall degree of support for the stimulus to be /da/. The sources of information are assumed to be evaluated independently, implying that different parameters are used for the different modalities. In the present example, the degree of auditory support for /da/ is denoted by ai (i = 1, . . . , I) and the degree of visual support for /da/ by bj (j = 1, . . . , J). When a unimodal stimulus is presented, the overall degree of support for each alternative is given by s (i∗, /da/) = ai and s (∗j, /da/) = bj , where the asterisk (*) indicates the absence of information, implying that Equation 10 reduces to pi∗ = ai and p∗j = bj . (11) When a bimodal stimulus is presented, the overall degree of support for each alternative is based on the integration or blending of both these sources. Hence, for bimodal stimuli, s (ij, /da/) = ai bj , where the operator denotes the combination of both sources. Hence, Equation 10 reduces to ai bj . (12) pij = ai bj + (1 − ai ) (1 − bj ) = +, so Equation 12 becomes The LIM assumes an additive combination, i.e., pij = ai + bj . 2 (13) The FLMP, in contrast, assumes a multiplicative combination, i.e., = ×, so Equation 12 becomes ai bj . ai bj + (1 − ai )(1 − bj ) (14) pij = 5 4.2.2 Parameter prior range and distribution Each level of auditory and visual support for /da/ (i.e., ai and bj , respectively) is associated with a free parameter, which implies that the FLMP and the LIM have an equal number of free parameters, I + J. Each of these parameters is constrained to satisfy 0 ≤ ai , bj ≤ 1. The original formulations of the LIM and FLMP unfortunately left the parameter priors unspecified. However, an implicit assumption that has been commonly used is a uniform prior for each of the parameters. This assumption implicitly underlies classical and widely adopted methods for model evaluation using accounted percentage of variance or maximum likelihood. ai ∼ Uniform(0, 1) and bi ∼ Uniform(0, 1) for i = 1, . . . , I; j = 1, . . . , J. (15) The models relying on this set of uniform priors will be referred to as LIMu and FLMPu . Note that LIMu and FLMPu treat the different parameters as independent. This approach misses important information. In particular, the experimental design is such that the amount of support for each level i + 1 is always higher than for level i. Because parameter ai (or bi ) corresponds to the degree of auditory (or visual) support for a unimodal stimulus at the ith level, it seems reasonable to expect the following orderings among the parameters to hold (see also [6]): aj > ai and bj > bi for j > i. (16) The models relying on this set of ordered priors will be referred to as LIMo and FLMPo . 4.3 Complexity and experimental design It is tempting to consider model complexity as an inherent characteristic of a model. For some models and for some measures of complexity this is clearly the case. Consider, for example, model Mb . In any experimental design (i.e., a number of coin tosses), PCMb = 1. However, more generally, this is not the case. Focusing on the FLMP and the LIM, it is clear that even a simple measure as PC depends crucially on (some aspects of) the experimental design. In particular, every level corresponds to a new parameter, so PC = I + J . Similarly, GC is dependent on design choices. The PPC is not different in this respect. The design sensitivity implies that one can only make sensible conclusions about differences in model complexity by using different designs. In an information integration task, the design decisions include the type of design (expanded or not), the number of sources, the number of response alternatives, the number of levels for each source, and the number of observations for each stimulus (sample size). The present study focuses on the expanded factorial designs with two sources and two response alternatives. The additional design features were varied: both a 5 × 5 and a 8 × 2 design were considered, using three different sample sizes (20, 60 and 150, following [2]). 4.4 Results Figure 1 shows the 99% predicted interval in the 8×2 design with n = 150. Each panel corresponds to a different model. In each panel, each of the 26 stimuli is displayed on the x-axis. The first eight stimuli correspond to the stimuli with the lowest level of visual support, and are ordered in increasing order of auditory support. The next eight stimuli correspond to the stimuli with the highest level of visual support. The next eight stimuli correspond to the unimodal stimuli where only auditory information is provided (again ranked in increasing order). The final two stimuli are the unimodal visual stimuli. Panel A shows that the predicted interval of LIMu nearly equals the universal interval, ranging between 0 and 1. This indicates that almost all outcomes are given a non-negligible prior mass by LIMu , making it almost maximally complex. FLMPu is even more complex. The predicted interval, shown in Panel B, virtually equals the universal interval, indicating that the model predicts virtually every possible outcome. Panels C and D show the dramatic effect of incorporating relevant prior information in the models. The predicted intervals of both LIMo and FLMPo are much smaller than their counterparts using the uniform priors. Focusing on the comparison between LIM and FLMP, the PPC indicates that the latter is more complex than the former. This observation holds irrespective of the model version (assuming uniform 6 0.9 0.8 0.8 Proportion of /da/ responses 1 0.9 Proportion of /da/ responses 1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 11 21 A 1* 0 *1 11 21 B 1* *1 1* *1 0.8 Proportion of /da/ responses 0.9 0.8 21 1 0.9 Proportion of /da/ responses 1 11 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 11 21 C 1* 0 *1 D Figure 1: The 99% predicted interval for each of the 26 stimuli (x-axis) according to LIMu (Panel A), FLMPu (Panel B), LIMo (Panel C), and FLMPo (Panel D). Table 1: PPC, based on the 99% predicted interval, for four models across six different designs. 20 LIMu FLMPu LIMo FLMPo 5×5 60 150 20 8×2 60 150 0.97 1 0.75 0.83 0.94 1 0.67 0.80 .97 1 0.77 0.86 0.95 1 0.69 0.82 0.93 0.99 0.64 0.78 7 0.94 0.99 0.66 0.81 vs. ordered priors). The smaller complexity of LIM is in line with previous attempts to measure the relative complexities of LIM and FLMP, such as the atheoretical simulation-based approach ([4] but see [5]), the semi-theoretical simulation-based approach [4], the theoretical simulation-based approach [2, 6, 22], and a direct computation of the GC [2]. The PPC’s for all six designs considered are displayed in Table 1. It shows that the observations made for the 8 × 2, n = 150 design holds across the five remaining designs as well: LIM is simpler than FLMP; and models assuming ordered priors are simpler than models assuming uniform priors. Note that these conclusions would not have been possible based on PC or GC. For PC, all four models have the same complexity. GC, in contrast, would detect complexity differences between LIM and FLMP (i.e., the first conclusion), but due to its insensitivity to the parameter prior, the complexity differences between LIMu and LIMo on the one hand, and FLMPu and FLMPo on the other hand (i.e., the second conclusion) would have gone unnoticed. 5 Discussion A theorist defining a model should clearly and explicitly specify at least the three following pieces of information: the model equation, the parameter prior range, and the parameter prior distribution. If any of these pieces is missing, the model should be regarded as incomplete, and therefore untestable. Consequently, any measure of generalizability should be sensitive to all three aspects of the model definition. Many currently popular generalizability measures do not satisfy this criterion, including AIC, BIC and MDL. A measure of generalizability that does take these three aspects of a model into account is the marginal likelihood [6, 7, 14, 23]. Often, the marginal likelihood is criticized exactly for its sensitivity to the prior range and distribution (e.g., [24]). However, in the light of the fact that the prior is a part of the model definition, I see the sensitivity of the marginal likelihood to the prior as an asset rather than a nuisance. It is precisely the measures of generalizability that are insensitive to the prior that miss an important aspect of the model. Similarly, any stand alone measure of model complexity should be sensitive to all three aspects of the model definition, as all three aspects contribute to the model’s complexity (with the model equation contributing two factors: the number of parameters and the functional form). Existing measures of complexity do not satisfy this requirement and are therefore incomplete. PC takes only part of the model equation into account, whereas GC takes only the model equation and the range into account. In contrast, the PPC currently proposed is sensitive to all these three aspects. It assesses model complexity using the predicted interval which contains all possible outcomes a model can generate. A narrow predicted interval (relative to the universal interval) indicates a simple model; a complex model is characterized by a wide predicted interval. There is a tight coupling between the notions of information, knowledge and uncertainty, and the notion of model complexity. As parameters correspond to unknown variables, having more information available leads to fewer parameters and hence to a simpler model. Similarly, the more information there is available, the sharper the parameter prior, implying a simpler model. To put it differently, the less uncertainty present in a model, the narrower its predicted interval, and the simpler the model. For example, in model Mb , there is maximal uncertainty. Nothing but the range is known about θ, so all values of θ are equally likely. In contrast, in model Mf , there is minimal uncertainty. In fact, ph is known for sure, so only a single value of θ is possible. This difference in uncertainty is translated in a difference in complexity. The same is true for the information integration models. Incorporating the order constraints in the priors reduces the uncertainty compared to the models without these constraints (it tells you, for example, that parameter a1 is smaller than a2 ). This reduction in uncertainty is reflected by a smaller complexity. There are many different sources of prior information that can be translated in a range or distribution. The illustration using the information integration models highlighted that prior information can reflect meaningful information in the design. Alternatively, priors can be informed by previous applications of similar models in similar settings. Probably the purest form of priors are those that translate theoretical assumptions made by a model (see [16]). The fact that it is often difficult to formalize this prior information may not be used as an excuse to leave the prior unspecified. Sure it is a challenging task, but so is translating theoretical assumptions into the model equation. Formalizing theory, intuitions, and information is what model building is all about. 8 References [1] Myung, I. J. (2000) The importance of complexity in model selection. Journal of Mathematical Psychology, 44, 190–204. [2] Pitt, M. A., Myung, I. J., and Zhang, S. (2002) Toward a method of selecting among computational models of cognition. Psychological Review, 109, 472–491. [3] Shiffrin, R. M., Lee, M. D., Kim, W., and Wagenmakers, E. J. (2008) A survey of model evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science, 32, 1248–1284. [4] Cutting, J. E., Bruno, N., Brady, N. P., and Moore, C. (1992) Selectivity, scope, and simplicity of models: A lesson from fitting judgments of perceived depth. Journal of Experimental Psychology: General, 121, 364–381. [5] Dunn, J. (2000) Model complexity: The fit to random data reconsidered. Psychological Research, 63, 174–182. [6] Myung, I. J. and Pitt, M. A. (1997) Applying Occam’s razor in modeling cognition: A Bayesian approach. Psychonomic Bulletin & Review, 4, 79–95. [7] Vanpaemel, W. and Storms, G. (in press) Abstraction and model evaluation in category learning. Behavior Research Methods. [8] Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. Petrov, B. and Csaki, B. (eds.), Second International Symposium on Information Theory, pp. 267–281, Academiai Kiado. [9] Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461–464. [10] Myung, I. J., Balasubramanian, V., and Pitt, M. A. (2000) Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences, 97, 11170–11175. [11] Lee, M. D. (2002) Generating additive clustering models with minimal stochastic complexity. Journal of Classification, 19, 69–85. [12] Rissanen, J. (1996) Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42, 40–47. [13] Gr¨ nwald, P. (2000) Model selection based on minimum description length. Journal of Mathematical u Psychology, 44, 133–152. [14] Lee, M. D. and Wagenmakers, E. J. (2005) Bayesian statistical inference in psychology: Comment on Trafimow (2003). Psychological Review, 112, 662–668. [15] Lee, M. D. and Vanpaemel, W. (2008) Exemplars, prototypes, similarities and rules in category representation: An example of hierarchical Bayesian analysis. Cognitive Science, 32, 1403–1424. [16] Vanpaemel, W. and Lee, M. D. (submitted) Using priors to formalize theory: Optimal attention and the generalized context model. [17] Lee, M. D. (2008) Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin & Review, 15, 1–15. [18] Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2004) WinBUGS User Manual Version 2.0. Medical Research Council Biostatistics Unit. Institute of Public Health, Cambridge. [19] Anderson, N. H. (1981) Foundations of information integration theory. Academic Press. [20] Oden, G. C. and Massaro, D. W. (1978) Integration of featural information in speech perception. Psychological Review, 85, 172–191. [21] Massaro, D. W. (1998) Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. MIT Press. [22] Massaro, D. W., Cohen, M. M., Campbell, C. S., and Rodriguez, T. (2001) Bayes factor of model selection validates FLMP. Psychonomic Bulletin and Review, 8, 1–17. [23] Kass, R. E. and Raftery, A. E. (1995) Bayes factors. Journal of the American Statistical Association, 90, 773–795. [24] Liu, C. C. and Aitkin, M. (2008) Bayes factors: Prior sensitivity and model generalizability. Journal of Mathematical Psychology, 53, 362–375. 9
2 0.9728238 201 nips-2009-Region-based Segmentation and Object Detection
Author: Stephen Gould, Tianshi Gao, Daphne Koller
Abstract: Object detection and multi-class image segmentation are two closely related tasks that can be greatly improved when solved jointly by feeding information from one task to the other [10, 11]. However, current state-of-the-art models use a separate representation for each task making joint inference clumsy and leaving the classification of many parts of the scene ambiguous. In this work, we propose a hierarchical region-based approach to joint object detection and image segmentation. Our approach simultaneously reasons about pixels, regions and objects in a coherent probabilistic model. Pixel appearance features allow us to perform well on classifying amorphous background classes, while the explicit representation of regions facilitate the computation of more sophisticated features necessary for object detection. Importantly, our model gives a single unified description of the scene—we explain every pixel in the image and enforce global consistency between all random variables in our model. We run experiments on the challenging Street Scene dataset [2] and show significant improvement over state-of-the-art results for object detection accuracy. 1
3 0.96861845 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines
Author: Masayuki Karasuyama, Ichiro Takeuchi
Abstract: We propose a multiple incremental decremental algorithm of Support Vector Machine (SVM). Conventional single incremental decremental SVM can update the trained model efficiently when single data point is added to or removed from the training set. When we add and/or remove multiple data points, this algorithm is time-consuming because we need to repeatedly apply it to each data point. The proposed algorithm is computationally more efficient when multiple data points are added and/or removed simultaneously. The single incremental decremental algorithm is built on an optimization technique called parametric programming. We extend the idea and introduce multi-parametric programming for developing the proposed algorithm. Experimental results on synthetic and real data sets indicate that the proposed algorithm can significantly reduce the computational cost of multiple incremental decremental operation. Our approach is especially useful for online SVM learning in which we need to remove old data points and add new data points in a short amount of time.
same-paper 4 0.96338022 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation
Author: Shuheng Zhou
Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1
5 0.95958894 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise
Author: Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R. Movellan, Paul L. Ruvolo
Abstract: Modern machine learning-based approaches to computer vision require very large databases of hand labeled images. Some contemporary vision systems already require on the order of millions of images for training (e.g., Omron face detector [9]). New Internet-based services allow for a large number of labelers to collaborate around the world at very low cost. However, using these services brings interesting theoretical and practical challenges: (1) The labelers may have wide ranging levels of expertise which are unknown a priori, and in some cases may be adversarial; (2) images may vary in their level of difficulty; and (3) multiple labels for the same image must be combined to provide an estimate of the actual label of the image. Probabilistic approaches provide a principled way to approach these problems. In this paper we present a probabilistic model and use it to simultaneously infer the label of each image, the expertise of each labeler, and the difficulty of each image. On both simulated and real data, we demonstrate that the model outperforms the commonly used “Majority Vote” heuristic for inferring image labels, and is robust to both noisy and adversarial labelers. 1
7 0.78615367 133 nips-2009-Learning models of object structure
8 0.77758372 211 nips-2009-Segmenting Scenes by Matching Image Composites
9 0.75536203 25 nips-2009-Adaptive Design Optimization in Experiments with People
10 0.71479672 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes
11 0.71439493 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry
12 0.70632714 115 nips-2009-Individuation, Identification and Object Discovery
13 0.69646424 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation
15 0.6886791 175 nips-2009-Occlusive Components Analysis
16 0.68807399 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships
17 0.68137747 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
18 0.67869246 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition
19 0.67298335 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition
20 0.67194027 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models