nips nips2013 nips2013-225 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lee H. Dicker, Dean P. Foster
Abstract: We model a “one-shot learning” situation, where very few observations y1 , ..., yn ∈ R are available. Associated with each observation yi is a very highdimensional vector xi ∈ Rd , which provides context for yi and enables us to predict subsequent observations, given their own context. One of the salient features of our analysis is that the problems studied here are easier when the dimension of xi is large; in other words, prediction becomes easier when more context is provided. The proposed methodology is a variant of principal component regression (PCR). Our rigorous analysis sheds new light on PCR. For instance, we show that classical PCR estimators may be inconsistent in the specified setting, unless they are multiplied by a scalar c > 1; that is, unless the classical estimator is expanded. This expansion phenomenon appears to be somewhat novel and contrasts with shrinkage methods (c < 1), which are far more common in big data analyses. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Associated with each observation yi is a very highdimensional vector xi ∈ Rd , which provides context for yi and enables us to predict subsequent observations, given their own context. [sent-10, score-0.111]
2 One of the salient features of our analysis is that the problems studied here are easier when the dimension of xi is large; in other words, prediction becomes easier when more context is provided. [sent-11, score-0.151]
3 The proposed methodology is a variant of principal component regression (PCR). [sent-12, score-0.141]
4 For instance, we show that classical PCR estimators may be inconsistent in the specified setting, unless they are multiplied by a scalar c > 1; that is, unless the classical estimator is expanded. [sent-14, score-0.138]
5 This expansion phenomenon appears to be somewhat novel and contrasts with shrinkage methods (c < 1), which are far more common in big data analyses. [sent-15, score-0.059]
6 We propose effective methods for one-shot learning in this setting, and derive risk approximations that are informative in an asymptotic regime where the number of training examples n is fixed (e. [sent-27, score-0.196]
7 These approximations provide insight into the significance of various parameters that are relevant for oneshot learning. [sent-30, score-0.09]
8 One important feature of the proposed one-shot setting is that prediction becomes “easier” when d is large – in other words, prediction becomes easier when more context is provided. [sent-31, score-0.159]
9 The methods considered in this paper are variants of principal component regression (PCR) [13]. [sent-35, score-0.126]
10 [19], who studied principal component scores in high dimensions, and work by Hall, Jung, Marron and co-authors [10, 11, 18, 21], who have studied “high dimension, low sample size” data, with fixed n and d → ∞, in a variety of contexts, including 1 PCA. [sent-43, score-0.153]
11 We show that the classical PCR estimator is generally inconsistent in the one-shot learning regime, where n is fixed and d → ∞. [sent-48, score-0.098]
12 To remedy this, we propose a bias-corrected PCR estimator, which is obtained by expanding the classical PCR estimator (i. [sent-49, score-0.09]
13 Risk approximations obtained in Section 5 imply that the bias-corrected estimator is consistent when n is fixed and d → ∞. [sent-52, score-0.124]
14 These results are supported by a simulation study described in Section 7, where we also consider an “oracle” PCR estimator for comparative purposes. [sent-53, score-0.063]
15 It is noteworthy that the bias-corrected estimator is an expanded version of the classical estimator. [sent-54, score-0.088]
16 Shrinkage, which would correspond to multiplying the classical estimator by a scalar 0 ≤ c < 1, is a far more common phenomenon in high-dimensional data analysis, e. [sent-55, score-0.109]
17 [19] argued for bias-correction via expansion in the analysis of principal component scores). [sent-58, score-0.12]
18 , (yn , xn ), where yi ∈ R is a scalar outcome and xi ∈ Rd is an associated d-dimensional “context” vector, for i = 1, . [sent-62, score-0.081]
19 Suppose that yi and xi are related via yi xi = hi θ + ξi ∈ R, hi ∼ N (0, η 2 ), ξi ∼ N (0, σ 2 ), √ = hi γ du + i ∈ Rd , i ∼ N (0, τ 2 I), i = 1, . [sent-66, score-0.271]
20 , id )T , 1 ≤ i ≤ n, are all assumed to be independent; hi is a latent factor linking the outcome yi and the vector xi ; ξi and i are random noise. [sent-73, score-0.115]
21 RV (ˆ) is small) in ˆ y an asymptotic regime whose key features are (i) n is fixed, (ii) d → ∞, (iii) σ 2 → 0, and (iv) inf η 2 γ 2 /τ 2 > 0. [sent-97, score-0.111]
22 We suggest that this regime reflects a one-shot learning setting, where n is small and d is large (captured by (i)-(ii) from the previous sentence), and there is abundant contextual information for predicting future outcomes (which is ensured by (iii)-(iv)). [sent-98, score-0.081]
23 In a specified asymptotic regime (not necessarily the one-shot regime), we say that a prediction method y is consistent if ˆ RV (ˆ) → 0. [sent-99, score-0.186]
24 Weak consistency is another type of consistency that is considered below. [sent-100, score-0.118]
25 We say that y y is weakly consistent if |ˆ − ynew | → 0 in probability. [sent-101, score-0.147]
26 2 3 Principal component regression By assumption, the data (yi , xi ) are multivariate normal. [sent-103, score-0.088]
27 This suggests studying linear prediction rules of the form y (xnew ) = ˆ T ˆ for some estimator β of β. [sent-105, score-0.151]
28 In this paper, we restrict our attention to linear prediction rules, ˆ xnew β, focusing on estimators related to principal component regression (PCR). [sent-106, score-0.367]
29 In its most basic form, principal component regression involves regressing y T T ˆ on XUk for some (typically small) k, and taking β = Uk (Uk X T XUk )−1 Uk X T y. [sent-118, score-0.126]
30 Thus, it is natural to restrict our attention to PCR with k = 1; more explicitly, consider ˆ β pcr = ˆ1 uT X T y ˆ u1 T XT Xu ˆ ˆ1 u1 = 1 T T ˆ u X yˆ 1 . [sent-120, score-0.583]
31 u l1 1 (5) ˆ In the following sections, we study consistency and risk properties of β pcr and related estimators. [sent-121, score-0.684]
32 4 Weak consistency and big data with n = 2 Before turning our attention to risk approximations for PCR in Section 5 below (which contains the paper’s main technical contributions), we discuss weak consistency in the one-shot asymptotic regime, devoting special attention to the case where n = 2. [sent-122, score-0.337]
33 Second, it will become apparent below that the risk of the consistent PCR methods studied in this paper depends on inverse moments of χ2 random variables. [sent-125, score-0.118]
34 For very small n, these inverse moments do not exist and, consequently, the risk of the associated prediction methods may be infinite. [sent-126, score-0.126]
35 On the other hand, the weak consistency results obtained in this section are valid for all n ≥ 2. [sent-128, score-0.115]
36 1 Heuristic analysis for n = 2 ˆ Recall the PCR estimator (5) and let ypcr (x) = xT β pcr be the associated linear prediction rule. [sent-130, score-1.114]
37 1 ˆ ˆ These expressions for l1 and u1 yield an explicit expression for β pcr when n = 2 and facilitate a simple heuristic analysis of PCR, which we undertake in this subsection. [sent-132, score-0.568]
38 This analysis suggests that ypcr is not consistent when σ 2 → 0 and d → ∞ (at least for n = 2). [sent-133, score-0.47]
39 However, the analysis ˆ ˆ also suggests that consistency can be achieved by multiplying β pcr by a scalar c ≥ 1; that is, by ˆ . [sent-134, score-0.684]
40 This observation leads us to consider and rigorously analyze a bias-corrected PCR expanding β pcr method, which we ultimately show is consistent in fixed n settings, if σ 2 → 0 and d → ∞. [sent-135, score-0.615]
41 On the other hand, it will also be shown below that ypcr is inconsistent in one-shot asymptotic regimes. [sent-136, score-0.486]
42 ˆ For large d, the basic approximations ||xi ||2 ≈ γ 2 dh2 + τ 2 d and xT x2 ≈ γ 2 dhi hj lead to the 1 1 following approximation for ypcr (xnew ): ˆ ˆ ypcr (xnew ) = xT β pcr ≈ ˆ new γ 2 (h2 + h2 ) 1 2 hnew θ + epcr , γ 2 (h2 + h2 ) + τ 2 1 2 3 (6) where epcr = γ 2 hnew ˆ uT X T ξ. [sent-137, score-1.856]
43 + h2 ) + τ 2 d}2 1 2 {γ 2 d(h2 1 Thus, τ2 hnew θ + epcr − ξnew . [sent-138, score-0.193]
44 (7) + h2 ) + τ 2 2 The second and third terms on the right-hand side in (7), epcr − ξnew , represent a random error that vanishes as d → ∞ and σ 2 → 0. [sent-139, score-0.086]
45 On the other hand, the first term on the right-hand side in (7), −τ 2 hnew θ/{γ 2 (h2 + h2 ) + τ 2 }, is a bias term that is, in general, non-zero when d → ∞ and 1 2 σ 2 → 0; in other words ypcr is inconsistent. [sent-140, score-0.594]
46 This bias is apparent in the expression for ypcr (xnew ) ˆ ˆ given in (6); in particular, the first term on the right-hand side of (6) is typically smaller than hnew θ. [sent-141, score-0.598]
47 ˆ One way to correct for the bias of ypcr is to multiply β pcr by ˆ ypcr (xnew ) − ynew ≈ − ˆ γ 2 (h2 1 l1 γ 2 (h2 + h2 ) + τ 2 1 2 ≥ 1, ≈ l1 − l2 γ 2 (h2 + h2 ) 2 1 where 1 ||x1 ||2 + ||x2 ||2 − (||x1 ||2 − ||x2 ||2 )2 + 4(xT x2 )2 ≈ τ 2 d 1 2 is the second-largest eigenvalue of X T X. [sent-142, score-1.547]
48 Define the bias-corrected principal component regression estimator l1 ˆ 1 ˆ ˆ β bc = uT X T y β pcr = l1 − l2 l1 − l2 1 ˆ ˆ and let ybc (x) = xT β bc be the associated linear prediction rule. [sent-143, score-1.506]
49 Then ybc (xnew ) = xT β bc ≈ ˆ ˆ new hnew θ + ebc , where hnew ˆ ebc = 2 2 uT X T ξ. [sent-144, score-0.932]
50 contained in a compact subset of (0, ∞)), then ybc (xnew ) − ynew ≈ ebc → 0 in probability; in other words, ybc ˆ ˆ is weakly consistent. [sent-147, score-0.995]
51 Indeed, weak consistency of ybc follows from Theorem 1 below. [sent-148, score-0.497]
52 Thus, when n = 2, ybc is weakly consistent, but not consistent. [sent-151, score-0.43]
53 Define the bias-corrected PCR estimator l1 ˆ 1 ˆ ˆ β bc = β = uT X T yˆ 1 u (8) l1 − ln pcr l1 − ln 1 ˆ and the associated linear prediction rule ybc (x) = xT β bc . [sent-154, score-1.468]
54 The main weak consistency result of the ˆ paper is given below. [sent-155, score-0.098]
55 y (9) d→∞ θ,η,τ,γ∈C σ 2 →0 u∈Rd On the other hand, lim inf inf d→∞ θ,η,τ,γ∈C σ 2 →0 u∈Rd PV {|ˆpcr (xnew ) − ynew | > r} > 0. [sent-160, score-0.155]
56 Theorem 1 implies that in the specified fixed n asymptotic setting, bias-corrected PCR is weakly consistent (9) and that the more standard PCR method ypcr ˆ is inconsistent (10). [sent-162, score-0.543]
57 In (8), it is noteworthy that l1 /(l1 − ln ) ≥ 1: in ˆ ˆ order to achieve (weak) consistency, the bias corrected estimator β bc is obtained by expanding β pcr . [sent-164, score-0.864]
58 By contrast, shrinkage is a far more common method for obtaining improved estimators in many regression and prediction settings (the literature on shrinkage estimation is vast, perhaps beginning with [23]). [sent-165, score-0.148]
59 4 5 Risk approximations and consistency In this section, we present risk approximations for ypcr and ybc that are valid when n ≥ 9. [sent-166, score-1.055]
60 A more ˆ ˆ careful analysis may yield approximations that are valid for smaller n; however, this is not pursued further here. [sent-167, score-0.086]
61 When d is large, σ 2 is small, and θ, η, τ, γ ∈ C, for some compact subset C ⊆ (0, ∞), Theorem 2 suggests that ˆ RV (ˆpcr ) ≈ θ2 η 2 EV (uT u1 )2 − 1 y 2 , l1 ˆ (uT u1 )2 − 1 l1 − ln RV (ˆbc ) ≈ θ2 η 2 EV y 2 . [sent-174, score-0.082]
62 Thus, consistency of ypcr and ybc in the one-shot regime hinges on asymptotic properties of ˆ ˆ ˆ ˆ EV {(uT u1 )2 − 1}2 and EV {l1 /(l1 − ln )(uT u1 )2 − 1}2 . [sent-175, score-1.02]
63 ˆ Proposition 1 (a) implies that in the one-shot regime, EV {(uT u1 )2 − 1}2 → E{τ 2 /(η 2 γ 2 Wn + τ 2 )2 } = 0; by Theorem 2 (a), it follows that ypcr is inconsistent. [sent-181, score-0.427]
64 On the other hand, Proposition 1 ˆ 2 ˆ (b) implies that EV l1 /(l1 − ln )(uT u1 )2 − 1 → 0 in the one-shot regime; thus, by Theorem 2 (b), ybc is consistent. [sent-182, score-0.443]
65 This suggests that both terms (11)-(12) in Theorem 2 (b) have similar magnitude and, consequently, are both necessary to obtain accurate approximations for RV (ˆbc ). [sent-190, score-0.065]
66 (It may be desirable to obtain more accurate approxy 2 Tˆ 2 imations for EV l1 /(l1 − ln )(u u1 ) − 1 ; this could potentially be leveraged to obtain better approximations for RV (ˆbc ). [sent-191, score-0.092]
67 Theorem 2 and Proposition 1 give risk approximations that are valid for all d and n ≥ 9. [sent-193, score-0.122]
68 However, as illustrated by Corollary 1, these approximations are most effective in a one-shot asymptotic setting, where n is fixed and d is large. [sent-194, score-0.078]
69 Approximate feature complexity for ybc is easily computed using Theorem 2 and Proposition 1 (clearly, ˆ feature complexity depends heavily on model parameters, such as θ, the y-data noise level σ 2 , and the x-data signal-to-noise ratio η 2 γ 2 /τ 2 ). [sent-197, score-0.399]
70 6 An oracle estimator In this section, we discuss a third method related to ypcr and ybc , which relies on information that ˆ ˆ is typically not available in practice. [sent-198, score-0.925]
71 ˆ Recall that both ybc and ypcr depend on the first principal component u1 , which may be viewed as ˆ ˆ an estimate of u. [sent-200, score-0.931]
72 If an oracle provides knowledge of u in advance, then it is natural to consider the oracle PCR estimator uT X T y ˆ β or = T T u u X Xu ˆ and the associated linear prediction rule yor (x) = xT β or . [sent-201, score-0.34]
73 Clearly, yor is consistent in the one-shot regime: if C ⊆ (0, ∞) is compact and n ≥ 3 is fixed, then ˆ lim sup d→∞ θ,η,τ,γ∈C σ 2 →0 u∈Rd 7 RV (ˆor ) = 0. [sent-205, score-0.214]
74 y Numerical results In this section, we describe the results of a simulation study where we compared the performance of ypcr , ybc , and yor . [sent-206, score-0.949]
75 For each ˆ ˆ ˆ simulated dataset, we computed β pcr , β bc , β or and the corresponding conditional prediction error RV (ˆ|y, X) y = E {ˆ(xnew ) − ynew }2 y, X y θ2 η2 , ψ2 d + 1 for y = ypcr , ybc , yor . [sent-213, score-1.854]
76 The empirical prediction error for each method y was then computed by avˆ ˆ ˆ ˆ ˆ eraging RV (ˆ|y, X) over all 1000 simulated datasets. [sent-214, score-0.125]
77 We also computed the“theoretical” prediction y error for each method, using the results from Sections 5-6, where appropriate. [sent-215, score-0.1]
78 More specifically, for ypcr and ybc , we used the leading terms of the approximations in Theorem 2 and Proposition 1 to ˆ ˆ obtain the theoretical prediction error; for yor , we used the formula given in Proposition 2 (see Table ˆ 1 for more details). [sent-216, score-1.103]
79 Finally, we computed the relative error between the empirical prediction error = ˆ ˆ (β − β)T (τ 2 I + η 2 γ 2 duuT )(β − β) + σ 2 + Table 1: Formulas for theoretical prediction error used in simulations (derived from Theorem 2 and Propositions 1-2). [sent-217, score-0.327]
80 Expectations in theoretical prediction error expressions for ypcr and ybc were ˆ ˆ computed empirically. [sent-218, score-0.963]
81 Prediction error for ypcr (PCR), ybc (Bias-corrected PCR), and yor (oracle). [sent-221, score-0.98]
82 Observe that ybc has smaller ˆ empirical prediction error than ypcr in every setting considered in Tables 2-3, and ybc substantially ˆ ˆ outperforms ypcr in most settings. [sent-251, score-1.777]
83 Indeed, the empirical prediction error for ybc when n = 9 is ˆ ˆ smaller than that of ypcr when n = 20 (for both d = 500 and d = 5000); in other words, ybc ˆ ˆ outperforms ypcr , even when ypcr has more than twice as much training data. [sent-252, score-2.204]
84 Additionally, the ˆ ˆ empirical prediction error of ybc is quite close to that of the oracle method yor , especially when n ˆ ˆ is relatively large. [sent-253, score-0.696]
85 These results highlight the effectiveness of the bias-corrected PCR method ybc in ˆ settings where σ 2 and n are small, η 2 γ 2 /τ 2 is substantially larger than 0, and d is large. [sent-254, score-0.399]
86 For n = 2, 4, theoretical prediction error is unavailable in some instances. [sent-255, score-0.137]
87 Prediction error for ypcr (PCR), ybc (Bias-corrected PCR), and yor (oracle). [sent-257, score-0.98]
88 17%) pursued an expression for RV (ˆpcr ) when n = 2 (it appears that RV (ˆpcr ) < ∞); furthermore, the y y approximations in Theorem 2 for RV (ˆpcr ), RV (ˆbc ) do not apply when n = 4. [sent-286, score-0.069]
89 In instances where y y theoretical prediction error is available, is finite, and d = 500, the relative error between empirical and theoretical prediction error for ypcr and ybc ranges from 8. [sent-287, score-1.19]
90 Thus, the accuracy of the theoretical prediction error formulas tends to improve as d increases, as one would expect. [sent-292, score-0.159]
91 Further improved measures of theoretical prediction error for ypcr and ybc could potentially be obtained by refining the approximations in Theorem 2 and ˆ ˆ Proposition 1. [sent-293, score-1.011]
92 The simulations described in the previous section indicate that ybc outperforms the uncorrected PCR method ypcr ˆ ˆ in settings where twice as much labeled data is available for ypcr . [sent-306, score-1.253]
93 As opposed to the single-factor model (1)-(2), one could consider a more general k-factor model, where yi = hT θ + ξi and xi = Shi + i ; here hi = (hi1 , . [sent-310, score-0.1]
94 , hik )T ∈ Rk is a multivariate normal random vector i √ (a k-dimensional factor linking yi and xi ), θ = (θ1 , . [sent-313, score-0.089]
95 On the distribution of the largest eigenvalue in principal components analysis. [sent-416, score-0.095]
96 Finite sample approximation results for principal component analysis: A matrix perturbation approach. [sent-424, score-0.105]
97 On consistency and sparsity for principal components analysis in high dimensions. [sent-431, score-0.134]
98 Convergence and prediction of principal component scores in highdimensional settings. [sent-444, score-0.202]
99 Optimal detection of sparse principal components in high dimension. [sent-449, score-0.075]
100 Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. [sent-464, score-0.064]
wordName wordTfidf (topN-words)
[('pcr', 0.568), ('ypcr', 0.427), ('ybc', 0.399), ('pe', 0.239), ('rv', 0.201), ('xnew', 0.157), ('ev', 0.157), ('bc', 0.147), ('hnew', 0.138), ('yor', 0.123), ('wn', 0.109), ('ut', 0.1), ('ynew', 0.09), ('principal', 0.075), ('prediction', 0.069), ('regime', 0.061), ('consistency', 0.059), ('risk', 0.057), ('ebc', 0.055), ('epcr', 0.055), ('proposition', 0.051), ('estimator', 0.05), ('oracle', 0.049), ('approximations', 0.048), ('na', 0.046), ('ln', 0.044), ('duut', 0.041), ('uk', 0.041), ('xt', 0.04), ('hi', 0.04), ('weak', 0.039), ('yi', 0.037), ('theoretical', 0.037), ('rd', 0.037), ('relative', 0.034), ('weakly', 0.031), ('du', 0.031), ('error', 0.031), ('asymptotic', 0.03), ('jung', 0.03), ('component', 0.03), ('shrinkage', 0.029), ('inconsistent', 0.029), ('oneshot', 0.028), ('xuk', 0.028), ('pv', 0.026), ('consistent', 0.026), ('lim', 0.025), ('empirical', 0.025), ('article', 0.024), ('marron', 0.024), ('theorem', 0.024), ('xi', 0.023), ('formulas', 0.022), ('regression', 0.021), ('scalar', 0.021), ('easier', 0.021), ('pursued', 0.021), ('expanding', 0.021), ('compact', 0.021), ('contextual', 0.02), ('hall', 0.02), ('inf', 0.02), ('eigenvalue', 0.02), ('classical', 0.019), ('multiplying', 0.019), ('noteworthy', 0.019), ('sup', 0.019), ('apparent', 0.018), ('royal', 0.018), ('suggests', 0.017), ('sheds', 0.017), ('lee', 0.017), ('asymptotics', 0.017), ('unlabeled', 0.017), ('valid', 0.017), ('studied', 0.017), ('propositions', 0.016), ('tenenbaum', 0.016), ('bias', 0.015), ('expansion', 0.015), ('annals', 0.015), ('un', 0.015), ('attention', 0.015), ('rules', 0.015), ('methodology', 0.015), ('big', 0.015), ('linking', 0.015), ('degrees', 0.014), ('scores', 0.014), ('pca', 0.014), ('words', 0.014), ('relevant', 0.014), ('psychological', 0.014), ('yn', 0.014), ('multivariate', 0.014), ('highdimensional', 0.014), ('devise', 0.014), ('xu', 0.013), ('comparative', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 225 nips-2013-One-shot learning and big data with n=2
Author: Lee H. Dicker, Dean P. Foster
Abstract: We model a “one-shot learning” situation, where very few observations y1 , ..., yn ∈ R are available. Associated with each observation yi is a very highdimensional vector xi ∈ Rd , which provides context for yi and enables us to predict subsequent observations, given their own context. One of the salient features of our analysis is that the problems studied here are easier when the dimension of xi is large; in other words, prediction becomes easier when more context is provided. The proposed methodology is a variant of principal component regression (PCR). Our rigorous analysis sheds new light on PCR. For instance, we show that classical PCR estimators may be inconsistent in the specified setting, unless they are multiplied by a scalar c > 1; that is, unless the classical estimator is expanded. This expansion phenomenon appears to be somewhat novel and contrasts with shrinkage methods (c < 1), which are far more common in big data analyses. 1
2 0.093265004 283 nips-2013-Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model
Author: Fang Han, Han Liu
Abstract: In this paper we focus on the principal component regression and its application to high dimension non-Gaussian data. The major contributions are two folds. First, in low dimensions and under the Gaussian model, by borrowing the strength from recent development in minimax optimal principal component estimation, we first time sharply characterize the potential advantage of classical principal component regression over least square estimation. Secondly, we propose and analyze a new robust sparse principal component regression on high dimensional elliptically distributed data. The elliptical distribution is a semiparametric generalization of the Gaussian, including many well known distributions such as multivariate Gaussian, rank-deficient Gaussian, t, Cauchy, and logistic. It allows the random vector to be heavy tailed and have tail dependence. These extra flexibilities make it very suitable for modeling finance and biomedical imaging data. Under the elliptical model, we prove that our method can estimate the regression coefficients in the optimal parametric rate and therefore is a good alternative to the Gaussian based methods. Experiments on synthetic and real world data are conducted to illustrate the empirical usefulness of the proposed method. 1
3 0.092372015 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar
Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1
4 0.084352829 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
Author: Tuo Zhao, Han Liu
Abstract: We propose a semiparametric method for estimating sparse precision matrix of high dimensional elliptical distribution. The proposed method calibrates regularizations when estimating each column of the precision matrix. Thus it not only is asymptotically tuning free, but also achieves an improved finite sample performance. Theoretically, we prove that the proposed method achieves the parametric rates of convergence in both parameter estimation and model selection. We present numerical results on both simulated and real datasets to support our theory and illustrate the effectiveness of the proposed estimator. 1
5 0.061634816 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
Author: Julien Mairal
Abstract: Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems. 1
6 0.057222188 348 nips-2013-Variational Policy Search via Trajectory Optimization
7 0.056351926 109 nips-2013-Estimating LASSO Risk and Noise Level
8 0.05581414 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures
9 0.053672802 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
10 0.052550491 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
11 0.048289012 188 nips-2013-Memory Limited, Streaming PCA
12 0.043838102 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
13 0.042077251 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
14 0.040316209 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
15 0.040312368 256 nips-2013-Probabilistic Principal Geodesic Analysis
16 0.039521445 98 nips-2013-Documents as multiple overlapping windows into grids of counts
17 0.038760643 280 nips-2013-Robust Data-Driven Dynamic Programming
18 0.038431704 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
19 0.037501164 149 nips-2013-Latent Structured Active Learning
20 0.036972921 230 nips-2013-Online Learning with Costly Features and Labels
topicId topicWeight
[(0, 0.104), (1, 0.026), (2, 0.036), (3, 0.0), (4, 0.006), (5, 0.049), (6, -0.015), (7, 0.048), (8, -0.059), (9, 0.006), (10, -0.012), (11, 0.004), (12, -0.01), (13, 0.007), (14, -0.019), (15, -0.026), (16, 0.025), (17, 0.057), (18, 0.004), (19, 0.049), (20, 0.032), (21, 0.041), (22, -0.037), (23, 0.058), (24, 0.013), (25, -0.003), (26, -0.004), (27, 0.047), (28, 0.003), (29, -0.01), (30, 0.043), (31, -0.016), (32, -0.055), (33, -0.052), (34, 0.017), (35, -0.067), (36, -0.081), (37, -0.1), (38, 0.059), (39, 0.069), (40, 0.046), (41, -0.049), (42, 0.015), (43, -0.04), (44, -0.064), (45, 0.086), (46, -0.039), (47, 0.012), (48, 0.076), (49, 0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.89330429 225 nips-2013-One-shot learning and big data with n=2
Author: Lee H. Dicker, Dean P. Foster
Abstract: We model a “one-shot learning” situation, where very few observations y1 , ..., yn ∈ R are available. Associated with each observation yi is a very highdimensional vector xi ∈ Rd , which provides context for yi and enables us to predict subsequent observations, given their own context. One of the salient features of our analysis is that the problems studied here are easier when the dimension of xi is large; in other words, prediction becomes easier when more context is provided. The proposed methodology is a variant of principal component regression (PCR). Our rigorous analysis sheds new light on PCR. For instance, we show that classical PCR estimators may be inconsistent in the specified setting, unless they are multiplied by a scalar c > 1; that is, unless the classical estimator is expanded. This expansion phenomenon appears to be somewhat novel and contrasts with shrinkage methods (c < 1), which are far more common in big data analyses. 1
2 0.65720648 283 nips-2013-Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model
Author: Fang Han, Han Liu
Abstract: In this paper we focus on the principal component regression and its application to high dimension non-Gaussian data. The major contributions are two folds. First, in low dimensions and under the Gaussian model, by borrowing the strength from recent development in minimax optimal principal component estimation, we first time sharply characterize the potential advantage of classical principal component regression over least square estimation. Secondly, we propose and analyze a new robust sparse principal component regression on high dimensional elliptically distributed data. The elliptical distribution is a semiparametric generalization of the Gaussian, including many well known distributions such as multivariate Gaussian, rank-deficient Gaussian, t, Cauchy, and logistic. It allows the random vector to be heavy tailed and have tail dependence. These extra flexibilities make it very suitable for modeling finance and biomedical imaging data. Under the elliptical model, we prove that our method can estimate the regression coefficients in the optimal parametric rate and therefore is a good alternative to the Gaussian based methods. Experiments on synthetic and real world data are conducted to illustrate the empirical usefulness of the proposed method. 1
3 0.65065807 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
Author: Tuo Zhao, Han Liu
Abstract: We propose a semiparametric method for estimating sparse precision matrix of high dimensional elliptical distribution. The proposed method calibrates regularizations when estimating each column of the precision matrix. Thus it not only is asymptotically tuning free, but also achieves an improved finite sample performance. Theoretically, we prove that the proposed method achieves the parametric rates of convergence in both parameter estimation and model selection. We present numerical results on both simulated and real datasets to support our theory and illustrate the effectiveness of the proposed estimator. 1
4 0.55882317 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar
Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1
5 0.5538795 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures
Author: Daniel Bartz, Klaus-Robert Müller
Abstract: Analytic shrinkage is a statistical technique that offers a fast alternative to crossvalidation for the regularization of covariance matrices and has appealing consistency properties. We show that the proof of consistency requires bounds on the growth rates of eigenvalues and their dispersion, which are often violated in data. We prove consistency under assumptions which do not restrict the covariance structure and therefore better match real world data. In addition, we propose an extension of analytic shrinkage –orthogonal complement shrinkage– which adapts to the covariance structure. Finally we demonstrate the superior performance of our novel approach on data from the domains of finance, spoken letter and optical character recognition, and neuroscience. 1
6 0.50797284 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression
7 0.47590074 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions
8 0.47304946 256 nips-2013-Probabilistic Principal Geodesic Analysis
9 0.46706387 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
10 0.4670628 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
11 0.45495394 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses
12 0.44767907 76 nips-2013-Correlated random features for fast semi-supervised learning
13 0.42991304 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints
14 0.4267543 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
15 0.42484245 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
16 0.4194845 324 nips-2013-The Fast Convergence of Incremental PCA
17 0.41139379 340 nips-2013-Understanding variable importances in forests of randomized trees
18 0.41035494 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
19 0.41009152 109 nips-2013-Estimating LASSO Risk and Noise Level
20 0.4092769 280 nips-2013-Robust Data-Driven Dynamic Programming
topicId topicWeight
[(2, 0.016), (16, 0.031), (33, 0.121), (34, 0.116), (41, 0.022), (49, 0.032), (56, 0.096), (70, 0.019), (85, 0.037), (89, 0.046), (90, 0.314), (93, 0.032), (95, 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.74213338 225 nips-2013-One-shot learning and big data with n=2
Author: Lee H. Dicker, Dean P. Foster
Abstract: We model a “one-shot learning” situation, where very few observations y1 , ..., yn ∈ R are available. Associated with each observation yi is a very highdimensional vector xi ∈ Rd , which provides context for yi and enables us to predict subsequent observations, given their own context. One of the salient features of our analysis is that the problems studied here are easier when the dimension of xi is large; in other words, prediction becomes easier when more context is provided. The proposed methodology is a variant of principal component regression (PCR). Our rigorous analysis sheds new light on PCR. For instance, we show that classical PCR estimators may be inconsistent in the specified setting, unless they are multiplied by a scalar c > 1; that is, unless the classical estimator is expanded. This expansion phenomenon appears to be somewhat novel and contrasts with shrinkage methods (c < 1), which are far more common in big data analyses. 1
2 0.70464122 54 nips-2013-Bayesian optimization explains human active search
Author: Ali Borji, Laurent Itti
Abstract: Many real-world problems have complicated objective functions. To optimize such functions, humans utilize sophisticated sequential decision-making strategies. Many optimization algorithms have also been developed for this same purpose, but how do they compare to humans in terms of both performance and behavior? We try to unravel the general underlying algorithm people may be using while searching for the maximum of an invisible 1D function. Subjects click on a blank screen and are shown the ordinate of the function at each clicked abscissa location. Their task is to find the function’s maximum in as few clicks as possible. Subjects win if they get close enough to the maximum location. Analysis over 23 non-maths undergraduates, optimizing 25 functions from different families, shows that humans outperform 24 well-known optimization algorithms. Bayesian Optimization based on Gaussian Processes, which exploits all the x values tried and all the f (x) values obtained so far to pick the next x, predicts human performance and searched locations better. In 6 follow-up controlled experiments over 76 subjects, covering interpolation, extrapolation, and optimization tasks, we further confirm that Gaussian Processes provide a general and unified theoretical account to explain passive and active function learning and search in humans. 1
3 0.65830106 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising
Author: Forest Agostinelli, Michael R. Anderson, Honglak Lee
Abstract: Stacked sparse denoising autoencoders (SSDAs) have recently been shown to be successful at removing noise from corrupted images. However, like most denoising techniques, the SSDA is not robust to variation in noise types beyond what it has seen during training. To address this limitation, we present the adaptive multi-column stacked sparse denoising autoencoder (AMC-SSDA), a novel technique of combining multiple SSDAs by (1) computing optimal column weights via solving a nonlinear optimization program and (2) training a separate network to predict the optimal weights. We eliminate the need to determine the type of noise, let alone its statistics, at test time and even show that the system can be robust to noise not seen in the training set. We show that state-of-the-art denoising performance can be achieved with a single system on a variety of different noise types. Additionally, we demonstrate the efficacy of AMC-SSDA as a preprocessing (denoising) algorithm by achieving strong classification performance on corrupted MNIST digits. 1
4 0.61724138 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill
Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1
5 0.54882991 173 nips-2013-Least Informative Dimensions
Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda
Abstract: We present a novel non-parametric method for finding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1
6 0.54802775 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
7 0.54372084 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
8 0.54355305 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
9 0.54346138 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
10 0.54321104 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
11 0.54278678 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
12 0.54275233 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
14 0.54218918 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
15 0.54158384 350 nips-2013-Wavelets on Graphs via Deep Learning
16 0.54091305 294 nips-2013-Similarity Component Analysis
17 0.54066724 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
18 0.54043418 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
19 0.54033804 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
20 0.54031795 280 nips-2013-Robust Data-Driven Dynamic Programming