jmlr jmlr2008 jmlr2008-5 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arnak S. Dalalyan, Anatoly Juditsky, Vladimir Spokoiny
Abstract: The statistical problem of estimating the effective dimension-reduction (EDR) subspace in the multi-index regression model with deterministic design and additive noise is considered. A new procedure for recovering the directions of the EDR subspace is proposed. Many methods for estimating the EDR subspace perform principal component analysis on a family of vectors, say ˆ ˆ β1 , . . . , βL , nearly lying in the EDR subspace. This is in particular the case for the structure-adaptive approach proposed by Hristache et al. (2001a). In the present work, we propose to estimate the projector onto the EDR subspace by the solution to the optimization problem minimize ˆ ˆ max β (I − A)β =1,...,L subject to A ∈ Am∗ , where Am∗ is the set of all symmetric matrices with eigenvalues in [0, 1] and trace less than or equal √ to m∗ , with m∗ being the true structural dimension. Under mild assumptions, n-consistency of the proposed procedure is proved (up to a logarithmic factor) in the case when the structural dimension is not larger than 4. Moreover, the stochastic error of the estimator of the projector onto the EDR subspace is shown to depend on L logarithmically. This enables us to use a large number of vectors ˆ β for estimating the EDR subspace. The empirical behavior of the algorithm is studied through numerical simulations. Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, central subspace
Reference: text
sentIndex sentText sentNum sentScore
1 A new procedure for recovering the directions of the EDR subspace is proposed. [sent-12, score-0.19]
2 Many methods for estimating the EDR subspace perform principal component analysis on a family of vectors, say ˆ ˆ β1 , . [sent-13, score-0.171]
3 In the present work, we propose to estimate the projector onto the EDR subspace by the solution to the optimization problem minimize ˆ ˆ max β (I − A)β =1,. [sent-19, score-0.265]
4 Moreover, the stochastic error of the estimator of the projector onto the EDR subspace is shown to depend on L logarithmically. [sent-24, score-0.323]
5 Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, central subspace 1. [sent-27, score-0.17]
6 When the function g is unspecified, only the linear subspace S ϑ spanned by these vectors may be identified from the sample. [sent-74, score-0.205]
7 This subspace is usually called index space or dimension-reduction (DR) subspace. [sent-75, score-0.145]
8 This smallest DR subspace, which is the intersection of all DR subspaces, is called effective dimension-reduction (EDR) subspace (Li, 1991) or central mean subspace (Cook and Li, 2002). [sent-78, score-0.29]
9 Note that a closely related problem is the estimation of the central subspace (CS), see Cook and Weisberg (1999) for its definition. [sent-87, score-0.178]
10 We refer to Cook and Li (2002) for background on the difference between the CS and 1648 E STIMATION OF THE D IMENSION -R EDUCTION S UBSPACE the central mean subspace and to Cook and Ni (2005) for a discussion of the relationship between different algorithms estimating these subspaces. [sent-93, score-0.171]
11 There are a number of methods providing an estimator of the EDR subspace in our set-up. [sent-94, score-0.232]
12 Second, there is no theoretical justification guaranteeing that these methods estimate the whole EDR subspace and not just a part thereof, see Cook and Li (2004, Section 3. [sent-103, score-0.145]
13 Moreover, in many cases they provide more accurate estimates of the EDR subspace (Hristache et al. [sent-109, score-0.145]
14 Therefore, a reasonable strategy for estimating the EDR subspace is to execute different procedures and to take a decision after comparing the obtained results. [sent-114, score-0.194]
15 , ∇ f (Xn ) and then recovering the EDR subspace from the vectors β by running a principal compo√ nent analysis (PCA). [sent-129, score-0.205]
16 Unfortunately, if L is small with respect to n, the subspace spanned by the vectors β1 , . [sent-131, score-0.205]
17 This goal is achieved by introducing a new method of extracting the EDR subspace from the estimators of the vectors β 1 , . [sent-137, score-0.249]
18 The main advantage of SAMM is that it allows us to deal with the case when L increases polynomially in n and yields an estimator of the EDR subspace which is consistent under a very weak identifiability assumption. [sent-143, score-0.232]
19 In addition, √ SAMM provides a n-consistent estimator (up to a logarithmic factor) of the EDR subspace when m∗ ≤ 4. [sent-144, score-0.232]
20 There are many methods for estimating the EDR subspace in this case, see Yin and Cook (2005); Delecroix et al. [sent-146, score-0.171]
21 Note also that the methods for estimating the EDR subspace have often their counterparts in the partially linear regression analysis, see for example Samarov et al. [sent-148, score-0.196]
22 Start with the null structural information, then iterate the above-mentioned two steps (estimation of the model and extraction of the structure) several times improving the quality of model estimation and increasing the accuracy of structural information during the iteration. [sent-167, score-0.149]
23 The function f has the same smoothness as g in the directions of the EDR subspace S spanned by the vectors ϑ1 , . [sent-179, score-0.227]
24 The corresponding local-linear estimator can be defined by fˆ(Xi ) ∇ f (Xi ) n = ∑ j=1 1 Xi j 1 Xi j w∗j i −1 n ∑ Yj j=1 1 ∗ Xi j wi j , (2) with the weights w∗j = K(|Π∗ Xi j |2 /h2 ), where h is some positive real number and Π∗ is the orthogoi nal projector onto the EDR subspace S . [sent-184, score-0.481]
25 ˆ If only an estimator A of the orthogonal projector Π∗ is available, a possible strategy is to ∗ by A in the definitions of the weights w∗ . [sent-186, score-0.191]
26 This is done by setting ˆ wi j = K(Xi j (I + ρ−2 A)Xi j /h2 ) and by redefining fˆ(Xi ) ∇ f (Xi ) n = ∑ j=1 1 Xi j 1 Xi j wi j −1 n ∑ Yj j=1 1 Xi j wi j . [sent-189, score-0.474]
27 Then S is the linear subspace of Rd spanned by the vectors ∇ f (Xi ), i = 1, . [sent-194, score-0.205]
28 Recall that the PCA method is based on the orthogonal decomposition of the matrix M = n−1 ∑n ∇ f (Xi )∇ f (Xi ) : M = OΛOT with an orthogonal matrix O and a diagonal matrix Λ with i=1 diagonal entries λ1 ≥ λ2 ≥ . [sent-199, score-0.15]
29 The obvious limitation of this approach is that it recovers the EDR subspace entirely only if the rank of M L coincides with the rank of M , which is equal to m∗ . [sent-223, score-0.166]
30 In the case when only an estimator of ∇ f is available, the above described method of recovering the EDR directions from an estimator of ML has a risk of order L/n (Hristache et al. [sent-226, score-0.219]
31 To this end, we modify the method of extracting the structural information from the ˆ estimators β of vectors β . [sent-231, score-0.162]
32 Observe that the estimator Πm of the projector Π∗ based on the PCA solves the following quadratic optimization problem: minimize ˆ ∑β ˆ (I − Π)β subject to Π2 = Π, tr Π ≤ m, (5) where the minimization is carried over the set of all symmetric (d × d)-matrices. [sent-233, score-0.358]
33 Let Am be the set of (d × d)-matrices defined as follows: Am = {A : A = A , 0 A I, tr A ≤ m}. [sent-235, score-0.203]
34 Theoretical Features of SAMM Throughout this section the true dimension m∗ of the EDR subspace is assumed to be known. [sent-243, score-0.172]
35 The vectors ϑ j are assumed to form ∗ an orthonormal basis of the EDR subspace entailing thus the representation Π ∗ = ∑m ϑ j ϑ j . [sent-254, score-0.223]
36 c) Define the estimators ∇ f k (Xi ) ˆ wi j = K Xi j (I + ρ−2 Ak−1 )Xi j /h2 . [sent-269, score-0.225]
37 e) Set ρk+1 = aρ · ρk , hk+1 = ah · hk and increase k by one. [sent-278, score-0.179]
38 f) Stop if ρk < ρmin or hk > hmax , otherwise continue with the step c). [sent-279, score-0.135]
39 The estimator of the EDR subspace is then the image of Πn . [sent-282, score-0.232]
40 ˆ Both Ak(n) and Πn are estimators of the projector onto S . [sent-283, score-0.158]
41 Remark 1 Assumption (A2) implies that the subspace S = Im(Π∗ ) is the smallest DR subspace, therefore it is the EDR subspace. [sent-313, score-0.145]
42 Indeed, for any DR subspace S , the gradient ∇ f (Xi ) belongs to S for every i. [sent-314, score-0.145]
43 Next, set Zi j = (hk Pk )−1 Xi j and for any d × d matrix U put wi j (U) = k (k) (k) (k) (k) (k) (k) (k) ¯ K (Zi j ) UZi j , wi j (U) = K (Zi j ) UZi j , Ni (U) = ∑ j wi j (U) and n (k) Vi (U) = ∑ j=1 1 (k) Zi j 1 (k) Zi j (k) wi j (U). [sent-352, score-0.658]
44 (k) Remark 3 Note that in (A3) we implicitly assumed that the matrices Vi are invertible, which may be true only if any neighborhood E (k) (Xi ) = {x : |(I + ρ−2 Π∗ )−1/2 (Xi − x)| ≤ hk } contains at least k d design points different from Xi . [sent-366, score-0.168]
45 ,n #E (1) (Xi ) = d + 1 and add to the matrix n ∑ j=1 1 Xi j 1 Xi j wi j , involved in the definition (3), a small full-rank matrix to be sure that the resulting matrix is invertible, see Section 4. [sent-372, score-0.236]
46 n n Proof Easy algebra yields Πn − Π ∗ 2 2 = tr(Πn − Π∗ )2 = tr Π2 − 2 tr Πn Π∗ + tr Π∗ n ≤ tr Πn + m∗ − 2 tr Πn Π∗ ≤ 2m∗ − 2 tr Πn Π∗ . [sent-392, score-1.218]
47 The equality tr Π∗ = m∗ and the linearity of the trace operator complete the proof of the first inequality. [sent-393, score-0.225]
48 2 According to these results, for m∗ ≤ 4, the estimator of the orthogonal projector onto S provided √ by the SAMM procedure is n-consistent up to a logarithmic factor. [sent-395, score-0.214]
49 4 Risk Bound for the Estimator of a Basis of the EDR Subspace The main result of this paper stated in the preceding subsection provides a risk bound for the estimator Πn of Π∗ , the orthogonal projector onto S . [sent-402, score-0.239]
50 As a by-product of this result, we show in this section that a similar risk bound holds also for the estimator of an orthonormal basis of S . [sent-403, score-0.128]
51 This ˆ means that for an arbitrarily chosen orthonormal basis of the estimated EDR subspace S = Im(Πn ), there is an orthonormal basis of the true EDR subspace S such that the matrices built from these bases are close in Frobenius norm with a probability tending to one. [sent-404, score-0.372]
52 , ϑm∗ ˆ of the estimated EDR subspace S = Im(Πn ) there exists an orthonormal basis ϑ1 , . [sent-409, score-0.186]
53 , ϑm∗ of the true ∗ ) such that, for sufficiently large n, it holds EDR subspace S = Im(Π P Θn − Θ − 2∗ 2 2 > Cn 3∨m tn + √ √ 2( m∗ + 1) 2µ∗ zc0 σ n(1 − ζn ) ≤ Lze− z2 −1 2 + 3k(n) − 5 , n ˆ where Θn (resp. [sent-412, score-0.189]
54 Furthermore, we add the small full-rank matrix Id+1 /n to 1 1 ∑n j=1 Xi j Xi j wi j in (3). [sent-443, score-0.184]
55 Table 1 contains the average loss for different values of the sample size n for the first step estimator by SAMM, the final estimator provided by SAMM and the estimators based on MAVE, MDA and SAVE. [sent-497, score-0.241]
56 05 0 20 40 60 80 100 σ 120 140 160 180 200 Figure 3: Average loss versus σ for the first step (solid line) and the final (dotted line) estimators provided by SAMM and for the estimator based on MAVE (dashed line) in Example 3. [sent-723, score-0.154]
57 The main idea is to evaluate the accuracy of the first step estimators of β and, given the accuracy of the estimator at the step k, evaluate the accuracy of the estimators at the step k + 1. [sent-795, score-0.221]
58 For this reason, the notation h−1 f¯k (Xi ) k ∗ Pk ∇ f k (Xi ) n = h−1Vi (Uk )−1 ∑ f (X j ) k (k) j=1 1 (k) (k) wi j (Uk ), Zi j will be useful. [sent-811, score-0.158]
59 , ξ∗ ∈ Rd such that max1≤ ≤L E[|ξ∗,k |2 ] ≤ c2 σ2 and 0 1,k L,k ξ∗,k 2 ˆ √ P max ≥ ϒk , Ak−1 ∈ Pk−1 ≤ , )− 1≤ ≤L n hk n √ √ where we used the notation ϒk = CV Cg (ρk +δk−1 )2 hk +c1 σαk tn /( n hk ) with tn = 4+(3 log(Ln)+ 3 2 1/2 , c = ψ(dC C )1/2 and c = 15ψ(C 2 C 4 C 2 +C 2 C 2 )1/2 . [sent-817, score-0.522]
60 Define v j = Vi (Uk )−1/2 1 (k) Zi j (k) wi j (Uk ), λ j = (k) wi j (Uk ) and λ = (λ1 , . [sent-824, score-0.316]
61 Therefore, · ∑ ri2j wi j (Uk ) (k) j=1 n ri2j Vi (Uk )−1 · ∑ wi j (Uk ) ≤ CV h−2 k (k) (k) j=1 max (k) ri2j . [sent-830, score-0.345]
62 (k) Since the weights wi j are defined via the kernel function K vanishing on the interval [1, ∞[, we have max j:w(k) (U )=0 ri2j = max{ri2j : |Sk Xi j | ≤ hk }. [sent-833, score-0.322]
63 By Corollary 19, the inequality |Sk Xi j | ≤ hk implies ij k |Π∗ Xi j | ≤√ k + δk−1 )hk . [sent-834, score-0.197]
64 On the other hand, |Π∗ Xi j | ≤ |Xi j | ≤ |Sk Xi j | ≤ hk . [sent-835, score-0.135]
65 These estimates yield (ρ |b(Xi )| ≤ CV Cg {(ρk + δk−1 ) ∧ 1}2 hk , and consequently, ∗ ¯ max Pk β =1,. [sent-836, score-0.164]
66 ,L ,k − β ≤ max b(Xi ) ≤ i CV Cg {(ρk + δk−1 ) ∧ 1}2 hk . [sent-839, score-0.164]
67 ∗ ˆ ¯ Using this notation, we have Pk β ,k − β ,k = ∑n c j, (Uk ) ε j , where j=1 c j, (Uk ) = 1 n 1 (k) (k) ∑ E1Vi (Uk )−1 Zi(k) wi j (Uk )ψ ,i . [sent-842, score-0.158]
68 , L we have ∗ ˆ Pk (β ,k − β ,k ) − ¯ ξ∗,k √ ≤ n hk n sup ∑ c j, (U) − c j, (I) ε j . [sent-848, score-0.206]
69 Setting ε = 2αk / n we get that the probability of the event h n k n sup U, ∑ j=1 k c j, (U) − c j, (I) ε j ≥ c1 σαk (4 + √ 3 log(Ln) + 3d 2 log( n)) √ n hk is less than 2/n. [sent-851, score-0.231]
70 Corollary 10 If nL ≥ 6 and the assumptions of Proposition 9 are fulfilled, then ∗ ˆ P max Pk (β ,k − β σc0 z ˆ ) ≥ ϒk + √ , Ak−1 ∈ Pk−1 n hk ≤ Lze− z2 −1 2 . [sent-853, score-0.164]
71 In particular, if nL ≥ 6, the probability of the event ∗ ˆ max Pk (β ,k − β ) ≥ ϒk + 2σc0 log(Ln) √ n hk ˆ ∩ {Ak−1 ∈ Pk−1 } does not exceed 3/n, where ϒk and c0 are defined in Proposition 9. [sent-854, score-0.189]
72 2 The Accuracy of the First-step Estimator Since at the first step no information about the EDR subspace is available, we use the same bandwidth in all directions, that is the local neighborhoods are balls (and not ellipsoids) of radius h. [sent-862, score-0.178]
73 1 Recall that for any integer k ∈ [2, k(n)]—where k(n) is the total number of iterations—we use the notation ρk = aρ ρk−1 , hk = ah hk−1 and αk = 2δ2 ρ−2 + 2δk−1 ρ−1 . [sent-883, score-0.179]
74 Let us introduce the additional k−1 k k notation √ n hk ϒk + 2σc0 log(nL), k < k(n), √ n hk ϒk + σc0 z, k = k(n), √ ζk = 2µ∗ (γ2 ρ−2 + 2 γk ρ−1Cg ), k k k 1 γk = √ n hk δk = 2γk µ∗ / 1 − ζ k , ˆ Ωk = {max |P∗ (β ,k − β )| ≤ γk }. [sent-884, score-0.405]
75 Let functions a j,γ : R p → Rd obey the conditions n ∑ |a j,γ (u)|2 ≤ κ2 , 0 γ∈Γ |u−u |≤r sup sup ∗ j=1 n sup sup sup ∑ γ∈Γ |u−u∗ |≤r e∈Sd−1 j=1 2 d (e a j,γ (u)) ≤ κ2 1 du for some u∗ ∈ R p . [sent-897, score-0.41]
76 If the ε j ’s are independent N (0, σ2 )-distributed random variables, then n P sup sup γ∈Γ |u−u∗ |≤r where t = ∑ a j,γ (u) ε j j=1 √ 2 > tσκ0 + 2 nσκ1 ε ≤ , n 3 log(|Γ|(2r/ε) p n) and |Γ| is the cardinality of Γ. [sent-898, score-0.142]
77 1668 E STIMATION OF THE D IMENSION -R EDUCTION S UBSPACE Proof Let Br be the ball {u : |u − u∗ | ≤ r} ⊂ R p and Σr,ε be an ε-net on Br such that for any u ∈ Br there is an element ul ∈ Σr,ε such that |u − ul | ≤ ε. [sent-899, score-0.162]
78 Thus we get P sup sup γ∈Γ ul ∈Σr,ε Hence, if t = ηγ (ul ) > tσκ0 ≤ Nr,ε ∑ ∑P ηγ (ul ) > tσκ0 ≤ |Γ|Nr,εte−(t γ∈Γ l=1 2 −1)/2 . [sent-903, score-0.223]
79 On the other hand, for any u, u ∈ Br , 2 ηγ (u) − ηγ (u ) = sup e ηγ (u) − ηγ (u ) e∈Sd−1 2 ≤ |u − u |2 · sup sup d(e ηγ ) (u) du = |u − u |2 · sup sup 2 d(e a j,γ ) (u) ε j . [sent-905, score-0.41]
80 du j=1 u∈Br e∈Sd−1 u∈Br e∈Sd−1 2 n ∑ The Cauchy-Schwarz inequality yields ηγ (u) − ηγ (u ) |u − u |2 2 n ≤ sup sup ∑ u∈Br e∈Sd−1 j=1 d(e a j,γ ) (u) du 2 n n ∑ ε2j ≤ κ2 ∑ ε2j . [sent-906, score-0.284]
81 The vectors β are assumed to belong to a m∗ -dimensional subspace S of Rd , but in this subsection we do not necessarily assume that β s are defined by (4). [sent-914, score-0.207]
82 tr[(A ˆ ˆ Proof In view of the relations tr A2 ≤ tr A ≤ m∗ and tr(Π∗ )2 = tr Π∗ = m∗ , we have ˆ ˆ ˆ ˆ tr(A − Π∗ )2 = tr(A2 − Π∗ ) + 2 tr(I − A)Π∗ ≤ 2| tr(I − A)Π∗ |. [sent-939, score-0.609]
83 For every A ∈ Am , it obviously holds (I − A1/2 )2 = I − 2A1/2 + A A1/2 )2 Π∗ ≤ tr Π∗ (I − A)Π∗ . [sent-946, score-0.203]
84 Therefore, I − A, and hence, tr Π∗ (I − ˆ ˆ ˆ tr Π∗ (I − A1/2 )2 Π∗ ≤ tr Π∗ (I − A)Π∗ = tr(I − A)Π∗ ≤ δ2 ˆ ˆ yielding |Π∗ x| ≤ |Π∗ A1/2 x| + δ|x| ≤ |A1/2 x| + δ|x| as required. [sent-947, score-0.609]
85 ˆ ˆ |A ˆ Lemma 20 Let tr(I − A)Π∗ ≤ δ2 for some δ ∈ [0, 1) and let Πm∗ be the orthogonal projection d onto the subspace spanned by the eigenvectors of A corresponding to its largest m∗ ˆ matrix in R ∗ ≤ δ2 /(1 − δ2 ). [sent-950, score-0.274]
86 This implies that ˆ tr[AΠ∗ ] ≤ = ∑ j≤m∗ j>m∗ ˆ ˆ tr[ϑ j ϑ j Π∗ ] ˆ ˆ ˆ ˆ ˆ ∑ (λ j − λm ) tr[ϑ j ϑ j Π∗ ] + λm ∗ j≤m∗ = ∑ ˆ ˆ ˆ ˆ λ j tr[ϑ j ϑ j Π∗ ] + λm∗ d ∗ ˆ ˆ ∑ ϑ j ϑ j Π∗ tr j=1 ˆ ˆ ˆ ˆ ∑ (λ j − λm ) tr[ϑ j ϑ j Π ] + m λm . [sent-965, score-0.203]
87 Taking into account the relations ˆ ˆ ˆ ˆ ˆ ∑ j≤d λ j ≤ m∗ , tr Π∗ = m∗ and (1 − λm∗ +1 )(I − Πm∗ ) I − A, we get λm∗ +1 ≤ m∗ − ∑ j≤m∗ λ j ≤ ˆ ˆ tr[(I − A)Π∗ ] ≤ δ2 and therefore I − Πm∗ (1 − δ2 )−1 (I − A). [sent-967, score-0.203]
88 Since B 2 2 = tr BB ≤ (tr(BB )1/2 )2 for any matrix B, it holds Π∗ (A − Π∗ )Π∗ 2 = Π∗ (I − A)Π∗ 2 ≤ tr Π∗ (I − A)Π∗ = tr(I − A)Π∗ = δ2 . [sent-974, score-0.432]
89 j=1 k Proof Simple computations yield n ∑ (k) E1Vi (U)−1 j=1 2 1 (k) Zi j (k) (k) wi j (U) = tr(E1Vi (U)−1 E1 ) ≤ dCV . [sent-978, score-0.158]
90 Thus, we will (k) (k) (k) write Vi , wi j and Zi j instead of Vi , wi j and Zi j . [sent-986, score-0.316]
91 By definition of c j, we have 2 2 d 1 n 1 e Vi−1 (U) Z ˜ ≤2 ∑ ij nhk i=1 dU 2 d (e c j, )(U) dU wi j (U)ψ 2 2 n dwi j (U) 1 ˜ ∑ e Vi−1 (U) Z1i j dU ψ nhk i=1 +2 ,i ,i 2 = ∆1 + ∆2 , where e = E1 e satisfies |˜ | ≤ |e| = 1. [sent-987, score-0.296]
92 One checks that d wi j (U)/dU = wi j (U)Zi j Zi j , where we used ˜ e ¯ 2 = 0 if Z UZ > 1. [sent-988, score-0.316]
93 Hence, d wi j (U)/dU 2|wi j (U)| and we get ¯ ¯ 8 ψ2 ∆2 ≤ 2 2 n hk n ∑ 2 1 ¯ Zi j wi j (U) Vi−1 (U) i=1 ≤ 2 = |wi j (U)| · |Zi j |2 ≤ ¯ ¯ 2 2 24ψ2CV CK . [sent-992, score-0.451]
94 ∂U pq ∂U pq Simple computations show that ∂Vi (U) = ∂U pq n ∑ j=1 n = ∑ j=1 1 Zi j 1 Zi j ∂ wi j (U) ∂U pq 1 Zi j 1 Zi j wi j (U)(Zi j ) p (Zi j )q . [sent-994, score-0.448]
95 ¯ Hence, for any a1 , a2 ∈ Rd+1 , da1 Vi−1 (U)a2 = dU n ∑ a1 Vi−1 (U) 1 Zi j j=1 1 Zi j Vi−1 (U)a2 wi j (U)Zi j Zi j . [sent-995, score-0.158]
96 ¯ This relation, combined with the estimate |Zi j |2 ≤ 2 for all i, j such that wi j = 0, implies the norm ¯ estimate da1 Vi−1 (U)a2 dU n 2 1 ≤ 2 ∑ a1 Vi−1 (U) Z ij 1 Zi j j=1 n ¯ Vi−1 (U)a2 wi j (U) ¯ ≤ 6|a1 | |a2 | ∑ Vi−1 (U) |wi j (U)| 2 j=1 ≤ 2 6CwCV |a1 | |a2 |Ni (U)−1 . [sent-996, score-0.346]
97 By our choice of ah and aρ , we have ρ1 h1 ≥ ρk hk ≥ ρk(n) hk(n) . [sent-1012, score-0.179]
98 Therefore, 4σ(c0 log(Ln) + c1tn ) γk √ ≤ 4 CV Cg ρk hk + ρk n ρ k hk ≤ 4 CV Cg h1 + 4σ(c0 log(Ln) + c1tn ) √ = sn . [sent-1013, score-0.27]
99 Estimating the structural dimension of regressions via parametric inverse regression. [sent-1048, score-0.126]
100 Using intraslice covariances for improved estimation of the central subspace in regression. [sent-1116, score-0.178]
wordName wordTfidf (topN-words)
[('samm', 0.43), ('edr', 0.403), ('mave', 0.305), ('tr', 0.203), ('cook', 0.184), ('dalalyan', 0.17), ('hristache', 0.161), ('wi', 0.158), ('pk', 0.151), ('subspace', 0.145), ('mda', 0.143), ('pokoiny', 0.143), ('uditsky', 0.143), ('zi', 0.136), ('hk', 0.135), ('ubspace', 0.134), ('eduction', 0.114), ('imension', 0.114), ('stimation', 0.114), ('estimator', 0.087), ('uk', 0.087), ('ul', 0.081), ('xi', 0.081), ('cv', 0.076), ('cg', 0.075), ('vi', 0.072), ('uzi', 0.072), ('sup', 0.071), ('projector', 0.068), ('estimators', 0.067), ('ak', 0.063), ('bura', 0.063), ('fnl', 0.063), ('lze', 0.063), ('sliced', 0.063), ('xia', 0.061), ('structural', 0.058), ('save', 0.057), ('sd', 0.056), ('du', 0.055), ('nhk', 0.054), ('br', 0.054), ('rd', 0.05), ('ni', 0.049), ('ck', 0.049), ('nl', 0.049), ('assertion', 0.046), ('proposition', 0.045), ('ml', 0.044), ('tn', 0.044), ('ah', 0.044), ('dr', 0.044), ('inverse', 0.041), ('orthonormal', 0.041), ('vectors', 0.037), ('li', 0.037), ('spokoiny', 0.036), ('weisberg', 0.036), ('orthogonal', 0.036), ('bv', 0.034), ('design', 0.033), ('bandwidth', 0.033), ('pq', 0.033), ('estimation', 0.033), ('inequality', 0.032), ('sa', 0.031), ('ti', 0.031), ('ln', 0.031), ('juditsky', 0.03), ('vec', 0.03), ('yin', 0.03), ('ij', 0.03), ('max', 0.029), ('im', 0.029), ('eigenvalues', 0.029), ('dimension', 0.027), ('arnak', 0.027), ('samarov', 0.027), ('lemma', 0.026), ('matrix', 0.026), ('estimating', 0.026), ('predictors', 0.026), ('corollary', 0.025), ('regression', 0.025), ('subsection', 0.025), ('jth', 0.025), ('event', 0.025), ('ful', 0.025), ('id', 0.024), ('recovering', 0.023), ('verifying', 0.023), ('onto', 0.023), ('procedures', 0.023), ('spanned', 0.023), ('proof', 0.022), ('directions', 0.022), ('nr', 0.022), ('yj', 0.022), ('coincides', 0.021), ('eigenvectors', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
Author: Arnak S. Dalalyan, Anatoly Juditsky, Vladimir Spokoiny
Abstract: The statistical problem of estimating the effective dimension-reduction (EDR) subspace in the multi-index regression model with deterministic design and additive noise is considered. A new procedure for recovering the directions of the EDR subspace is proposed. Many methods for estimating the EDR subspace perform principal component analysis on a family of vectors, say ˆ ˆ β1 , . . . , βL , nearly lying in the EDR subspace. This is in particular the case for the structure-adaptive approach proposed by Hristache et al. (2001a). In the present work, we propose to estimate the projector onto the EDR subspace by the solution to the optimization problem minimize ˆ ˆ max β (I − A)β =1,...,L subject to A ∈ Am∗ , where Am∗ is the set of all symmetric matrices with eigenvalues in [0, 1] and trace less than or equal √ to m∗ , with m∗ being the true structural dimension. Under mild assumptions, n-consistency of the proposed procedure is proved (up to a logarithmic factor) in the case when the structural dimension is not larger than 4. Moreover, the stochastic error of the estimator of the projector onto the EDR subspace is shown to depend on L logarithmically. This enables us to use a large number of vectors ˆ β for estimating the EDR subspace. The empirical behavior of the algorithm is studied through numerical simulations. Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, central subspace
2 0.090412237 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
Author: Manfred K. Warmuth, Dima Kuzmin
Abstract: We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic. We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2 ) per trial, where n is the dimension of the instances. Keywords: principal component analysis, online learning, density matrix, expert setting, quantum Bayes rule
3 0.076154806 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
4 0.073779486 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui
Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso
5 0.068786368 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
Author: Sébastien Loustau
Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers
6 0.065882824 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
7 0.064420253 26 jmlr-2008-Consistency of Trace Norm Minimization
8 0.063348696 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
9 0.062582016 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
10 0.061741397 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
11 0.059959721 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
12 0.059399735 57 jmlr-2008-Manifold Learning: The Price of Normalization
13 0.057803463 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
14 0.047467891 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
15 0.047158871 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
16 0.04672635 92 jmlr-2008-Universal Multi-Task Kernels
17 0.043720085 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs (Special Topic on Causality)
18 0.04044129 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
19 0.037598375 91 jmlr-2008-Trust Region Newton Method for Logistic Regression
20 0.035063788 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
topicId topicWeight
[(0, 0.2), (1, -0.06), (2, -0.109), (3, -0.016), (4, 0.166), (5, -0.043), (6, 0.001), (7, -0.072), (8, -0.091), (9, -0.072), (10, 0.089), (11, -0.014), (12, -0.025), (13, 0.062), (14, 0.141), (15, 0.034), (16, -0.018), (17, -0.112), (18, -0.044), (19, -0.018), (20, 0.138), (21, 0.014), (22, -0.076), (23, 0.102), (24, 0.12), (25, 0.045), (26, 0.137), (27, -0.064), (28, 0.0), (29, 0.107), (30, -0.289), (31, -0.045), (32, -0.088), (33, 0.054), (34, -0.207), (35, 0.04), (36, -0.216), (37, -0.236), (38, 0.115), (39, -0.019), (40, 0.22), (41, -0.168), (42, 0.032), (43, -0.136), (44, 0.149), (45, 0.094), (46, -0.041), (47, -0.096), (48, -0.038), (49, -0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.94435829 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
Author: Arnak S. Dalalyan, Anatoly Juditsky, Vladimir Spokoiny
Abstract: The statistical problem of estimating the effective dimension-reduction (EDR) subspace in the multi-index regression model with deterministic design and additive noise is considered. A new procedure for recovering the directions of the EDR subspace is proposed. Many methods for estimating the EDR subspace perform principal component analysis on a family of vectors, say ˆ ˆ β1 , . . . , βL , nearly lying in the EDR subspace. This is in particular the case for the structure-adaptive approach proposed by Hristache et al. (2001a). In the present work, we propose to estimate the projector onto the EDR subspace by the solution to the optimization problem minimize ˆ ˆ max β (I − A)β =1,...,L subject to A ∈ Am∗ , where Am∗ is the set of all symmetric matrices with eigenvalues in [0, 1] and trace less than or equal √ to m∗ , with m∗ being the true structural dimension. Under mild assumptions, n-consistency of the proposed procedure is proved (up to a logarithmic factor) in the case when the structural dimension is not larger than 4. Moreover, the stochastic error of the estimator of the projector onto the EDR subspace is shown to depend on L logarithmically. This enables us to use a large number of vectors ˆ β for estimating the EDR subspace. The empirical behavior of the algorithm is studied through numerical simulations. Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, central subspace
2 0.44367632 57 jmlr-2008-Manifold Learning: The Price of Normalization
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
3 0.36338478 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
Author: Sivan Sabato, Shai Shalev-Shwartz
Abstract: Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This approach can fail when ranking categorical features with many values. We propose the Ginger ranking criterion, that estimates the generalization error of the predictor associated with the Gini index. We show that for almost all training sets, the Ginger criterion produces an accurate estimation of the true generalization error, regardless of the number of values in a categorical feature. We also address the question of finding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassification error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties. We demonstrate the efficiency of our approach for feature selection and for learning decision trees in a series of experiments with synthetic and natural data sets. Keywords: feature ranking, categorical features, generalization bounds, Gini index, decision trees
4 0.33113402 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
Author: Manfred K. Warmuth, Dima Kuzmin
Abstract: We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic. We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2 ) per trial, where n is the dimension of the instances. Keywords: principal component analysis, online learning, density matrix, expert setting, quantum Bayes rule
5 0.32050958 26 jmlr-2008-Consistency of Trace Norm Minimization
Author: Francis R. Bach
Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. Keywords: convex optimization, singular value decomposition, trace norm, consistency
6 0.29017651 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
7 0.27068892 52 jmlr-2008-Learning from Multiple Sources
8 0.26685295 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
9 0.25624949 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
10 0.24617895 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
11 0.23456614 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
12 0.2129021 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
13 0.20284346 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
14 0.17252353 47 jmlr-2008-Learning Balls of Strings from Edit Corrections
15 0.17002754 80 jmlr-2008-Ranking Individuals by Group Comparisons
16 0.16626073 2 jmlr-2008-A Library for Locally Weighted Projection Regression (Machine Learning Open Source Software Paper)
17 0.15874821 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
18 0.15540089 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
19 0.14761017 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
20 0.14289647 56 jmlr-2008-Magic Moments for Structured Output Prediction
topicId topicWeight
[(0, 0.035), (40, 0.033), (54, 0.022), (58, 0.051), (66, 0.582), (76, 0.022), (78, 0.013), (88, 0.049), (92, 0.037), (94, 0.035), (99, 0.023)]
simIndex simValue paperId paperTitle
1 0.99467695 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting
Author: Kristin P. Bennett
Abstract: unkown-abstract
2 0.97677475 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
Author: Onureena Banerjee, Laurent El Ghaoui, Alexandre d'Aspremont
Abstract: We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive 1 -norm penalized regression. Our second algorithm, based on Nesterov’s first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan, 2006), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data. Keywords: model selection, maximum likelihood estimation, convex optimization, Gaussian graphical model, binary data
same-paper 3 0.96704203 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
Author: Arnak S. Dalalyan, Anatoly Juditsky, Vladimir Spokoiny
Abstract: The statistical problem of estimating the effective dimension-reduction (EDR) subspace in the multi-index regression model with deterministic design and additive noise is considered. A new procedure for recovering the directions of the EDR subspace is proposed. Many methods for estimating the EDR subspace perform principal component analysis on a family of vectors, say ˆ ˆ β1 , . . . , βL , nearly lying in the EDR subspace. This is in particular the case for the structure-adaptive approach proposed by Hristache et al. (2001a). In the present work, we propose to estimate the projector onto the EDR subspace by the solution to the optimization problem minimize ˆ ˆ max β (I − A)β =1,...,L subject to A ∈ Am∗ , where Am∗ is the set of all symmetric matrices with eigenvalues in [0, 1] and trace less than or equal √ to m∗ , with m∗ being the true structural dimension. Under mild assumptions, n-consistency of the proposed procedure is proved (up to a logarithmic factor) in the case when the structural dimension is not larger than 4. Moreover, the stochastic error of the estimator of the projector onto the EDR subspace is shown to depend on L logarithmically. This enables us to use a large number of vectors ˆ β for estimating the EDR subspace. The empirical behavior of the algorithm is studied through numerical simulations. Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, central subspace
4 0.6744523 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
Author: Francis R. Bach
Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators
5 0.5690552 26 jmlr-2008-Consistency of Trace Norm Minimization
Author: Francis R. Bach
Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. Keywords: convex optimization, singular value decomposition, trace norm, consistency
6 0.56461757 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
7 0.52794206 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
8 0.52391577 57 jmlr-2008-Manifold Learning: The Price of Normalization
9 0.51740491 83 jmlr-2008-Robust Submodular Observation Selection
10 0.51204383 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
11 0.51020545 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
12 0.50817275 58 jmlr-2008-Max-margin Classification of Data with Absent Features
13 0.50367177 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
14 0.50094205 56 jmlr-2008-Magic Moments for Structured Output Prediction
16 0.48719245 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
17 0.48462784 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
18 0.47540629 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
19 0.46882218 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
20 0.46768594 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers