jmlr jmlr2013 jmlr2013-104 knowledge-graph by maker-knowledge-mining

104 jmlr-2013-Sparse Single-Index Model

Source: pdf

Author: Pierre Alquier, Gérard Biau

Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a ﬂexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difﬁculty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). [sent-8, score-0.133]

2 To circumvent this difﬁculty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. [sent-9, score-0.131]

3 On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. [sent-10, score-0.074]

4 The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. [sent-11, score-0.155]

5 Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method 1. [sent-12, score-0.221]

6 In the regression function estimation problem, the goal is to use the data Dn in order to construct an estimate rn : R p → R of the regression function r(x) = E[Y |X = x]. [sent-18, score-0.252]

7 One of the main advantages of the single-index model is its supposed ability to deal with the problem of high dimension (Bellman, 1961). [sent-37, score-0.076]

8 It is known that estimating the regression function is especially difﬁcult whenever the dimension p of X becomes large. [sent-38, score-0.078]

9 As a matter of fact, the optimal mean square convergence rate n−2k/(2k+p) for the estimation of a k-times differentiable regression function converges to zero dramatically slowly if the dimension p is large compared to k. [sent-39, score-0.143]

10 Thus, in particular, if r(x) = f ⋆ (θ⋆T x) holds for every x ∈ R p , then the underlying structural dimension of the model is 1 (instead of p) and the estimation of r can hopefully be performed easier. [sent-41, score-0.106]

11 model class is n Nevertheless, practical estimation of the link function f ⋆ and the index θ⋆ still requires a degree of statistical smoothing. [sent-43, score-0.101]

12 In fact, this drawback considerably reduces the ability of the single-index model to behave as an effective dimension reduction technique. [sent-52, score-0.076]

13 Sparse estimation is playing an increasingly important role in the statistics and machine learning communities, and several methods have recently been developed in both ﬁelds, which rely upon the notion of sparsity (e. [sent-59, score-0.104]

14 e In the present document, we consider the single-index model (1) from a sparsity perspective, that is, we assume that θ⋆ has only a few coordinates different from 0. [sent-64, score-0.166]

15 In the dimension reduction scenario we have in mind, the ambient dimension p can be very large, much larger than the sample size n, but we believe that the representation is sparse, that is, that very few coordinates of θ⋆ are nonzero. [sent-65, score-0.191]

16 This assumption is helpful at least for two reasons: If p is large and the number of nonzero coordinates is small enough, then the model is easier to interpret and its efﬁcient estimation becomes possible. [sent-66, score-0.122]

17 This strategy was further investigated for regression by Audibert (2004) and Alquier (2008) and, more recently, worked out in the sparsity framework by Dalalyan and Tsybakov (2008, 2012) and Alquier and Lounici (2011). [sent-71, score-0.103]

18 The main message of Dalalyan and Tsybakov (2008, 2012) and Alquier and Lounici (2011) is that aggregation with a properly chosen prior is able to deal nicely with the sparsity issue. [sent-72, score-0.107]

19 Then we state our main result (Theorem 2), which offers a sparsity oracle inequality more powerful than the best known oracle inequalities for other common procedures of single-index recovery. [sent-77, score-0.195]

20 Section 3 is devoted to the practical implementation of the estimate via a reversible jump Markov chain Monte Carlo (MCMC) algorithm, and to numerical experiments on both simulated and real-life data sets. [sent-78, score-0.155]

21 In order to approximate the link function f ⋆ , we shall use the vector space F spanned by a given countable dictionary of measurable functions {ϕ j }∞ . [sent-116, score-0.094]

22 It strongly relies on the choice of p a probability measure π on S1,+ × F , called the prior, which in our framework should enforce the sparsity properties of the target regression function. [sent-127, score-0.152]

23 p We see that S1,+ (I) may be interpreted as the set of “active” coordinates in the single-index rep gression of Y on X, and note that the prior on S1,+ is a convex combination of uniform probability p measures on the subsets S1,+ (I). [sent-136, score-0.147]

24 The factor 10−i penalizes models of high dimension, in accordance with the sparsity idea. [sent-139, score-0.099]

25 Next, the integer M should n=1 be interpreted as a measure of the “dimension” of the function f —the larger M, the more complex the function—and the prior ν adapts again to the sparsity idea by penalizing large-dimensional functions f . [sent-157, score-0.107]

26 The ˆ estimates θλ and fˆλ of θ⋆ and f ⋆ , respectively, are simply obtained by randomly drawing ˆ ˆ (θλ , fˆλ ) ∼ ρλ , p ˆ where ρλ is the so-called Gibbs posterior distribution over S1,+ × Fn (C + 1), deﬁned by the probability density ˆ dρλ exp [−λRn (θ, f )] . [sent-163, score-0.149]

27 Here and everywhere, the wording “with probability 1 − δ” means the probability evaluated with respect to the distribution P⊗n of the ˆ data Dn and the conditional probability measure ρλ . [sent-176, score-0.147]

28 w + 2 [(2C + 1)2 + 4σ2 ] Then, for all δ ∈ ]0, 1[, with probability at least 1 − δ we have λ= ˆ R(θλ , fˆλ ) − R(θ⋆ , f ⋆ ) ≤ Ξ inf I ⊂ {1, . [sent-180, score-0.114]

29 , p} 1≤M≤n + (4) ⋆ R(θ⋆ , fI,M ) − R(θ⋆ , f ⋆ ) I,M M log(Cn) + |I| log(pn) + log n 2 δ , where Ξ is a positive constant, depending on L, C, σ and ℓ only. [sent-183, score-0.15]

30 Then, for all δ ∈ ]0, 1[, with probability at least 1 − δ we have ⋆ ⋆ ˆ R(θλ , fˆλ ) − R(θ , f ) ≤ Ξ ′ log(Cn) n 2k 2k+1 + θ⋆ 0 log(pn) n + log 2 δ n , (5) where Ξ′ is a positive constant, depending on L, C, σ, ℓ and B only. [sent-206, score-0.199]

31 The strength of Corollary 4 is to provide a ﬁnite sample bound and to show that our estimate still behaves well in a nonasymptotic situation if the intrinsic dimension (i. [sent-208, score-0.102]

32 (2010), which can be applied to the single-index model to estimate consistently the parameter θ⋆ and perform variable selection in a sparsity framework. [sent-223, score-0.101]

33 The idea is to start from an initial given pair (θ n 1,+ (t+1) , f (t+1) ) from (θ(t) , f (t) ) via the following chain of then, at each step, to iteratively compute (θ rules: • Sample a random pair (τ(t) , h(t) ) according to some proposal conditional density kt ( . [sent-238, score-0.139]

34 ∞ This protocol ensures that the sequence {(θ(t) , f (t) )}t=0 is a Markov chain with invariant probability ˆ distribution ρλ (see, e. [sent-240, score-0.086]

35 A usual choice is to take kt ≡ k, so that the Markov chain is homogeneous. [sent-243, score-0.112]

36 However, in our context, it is more convenient to let kt = k1 if t is odd and kt = k2 if t is even. [sent-244, score-0.15]

37 251 A LQUIER AND B IAU Besides, the time for the Markov chains to converge depends strongly on the ambient dimension p and the starting point of the simulations. [sent-257, score-0.126]

38 When the dimension is small (typically, p ≤ 10), the chains converge fast and any value may be chosen as a starting point. [sent-258, score-0.098]

39 However, using as a starting point for the chains the preliminary estimate ˆ θHHI (see below) signiﬁcantly reduces the number of steps needed to reach convergence—we let the chains run 5000 steps in this context. [sent-261, score-0.098]

40 Nevertheless, as a general rule, we encourage the users to inspect the convergence of the chains by checking if Rn (θ(t) , f (t) ) is stabilized, and to run several chains starting from different points to avoid their attraction into local minima. [sent-262, score-0.098]

41 The tested competing methods are the Lasso (Tibshirani, 1996), the standard regression kernel estimate (Nadaraya, 1964, 1970; Watson, 1964; Tsybakov, 2009), and the estimation strategy discussed in H¨ rdle et al. [sent-268, score-0.127]

42 Finally, the estimation procedure advocated in H¨ rdle a 252 S PARSE S INGLE -I NDEX M ODEL et al. [sent-288, score-0.098]

43 Thus, in [Model 1] and [Model 2], even if the ambient dimension is large, the intrinsic dimension of the model is in fact equal to 2. [sent-316, score-0.153]

44 253 A LQUIER AND B IAU n = 50 FLinear FSI FNP n = 100 FLinear FSI FNP p = 10 median mean s. [sent-329, score-0.228]

45 This observation can be easily explained by the fact that F ˆ integrate any sparsity information regarding the parameter θ⋆ , whereas FFourier tries to focus on the dimension of the active coordinates, which is equal to 2 in this simulation. [sent-416, score-0.123]

46 254 S PARSE S INGLE -I NDEX M ODEL n = 50 FLinear FSI FNP n = 100 FLinear FSI FNP p = 50 median mean s. [sent-430, score-0.228]

47 The estimation procedure FFourier 255 A LQUIER AND B IAU Data set AIR QUALITY n = 111 p=3 AUTO-MPG n = 392 p=7 CONCRETE n = 1030 p=8 HOUSING n = 508 p = 11 SLUMP-1 n = 51 p=7 SLUMP-2 n = 51 p=7 SLUMP-3 n = 51 p=7 WINE-RED n = 1599 p = 11 WINE-WHITE n = 4898 p = 11 median mean s. [sent-530, score-0.258]

48 Speciﬁcally, the observations were embedded into a space of dimension p × 4 by letting the new fake coordinates follow independent uniform [0, 1] random variables. [sent-664, score-0.114]

49 In this nonasymptotic framework, the method FHHI —which is not designed 256 S PARSE S INGLE -I NDEX M ODEL ˆ Figure 1: AUTO-MPG example: Estimated link function by the method FFourier . [sent-666, score-0.097]

50 1 Preliminary Results Throughout this section, we let π be the prior probability measure on R p × Fn (C + 1) equipped with its canonical Borel σ-ﬁeld. [sent-676, score-0.082]

51 257 A LQUIER AND B IAU Augmented data set AIR QUALITY n = 111 p = 12 AUTO-MPG n = 392 p = 28 CONCRETE n = 1030 p = 32 HOUSING n = 508 p = 44 SLUMP-1 n = 51 p = 28 SLUMP-2 n = 51 p = 28 SLUMP-3 n = 51 p = 28 WINE-RED n = 1599 p = 44 WINE-WHITE n = 4898 p = 44 median mean s. [sent-678, score-0.228]

52 2(1 − wζ) ≤ exp Given a measurable space (E, E ) and two probability measures µ1 and µ2 on (E, E ), we denote by K (µ1 , µ2 ) the Kullback-Leibler divergence of µ1 with respect to µ2 , deﬁned by  dµ1  dµ1 if µ1 ≪ µ2 , log K (µ1 , µ2 ) = dµ2  ∞ otherwise. [sent-819, score-0.316]

53 For any probability measure µ on (E, E ) and any measurable function h : E → R such that (exp ◦h)dµ < ∞, we have log (exp ◦h)dµ = sup m hdm − K (m, µ) , (6) where the supremum is taken over all probability measures on (E, E ) and, by convention, ∞ − ∞ = −∞. [sent-823, score-0.328]

54 Moreover, as soon as h is bounded from above on the support of µ, the supremum with respect to m on the right-hand side of (6) is reached for the Gibbs distribution g given by dg (e) = dµ exp [h(e)] (exp ◦h)dµ , e ∈ E. [sent-824, score-0.132]

55 259 log ˆ dρ ˆ dπ (θ, fˆ) + log λ 1 δ , A LQUIER AND B IAU p Proof Fix θ ∈ S1,+ and f ∈ Fn (C + 1). [sent-828, score-0.3]

56 Thus, for any inverse temperature parameter λ ∈ ]0, n/w[, taking ζ = λ/n, we may write by Lemma 5 E exp [λ (R(θ, f ) − R(θ⋆ , f ⋆ ) − Rn (θ, f ) + Rn (θ⋆ , f ⋆ ))] ≤ exp vλ2 2n2 (1 − wλ ) n . [sent-842, score-0.174]

57 Therefore, using the deﬁnition of v, we obtain λ− E exp λ2 (2C + 1)2 + 4σ2 n(1 − wλ n ) (R(θ, f ) − R(θ⋆ , f ⋆ )) + λ (−Rn (θ, f ) + Rn (θ⋆ , f ⋆ )) − log 1 δ ≤ δ. [sent-843, score-0.217]

58 Let us remind the reader that π is a prior probability measure on the set S1,+ × Fn (C + 1). [sent-845, score-0.082]

59 Recalling that P⊗n stands for the distribution of the sample Dn , the latter inequality can be more conveniently written as EDn ∼P⊗n E(θ, fˆ)∼ρ exp ˆ ˆ λ− λ2 (2C + 1)2 + 4σ2 n(1 − wλ ) n ˆ R(θ, fˆ) − R(θ⋆ , f ⋆ ) ˆ 1 dρ ˆ ˆ ˆ (θ, f ) − log + λ −Rn (θ, fˆ) + Rn (θ⋆ , f ⋆ ) − log dπ δ ≤ δ. [sent-848, score-0.414]

60 Put differently, letting λ ∈ 0, n , w + [(2C + 1)2 + 4σ2 ] we have, with probability at least 1 − δ, ˆ R(θ, fˆ) − R(θ⋆ , f ⋆ ) ≤ 1 ˆ ˆ log dρ (θ, fˆ) + log dπ ˆ Rn (θ, fˆ) − Rn (θ , f ) + λ ⋆ 2 2] 1 − λ[(2C+1) +4σ n−wλ This concludes the proof of Lemma 7. [sent-850, score-0.349]

61 S PARSE S INGLE -I NDEX M ODEL Lemma 8 Under the conditions of Lemma 7 we have, with probability at least 1 − δ, ˆ Rn (θ, f )dρ(θ, f ) − Rn (θ⋆ , f ⋆ ) ≤ 1+ λ (2C + 1)2 + 4σ2 n − wλ ˆ R(θ, f )dρ(θ, f ) − R(θ⋆ , f ⋆ ) + ˆ K (ρ, π) + log λ 1 δ . [sent-852, score-0.199]

62 More precisely, we apply Lemma 5 with Ti = (Yi − f (θT Xi ))2 − (Yi − f ⋆ (θ⋆T Xi ))2 and obtain, for any inverse temperature parameter λ ∈ ]0, n/w[, E exp [λ (R(θ⋆ , f ⋆ ) − R(θ, f ) − Rn (θ⋆ , f ⋆ ) + Rn (θ, f ))] ≤ exp vλ2 2n2 (1 − wλ ) n (see (7) for the deﬁnition of v). [sent-854, score-0.174]

63 Thus, using the deﬁnition of v, λ+ E exp λ2 (2C + 1)2 + 4σ2 (R(θ⋆ , f ⋆ ) − R(θ, f )) n(1 − wλ ) n + λ (Rn (θ, f ) − Rn (θ⋆ , f ⋆ )) − log 1 δ ≤ δ. [sent-855, score-0.217]

64 ˆ Thus, for any data-dependent posterior probability measure ρ absolutely continuous with respect to π, E exp λ+ λ2 (2C + 1)2 + 4σ2 n(1 − wλ ) n (R(θ⋆ , f ⋆ ) − R(θ, f )) + λ (Rn (θ, f ) − Rn (θ⋆ , f ⋆ )) − log ˆ dρ 1 (θ, f ) − log dπ δ ≤ δ. [sent-857, score-0.482]

65 Consequently, by the elementary inequality exp(λx) ≥ 1R+ (x), we obtain, with probability at most δ, ˆ Rn (θ, f )dρ(θ, f ) − Rn (θ⋆ , f ⋆ ) ≥ 1+ λ (2C + 1)2 + 4σ2 n − wλ + ˆ R(θ, f )dρ(θ, f ) − R(θ⋆ , f ⋆ ) ˆ K (ρ, π) + log λ 1 δ . [sent-859, score-0.246]

66 Equivalently, with probability at least 1 − δ, ˆ Rn (θ, f )dρ(θ, f ) − Rn (θ⋆ , f ⋆ ) ≤ 1+ + λ (2C + 1)2 + 4σ2 n − wλ ˆ K (ρ, π) + log λ 1 δ ˆ R(θ, f )dρ(θ, f ) − R(θ⋆ , f ⋆ ) . [sent-860, score-0.199]

67 Observe that   ˆ exp −λRn (θλ , fˆλ ) ˆ dρλ ˆ ˆ   log (θλ , fλ ) = log   dπ exp [−λRn (θ, f )] dπ(θ, f ) ˆ = −λRn (θλ , fˆλ ) − log exp [−λRn (θ, f )] dπ(θ, f ). [sent-864, score-0.651]

68 Consequently, with probability at least 1 − δ, ˆ R(θλ , fˆλ ) − R(θ⋆ , f ⋆ ) ≤ 1 λ 1− − log λ[(2C+1)2 +4σ2 ] n−wλ 1 δ − λRn (θ⋆ , f ⋆ ) + log exp [−λRn (θ, f )] dπ(θ, f ) . [sent-865, score-0.416]

69 Next, using Lemma 6 we deduce that, with probability at least 1 − δ, ˆ R(θλ , fˆλ ) − R(θ⋆ , f ⋆ ) ≤ 1 1− inf λ[(2C+1)2 +4σ2 ] ρ ˆ n−wλ + ˆ Rn (θ, f )dρ(θ, f ) − Rn (θ⋆ , f ⋆ ) ˆ K (ρ, π) + log λ 1 δ , p where the inﬁmum is taken over all probability measures on S1,+ × Fn (C + 1). [sent-866, score-0.313]

70 In particular, letting p M (I, M) be the set of all probability measures on S1,+ (I) × FM (C + 1), we have, with probability at least 1 − δ, ˆ R(θλ , fˆλ ) − R(θ⋆ , f ⋆ ) ≤ 1 1− λ[(2C+1)2 +4σ2 ] n−wλ + inf I ⊂ {1, . [sent-867, score-0.163]

71 , p} 1≤M≤n ˆ K (ρ, π) + log λ 1 δ inf ˆ ρ∈M (I,M) . [sent-870, score-0.215]

72 265 ˆ Rn (θ, f )dρ(θ, f ) − Rn (θ⋆ , f ⋆ ) A LQUIER AND B IAU ˆ Next, observe that, for ρ ∈ M (I, M),  ˆ ˆ ˆ K (ρ, π) = K (ρ, µ ⊗ ν) = K (ρ, µI ⊗ νM ) + log  p |I| 10−|I|−M ˆ ≤ K (ρ, µI ⊗ νM ) + log 1− 1 p 10 1− 1 n 10 10−|I|−M p |I|   . [sent-871, score-0.3]

73 (8) Therefore, with probability at least 1 − δ, ˆ R(θλ , fˆλ ) − R(θ⋆ , f ⋆ ) ≤ 1 1− + λ[(2C+1)2 +4σ2 ] n−wλ inf I ⊂ {1, . [sent-872, score-0.114]

74 , p} 1≤M≤n ˆ K (ρ, µI ⊗ νM ) + log ˆ Rn (θ, f )dρ(θ, f ) − Rn (θ⋆ , f ⋆ ) inf ˆ ρ∈M (I,M) p (|I|) 10−|I|−M + log 1 δ . [sent-875, score-0.365]

75 λ (9) ˆ By Lemma 8 and inequality (8), for any data-dependent distribution ρ ∈ M (I, M), with probability at least 1 − δ, ˆ Rn (θ, f )dρ(θ, f ) − Rn (θ⋆ , f ⋆ ) ≤ 1+ + λ (2C + 1)2 + 4σ2 n − wλ ˆ K (ρ, µI ⊗ νM ) + log ˆ R(θ, f )dρ(θ, f ) − R(θ⋆ , f ⋆ ) p (|I|) 10−|I|−M + log 1 δ . [sent-876, score-0.396]

76 λ (10) Thus, combining inequalities (9) and (10), we may write, with probability at least 1 − 2δ, ˆ R(θλ , fˆλ ) − R(θ⋆ , f ⋆ ) ≤ 1 1− λ[(2C+1)2 +4σ2 ] n−wλ 1+ +2 inf I ⊂ {1, . [sent-877, score-0.114]

77 , p} 1≤M≤n λ (2C + 1)2 + 4σ2 n − wλ ˆ K (ρ, µI ⊗ νM ) + log inf ˆ ρ∈M (I,M) ˆ R(θ, f )dρ(θ, f ) − R(θ⋆ , f ⋆ ) p (|I|) 10−|I|−M + log 1 δ λ . [sent-880, score-0.365]

78 j=1 With this notation, inequality (11) leads to ˆ R(θλ , fˆλ ) − R(θ⋆ , f ⋆ ) ≤ 1 1− λ[(2C+1)2 +4σ2 ] n−wλ 1+ +2 inf I ⊂ {1, . [sent-885, score-0.112]

79 , p} 1≤M≤n inf η,γ>0 λ (2C + 1)2 + 4σ2 n − wλ R(θ, f )dρI,M,η,γ (θ, f ) − R(θ⋆ , f ⋆ ) K (ρI,M,η,γ , µI ⊗ νM ) + log p (|I|) 10−|I|−M + log 1 δ . [sent-888, score-0.365]

80 Note ﬁrst that log and, consequently, log p |I| 10−|I|−M p pe ≤ |I| log |I| |I| ≤ |I| log pe + (|I| + M) log 10. [sent-890, score-0.75]

81 I,M,η I,M,γ By technical Lemma 9, we know that K (ρ1 , µI ) ≤ (|I| − 1) log max |I|, I,M,η Similarly, by technical Lemma 10, K (ρ2 , νM ) = M log I,M,γ 267 C+1 . [sent-892, score-0.3]

82 (13) A LQUIER AND B IAU Putting all the pieces together, we are led to K (ρI,M,η,γ , µI ⊗ νM ) ≤ (|I| − 1) log max |I|, 4 η + M log C+1 . [sent-894, score-0.3]

83 Combining this inequality with (12)-(14) yields, with probability larger than 1 − 2δ, ˆ R(θλ , fˆλ ) − R(θ⋆ , f ⋆ ) ≤ 1 2 2] 1 − λ[(2C+1) +4σ n−wλ − R(θ⋆ , f ⋆ ) + Ξ1 n 1+ inf I ⊂ {1, . [sent-920, score-0.161]

84 , p} 1≤M≤n +2 λ (2C + 1)2 + 4σ2 n − wλ ⋆ R(θ⋆ , fI,M ) I,M M log(10(C + 1)n) + |I| log(40epn) + log λ 1 δ . [sent-923, score-0.15]

85 Choosing ﬁnally λ= n , w + 2 [(2C + 1)2 + 4σ2 ] we obtain that there exists a positive constant Ξ2 , function of L, C, σ and ℓ such that, with probability at least 1 − 2δ, ˆ R(θλ , fˆλ ) − R(θ⋆ , f ⋆ ) ≤ Ξ2 inf I ⊂ {1, . [sent-924, score-0.114]

86 , p} 1≤M≤n + ⋆ R(θ⋆ , fI,M ) − R(θ⋆ , f ⋆ ) I,M M log(10Cn) + |I| log(40epn) + log n This concludes the proof of Theorem 2. [sent-927, score-0.15]

87 3 Proof of Corollary 4 We already know, by Theorem 2, that with probability at least 1 − δ, ˆ R(θλ , fˆλ ) − R(θ⋆ , f ⋆ ) ≤ Ξ ⋆ R(θ⋆ , fI,M ) − R(θ⋆ , f ⋆ ) I,M inf I ⊂ {1, . [sent-930, score-0.114]

88 , p} 1≤M≤n + M log(Cn) + |I| log(pn) + log n 2 δ . [sent-933, score-0.15]

89 Therefore, inequality (15) leads to ˆ R(θλ , fˆλ ) − R(θ⋆ , f ⋆ ) ≤ Ξ inf 1≤M≤n ΛM −2k + M log(Cn) + |I ⋆ | log(pn) + log n 2 δ . [sent-942, score-0.262]

90 Then K (ρ1 , µI ) ≤ (|I| − 1) log max |I|, I,M,η 4 η . [sent-950, score-0.15]

91 Clearly,   K (ρ1 , µI ) = log   I,M,η   ≤ log  Thus, 1 1[ θ−θ⋆ I,M 1 ≤η] dµI (θ)   1 1[θ∈F A ] 1[  θ−θ⋆ I,M 1 ≤η] dµI (θ) 2|I|−1 K (ρ1 , µI ) ≤ log   I,M,η 1[  θ−θ⋆ I,M 2|I|−1  ≤ log  dχ(θ)  1 ≤η] 1[θ∈K] dχ(θ)   . [sent-996, score-0.6]

92 Consequently, we obtain K (ρ1 , µI ) ≤ log I,M,η 2 u |I|−1 ≤ (|I| − 1) log max |I|, 4 η . [sent-1000, score-0.3]

93 , p}, any positive integer M ≤ n and any γ ∈ ]0, 1/n], let the probability measure ρ2 I,M,γ be deﬁned by dρ2 I,M,γ dνM ( f ) ∝ 1[ 274 ⋆ f − fI,M M ≤γ] S PARSE S INGLE -I NDEX M ODEL where, for f = ∑M β j ϕ j ∈ FM (C + 1), we put j=1 M M = f ∑ j|β j |. [sent-1004, score-0.085]

94 γ K (ρ2 , νM ) = M log I,M,γ Proof Observe that K (ρ2 , νM ) = I,M,γ log dρ2 I,M,γ dνM ( f ) dρ2 ( f ). [sent-1006, score-0.3]

95 This implies j=1   1[∑M j|β j |≤C+1] (β)dβ j=1  K (ρ2 , νM ) = log   . [sent-1010, score-0.15]

96 We conclude that  K (ρ2 , νM ) = log   I,M,γ 1[∑M j|β j |≤C+1] dβ j=1 1[∑M j|β j −(β⋆ j=1 I,M ) j |≤γ] 275  C+1  . [sent-1013, score-0.15]

97 1 Notation In order to provide explicit formulas for the conditional densities k1 ((τ, h)|(θ, f )) and k2 ((τ, h)|(θ, f )), we ﬁrst set mf f= ∑ β f , jϕ j mh h= and j=1 ∑ βh, j ϕ j , j=1 where it is recalled that {ϕ j }∞ denotes the (non-normalized) trigonometric system. [sent-1017, score-0.141]

98 We let I j=1 (respectively, J) be the set of nonzero coordinates of the vector θ (respectively, τ), and denote ﬁnally by θI (respectively, τJ ) the vector of dimension |I| (respectively, |J|) which contains the nonzero coordinates of θ (respectively, |τ|). [sent-1018, score-0.179]

99 ,mh ∈ arg min m b∈R h ∑ i=1 mh 2 T Yi − ∑ b j ϕ j (τ Xi ) . [sent-1024, score-0.104]

100 Note that simulating with respect to denss (h|τ, mh ) is an easy task, since one just needs to compute a least square estimate and then draw from a truncated Gaussian distribution. [sent-1027, score-0.157]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ffourier', 0.298), ('iau', 0.298), ('lquier', 0.298), ('ingle', 0.28), ('ndex', 0.28), ('fm', 0.221), ('fhhi', 0.21), ('median', 0.193), ('odel', 0.167), ('rn', 0.164), ('parse', 0.152), ('log', 0.15), ('fnw', 0.14), ('flasso', 0.123), ('alquier', 0.12), ('mh', 0.104), ('fn', 0.095), ('mcmc', 0.088), ('flinear', 0.088), ('fnp', 0.088), ('fsi', 0.088), ('gh', 0.088), ('hhi', 0.088), ('catoni', 0.081), ('kt', 0.075), ('sparsity', 0.074), ('rdle', 0.068), ('reversible', 0.068), ('exp', 0.067), ('dn', 0.066), ('lasso', 0.066), ('coordinates', 0.065), ('inf', 0.065), ('xi', 0.061), ('ontrol', 0.06), ('characters', 0.06), ('tsybakov', 0.06), ('dalalyan', 0.054), ('denss', 0.053), ('nonasymptotic', 0.053), ('rjmcmc', 0.053), ('pn', 0.05), ('measurable', 0.05), ('jump', 0.05), ('probability', 0.049), ('chains', 0.049), ('dimension', 0.049), ('inequality', 0.047), ('sobolev', 0.047), ('housing', 0.047), ('link', 0.044), ('pierre', 0.042), ('cn', 0.041), ('temperature', 0.04), ('lemma', 0.038), ('lounici', 0.037), ('oracle', 0.037), ('fubini', 0.037), ('kh', 0.037), ('trigonometric', 0.037), ('chain', 0.037), ('put', 0.036), ('annex', 0.035), ('dublin', 0.035), ('marin', 0.035), ('prevision', 0.035), ('mean', 0.035), ('gibbs', 0.035), ('soon', 0.035), ('posterior', 0.033), ('lk', 0.033), ('absolutely', 0.033), ('prior', 0.033), ('concrete', 0.032), ('ti', 0.032), ('consequently', 0.031), ('estimation', 0.03), ('delecroix', 0.03), ('rard', 0.03), ('condi', 0.03), ('antoniadis', 0.03), ('yeh', 0.03), ('bm', 0.03), ('supremum', 0.03), ('carlo', 0.03), ('regression', 0.029), ('monte', 0.029), ('sparse', 0.029), ('ambient', 0.028), ('paris', 0.028), ('surely', 0.028), ('besides', 0.027), ('proposal', 0.027), ('model', 0.027), ('slump', 0.027), ('postponed', 0.027), ('belongs', 0.025), ('audibert', 0.025), ('accordance', 0.025), ('bunea', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 104 jmlr-2013-Sparse Single-Index Model

Author: Pierre Alquier, Gérard Biau

2 0.10124438 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

Abstract: We propose a new method of learning a sparse nonnegative-deﬁnite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm ﬁrst estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to inﬁnity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm

3 0.089566328 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

Author: Ery Arias-Castro, Bruno Pelletier

Abstract: Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing speciﬁc rates of convergence under standard assumptions. We ﬁnd that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent. Keywords: maximum variance unfolding, isometric embedding, U-processes, empirical processes, proximity graphs.

4 0.073758572 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

Author: Markus Thom, Günther Palm

Abstract: Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks. This paper proposes a model targeted at classiﬁcation tasks, where sparse activity and sparse connectivity are used to enhance classiﬁcation capabilities. The tool for achieving this is a sparseness-enforcing projection operator which ﬁnds the closest vector with a pre-deﬁned sparseness for any given vector. In the theoretical part of this paper, a comprehensive theory for such a projection is developed. In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function. The entire model can hence be tuned end-to-end using gradient-based methods. Experiments on the MNIST database of handwritten digits show that classiﬁcation performance can be boosted by sparse activity or sparse connectivity. With a combination of both, performance can be signiﬁcantly better compared to classical non-sparse approaches. Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity

5 0.068714142 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same ﬂavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with ﬁxed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds

6 0.058467742 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

7 0.05813795 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality

8 0.05602368 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

9 0.053506568 76 jmlr-2013-Nonparametric Sparsity and Regularization

10 0.051059432 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

11 0.050939444 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

12 0.050165597 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

13 0.0500631 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

14 0.049647331 65 jmlr-2013-Lower Bounds and Selectivity of Weak-Consistent Policies in Stochastic Multi-Armed Bandit Problem

15 0.048519246 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

16 0.046948008 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

17 0.046569157 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

18 0.044652615 90 jmlr-2013-Quasi-Newton Method: A New Direction

19 0.044563998 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions

20 0.043649133 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.229), (1, 0.021), (2, 0.069), (3, 0.057), (4, 0.014), (5, -0.14), (6, -0.005), (7, -0.007), (8, 0.106), (9, 0.019), (10, -0.133), (11, -0.079), (12, -0.031), (13, 0.049), (14, -0.004), (15, -0.037), (16, -0.015), (17, -0.023), (18, 0.17), (19, -0.025), (20, -0.01), (21, 0.035), (22, -0.105), (23, 0.012), (24, -0.002), (25, -0.17), (26, 0.11), (27, -0.015), (28, -0.101), (29, -0.045), (30, -0.238), (31, -0.196), (32, 0.003), (33, 0.088), (34, -0.043), (35, -0.058), (36, -0.148), (37, -0.087), (38, -0.038), (39, 0.038), (40, -0.01), (41, 0.017), (42, -0.079), (43, -0.008), (44, 0.022), (45, 0.045), (46, 0.018), (47, -0.08), (48, 0.115), (49, -0.144)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90147752 104 jmlr-2013-Sparse Single-Index Model

Author: Pierre Alquier, Gérard Biau

2 0.63317239 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

Author: Ery Arias-Castro, Bruno Pelletier

3 0.4971382 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

Author: Markus Thom, Günther Palm

4 0.48474753 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

5 0.47971138 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality

Author: Sébastien Bubeck, Damien Ernst, Aurélien Garivier

Abstract: We consider an original problem that arises from the issue of security analysis of a power system and that we name optimal discovery with probabilistic expert advice. We address it with an algorithm based on the optimistic paradigm and on the Good-Turing missing mass estimator. We prove two different regret bounds on the performance of this algorithm under weak assumptions on the probabilistic experts. Under more restrictive hypotheses, we also prove a macroscopic optimality result, comparing the algorithm both with an oracle strategy and with uniform sampling. Finally, we provide numerical experiments illustrating these theoretical ﬁndings. Keywords: optimal discovery, probabilistic experts, optimistic algorithm, Good-Turing estimator, UCB

6 0.4731504 21 jmlr-2013-Classifier Selection using the Predicate Depth

7 0.44776282 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

8 0.43366244 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

9 0.39206651 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

10 0.3683356 65 jmlr-2013-Lower Bounds and Selectivity of Weak-Consistent Policies in Stochastic Multi-Armed Bandit Problem

11 0.36556262 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

12 0.36394992 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

13 0.35383457 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

14 0.32056999 76 jmlr-2013-Nonparametric Sparsity and Regularization

15 0.31779796 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

16 0.30174056 62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion

17 0.2946637 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

18 0.2875559 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

19 0.2757532 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

20 0.2741513 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.022), (5, 0.099), (6, 0.031), (10, 0.064), (20, 0.013), (23, 0.533), (68, 0.018), (70, 0.024), (75, 0.037), (85, 0.035), (87, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92513251 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone

Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efﬁcient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction

same-paper 2 0.87523037 104 jmlr-2013-Sparse Single-Index Model

Author: Pierre Alquier, Gérard Biau

3 0.62940103 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng

Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)

4 0.47251064 66 jmlr-2013-MAGIC Summoning: Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use

Author: Daniel Kyu Hwa Kohlsdorf, Thad E. Starner

Abstract: Gestures for interfaces should be short, pleasing, intuitive, and easily recognized by a computer. However, it is a challenge for interface designers to create gestures easily distinguishable from users’ normal movements. Our tool MAGIC Summoning addresses this problem. Given a speciﬁc platform and task, we gather a large database of unlabeled sensor data captured in the environments in which the system will be used (an “Everyday Gesture Library” or EGL). The EGL is quantized and indexed via multi-dimensional Symbolic Aggregate approXimation (SAX) to enable quick searching. MAGIC exploits the SAX representation of the EGL to suggest gestures with a low likelihood of false triggering. Suggested gestures are ordered according to brevity and simplicity, freeing the interface designer to focus on the user experience. Once a gesture is selected, MAGIC can output synthetic examples of the gesture to train a chosen classiﬁer (for example, with a hidden Markov model). If the interface designer suggests his own gesture and provides several examples, MAGIC estimates how accurately that gesture can be recognized and estimates its false positive rate by comparing it against the natural movements in the EGL. We demonstrate MAGIC’s effectiveness in gesture selection and helpfulness in creating accurate gesture recognizers. Keywords: gesture recognition, gesture spotting, false positives, continuous recognition

5 0.4435994 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont

Abstract: We describe a Bayesian method for group feature selection in linear regression problems. The method is based on a generalized version of the standard spike-and-slab prior distribution which is often used for individual feature selection. Exact Bayesian inference under the prior considered is infeasible for typical regression problems. However, approximate inference can be carried out efﬁciently using Expectation Propagation (EP). A detailed analysis of the generalized spike-and-slab prior shows that it is well suited for regression problems that are sparse at the group level. Furthermore, this prior can be used to introduce prior knowledge about speciﬁc groups of features that are a priori believed to be more relevant. An experimental evaluation compares the performance of the proposed method with those of group LASSO, Bayesian group LASSO, automatic relevance determination and additional variants used for group feature selection. The results of these experiments show that a model based on the generalized spike-and-slab prior and the EP algorithm has state-of-the-art prediction performance in the problems analyzed. Furthermore, this model is also very useful to carry out sequential experimental design (also known as active learning), where the data instances that are most informative are iteratively included in the training set, reducing the number of instances needed to obtain a particular level of prediction accuracy. Keywords: group feature selection, generalized spike-and-slab priors, expectation propagation, sparse linear model, approximate inference, sequential experimental design, signal reconstruction

6 0.44065517 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

7 0.42900097 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

8 0.42155147 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

9 0.42054999 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

10 0.40900806 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality

11 0.40534315 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

12 0.3965067 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

13 0.39281428 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

14 0.38740805 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

15 0.38519898 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

16 0.38492057 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

17 0.38350704 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

18 0.38096902 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

19 0.38079798 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

20 0.38025811 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference