jmlr jmlr2012 jmlr2012-59 knowledge-graph by maker-knowledge-mining

59 jmlr-2012-Linear Regression With Random Projections


Source: pdf

Author: Odalric-Ambrym Maillard, Rémi Munos

Abstract: We investigate a method for regression that makes use of a randomly generated subspace GP ⊂ F (of finite dimension P) of a given large (possibly infinite) dimensional function space F , for example, L2 ([0, 1]d ; R). GP is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in GP rather than in F , and derive excess risk bounds for a specific regression algorithm (least squares regression in GP ). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments. Keywords: regression, random matrices, dimension reduction

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 GP is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i. [sent-6, score-0.176]

2 ) PX , equipped with the quadratic norm def f PX = X f 2 (x)dPX (x) . [sent-25, score-0.399]

3 The goal of the statistician is to build, from the observations only, a regression function f ∈ F that is closed to the so-called target function f ⋆ , in the sense that it has a low excess risk R( f ) − R( f ⋆ ), where the risk of any f ∈ L2,PX (X ; R) is defined as def (y − f (x))2 d P (x, y) . [sent-33, score-0.612]

4 R( f ) = X ×R Similarly, we introduce the empirical risk of a function f to be def RN ( f ) = and we define the empirical norm of f as 1 N ∑ [yn − f (xn )]2 , N n=1 f def N = 1 N ∑ f (xn )2 . [sent-34, score-0.814]

5 i 1 Examples of initial features include Fourier basis, multi-resolution basis such as wavelets, and also less explicit features coming from a preliminary dictionary learning process. [sent-37, score-0.194]

6 In the sequel, for the sake of simplicity we focus our attention to the case when the target function f ⋆ = fα⋆ belongs to the space F , in which case the excess risk of a function f can be written as R( f ) − R( f ⋆ ) = f − f ⋆ PX . [sent-38, score-0.193]

7 Our goal is first to give some intuition about this method by providing approximation error and simple excess risk bounds (which may not be the tightest possible ones as explained in Section 4. [sent-48, score-0.195]

8 1), which enables us to show how to relate the choice of the initial features {ϕi }i 1 to the construction of standard function spaces via Gaussian objects (Section 3. [sent-53, score-0.199]

9 3, where we provide bounds on approximation error of the random space GP in the framework of regression with deterministic and random designs, and in Section 4. [sent-60, score-0.188]

10 4, where we derive excess risk bounds for a specific estimate. [sent-61, score-0.159]

11 Then let us define GP to be the (random) vector space spanned by those features, that is, def def GP = gβ (x) = P ∑ β p ψ p (x), β ∈ RP . [sent-71, score-0.772]

12 p=1 In the sequel, PG will refer to the law of the Gaussian variables, Pη to the law of the observation noise and PY to the law of the observations. [sent-72, score-0.174]

13 Example: Let us consider as an example the features {ϕi }i 1 to be a set of functions defined by rescaling and translation of a mother one-dimensional hat function (illustrated in Figure 1, middle column) and defined precisely in paragraph 3. [sent-77, score-0.279]

14 Thus we deduce that the excess risk is √ B f ⋆ H 1 log(N/δ) √ ) for P of the order N. [sent-81, score-0.268]

15 s,2,2 N n=1 1 − 2−2s+1 Thus the excess risk in this case is bounded as gβ − f ⋆ 2 N = O( B f⋆ s,2,2 log(N/δ) √ N ). [sent-87, score-0.159]

16 1 Comments The second term in the bound (2) is a usual estimation error term in regression, while the first term comes from the additional approximation error of the space GP w. [sent-89, score-0.223]

17 • The result does not require any specific smoothness assumptions on the initial features {ϕi }i 1 ; by optimizing over P, we get a rate of order N −1/2 that corresponds to the minimax rates under such assumptions up to logarithmic factors. [sent-96, score-0.19]

18 On the other hand when we use the random projection method, the random choice of GP implies that for any f ⋆ ∈ F , the approximation error infg∈GP f ⋆ − g N can be controlled (by the first term of the bound (2)) in high probability. [sent-99, score-0.227]

19 However, considering smoothness assumptions on the features would enable to derive a better approximation error term (first term of the bound (2)); typically with a Sobolev assumption or order s, we would get a term of order P−2s instead of P−1 . [sent-112, score-0.291]

20 A more important modification of the method would be to consider, like for data-driven penalization techniques, a data-dependent construction of the random space GP , that is, using a datadriven distribution for the random variable A p,i instead of a Gaussian distribution. [sent-125, score-0.146]

21 In order to illustrate the method, we show in figure 1 three examples of initial features {ϕi } (top row) and random features {ψ p } (bottom row). [sent-128, score-0.194]

22 The first family of features is the basis of wavelet Haar functions. [sent-129, score-0.233]

23 For example, in the case of multi-resolution hat functions (middle column), the corresponding random features are Brownian motions. [sent-133, score-0.148]

24 The linear regression with random projections approach described here simply consists in performing least-squares regression using the set of randomly generated features {ψ p }1 p P (e. [sent-134, score-0.217]

25 The initial set of features are (respectively) Haar functions (left), multi-resolution hat functions (middle) and multi-resolution Gaussian functions (right). [sent-138, score-0.15]

26 Let us remind that the use of random projections is well-known in many domains and applications, with different names according to the corresponding field, and that the corresponding random objects are widely studied and used. [sent-141, score-0.186]

27 In their paper, the authors show that with the appropriate choice of the wavelet functions and when using rescaling coefficients of the form σ j,l = 2− js with scale index j an position index l (see paragraph 3. [sent-151, score-0.213]

28 1) (left) and then subsampled at 32 × 32 (middle), which provides the data samples for generating a regression function (right) using random features (generated from the symlets as initial features, in the simplest model when s is constant). [sent-158, score-0.176]

29 def ′ Then note that the space defined by SN = I ′ (S ′ ), that is, the closure of the image of S ′ by I ′ in the sense of L2 (S ; R, N (0, K)), is a Hilbert space with inner product inherited from L2 (S ; R, N (0, K)). [sent-177, score-0.437]

30 Then the kernel space of N (0, K) is K = J(H ), endowed with inner product Jh1 , Jh2 def H = h1 , h2 H . [sent-187, score-0.476]

31 1 be an ∞ ∑i=1 ξi ϕi is orthonormal basis of K for the a Gaussian object following the Note that from Lemma 4, one can build an orthonormal basis {ϕi }i 1 by defining, for all i ϕi = Jhi where {hi }i 1 is an orthonormal basis of H and J satisfies conditions of Lemma 4. [sent-195, score-0.343]

32 Since ∑i σ2 < ∞, the Gaussian object W = ∑i ξi σi ei is well defined and i i our goal is to identify the kernel of the law of W . [sent-210, score-0.183]

33 For such an f , we deduce by the previous section that the injection mapping is given by (I ′ f )(g) = ∑i ci (g, ei ), and that we also have I ′ f 2 ′ = E (I ′ f ,W )2 = E SN ∑ σi ξi ci i 1 2 = ∑ σ2 c2 . [sent-213, score-0.164]

34 Therefore, a function in the space K corresponding to f is of the form ∑i σi ci ei , and one can easily check that the kernel space of the law of W is thus given by c 2 K = fc = ∑ ci ei ; ∑ i < ∞ , i 1 i 1 σi 2743 M AILLARD AND M UNOS endowed with inner product ( fc , fd )K = ∑i ci di 1 σ2 . [sent-215, score-0.297]

35 Note that if we now introduce the functions {ϕi }i 1 def defined by ϕi = σi ei ∈ H , then we get K= f α = ∑ α i ϕi ; α l2 <∞ , i 1 endowed with inner product ( fα , fβ )K = α, β l2 . [sent-217, score-0.461]

36 In this paragraph, we now apply the previous construction to the case when the {ei }i 1 are chosen to be a wavelet basis of functions defined on X = [0, 1] with reference measure µ being the Lebesgue measure. [sent-223, score-0.177]

37 Let e denote the mother wavelet function, and let us write e j,l the ith element of the basis, with j ∈ N a scale index and l ∈ {0, . [sent-224, score-0.284]

38 Let us define the coefficients {σi }i 1 to be exponentially decreasing with the scale index: def σ j,l = 2− js for all j 0 and l ∈ {0, . [sent-228, score-0.4]

39 Now assume that for some q ∈ N \ {0} such that q > s, the mother wavelet function e belongs to C q (X ), the set of q-times continuously differentiable functions on X , and admits q vanishing moments. [sent-232, score-0.235]

40 Moreover, s,2,2 l assuming that the mother wavelet is bounded by a constant λ and has compact support [0, 1], then we have the property that is useful in view of our main Theorem sup ϕ(x) x∈X 2 λ2 . [sent-235, score-0.3]

41 By application of def Lemma 5, we have that K = J(H ) endowed with the inner product Jh1 , Jh2 K = h1 , h2 H is the kernel space of N (0, JJ ′ ). [sent-246, score-0.476]

42 the Gaussian random variables, fα − gAα 2 N ε2 α 2 1 N ∑ ϕ(xn ) N n=1 2 def where we recall that by assumption, for any x, ϕ(x) = (ϕi (x))i 1 , is in ℓ2 . [sent-296, score-0.409]

43 P On the other hand the zero-value regressor has an estimation error of 1 y N 2 = N 1 N T 2 def 1 T (α xn ) = αT Sα , where S = ∑ xn xn ∈ RD×D . [sent-311, score-0.618]

44 Thus a ”smooth” function fα (in the sense of having a low functional norm) has a low norm of the parameter α , and is thus well approximated with a small number of wavelets coefficients. [sent-325, score-0.172]

45 Regression With Random Subspaces In this section, we describe the construction of the random subspace GP ⊂ F defined as the span of the random features {ψ p } p P generated from the initial features {ϕi }i 1 . [sent-330, score-0.265]

46 4 an algorithm that builds the proposed regression function and provide excess risk bounds for this algorithm. [sent-335, score-0.197]

47 In this paper we assume that the set of features {ϕi }i tinuous and satisfy the assumption that, sup ϕ(x) 2 < ∞, where ϕ(x) x∈X 2 def = ∑ ϕi (x)2 . [sent-338, score-0.49]

48 The random subspace GP is generated by building a set of P random features {ψ p }1 p P defined as linear combinations of the initial features {ϕi }1 1 weighted by random coefficients: def ψ p (x) = ∑ A p,i ϕi (x), for 1 p P, i 1 where the (infinitely many) coefficients A p,i are drawn i. [sent-341, score-0.674]

49 Such a definition of the features ψ p as an infinite sum of random variable is not obvious (this is an expansion of a Gaussian object) and we refer to the Section 3 for elements of theory about Gaussian objects and Lemma 5 for the expansion of a Gaussian object. [sent-346, score-0.261]

50 Actually, they are random samples of a centered Gaussian process 1 indexed by the space X with covariance structure given by P ϕ(x), ϕ(x′ ) , where we use the notation def u, v = ∑i ui vi for two square-summable sequences u and v. [sent-348, score-0.443]

51 We finally define GP ⊂ F to be the (random) vector space spanned by those features, that is, def def GP = gβ (x) = P ∑ β p ψ p (x), β ∈ RP . [sent-352, score-0.772]

52 p=1 We now want to compute a high probability bound on the excess risk of an estimator built using the random space GP . [sent-353, score-0.297]

53 Then we compute a high probability bound on the approximation error of the considered random space w. [sent-355, score-0.174]

54 Finally, we combine both bounds in order to derive a bound on the excess risk of the proposed estimate. [sent-359, score-0.195]

55 We will thus consider in this subsection only a space G that is the span over a deterministic set of P functions {ψ p } p P , and we will write, for a convex subset Θ ⊂ RP , def GΘ = gθ ∈ G ; θ ∈ Θ . [sent-365, score-0.403]

56 def def Similarly, we write g⋆ = argmin R(g) and g⋆ = argmin R(g). [sent-366, score-0.738]

57 Examples of well studied estimates Θ are: g∈G g∈GΘ 2749 M AILLARD AND M UNOS def • gols = argming∈G RN (g), the ordinary least-squares (ols) estimate. [sent-367, score-0.413]

58 def • germ = argming∈GΘ RN (g) the empirical risk minimizer (erm) that coincides with the ols when Θ = RP . [sent-368, score-0.459]

59 def def • gridge = argming∈G RN (g) + λ θ , glasso = argming∈G RN (g) + λ θ 1 . [sent-369, score-0.738]

60 def We also introduce for convenience gB , the truncation at level ±B of some g ∈ G , defined by gB (x) = u if |u| B, def TB [g(x)], where TB (u) = B sign(u) otherwise. [sent-370, score-0.738]

61 ∞ < ∞ where the infimum is over all orthonormal ba- • Orthogonality assumptions: (O1 ) {ψ p } p P is an orthonormal basis of G w. [sent-383, score-0.158]

62 Under assumption (N2 ) and (N3 ), the truncated o estimator gL = TL (gols ) satisfies ER(gL ) − R( f (reg) ) 8[R(g∗ ) − R( f (reg) )] + κ (σ2 ∨ B2 )P log(N) , N def where κ is some numerical constant and f (reg) (x) = E(Y |X = x). [sent-390, score-0.369]

63 N Note that Theorem 10 and Theorem 15 provide a result in expectation only, which is not enough for our purpose, since we need high probability bounds on the excess risk in order to be able to handle the randomness of the space GP . [sent-414, score-0.249]

64 Assumptions satisfied by the random space GP We now discuss the assumptions that are satisfied in our setting where G is a random space GP built from the random features {ψ p } p P , in terms of assumptions on the underlying initial space F . [sent-415, score-0.412]

65 Theorem 19 (Approximation error with deterministic design) For all P 1, for all δ ∈ (0, 1) there exists an event of PG -probability higher than 1 − δ such that on this event, inf g∈GP f⋆ −g 2 N 12 log(4N/δ) ⋆ α P 2 1 N ∑ ϕ(xn ) N n=1 2 . [sent-450, score-0.156]

66 4 Excess Risk of Random Spaces In this section, we analyze the excess risk of the random projection method. [sent-456, score-0.235]

67 The final prediction function g(x) is def the truncation (at level ±B) of gβ , that is, g(x) = TB [gβ (x)]. [sent-461, score-0.369]

68 In the next subsection, we provide excess risk bounds w. [sent-462, score-0.159]

69 Note that from this theorem, we deduce (without further assumptions on the features {ϕi }i 1 ) √ N that for instance for the choice P = log(N/δ) then f ⋆ − gβ 2 N κ′ α⋆ 2 1 N ∑ ϕ(xn ) N n=1 2 log(N/δ) log(1/δ) , + N N for some positive constant κ′ . [sent-469, score-0.211]

70 3 R EGRESSION WITH R ANDOM D ESIGN In the regression problem with random design, the analysis of the excess risk of a given method is not straightforward, since the assumptions to apply standard techniques may not be satisfied without further knowledge on the structure of the features. [sent-475, score-0.283]

71 N (Y, G ) are the projections of the target function f ⋆ and ˜ observation Y onto the random linear space G with respect to the empirical norm . [sent-488, score-0.149]

72 Approximation error term The first term, f ⋆ − gβ 2 , is an approximation error term in empirical ˜ N norm, it contains the number of projections as well as the norm of the target function. [sent-495, score-0.189]

73 This term plays the role of the approximation term that exists for regression with penalization by a factor λ f 2 . [sent-496, score-0.184]

74 Dσ2 This term classically decreases at speed N where σ2 is the variance of the noise and D is related to the log entropy of the space of function G considered. [sent-502, score-0.181]

75 This term also depends on the log entropy of the space of functions, thus the same remark applies to the dependency with P as for the noise error term. [sent-508, score-0.181]

76 (2008), the authors provide excess risk bounds for greedy algorithms (i. [sent-523, score-0.159]

77 2 Adaptivity Randomization enables to define approximation spaces such that the approximation error, either in expectation or in high probability on the choice of the random space, is controlled, whatever the measure P that is used to assess the performance. [sent-540, score-0.18]

78 On the contrary, the random features {ψ p } p P that are functions that contain (random combinations of) all wavelets, will be able to detect correlations between the data and some high frequency wavelets, and thus discover relevant features of f ⋆ at the spot. [sent-549, score-0.152]

79 We consider as initial features {ϕi }i 1 the set of hat functions defined in Section 3. [sent-552, score-0.15]

80 Least-squares regression using scrambled objects is able to learn the structure of f ⋆ in terms of the measure P , while least-squares regression with the initial features completely fails. [sent-563, score-0.289]

81 Approximation error Using a finite F introduces an additional approximation (squared) error term in the final excess risk bounds. [sent-578, score-0.234]

82 This additional error that is due to the numerical approximation is 2s of order O(F − d ) for a wavelet basis adapted to H s ([0, 1]d ) and can be made arbitrarily small, for example, o(N −1/2 ), whenever the depth of the wavelet dyadic-tree is bigger than log N . [sent-579, score-0.422]

83 Note that in the specific case of wavelets, we can even think to combine random projection with tools like fast wavelet transform which would be even faster (which we do not do here for simplicity and generality purpose). [sent-586, score-0.213]

84 √ Thus, using P = O( N) random features, we deduce that the complexity of building the matrix Ψ is at most O(2d N 3/2 log N). [sent-597, score-0.221]

85 2 Proof of Lemma 24 Proof We can bound the noise term gβ − gβ 2 using a simple Chernoff bound together with a ˜ N chaining argument. [sent-632, score-0.147]

86 Indeed, by definition of gβ and gβ , if we introduce the noise vector η defined by ˜ η = Y − f , we have 2 N gβ − gβ ˜ gβ − gβ , η ˜ = N N = 1 ˜ ∑ ηi (gβ − gβ )(Xi ) N i=1 sup 1 N ∑N ηi g(Xi ) i=1 g N 1 N ∑N ηi g(Xi ) i=1 g N g∈G sup g∈G gβ − gβ ˜ 2 N . [sent-633, score-0.166]

87 n : def ρ(t) = PY ∃g ∈ G 1 N ∑N ηi g(Xi ) i=1 >t g N = PY ∃g ∈ G 1 1 N ∑ ηi g(Xi ) > t . [sent-641, score-0.369]

88 N Thus, by setting δ1 = δ2 = δ3 = δ/3, we deduce that with P -probability higher than 1 − δ, √ 1 N √ B1 6 N ∑i=1 εi g(Xi ) √ 400 log(2)(1 + 2)2 P + 200 log(6/δ) + log(3/δ) . [sent-666, score-0.146]

89 N L INEAR R EGRESSION W ITH R ANDOM P ROJECTIONS 1 4 Thus,we consider u = N−1 , and deduce that, provided that N log(N) P , then with probability higher than 1 − δ w. [sent-682, score-0.174]

90 , then Hoeffding’s inequality applies; we deduce that there exists an event ΩX (ωG ) of PX -probability higher than 1 − δX such that on this event fα∗ − TL (gAα∗ ) 2 X P fα∗ − TL (gAα∗ ) log(1/δ) . [sent-715, score-0.318]

91 2m 2 2 m + (2L) Now by independence between the Gaussian random variables and the sample, the same inequality is valid on the event Ω1 = {ωX × ωG ; ωG ∈ ΩG , ωX ∈ ΩX (ωG )} , and this event has PX × PG -probability higher than 1 − δX . [sent-716, score-0.249]

92 Thus still by independence, the same inequality is valid on the event Ω2 = {(ωX , ωG ); ωX ∈ ΩX , ωG ∈ ΩG (ωX )} , 2 3 and this event has PX × PG -probability higher than 1 − 4me−P(ε /4−ε /6) . [sent-718, score-0.209]

93 Thus, we deduce, by a union bound that for all ε ∈ (0, 1) and m 1 there exists an event Ω1 ∩ Ω2 2 3 of PX × PG -probability higher than 1 − δX − 4me−P(ε /4−ε /6) such that on this event, inf f ⋆ − TL (g) 2 X P g∈G ε2 α⋆ 2 sup ϕ(x) 2 + (2L)2 x log(1/δ) . [sent-719, score-0.257]

94 2m Finally in order to get a bound in high PG -probability only, we introduce for any ωG ∈ ΩG the def event Ω′ (ωG ) = {ωX ∈ ΩX ; (ωX , ωG ) ∈ Ω1 × Ω2 } and then define for all λ > 0 the event X def Λ = {ωG ∈ ΩG ; PX (Ω′ (ωG )) X 2766 1 − λ} . [sent-720, score-0.946]

95 δ +δ Now by rewriting the bound using δ = X λ G , we deduce that for all δ, for all m and λ, there exists an event of PG -probability higher than 1 − δ such that inf f ⋆ − TL (g) 2 X P g∈G 12 log( 8m ) ⋆ λδ α P 2 2 sup ϕ(x) + (2L)2 x 2 log( λδ ) +λ . [sent-727, score-0.366]

96 For the first term on right hand side, an application of Theorem 19 ensures that there exists an event Ω′ ⊂ ΩG G of PG -probability higher than 1 − δ such that for all ωG ∈ Ω′ , G f ⋆ − gβ ˜ 2 N 12 log(4N/δ) ⋆ α P 2 1 N ∑ ϕ(xn ) N n=1 2 . [sent-736, score-0.162]

97 Since no random variable Y appears in this term, this is also true on the event def Ω1 = {(ωY , ωG ) ∈ ΩY × ΩG ; ωG ∈ Ω′ } , G and Ω1 has PY × PG -probability higher than 1 − δ. [sent-737, score-0.532]

98 Thus by independence of the noise term with the Gaussian variables, we deduce that a similar bound holds on the event def Ω2 = {(ωY , ωG ) ∈ ΩY × ΩG ; ωY ∈ ΩY (ωG )} , and that Ω2 has PY × PG -probability higher than 1 − δ′ . [sent-740, score-0.712]

99 7 Proof of Theorem 22 Proof Similarly to the proof of Theorem 20, we introduce the sets ΩX , Ωη and ΩG that consist of def all possible realizations of the input, noise and Gaussian random variables. [sent-743, score-0.472]

100 N On the other hand, by application of Theorem 19, for any given (ωX , ωη ), there exists an event ΩG (ωX , ωη ) ⊂ ΩG of PG -probability higher than 1 − δ3 such that on this event f ⋆ − gβ ˜ 2 N 12 log(4N/δ3 ) ⋆ α P 2 sup ϕ(x) 2 . [sent-751, score-0.274]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gp', 0.412), ('def', 0.369), ('tb', 0.302), ('pg', 0.217), ('aillard', 0.207), ('rojections', 0.196), ('unos', 0.177), ('andom', 0.147), ('wavelets', 0.142), ('rp', 0.14), ('wavelet', 0.137), ('px', 0.131), ('brownian', 0.126), ('excess', 0.113), ('egression', 0.11), ('tl', 0.11), ('ga', 0.11), ('deduce', 0.109), ('maillard', 0.098), ('mother', 0.098), ('inear', 0.098), ('event', 0.086), ('gaussian', 0.084), ('xn', 0.083), ('log', 0.072), ('supx', 0.071), ('besov', 0.065), ('carleman', 0.065), ('sup', 0.065), ('py', 0.062), ('objects', 0.061), ('orthonormal', 0.059), ('features', 0.056), ('ei', 0.055), ('scrambled', 0.054), ('expansion', 0.052), ('hat', 0.052), ('gy', 0.052), ('ith', 0.049), ('lemma', 0.047), ('munos', 0.046), ('risk', 0.046), ('law', 0.046), ('object', 0.046), ('assumptions', 0.046), ('paragraph', 0.045), ('projections', 0.045), ('alquier', 0.044), ('argming', 0.044), ('germ', 0.044), ('gols', 0.044), ('peaky', 0.044), ('hilbert', 0.043), ('initial', 0.042), ('sobolev', 0.042), ('random', 0.04), ('spaces', 0.04), ('basis', 0.04), ('texture', 0.039), ('barron', 0.039), ('term', 0.039), ('regression', 0.038), ('infg', 0.037), ('rahimi', 0.037), ('motions', 0.037), ('endowed', 0.037), ('higher', 0.037), ('projection', 0.036), ('reg', 0.036), ('noise', 0.036), ('kernel', 0.036), ('approximation', 0.036), ('bound', 0.036), ('rkhs', 0.035), ('un', 0.034), ('space', 0.034), ('pierre', 0.034), ('haar', 0.034), ('cients', 0.033), ('inf', 0.033), ('aun', 0.033), ('iz', 0.033), ('lifshits', 0.033), ('micha', 0.033), ('penalization', 0.032), ('subspace', 0.031), ('js', 0.031), ('theorem', 0.031), ('norm', 0.03), ('coef', 0.03), ('ds', 0.029), ('dec', 0.029), ('middle', 0.028), ('randomness', 0.028), ('probability', 0.028), ('canu', 0.028), ('caponnetto', 0.028), ('catoni', 0.028), ('realizations', 0.027), ('ew', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 59 jmlr-2012-Linear Regression With Random Projections

Author: Odalric-Ambrym Maillard, Rémi Munos

Abstract: We investigate a method for regression that makes use of a randomly generated subspace GP ⊂ F (of finite dimension P) of a given large (possibly infinite) dimensional function space F , for example, L2 ([0, 1]d ; R). GP is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in GP rather than in F , and derive excess risk bounds for a specific regression algorithm (least squares regression in GP ). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments. Keywords: regression, random matrices, dimension reduction

2 0.21494661 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

Author: Kian Ming A. Chai

Abstract: Gaussian process prior with an appropriate likelihood function is a flexible non-parametric model for a variety of learning tasks. One important and standard task is multi-class classification, which is the categorization of an item into one of several fixed classes. A usual likelihood function for this is the multinomial logistic likelihood function. However, exact inference with this model has proved to be difficult because high-dimensional integrations are required. In this paper, we propose a variational approximation to this model, and we describe the optimization of the variational parameters. Experiments have shown our approximation to be tight. In addition, we provide dataindependent bounds on the marginal likelihood of the model, one of which is shown to be much tighter than the existing variational mean-field bound in the experiments. We also derive a proper lower bound on the predictive likelihood that involves the Kullback-Leibler divergence between the approximating and the true posterior. We combine our approach with a recently proposed sparse approximation to give a variational sparse approximation to the Gaussian process multi-class model. We also derive criteria which can be used to select the inducing set, and we show the effectiveness of these criteria over random selection in an experiment. Keywords: Gaussian process, probabilistic classification, multinomial logistic, variational approximation, sparse approximation

3 0.13395593 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression

Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding

Abstract: This paper presents the Getting-started style documentation for the local and parallel computation toolbox for Gaussian process regression (GPLP), an open source software package written in Matlab (but also compatible with Octave). The working environment and the usage of the software package will be presented in this paper. Keywords: Gaussian process regression, domain decomposition method, partial independent conditional, bagging for Gaussian process, local probabilistic regression

4 0.095088296 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu

Abstract: Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f j∗ , where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f j∗ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1 -type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn ) norms over the class Fd,s,H of sparse additive models with each univariate function f j∗ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much √ faster estimation rates are possible for any sparsity s = Ω( n), showing that global boundedness is a significant restriction in the high-dimensional setting. Keywords: sparsity, kernel, non-parametric, convex, minimax

5 0.085442252 80 jmlr-2012-On Ranking and Generalization Bounds

Author: Wojciech Rejchel

Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process

6 0.084137112 91 jmlr-2012-Plug-in Approach to Active Learning

7 0.080089681 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

8 0.076324902 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

9 0.0707874 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

10 0.064234689 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

11 0.06325604 20 jmlr-2012-Analysis of a Random Forests Model

12 0.057568669 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

13 0.056655314 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

14 0.056102011 82 jmlr-2012-On the Necessity of Irrelevant Variables

15 0.055028256 73 jmlr-2012-Multi-task Regression using Minimal Penalties

16 0.053788036 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity

17 0.051531199 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration

18 0.050717406 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

19 0.050649036 111 jmlr-2012-Structured Sparsity and Generalization

20 0.049445122 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.241), (1, 0.13), (2, -0.019), (3, -0.047), (4, 0.099), (5, -0.099), (6, -0.042), (7, 0.022), (8, -0.273), (9, -0.045), (10, -0.103), (11, 0.028), (12, -0.231), (13, 0.263), (14, -0.157), (15, -0.054), (16, -0.132), (17, -0.009), (18, -0.258), (19, 0.063), (20, 0.053), (21, -0.109), (22, -0.087), (23, -0.217), (24, 0.027), (25, 0.065), (26, 0.07), (27, -0.024), (28, 0.057), (29, -0.087), (30, 0.073), (31, -0.025), (32, 0.005), (33, -0.021), (34, -0.042), (35, 0.002), (36, -0.023), (37, 0.079), (38, 0.016), (39, -0.008), (40, 0.009), (41, -0.001), (42, 0.065), (43, 0.04), (44, 0.099), (45, -0.01), (46, -0.029), (47, 0.031), (48, -0.048), (49, 0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91095769 59 jmlr-2012-Linear Regression With Random Projections

Author: Odalric-Ambrym Maillard, Rémi Munos

Abstract: We investigate a method for regression that makes use of a randomly generated subspace GP ⊂ F (of finite dimension P) of a given large (possibly infinite) dimensional function space F , for example, L2 ([0, 1]d ; R). GP is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in GP rather than in F , and derive excess risk bounds for a specific regression algorithm (least squares regression in GP ). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments. Keywords: regression, random matrices, dimension reduction

2 0.70675999 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

Author: Kian Ming A. Chai

Abstract: Gaussian process prior with an appropriate likelihood function is a flexible non-parametric model for a variety of learning tasks. One important and standard task is multi-class classification, which is the categorization of an item into one of several fixed classes. A usual likelihood function for this is the multinomial logistic likelihood function. However, exact inference with this model has proved to be difficult because high-dimensional integrations are required. In this paper, we propose a variational approximation to this model, and we describe the optimization of the variational parameters. Experiments have shown our approximation to be tight. In addition, we provide dataindependent bounds on the marginal likelihood of the model, one of which is shown to be much tighter than the existing variational mean-field bound in the experiments. We also derive a proper lower bound on the predictive likelihood that involves the Kullback-Leibler divergence between the approximating and the true posterior. We combine our approach with a recently proposed sparse approximation to give a variational sparse approximation to the Gaussian process multi-class model. We also derive criteria which can be used to select the inducing set, and we show the effectiveness of these criteria over random selection in an experiment. Keywords: Gaussian process, probabilistic classification, multinomial logistic, variational approximation, sparse approximation

3 0.67820036 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression

Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding

Abstract: This paper presents the Getting-started style documentation for the local and parallel computation toolbox for Gaussian process regression (GPLP), an open source software package written in Matlab (but also compatible with Octave). The working environment and the usage of the software package will be presented in this paper. Keywords: Gaussian process regression, domain decomposition method, partial independent conditional, bagging for Gaussian process, local probabilistic regression

4 0.46268535 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle

Abstract: While unsupervised learning has long been useful for density modeling, exploratory data analysis and visualization, it has become increasingly important for discovering features that will later be used for discriminative tasks. Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which find latent representations that are constrained but nevertheless informative for reconstruction. However, pure unsupervised learning with autoencoders can find representations that may or may not be useful for the ultimate discriminative task. It is a continuing challenge to guide the training of an autoencoder so that it finds features which will be useful for predicting labels. Similarly, we often have a priori information regarding what statistical variation will be irrelevant to the ultimate discriminative task, and we would like to be able to use this for guidance as well. Although a typical strategy would be to include a parametric discriminative model as part of the autoencoder training, here we propose a nonparametric approach that uses a Gaussian process to guide the representation. By using a nonparametric model, we can ensure that a useful discriminative function exists for a given set of features, without explicitly instantiating it. We demonstrate the superiority of this guidance mechanism on four data sets, including a real-world application to rehabilitation research. We also show how our proposed approach can learn to explicitly ignore statistically significant covariate information that is label-irrelevant, by evaluating on the small NORB image recognition problem in which pose and lighting labels are available. Keywords: autoencoder, gaussian process, gaussian process latent variable model, representation learning, unsupervised learning

5 0.41285086 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

Author: Philipp Hennig, Christian J. Schuler

Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation

6 0.38898751 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

7 0.33773682 91 jmlr-2012-Plug-in Approach to Active Learning

8 0.32642633 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration

9 0.32548106 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

10 0.3157908 20 jmlr-2012-Analysis of a Random Forests Model

11 0.30896839 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

12 0.30181584 80 jmlr-2012-On Ranking and Generalization Bounds

13 0.29022351 111 jmlr-2012-Structured Sparsity and Generalization

14 0.28999478 109 jmlr-2012-Stability of Density-Based Clustering

15 0.28791597 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

16 0.28321287 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

17 0.2706562 68 jmlr-2012-Minimax Manifold Estimation

18 0.25822818 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

19 0.2579039 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion

20 0.2486926 4 jmlr-2012-A Kernel Two-Sample Test


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.011), (21, 0.032), (26, 0.031), (29, 0.025), (35, 0.011), (49, 0.013), (56, 0.014), (57, 0.011), (69, 0.013), (75, 0.041), (79, 0.014), (92, 0.625), (96, 0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99018544 59 jmlr-2012-Linear Regression With Random Projections

Author: Odalric-Ambrym Maillard, Rémi Munos

Abstract: We investigate a method for regression that makes use of a randomly generated subspace GP ⊂ F (of finite dimension P) of a given large (possibly infinite) dimensional function space F , for example, L2 ([0, 1]d ; R). GP is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in GP rather than in F , and derive excess risk bounds for a specific regression algorithm (least squares regression in GP ). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments. Keywords: regression, random matrices, dimension reduction

2 0.98553956 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

Author: Sahand Negahban, Martin J. Wainwright

Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the “spikiness” and “low-rankness” of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with ℓq -“balls” of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. Keywords: matrix completion, collaborative filtering, convex optimization

3 0.97552711 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

Author: Aviv Tamar, Dotan Di Castro, Ron Meir

Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel definition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms

4 0.97209972 25 jmlr-2012-Characterization and Greedy Learning of Interventional Markov Equivalence Classes of Directed Acyclic Graphs

Author: Alain Hauser, Peter Bühlmann

Abstract: The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, that is the same conditional independence relations of multivariate observational distributions, has a long tradition; many algorithms exist for model selection and structure learning in Markov equivalence classes. In this paper, we extend the notion of Markov equivalence of DAGs to the case of interventional distributions arising from multiple intervention experiments. We show that under reasonable assumptions on the intervention experiments, interventional Markov equivalence defines a finer partitioning of DAGs than observational Markov equivalence and hence improves the identifiability of causal models. We give a graph theoretic criterion for two DAGs being Markov equivalent under interventions and show that each interventional Markov equivalence class can, analogously to the observational case, be uniquely represented by a chain graph called interventional essential graph (also known as CPDAG in the observational case). These are key insights for deriving a generalization of the Greedy Equivalence Search algorithm aimed at structure learning from interventional data. This new algorithm is evaluated in a simulation study. Keywords: causal inference, interventions, graphical model, Markov equivalence, greedy equivalence search

5 0.90402806 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration

Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos

Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis

6 0.87792224 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

7 0.85517973 34 jmlr-2012-Dynamic Policy Programming

8 0.81122696 68 jmlr-2012-Minimax Manifold Estimation

9 0.8062793 111 jmlr-2012-Structured Sparsity and Generalization

10 0.78551084 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

11 0.77984369 82 jmlr-2012-On the Necessity of Irrelevant Variables

12 0.7552883 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions

13 0.75278479 73 jmlr-2012-Multi-task Regression using Minimal Penalties

14 0.75000346 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels

15 0.74700785 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

16 0.7401855 13 jmlr-2012-Active Learning via Perfect Selective Classification

17 0.73615801 80 jmlr-2012-On Ranking and Generalization Bounds

18 0.73389542 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

19 0.72863472 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

20 0.72374302 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning