nips nips2010 nips2010-233 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Odalric Maillard, Rémi Munos
Abstract: We consider least-squares regression using a randomly generated subspace GP ⊂ F of finite dimension P , where F is a function space of infinite dimension, e.g. L2 ([0, 1]d ). GP is defined as the span of P random features that are linear combinations of the basis functions of F weighted by random Gaussian i.i.d. coefficients. In particular, we consider multi-resolution random combinations at all scales of a given mother function, such as a hat function or a wavelet. In this latter case, the resulting Gaussian objects are called scrambled wavelets and we show that they enable to approximate functions in Sobolev spaces H s ([0, 1]d ). As a result, given N data, the least-squares estimate g built from P scrambled wavelets has excess risk ||f ∗ − g||2 = O(||f ∗ ||2 s ([0,1]d ) (log N )/P + P (log N )/N ) for P H target functions f ∗ ∈ H s ([0, 1]d ) of smoothness order s > d/2. An interesting aspect of the resulting bounds is that they do not depend on the distribution P from which the data are generated, which is important in a statistical regression setting considered here. Randomization enables to adapt to any possible distribution. We conclude by describing an efficient numerical implementation using lazy ex˜ pansions with numerical complexity O(2d N 3/2 log N + N 2 ), where d is the dimension of the input space. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract We consider least-squares regression using a randomly generated subspace GP ⊂ F of finite dimension P , where F is a function space of infinite dimension, e. [sent-4, score-0.264]
2 GP is defined as the span of P random features that are linear combinations of the basis functions of F weighted by random Gaussian i. [sent-7, score-0.334]
3 In particular, we consider multi-resolution random combinations at all scales of a given mother function, such as a hat function or a wavelet. [sent-11, score-0.502]
4 In this latter case, the resulting Gaussian objects are called scrambled wavelets and we show that they enable to approximate functions in Sobolev spaces H s ([0, 1]d ). [sent-12, score-1.065]
5 As a result, given N data, the least-squares estimate g built from P scrambled wavelets has excess risk ||f ∗ − g||2 = O(||f ∗ ||2 s ([0,1]d ) (log N )/P + P (log N )/N ) for P H target functions f ∗ ∈ H s ([0, 1]d ) of smoothness order s > d/2. [sent-13, score-1.294]
6 An interesting aspect of the resulting bounds is that they do not depend on the distribution P from which the data are generated, which is important in a statistical regression setting considered here. [sent-14, score-0.157]
7 We conclude by describing an efficient numerical implementation using lazy ex˜ pansions with numerical complexity O(2d N 3/2 log N + N 2 ), where d is the dimension of the input space. [sent-16, score-0.292]
8 1 Introduction We consider ordinary least-squares regression using randomly generated feature spaces. [sent-17, score-0.174]
9 Let us first describe the general regression problem: we observe data DN = ({xn , yn }1≤n≤N ) (with xn ∈ X a compact subset of Rd , and yn ∈ R), assumed to be independently and identically distributed (i. [sent-18, score-0.258]
10 ) with xn ∼ P and yn = f ∗ (xn ) + ηn , where f ∗ is the (unknown) target function, such that ||f ∗ ||∞ ≤ L, and ηn is a centered, independent noise of variance bounded by σ 2 . [sent-21, score-0.145]
11 Now, for a given class of functions F, and f ∈ F, we define the empirical def LN (f ) = 1 N 2 -error N [yn − f (xn )]2 , n=1 and the generalization error def L(f ) = EX,Y [(Y − f (X))2 ]. [sent-23, score-0.468]
12 The excess risk L(f )−L(f ∗ ) = ||f ∗ − f ||P (where ||g||2 = EX∼P [g(X)2 ]) measures the closeness P to optimality. [sent-25, score-0.293]
13 In this paper we consider infinite dimensional spaces F that are generated by a denumerable family of functions {ϕi }i≥1 , called initial features (such as wavelets). [sent-26, score-0.367]
14 In this paper we follow an alternative approach introduced in [10], called Compressed Least Squares Regression, which considers generating randomly a subspace GP (of finite dimension P ) of F, and then returning the empirical risk minimizer in GP , i. [sent-32, score-0.274]
15 Here we consider specific cases of infinite dimensional spaces F and provide a characterization of the resulting approximation spaces. [sent-36, score-0.195]
16 2 Regression with random spaces Let us briefly recall the method described in [10] and extend it to the case of infinite dimensional spaces F. [sent-37, score-0.315]
17 In this paper we assume that the set of features (ϕi )i≥1 are continuous and are such that, def sup ||ϕ(x)||2 < ∞, where ||ϕ(x)||2 = x∈X ϕi (x)2 . [sent-38, score-0.332]
18 (1) i≥1 Examples of feature spaces satisfying this property include rescaled wavelets and will be described in Section 3. [sent-39, score-0.573]
19 The random subspace GP is generated by building a set of P random features (ψp )1≤p≤P defined as linear combinations of the initial features {ϕi }1≥1 weighted by random coefficients: def ψp (x) = Ap,i ϕi (x), for 1 ≤ p ≤ P, (2) i≥1 where the (infinitely many) coefficients Ap,i are drawn i. [sent-40, score-0.693]
20 Such a definition of the features ψp as an infinite sum of random variable is not obvious (this is called an expansion of a Gaussian object) and we refer to [11] for elements of theory about Gaussian objects and for the expansion of a Gaussian object. [sent-45, score-0.372]
21 P P i≥1 The continuity of the initial features (ϕi ) guarantees that there exists a continuous version of the process ψp which is thus a Gaussian process. [sent-49, score-0.149]
22 β = Ψ† Y ∈ RP , where Ψ is the def N × P -matrix composed of the elements: Ψn,p = Ψp (xn ), and Ψ† is the Moore-Penrose pseudoinverse of Ψ1 . [sent-57, score-0.192]
23 b u if |u| ≤ L, def def g(x) = TL [gβ (x)], where TL (u) = b L sign(u) otherwise. [sent-60, score-0.384]
24 Next, we provide bounds on the approximation error of f ∗ in GP and deduce excess risk bounds. [sent-61, score-0.493]
25 1 Approximation error We now extend the result of [10] and derive approximation error bounds both in expectation and in high probability. [sent-63, score-0.222]
26 We restrict the set of target functions to belong to the approximation space K ⊂ F (also identified to the kernel space associated to the expansion of a Gaussian object): def def K = {fα ∈ F, ||α||2 = 2 αi < ∞}. [sent-64, score-0.813]
27 This space may be seen from two equivalent points of view: either as a set of functions that are random linear combinations of the initial features, or a set of functions that are the expectation of some random processes (interpretation in terms of kernel space). [sent-66, score-0.471]
28 We will not develop the related theory of Gaussian processes here but we refer the reader interested in the construction of kernel spaces to [11] Let fα = i αi ϕi ∈ K. [sent-67, score-0.211]
29 ¯ The following result provides bounds for the approximation error ||fα − g ∗ ||P both in expectation ¯ and in high probability. [sent-75, score-0.191]
30 2 Excess risk bounds We now combine the approximation error bound from Theorem 1 with usual estimation error bounds ∗ for linear spaces (see e. [sent-98, score-0.53]
31 Remember def that our prediction function g is the truncation g = TL [gβ ] of the (ordinary) least-squares estimate b gβ (empirical risk minimizer in the random space GP ) defined by (3). [sent-102, score-0.469]
32 b We now provide upper bounds (both in expectation and in high probability) on the excess risk for the least-squares estimate using random subspaces (the proof is given in [11]). [sent-103, score-0.474]
33 input data, noise, and the choice of the random features): P log N log N ∗ 2 P + ||α || sup ||ϕ(x)||2 , (5) EGP ,X,Y ||f ∗ − g||2 ≤ c4 σ 2 + L2 P N N P x Now, for any η > 0, whenever P ≥ c5 log(N/η), we have the following bound in high probability (w. [sent-110, score-0.225]
34 the choice of the random features), where c3 , c4 , c5 , c6 are universal constant (see [11]): P log N log N/η ∗ 2 P + ||α || sup ||ϕ(x)||2 . [sent-113, score-0.169]
35 (6) EX,Y ||f ∗ − g||2 ≤ c6 σ 2 + L2 P N N P x 3 The results of Theorems 1 and 2 say that if the term ||α∗ ||2 supx ||ϕ(x)||2 is small, then the leastsquares estimate in the random subspace GP has low excess risk. [sent-114, score-0.324]
36 In the next section we provide two examples of feature spaces and characterize the space of functions for which this term is controlled. [sent-116, score-0.239]
37 3 Regression with Scrambled Objects In the two examples provided below we consider (infinitely many) initial features that are translations and rescaling of a given mother function (which is assumed to be continuous) at all scales. [sent-117, score-0.319]
38 Thus each random feature ψp is a Gaussian object based on a multi-scale scheme built from an object (the mother function), and will be called a “scrambled object”, to refer to the disorderly construction of this multi-resolution random process. [sent-118, score-0.399]
39 We thus propose to solve the regression problem by ordinary Least Squares on the (random) approximation space defined by the span of P such scrambled objects. [sent-119, score-0.665]
40 The first one considers the case when the mother function is a hat function and we show that the corresponding scrambled objects are Brownian motions. [sent-121, score-0.911]
41 Let us choose as object (mother function) the hat function Λ(x) = xI[0,1/2[ + (1 − x)I[1/2,1[ . [sent-126, score-0.275]
42 We define the (infinite) set of initial features as translated and rescaled hat functions: Λj,l (x) = 2−j/2 Λ(2j x − l) for any scale j ≥ 1 and translation index 0 ≤ l ≤ 2j − 1. [sent-127, score-0.441]
43 Those functions are indexed by the scale j and translation index l, but all functions may be equivalently indexed by a unique index i ≥ 1. [sent-130, score-0.242]
44 We have the property that the random features ψp (x), defined as linear combinations of those hat functions weighted by Gaussian i. [sent-131, score-0.477]
45 In addition, we can characterize the corresponding kernel space K, which is the Sobolev space H 1 ([0, 1]) of order 1 (space of functions which have a weak derivative in L2 ([0, 1])). [sent-135, score-0.228]
46 Dimension d: For the extension to dimension d, we define the initial features as the tensor product ϕj,l of one-dimensional hat functions (thus j and l are multi-indices). [sent-136, score-0.484]
47 The random features ψp (x) are Brownian sheets (extensions of Brownian motions to several dimensions) and the corresponding kernel K is the so-called Cameron-Martin space [9], endowed with the norm ∂df ||f ||K = || ∂x1 . [sent-137, score-0.622]
48 Note that in dimension d > 1, this space differs from the Sobolev space H 1 . [sent-145, score-0.156]
49 Regression with Brownian Sheets: When one uses Brownian sheets for regression with a target ∗ function f ∗ = i αi ϕi that lies in the Cameron-Martin space K defined previously (i. [sent-146, score-0.43]
50 K x∈X Thus, from Theorem 2, ordinary least-squares performed on random subspaces spanned by P Brownian sheets has an expected excess risk EGP ,X,Y ||f ∗ − g||2 = O P log N ∗ 2 log N P+ ||f ||K , N P (and a similar bound holds in high probability). [sent-149, score-0.803]
51 2 Scrambled Wavelets in [0, 1]d We now introduce a second example built from a family of orthogonal wavelets (ϕε,j,l ) ∈ ˜ C q ([0, 1]d ) (where ε ∈ {0, 1}d is a multi-index, j is a scale index, l a multi-index, see [2, 12] for details of the notations) with at least q > d/2 vanishing moments. [sent-151, score-0.512]
52 Now for s ∈ (d/2, q), we dedef ϕ ˜ ε,j,l fine the initial features (ϕε,j,l ) as the rescaled wavelets (ϕε,j,l ), i. [sent-152, score-0.586]
53 Again, ˜ ˜ the initial features may equivalently be indexed by a unique index i ≥ 1. [sent-155, score-0.217]
54 Regression with Scambled Wavelets: Assume that the mother wavelet ϕ has compact support ˜ ∗ [0, 1]d and is bounded by λ, and assume that the target function f ∗ = i αi ϕi lies in the Sobolev s d ∗ space H ([0, 1] ) with s > d/2 (i. [sent-160, score-0.388]
55 H 1 − 2−2(s−d/2) x∈X Thus from Theorem 2, ordinary least-squares performed on random subspaces spanned by P scrambled wavelets has an expected excess risk log N log N ∗ 2 EGP ,X,Y ||f ∗ − g||2 = O P+ ||f ||H s ([0,1]d ) , (8) P N P (and a similar bound holds in high probability). [sent-164, score-1.38]
56 √ In both examples, by choosing P of order N ||f ∗ ||K , one deduces the excess risk ||f ∗ ||K log N √ E||f ∗ − g||2 = O . [sent-165, score-0.332]
57 3 Remark about randomized spaces Note that the bounds on the excess risk obtained in (7), (8), and (9) do not depend on the distribution P under which the data are generated. [sent-167, score-0.496]
58 For example when P = λ, the Lebesgue measure, and f ∗ ∈ H s ([0, 1]d ) (with s > d/2), then linear regression using wavelets (with at least d/2 vanishing moments), which form an orthonormal basis of L2,λ ([0, 1]d ), enables to achieve a bound similar to (8). [sent-171, score-0.665]
59 Randomization enables to define approximation spaces such that the approximation error (either in expectation or in high probability on the choice of the random space) is controlled, whatever the measure P used to assess the performance (even when P is unknown) is. [sent-174, score-0.429]
60 [6]), will most probably miss the specific characteristics of f ∗ at the spot, since the first wavelets have large support. [sent-178, score-0.4]
61 On the contrary, scrambled wavelets, which are functions that contain (random combinations of) all wavelets, will be able to detect correlations between the data and some high frequency wavelets, and thus discover relevant features of f ∗ at the spot. [sent-179, score-0.619]
62 We consider as initial features (ϕi )i≥1 the set of hat functions defined in Section 3. [sent-182, score-0.428]
63 The middle plots represents the least-squares estimate g using P = 40 scrambled objects (ψp )1≤p≤40 (here Brownian motions). [sent-186, score-0.515]
64 The right plots shows the least-squares estimate using the initial features (ϕi )1≤i≤40 . [sent-187, score-0.188]
65 No method is able to learn f ∗ on the whole space (this is normal since the available data are only generated from a peaky distribution). [sent-189, score-0.161]
66 Least-squares regression using scrambled objects is able to learn the structure of f ∗ in terms of the measure P. [sent-193, score-0.566]
67 51 ∗ Figure 1: LS estimate of f using N = 100 data generated from a peaky distribution P (left plots), using 40 Brownian motions (ψp ) (middle plots) and 40 hat functions (ϕi ) (right plots). [sent-277, score-0.459]
68 Indeed, the kernel space K is composed of functions whose order of smoothness may depend on d. [sent-284, score-0.214]
69 For illustration, in the case of scrambled wavelets, the kernel space is the Sobolev space H s ([0, 1]d ) with s > d/2. [sent-285, score-0.586]
70 Notice that if one considers wavelets with q vanishing moments, where q > d/2, then one may choose s (such that q > s > d/2) arbitrarily close to d/2, and deduce that the excess risk rate ˜ O(N −1/2 ) deduced from Theorem 2 is arbitrarily close to the minimax lower rate. [sent-287, score-1.045]
71 Thus regression using scrambled wavelets is minimax optimal (up to logarithmic factors). [sent-288, score-0.985]
72 Now, concerning Brownian sheets, we are not aware of minimax lower bounds for Cameron-Martin spaces, thus we do not know whether regression using Brownian sheets is minimax optimal or not. [sent-289, score-0.559]
73 Links with RKHS Theory: There are strong links between the kernel space of Gaussian objects (see eq. [sent-290, score-0.22]
74 We now remind two properties that illustrate those links: • Kernel spaces of Gaussian objects can be built using a Carleman operator, i. [sent-292, score-0.246]
75 This space is thus also the kernel space of the Gaussian object as defined by (4). [sent-301, score-0.224]
76 6 The expansion of a Mercer kernel gives an explicit construction of the functions of the RKHS. [sent-302, score-0.214]
77 The approach described in this paper enables to choose explicitly the initial basis functions, and build the corresponding kernel space. [sent-304, score-0.211]
78 For example we have presented examples of expansions using multiresolution bases (such as hat functions and wavelets), which is not easy to obtain from the Mercer expansion. [sent-305, score-0.279]
79 This is interesting because from the choice of the initial basis, we can characterize the corresponding approximations spaces (e. [sent-306, score-0.193]
80 Another more practical benefit is that by using multi-resolution bases (with compact mother function), we can derive efficient numerical implementations, as described in Section 5. [sent-309, score-0.267]
81 In the work [1], the authors provide excess risk bounds for greedy algorithms (i. [sent-320, score-0.36]
82 5 Efficient implementation using a lazy multi-resolution expansion In practice, in order to build the least-squares estimate, one needs to compute the values of the random features (ψp )1≤p≤P at the data points (xn )1≤n≤N , i. [sent-327, score-0.304]
83 Due to finite memory and precision of computers, numerical implementations can only handle a finite number F of initial features (ϕi )1≤i≤F . [sent-330, score-0.206]
84 However, in the multi-resolution schemes described here, provided that the mother function has compact support (such as the hat functions or the Daubechie wavelets), we can significantly speed up the computation of the matrix Ψ by using a tree-based lazy expansion, i. [sent-332, score-0.572]
85 where the expansion of the random features (ψp )p≤P is built only when needed for the evaluation at the points (xn )n . [sent-334, score-0.266]
86 Now, in dimension d the classical extension of one-dimensional wavelets uses a family of 2d − 1 wavelets, thus requires 2d − 1 trees each one having 2dH nodes. [sent-339, score-0.456]
87 While the resulting number of initial features F is of order 2d(H+1) , thanks to the lazy evaluation (notice that one never computes all the initial features), one needs to expand at most one path of length H per training point, and the resulting complexity to compute Ψ is O(2d HP N ). [sent-340, score-0.289]
88 The main result is that one can reduce significantly the total number of features to F = O(2H H d ) (while preserving a good approximation for sufficiently smooth functions). [sent-342, score-0.151]
89 7 Now, using a finite F introduces an additional approximation (squared) error term in the final excess 2s risk bounds or order O(F − d ) for a wavelet basis adapted to H s ([0, 1]d ). [sent-344, score-0.562]
90 6 Conclusion and future works We analyzed least-squares regression using sub-spaces GP that are generated by P random linear combinations of infinitely many initial features. [sent-352, score-0.282]
91 We showed that the approximation space K = {fα , ||α|| < ∞} (which is also the kernel space of the related Gaussian object) provides a characterization of the set of target functions f ∗ for which this random regression works. [sent-353, score-0.476]
92 We derived a general approximation error result from which we deduced excess risk bounds of order O( log N P + log N ||f ∗ ||2 ). [sent-355, score-0.585]
93 K N P We showed that least-squares regression with scrambled wavelets provides rates that are arbitrarily close to minimax optimality. [sent-356, score-1.016]
94 However in the case of regression with Brownian sheets, we are not aware of minimax lower bounds for Cameron-Martin spaces in dimension d > 1. [sent-357, score-0.433]
95 We discussed a key aspect of randomized approximation spaces which is that the approximation error can be controlled independently of the measure P used to assess the performance. [sent-358, score-0.285]
96 This is essential in a regression setting where P is unknown, and excess risk rates independent of P are obtained. [sent-359, score-0.383]
97 We concluded by mentioning a nice property of using multiscale objects like Brownian sheets and scrambled wavelets (with compact mother wavelet) which is the possibility to be efficiently implemented. [sent-360, score-1.32]
98 We described a lazy expansion approach for computing the regression function which has a numerical complexity O(N 2 + 2d N 3/2 log N ). [sent-361, score-0.355]
99 A limitation of the current scrambled wavelets is that, so far, we did not consider refined analysis for spaces H s with large smoothness s d/2. [sent-362, score-0.983]
100 Possible directions for better handling such spaces may involve refined covering number bounds which will be the object of future works. [sent-363, score-0.252]
wordName wordTfidf (topN-words)
[('scrambled', 0.411), ('wavelets', 0.4), ('gp', 0.287), ('sheets', 0.234), ('hat', 0.226), ('brownian', 0.222), ('def', 0.192), ('excess', 0.183), ('mother', 0.17), ('sobolev', 0.141), ('spaces', 0.136), ('risk', 0.11), ('features', 0.092), ('regression', 0.09), ('expansion', 0.086), ('minimax', 0.084), ('lazy', 0.083), ('peaky', 0.082), ('tl', 0.077), ('kernel', 0.075), ('wavelet', 0.072), ('motions', 0.069), ('bounds', 0.067), ('vanishing', 0.067), ('objects', 0.065), ('combinations', 0.063), ('egp', 0.062), ('mercer', 0.061), ('ga', 0.059), ('supx', 0.059), ('approximation', 0.059), ('numerical', 0.057), ('initial', 0.057), ('maillard', 0.057), ('deduced', 0.057), ('target', 0.056), ('dimension', 0.056), ('ordinary', 0.055), ('functions', 0.053), ('rkhs', 0.051), ('xn', 0.05), ('space', 0.05), ('object', 0.049), ('sup', 0.048), ('carleman', 0.047), ('eap', 0.047), ('gaussian', 0.046), ('built', 0.045), ('truncation', 0.044), ('random', 0.043), ('lk', 0.043), ('deduce', 0.043), ('numerica', 0.041), ('basis', 0.04), ('compact', 0.04), ('yn', 0.039), ('subspace', 0.039), ('log', 0.039), ('enables', 0.039), ('considers', 0.039), ('plots', 0.039), ('indexed', 0.039), ('ei', 0.038), ('phane', 0.038), ('rescaled', 0.037), ('subspaces', 0.037), ('smoothness', 0.036), ('zoom', 0.035), ('acta', 0.035), ('rahimi', 0.035), ('expectation', 0.034), ('spanned', 0.034), ('nitely', 0.034), ('spot', 0.033), ('ln', 0.033), ('coef', 0.032), ('sam', 0.032), ('eigenfunctions', 0.032), ('arbitrarily', 0.031), ('norm', 0.031), ('error', 0.031), ('nite', 0.031), ('remark', 0.031), ('hp', 0.031), ('yoram', 0.031), ('links', 0.03), ('minimizer', 0.03), ('randomization', 0.03), ('bound', 0.029), ('index', 0.029), ('lebesgue', 0.029), ('generated', 0.029), ('mi', 0.028), ('ali', 0.028), ('whatever', 0.028), ('endowed', 0.028), ('jl', 0.028), ('ming', 0.028), ('hilbert', 0.027), ('whenever', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 233 nips-2010-Scrambled Objects for Least-Squares Regression
Author: Odalric Maillard, Rémi Munos
Abstract: We consider least-squares regression using a randomly generated subspace GP ⊂ F of finite dimension P , where F is a function space of infinite dimension, e.g. L2 ([0, 1]d ). GP is defined as the span of P random features that are linear combinations of the basis functions of F weighted by random Gaussian i.i.d. coefficients. In particular, we consider multi-resolution random combinations at all scales of a given mother function, such as a hat function or a wavelet. In this latter case, the resulting Gaussian objects are called scrambled wavelets and we show that they enable to approximate functions in Sobolev spaces H s ([0, 1]d ). As a result, given N data, the least-squares estimate g built from P scrambled wavelets has excess risk ||f ∗ − g||2 = O(||f ∗ ||2 s ([0,1]d ) (log N )/P + P (log N )/N ) for P H target functions f ∗ ∈ H s ([0, 1]d ) of smoothness order s > d/2. An interesting aspect of the resulting bounds is that they do not depend on the distribution P from which the data are generated, which is important in a statistical regression setting considered here. Randomization enables to adapt to any possible distribution. We conclude by describing an efficient numerical implementation using lazy ex˜ pansions with numerical complexity O(2d N 3/2 log N + N 2 ), where d is the dimension of the input space. 1
2 0.163431 134 nips-2010-LSTD with Random Projections
Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1
3 0.13946821 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
4 0.095176026 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
Author: Andreas Christmann, Ingo Steinwart
Abstract: During the last years support vector machines (SVMs) have been successfully applied in situations where the input space X is not necessarily a subset of Rd . Examples include SVMs for the analysis of histograms or colored images, SVMs for text classiÄ?Ĺš cation and web mining, and SVMs for applications from computational biology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be consistent to the Bayes risk, if either the input space is a complete separable metric space and the reproducing kernel Hilbert space (RKHS) H ⊂ Lp (PX ) is dense, or if the SVM uses a universal kernel k. So far, however, there are no kernels of practical interest known that satisfy these assumptions, if X ⊂ Rd . We close this gap by providing a general technique based on Taylor-type kernels to explicitly construct universal kernels on compact metric spaces which are not subset of Rd . We apply this technique for the following special cases: universal kernels on the set of probability measures, universal kernels based on Fourier transforms, and universal kernels for signal processing. 1
5 0.083268762 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
Author: Manfred Opper, Andreas Ruttor, Guido Sanguinetti
Abstract: We present a novel approach to inference in conditionally Gaussian continuous time stochastic processes, where the latent process is a Markovian jump process. We first consider the case of jump-diffusion processes, where the drift of a linear stochastic differential equation can jump at arbitrary time points. We derive partial differential equations for exact inference and present a very efficient mean field approximation. By introducing a novel lower bound on the free energy, we then generalise our approach to Gaussian processes with arbitrary covariance, such as the non-Markovian RBF covariance. We present results on both simulated and real data, showing that the approach is very accurate in capturing latent dynamics and can be useful in a number of real data modelling tasks.
6 0.081317075 250 nips-2010-Spectral Regularization for Support Estimation
7 0.078358233 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
8 0.077238992 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
9 0.074681625 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
10 0.072628215 148 nips-2010-Learning Networks of Stochastic Differential Equations
11 0.06899368 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
12 0.062918417 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
13 0.062297687 133 nips-2010-Kernel Descriptors for Visual Recognition
14 0.062219899 100 nips-2010-Gaussian Process Preference Elicitation
15 0.061262935 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls
16 0.061161067 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
17 0.060620811 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage
18 0.058333673 218 nips-2010-Probabilistic latent variable models for distinguishing between cause and effect
19 0.058190323 108 nips-2010-Graph-Valued Regression
20 0.057328418 117 nips-2010-Identifying graph-structured activation patterns in networks
topicId topicWeight
[(0, 0.185), (1, 0.021), (2, 0.073), (3, 0.023), (4, 0.081), (5, 0.022), (6, 0.055), (7, 0.008), (8, -0.044), (9, -0.027), (10, -0.021), (11, -0.051), (12, -0.12), (13, -0.042), (14, -0.056), (15, 0.048), (16, 0.128), (17, 0.032), (18, 0.058), (19, 0.023), (20, -0.037), (21, 0.023), (22, -0.049), (23, -0.067), (24, -0.009), (25, -0.02), (26, -0.042), (27, 0.082), (28, -0.013), (29, 0.041), (30, 0.067), (31, -0.036), (32, -0.004), (33, 0.015), (34, 0.023), (35, 0.029), (36, 0.067), (37, 0.053), (38, 0.019), (39, -0.091), (40, 0.021), (41, 0.085), (42, 0.035), (43, -0.158), (44, 0.024), (45, 0.012), (46, 0.103), (47, -0.014), (48, -0.112), (49, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.94118202 233 nips-2010-Scrambled Objects for Least-Squares Regression
Author: Odalric Maillard, Rémi Munos
Abstract: We consider least-squares regression using a randomly generated subspace GP ⊂ F of finite dimension P , where F is a function space of infinite dimension, e.g. L2 ([0, 1]d ). GP is defined as the span of P random features that are linear combinations of the basis functions of F weighted by random Gaussian i.i.d. coefficients. In particular, we consider multi-resolution random combinations at all scales of a given mother function, such as a hat function or a wavelet. In this latter case, the resulting Gaussian objects are called scrambled wavelets and we show that they enable to approximate functions in Sobolev spaces H s ([0, 1]d ). As a result, given N data, the least-squares estimate g built from P scrambled wavelets has excess risk ||f ∗ − g||2 = O(||f ∗ ||2 s ([0,1]d ) (log N )/P + P (log N )/N ) for P H target functions f ∗ ∈ H s ([0, 1]d ) of smoothness order s > d/2. An interesting aspect of the resulting bounds is that they do not depend on the distribution P from which the data are generated, which is important in a statistical regression setting considered here. Randomization enables to adapt to any possible distribution. We conclude by describing an efficient numerical implementation using lazy ex˜ pansions with numerical complexity O(2d N 3/2 log N + N 2 ), where d is the dimension of the input space. 1
2 0.60342699 134 nips-2010-LSTD with Random Projections
Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1
3 0.57434958 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
Author: Alessandro Chiuso, Gianluigi Pillonetto
Abstract: We introduce a new Bayesian nonparametric approach to identification of sparse dynamic linear systems. The impulse responses are modeled as Gaussian processes whose autocovariances encode the BIBO stability constraint, as defined by the recently introduced “Stable Spline kernel”. Sparse solutions are obtained by placing exponential hyperpriors on the scale factors of such kernels. Numerical experiments regarding estimation of ARMAX models show that this technique provides a definite advantage over a group LAR algorithm and state-of-the-art parametric identification techniques based on prediction error minimization. 1
4 0.56343871 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage
Author: Fabian L. Wauthier, Michael I. Jordan
Abstract: Heavy-tailed distributions are often used to enhance the robustness of regression and classification methods to outliers in output space. Often, however, we are confronted with “outliers” in input space, which are isolated observations in sparsely populated regions. We show that heavy-tailed stochastic processes (which we construct from Gaussian processes via a copula), can be used to improve robustness of regression and classification estimators to such outliers by selectively shrinking them more strongly in sparse regions than in dense regions. We carry out a theoretical analysis to show that selective shrinkage occurs when the marginals of the heavy-tailed process have sufficiently heavy tails. The analysis is complemented by experiments on biological data which indicate significant improvements of estimates in sparse regions while producing competitive results in dense regions. 1
5 0.55283183 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
6 0.54797935 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
7 0.53822792 243 nips-2010-Smoothness, Low Noise and Fast Rates
8 0.5344764 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
9 0.51458561 191 nips-2010-On the Theory of Learnining with Privileged Information
10 0.51155949 148 nips-2010-Learning Networks of Stochastic Differential Equations
11 0.51149464 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
12 0.50879502 250 nips-2010-Spectral Regularization for Support Estimation
13 0.49512678 280 nips-2010-Unsupervised Kernel Dimension Reduction
14 0.48837614 100 nips-2010-Gaussian Process Preference Elicitation
15 0.48783934 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
16 0.48558015 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
17 0.45661658 212 nips-2010-Predictive State Temporal Difference Learning
18 0.44220302 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics
19 0.43864909 220 nips-2010-Random Projection Trees Revisited
20 0.4358604 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
topicId topicWeight
[(13, 0.036), (17, 0.016), (27, 0.047), (30, 0.092), (35, 0.023), (37, 0.231), (45, 0.198), (50, 0.056), (51, 0.012), (52, 0.063), (60, 0.041), (77, 0.035), (78, 0.01), (79, 0.015), (90, 0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.8195911 233 nips-2010-Scrambled Objects for Least-Squares Regression
Author: Odalric Maillard, Rémi Munos
Abstract: We consider least-squares regression using a randomly generated subspace GP ⊂ F of finite dimension P , where F is a function space of infinite dimension, e.g. L2 ([0, 1]d ). GP is defined as the span of P random features that are linear combinations of the basis functions of F weighted by random Gaussian i.i.d. coefficients. In particular, we consider multi-resolution random combinations at all scales of a given mother function, such as a hat function or a wavelet. In this latter case, the resulting Gaussian objects are called scrambled wavelets and we show that they enable to approximate functions in Sobolev spaces H s ([0, 1]d ). As a result, given N data, the least-squares estimate g built from P scrambled wavelets has excess risk ||f ∗ − g||2 = O(||f ∗ ||2 s ([0,1]d ) (log N )/P + P (log N )/N ) for P H target functions f ∗ ∈ H s ([0, 1]d ) of smoothness order s > d/2. An interesting aspect of the resulting bounds is that they do not depend on the distribution P from which the data are generated, which is important in a statistical regression setting considered here. Randomization enables to adapt to any possible distribution. We conclude by describing an efficient numerical implementation using lazy ex˜ pansions with numerical complexity O(2d N 3/2 log N + N 2 ), where d is the dimension of the input space. 1
Author: Matthias Broecheler, Lise Getoor
Abstract: Continuous Markov random fields are a general formalism to model joint probability distributions over events with continuous outcomes. We prove that marginal computation for constrained continuous MRFs is #P-hard in general and present a polynomial-time approximation scheme under mild assumptions on the structure of the random field. Moreover, we introduce a sampling algorithm to compute marginal distributions and develop novel techniques to increase its efficiency. Continuous MRFs are a general purpose probabilistic modeling tool and we demonstrate how they can be applied to statistical relational learning. On the problem of collective classification, we evaluate our algorithm and show that the standard deviation of marginals serves as a useful measure of confidence. 1
3 0.71404612 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
4 0.71390152 148 nips-2010-Learning Networks of Stochastic Differential Equations
Author: José Pereira, Morteza Ibrahimi, Andrea Montanari
Abstract: We consider linear models for stochastic dynamics. To any such model can be associated a network (namely a directed graph) describing which degrees of freedom interact under the dynamics. We tackle the problem of learning such a network from observation of the system trajectory over a time interval T . We analyze the ℓ1 -regularized least squares algorithm and, in the setting in which the underlying network is sparse, we prove performance guarantees that are uniform in the sampling rate as long as this is sufficiently high. This result substantiates the notion of a well defined ‘time complexity’ for the network inference problem. keywords: Gaussian processes, model selection and structure learning, graphical models, sparsity and feature selection. 1 Introduction and main results Let G = (V, E) be a directed graph with weight A0 ∈ R associated to the directed edge (j, i) from ij j ∈ V to i ∈ V . To each node i ∈ V in this network is associated an independent standard Brownian motion bi and a variable xi taking values in R and evolving according to A0 xj (t) dt + dbi (t) , ij dxi (t) = j∈∂+ i where ∂+ i = {j ∈ V : (j, i) ∈ E} is the set of ‘parents’ of i. Without loss of generality we shall take V = [p] ≡ {1, . . . , p}. In words, the rate of change of xi is given by a weighted sum of the current values of its neighbors, corrupted by white noise. In matrix notation, the same system is then represented by dx(t) = A0 x(t) dt + db(t) , p (1) 0 p×p with x(t) ∈ R , b(t) a p-dimensional standard Brownian motion and A ∈ R a matrix with entries {A0 }i,j∈[p] whose sparsity pattern is given by the graph G. We assume that the linear system ij x(t) = A0 x(t) is stable (i.e. that the spectrum of A0 is contained in {z ∈ C : Re(z) < 0}). Further, ˙ we assume that x(t = 0) is in its stationary state. More precisely, x(0) is a Gaussian random variable 1 independent of b(t), distributed according to the invariant measure. Under the stability assumption, this a mild restriction, since the system converges exponentially to stationarity. A portion of time length T of the system trajectory {x(t)}t∈[0,T ] is observed and we ask under which conditions these data are sufficient to reconstruct the graph G (i.e., the sparsity pattern of A0 ). We are particularly interested in computationally efficient procedures, and in characterizing the scaling of the learning time for large networks. Can the network structure be learnt in a time scaling linearly with the number of its degrees of freedom? As an example application, chemical reactions can be conveniently modeled by systems of nonlinear stochastic differential equations, whose variables encode the densities of various chemical species [1, 2]. Complex biological networks might involve hundreds of such species [3], and learning stochastic models from data is an important (and challenging) computational task [4]. Considering one such chemical reaction network in proximity of an equilibrium point, the model (1) can be used to trace fluctuations of the species counts with respect to the equilibrium values. The network G would represent in this case the interactions between different chemical factors. Work in this area focused so-far on low-dimensional networks, i.e. on methods that are guaranteed to be correct for fixed p, as T → ∞, while we will tackle here the regime in which both p and T diverge. Before stating our results, it is useful to stress a few important differences with respect to classical graphical model learning problems: (i) Samples are not independent. This can (and does) increase the sample complexity. (ii) On the other hand, infinitely many samples are given as data (in fact a collection indexed by the continuous parameter t ∈ [0, T ]). Of course one can select a finite subsample, for instance at regularly spaced times {x(i η)}i=0,1,... . This raises the question as to whether the learning performances depend on the choice of the spacing η. (iii) In particular, one expects that choosing η sufficiently large as to make the configurations in the subsample approximately independent can be harmful. Indeed, the matrix A0 contains more information than the stationary distribution of the above process (1), and only the latter can be learned from independent samples. (iv) On the other hand, letting η → 0, one can produce an arbitrarily large number of distinct samples. However, samples become more dependent, and intuitively one expects that there is limited information to be harnessed from a given time interval T . Our results confirm in a detailed and quantitative way these intuitions. 1.1 Results: Regularized least squares Regularized least squares is an efficient and well-studied method for support recovery. We will discuss relations with existing literature in Section 1.3. In the present case, the algorithm reconstructs independently each row of the matrix A0 . The rth row, A0 , is estimated by solving the following convex optimization problem for Ar ∈ Rp r minimize L(Ar ; {x(t)}t∈[0,T ] ) + λ Ar 1 , (2) where the likelihood function L is defined by L(Ar ; {x(t)}t∈[0,T ] ) = 1 2T T 0 (A∗ x(t))2 dt − r 1 T T 0 (A∗ x(t)) dxr (t) . r (3) (Here and below M ∗ denotes the transpose of matrix/vector M .) To see that this likelihood function is indeed related to least squares, one can formally write xr (t) = dxr (t)/dt and complete the square ˙ for the right hand side of Eq. (3), thus getting the integral (A∗ x(t) − xr (t))2 dt − xr (t)2 dt. ˙ ˙ r The first term is a sum of square residuals, and the second is independent of A. Finally the ℓ1 regularization term in Eq. (2) has the role of shrinking to 0 a subset of the entries Aij thus effectively selecting the structure. Let S 0 be the support of row A0 , and assume |S 0 | ≤ k. We will refer to the vector sign(A0 ) as to r r the signed support of A0 (where sign(0) = 0 by convention). Let λmax (M ) and λmin (M ) stand for r 2 the maximum and minimum eigenvalue of a square matrix M respectively. Further, denote by Amin the smallest absolute value among the non-zero entries of row A0 . r When stable, the diffusion process (1) has a unique stationary measure which is Gaussian with covariance Q0 ∈ Rp×p given by the solution of Lyapunov’s equation [5] A0 Q0 + Q0 (A0 )∗ + I = 0. (4) Our guarantee for regularized least squares is stated in terms of two properties of the covariance Q0 and one assumption on ρmin (A0 ) (given a matrix M , we denote by ML,R its submatrix ML,R ≡ (Mij )i∈L,j∈R ): (a) We denote by Cmin ≡ λmin (Q0 0 ,S 0 ) the minimum eigenvalue of the restriction of Q0 to S the support S 0 and assume Cmin > 0. (b) We define the incoherence parameter α by letting |||Q0 (S 0 )C ,S 0 Q0 S 0 ,S 0 and assume α > 0. (Here ||| · |||∞ is the operator sup norm.) −1 |||∞ = 1 − α, ∗ (c) We define ρmin (A0 ) = −λmax ((A0 + A0 )/2) and assume ρmin (A0 ) > 0. Note this is a stronger form of stability assumption. Our main result is to show that there exists a well defined time complexity, i.e. a minimum time interval T such that, observing the system for time T enables us to reconstruct the network with high probability. This result is stated in the following theorem. Theorem 1.1. Consider the problem of learning the support S 0 of row A0 of the matrix A0 from a r sample trajectory {x(t)}t∈[0,T ] distributed according to the model (1). If T > 104 k 2 (k ρmin (A0 )−2 + A−2 ) 4pk min log , 2 α2 ρmin (A0 )Cmin δ (5) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36 log(4p/δ)/(T α2 ρmin (A0 )) . The time complexity is logarithmic in the number of variables and polynomial in the support size. Further, it is roughly inversely proportional to ρmin (A0 ), which is quite satisfying conceptually, since ρmin (A0 )−1 controls the relaxation time of the mixes. 1.2 Overview of other results So far we focused on continuous-time dynamics. While, this is useful in order to obtain elegant statements, much of the paper is in fact devoted to the analysis of the following discrete-time dynamics, with parameter η > 0: x(t) = x(t − 1) + ηA0 x(t − 1) + w(t), t ∈ N0 . (6) Here x(t) ∈ Rp is the vector collecting the dynamical variables, A0 ∈ Rp×p specifies the dynamics as above, and {w(t)}t≥0 is a sequence of i.i.d. normal vectors with covariance η Ip×p (i.e. with independent components of variance η). We assume that consecutive samples {x(t)}0≤t≤n are given and will ask under which conditions regularized least squares reconstructs the support of A0 . The parameter η has the meaning of a time-step size. The continuous-time model (1) is recovered, in a sense made precise below, by letting η → 0. Indeed we will prove reconstruction guarantees that are uniform in this limit as long as the product nη (which corresponds to the time interval T in the previous section) is kept constant. For a formal statement we refer to Theorem 3.1. Theorem 1.1 is indeed proved by carefully controlling this limit. The mathematical challenge in this problem is related to the fundamental fact that the samples {x(t)}0≤t≤n are dependent (and strongly dependent as η → 0). Discrete time models of the form (6) can arise either because the system under study evolves by discrete steps, or because we are subsampling a continuous time system modeled as in Eq. (1). Notice that in the latter case the matrices A0 appearing in Eq. (6) and (1) coincide only to the zeroth order in η. Neglecting this technical complication, the uniformity of our reconstruction guarantees as η → 0 has an appealing interpretation already mentioned above. Whenever the samples spacing is not too large, the time complexity (i.e. the product nη) is roughly independent of the spacing itself. 3 1.3 Related work A substantial amount of work has been devoted to the analysis of ℓ1 regularized least squares, and its variants [6, 7, 8, 9, 10]. The most closely related results are the one concerning high-dimensional consistency for support recovery [11, 12]. Our proof follows indeed the line of work developed in these papers, with two important challenges. First, the design matrix is in our case produced by a stochastic diffusion, and it does not necessarily satisfies the irrepresentability conditions used by these works. Second, the observations are not corrupted by i.i.d. noise (since successive configurations are correlated) and therefore elementary concentration inequalities are not sufficient. Learning sparse graphical models via ℓ1 regularization is also a topic with significant literature. In the Gaussian case, the graphical LASSO was proposed to reconstruct the model from i.i.d. samples [13]. In the context of binary pairwise graphical models, Ref. [11] proves high-dimensional consistency of regularized logistic regression for structural learning, under a suitable irrepresentability conditions on a modified covariance. Also this paper focuses on i.i.d. samples. Most of these proofs builds on the technique of [12]. A naive adaptation to the present case allows to prove some performance guarantee for the discrete-time setting. However the resulting bounds are not uniform as η → 0 for nη = T fixed. In particular, they do not allow to prove an analogous of our continuous time result, Theorem 1.1. A large part of our effort is devoted to producing more accurate probability estimates that capture the correct scaling for small η. Similar issues were explored in the study of stochastic differential equations, whereby one is often interested in tracking some slow degrees of freedom while ‘averaging out’ the fast ones [14]. The relevance of this time-scale separation for learning was addressed in [15]. Let us however emphasize that these works focus once more on system with a fixed (small) number of dimensions p. Finally, the related topic of learning graphical models for autoregressive processes was studied recently in [16, 17]. The convex relaxation proposed in these papers is different from the one developed here. Further, no model selection guarantee was proved in [16, 17]. 2 Illustration of the main results It might be difficult to get a clear intuition of Theorem 1.1, mainly because of conditions (a) and (b), which introduce parameters Cmin and α. The same difficulty arises with analogous results on the high-dimensional consistency of the LASSO [11, 12]. In this section we provide concrete illustration both via numerical simulations, and by checking the condition on specific classes of graphs. 2.1 Learning the laplacian of graphs with bounded degree Given a simple graph G = (V, E) on vertex set V = [p], its laplacian ∆G is the symmetric p × p matrix which is equal to the adjacency matrix of G outside the diagonal, and with entries ∆G = ii −deg(i) on the diagonal [18]. (Here deg(i) denotes the degree of vertex i.) It is well known that ∆G is negative semidefinite, with one eigenvalue equal to 0, whose multiplicity is equal to the number of connected components of G. The matrix A0 = −m I + ∆G fits into the setting of Theorem 1.1 for m > 0. The corresponding model (1.1) describes the over-damped dynamics of a network of masses connected by springs of unit strength, and connected by a spring of strength m to the origin. We obtain the following result. Theorem 2.1. Let G be a simple connected graph of maximum vertex degree k and consider the model (1.1) with A0 = −m I + ∆G where ∆G is the laplacian of G and m > 0. If k+m 5 4pk T ≥ 2 · 105 k 2 , (7) (k + m2 ) log m δ then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36(k + m)2 log(4p/δ)/(T m3 ). In other words, for m bounded away from 0 and ∞, regularized least squares regression correctly reconstructs the graph G from a trajectory of time length which is polynomial in the degree and logarithmic in the system size. Notice that once the graph is known, the laplacian ∆G is uniquely determined. Also, the proof technique used for this example is generalizable to other graphs as well. 4 2800 Min. # of samples for success prob. = 0.9 1 0.9 p = 16 p = 32 0.8 Probability of success p = 64 0.7 p = 128 p = 256 0.6 p = 512 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 T=nη 350 400 2600 2400 2200 2000 1800 1600 1400 1200 1 10 450 2 3 10 10 p Figure 1: (left) Probability of success vs. length of the observation interval nη. (right) Sample complexity for 90% probability of success vs. p. 2.2 Numerical illustrations In this section we present numerical validation of the proposed method on synthetic data. The results confirm our observations in Theorems 1.1 and 3.1, below, namely that the time complexity scales logarithmically with the number of nodes in the network p, given a constant maximum degree. Also, the time complexity is roughly independent of the sampling rate. In Fig. 1 and 2 we consider the discrete-time setting, generating data as follows. We draw A0 as a random sparse matrix in {0, 1}p×p with elements chosen independently at random with P(A0 = 1) = k/p, k = 5. The ij process xn ≡ {x(t)}0≤t≤n is then generated according to Eq. (6). We solve the regularized least 0 square problem (the cost function is given explicitly in Eq. (8) for the discrete-time case) for different values of n, the number of observations, and record if the correct support is recovered for a random row r using the optimum value of the parameter λ. An estimate of the probability of successful recovery is obtained by repeating this experiment. Note that we are estimating here an average probability of success over randomly generated matrices. The left plot in Fig.1 depicts the probability of success vs. nη for η = 0.1 and different values of p. Each curve is obtained using 211 instances, and each instance is generated using a new random matrix A0 . The right plot in Fig.1 is the corresponding curve of the sample complexity vs. p where sample complexity is defined as the minimum value of nη with probability of success of 90%. As predicted by Theorem 2.1 the curve shows the logarithmic scaling of the sample complexity with p. In Fig. 2 we turn to the continuous-time model (1). Trajectories are generated by discretizing this stochastic differential equation with step δ much smaller than the sampling rate η. We draw random matrices A0 as above and plot the probability of success for p = 16, k = 4 and different values of η, as a function of T . We used 211 instances for each curve. As predicted by Theorem 1.1, for a fixed observation interval T , the probability of success converges to some limiting value as η → 0. 3 Discrete-time model: Statement of the results Consider a system evolving in discrete time according to the model (6), and let xn ≡ {x(t)}0≤t≤n 0 be the observed portion of the trajectory. The rth row A0 is estimated by solving the following r convex optimization problem for Ar ∈ Rp minimize L(Ar ; xn ) + λ Ar 0 where L(Ar ; xn ) ≡ 0 1 2η 2 n 1 , (8) n−1 2 t=0 {xr (t + 1) − xr (t) − η A∗ x(t)} . r (9) Apart from an additive constant, the η → 0 limit of this cost function can be shown to coincide with the cost function in the continuous time case, cf. Eq. (3). Indeed the proof of Theorem 1.1 will amount to a more precise version of this statement. Furthermore, L(Ar ; xn ) is easily seen to be the 0 log-likelihood of Ar within model (6). 5 1 1 0.9 0.95 0.9 0.7 Probability of success Probability of success 0.8 η = 0.04 η = 0.06 0.6 η = 0.08 0.5 η = 0.1 0.4 η = 0.14 0.3 η = 0.22 η = 0.18 0.85 0.8 0.75 0.7 0.65 0.2 0.6 0.1 0 50 100 150 T=nη 200 0.55 0.04 250 0.06 0.08 0.1 0.12 η 0.14 0.16 0.18 0.2 0.22 Figure 2: (right)Probability of success vs. length of the observation interval nη for different values of η. (left) Probability of success vs. η for a fixed length of the observation interval, (nη = 150) . The process is generated for a small value of η and sampled at different rates. As before, we let S 0 be the support of row A0 , and assume |S 0 | ≤ k. Under the model (6) x(t) has r a Gaussian stationary state distribution with covariance Q0 determined by the following modified Lyapunov equation A0 Q0 + Q0 (A0 )∗ + ηA0 Q0 (A0 )∗ + I = 0 . (10) It will be clear from the context whether A0 /Q0 refers to the dynamics/stationary matrix from the continuous or discrete time system. We assume conditions (a) and (b) introduced in Section 1.1, and adopt the notations already introduced there. We use as a shorthand notation σmax ≡ σmax (I +η A0 ) where σmax (.) is the maximum singular value. Also define D ≡ 1 − σmax /η . We will assume D > 0. As in the previous section, we assume the model (6) is initiated in the stationary state. Theorem 3.1. Consider the problem of learning the support S 0 of row A0 from the discrete-time r trajectory {x(t)}0≤t≤n . If nη > 4pk 104 k 2 (kD−2 + A−2 ) min log , 2 DC 2 α δ min (11) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = (36 log(4p/δ))/(Dα2 nη). In other words the discrete-time sample complexity, n, is logarithmic in the model dimension, polynomial in the maximum network degree and inversely proportional to the time spacing between samples. The last point is particularly important. It enables us to derive the bound on the continuoustime sample complexity as the limit η → 0 of the discrete-time sample complexity. It also confirms our intuition mentioned in the Introduction: although one can produce an arbitrary large number of samples by sampling the continuous process with finer resolutions, there is limited amount of information that can be harnessed from a given time interval [0, T ]. 4 Proofs In the following we denote by X ∈ Rn×p the matrix whose (t + 1)th column corresponds to the configuration x(t), i.e. X = [x(0), x(1), . . . , x(n − 1)]. Further ∆X ∈ Rn×p is the matrix containing configuration changes, namely ∆X = [x(1) − x(0), . . . , x(n) − x(n − 1)]. Finally we write W = [w(1), . . . , w(n − 1)] for the matrix containing the Gaussian noise realization. Equivalently, The r th row of W is denoted by Wr . W = ∆X − ηA X . In order to lighten the notation, we will omit the reference to xn in the likelihood function (9) and 0 simply write L(Ar ). We define its normalized gradient and Hessian by G = −∇L(A0 ) = r 1 ∗ XWr , nη Q = ∇2 L(A0 ) = r 6 1 XX ∗ . n (12) 4.1 Discrete time In this Section we outline our prove for our main result for discrete-time dynamics, i.e., Theorem 3.1. We start by stating a set of sufficient conditions for regularized least squares to work. Then we present a series of concentration lemmas to be used to prove the validity of these conditions, and finally we sketch the outline of the proof. As mentioned, the proof strategy, and in particular the following proposition which provides a compact set of sufficient conditions for the support to be recovered correctly is analogous to the one in [12]. A proof of this proposition can be found in the supplementary material. Proposition 4.1. Let α, Cmin > 0 be be defined by λmin (Q0 0 ,S 0 ) ≡ Cmin , S |||Q0 0 )C ,S 0 Q0 0 ,S 0 S (S −1 |||∞ ≡ 1 − α . (13) If the following conditions hold then the regularized least square solution (8) correctly recover the signed support sign(A0 ): r λα Amin Cmin G ∞≤ , GS 0 ∞ ≤ − λ, (14) 3 4k α Cmin α Cmin √ , √ . |||QS 0 ,S 0 − Q0 0 ,S 0 |||∞ ≤ (15) |||Q(S 0 )C ,S 0 − Q0 0 )C ,S 0 |||∞ ≤ S (S 12 k 12 k Further the same statement holds for the continuous model 3, provided G and Q are the gradient and the hessian of the likelihood (3). The proof of Theorem 3.1 consists in checking that, under the hypothesis (11) on the number of consecutive configurations, conditions (14) to (15) will hold with high probability. Checking these conditions can be regarded in turn as concentration-of-measure statements. Indeed, if expectation is taken with respect to a stationary trajectory, we have E{G} = 0, E{Q} = Q0 . 4.1.1 Technical lemmas In this section we will state the necessary concentration lemmas for proving Theorem 3.1. These are non-trivial because G, Q are quadratic functions of dependent random variables the samples {x(t)}0≤t≤n . The proofs of Proposition 4.2, of Proposition 4.3, and Corollary 4.4 can be found in the supplementary material provided. Our first Proposition implies concentration of G around 0. Proposition 4.2. Let S ⊆ [p] be any set of vertices and ǫ < 1/2. If σmax ≡ σmax (I + η A0 ) < 1, then 2 P GS ∞ > ǫ ≤ 2|S| e−n(1−σmax ) ǫ /4 . (16) We furthermore need to bound the matrix norms as per (15) in proposition 4.1. First we relate bounds on |||QJS − Q0 JS |||∞ with bounds on |Qij − Q0 |, (i ∈ J, i ∈ S) where J and S are any ij subsets of {1, ..., p}. We have, P(|||QJS − Q0 )|||∞ > ǫ) ≤ |J||S| max P(|Qij − Q0 | > ǫ/|S|). JS ij i,j∈J (17) Then, we bound |Qij − Q0 | using the following proposition ij Proposition 4.3. Let i, j ∈ {1, ..., p}, σmax ≡ σmax (I + ηA0 ) < 1, T = ηn > 3/D and 0 < ǫ < 2/D where D = (1 − σmax )/η then, P(|Qij − Q0 )| > ǫ) ≤ 2e ij n − 32η2 (1−σmax )3 ǫ2 . (18) Finally, the next corollary follows from Proposition 4.3 and Eq. (17). Corollary 4.4. Let J, S (|S| ≤ k) be any two subsets of {1, ..., p} and σmax ≡ σmax (I + ηA0 ) < 1, ǫ < 2k/D and nη > 3/D (where D = (1 − σmax )/η) then, P(|||QJS − Q0 |||∞ > ǫ) ≤ 2|J|ke JS 7 n − 32k2 η2 (1−σmax )3 ǫ2 . (19) 4.1.2 Outline of the proof of Theorem 3.1 With these concentration bounds we can now easily prove Theorem 3.1. All we need to do is to compute the probability that the conditions given by Proposition 4.1 hold. From the statement of the theorem we have that the first two conditions (α, Cmin > 0) of Proposition 4.1 hold. In order to make the first condition on G imply the second condition on G we assume that λα/3 ≤ (Amin Cmin )/(4k) − λ which is guaranteed to hold if λ ≤ Amin Cmin /8k. (20) We also combine the two last conditions on Q, thus obtaining the following |||Q[p],S 0 − Q0 0 |||∞ ≤ [p],S α Cmin √ , 12 k (21) since [p] = S 0 ∪ (S 0 )C . We then impose that both the probability of the condition on Q failing and the probability of the condition on G failing are upper bounded by δ/2 using Proposition 4.2 and Corollary 4.4. It is shown in the supplementary material that this is satisfied if condition (11) holds. 4.2 Outline of the proof of Theorem 1.1 To prove Theorem 1.1 we recall that Proposition 4.1 holds provided the appropriate continuous time expressions are used for G and Q, namely G = −∇L(A0 ) = r 1 T T x(t) dbr (t) , 0 Q = ∇2 L(A0 ) = r 1 T T x(t)x(t)∗ dt . (22) 0 These are of course random variables. In order to distinguish these from the discrete time version, we will adopt the notation Gn , Qn for the latter. We claim that these random variables can be coupled (i.e. defined on the same probability space) in such a way that Gn → G and Qn → Q almost surely as n → ∞ for fixed T . Under assumption (5), it is easy to show that (11) holds for all n > n0 with n0 a sufficiently large constant (for a proof see the provided supplementary material). Therefore, by the proof of Theorem 3.1, the conditions in Proposition 4.1 hold for gradient Gn and hessian Qn for any n ≥ n0 , with probability larger than 1 − δ. But by the claimed convergence Gn → G and Qn → Q, they hold also for G and Q with probability at least 1 − δ which proves the theorem. We are left with the task of showing that the discrete and continuous time processes can be coupled in such a way that Gn → G and Qn → Q. With slight abuse of notation, the state of the discrete time system (6) will be denoted by x(i) where i ∈ N and the state of continuous time system (1) by x(t) where t ∈ R. We denote by Q0 the solution of (4) and by Q0 (η) the solution of (10). It is easy to check that Q0 (η) → Q0 as η → 0 by the uniqueness of stationary state distribution. The initial state of the continuous time system x(t = 0) is a N(0, Q0 ) random variable independent of b(t) and the initial state of the discrete time system is defined to be x(i = 0) = (Q0 (η))1/2 (Q0 )−1/2 x(t = 0). At subsequent times, x(i) and x(t) are assumed are generated by the respective dynamical systems using the same matrix A0 using common randomness provided by the standard Brownian motion {b(t)}0≤t≤T in Rp . In order to couple x(t) and x(i), we construct w(i), the noise driving the discrete time system, by letting w(i) ≡ (b(T i/n) − b(T (i − 1)/n)). The almost sure convergence Gn → G and Qn → Q follows then from standard convergence of random walk to Brownian motion. Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 References [1] D.T. Gillespie. Stochastic simulation of chemical kinetics. Annual Review of Physical Chemistry, 58:35–55, 2007. [2] D. Higham. Modeling and Simulating Chemical Reactions. SIAM Review, 50:347–368, 2008. [3] N.D.Lawrence et al., editor. Learning and Inference in Computational Systems Biology. MIT Press, 2010. [4] T. Toni, D. Welch, N. Strelkova, A. Ipsen, and M.P.H. Stumpf. Modeling and Simulating Chemical Reactions. J. R. Soc. Interface, 6:187–202, 2009. [5] K. Zhou, J.C. Doyle, and K. Glover. Robust and optimal control. Prentice Hall, 1996. [6] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. [7] D.L. Donoho. For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics, 59(7):907–934, 2006. [8] D.L. Donoho. For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006. [9] T. Zhang. Some sharp performance bounds for least squares regression with L1 regularization. Annals of Statistics, 37:2109–2144, 2009. [10] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183–2202, 2009. [11] M.J. Wainwright, P. Ravikumar, and J.D. Lafferty. High-Dimensional Graphical Model Selection Using l-1-Regularized Logistic Regression. Advances in Neural Information Processing Systems, 19:1465, 2007. [12] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006. [13] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432, 2008. [14] K. Ball, T.G. Kurtz, L. Popovic, and G. Rempala. Modeling and Simulating Chemical Reactions. Ann. Appl. Prob., 16:1925–1961, 2006. [15] G.A. Pavliotis and A.M. Stuart. Parameter estimation for multiscale diffusions. J. Stat. Phys., 127:741–781, 2007. [16] J. Songsiri, J. Dahl, and L. Vandenberghe. Graphical models of autoregressive processes. pages 89–116, 2010. [17] J. Songsiri and L. Vandenberghe. Topology selection in graphical models of autoregressive processes. Journal of Machine Learning Research, 2010. submitted. [18] F.R.K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, 1997. [19] P. Ravikumar, M.J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using l1-regularized logistic regression. Annals of Statistics, 2008. 9
5 0.71315873 280 nips-2010-Unsupervised Kernel Dimension Reduction
Author: Meihong Wang, Fei Sha, Michael I. Jordan
Abstract: We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. We extend this idea and develop similarly motivated measures for unsupervised problems where covariates and responses are the same. Our empirical studies show that the resulting compact representation yields meaningful and appealing visualization and clustering of data. Furthermore, when used in conjunction with supervised learners for classification, our methods lead to lower classification errors than state-of-the-art methods, especially when embedding data in spaces of very few dimensions.
6 0.71313184 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
7 0.71247286 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
8 0.71196687 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
9 0.71119946 243 nips-2010-Smoothness, Low Noise and Fast Rates
10 0.70981479 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
11 0.70941049 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
12 0.70913881 265 nips-2010-The LASSO risk: asymptotic results and real world examples
13 0.70879263 155 nips-2010-Learning the context of a category
14 0.70799202 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
15 0.70795691 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
16 0.70789319 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
17 0.70784575 63 nips-2010-Distributed Dual Averaging In Networks
18 0.70661855 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
19 0.70628893 204 nips-2010-Penalized Principal Component Regression on Graphs for Analysis of Subnetworks
20 0.70593894 290 nips-2010-t-logistic regression