nips nips2005 nips2005-168 knowledge-graph by maker-knowledge-mining

168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

Source: pdf

Author: Larry Wasserman, John D. Lafferty

Abstract: We present a method for nonparametric regression that performs bandwidth selection and variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisﬁes a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. A modiﬁed version that replaces hard with soft thresholding effectively solves a sequence of lasso problems. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. [sent-2, score-0.731]

2 When the unknown function satisﬁes a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. [sent-3, score-0.265]

3 The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. [sent-4, score-0.484]

4 A modiﬁed version that replaces hard with soft thresholding effectively solves a sequence of lasso problems. [sent-5, score-0.323]

5 1 Introduction Estimating a high dimensional regression function is notoriously difﬁcult due to the “curse of dimensionality. [sent-6, score-0.049]

6 Then if m is in W2 (c), the d-dimensional Sobolev ball of order two and radius c, it is well known that lim inf n4/(4+d) inf sup R(mn , m) > 0 , (1) n→∞ mn m∈W2 (c) b where R(mn , m) = Em (mn (x) − m(x))2 dx is the risk of the estimate mn constructed on a sample of size n (Gy¨ rﬁ et al. [sent-15, score-0.203]

7 Under this sparseness assumption we can hope to achieve the better minimax convergence rate of n−4/(4+r) if the r relevant variables can be isolated. [sent-24, score-0.164]

8 Thus, we are faced with the problem of variable selection in nonparametric regression. [sent-25, score-0.187]

9 A large body of previous work has addressed this fundamental problem, which has led to a variety of methods to combat the curse of dimensionality. [sent-26, score-0.057]

10 (2005) use likelihood basis pursuit, essentially the lasso adapted to the spline setting. [sent-31, score-0.106]

11 1984) and MARS (Friedman 1991) effectively perform variable selection as part of their function ﬁtting. [sent-33, score-0.123]

12 (2005) use independence testing for variable selection and B¨ hlmann and Yu (2005) introduced a boosting approach. [sent-35, score-0.13]

13 Indeed, the theoretical analysis of sparse parametric estimators such as the lasso (Tibshirani 1996) is difﬁcult, and only recently has signiﬁcant progress been made on this front (Donoho 2004; Fu and Knight 2000). [sent-38, score-0.126]

14 In this paper we present a new approach to sparse nonparametric function estimation that is both computationally simple and amenable to theoretical analysis. [sent-39, score-0.114]

15 It is based on the idea that bandwidth and variable selection can be simultaneously performed by computing the inﬁnitesimal change in a nonparametric estimator as a function of the smoothing parameters, and then thresholding these derivatives to effectively get a sparse estimate. [sent-41, score-0.837]

16 As a simple version of this principle we use hard thresholding, effectively carrying out a sequence of hypothesis tests. [sent-42, score-0.108]

17 A modiﬁed version that replaces testing with soft thresholding effectively solves a sequence of lasso problems. [sent-43, score-0.281]

18 The potential appeal of this approach is that it can be based on relatively simple and theoretically well understood nonparametric techniques such as local linear smoothing, leading to methods that are simple to implement and can be used in high dimensional problems. [sent-44, score-0.092]

19 Moreover, we show that the rodeo can achieve near optimal minimax rates of convergence, and therefore circumvents the curse of dimensionality when the true function is indeed sparse. [sent-45, score-0.559]

20 Fix a point x and let mh (x) denote an estimator of m(x) based on a vector of smoothing parameters h = (h1 , . [sent-49, score-0.485]

21 For now, assume that xi is one of the observed data points and that m0 (x) = Yi . [sent-58, score-0.049]

22 If P = (h(t) : 0 ≤ t ≤ 1) is a smooth path through the set of smoothing parameters with h(0) = 0 and h(1) = 1 (or any other ﬁxed, large bandwidth) then 1 1 dM (h(s)) ˙ m(x) = M (0) = M (1) − ds = M (1) − D(s), h(s) ds ds 0 0 where D(h) = ∇M (h) = ∂M ∂M ∂hj , . [sent-60, score-0.276]

23 , ∂hj T ˙ is the gradient of M (h) and h(s) = dh(s) ds is the derivative of h(s) along the path. [sent-63, score-0.104]

24 A biased, low variance estimator of M (1) is m1 (x). [sent-64, score-0.117]

25 An unbiased estimator of D(h) is Z(h) = ∂ mh (x) ∂ mh (x) ,. [sent-65, score-0.765]

26 (2) ˙ Z(s), h(s) ds (3) The naive estimator 1 m(x) = m1 (x) − 0 h2 Start Rodeo path Ideal path Figure 1: The bandwidths for the relevant variables (h2 ) are shrunk, while the bandwidths for the irrelevant variables (h1 ) are kept relatively large. [sent-69, score-0.921]

27 The simplest rodeo algorithm shrinks the bandwidths in discrete steps 1, β, β 2 , . [sent-70, score-0.692]

28 Optimal bandwidth h1 is identically equal to m0 (x) = Yi , which has poor risk since the variance of Z(h) is large for small h. [sent-74, score-0.327]

29 However, our sparsity assumption on m suggests that there should be paths for which D(h) is also sparse. [sent-75, score-0.072]

30 Along such a path, we replace Z(h) with an estimator D(h) that makes use of the sparsity assumption. [sent-76, score-0.156]

31 Our estimate of m(x) is then 1 m(x) = m1 (x) − ˙ D(s), h(s) ds . [sent-77, score-0.065]

32 (4) 0 To implement this idea we need to do two things: (i) we need to ﬁnd a sparse path and (ii) we need to take advantage of this sparseness when estimating D along that path. [sent-78, score-0.12]

33 The key observation is that if xj is irrelevant, then we expect that changing the bandwidth hj for that variable should cause only a small change in the estimator mh (x). [sent-79, score-1.274]

34 Conversely, if xj is relevant, then we expect that changing the bandwidth hj for that variable should cause a large change in the estimator. [sent-80, score-0.833]

35 Thus, Zj = ∂ mh (x)/∂hj should discriminate between relevant and irrelevant covariates. [sent-81, score-0.431]

36 To simplify the procedure, we can replace the continuum of bandwidths with a discrete set where each hj ∈ B = {h0 , βh0 , β 2 h0 , . [sent-82, score-0.614]

37 Moreover, we can proceed in a greedy fashion by estimating D(h) sequentially with hj ∈ B and setting Dj (h) = 0 when hj < hj , where hj is the ﬁrst h such that |Zj (h)| < λj (h) for some threshold λj . [sent-86, score-1.523]

38 This greedy version, coupled with the hard threshold estimator, yields m(x) = mb (x). [sent-87, score-0.097]

39 This idea can be implemented using a greedy algorithm, coupled with the hard threshold estimator, to yield a bandwidth selection procedure based on testing. [sent-89, score-0.474]

40 This approach to bandwidth selection is similar to that of Lepski et al. [sent-90, score-0.39]

41 Our approach is also similar to a method of Ruppert (1997) that uses a sequence of decreasing bandwidths and then estimates the optimal bandwidth by estimating the mean squared error as a function of bandwidth. [sent-92, score-0.565]

42 Our greedy approach tests whether an inﬁnitesimal change in the bandwidth from its current setting leads to a signiﬁcant change in the estimate, and is more easily extended to a practical method in higher dimensions. [sent-93, score-0.415]

43 (2001) focuses on variable selection in multi-index models rather than on bandwidth estimation. [sent-95, score-0.391]

44 3 Rodeo using Local Linear Regression We now present the multivariate case in detail, using local linear smoothing as the basic method since it is known to have many good properties. [sent-96, score-0.067]

45 Let mH (x) denote the local linear estimator of m(x) using bandwidth matrix H. [sent-101, score-0.413]

46 The estimator mH can be written as mH (x) = n i=1 G(Xi , x, h) Yi where 1 T G(u, x, h) = eT (Xx Wx Xx )−1 KH (u − x) (6) 1 (u − x)T is called the effective kernel. [sent-112, score-0.117]

47 We assume that the covariates are random with sampling density f (x), and make the same assumptions as Ruppert and Wand (1994) in their analysis of the bias and variance of local linear regression. [sent-113, score-0.136]

48 In particular, (i) the kernel K has compact support with zero odd moments and uu⊤ K(u) du = ν2 (K)I and (ii) the sampling density f (x) is continuously differentiable and strictly positive. [sent-114, score-0.053]

49 In the version of the algorithm that follows, we take K to be a product kernel and H to be diagonal with elements h = (h1 , . [sent-115, score-0.069]

50 Our method is based on the statistic ∂ mh (x) Zj = = ∂hj where Gj (u, x, h) = Zj = ∂G(u,x,h) . [sent-119, score-0.324]

51 ∂hj n Gj (Xi , x, h)Yi (7) i=1 Straightforward calculations show that ∂ mh (x) ⊤ ⊤ ∂Wx = = e⊤ (Xx Wx Xx )−1 Xx (Y − Xx α) 1 ∂hj ∂hj (8) ⊤ ⊤ where α = (Xx Wx Xx )−1 Xx Wx Y is the coefﬁcient vector for the local linear ﬁt. [sent-120, score-0.324]

52 Assuming a product kernel we have   d Wx = diag  j=1 d K((X1j − xj )/hj ), . [sent-122, score-0.143]

53 , j=1 K((Xnj − xj )/hj ) and ∂Wx /∂hj = Wx Dj where ∂ log K((X1j − xj )/hj ) ∂ log K((Xnj − xj )/hj ) Dj = diag ,. [sent-125, score-0.32]

54 For example, with the Gaussian 1 1 2 kernel K(u) = exp(−u /2) we have Dj = h3 diag (X1j − xj )2 , . [sent-129, score-0.143]

55 (12) i=1 Then the hard thresholding version of the rodeo algorithm is given in Figure 2. [sent-140, score-0.611]

56 Then E(σ 2 ) = σ 2 + bias where Rodeo: Hard thresholding version 1. [sent-146, score-0.154]

57 Select parameter 0 < β < 1 and initial bandwidth h0 slowly decreasing to zero, √ with h0 = Ω 1/ log log n . [sent-147, score-0.372]

58 Let cn = Ω(1) be a sequence satisfying dcn = Ω(log n). [sent-148, score-0.071]

59 Initialize the bandwidths, and activate all covariates: (a) hj = h0 , j = 1, 2, . [sent-150, score-0.367]

60 (c) If |Zj | ≥ λj , then set hj ← βhj , otherwise remove j from A. [sent-160, score-0.367]

61 Figure 2: The hard thresholding version of the rodeo, which can be applied using the derivatives Zj of any nonparametric smoother. [sent-166, score-0.258]

62 Note however that the bias is mitigated by sparsity (small r). [sent-169, score-0.069]

63 For convenience of notation we assume that the covariates are numbered such that the relevant variables xj correspond to 1 ≤ j ≤ r, and the irrelevant variables to j > r. [sent-173, score-0.354]

64 Suppose that K is a product kernel with bandwidth vector h = (h1 , . [sent-177, score-0.327]

65 More generally, assuming that r is bounded, we have the following when hj → 0: If j ∈ Rc the derivative of the bias is ∂ 2 2 µj = E[mH (x) − m(x)] = −tr (HR HR ) ν2 (∇j log f (x)) hj + oP (hj ) (13) ∂hj where the Hessian of m(x) is H = HR 0 0 0 and HR = diag(h2 , . [sent-182, score-0.83]

66 Let C = σ 2 R(K) 4m(x) = hj ν2 mjj (x) + oP (hj ). [sent-188, score-0.509]

67 (15) These lemmas parallel the calculations of Ruppert and Wand (1994) except for the difference that the irrelevant variables have different leading terms in the expansions than relevant variables. [sent-194, score-0.164]

68 (A1) For some constant k > 0, each j > r satisﬁes ∇j log f (x) = O (A2) For each j ≤ r, logk n n1/4 mjj (x) = 0 . [sent-197, score-0.209]

69 1 that Aj hj + oP (hj ) j ≤ r µj = (18) Bj hj + oP (hj ) j > r where 2 Aj = ν2 mjj (x), Bj = −tr(HH)ν2 (∇j log f (x))2 . [sent-200, score-0.903]

70 More generally, µj is approximately proportional to (∇j log f (x))2 for j > r which implies that |µj | ≈ 0 for irrelevant variables if f is sufﬁciently smooth in the variable xj . [sent-204, score-0.243]

71 Hence, assumption (A1) can be interpreted as requiring that f is sufﬁciently smooth in the irrelevant dimensions. [sent-205, score-0.068]

72 Equation (18) ensures that µj is proportional to hj |mjj (x)| for small hj . [sent-207, score-0.734]

73 Since we take the initial bandwidth h0 to be decreasingly slowly with n, (A2) implies that |µj (h)| ≥ chj |mjj (x)| for some constant c > 0, for sufﬁciently large n. [sent-208, score-0.296]

74 In the following we write Yn = OP (an ) to mean that Yn = OP (bn an ) where bn is logarithmic in n; similarly, an = Ω(bn ) if an = Ω(bn cn ) where cn is logarithmic in n. [sent-209, score-0.201]

75 Then the number of iterations Tn until the rodeo stops satisﬁes 1 1 P log1/β (nan ) ≤ Tn ≤ log1/β (nbn ) −→ 1 (20) 4+r 4+r where an = Ω(1) and bn = O(1). [sent-214, score-0.502]

76 Moreover, the algorithm outputs bandwidths h⋆ that satisfy 1 P h⋆ ≥ for all j > r −→ 1 (21) j logk n and P h0 (nbn )−1/(4+r) ≤ h⋆ ≤ h0 (nan )−1/(4+r) for all j ≤ r −→ 1 . [sent-215, score-0.287]

77 3, the risk R(h⋆ ) of the rodeo estimator satisﬁes R(h⋆ ) = OP n−4/(4+r) . [sent-219, score-0.593]

78 (23) In the one-dimensional case, this result shows that the algorithm recovers the locally optimal bandwidth, giving an adaptive estimator, and in general attains the optimal (up to logarithmic factors) minimax rate of convergence. [sent-220, score-0.126]

79 5 Some Examples and Extensions Figure 3 illustrates the rodeo on synthetic and real data. [sent-222, score-0.478]

80 The left plot shows the bandwidths obtained on a synthetic dataset with n = 500 points of dimension d = 20. [sent-223, score-0.28]

81 The covariates are generated as xi ∼ Uniform(0, 1), the true function is m(x) = 2(x1 +1)2 +2 sin(10x2 ), and σ = 1. [sent-224, score-0.155]

82 The results are averaged over 50 randomly generated data sets; note that the displayed bandwidth paths are not monotonic because of this averaging. [sent-225, score-0.329]

83 The plot shows how the bandwidths of the relevant variables shrink toward zero, while the bandwidths of the irrelevant variables remain large. [sent-226, score-0.665]

84 While we have focused on estimation of m locally at a point x, the idea can be extended to carry out global bandwidth and variable selection by averaging over multiple evaluation points x1 , . [sent-229, score-0.444]

85 In addition, it is possible to consider more general paths, for example using soft thresholding or changing only the bandwidth corresponding to the largest |Zj |/λj . [sent-234, score-0.449]

86 Such a version of the rodeo can be seen as a nonparametric counterpart to least angle regression (LARS) (Efron et al. [sent-235, score-0.662]

87 2004), a reﬁnement of forward stagewise regression in which one adds the covariate most correlated with the residuals of the current ﬁt, in small, incremental steps. [sent-236, score-0.103]

88 Reducing the bandwidth is like adding in more of that variable. [sent-238, score-0.296]

89 Suppose now that we make the following modiﬁcations to the rodeo: ∗ (i) change the bandwidths one at a time, based on the largest Zj = |Zj |/λj , (ii) reduce ∗ the bandwidth continuously, rather than in discrete steps, until the largest Zj is equal to the next largest. [sent-239, score-0.619]

90 Figure 3 (right) shows the result of running this greedy version of the rodeo on ∗ the diabetes dataset used to illustrate LARS. [sent-240, score-0.578]

91 The resulting variable ordering is seen to be very similar to, but different from, the ordering obtained from the parametric LARS ﬁt. [sent-242, score-0.062]

92 0 12 0 20 40 60 80 100 Greedy Rodeo Step Figure 3: Left: Average bandwidth output by the rodeo for a function with r = 2 relevant variables in d = 20 dimensions (n = 500, with 50 trials). [sent-272, score-0.812]

93 Covariates are generated as xi ∼ Uniform(0, 1), the true function is m(x) = 2(x1 + 1)3 + 2 sin(10x2 ), and σ = 1, ﬁt at the test point x = ( 1 , . [sent-273, score-0.049]

94 The variance is greater for large step sizes since the rodeo runs that long for fewer data 2 2 sets. [sent-277, score-0.445]

95 Right: Greedy rodeo on the diabetes data, used to illustrate LARS (Efron et al. [sent-278, score-0.523]

96 A set of k = 100 of the total n = 442 points were sampled (d = 10), and the bandwidth for the variable with largest average |Zj |/λj was reduced in each step. [sent-280, score-0.357]

97 The parametric LARS algorithm adds variables in the order 3, 9, 4, 7, 2, 10, 5, 8, 6, 1. [sent-282, score-0.079]

98 Optimal spatial adaptation to inhomogeneous smoothness: An approach based on kernel estimates with variable bandwidth selectors. [sent-334, score-0.366]

99 Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation. [sent-354, score-0.388]

100 Variable selection and model building via likelihood basis pursuit. [sent-374, score-0.056]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rodeo', 0.445), ('hj', 0.367), ('mh', 0.324), ('bandwidth', 0.296), ('bandwidths', 0.247), ('zj', 0.238), ('wx', 0.184), ('xx', 0.178), ('mjj', 0.142), ('serum', 0.121), ('estimator', 0.117), ('covariates', 0.106), ('op', 0.099), ('nonparametric', 0.092), ('thresholding', 0.086), ('ruppert', 0.081), ('lasso', 0.081), ('hd', 0.08), ('xj', 0.077), ('dj', 0.071), ('irrelevant', 0.068), ('mn', 0.067), ('ds', 0.065), ('lars', 0.064), ('xnj', 0.061), ('bn', 0.057), ('minimax', 0.057), ('curse', 0.057), ('selection', 0.056), ('gj', 0.056), ('greedy', 0.055), ('hr', 0.054), ('annals', 0.05), ('regression', 0.049), ('xi', 0.049), ('kh', 0.048), ('smoothing', 0.044), ('efron', 0.042), ('hard', 0.042), ('logarithmic', 0.041), ('dcn', 0.04), ('diabetes', 0.04), ('fu', 0.04), ('hristache', 0.04), ('lepski', 0.04), ('logk', 0.04), ('nan', 0.04), ('nbn', 0.04), ('wand', 0.04), ('derivative', 0.039), ('sparsity', 0.039), ('relevant', 0.039), ('variable', 0.039), ('version', 0.038), ('et', 0.038), ('path', 0.037), ('sparseness', 0.036), ('breiman', 0.035), ('hlmann', 0.035), ('diag', 0.035), ('paths', 0.033), ('hastie', 0.033), ('yi', 0.033), ('synthetic', 0.033), ('change', 0.032), ('nitesimal', 0.032), ('rc', 0.032), ('variables', 0.032), ('cn', 0.031), ('kernel', 0.031), ('risk', 0.031), ('covariate', 0.03), ('bias', 0.03), ('xr', 0.028), ('tn', 0.028), ('effectively', 0.028), ('locally', 0.028), ('satis', 0.028), ('gy', 0.027), ('age', 0.027), ('log', 0.027), ('idea', 0.025), ('suppose', 0.025), ('lemmas', 0.025), ('spline', 0.025), ('replaces', 0.025), ('friedman', 0.025), ('tibshirani', 0.025), ('adds', 0.024), ('xn', 0.024), ('soft', 0.023), ('parametric', 0.023), ('uniform', 0.023), ('multivariate', 0.023), ('largest', 0.022), ('decreasing', 0.022), ('changing', 0.022), ('fix', 0.022), ('continuously', 0.022), ('sparse', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

Author: Larry Wasserman, John D. Lafferty

2 0.10794453 50 nips-2005-Convex Neural Networks

Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte

Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artiﬁcial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an inﬁnite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time ﬁnding a linear classiﬁer that minimizes a weighted sum of errors. 1

3 0.1016801 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

Author: Ricky Der, Daniel D. Lee

Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.

4 0.084476151 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

Author: Alan L. Yuille

Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1

5 0.08435934 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning

Author: Jorge Silva, Jorge Marques, João Lemos

Abstract: There has been a surge of interest in learning non-linear manifold models to approximate high-dimensional data. Both for computational complexity reasons and for generalization capability, sparsity is a desired feature in such models. This usually means dimensionality reduction, which naturally implies estimating the intrinsic dimension, but it can also mean selecting a subset of the data to use as landmarks, which is especially important because many existing algorithms have quadratic complexity in the number of observations. This paper presents an algorithm for selecting landmarks, based on LASSO regression, which is well known to favor sparse approximations because it uses regularization with an l1 norm. As an added beneﬁt, a continuous manifold parameterization, based on the landmarks, is also found. Experimental results with synthetic and real data illustrate the algorithm. 1

6 0.083479151 77 nips-2005-From Lasso regression to Feature vector machine

7 0.083041415 182 nips-2005-Statistical Convergence of Kernel CCA

8 0.082024664 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

9 0.08077658 47 nips-2005-Consistency of one-class SVM and related algorithms

10 0.080097482 79 nips-2005-Fusion of Similarity Data in Clustering

11 0.077562816 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

12 0.077278405 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares

13 0.068924844 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections

14 0.066102177 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

15 0.061525781 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

16 0.058224056 74 nips-2005-Faster Rates in Regression via Active Learning

17 0.055283524 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

18 0.051771712 138 nips-2005-Non-Local Manifold Parzen Windows

19 0.04756964 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

20 0.046576191 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.161), (1, 0.072), (2, -0.05), (3, -0.079), (4, 0.077), (5, 0.052), (6, -0.038), (7, -0.081), (8, 0.043), (9, -0.038), (10, -0.003), (11, 0.105), (12, 0.011), (13, 0.059), (14, -0.006), (15, 0.219), (16, -0.058), (17, 0.168), (18, 0.059), (19, -0.04), (20, -0.043), (21, 0.027), (22, -0.064), (23, -0.144), (24, -0.037), (25, -0.093), (26, 0.004), (27, -0.039), (28, -0.065), (29, -0.076), (30, 0.041), (31, -0.084), (32, 0.025), (33, -0.012), (34, -0.105), (35, -0.094), (36, -0.066), (37, 0.002), (38, -0.083), (39, 0.074), (40, -0.024), (41, 0.047), (42, 0.008), (43, 0.074), (44, -0.072), (45, 0.155), (46, 0.048), (47, -0.033), (48, 0.009), (49, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94373381 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

Author: Larry Wasserman, John D. Lafferty

2 0.65051299 182 nips-2005-Statistical Convergence of Kernel CCA

Author: Kenji Fukumizu, Arthur Gretton, Francis R. Bach

Abstract: While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a ﬁnite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justiﬁcation for these methods. The result also gives a suﬃcient condition on the decay of the regularization coeﬃcient in the methods to ensure convergence. 1

3 0.58450067 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

Author: Ricky Der, Daniel D. Lee

4 0.52153373 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

Author: Alan L. Yuille

5 0.49360973 77 nips-2005-From Lasso regression to Feature vector machine

Author: Fan Li, Yiming Yang, Eric P. Xing

Abstract: Lasso regression tends to assign zero weights to most irrelevant or redundant features, and hence is a promising technique for feature selection. Its limitation, however, is that it only offers solutions to linear models. Kernel machines with feature scaling techniques have been studied for feature selection with non-linear models. However, such approaches require to solve hard non-convex optimization problems. This paper proposes a new approach named the Feature Vector Machine (FVM). It reformulates the standard Lasso regression into a form isomorphic to SVM, and this form can be easily extended for feature selection with non-linear models by introducing kernels deﬁned on feature vectors. FVM generates sparse solutions in the nonlinear feature space and it is much more tractable compared to feature scaling kernel machines. Our experiments with FVM on simulated data show encouraging results in identifying the small number of dominating features that are non-linearly correlated to the response, a task the standard Lasso fails to complete.

6 0.47554243 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning

7 0.4718762 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares

8 0.4706839 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

9 0.4650273 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

10 0.43179008 50 nips-2005-Convex Neural Networks

11 0.39435583 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

12 0.35756314 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

13 0.35663903 47 nips-2005-Consistency of one-class SVM and related algorithms

14 0.35363883 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

15 0.35352123 79 nips-2005-Fusion of Similarity Data in Clustering

16 0.3526502 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

17 0.34570751 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences

18 0.33523551 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares

19 0.32975388 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

20 0.32740429 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.044), (10, 0.036), (27, 0.026), (31, 0.046), (34, 0.078), (39, 0.012), (41, 0.014), (55, 0.035), (65, 0.029), (69, 0.059), (73, 0.029), (77, 0.017), (88, 0.077), (90, 0.319), (91, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86725724 182 nips-2005-Statistical Convergence of Kernel CCA

Author: Kenji Fukumizu, Arthur Gretton, Francis R. Bach

same-paper 2 0.75713372 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

Author: Larry Wasserman, John D. Lafferty

3 0.6482482 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

Author: Zhenyue Zhang, Hongyuan Zha

Abstract: We propose a fast manifold learning algorithm based on the methodology of domain decomposition. Starting with the set of sample points partitioned into two subdomains, we develop the solution of the interface problem that can glue the embeddings on the two subdomains into an embedding on the whole domain. We provide a detailed analysis to assess the errors produced by the gluing process using matrix perturbation theory. Numerical examples are given to illustrate the efﬁciency and effectiveness of the proposed methods. 1

4 0.46220851 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

Author: O. P. Kreidl, Alan S. Willsky

Abstract: Given a directed graphical model with binary-valued hidden nodes and real-valued noisy observations, consider deciding upon the maximum a-posteriori (MAP) or the maximum posterior-marginal (MPM) assignment under the restriction that each node broadcasts only to its children exactly one single-bit message. We present a variational formulation, viewing the processing rules local to all nodes as degrees-of-freedom, that minimizes the loss in expected (MAP or MPM) performance subject to such online communication constraints. The approach leads to a novel message-passing algorithm to be executed ofﬂine, or before observations are realized, which mitigates the performance loss by iteratively coupling all rules in a manner implicitly driven by global statistics. We also provide (i) illustrative examples, (ii) assumptions that guarantee convergence and efﬁciency and (iii) connections to active research areas. 1

5 0.4539516 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

Author: Firas Hamze, Nando de Freitas

Abstract: This paper presents a new sampling algorithm for approximating functions of variables representable as undirected graphical models of arbitrary connectivity with pairwise potentials, as well as for estimating the notoriously difﬁcult partition function of the graph. The algorithm ﬁts into the framework of sequential Monte Carlo methods rather than the more widely used MCMC, and relies on constructing a sequence of intermediate distributions which get closer to the desired one. While the idea of using “tempered” proposals is known, we construct a novel sequence of target distributions where, rather than dropping a global temperature parameter, we sequentially couple individual pairs of variables that are, initially, sampled exactly from a spanning tree of the variables. We present experimental results on inference and estimation of the partition function for sparse and densely-connected graphs.

6 0.45377666 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

7 0.45173314 30 nips-2005-Assessing Approximations for Gaussian Process Classification

8 0.44855645 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

9 0.4477911 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models

10 0.44078118 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

11 0.44023129 78 nips-2005-From Weighted Classification to Policy Search

12 0.438898 105 nips-2005-Large-Scale Multiclass Transduction

13 0.43709001 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

14 0.43693846 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments

15 0.43678826 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

16 0.43677679 74 nips-2005-Faster Rates in Regression via Active Learning

17 0.4360508 54 nips-2005-Data-Driven Online to Batch Conversions

18 0.43576404 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

19 0.43407145 133 nips-2005-Nested sampling for Potts models

20 0.43403316 98 nips-2005-Infinite latent feature models and the Indian buffet process