nips nips2009 nips2009-81 knowledge-graph by maker-knowledge-mining

81 nips-2009-Ensemble Nystrom Method


Source: pdf

Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar

Abstract: A crucial technique for scaling kernel methods to very large data sets reaching or exceeding millions of instances is based on low-rank approximation of kernel matrices. We introduce a new family of algorithms based on mixtures of Nystr¨ m o approximations, ensemble Nystr¨ m algorithms, that yield more accurate low-rank o approximations than the standard Nystr¨ m method. We give a detailed study of o variants of these algorithms based on simple averaging, an exponential weight method, or regression-based methods. We also present a theoretical analysis of these algorithms, including novel error bounds guaranteeing a better convergence rate than the standard Nystr¨ m method. Finally, we report results of extensive o experiments with several data sets containing up to 1M points demonstrating the significant improvement over the standard Nystr¨ m approximation. o 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A crucial technique for scaling kernel methods to very large data sets reaching or exceeding millions of instances is based on low-rank approximation of kernel matrices. [sent-6, score-0.117]

2 We introduce a new family of algorithms based on mixtures of Nystr¨ m o approximations, ensemble Nystr¨ m algorithms, that yield more accurate low-rank o approximations than the standard Nystr¨ m method. [sent-7, score-0.36]

3 We give a detailed study of o variants of these algorithms based on simple averaging, an exponential weight method, or regression-based methods. [sent-8, score-0.056]

4 We also present a theoretical analysis of these algorithms, including novel error bounds guaranteeing a better convergence rate than the standard Nystr¨ m method. [sent-9, score-0.08]

5 But, several standard learning algorithms such as support vector machines (SVMs) [2, 4], kernel ridge regression (KRR) [14], kernel principal component analysis (KPCA) [15], manifold learning [13], or other kernel-based algorithms do not scale to such orders of magnitude. [sent-12, score-0.4]

6 One solution to deal with such large data sets is to use an approximation of the kernel matrix. [sent-14, score-0.054]

7 As shown by [18], later by [6, 17, 19], low-rank approximations of the kernel matrix using the Nystr¨ m method can provide an effective technique for tackling large-scale scale o data sets with no significant decrease in performance. [sent-15, score-0.123]

8 This motivates our search for further improved low-rank approximations that can scale to such orders of magnitude and generate accurate approximations. [sent-17, score-0.054]

9 We show that a new family of algorithms based on mixtures of Nystr¨ m approximations, ensemble Nystr¨ m algorithms, yields more o o accurate low-rank approximations than the standard Nystr¨ m method. [sent-18, score-0.36]

10 Moreover, these ensemble alo gorithms naturally fit distributed computing environment where their computational cost is roughly the same as that of the standard Nystr¨ m method. [sent-19, score-0.29]

11 Section 2 gives an overview of the Nystr¨ m o low-rank approximation method and describes our ensemble Nystr¨ m algorithms. [sent-22, score-0.335]

12 We describe sevo eral variants of these algorithms, including one based on simple averaging of p Nystr¨ m solutions, o 1 an exponential weight method, and a regression method which consists of estimating the mixture parameters of the ensemble using a few columns sampled from the matrix. [sent-23, score-0.506]

13 In Section 3, we present a theoretical analysis of ensemble Nystr¨ m algorithms, namely bounds on the reconstruction error for o both the Frobenius norm and the spectral norm. [sent-24, score-0.455]

14 Section 4 o reports the results of extensive experiments with these algorithms on several data sets containing up to 1M points, comparing different variants of our ensemble Nystr¨ m algorithms and demonstrating o the performance improvements gained over the standard Nystr¨ m method. [sent-26, score-0.337]

15 o 2 Algorithm We first give a brief overview of the Nystr¨ m low-rank approximation method, introduce the notation o used in the following sections, and then describe our ensemble Nystr¨ m algorithms. [sent-27, score-0.315]

16 The Nystr¨ m approximation o of a symmetric positive semidefinite (SPSD) matrix K is based on a sample of m ≪ n columns of K [5, 18]. [sent-30, score-0.137]

17 Let C denote the n × m matrix formed by these columns and W the m × m matrix consisting of the intersection of these m columns with the corresponding m rows of K. [sent-31, score-0.207]

18 The columns and rows of K can be rearranged based on this sampling so that K and C be written as follows: K= W K21 K⊤ 21 K22 and C = W . [sent-32, score-0.091]

19 We assume a fixed kernel function K : X ×X → R that can be used to generate the entries of a kernel matrix K. [sent-44, score-0.073]

20 The learner receives a sample S of mp columns randomly selected from matrix K uniformly without replacement. [sent-45, score-0.134]

21 Each subsample Sr , r ∈ [1, p], contains m columns and is used to define a rank-k Nystr¨ m approximation Kr . [sent-50, score-0.119]

22 Dropping the rank subscript k in favor of the sample index o + r, Kr can be written as Kr = Cr Wr C⊤ , where Cr and Wr denote the matrices formed from r + the columns of Sr and Wr is the pseudo-inverse of the rank-k approximation of Wr . [sent-51, score-0.145]

23 The learner further receives a sample V of s columns used to determine the weight µr ∈ R attributed to each expert Kr . [sent-52, score-0.149]

24 Thus, the general form of the approximation of K generated by the ensemble Nystr¨ m o algorithm is p Kens = µr Kr . [sent-53, score-0.315]

25 (3) r=1 The mixture weights µr can be defined in many ways. [sent-54, score-0.058]

26 Nevertheless, o 2 this simple uniform method already generates a solution superior to any one of the approximations Kr used in the combination, as we shall see in the experimental section. [sent-57, score-0.107]

27 The choice of the mixture weights here is similar to those used in the weighted-majority algorithm [11]. [sent-62, score-0.058]

28 Let KV denote the matrix formed by using the samples from V as its columns and let KV denote r the submatrix of Kr containing the columns corresponding to the columns in V . [sent-63, score-0.269]

29 The reconstruction error ǫr = KV − KV can be directly computed from these matrices. [sent-64, score-0.055]

30 This can be viewed as a ridge regression objective function and admits a closed form solution. [sent-66, score-0.298]

31 We will refer to this method as the ridge regression method. [sent-67, score-0.318]

32 The total complexity of the ensemble Nystr¨ m algorithm is O(pm3 + pmkn + Cµ ), where Cµ is o the cost of computing the mixture weights, µ, used to combine the p Nystr¨ m approximations. [sent-68, score-0.32]

33 In o general, the cubic term dominates the complexity since the mixture weights can be computed in constant time for the uniform method, in O(psn) for the exponential weight method, or in O(p3 + pms) for the ridge regression method. [sent-69, score-0.394]

34 Furthermore, although the ensemble Nystr¨ m algorithm o requires p times more space and CPU cycles than the standard Nystr¨ m method, these additional o requirements are quite reasonable in practice. [sent-70, score-0.306]

35 o 3 Theoretical analysis We now present a theoretical analysis of the ensemble Nystr¨ m method for which we use as tools o some results previously shown by [5] and [9]. [sent-74, score-0.321]

36 As in [9], we shall use the following generalization of McDiarmid’s concentration bound to sampling without replacement [3]. [sent-75, score-0.078]

37 , Zm be a sequence of random variables sampled uniformly without replacement from a fixed set of m + u elements Z, and let φ : Z m → R be a symmetric function ′ ′ such that for all i ∈ [1, m] and for all z1 , . [sent-80, score-0.05]

38 We define the selection matrix corresponding to a sample of m columns as the matrix S ∈ Rn×m defined by Sii = 1 if the ith column of K is among those sampled, Sij = 0 otherwise. [sent-97, score-0.127]

39 Thus, C = KS is the matrix formed by the columns sampled. [sent-98, score-0.115]

40 Note that these bounds are similar to the bounds in Theorem 3 in [9], though in this O(1/m work we give new results for the spectral norm and present a tighter Lipschitz condition (9), the latter of which is needed to derive tighter bounds in Section 3. [sent-103, score-0.185]

41 Let K denote the rank-k Nystr¨ m approximation of K based on m columns sampled o uniformly at random without replacement from K, and Kk the best rank-k approximation of K. [sent-106, score-0.195]

42 To bound the norm-2 error of the Nystr¨ m method in the scenario of sampling without reo placement, we start with the following general inequality given by [5][proof of Lemma 4]: n m K−K 2 ≤ K − Kk 2 + 2 XX⊤ − ZZ⊤ 2 , (6) where Z = XS. [sent-109, score-0.138]

43 Let S′ be a sampling matrix selecting the same columns as S except for one, and n let Z′ denote m XS′ . [sent-111, score-0.106]

44 Let z and z′ denote the only differing columns of Z and Z′ , then |φ(S′ ) − φ(S)| ≤ z′ z′⊤ − zz⊤ ′ ≤2 z −z 2 2 = (z′ − z)z′⊤ + z(z′ − z)⊤ ′ max{ z 2 , z (7) 2 2 }. [sent-112, score-0.093]

45 The norm of the difference of two columns of X can be viewed as the norm of the difference of two feature vectors associated 1 K and thus can be to 2 bounded by dK . [sent-114, score-0.139]

46 The following general inequality holds for the Frobenius error of the Nystr¨ m method [5]: o √ 2 2 ⊤ ⊤ 2 K − K F ≤ K − Kk F + 64k XX − ZZ F nKmax . [sent-119, score-0.081]

47 2 Error bounds for the ensemble Nystr¨ m method o The following error bounds hold for ensemble Nystr¨ m methods based on a convex combination of o Nystr¨ m approximations. [sent-122, score-0.696]

48 Let S be a sample of pm columns drawn uniformly at random without replacement from K, decomposed into p subsamples of size m, S1 , . [sent-124, score-0.215]

49 For r ∈ [1, p], let Kr denote the rank-k Nystr¨ m approximation of K based on the sample Sr , and let Kk denote the best rank-k o approximation of K. [sent-128, score-0.088]

50 Plugging in this upper bound and the Lipschitz bound (15) in Theorem 1 yields our norm-2 bound for the ensemble Nystr¨ m method. [sent-137, score-0.356]

51 o For the Frobenius error bound, using the convexity of the Frobenius norm square · general inequality (11), we can write p K − Kens 2 F = r=1 p ≤ µr r=1 2 µr (K − Kr ) K − Kk = K − Kk 2 F + F 2 F and the p ≤ + √ 64k 2 F r=1 µr K − Kr 2 F √ 64k XX⊤ − Zr Z⊤ r (16) F nKmax . [sent-138, score-0.092]

52 However, the bounds for the ensemble Nystr¨ m are tighter than those for any Nystr¨ m expert based on a single sample of size o o m even for a uniform weighting. [sent-142, score-0.394]

53 In particular, for µ = 1/p, the last term of the ensemble bound for 1 √ norm-2 is smaller by a factor larger than µmax p 2 = 1/ p. [sent-143, score-0.312]

54 4 Experiments In this section, we present experimental results that illustrate the performance of the ensemble Nystr¨ m method. [sent-144, score-0.29]

55 1, we compare the o performance of various methods for calculating the mixture weights (µr ). [sent-147, score-0.058]

56 Throughout our experiments, we measure the accuracy of a low-rank approximation K by calculating the relative error in Frobenius and spectral norms, that is, if we let ξ = {2, F }, then we calculate the following quantity: % error = K−K K ξ 5 ξ × 100. [sent-150, score-0.163]

57 1 Ensemble Nystr¨ m with various mixture weights o In this set of experiments, we show results for our ensemble Nystr¨ m method using different techo niques to choose the mixture weights as discussed in Section 2. [sent-154, score-0.426]

58 For each dataset, we fixed the reduced rank to k = 50, and set the number of sampled columns to m = 3% n. [sent-157, score-0.09]

59 1 Furthermore, for the exponential and the ridge regression variants, we sampled an additional set of s = 20 columns and used an additional 20 columns (s′ ) as a hold-out set for selecting the optimal values of η and λ. [sent-158, score-0.494]

60 As a baseline, we also measured the minimal and mean percent error across the p Nystr¨ m approximations used to construct Kens . [sent-160, score-0.23]

61 For the Frobenius norm, we also calculated o the performance when using the optimal µ, that is, we used least-square regression to find the best possible choice of combination weights for a fixed set of p approximations by setting s = n. [sent-161, score-0.14]

62 The results of these experiments are presented in Figure 1 for the Frobenius norm and in Figure 2 for the spectral norm. [sent-162, score-0.072]

63 These results clearly show that the ensemble Nystr¨ m performance is signifio cantly better than any of the individual Nystr¨ m approximations. [sent-163, score-0.29]

64 Furthermore, the ridge regression o technique is the best of the proposed techniques and generates nearly the optimal solution in terms of the percent error in Frobenius norm. [sent-164, score-0.541]

65 We also observed that when s is increased to approximately 5% to 10% of n, linear regression without any regularization performs about as well as ridge regression for both the Frobenius and spectral norm. [sent-165, score-0.375]

66 Figure 3 shows this comparison between linear regression and ridge regression for varying values of s using a fixed number of experts (p = 10). [sent-166, score-0.356]

67 Finally we note that the ensemble Nystr¨ m method tends to converge very quickly, and the most significant o gain in performance occurs as p increases from 2 to 10. [sent-167, score-0.31]

68 2 Large-scale experiments Next, we present an empirical study of the effectiveness of the ensemble Nystr¨ m method on the o SIFT-1M dataset in Table 1 containing 1 million data points. [sent-169, score-0.31]

69 We present results comparing the performance of the ensemble Nystr¨ m method, using both uniform and ridge regression mixture o weights, with that of the best and mean performance across the p Nystr¨ m approximations used to o construct Kens . [sent-171, score-0.706]

70 We also make comparisons with a recently proposed k-means based sampling technique for the Nystr¨ m method [19]. [sent-172, score-0.052]

71 Although the k-means technique is quite effective at generating o informative columns by exploiting the data distribution, the cost of performing k-means becomes expensive for even moderately sized datasets, making it difficult to use in large-scale settings. [sent-173, score-0.095]

72 Nevertheless, in this work, we include the k-means method in our comparison, and we present results for various subsamples of the SIFT-1M dataset, with n ranging from 5K to 1M. [sent-174, score-0.049]

73 To do this, we first searched for an appropriate m such that the percent error for the ensemble Nystr¨ m method with o ridge weights was approximately 10%, and measured the time required by the cluster to construct this approximation. [sent-176, score-0.773]

74 uni exp ridge optimal Percent Error (Frobenius) mean b. [sent-188, score-0.368]

75 uni exp ridge optimal Percent Error (Frobenius) Percent Error (Frobenius) 4. [sent-192, score-0.352]

76 5 14 13 12 11 3 0 5 10 15 20 25 10 0 30 5 Number of base learners (p) 10 15 20 25 0. [sent-193, score-0.111]

77 45 5 10 15 20 25 30 Number of base learners (p) Ensemble Method − AB−S Ensemble Method − DEXT 70 mean b. [sent-195, score-0.127]

78 uni exp ridge optimal 36 68 Percent Error (Frobenius) 38 Percent Error (Frobenius) 0. [sent-199, score-0.352]

79 55 Number of base learners (p) 40 34 32 30 28 26 24 0 0. [sent-200, score-0.111]

80 ’) performance of the p base learners used to create the ensemble approximation. [sent-214, score-0.401]

81 1 5 Number of base learners (p) 10 15 20 25 15 20 25 30 45 mean b. [sent-243, score-0.127]

82 uni exp ridge 40 Percent Error (Spectral) 40 10 Number of base learners (p) Ensemble Method − DEXT 45 Percent Error (Spectral) 5 Number of base learners (p) Ensemble Method − AB−S 35 30 25 20 15 10 0 0. [sent-251, score-0.557]

83 The results of this experiment, presented in Figure 4, clearly show that the ensemble Nystr¨ m o method is the most effective technique given a fixed amount of time. [sent-254, score-0.328]

84 Furthermore, even with the small values of s and s′ , ensemble Nystr¨ m with ridge-regression weighting outperforms the o uniform ensemble Nystr¨ m method. [sent-255, score-0.593]

85 It generates an approximation that is worse than the mean standard Nystr¨ m approxio mation and its performance increasingly deteriorates as n approaches 1M. [sent-257, score-0.058]

86 35 10 15 20 Percent Error (Frobenius) no−ridge ridge optimal 5 Effect of Ridge − ESS Effect of Ridge − MNIST Percent Error (Frobenius) Percent Error (Frobenius) Effect of Ridge − PIE−2. [sent-261, score-0.279]

87 495 25 5 10 15 20 no−ridge ridge optimal 0. [sent-270, score-0.279]

88 445 0 25 5 10 15 20 25 Percent Error (Frobenius) Percent Error (Frobenius) Relative size of validation set Relative size of validation set Relative size of validation set Effect of Ridge − AB−S Effect of Ridge − DEXT no−ridge ridge optimal 28. [sent-273, score-0.36]

89 5 26 0 5 10 15 20 56 no−ridge ridge optimal 55. [sent-276, score-0.279]

90 5 25 5 Relative size of validation set 10 15 20 25 Relative size of validation set Figure 3: Comparison of percent error in Frobenius norm for the ensemble Nystr¨ m method with p = o 10 experts with weights derived from linear regression (‘no-ridge’) and ridge regression (‘ridge’). [sent-278, score-0.952]

91 uni ridge kmeans 14 13 12 11 10 9 4 5 10 10 6 10 Size of dataset (n) Figure 4: Large-scale performance comparison with SIFT-1M dataset. [sent-285, score-0.313]

92 Given fixed computational time, ensemble Nystr¨ m with ridge weights tends to outperform other techniques. [sent-286, score-0.58]

93 o though the space requirements are 10 times greater for ensemble Nystr¨ m in comparison to standard o Nystr¨ m (since p = 10 in this experiment), the space constraints are nonetheless quite reasonable. [sent-287, score-0.306]

94 o For instance, when working with the full 1M points, the ensemble Nystr¨ m method with ridge reo gression weights only required approximately 1% of the columns of K to achieve a percent error of 10%. [sent-288, score-0.871]

95 5 Conclusion We presented a novel family of algorithms, ensemble Nystr¨ m algorithms, for accurate low-rank apo proximations in large-scale applications. [sent-289, score-0.303]

96 The consistent and significant performance improvement across a number of different data sets, along with the fact that these algorithms can be easily parallelized, suggests that these algorithms can benefit a variety of applications where kernel methods are used. [sent-290, score-0.061]

97 Interestingly, the algorithmic solution we have proposed for scaling these kernel learning algorithms to larger scales is itself derived from the machine learning idea of ensemble methods. [sent-291, score-0.335]

98 We expect that finer error bounds and theoretical guarantees will further guide the design of the ensemble algorithms and help us gain a better insight about the convergence properties of our algorithms. [sent-293, score-0.386]

99 Using the Nystr¨ m method to speed up kernel machines. [sent-402, score-0.049]

100 Improved Nystr¨ m low-rank approximation and error analo ysis. [sent-408, score-0.067]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nystr', 0.81), ('ensemble', 0.29), ('ridge', 0.262), ('frobenius', 0.137), ('percent', 0.131), ('kr', 0.121), ('kk', 0.108), ('zr', 0.105), ('xx', 0.102), ('kmax', 0.095), ('kens', 0.095), ('columns', 0.077), ('zz', 0.076), ('learners', 0.074), ('nkmax', 0.059), ('kv', 0.059), ('uni', 0.051), ('dext', 0.047), ('sr', 0.046), ('error', 0.042), ('dk', 0.042), ('approximations', 0.041), ('spectral', 0.041), ('pm', 0.04), ('wk', 0.04), ('pie', 0.038), ('base', 0.037), ('regression', 0.036), ('spsd', 0.036), ('zm', 0.035), ('wr', 0.034), ('ess', 0.032), ('norm', 0.031), ('mixture', 0.03), ('mohri', 0.03), ('kernel', 0.029), ('subsamples', 0.029), ('expert', 0.028), ('weights', 0.028), ('bounds', 0.027), ('validation', 0.027), ('mnist', 0.027), ('replacement', 0.026), ('approximation', 0.025), ('ameet', 0.024), ('formed', 0.023), ('experts', 0.022), ('bound', 0.022), ('theorem', 0.022), ('exp', 0.022), ('reo', 0.021), ('kii', 0.021), ('nmk', 0.021), ('talwalkar', 0.021), ('kumar', 0.02), ('method', 0.02), ('sample', 0.02), ('rbf', 0.019), ('inequality', 0.019), ('max', 0.019), ('best', 0.018), ('technique', 0.018), ('generates', 0.017), ('subsample', 0.017), ('optimal', 0.017), ('ab', 0.017), ('shall', 0.016), ('millions', 0.016), ('tighter', 0.016), ('cortes', 0.016), ('differing', 0.016), ('algorithms', 0.016), ('requirements', 0.016), ('mean', 0.016), ('inequalities', 0.016), ('courant', 0.015), ('cr', 0.015), ('matrix', 0.015), ('variants', 0.015), ('parallelized', 0.014), ('ui', 0.014), ('sampling', 0.014), ('datasets', 0.014), ('sp', 0.013), ('reconstruction', 0.013), ('relative', 0.013), ('uniform', 0.013), ('sampled', 0.013), ('weight', 0.013), ('accurate', 0.013), ('machines', 0.012), ('zi', 0.012), ('decomposed', 0.012), ('exponential', 0.012), ('uniformly', 0.011), ('lipschitz', 0.011), ('simplex', 0.011), ('learner', 0.011), ('svd', 0.011), ('theoretical', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 81 nips-2009-Ensemble Nystrom Method

Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar

Abstract: A crucial technique for scaling kernel methods to very large data sets reaching or exceeding millions of instances is based on low-rank approximation of kernel matrices. We introduce a new family of algorithms based on mixtures of Nystr¨ m o approximations, ensemble Nystr¨ m algorithms, that yield more accurate low-rank o approximations than the standard Nystr¨ m method. We give a detailed study of o variants of these algorithms based on simple averaging, an exponential weight method, or regression-based methods. We also present a theoretical analysis of these algorithms, including novel error bounds guaranteeing a better convergence rate than the standard Nystr¨ m method. Finally, we report results of extensive o experiments with several data sets containing up to 1M points demonstrating the significant improvement over the standard Nystr¨ m approximation. o 1

2 0.10543811 237 nips-2009-Subject independent EEG-based BCI decoding

Author: Siamac Fazli, Cristian Grozea, Marton Danoczy, Benjamin Blankertz, Florin Popescu, Klaus-Robert Müller

Abstract: In the quest to make Brain Computer Interfacing (BCI) more usable, dry electrodes have emerged that get rid of the initial 30 minutes required for placing an electrode cap. Another time consuming step is the required individualized adaptation to the BCI user, which involves another 30 minutes calibration for assessing a subject’s brain signature. In this paper we aim to also remove this calibration proceedure from BCI setup time by means of machine learning. In particular, we harvest a large database of EEG BCI motor imagination recordings (83 subjects) for constructing a library of subject-specific spatio-temporal filters and derive a subject independent BCI classifier. Our offline results indicate that BCI-na¨ve ı users could start real-time BCI use with no prior calibration at only a very moderate performance loss.

3 0.078100912 128 nips-2009-Learning Non-Linear Combinations of Kernels

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.

4 0.065953732 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

Author: Jing Gao, Feng Liang, Wei Fan, Yizhou Sun, Jiawei Han

Abstract: Ensemble classifiers such as bagging, boosting and model averaging are known to have improved accuracy and robustness over a single model. Their potential, however, is limited in applications which have no access to raw data but to the meta-level model output. In this paper, we study ensemble learning with output from multiple supervised and unsupervised models, a topic where little work has been done. Although unsupervised models, such as clustering, do not directly generate label prediction for each individual, they provide useful constraints for the joint prediction of a set of related objects. We propose to consolidate a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the smoothness of the prediction over the graph, as well as penalizing deviations from the initial labeling provided by supervised models. We solve this problem through iterative propagation of probability estimates among neighboring nodes. Our method can also be interpreted as conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on three real applications demonstrate the benefits of the proposed method over existing alternatives1 . 1

5 0.063079275 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which first estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves significantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1

6 0.048516516 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

7 0.043964688 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

8 0.041547716 147 nips-2009-Matrix Completion from Noisy Entries

9 0.041013043 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

10 0.039615627 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

11 0.034820154 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

12 0.030935928 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

13 0.030606348 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

14 0.029853772 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

15 0.027011786 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

16 0.026548795 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

17 0.026411263 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

18 0.026372202 157 nips-2009-Multi-Label Prediction via Compressed Sensing

19 0.026369251 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

20 0.026281485 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.093), (1, 0.039), (2, -0.009), (3, 0.05), (4, -0.016), (5, -0.023), (6, 0.018), (7, -0.015), (8, 0.014), (9, 0.019), (10, 0.013), (11, -0.03), (12, -0.019), (13, 0.014), (14, 0.012), (15, 0.003), (16, -0.003), (17, -0.002), (18, -0.043), (19, -0.001), (20, 0.029), (21, -0.046), (22, -0.002), (23, -0.064), (24, 0.01), (25, 0.011), (26, -0.052), (27, 0.025), (28, -0.0), (29, -0.03), (30, 0.009), (31, -0.08), (32, 0.02), (33, -0.003), (34, 0.114), (35, -0.012), (36, 0.024), (37, 0.139), (38, 0.03), (39, 0.093), (40, -0.07), (41, -0.031), (42, 0.057), (43, -0.112), (44, -0.002), (45, 0.033), (46, -0.06), (47, 0.151), (48, 0.036), (49, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90182728 81 nips-2009-Ensemble Nystrom Method

Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar

Abstract: A crucial technique for scaling kernel methods to very large data sets reaching or exceeding millions of instances is based on low-rank approximation of kernel matrices. We introduce a new family of algorithms based on mixtures of Nystr¨ m o approximations, ensemble Nystr¨ m algorithms, that yield more accurate low-rank o approximations than the standard Nystr¨ m method. We give a detailed study of o variants of these algorithms based on simple averaging, an exponential weight method, or regression-based methods. We also present a theoretical analysis of these algorithms, including novel error bounds guaranteeing a better convergence rate than the standard Nystr¨ m method. Finally, we report results of extensive o experiments with several data sets containing up to 1M points demonstrating the significant improvement over the standard Nystr¨ m approximation. o 1

2 0.69173568 237 nips-2009-Subject independent EEG-based BCI decoding

Author: Siamac Fazli, Cristian Grozea, Marton Danoczy, Benjamin Blankertz, Florin Popescu, Klaus-Robert Müller

Abstract: In the quest to make Brain Computer Interfacing (BCI) more usable, dry electrodes have emerged that get rid of the initial 30 minutes required for placing an electrode cap. Another time consuming step is the required individualized adaptation to the BCI user, which involves another 30 minutes calibration for assessing a subject’s brain signature. In this paper we aim to also remove this calibration proceedure from BCI setup time by means of machine learning. In particular, we harvest a large database of EEG BCI motor imagination recordings (83 subjects) for constructing a library of subject-specific spatio-temporal filters and derive a subject independent BCI classifier. Our offline results indicate that BCI-na¨ve ı users could start real-time BCI use with no prior calibration at only a very moderate performance loss.

3 0.49721226 124 nips-2009-Lattice Regression

Author: Eric Garcia, Maya Gupta

Abstract: We present a new empirical risk minimization framework for approximating functions from training samples for low-dimensional regression applications where a lattice (look-up table) is stored and interpolated at run-time for an efficient implementation. Rather than evaluating a fitted function at the lattice nodes without regard to the fact that samples will be interpolated, the proposed lattice regression approach estimates the lattice to minimize the interpolation error on the given training samples. Experiments show that lattice regression can reduce mean test error by as much as 25% compared to Gaussian process regression (GPR) for digital color management of printers, an application for which linearly interpolating a look-up table is standard. Simulations confirm that lattice regression performs consistently better than the naive approach to learning the lattice. Surprisingly, in some cases the proposed method — although motivated by computational efficiency — performs better than directly applying GPR with no lattice at all. 1

4 0.48283058 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

Author: Xiaolin Yang, Seyoung Kim, Eric P. Xing

Abstract: Multitask learning addresses the problem of learning related tasks that presumably share some commonalities on their input-output mapping functions. Previous approaches to multitask learning usually deal with homogeneous tasks, such as purely regression tasks, or entirely classification tasks. In this paper, we consider the problem of learning multiple related tasks of predicting both continuous and discrete outputs from a common set of input variables that lie in a highdimensional feature space. All of the tasks are related in the sense that they share the same set of relevant input variables, but the amount of influence of each input on different outputs may vary. We formulate this problem as a combination of linear regressions and logistic regressions, and model the joint sparsity as L1 /L∞ or L1 /L2 norm of the model parameters. Among several possible applications, our approach addresses an important open problem in genetic association mapping, where the goal is to discover genetic markers that influence multiple correlated traits jointly. In our experiments, we demonstrate our method in this setting, using simulated and clinical asthma datasets, and we show that our method can effectively recover the relevant inputs with respect to all of the tasks. 1

5 0.47259554 182 nips-2009-Optimal Scoring for Unsupervised Learning

Author: Zhihua Zhang, Guang Dai

Abstract: We are often interested in casting classification and clustering problems as a regression framework, because it is feasible to achieve some statistical properties in this framework by imposing some penalty criteria. In this paper we illustrate optimal scoring, which was originally proposed for performing the Fisher linear discriminant analysis by regression, in the application of unsupervised learning. In particular, we devise a novel clustering algorithm that we call optimal discriminant clustering. We associate our algorithm with the existing unsupervised learning algorithms such as spectral clustering, discriminative clustering and sparse principal component analysis. Experimental results on a collection of benchmark datasets validate the effectiveness of the optimal discriminant clustering algorithm.

6 0.42400444 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

7 0.41715217 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

8 0.40908337 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

9 0.40160352 128 nips-2009-Learning Non-Linear Combinations of Kernels

10 0.3989591 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

11 0.37576893 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

12 0.3673681 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

13 0.35424158 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions

14 0.32227561 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

15 0.31722286 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

16 0.31319323 46 nips-2009-Bilinear classifiers for visual recognition

17 0.30995545 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

18 0.30985099 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks

19 0.30849269 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

20 0.30504015 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(18, 0.012), (24, 0.042), (25, 0.048), (35, 0.028), (36, 0.073), (39, 0.031), (58, 0.499), (61, 0.018), (71, 0.025), (81, 0.033), (86, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98748362 43 nips-2009-Bayesian estimation of orientation preference maps

Author: Sebastian Gerwinn, Leonard White, Matthias Kaschube, Matthias Bethge, Jakob H. Macke

Abstract: Imaging techniques such as optical imaging of intrinsic signals, 2-photon calcium imaging and voltage sensitive dye imaging can be used to measure the functional organization of visual cortex across different spatial and temporal scales. Here, we present Bayesian methods based on Gaussian processes for extracting topographic maps from functional imaging data. In particular, we focus on the estimation of orientation preference maps (OPMs) from intrinsic signal imaging data. We model the underlying map as a bivariate Gaussian process, with a prior covariance function that reflects known properties of OPMs, and a noise covariance adjusted to the data. The posterior mean can be interpreted as an optimally smoothed estimate of the map, and can be used for model based interpolations of the map from sparse measurements. By sampling from the posterior distribution, we can get error bars on statistical properties such as preferred orientations, pinwheel locations or pinwheel counts. Finally, the use of an explicit probabilistic model facilitates interpretation of parameters and quantitative model comparisons. We demonstrate our model both on simulated data and on intrinsic signaling data from ferret visual cortex. 1

2 0.97857642 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

Author: John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, Yi Ma

Abstract: Principal component analysis is a fundamental operation in computational data analysis, with myriad applications ranging from web search to bioinformatics to computer vision and image analysis. However, its performance and applicability in real scenarios are limited by a lack of robustness to outlying or corrupted observations. This paper considers the idealized “robust principal component analysis” problem of recovering a low rank matrix A from corrupted observations D = A + E. Here, the corrupted entries E are unknown and the errors can be arbitrarily large (modeling grossly corrupted observations common in visual and bioinformatic data), but are assumed to be sparse. We prove that most matrices A can be efficiently and exactly recovered from most error sign-and-support patterns by solving a simple convex program, for which we give a fast and provably convergent algorithm. Our result holds even when the rank of A grows nearly proportionally (up to a logarithmic factor) to the dimensionality of the observation space and the number of errors E grows in proportion to the total number of entries in the matrix. A by-product of our analysis is the first proportional growth results for the related problem of completing a low-rank matrix from a small fraction of its entries. Simulations and real-data examples corroborate the theoretical results, and suggest potential applications in computer vision. 1

3 0.96655476 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

Author: Marco Cuturi, Jean-philippe Vert, Alexandre D'aspremont

Abstract: We propose new methodologies to detect anomalies in discrete-time processes taking values in a probability space. These methods are based on the inference of functionals whose evaluations on successive states visited by the process are stationary and have low autocorrelations. Deviations from this behavior are used to flag anomalies. The candidate functionals are estimated in a subspace of a reproducing kernel Hilbert space associated with the original probability space considered. We provide experimental results on simulated datasets which show that these techniques compare favorably with other algorithms.

4 0.96196687 67 nips-2009-Directed Regression

Author: Yi-hao Kao, Benjamin V. Roy, Xiang Yan

Abstract: When used to guide decisions, linear regression analysis typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. When there are multiple response variables and features do not perfectly capture their relationships, it is beneficial to account for the decision objective when computing regression coefficients. Empirical optimization does so but sacrifices performance when features are well-chosen or training data are insufficient. We propose directed regression, an efficient algorithm that combines merits of ordinary least squares and empirical optimization. We demonstrate through a computational study that directed regression can generate significant performance gains over either alternative. We also develop a theory that motivates the algorithm. 1

5 0.95866513 223 nips-2009-Sparse Metric Learning via Smooth Optimization

Author: Yiming Ying, Kaizhu Huang, Colin Campbell

Abstract: In this paper we study the problem of learning a low-rank (sparse) distance matrix. We propose a novel metric learning model which can simultaneously conduct dimension reduction and learn a distance matrix. The sparse representation involves a mixed-norm regularization which is non-convex. We then show that it can be equivalently formulated as a convex saddle (min-max) problem. From this saddle representation, we develop an efficient smooth optimization approach [17] for sparse metric learning, although the learning model is based on a nondifferentiable loss function. Finally, we run experiments to validate the effectiveness and efficiency of our sparse metric learning model on various datasets.

same-paper 6 0.95056266 81 nips-2009-Ensemble Nystrom Method

7 0.78827482 147 nips-2009-Matrix Completion from Noisy Entries

8 0.7834782 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

9 0.77015227 177 nips-2009-On Learning Rotations

10 0.74375427 243 nips-2009-The Ordered Residual Kernel for Robust Motion Subspace Clustering

11 0.73979104 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

12 0.71580404 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies

13 0.71284389 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

14 0.71093458 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

15 0.70812637 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

16 0.70713419 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

17 0.70662439 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

18 0.70595646 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

19 0.7059226 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

20 0.70204431 163 nips-2009-Neurometric function analysis of population codes