nips nips2012 nips2012-249 knowledge-graph by maker-knowledge-mining

249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison


Source: pdf

Author: Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou

Abstract: Both random Fourier features and the Nystr¨ m method have been successfully o applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystr¨ m method are randomly sampled from the training examples and are o therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystr¨ m method can yield impressively better generalization error bound than o random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn Abstract Both random Fourier features and the Nystr¨ m method have been successfully o applied to efficient kernel learning. [sent-6, score-0.368]

2 In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. [sent-7, score-0.095]

3 Unlike approaches based on random Fourier features where the basis functions (i. [sent-8, score-0.175]

4 , cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystr¨ m method are randomly sampled from the training examples and are o therefore data dependent. [sent-10, score-0.255]

5 By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystr¨ m method can yield impressively better generalization error bound than o random Fourier features based approach. [sent-11, score-0.529]

6 One limitation of kernel methods is their high computational cost, which is at least quadratic in the number of training examples, due to the calculation of kernel matrix. [sent-15, score-0.457]

7 , incomplete Cholesky decomposition [3]) have been used to alleviate the computational challenge of kernel methods, they still require computing the kernel matrix. [sent-18, score-0.459]

8 Other approaches such as online learning [9] and budget learning [7] have also been developed for large-scale kernel learning, but they tend to yield performance worse performance than batch learning. [sent-19, score-0.235]

9 To avoid computing kernel matrix, one common approach is to approximate a kernel learning problem with a linear prediction problem. [sent-20, score-0.467]

10 It is often achieved by generating a vector representation of data that approximates the kernel similarity between any two data points. [sent-21, score-0.239]

11 The most well known approaches in this category are random Fourier features [13, 14] and the Nystr¨ m method [20, 8]. [sent-22, score-0.169]

12 The objective of this work is to understand the difference between these two approaches, both theoretically and empirically The theoretical foundation for random Fourier transform is that a shift-invariant kernel is the Fourier transform of a non-negative measure [15]. [sent-24, score-0.324]

13 Analysis in [14] shows that, the generalization error bound for kernel learning based on random Fourier features is given by O(N −1/2 + m−1/2 ), where N is the number of training examples and m is the number of sampled Fourier components. [sent-26, score-0.517]

14 1 An alternative approach for large-scale kernel classification is the Nystr¨ m method [20, 8] that o approximates the kernel matrix by a low rank matrix. [sent-27, score-0.574]

15 It randomly samples a subset of training examples and computes a kernel matrix K for the random samples. [sent-28, score-0.369]

16 It then represents each data point by a vector based on its kernel similarity to the random samples and the sampled kernel matrix K. [sent-29, score-0.567]

17 Most analysis of the Nystr¨ m method follows [8] and bounds the error in approximating the o kernel matrix. [sent-30, score-0.302]

18 According to [8], the approximation error of the Nystr¨ m method, measured in o spectral norm 1 , is O(m−1/2 ), where m is the number of sampled training examples. [sent-31, score-0.152]

19 Using the arguments in [6], we expected an additional error of O(m−1/2 ) in the generalization performance caused by the approximation of the Nystr¨ m method, similar to random Fourier features. [sent-32, score-0.19]

20 o Contributions In this work, we first establish a unified framework for both methods from the viewpoint of functional approximation. [sent-33, score-0.069]

21 This is important because random Fourier features and the Nystr¨ m method address large-scale kernel learning very differently: random Fourier features aim o to approximate the kernel function directly while the Nystr¨ m method is designed to approximate o the kernel matrix. [sent-34, score-1.019]

22 By exploring this difference, we show that the additional error caused by the Nystr¨ m method in the generalization performance can be o improved to O(1/m) when there is a large gap in the eigen-spectrum of the kernel matrix. [sent-36, score-0.421]

23 , (xN , yN )} be a collection of N training examples, where xi ∈ X ⊆ Rd , yi ∈ Y. [sent-41, score-0.084]

24 Let κ(·, ·) be a kernel function, Hκ denote the endowed Reproducing Kernel Hilbert Space, and K = [κ(xi , xj )]N ×N be the kernel matrix for the samples in D. [sent-42, score-0.518]

25 , N be the eigenvalues and eigenvectors of K ranked in the descending order of eigenvalues. [sent-47, score-0.201]

26 , xm } denote the randomly o sampled examples, K = [κ(xi , xj )]m×m denote the corresponding kernel matrix. [sent-55, score-0.295]

27 According to [18], the eigenvalues of LN and Lm are λi /N, i ∈ [N ] and λi /m, i ∈ [m], respectively, and their corresponding normalized eigenfunctions ϕj , j ∈ [N ] and ϕj , j ∈ [m] are given by ϕj (·) = 1 λj N Vi,j κ(xi , ·), j ∈ [N ], ϕj (·) = 1 m Vi,j κ(xi , ·), j ∈ [m]. [sent-64, score-0.151]

28 (2) λj i=1 i=1 ¯ To make our discussion concrete, we focus on the RBF kernel 2 , i. [sent-65, score-0.217]

29 Our goal is to efficiently learn a kernel prediction function by solving the following optimization problem: min f ∈HD λ f 2 2 Hκ + 1 N N (f (xi ), yi ), (3) i=1 1 We choose the bound based on spectral norm according to the discussion in [6]. [sent-68, score-0.309]

30 The improved bound obtained in the paper for the Nystrom method is valid for any kernel matrix that satisfies the eigengap condition. [sent-69, score-0.448]

31 , κ(xN , ·)) is a span over all the training examples 3 , and (z, y) is a convex loss function with respect to z. [sent-73, score-0.08]

32 The high computational cost of kernel learning arises from the fact that we have to search for an optimal classifier f (·) in a large space HD . [sent-75, score-0.217]

33 Given this observation, to alleviate the computational cost of kernel classification, we can reduce space HD to a smaller space Ha , and only search for the solution f (·) ∈ Ha . [sent-76, score-0.242]

34 Below we show that the difference between random Fourier features and the Nystr¨ m method lies in the construction of the approximate space Ha . [sent-79, score-0.209]

35 For o each method, we begin with a description of a vector representation of data, and then connect the vector representation to the approximate large kernel machine by functional approximation. [sent-80, score-0.321]

36 Random Fourier Features The random Fourier features are constructed by first sampling Fourier components u1 , . [sent-81, score-0.117]

37 , um separately, and then passing them through sine and cosine functions, i. [sent-87, score-0.101]

38 Given the random Fourier features, we then learn a linear machine f (x) = w zf (x) by solving the following optimization problem: min 2m w∈R λ w 2 2 2 + N 1 N (w zf (xi ), yi ). [sent-93, score-0.203]

39 (4) i=1 To connect the linear machine (4) to the kernel machine in (3) by a functional approximation, we can f construct a functional space Ha = span(s1 (·), c1 (·), . [sent-94, score-0.339]

40 (5) i=1 The following proposition connects the approximate kernel machine in (5) to the linear machine in (4). [sent-99, score-0.271]

41 Proposition 1 The approximate kernel machine in (5) is equivalent to the following linear machine min 2m w∈R λ 1 w (w ◦ γ) + 2 N s/c c s c s where γ = (γ1 , γ1 , · · · , γm , γm ) and γi N (w zf (xi ), yi ), (6) i=1 = exp(σ 2 ui 2 /2). [sent-101, score-0.344]

42 2 Comparing (6) to the linear machine based on random Fourier features in (4), we can see that other s/c than the weights {γi }m , random Fourier features can be viewed as to approximate (3) by rei=1 f stricting the solution f (·) to Ha . [sent-102, score-0.267]

43 It is straightforward to verify that zn (xi ) zn (xj ) = [Kr ]ij . [sent-114, score-0.142]

44 Given the vector representation zn (x), we then learn a linear machine f (x) = w zn (x) by solving the following optimization problem: λ min w w∈Rr 2 3 2 2 1 + N N (w zn (xi ), yi ). [sent-115, score-0.184]

45 3 (7) In order to see how the Nystr¨ m method can be cast into the unified framework of approximating the o large scale kernel machine by functional approximation, we construct the following functional space n Ha = span(ϕ1 , . [sent-117, score-0.374]

46 The following proposition shows that the linear machine in (7) using the vector representation of the Nystr¨ m method is equivalent to the approximate kernel machine in (3) by restricting the o n solution f (·) to an approximate functional space Ha . [sent-124, score-0.389]

47 In particular, the basis functions used by random Fourier features are sampled from a Gaussian distribution that is independent from the training examples. [sent-126, score-0.21]

48 In contrast, the basis functions used by the Nystr¨ m method are sampled from the training examples and are therefore data dependent. [sent-127, score-0.153]

49 , the first few eigenvalues of the full kernel matrix are much larger than the remaining eigenvalues, the classification performance is mostly determined by the top eigenvectors. [sent-131, score-0.332]

50 Since the Nystr¨ m method uses a data dependent sampling method, it is able to discover the o subspace spanned by the top eigenvectors using a small number of samples. [sent-132, score-0.127]

51 In contrast, since random Fourier features are drawn from a distribution independent from training data, it may require a large number of samples before it can discover this subspace. [sent-133, score-0.175]

52 As a result, we expect a significantly lower generalization error for the Nystr¨ m method. [sent-134, score-0.075]

53 The σ value in the RBF kernel is chosen by cross-validation and is set to 6 for the synthetic data. [sent-141, score-0.245]

54 According to the results shown in Figure 1(c), it is clear that the Nystr¨ m o method performs significantly better than random Fourier features. [sent-144, score-0.074]

55 By using only 100 samples, the Nystr¨ m method is able to make perfect prediction, while the decision made by random Fourier feao tures based method is close to random guess. [sent-145, score-0.148]

56 To evaluate the approximation error of the functional space, we plot in Figure 1(e) and 1(f), respectively, the first two eigenvectors of the approximate kernel matrix computed by the Nystr¨ m method and random Fourier features using 100 samples. [sent-146, score-0.61]

57 o Compared to the eigenvectors computed from the full kernel matrix (Figure 1(d)), we can see that the Nystr¨ m method achieves a significantly better approximation of the first two eigenvectors than o random Fourier features. [sent-147, score-0.488]

58 Finally, we note that although the concept of eigengap has been exploited in many studies of kernel learning [2, 12, 1, 17], to the best of our knowledge, this is the first time it has been incorporated in the analysis for approximate large-scale kernel learning. [sent-148, score-0.578]

59 3 Main Theoretical Result ∗ ∗ Let fm be the optimal solution to the approximate kernel learning problem in (8), and let fN be the ∗ solution to the full version of kernel learning in (3). [sent-149, score-0.559]

60 04 0 2000 4000 6000 8000 5 10 20 50 # random samples 100 (c) Classification accuracy vs the number of samples 0. [sent-175, score-0.128]

61 Let ε as the solution to ε2 = ψ(ε) where the existence and uniqueness of ε are determined by the sub-root property of ψ(δ) [4], and = max ε, 6 ln N N . [sent-194, score-0.172]

62 According to [10], we have 2 = O(N −1/2 ), and when the eigenvalues of kernel function follow a p-power law, ∗ ∗ it is improved to 2 = O(N −p/(p+1) ). [sent-195, score-0.333]

63 Theorem 1 For 16 2 e−2N ≤ λ ≤ 1, λr+1 = O(N/m) and 2 ln(2N 3 ) + m (λr − λr+1 )/N = Ω(1) ≥ 3 2 ln(2N 3 ) m , with a probability 1 − 3N −3 , we have ∗ Λ(fm ) ≤ ∗ 3Λ(fN ) + 1 O λ 2 + 1 m , where O(·) suppresses the polynomial term of ln N . [sent-198, score-0.172]

64 Theorem 1 shows that the additional error caused by the approximation of the Nystr¨ m method is o improved to O(1/m) when there is a large gap between λr and λr+1 . [sent-199, score-0.19]

65 Note that the improvement √ from O(1/ m) to O(1/m) is very significant from the theoretical viewpoint, because it is well known that the generalization error for kernel learning is O(N −1/2 ) [4]5 . [sent-200, score-0.292]

66 As a result, to achieve a similar performance as the standard kernel learning, the number of required samples has to be 5 It is possible to achieve a better generalization error bound of O(N −p/(p+1) ) by assuming the eigenvalues of kernel matrix follow a p-power law [10]. [sent-201, score-0.712]

67 However, large eigengap doest not immediately indicate power law distribution for eigenvalues and and consequently a better generalization error. [sent-202, score-0.267]

68 5 √ O(N ) if the additional error caused by the kernel approximation is bounded by O(1/ m), leading to a high computational cost. [sent-203, score-0.344]

69 On the other hand, with O(1/m) bound for the additional error caused √ by the kernel approximation, the number of required samples is reduced to N , making it more practical for large-scale kernel learning. [sent-204, score-0.572]

70 n We also note that the improvement made for the Nystr¨ m method relies on the property that Ha ⊂ o HD and therefore requires data dependent basis functions. [sent-205, score-0.098]

71 We first present a theorem to show that the excessive risk bound of fm is related to the matrix approximation error K − Kr 2 . [sent-209, score-0.247]

72 Using the eigenfunctions of Lm and LN , we define two linear operators Hr and Hr as r Hr [f ](·) = r ϕi (·) ϕi , f Hκ , Hr [f ](·) = i=1 ϕi (·) ϕi , f Hκ , (10) i=1 where f ∈ Hκ . [sent-213, score-0.09]

73 1/2 1/2 Given the result in Theorem 3, we move to bound the spectral norm of LN ∆HLN . [sent-216, score-0.067]

74 To this end, we assume a sufficiently large eigengap ∆ = (λr − λr+1 )/N . [sent-217, score-0.111]

75 The theorem below bounds 1/2 1/2 LN ∆HLN 2 using matrix perturbation theory [19]. [sent-218, score-0.084]

76 5 Empirical Studies To verify our theoretical findings, we evaluate the empirical performance of the Nystr¨ m method o and random Fourier features for large-scale kernel learning. [sent-228, score-0.404]

77 Note that datasets C PU, C ENSUS, A DULT and F OREST were originally used in [13] to verify the effectiveness of random Fourier features. [sent-230, score-0.076]

78 A RBF kernel is used for both methods and for all the datasets. [sent-237, score-0.217]

79 , C OVTYPE and F OREST), we restrict the maximum number of random samples to 200 because of the high computational cost. [sent-244, score-0.075]

80 We observed that for all the data sets, the Nystr¨ m method outperforms random Fourier features 6 . [sent-245, score-0.151]

81 Moreover, except for C OVTYPE with 10 o random samples, the Nystr¨ m method performs significantly better than random Fourier features, o according to t-tests at 95% significance level. [sent-246, score-0.114]

82 We finally evaluate that whether the large eigengap condition, the key assumption for our main theoretical result, holds for the data sets. [sent-247, score-0.111]

83 Due to the large size, except for C PU, we compute the eigenvalues of kernel matrix based on 10, 000 randomly selected examples from each dataset. [sent-248, score-0.358]

84 As shown in Figure 3 (eigenvalues are in logarithm scale), we observe that the eigenvalues drop very quickly as the rank increases, leading to a significant gap between the top eigenvalues and the remaining eigenvalues. [sent-249, score-0.274]

85 6 Conclusion and Discussion We study two methods for large-scale kernel learning, i. [sent-250, score-0.217]

86 , the Nystr¨ m method and random Fourier o features. [sent-252, score-0.074]

87 One key difference between these two approaches is that the Nystr¨ m method uses data o 6 We note that the classification performance of A DULT data set reported in Figure 2 does not match with the performance reported in [13]. [sent-253, score-0.077]

88 5 COVTYPE accuracy(%) accuracy(%) 90 80 2 # random samples COD_RNA 100 2. [sent-268, score-0.075]

89 5 0 50 100 200 500 1000 90 Nystrom Method Random Fourier Features accuracy(%) mean square error mean square error 3 Nystrom Method Random Fourier Features 2 0 ADULT CENSUS 2. [sent-269, score-0.096]

90 5 20 50 100 # random samples 70 65 60 Nystrom Method Random Fourier Features 10 Nystrom Method Random Fourier Features 55 200 10 20 50 100 # random samples 200 Figure 2: Comparison of the Nymstr¨ m method and random Fourier features. [sent-270, score-0.224]

91 6N Figure 3: The eigenvalue distributions of kernel matrices. [sent-298, score-0.217]

92 dependent basis functions while random Fourier features introduce data independent basis functions. [sent-300, score-0.221]

93 This difference leads to an improved analysis for kernel learning approaches based on the Nystr¨ m o method. [sent-301, score-0.289]

94 We show that when there is a large eigengap of kernel matrix, the approximation error of Nystr¨ m method can be improved to O(1/m), leading to a significantly better generalization o performance than random Fourier features. [sent-302, score-0.559]

95 As implied from our study, it is important to develop data dependent basis functions for large-scale kernel learning. [sent-304, score-0.281]

96 One direction we plan to explore is to improve random Fourier features by making the sampling data dependent. [sent-305, score-0.117]

97 This can be achieved by introducing a rejection procedure that rejects the sample Fourier components when they do not align well with the top eigenfunctions estimated from the sampled data. [sent-306, score-0.094]

98 On the impact of kernel approximation on learning accuracy. [sent-343, score-0.268]

99 On the nystrom method for approximating a gram matrix for improved kernel-based learning. [sent-355, score-0.347]

100 Using the nystrom method to speed up kernel machines. [sent-420, score-0.486]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nystr', 0.583), ('fourier', 0.465), ('nystrom', 0.235), ('kernel', 0.217), ('kr', 0.204), ('lm', 0.177), ('ln', 0.172), ('ha', 0.17), ('eigengap', 0.111), ('hd', 0.11), ('hs', 0.107), ('hln', 0.103), ('fm', 0.092), ('hr', 0.09), ('eigenvalues', 0.087), ('features', 0.077), ('eigenvectors', 0.069), ('zf', 0.069), ('eigenfunctions', 0.064), ('eigenvector', 0.062), ('covtype', 0.059), ('fn', 0.057), ('rank', 0.056), ('zn', 0.053), ('um', 0.052), ('census', 0.052), ('functional', 0.051), ('generalization', 0.045), ('adult', 0.045), ('kb', 0.045), ('caused', 0.044), ('vr', 0.043), ('basis', 0.04), ('random', 0.04), ('dult', 0.039), ('orest', 0.039), ('ovtype', 0.039), ('spectral', 0.038), ('theorem', 0.037), ('verify', 0.036), ('xi', 0.036), ('samples', 0.035), ('puted', 0.034), ('method', 0.034), ('uni', 0.034), ('approximate', 0.033), ('rahimi', 0.032), ('span', 0.031), ('approximation', 0.031), ('sin', 0.03), ('error', 0.03), ('rbf', 0.03), ('cos', 0.03), ('forest', 0.03), ('pu', 0.03), ('sampled', 0.03), ('improved', 0.029), ('bound', 0.029), ('sine', 0.028), ('vij', 0.028), ('matrix', 0.028), ('synthetic', 0.028), ('xm', 0.027), ('examples', 0.026), ('descending', 0.026), ('owing', 0.026), ('operators', 0.026), ('cpu', 0.026), ('yi', 0.025), ('difference', 0.025), ('alleviate', 0.025), ('dependent', 0.024), ('law', 0.024), ('mohri', 0.023), ('classi', 0.023), ('training', 0.023), ('gap', 0.022), ('approximates', 0.022), ('leading', 0.022), ('dr', 0.022), ('cosine', 0.021), ('transform', 0.021), ('approximating', 0.021), ('xj', 0.021), ('libsvm', 0.021), ('rademacher', 0.021), ('proposition', 0.021), ('impact', 0.02), ('connect', 0.02), ('perturbation', 0.019), ('ranked', 0.019), ('square', 0.018), ('accuracy', 0.018), ('approaches', 0.018), ('viewpoint', 0.018), ('pling', 0.017), ('nanjing', 0.017), ('kddcup', 0.017), ('oversampling', 0.017), ('impressively', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Author: Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou

Abstract: Both random Fourier features and the Nystr¨ m method have been successfully o applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystr¨ m method are randomly sampled from the training examples and are o therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystr¨ m method can yield impressively better generalization error bound than o random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets. 1

2 0.10682034 37 nips-2012-Affine Independent Variational Inference

Author: Edward Challis, David Barber

Abstract: We consider inference in a broad class of non-conjugate probabilistic models based on minimising the Kullback-Leibler divergence between the given target density and an approximating ‘variational’ density. In particular, for generalised linear models we describe approximating densities formed from an affine transformation of independently distributed latent variables, this class including many well known densities as special cases. We show how all relevant quantities can be efficiently computed using the fast Fourier transform. This extends the known class of tractable variational approximations and enables the fitting for example of skew variational densities to the target density. 1

3 0.10202589 330 nips-2012-Supervised Learning with Similarity Functions

Author: Purushottam Kar, Prateek Jain

Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1

4 0.096362986 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

5 0.085774049 234 nips-2012-Multiresolution analysis on the symmetric group

Author: Risi Kondor, Walter Dempsey

Abstract: There is no generally accepted way to define wavelets on permutations. We address this issue by introducing the notion of coset based multiresolution analysis (CMRA) on the symmetric group, find the corresponding wavelet functions, and describe a fast wavelet transform for sparse signals. We discuss potential applications in ranking, sparse approximation, and multi-object tracking. 1

6 0.085468367 188 nips-2012-Learning from Distributions via Support Measure Machines

7 0.085250489 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

8 0.082270399 324 nips-2012-Stochastic Gradient Descent with Only One Projection

9 0.080824643 231 nips-2012-Multiple Operator-valued Kernel Learning

10 0.078471139 145 nips-2012-Gradient Weights help Nonparametric Regressors

11 0.072743572 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination

12 0.069558434 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

13 0.060894776 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

14 0.058754481 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

15 0.058710251 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

16 0.057688553 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

17 0.057318266 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

18 0.05607025 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

19 0.055411801 187 nips-2012-Learning curves for multi-task Gaussian process regression

20 0.055388745 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.145), (1, 0.019), (2, 0.03), (3, -0.034), (4, 0.061), (5, -0.005), (6, 0.001), (7, 0.105), (8, -0.01), (9, -0.084), (10, 0.023), (11, 0.006), (12, 0.015), (13, -0.026), (14, 0.06), (15, -0.073), (16, 0.051), (17, 0.029), (18, 0.026), (19, -0.029), (20, 0.042), (21, -0.03), (22, 0.065), (23, -0.093), (24, 0.0), (25, 0.102), (26, -0.047), (27, 0.046), (28, -0.085), (29, 0.067), (30, -0.02), (31, -0.037), (32, 0.052), (33, -0.083), (34, 0.027), (35, -0.067), (36, -0.123), (37, -0.067), (38, -0.077), (39, 0.056), (40, -0.011), (41, -0.064), (42, 0.061), (43, 0.025), (44, -0.027), (45, 0.054), (46, 0.073), (47, 0.035), (48, 0.086), (49, -0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91140562 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Author: Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou

Abstract: Both random Fourier features and the Nystr¨ m method have been successfully o applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystr¨ m method are randomly sampled from the training examples and are o therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystr¨ m method can yield impressively better generalization error bound than o random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets. 1

2 0.72075987 231 nips-2012-Multiple Operator-valued Kernel Learning

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1

3 0.70297551 167 nips-2012-Kernel Hyperalignment

Author: Alexander Lorbert, Peter J. Ramadge

Abstract: We offer a regularized, kernel extension of the multi-set, orthogonal Procrustes problem, or hyperalignment. Our new method, called Kernel Hyperalignment, expands the scope of hyperalignment to include nonlinear measures of similarity and enables the alignment of multiple datasets with a large number of base features. With direct application to fMRI data analysis, kernel hyperalignment is well-suited for multi-subject alignment of large ROIs, including the entire cortex. We report experiments using real-world, multi-subject fMRI data. 1

4 0.67540973 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

5 0.6506024 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson

Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1

6 0.63303202 330 nips-2012-Supervised Learning with Similarity Functions

7 0.61484802 188 nips-2012-Learning from Distributions via Support Measure Machines

8 0.6008783 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

9 0.52670282 145 nips-2012-Gradient Weights help Nonparametric Regressors

10 0.51938462 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

11 0.51769936 234 nips-2012-Multiresolution analysis on the symmetric group

12 0.49425015 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

13 0.47389662 221 nips-2012-Multi-Stage Multi-Task Feature Learning

14 0.47147715 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

15 0.46180615 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

16 0.46063894 37 nips-2012-Affine Independent Variational Inference

17 0.45389384 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

18 0.44636077 128 nips-2012-Fast Resampling Weighted v-Statistics

19 0.43675658 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination

20 0.43236384 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.029), (21, 0.013), (38, 0.125), (39, 0.363), (42, 0.03), (53, 0.013), (54, 0.013), (55, 0.035), (74, 0.043), (76, 0.141), (80, 0.061), (92, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83735621 351 nips-2012-Transelliptical Component Analysis

Author: Fang Han, Han Liu

Abstract: We propose a high dimensional semiparametric scale-invariant principle component analysis, named TCA, by utilize the natural connection between the elliptical distribution family and the principal component analysis. Elliptical distribution family includes many well-known multivariate distributions like multivariate Gaussian, t and logistic and it is extended to the meta-elliptical by Fang et.al (2002) using the copula techniques. In this paper we extend the meta-elliptical distribution family to a even larger family, called transelliptical. We prove that TCA can obtain a near-optimal s log d/n estimation consistency rate in recovering the leading eigenvector of the latent generalized correlation matrix under the transelliptical distribution family, even if the distributions are very heavy-tailed, have infinite second moments, do not have densities and possess arbitrarily continuous marginal distributions. A feature selection result with explicit rate is also provided. TCA is further implemented in both numerical simulations and largescale stock data to illustrate its empirical usefulness. Both theories and experiments confirm that TCA can achieve model flexibility, estimation accuracy and robustness at almost no cost. 1

2 0.80626786 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

Author: Gal Elidan, Cobi Cario

Abstract: The empirical success of the belief propagation approximate inference algorithm has inspired numerous theoretical and algorithmic advances. Yet, for continuous non-Gaussian domains performing belief propagation remains a challenging task: recent innovations such as nonparametric or kernel belief propagation, while useful, come with a substantial computational cost and offer little theoretical guarantees, even for tree structured models. In this work we present Nonparanormal BP for performing efficient inference on distributions parameterized by a Gaussian copulas network and any univariate marginals. For tree structured networks, our approach is guaranteed to be exact for this powerful class of non-Gaussian models. Importantly, the method is as efficient as standard Gaussian BP, and its convergence properties do not depend on the complexity of the univariate marginals, even when a nonparametric representation is used. 1

3 0.79897362 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

Author: Xiaolong Wang, Liang Lin

Abstract: This paper studies a novel discriminative part-based model to represent and recognize object shapes with an “And-Or graph”. We define this model consisting of three layers: the leaf-nodes with collaborative edges for localizing local parts, the or-nodes specifying the switch of leaf-nodes, and the root-node encoding the global verification. A discriminative learning algorithm, extended from the CCCP [23], is proposed to train the model in a dynamical manner: the model structure (e.g., the configuration of the leaf-nodes associated with the or-nodes) is automatically determined with optimizing the multi-layer parameters during the iteration. The advantages of our method are two-fold. (i) The And-Or graph model enables us to handle well large intra-class variance and background clutters for object shape detection from images. (ii) The proposed learning algorithm is able to obtain the And-Or graph representation without requiring elaborate supervision and initialization. We validate the proposed method on several challenging databases (e.g., INRIA-Horse, ETHZ-Shape, and UIUC-People), and it outperforms the state-of-the-arts approaches. 1

4 0.78322023 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

Author: Xianxing Zhang, Lawrence Carin

Abstract: A new methodology is developed for joint analysis of a matrix and accompanying documents, with the documents associated with the matrix rows/columns. The documents are modeled with a focused topic model, inferring interpretable latent binary features for each document. A new matrix decomposition is developed, with latent binary features associated with the rows/columns, and with imposition of a low-rank constraint. The matrix decomposition and topic model are coupled by sharing the latent binary feature vectors associated with each. The model is applied to roll-call data, with the associated documents defined by the legislation. Advantages of the proposed model are demonstrated for prediction of votes on a new piece of legislation, based only on the observed text of legislation. The coupling of the text and legislation is also shown to yield insight into the properties of the matrix decomposition for roll-call data. 1

same-paper 5 0.74446696 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Author: Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou

Abstract: Both random Fourier features and the Nystr¨ m method have been successfully o applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystr¨ m method are randomly sampled from the training examples and are o therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystr¨ m method can yield impressively better generalization error bound than o random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets. 1

6 0.72731256 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space

7 0.708148 352 nips-2012-Transelliptical Graphical Models

8 0.60475522 310 nips-2012-Semiparametric Principal Component Analysis

9 0.56228316 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior

10 0.56100351 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

11 0.55016387 163 nips-2012-Isotropic Hashing

12 0.54420495 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

13 0.54116118 317 nips-2012-Smooth-projected Neighborhood Pursuit for High-dimensional Nonparanormal Graph Estimation

14 0.53816897 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination

15 0.53763562 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

16 0.53675133 75 nips-2012-Collaborative Ranking With 17 Parameters

17 0.536111 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

18 0.5323599 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning

19 0.52969372 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

20 0.52800584 346 nips-2012-Topology Constraints in Graphical Models