nips nips2013 nips2013-76 knowledge-graph by maker-knowledge-mining

76 nips-2013-Correlated random features for fast semi-supervised learning

Source: pdf

Author: Brian McWilliams, David Balduzzi, Joachim Buhmann

Abstract: This paper presents Correlated Nystr¨ m Views (XNV), a fast semi-supervised alo gorithm for regression and classiﬁcation. The algorithm draws on two main ideas. First, it generates two views consisting of computationally inexpensive random features. Second, multiview regression, using Canonical Correlation Analysis (CCA) on unlabeled data, biases the regression towards useful features. It has been shown that CCA regression can substantially reduce variance with a minimal increase in bias if the views contains accurate estimators. Recent theoretical and empirical work shows that regression with random features closely approximates kernel regression, implying that the accuracy requirement holds for random views. We show that XNV consistently outperforms a state-of-the-art algorithm for semi-supervised learning: substantially improving predictive performance and reducing the variability of performance on a wide variety of real-world datasets, whilst also reducing runtime by orders of magnitude. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Correlated random features for fast semi-supervised learning Brian McWilliams ETH Z¨ rich, Switzerland u brian. [sent-1, score-0.077]

2 ch Abstract This paper presents Correlated Nystr¨ m Views (XNV), a fast semi-supervised alo gorithm for regression and classiﬁcation. [sent-10, score-0.058]

3 First, it generates two views consisting of computationally inexpensive random features. [sent-12, score-0.212]

4 Second, multiview regression, using Canonical Correlation Analysis (CCA) on unlabeled data, biases the regression towards useful features. [sent-13, score-0.156]

5 It has been shown that CCA regression can substantially reduce variance with a minimal increase in bias if the views contains accurate estimators. [sent-14, score-0.307]

6 Recent theoretical and empirical work shows that regression with random features closely approximates kernel regression, implying that the accuracy requirement holds for random views. [sent-15, score-0.2]

7 We show that XNV consistently outperforms a state-of-the-art algorithm for semi-supervised learning: substantially improving predictive performance and reducing the variability of performance on a wide variety of real-world datasets, whilst also reducing runtime by orders of magnitude. [sent-16, score-0.071]

8 For learning non-linear relationships, kernel methods achieve excellent performance but na¨vely require operations cubic in ı the number of training points. [sent-18, score-0.092]

9 Random features have been introduced to approximate kernel machines when the number of training examples is very large, rendering exact kernel computation intractable. [sent-20, score-0.234]

10 Among several different approaches, the Nystr¨ m method for low-rank kernel approximation [1] exhibits good theoretical o properties and empirical performance [3–5]. [sent-21, score-0.102]

11 Semi-supervised learning aims to improve prediction by extracting useful structure from the unlabeled data points and using this in conjunction with a function learned on a small number of labeled points. [sent-23, score-0.212]

12 We investigate two ways of doing so: one based on the Nystr¨ m method and another based on random o Fourier features (so-called kitchen sinks) [2, 6]. [sent-28, score-0.11]

13 It turns out that the Nystr¨ m method almost always o outperforms Fourier features by a quite large margin, so we only report these results in the main text. [sent-29, score-0.11]

14 The second step, following [7], uses Canonical Correlation Analysis (CCA, [8, 9]) to bias the optimization procedure towards features that are correlated across the views. [sent-30, score-0.148]

15 Intuitively, if both views contain accurate estimators, then penalizing uncorrelated features reduces variance without increasing the bias by much. [sent-31, score-0.411]

16 Recent theoretical work by Bach [5] shows that Nystr¨ m views can be exo pected to contain accurate estimators. [sent-32, score-0.211]

17 We ﬁnd that XNV outperforms SSSL by around 10-15% on average, depending on the number of labeled points available, see §3. [sent-34, score-0.127]

18 We also ﬁnd that the performance of XNV exhibits dramatically less variability than SSSL, with a typical reduction of 30%. [sent-35, score-0.06]

19 However, since SSSL does not scale up to large sets of unlabeled data, we modify SSSL by introducing a Nystr¨ m approximation to improve runtime performance. [sent-37, score-0.122]

20 Our approximate version of SSSL outperforms kernel ridge regression (KRR) by > 50% on the 18 datasets on average, in line with the results reported in [10], suggesting that we lose little by replacing the exact SSSL with our approximate implementation. [sent-39, score-0.254]

21 Surprisingly, despite guaranteeing improved prediction performance under a relatively weak assumption on the views, CCA regression has not been widely used since its proposal – to the best of our knowledge this is ﬁrst empirical evaluation of multi-view regression’s performance. [sent-43, score-0.108]

22 A possible reason for this is the difﬁculty in obtaining naturally occurring data equipped with multiple views that can be shown to satisfy the multi-view assumption. [sent-44, score-0.195]

23 We overcome this problem by constructing random views that satisfy the assumption by design. [sent-45, score-0.243]

24 First, given two equally useful but sufﬁciently different views on a dataset, penalizing regression using the canonical norm (computed via CCA), can substantially improve performance [7]. [sent-48, score-0.46]

25 The second is the Nystr¨ m method for constructing random features [1], which we use to o construct the views. [sent-49, score-0.107]

26 , (xn , yn ) for xi 2 RD and yi 2 R, sampled according to joint distribution P (x, y). [sent-54, score-0.096]

27 Further suppose we have two views on the data z(⌫) : RD ! [sent-55, score-0.195]

28 Further let L(Z) denote the space of linear maps from a linear space Z to the reals, and deﬁne: f (⌫) := argmin loss(g) for ⌫ 2 {1, 2} and f := argmin loss(g). [sent-61, score-0.064]

29 CCA ﬁnds bases for the two sets of variables such that the correlation between projections onto the bases are maximized. [sent-66, score-0.126]

30 ⇣ ⌘ (1) (2) The ﬁrst pair of canonical basis vectors, b1 , b1 is found by solving: ⇣ ⌘ argmax corr b(1)> z(1) , b(2)> z(2) . [sent-67, score-0.152]

31 Orthogonality: ET zj zk ] = jk , where jk is the Kronecker delta, and ⇥ (1)> (2) ⇤ ¯ ¯ 2. [sent-73, score-0.091]

32 is referred to as the j th canonical correlation coefﬁcient. [sent-79, score-0.195]

33 Given vector z(⌫) in the canonical basis, deﬁne its canonical norm as v uD ⇣ ⌘2 uX 1 j (⌫) (⌫) k¯ kCCA := t z zj ¯ . [sent-81, score-0.315]

34 Assume we observe n pairs of views coupled with real valued labels n on h i> (⌫) (1) (2) (⌫) (⌫) zi , zi , yi , canonical ridge regression ﬁnds coefﬁcients b = b1 , . [sent-83, score-0.561]

35 , bM such that i=1 X⇣ b (⌫) := argmin 1 yi n i=1 n (⌫) > (⌫) ¯ zi ⌘2 +k (⌫) 2 kCCA . [sent-86, score-0.098]

36 (3) The resulting estimator, referred to as the canonical shrinkage estimator, is b(⌫) = j j n n X (⌫) (4) ¯ zi,j yi . [sent-87, score-0.197]

37 i=1 Penalizing with the canonical norm biases the optimization towards features that are highly correlated across the views. [sent-88, score-0.282]

38 Good regressors exist in both views by Assumption 1. [sent-89, score-0.215]

39 Thus, intuitively, penalizing uncorrelated features signiﬁcantly reduces variance, without increasing the bias by much. [sent-90, score-0.184]

40 Let (⌫) f b denote the estimator constructed with the canonical shrinkage estimator, Eq. [sent-93, score-0.179]

41 P 2 1 The ﬁrst term, 5✏, bounds the bias of the canonical estimator, whereas the second, n j bounds P 2 the variance. [sent-96, score-0.159]

42 The j can be thought of as a measure of the “intrinsic dimensionality” of the unlabeled data, which controls the rate of convergence. [sent-97, score-0.072]

43 If the canonical correlation coefﬁcients decay sufﬁciently rapidly, then the increase in bias is more than made up for by the decrease in variance. [sent-98, score-0.217]

44 2 Constructing random views We construct two views satisfying Assumption 1 in expectation, see Theorem 3 below. [sent-100, score-0.39]

45 To ensure our method scales to large sets of unlabeled data, we use random features generated using the Nystr¨ m o method [1]. [sent-101, score-0.149]

46 Where here, (x) deﬁnes a mapping from RD to a high dimensional feature space and (·, ·) is a positive semi-deﬁnite kernel function. [sent-104, score-0.065]

47 The idea behind random features is to instead deﬁne a lower-dimensional mapping, z(xi ) : RD ! [sent-105, score-0.077]

48 Vectors of random features can be constructed as b z(xi ) = D 1/2 > b ˆ ˆ V> [(xi , x1 ), . [sent-119, score-0.077]

49 Constructing features in this way reduces the time complexity of learning a non-linear prediction function from O(N 3 ) to O(N ) [15]. [sent-123, score-0.127]

50 , 'r } where r is the rank of K and the 'i are the ﬁrst ˆ ˆ ˆ r eigenfunctions of LM . [sent-128, score-0.074]

51 H spanned by the eigenfunctions of linear operator LM in Eq. [sent-130, score-0.074]

52 Solving o minr w2R is equivalent to solving N 1 X `(w> z(xi ), yi ) + kwk2 2 N i=1 2 N 1 X `(f (xi ), yi ) + kf k2  . [sent-132, score-0.08]

53 3 (7) (8) The proposed algorithm: Correlated Nystr¨ m Views (XNV) o Algorithm 1 details our approach to semi-supervised learning based on generating two views consisting of Nystr¨ m random features and penalizing features which are weakly correlated across views. [sent-134, score-0.449]

54 o The setting is that we have labeled data {xi , yi }n and a large amount of unlabeled data {xi }N i=1 i=n+1 . [sent-135, score-0.193]

55 The next two steps implement multi-view regression using the randomly generated views z(1) (x) and z(2) (x). [sent-137, score-0.253]

56 o n Input: Labeled data: {xi , yi }i=1 and unlabeled data: {xi }N i=n+1 ˆ ˆ 1: Generate features. [sent-140, score-0.112]

57 , x2M uniformly from the dataset, compute the eigendecomˆ ˆ positions of the sub-sampled kernel matrices K(1) and K(2) which are constructed from the samples 1, . [sent-144, score-0.065]

58 Compute CCA bases B(1) , B(2) and canonical correlations ¯ two views and set zi 3: Labeled data. [sent-155, score-0.413]

59 M for the (9) features are heavily downweighted in the CCA basis without introducing an additional tuning parameter. [sent-161, score-0.092]

60 The further penalty on the `2 norm (in the CCA basis) is introduced as a practical measure to control the variance of the estimator b which can become large if there are many highly correlated 1 features (i. [sent-162, score-0.183]

61 Nystr¨ m sampling, step 1, reduces the O(N 3 ) o operations required for kernel learning to O(N ). [sent-170, score-0.083]

62 The quality of the kernel approximation in (5) has been the subject of detailed study in recent years leading to a number of strong empirical and theoretical results [3–5, 15]. [sent-175, score-0.083]

63 , yN ] , and deﬁne smoothed estimate ykernel := (K + 1 ˜ ˜ ˆ N I) K(y + ⇠) and smoothed Nystr¨ m estimate yNystr¨ m := (K + N I) 1 K(y + ⇠), both o o computed by minimizing the MSE with ridge penalty . [sent-181, score-0.171]

64 For sufﬁciently large M (depending on ⌘, see [5]), we have ⇥ ⇤ ⇥ ⇤ ˆ ˆ EM E⇠ ky yNystr¨ m k2  (1 + 4⌘) · E⇠ ky ykernel k2 o 2 2 ˜ where EM refers to the expectation over subsampled columns used to construct K. [sent-183, score-0.08]

65 In short, the best smoothed estimators in the Nystr¨ m views are close to the optimal smoothed o estimator. [sent-184, score-0.255]

66 Since the kernel estimate is consistent, loss(f ) ! [sent-185, score-0.065]

67 An alternative approach to constructing random views is to use Fourier features instead of Nystr¨ m features in Step 1. [sent-190, score-0.379]

68 We therefore do not discuss Fourier features in the main text, see §SI. [sent-193, score-0.077]

69 (6) and then solves argmin w2Rs n X i=1 0 s X @ wj j=1 k (xi ) 12 yi A , (10) where s is set by the user. [sent-199, score-0.072]

70 First, instead of constructing the full Gram matrix, we construct a Nystr¨ m approximation by sampling M points from the labeled and o unlabeled training set. [sent-205, score-0.255]

71 Second, instead of thresholding eigenfunctions, we use the easier to tune ridge penalty which penalizes directions proportional to the inverse square of their eigenvalues [18]. [sent-206, score-0.079]

72 As justiﬁcation, note that Proposition 2 states that the Nystr¨ m approximation to kernel regression o ˆ actually solves a ridge regression problem in the span of the eigenfunctions of LM . [sent-207, score-0.37]

73 We will also refer to the Nystr¨ m approximation to the span of L o SSSL using 2M features as SSSL2M . [sent-209, score-0.113]

74 The datasets cover a variety of regression (denoted by R) and two-class classiﬁcation (C) problems. [sent-213, score-0.091]

75 The sarcos dataset involves predicting the joint position of a robot arm; following convention we report results on the 1st, 5th and 7th joint positions. [sent-214, score-0.08]

76 The SSSL algorithm was shown to exhibit state-of-the-art performance over fully and semisupervised methods in scenarios where few labeled training examples are available [10]. [sent-215, score-0.108]

77 We set the kernel width, and the `2 regularisation strength, , for each method using 5-fold cross validation with 1000 labeled training examples. [sent-219, score-0.173]

78 We trained all methods using a squared error loss function, `(f (xi ), yi ) = (f (xi ) yi )2 , with M = 200 random features, and n = 100, 150, 200, . [sent-220, score-0.116]

79 runtimes bank8 cal housing sylva SSSL 72s 2300s SSSL2M 0. [sent-243, score-0.108]

80 3s 26s For the cal housing dataset, XNV exhibits an almost 1800⇥ speed up over SSSL. [sent-247, score-0.095]

81 For regression tasks we report on the mean squared error (MSE) on the testing set normalized by the variance of the test output. [sent-252, score-0.104]

82 XNV vs SSSLM/2M Avg reduction in error Avg reduction in std err n = 100 11% 15% n = 200 16% 30% n = 300 15% 31% n = 400 12% 33% n = 500 9% 30% The reduced variability is to be expected from Theorem 1. [sent-256, score-0.149]

83 4 100 200 300 400 500 600 700 800 number of labeled training points 900 100 1000 (a) adult 300 400 500 600 700 800 number of labeled training points 0 1000 100 200 XNV XNV 0. [sent-279, score-0.27]

84 5 300 400 500 600 700 800 number of labeled training points (c) census 0. [sent-288, score-0.135]

85 2 SSSLM prediction error 900 (b) cal housing 0. [sent-289, score-0.124]

86 The plots in Figure 1 shows a representative comparison of mean prediction errors for several datasets when n = 100, . [sent-306, score-0.065]

87 Observe that XNV almost always improves prediction accuracy and reduces variance compared with SSSLM and SSSL2M when the labeled training set contains between 100 and 500 labeled points. [sent-311, score-0.255]

88 This suggests that when there are few labeled points, obtaining a 6 Computed in Matlab 7. [sent-316, score-0.081]

89 7 more accurate estimate of the eigenfunctions of the kernel does not necessarily improve predictive performance. [sent-318, score-0.155]

90 Indeed, when more random features are added, stronger regularization is required to reduce the inﬂuence of uninformative features, this also has the effect of downweighting informative features. [sent-319, score-0.077]

91 2 compares the performance of SSSLM and XNV to fully supervised kernel ridge regression (KRR). [sent-322, score-0.202]

92 Nystr¨ m features signiﬁcantly outperform Fourier features, in line o with observations in [3]. [sent-325, score-0.077]

93 The table below shows the relative improvement of XNV over XKS: XNV vs XKS Avg reduction in error Avg reduction in std err n = 100 30% 36% n = 200 28% 44% n = 300 26% 34% n = 400 25% 37% n = 500 24% 36% Further results and discussion for XKS are included in the supplementary material. [sent-326, score-0.129]

94 By combining two randomly generated views of Nystr¨ m features via an efﬁcient implementation of CCA, XNV outperforms the o prior state-of-the-art, SSSL, by 10-15% (depending on the number of labeled points) on average over 18 datasets. [sent-546, score-0.372]

95 An interesting research direction is to investigate using the recently developed deep CCA algorithm, which extracts higher order correlations between views [19], as a preprocessing step. [sent-548, score-0.216]

96 Since CCA gives us a criterion by which to measure the important of random features, in the future we aim to investigate active sampling schemes based on canonical correlations which may yield better performance by selecting the most informative indices to sample. [sent-550, score-0.158]

97 o 8 References [1] Williams C, Seeger M: Using the Nystr¨ m method to speed up kernel machines. [sent-553, score-0.065]

98 [5] Bach F: Sharp analysis of low-rank kernel approximations. [sent-561, score-0.065]

99 [11] Belkin M, Niyogi P, Sindhwani V: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. [sent-573, score-0.153]

100 [12] Blum A, Mitchell T: Combining labeled and unlabeled data with co-training. [sent-575, score-0.153]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xnv', 0.583), ('nystr', 0.417), ('sssl', 0.405), ('ssslm', 0.275), ('cca', 0.203), ('views', 0.195), ('canonical', 0.137), ('xks', 0.081), ('labeled', 0.081), ('ridge', 0.079), ('features', 0.077), ('eigenfunctions', 0.074), ('unlabeled', 0.072), ('sarcos', 0.066), ('kernel', 0.065), ('gram', 0.061), ('fourier', 0.059), ('correlation', 0.058), ('regression', 0.058), ('xi', 0.056), ('penalizing', 0.051), ('correlated', 0.049), ('avg', 0.045), ('sinks', 0.043), ('lm', 0.042), ('cal', 0.041), ('yi', 0.04), ('housing', 0.035), ('bases', 0.034), ('datasets', 0.033), ('kitchen', 0.033), ('runtime', 0.032), ('ibn', 0.032), ('krr', 0.032), ('mcwilliams', 0.032), ('sylva', 0.032), ('ykernel', 0.032), ('ynystr', 0.032), ('prediction', 0.032), ('argmin', 0.032), ('eth', 0.031), ('switzerland', 0.031), ('smoothed', 0.03), ('constructing', 0.03), ('xm', 0.029), ('kakade', 0.029), ('training', 0.027), ('points', 0.027), ('mw', 0.026), ('adv', 0.026), ('multiview', 0.026), ('zi', 0.026), ('jk', 0.026), ('vs', 0.026), ('avron', 0.025), ('livescu', 0.025), ('ky', 0.024), ('kcca', 0.024), ('std', 0.024), ('rahimi', 0.024), ('colt', 0.023), ('bm', 0.023), ('estimator', 0.022), ('zj', 0.022), ('bias', 0.022), ('generalization', 0.022), ('correlations', 0.021), ('err', 0.021), ('reduction', 0.021), ('shrinkage', 0.02), ('loss', 0.02), ('mahoney', 0.02), ('variability', 0.02), ('regressors', 0.02), ('foster', 0.019), ('norm', 0.019), ('bach', 0.019), ('exhibits', 0.019), ('sm', 0.019), ('ln', 0.019), ('outperforms', 0.019), ('assumption', 0.018), ('span', 0.018), ('approximation', 0.018), ('reduces', 0.018), ('zk', 0.017), ('jin', 0.017), ('computationally', 0.017), ('randomization', 0.016), ('uncorrelated', 0.016), ('accurate', 0.016), ('error', 0.016), ('variance', 0.016), ('jmlr', 0.016), ('mse', 0.016), ('basis', 0.015), ('laplacian', 0.015), ('nds', 0.015), ('rich', 0.015), ('report', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 76 nips-2013-Correlated random features for fast semi-supervised learning

Author: Brian McWilliams, David Balduzzi, Joachim Buhmann

2 0.13474727 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

Author: Raja Hafiz Affandi, Emily Fox, Ben Taskar

Abstract: Determinantal point processes (DPPs) are random point processes well-suited for modeling repulsion. In machine learning, the focus of DPP-based models has been on diverse subset selection from a discrete and ﬁnite base set. This discrete setting admits an efﬁcient sampling algorithm based on the eigendecomposition of the deﬁning kernel matrix. Recently, there has been growing interest in using DPPs deﬁned on continuous spaces. While the discrete-DPP sampler extends formally to the continuous case, computationally, the steps required are not tractable in general. In this paper, we present two efﬁcient DPP sampling schemes that apply to a wide range of kernel functions: one based on low rank approximations via Nystr¨ m o and random Fourier feature techniques and another based on Gibbs sampling. We demonstrate the utility of continuous DPPs in repulsive mixture modeling and synthesizing human poses spanning activity spaces. 1

3 0.06581638 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-speciﬁc decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for speciﬁc a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a signiﬁcant improvement over the state of the art. 1

4 0.061252389 327 nips-2013-The Randomized Dependence Coefficient

Author: David Lopez-Paz, Philipp Hennig, Bernhard Schölkopf

Abstract: We introduce the Randomized Dependence Coefﬁcient (RDC), a measure of nonlinear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-R´ nyi Maximum Correlation Coefﬁcient. RDC is deﬁned in e terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just ﬁve lines of R code, included at the end of the paper. 1

5 0.053313736 149 nips-2013-Latent Structured Active Learning

Author: Wenjie Luo, Alex Schwing, Raquel Urtasun

Abstract: In this paper we present active learning algorithms in the context of structured prediction problems. To reduce the amount of labeling necessary to learn good models, our algorithms operate with weakly labeled data and we query additional examples based on entropies of local marginals, which are a good surrogate for uncertainty. We demonstrate the effectiveness of our approach in the task of 3D layout prediction from single images, and show that good models are learned when labeling only a handful of random variables. In particular, the same performance as using the full training set can be obtained while only labeling ∼10% of the random variables. 1

6 0.050998785 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

7 0.04739387 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

8 0.046749063 251 nips-2013-Predicting Parameters in Deep Learning

9 0.045941953 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

10 0.045780495 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

11 0.04309094 99 nips-2013-Dropout Training as Adaptive Regularization

12 0.041894335 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

13 0.041837625 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression

14 0.041129187 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

15 0.040202651 65 nips-2013-Compressive Feature Learning

16 0.039847519 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

17 0.039593104 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform

18 0.039506756 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

19 0.038465902 173 nips-2013-Least Informative Dimensions

20 0.037765581 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.11), (1, 0.044), (2, 0.007), (3, 0.004), (4, 0.029), (5, 0.027), (6, -0.005), (7, 0.028), (8, -0.056), (9, 0.024), (10, -0.017), (11, -0.016), (12, -0.064), (13, -0.05), (14, 0.109), (15, 0.033), (16, -0.047), (17, 0.034), (18, 0.062), (19, -0.029), (20, -0.02), (21, -0.022), (22, -0.019), (23, 0.044), (24, -0.011), (25, 0.016), (26, 0.04), (27, -0.023), (28, 0.031), (29, -0.073), (30, -0.007), (31, 0.005), (32, -0.017), (33, 0.007), (34, 0.011), (35, -0.017), (36, -0.026), (37, -0.071), (38, 0.036), (39, 0.029), (40, 0.001), (41, -0.085), (42, 0.003), (43, 0.003), (44, -0.016), (45, 0.015), (46, 0.015), (47, 0.001), (48, 0.024), (49, -0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91045743 76 nips-2013-Correlated random features for fast semi-supervised learning

Author: Brian McWilliams, David Balduzzi, Joachim Buhmann

2 0.73320663 156 nips-2013-Learning Kernels Using Local Rademacher Complexity

Author: Corinna Cortes, Marius Kloft, Mehryar Mohri

Abstract: We use the notion of local Rademacher complexity to design new algorithms for learning kernels. Our algorithms thereby beneﬁt from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate. We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efﬁcient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. We also report the results of experiments with both algorithms in both binary and multi-class classiﬁcation tasks. 1

3 0.69286937 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

Author: Qichao Que, Mikhail Belkin

Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the ﬁrst kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is ﬂexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities deﬁned on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difﬁcult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classiﬁcation within the covariate shift framework. 1

4 0.64513141 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

Author: Raja Hafiz Affandi, Emily Fox, Ben Taskar

5 0.64406282 327 nips-2013-The Randomized Dependence Coefficient

Author: David Lopez-Paz, Philipp Hennig, Bernhard Schölkopf

6 0.63651472 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

7 0.60564852 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

8 0.5920819 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

9 0.58635306 244 nips-2013-Parametric Task Learning

10 0.57630074 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

11 0.55319202 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

12 0.55108935 171 nips-2013-Learning with Noisy Labels

13 0.54500753 225 nips-2013-One-shot learning and big data with n=2

14 0.53714502 211 nips-2013-Non-Linear Domain Adaptation with Boosting

15 0.53306997 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

16 0.52404058 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

17 0.51908463 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression

18 0.51764017 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

19 0.51506591 65 nips-2013-Compressive Feature Learning

20 0.51146907 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.011), (16, 0.047), (33, 0.105), (34, 0.09), (36, 0.016), (39, 0.315), (41, 0.024), (49, 0.022), (56, 0.1), (70, 0.021), (85, 0.026), (89, 0.022), (93, 0.086), (95, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.76951802 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

Author: Min Xiao, Yuhong Guo

Abstract: Cross language text classiﬁcation is an important learning task in natural language processing. A critical challenge of cross language learning arises from the fact that words of different languages are in disjoint feature spaces. In this paper, we propose a two-step representation learning method to bridge the feature spaces of different languages by exploiting a set of parallel bilingual documents. Speciﬁcally, we ﬁrst formulate a matrix completion problem to produce a complete parallel document-term matrix for all documents in two languages, and then induce a low dimensional cross-lingual document representation by applying latent semantic indexing on the obtained matrix. We use a projected gradient descent algorithm to solve the formulated matrix completion problem with convergence guarantees. The proposed method is evaluated by conducting a set of experiments with cross language sentiment classiﬁcation tasks on Amazon product reviews. The experimental results demonstrate that the proposed learning method outperforms a number of other cross language representation learning methods, especially when the number of parallel bilingual documents is small. 1

same-paper 2 0.70775229 76 nips-2013-Correlated random features for fast semi-supervised learning

Author: Brian McWilliams, David Balduzzi, Joachim Buhmann

3 0.65342063 118 nips-2013-Fast Determinantal Point Process Sampling with Application to Clustering

Author: Byungkon Kang

Abstract: Determinantal Point Process (DPP) has gained much popularity for modeling sets of diverse items. The gist of DPP is that the probability of choosing a particular set of items is proportional to the determinant of a positive deﬁnite matrix that deﬁnes the similarity of those items. However, computing the determinant requires time cubic in the number of items, and is hence impractical for large sets. In this paper, we address this problem by constructing a rapidly mixing Markov chain, from which we can acquire a sample from the given DPP in sub-cubic time. In addition, we show that this framework can be extended to sampling from cardinalityconstrained DPPs. As an application, we show how our sampling algorithm can be used to provide a fast heuristic for determining the number of clusters, resulting in better clustering.

4 0.60235673 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

Author: Il M. Park, Evan W. Archer, Kenneth Latimer, Jonathan W. Pillow

Abstract: Probabilistic models for binary spike patterns provide a powerful tool for understanding the statistical dependencies in large-scale neural recordings. Maximum entropy (or “maxent”) models, which seek to explain dependencies in terms of low-order interactions between neurons, have enjoyed remarkable success in modeling such patterns, particularly for small groups of neurons. However, these models are computationally intractable for large populations, and low-order maxent models have been shown to be inadequate for some datasets. To overcome these limitations, we propose a family of “universal” models for binary spike patterns, where universality refers to the ability to model arbitrary distributions over all 2m binary patterns. We construct universal models using a Dirichlet process centered on a well-behaved parametric base measure, which naturally combines the ﬂexibility of a histogram and the parsimony of a parametric model. We derive computationally efﬁcient inference methods using Bernoulli and cascaded logistic base measures, which scale tractably to large populations. We also establish a condition for equivalence between the cascaded logistic and the 2nd-order maxent or “Ising” model, making cascaded logistic a reasonable choice for base measure in a universal model. We illustrate the performance of these models using neural data. 1

5 0.52870721 215 nips-2013-On Decomposing the Proximal Map

Author: Yao-Liang Yu

Abstract: The proximal map is the key step in gradient-type algorithms, which have become prevalent in large-scale high-dimensional problems. For simple functions this proximal map is available in closed-form while for more complicated functions it can become highly nontrivial. Motivated by the need of combining regularizers to simultaneously induce different types of structures, this paper initiates a systematic investigation of when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual summands. We not only unify a few known results scattered in the literature but also discover several new decompositions obtained almost effortlessly from our theory. 1

6 0.5281418 99 nips-2013-Dropout Training as Adaptive Regularization

7 0.52393889 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

8 0.52241695 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

9 0.52087218 269 nips-2013-Regression-tree Tuning in a Streaming Setting

10 0.52007598 5 nips-2013-A Deep Architecture for Matching Short Texts

11 0.51872075 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

12 0.51673454 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

13 0.51610619 251 nips-2013-Predicting Parameters in Deep Learning

14 0.51584411 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

15 0.51551813 69 nips-2013-Context-sensitive active sensing in humans

16 0.5146113 201 nips-2013-Multi-Task Bayesian Optimization

17 0.51359421 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

18 0.5127793 318 nips-2013-Structured Learning via Logistic Regression

19 0.51183528 249 nips-2013-Polar Operators for Structured Sparse Estimation

20 0.51042348 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent