nips nips2012 nips2012-144 knowledge-graph by maker-knowledge-mining

144 nips-2012-Gradient-based kernel method for feature extraction and variable selection


Source: pdf

Author: Kenji Fukumizu, Chenlei Leng

Abstract: We propose a novel kernel approach to dimension reduction for supervised learning: feature extraction and variable selection; the former constructs a small number of features from predictors, and the latter finds a subset of predictors. First, a method of linear feature extraction is proposed using the gradient of regression function, based on the recent development of the kernel method. In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the regressor or type of variables, and uses computationally simple eigendecomposition, thus applicable to large data sets. Second, in combination of a sparse penalty, the method is extended to variable selection, following the approach by Chen et al. [2]. Experimental results show that the proposed methods successfully find effective features and variables without parametric models. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Gradient-based kernel method for feature extraction and variable selection Kenji Fukumizu The Institute of Statistical Mathematics 10-3 Midori-cho, Tachikawa, Tokyo 190-8562 Japan fukumizu@ism. [sent-1, score-0.394]

2 sg Abstract We propose a novel kernel approach to dimension reduction for supervised learning: feature extraction and variable selection; the former constructs a small number of features from predictors, and the latter finds a subset of predictors. [sent-5, score-0.506]

3 First, a method of linear feature extraction is proposed using the gradient of regression function, based on the recent development of the kernel method. [sent-6, score-0.391]

4 In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the regressor or type of variables, and uses computationally simple eigendecomposition, thus applicable to large data sets. [sent-7, score-0.122]

5 Second, in combination of a sparse penalty, the method is extended to variable selection, following the approach by Chen et al. [sent-8, score-0.075]

6 1 Introduction Dimension reduction is involved in most of modern data analysis, in which high dimensional data must be handled. [sent-11, score-0.126]

7 There are two categories of dimension reduction: feature extraction, in which a linear or nonlinear mapping to a low-dimensional space is pursued, and variable selection, in which a subset of variables is selected. [sent-12, score-0.183]

8 The goal of dimension reduction in supervised setting is to find such features or a subset of variables X that explain Y as effectively as possible. [sent-19, score-0.232]

9 This paper focuses linear dimension reduction, in which linear combinations of the components of X are used to make effective features. [sent-20, score-0.08]

10 We first develop a method for linear feature extraction with kernels, and extend it to variable selection with a sparseness penalty. [sent-22, score-0.344]

11 The most significant point of the proposed methods is that we do not assume any parametric models on the conditional probability, or make strong assumptions on the distribution of variables. [sent-23, score-0.117]

12 This differs from many other methods, particularly for variable selection, where a specific parametric model is often assumed. [sent-24, score-0.075]

13 1 First, consider the linear feature extraction based on Eq. [sent-29, score-0.138]

14 The first method using this formulation is the sliced inverse regression (SIR, [13]), which employs the fact that the inverse regression E[X|Y ] lies in the EDR space under some assumptions. [sent-31, score-0.207]

15 Many methods have been proposed in this vein of inverse regression ([4, 12] among others). [sent-32, score-0.075]

16 While the methods are computationally simple, they often need some strong assumptions on the distribution of X such as elliptic symmetry. [sent-33, score-0.062]

17 The first one is the dimension reduction with the gradient of regressor E[Y |X = x] [11, 17]. [sent-35, score-0.27]

18 There are some limitations in this approach, however: the nonparametric gradient estimation in high-dimensional spaces is challenging, and the method may not work unless the noise is additive. [sent-41, score-0.063]

19 The second one is the kernel dimension reduction (KDR, [8, 9, 28]), which uses the kernel method for characterizing the conditional independence to overcome various limitations of existing methods. [sent-42, score-0.494]

20 We propose a kernel method for linear feature extraction using the gradient-based approach, but unlike the existing ones [11, 17], the gradient is estimated based on the recent development of the kernel method [9, 19]. [sent-44, score-0.461]

21 It solves the problems of existing methods: by virtue of the kernel method, Y can be of arbitrary type, and the kernel estimator is stable without careful decrease of bandwidth. [sent-45, score-0.272]

22 It solves also the problem of KDR: the estimator by an eigenproblem needs no numerical optimization. [sent-46, score-0.113]

23 Second, by using the above feature extraction in conjunction with a sparseness penalty, we propose a novel method for variable selection. [sent-48, score-0.278]

24 Recently extensive studies have been done for variable selection with a sparseness penalty such as LASSO [23] and SCAD [6]. [sent-49, score-0.209]

25 These methods, however, use some specific model for regression such as linear regression, which is a limitation of the methods. [sent-51, score-0.075]

26 [2] proposed a novel method for sparse variable selection based on the objective function of linear feature extraction formulated as an eigenproblem such as SIR. [sent-53, score-0.35]

27 We follow this approach to derive our method for variable selection. [sent-54, score-0.075]

28 Unlike the methods used in [2], the proposed one does not require strong assumptions on the regressor or distribution, and thus provides a variable selection method based on the conditional independence irrespective of the regression model. [sent-55, score-0.395]

29 1 Gradient-based kernel dimension reduction Gradient of a regression function and dimension reduction We review the basic idea of the gradient-based method [11, 17] for dimension reduction. [sent-57, score-0.654]

30 In the more recent method [11], a standard local linear least squares with a smoothing kernel (not necessarily positive definite, [5]) is used for estimating the gradient, and the dimensionality of the projection is continuously reduced to the desired one in the iteration. [sent-62, score-0.239]

31 Since the gradient estimation for highdimensional data is difficult in general, the iterative reduction is expected to give more accurate estimation. [sent-63, score-0.13]

32 2 Kernel method for estimating gradient of regression For a set Ω, a (R-valued) positive definite kernel k on Ω is a symmetric kernel k : Ω × Ω → R n such that i,j=1 ci cj k(xi , xj ) ≥ 0 for any x1 , . [sent-66, score-0.368]

33 It is known that a positive definite kernel on Ω uniquely defines a Hilbert space H consisting of functions on Ω, in which the reproducing property f, k(·, x) H = f (x) (∀f ∈ H) holds, where ·, · H is the inner product of H. [sent-73, score-0.161]

34 The Hilbert space H is called the reproducing kernel Hilbert space (RKHS) associated with k. [sent-74, score-0.161]

35 2 In deriving a kernel method based on the approach in Sec. [sent-76, score-0.145]

36 3) that if a positive definite kernel k(x, y) on an open set in the Euclidean space is continuously differentiable with respect to x and y, every f in the corresponding RKHS H is continuously differentiable. [sent-83, score-0.179]

37 (2) ∂x ∂x H This reproducing property combined with the following kernel estimator of the conditional expectation (see [8, 9, 19] for details) will provide a method for dimension reduction. [sent-85, score-0.338]

38 p(y|x) exist, and that a positive definite kernel is measurable and bounded. [sent-93, score-0.115]

39 One of the advantages of the kernel method is that estimation with finite data is straightforward. [sent-98, score-0.145]

40 Note that the above expression is the kernel ridge regression of g(Y ) on X. [sent-110, score-0.19]

41 ∂x ∂x With g = kY (·, y ), we obtain the gradient of regression of the feature vector ΦY (Y ) on X as ˜ ∂ −1 ∂kX (·, x) E[ΦY (Y )|X = x] = CY X CXX . [sent-115, score-0.142]

42 3 (6) Gradient-based kernel method for linear feature extraction ∂ It follows from the same argument as in Sec. [sent-117, score-0.283]

43 1 that ∂x E[kY (·, y)|X = x] = Ξ(x)B with an m operator Ξ(x) from R to HY , where we use a slight abuse of notation by identifying the operator Ξ(x) with a matrix. [sent-119, score-0.058]

44 (6), we have ∂kX (·, x) −1 −1 ∂kX (·, x) =: M (x), (7) , CXX CXY CY X CXX ∂x ∂x HX which shows that the eigenvectors for non-zero eigenvalues of m × m matrix M (x) are contained in the EDR space. [sent-121, score-0.08]

45 1, this method incorporates high (or infinite) dimensional regressor E[ΦY (Y )|X = x]. [sent-125, score-0.119]

46 B T Ξ(x), Ξ(x) HY B = 1 Noting CXX f, f = E[f (X)2 ], it is easy to see that CXX is injective, if kX is a continuous kernel on a topological space X , and PX is a Borel probability measure such that P (U ) > 0 for any open set U in X . [sent-126, score-0.115]

47 = As the eigenvectors of M (x) are contained in the EDR space for any x, we propose to use the average of M (Xi ) over all the data points Xi , and define n n 1 1 ˜ Mn := n i=1 Mn (Xi ) = n i=1 kX (Xi )T (GX + nεn In )−1 GY (GX + nεn In )−1 kX (Xi ). [sent-135, score-0.055]

48 ˜ We call the dimension reduction with the matrix Mn the gradient-based kernel dimension reduction (gKDR). [sent-136, score-0.469]

49 For linear feature extraction, the projection matrix B in Eq. [sent-137, score-0.06]

50 (1) is then estimated simply ˜ by the top d eigenvectors of Mn . [sent-138, score-0.055]

51 The proposed method applies to a wide class of problems; in contrast to many existing methods, the gKDR-FEX can handle any type of data for Y including multinomial or structured variables, and make no strong assumptions on the regressor or distribution of X. [sent-140, score-0.152]

52 Additionally, since the gKDR incorporates the high dimensional feature vector ΦY (Y ), it works for any regression relation including multiplicative noise, for which many existing methods such as SIR and IADE fail. [sent-141, score-0.178]

53 As in all kernel methods, the results of gKDR depend on the choice of kernels. [sent-142, score-0.115]

54 We use the crossvalidation (CV) for choosing kernels and parameters, combined with some regression or classification method. [sent-143, score-0.075]

55 In this paper, the k-nearest neighbor (kNN) regression / classification is used in CV for its simplicity: for each candidate of a kernel or parameter, we compute the CV error by the kNN method with (B T Xi , Yi ), where B is given by gKDR, and choose the one that gives the least error. [sent-144, score-0.22]

56 The time complexity of the matrix inversions and the eigendecomposition for gKDR are O(n3 ), which is prohibitive for large data sets. [sent-145, score-0.069]

57 Let GX ≈ RRT and GY ≈ HH T be the low rank approximation with rx = rkR, ry = rkH (rx , ry < n, m). [sent-149, score-0.158]

58 With the notation F := 1 a (GX + nεn In )−1 H and Θas = σ2 Xi Ris , we have, for 1 ≤ a, b ≤ m, i ry n rx n n rx ˜ Mn,ab = Γt Γt , Γt = Ris Θas Fjt − Rjs Fjt . [sent-150, score-0.166]

59 Θas i=1 t=1 ia ib ia s=1 j=1 j s=1 i j=1 With this method, the complexity is O(nmr) in space and O(nm2 r) in time (r = max{rx , ry }), which is much more efficient in memory than straightforward implementation. [sent-151, score-0.1]

60 First, since accurate nonparametric estimation with highdimensional X is not easy, we propose a method for decreasing the dimensionality iteratively. [sent-153, score-0.066]

61 Using gKDR-FEX, we first find a matrix B1 of dimensionality d1 larger than the target d, project data Xi (1) (1) T onto the subspace as Zi = B1 Xi , find the projection matrix B2 (d1 × d2 matrix) for Zi onto a d2 (d2 < d1 ) dimensional subspace, and repeat this process. [sent-154, score-0.126]

62 Note that this problem is shared by many linear dimension reduction methods including CCA and slice-based methods. [sent-158, score-0.177]

63 , T , the projection matrices B[a] given by the eigenvectors of T M[a] = i∈Ta M (Xi ) are used to define P = 1 a=1 B[a] B[a] . [sent-166, score-0.081]

64 The estimator of B is then given by the top d eigenvectors of P . [sent-167, score-0.097]

65 A positive definite kernel k on a 4 (A) n = 100 (A) n = 200 (B) n = 100 (B) n = 200 (C) n = 200 (C) n = 400 gKDR -FEX 0. [sent-172, score-0.115]

66 In addition to the above assumptions (i)-(iii), assume that the kernel kY is characteristic. [sent-226, score-0.145]

67 5 Experiments with gKDR-FEX 1 We always use the Gaussian kernel k(x, x) = exp(− 2σ2 x− x 2 ) in the kernel method below. [sent-238, score-0.26]

68 For choosing the parameter σ in Gaussian kernel and the regularization parameter εn , the CV in Sec. [sent-254, score-0.115]

69 m, ntr and ntest are the dimension of X, training data size, and testing data size, respectively. [sent-265, score-0.206]

70 One way of evaluating dimension reduction methods in supervised learning is to consider the classification or regression accuracy after projecting data onto the estimated subspaces. [sent-309, score-0.283]

71 We next used three data sets for binary classification, heart-disease (H), ionoshpere (I), and breast-cancer-Wisconsin (B), from UCI repository [7], and evaluated the classification rates of gKDR-FEXv with kNN classifiers (k = 7). [sent-310, score-0.071]

72 In comparison with these results, the low dimensional subspaces found by gKDR-FEX and gKDR-FEXv maintain the information for classification effectively. [sent-328, score-0.055]

73 The second data set is author identification of Amazon commerce reviews with 10000 dimensional linguistic features. [sent-330, score-0.054]

74 Such variable selection methods with Pearson correlation are popularly used for very high dimensional data. [sent-338, score-0.14]

75 As we can see from Table 2, the gKDRFEX gives much more effective subspaces for regression than the Pearson correlation method, when 6 the number of authors is large. [sent-340, score-0.101]

76 3 Variable selection with gKDR In recent years, extensive studies have been done on variable selection with a sparseness penalty ([6, 16, 18, 23–27, 29, 30] among many others). [sent-343, score-0.275]

77 In supervised setting, these studies often consider some specific model for the regression such as least square or logistic regression. [sent-344, score-0.106]

78 While consistency and oracle property have been also established for many methods, the assumption that there is a true parameter in the model may not hold in practice, and thus a strong restriction of the methods. [sent-345, score-0.093]

79 It is then important to consider more flexible ways of variable selection without assuming any parametric model on the regression. [sent-346, score-0.141]

80 The gKDR approach is appealing to this problem, since it realizes conditional independence without strong assumptions for regression or distribution of variables. [sent-347, score-0.194]

81 [2] recently proposed the Coordinate-Independent Sparse (CIS) method, which is a semiparametric method for sparse variable selection. [sent-349, score-0.075]

82 In CIS, the linear feature B T X is assumed with some rows of B zero, but no parametric model is specified for regression. [sent-350, score-0.064]

83 This is achieved by imposing the sparseness penalty of the group LASSO [29] in combination with an objective function of linear feature extraction written in the form of eigenproblem such as SIR and PFC [3]. [sent-352, score-0.307]

84 We follow the CIS method for our variable selection with gKDR; since the gKDR is given by the ˜ eigenproblem with matrix Mn , the CIS method is applied straightforwardly. [sent-353, score-0.242]

85 The significance of our method is that the gKDR formulates the conditional independence of Y and X given B T X, while the existing CIS-based methods in [2] realize only weaker conditions under strong assumptions. [sent-354, score-0.119]

86 1 Sparse variable selection with gKDR Throughout this section, it is assumed that the true probability satisfies Eq. [sent-356, score-0.111]

87 The proposed variable selection method, gKDR-VS, estimates B by m Bλ = arg min B:B T B=Id ˜ −Tr[B T Mn B] + λi v i , (9) i=1 Rm is + where vj is the Euclidean norm and λ = (λ1 , . [sent-368, score-0.143]

88 , B d To choose the parameter θ, a BIC-based method is often used in sparse variable selection [27, 31] with theoretical guarantee of model consistency. [sent-382, score-0.141]

89 Note that (B) includes multiplicative noise, which cannot be handled by many dimension reduction methods. [sent-467, score-0.217]

90 4 Conclusions We have proposed a gradient-based kernel approach for dimension reduction in supervised learning. [sent-478, score-0.323]

91 The method is based on the general kernel formulation of conditional independence, and thus has wide applicability without strong restrictions on the model or variables. [sent-479, score-0.202]

92 The linear feature extraction, gKDR-FEX, finds effective features with simple eigendecomposition, even when other conventional methods are not applicable by multiplicative noise or high-dimensionality. [sent-480, score-0.074]

93 We have extended the method to variable selection (gKDR-VS) with a sparseness penalty, and demonstrated its promising performance with synthetic and real world data. [sent-482, score-0.206]

94 Variable selection via nonconcave penalized likelihood and its oracle properties. [sent-526, score-0.092]

95 Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. [sent-547, score-0.289]

96 On principal Hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. [sent-598, score-0.08]

97 A simple and efficient algorithm for gene selection using sparse logistic regression. [sent-637, score-0.066]

98 Regression coefficient and autoregressive order shrinkage and selection via the lasso. [sent-694, score-0.066]

99 Shrinkage tuning parameter selection with a diverging number of parameters. [sent-711, score-0.066]

100 Model selection and estimation in regression with grouped variables. [sent-727, score-0.141]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gkdr', 0.464), ('kx', 0.392), ('cxx', 0.33), ('kdr', 0.321), ('hx', 0.168), ('mn', 0.165), ('ky', 0.145), ('hy', 0.145), ('edr', 0.143), ('iade', 0.125), ('sir', 0.116), ('kernel', 0.115), ('extraction', 0.104), ('cis', 0.102), ('knn', 0.098), ('reduction', 0.097), ('gx', 0.095), ('cxy', 0.087), ('cy', 0.081), ('dimension', 0.08), ('cv', 0.077), ('regression', 0.075), ('fukumizu', 0.074), ('eigenproblem', 0.071), ('selection', 0.066), ('sparseness', 0.065), ('ntest', 0.063), ('ntr', 0.063), ('regressor', 0.06), ('gy', 0.06), ('sec', 0.059), ('rx', 0.058), ('eigenvectors', 0.055), ('lstat', 0.053), ('ptratio', 0.053), ('ry', 0.05), ('reproducing', 0.046), ('op', 0.045), ('xa', 0.045), ('variable', 0.045), ('supplements', 0.044), ('hilbert', 0.043), ('estimator', 0.042), ('svm', 0.041), ('rkhs', 0.04), ('multiplicative', 0.04), ('cca', 0.039), ('cn', 0.038), ('lasso', 0.037), ('xi', 0.037), ('classification', 0.036), ('inversions', 0.036), ('dimensionality', 0.036), ('fjt', 0.036), ('ionoshpere', 0.036), ('ris', 0.036), ('consistency', 0.035), ('subspace', 0.035), ('dennis', 0.035), ('repository', 0.035), ('feature', 0.034), ('gradient', 0.033), ('penalty', 0.033), ('eigendecomposition', 0.033), ('independence', 0.032), ('vj', 0.032), ('continuously', 0.032), ('strong', 0.032), ('gram', 0.032), ('scad', 0.031), ('uci', 0.031), ('pearson', 0.031), ('supervised', 0.031), ('parametric', 0.03), ('assumptions', 0.03), ('method', 0.03), ('injective', 0.029), ('tax', 0.029), ('dimensional', 0.029), ('operator', 0.029), ('classi', 0.028), ('id', 0.027), ('sliced', 0.027), ('cook', 0.027), ('rad', 0.027), ('rm', 0.027), ('tr', 0.027), ('projection', 0.026), ('housing', 0.026), ('dy', 0.026), ('subspaces', 0.026), ('oracle', 0.026), ('conditional', 0.025), ('reviews', 0.025), ('dis', 0.025), ('ia', 0.025), ('corr', 0.025), ('eigenvalues', 0.025), ('variables', 0.024), ('singapore', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

Author: Kenji Fukumizu, Chenlei Leng

Abstract: We propose a novel kernel approach to dimension reduction for supervised learning: feature extraction and variable selection; the former constructs a small number of features from predictors, and the latter finds a subset of predictors. First, a method of linear feature extraction is proposed using the gradient of regression function, based on the recent development of the kernel method. In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the regressor or type of variables, and uses computationally simple eigendecomposition, thus applicable to large data sets. Second, in combination of a sparse penalty, the method is extended to variable selection, following the approach by Chen et al. [2]. Experimental results show that the proposed methods successfully find effective features and variables without parametric models. 1

2 0.12660109 231 nips-2012-Multiple Operator-valued Kernel Learning

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1

3 0.099321142 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

Author: Xaq Pitkow

Abstract: This paper shows how sparse, high-dimensional probability distributions could be represented by neurons with exponential compression. The representation is a novel application of compressive sensing to sparse probability distributions rather than to the usual sparse signals. The compressive measurements correspond to expected values of nonlinear functions of the probabilistically distributed variables. When these expected values are estimated by sampling, the quality of the compressed representation is limited only by the quality of sampling. Since the compression preserves the geometric structure of the space of sparse probability distributions, probabilistic computation can be performed in the compressed domain. Interestingly, functions satisfying the requirements of compressive sensing can be implemented as simple perceptrons. If we use perceptrons as a simple model of feedforward computation by neurons, these results show that the mean activity of a relatively small number of neurons can accurately represent a highdimensional joint distribution implicitly, even without accounting for any noise correlations. This comprises a novel hypothesis for how neurons could encode probabilities in the brain. 1

4 0.09143164 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

5 0.077123843 227 nips-2012-Multiclass Learning with Simplex Coding

Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine

Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1

6 0.075165153 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

7 0.073837101 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

8 0.07299424 188 nips-2012-Learning from Distributions via Support Measure Machines

9 0.066810302 247 nips-2012-Nonparametric Reduced Rank Regression

10 0.065101661 330 nips-2012-Supervised Learning with Similarity Functions

11 0.063792743 16 nips-2012-A Polynomial-time Form of Robust Regression

12 0.062903889 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

13 0.06161163 279 nips-2012-Projection Retrieval for Classification

14 0.060680032 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

15 0.058859434 271 nips-2012-Pointwise Tracking the Optimal Regression Function

16 0.058180187 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

17 0.057318266 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

18 0.057061665 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

19 0.05599023 197 nips-2012-Learning with Recursive Perceptual Representations

20 0.053732589 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.152), (1, 0.042), (2, 0.029), (3, -0.059), (4, 0.049), (5, 0.034), (6, 0.011), (7, 0.093), (8, -0.032), (9, -0.097), (10, -0.014), (11, 0.0), (12, 0.058), (13, -0.009), (14, 0.05), (15, -0.053), (16, 0.081), (17, 0.025), (18, -0.005), (19, -0.018), (20, 0.015), (21, -0.003), (22, 0.026), (23, -0.054), (24, 0.023), (25, 0.0), (26, 0.019), (27, -0.072), (28, 0.0), (29, 0.045), (30, -0.031), (31, -0.042), (32, 0.037), (33, 0.02), (34, -0.005), (35, 0.005), (36, -0.047), (37, -0.04), (38, -0.016), (39, 0.0), (40, 0.017), (41, -0.017), (42, -0.023), (43, -0.008), (44, -0.045), (45, 0.013), (46, 0.062), (47, -0.006), (48, 0.009), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92883271 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

Author: Kenji Fukumizu, Chenlei Leng

Abstract: We propose a novel kernel approach to dimension reduction for supervised learning: feature extraction and variable selection; the former constructs a small number of features from predictors, and the latter finds a subset of predictors. First, a method of linear feature extraction is proposed using the gradient of regression function, based on the recent development of the kernel method. In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the regressor or type of variables, and uses computationally simple eigendecomposition, thus applicable to large data sets. Second, in combination of a sparse penalty, the method is extended to variable selection, following the approach by Chen et al. [2]. Experimental results show that the proposed methods successfully find effective features and variables without parametric models. 1

2 0.84507191 231 nips-2012-Multiple Operator-valued Kernel Learning

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1

3 0.82277364 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

4 0.8019681 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Author: Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou

Abstract: Both random Fourier features and the Nystr¨ m method have been successfully o applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystr¨ m method are randomly sampled from the training examples and are o therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystr¨ m method can yield impressively better generalization error bound than o random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets. 1

5 0.77141666 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

6 0.75604856 167 nips-2012-Kernel Hyperalignment

7 0.75191593 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

8 0.75060409 330 nips-2012-Supervised Learning with Similarity Functions

9 0.69732732 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

10 0.69638091 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

11 0.68491924 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

12 0.66504133 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

13 0.64373296 221 nips-2012-Multi-Stage Multi-Task Feature Learning

14 0.62224877 16 nips-2012-A Polynomial-time Form of Robust Regression

15 0.60293025 145 nips-2012-Gradient Weights help Nonparametric Regressors

16 0.59635854 95 nips-2012-Density-Difference Estimation

17 0.59012908 247 nips-2012-Nonparametric Reduced Rank Regression

18 0.58823609 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

19 0.58390707 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

20 0.57060754 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.042), (21, 0.022), (32, 0.278), (38, 0.087), (39, 0.026), (42, 0.028), (49, 0.016), (54, 0.028), (55, 0.05), (74, 0.066), (76, 0.131), (80, 0.087), (92, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.77814072 23 nips-2012-A lattice filter model of the visual pathway

Author: Karol Gregor, Dmitri B. Chklovskii

Abstract: Early stages of visual processing are thought to decorrelate, or whiten, the incoming temporally varying signals. Motivated by the cascade structure of the visual pathway (retina → lateral geniculate nucelus (LGN) → primary visual cortex, V1) we propose to model its function using lattice filters - signal processing devices for stage-wise decorrelation of temporal signals. Lattice filter models predict neuronal responses consistent with physiological recordings in cats and primates. In particular, they predict temporal receptive fields of two different types resembling so-called lagged and non-lagged cells in the LGN. Moreover, connection weights in the lattice filter can be learned using Hebbian rules in a stage-wise sequential manner reminiscent of the neuro-developmental sequence in mammals. In addition, lattice filters can model visual processing in insects. Therefore, lattice filter is a useful abstraction that captures temporal aspects of visual processing. Our sensory organs face an ongoing barrage of stimuli from the world and must transmit as much information about them as possible to the rest of the brain [1]. This is a formidable task because, in sensory modalities such as vision, the dynamic range of natural stimuli (more than three orders of magnitude) greatly exceeds the dynamic range of relay neurons (less than two orders of magnitude) [2]. The reason why high fidelity transmission is possible at all is that the continuity of objects in the physical world leads to correlations in natural stimuli, which imply redundancy. In turn, such redundancy can be eliminated by compression performed by the front end of the visual system leading to the reduction of the dynamic range [3, 4]. A compression strategy appropriate for redundant natural stimuli is called predictive coding [5, 6, 7]. In predictive coding, a prediction of the incoming signal value is computed from past values delayed in the circuit. This prediction is subtracted from the actual signal value and only the prediction error is transmitted. In the absence of transmission noise such compression is lossless as the original signal could be decoded on the receiving end by inverting the encoder. If predictions are accurate, the dynamic range of the error is much smaller than that of the natural stimuli. Therefore, minimizing dynamic range using predictive coding reduces to optimizing prediction. Experimental support for viewing the front end of the visual system as a predictive encoder comes from the measurements of receptive fields [6, 7]. In particular, predictive coding suggests that, for natural stimuli, the temporal receptive fields should be biphasic and the spatial receptive fields center-surround. These predictions are born out by experimental measurements in retinal ganglion cells, [8], lateral geniculate nucleus (LGN) neurons [9] and fly second order visual neurons called large monopolar cells (LMCs) [2]. In addition, the experimentally measured receptive fields vary with signal-to-noise ratio as would be expected from optimal prediction theory [6]. Furthermore, experimentally observed whitening of the transmitted signal [10] is consistent with removing correlated components from the incoming signals [11]. As natural stimuli contain correlations on time scales greater than hundred milliseconds, experimentally measured receptive fields of LGN neurons are equally long [12]. Decorrelation over such long time scales requires equally long delays. How can such extended receptive field be produced by 1 biological neurons and synapses whose time constants are typically less than hundred milliseconds [13]? The field of signal processing offers a solution to this problem in the form of a device called a lattice filter, which decorrelates signals in stages, sequentially adding longer and longer delays [14, 15, 16, 17]. Motivated by the cascade structure of visual systems [18], we propose to model decorrelation in them by lattice filters. Naturally, visual systems are more complex than lattice filters and perform many other operations. However, we show that the lattice filter model explains several existing observations in vertebrate and invertebrate visual systems and makes testable predictions. Therefore, we believe that lattice filters provide a convenient abstraction for modeling temporal aspects of visual processing. This paper is organized as follows. First, we briefly summarize relevant results from linear prediction theory. Second, we explain the operation of the lattice filter in discrete and continuous time. Third, we compare lattice filter predictions with physiological measurements. 1 Linear prediction theory Despite the non-linear nature of neurons and synapses, the operation of some neural circuits in vertebrates [19] and invertebrates [20] can be described by a linear systems theory. The advantage of linear systems is that optimal circuit parameters may be obtained analytically and the results are often intuitively clear. Perhaps not surprisingly, the field of signal processing relies heavily on the linear prediction theory, offering a convenient framework [15, 16, 17]. Below, we summarize the results from linear prediction that will be used to explain the operation of the lattice filter. Consider a scalar sequence y = {yt } where time t = 1, . . . , n. Suppose that yt at each time point depends on side information provided by vector zt . Our goal is to generate a series of linear predictions, yt from the vector zt , yt = w · zt . We define a prediction error as: ˆ ˆ et = yt − yt = yt − w · zt ˆ (1) and look for values of w that minimize mean squared error: e2 = 1 nt e2 = t t 1 nt (yt − w · zt )2 . (2) t The weight vector, w is optimal for prediction of sequence y from sequence z if and only if the prediction error sequence e = y − w · z is orthogonal to each component of vector z: ez = 0. (3) When the whole series y is given in advance, i.e. in the offline setting, these so-called normal equations can be solved for w, for example, by Gaussian elimination [21]. However, in signal processing and neuroscience applications, another setting called online is more relevant: At every time step t, prediction yt must be made using only current values of zt and w. Furthermore, after a ˆ prediction is made, w is updated based on the prediction yt and observed yt , zt . ˆ In the online setting, an algorithm called stochastic gradient descent is often used, where, at each time step, w is updated in the direction of negative gradient of e2 : t w →w−η w (yt − w · zt ) 2 . (4) This leads to the following weight update, known as least mean square (LMS) [15], for predicting sequence y from sequence z: w → w + ηet zt , (5) where η is the learning rate. The value of η represents the relative influence of more recent observations compared to more distant ones. The larger the learning rate the faster the system adapts to recent observations and less past it remembers. In this paper, we are interested in predicting a current value xt of sequence x from its past values xt−1 , . . . , xt−k restricted by the prediction order k > 0: xt = wk · (xt−1 , . . . , xt−k )T . ˆ 2 (6) This problem is a special case of the online linear prediction framework above, where yt = xt , zt = (xt−1 , . . . , xt−k )T . Then the gradient update is given by: w → wk + ηet (xt−1 , . . . , xt−k )T . (7) While the LMS algorithm can find the weights that optimize linear prediction (6), the filter wk has a long temporal extent making it difficult to implement with neurons and synapses. 2 Lattice filters One way to generate long receptive fields in circuits of biological neurons is to use a cascade architecture, known as the lattice filter, which calculates optimal linear predictions for temporal sequences and transmits prediction errors [14, 15, 16, 17]. In this section, we explain the operation of a discrete-time lattice filter, then adapt it to continuous-time operation. 2.1 Discrete-time implementation The first stage of the lattice filter, Figure 1, calculates the error of the first order optimal prediction (i.e. only using the preceding element of the sequence), the second stage uses the output of the first stage and calculates the error of the second order optimal prediction (i.e. using only two previous values) etc. To make such stage-wise error computations possible the lattice filter calculates at every stage not only the error of optimal prediction of xt from past values xt−1 , . . . , xt−k , called forward error, ftk = xt − wk · (xt−1 , . . . , xt−k )T , (8) but, perhaps non-intuitively, also the error of optimal prediction of a past value xt−k from the more recent values xt−k+1 , . . . , xt , called backward error: bk = xt−k − w k · (xt−k+1 , . . . , xt )T , t k where w and w k (9) are the weights of the optimal prediction. For example, the first stage of the filter calculates the forward error ft1 of optimal prediction of xt from xt−1 : ft1 = xt − u1 xt−1 as well as the backward error b1 of optimal prediction of xt−1 from t xt : b1 = xt−1 − v 1 xt , Figure 1. Here, we assume that coefficients u1 and v 1 that give optimal linear t prediction are known and return to learning them below. Each following stage of the lattice filter performs a stereotypic operation on its inputs, Figure 1. The k-th stage (k > 1) receives forward, ftk−1 , and backward, bk−1 , errors from the previous stage, t delays backward error by one time step and computes a forward error: ftk = ftk−1 − uk bk−1 t−1 (10) of the optimal linear prediction of ftk−1 from bk−1 . In addition, each stage computes a backward t−1 error k−1 k bt = bt−1 − v k ftk−1 (11) of the optimal linear prediction of bk−1 from ftk−1 . t−1 As can be seen in Figure 1, the lattice filter contains forward prediction error (top) and backward prediction error (bottom) branches, which interact at every stage via cross-links. Operation of the lattice filter can be characterized by the linear filters acting on the input, x, to compute forward or backward errors of consecutive order, so called prediction-error filters (blue bars in Figure 1). Because of delays in the backward error branch the temporal extent of the filters grows from stage to stage. In the next section, we will argue that prediction-error filters correspond to the measurements of temporal receptive fields in neurons. For detailed comparison with physiological measurements we will use the result that, for bi-phasic prediction-error filters, such as the ones in Figure 1, the first bar of the forward prediction-error filter has larger weight, by absolute value, than the combined weights of the remaining coefficients of the corresponding filter. Similarly, in backward predictionerror filters, the last bar has greater weight than the rest of them combined. This fact arises from the observation that forward prediction-error filters are minimum phase, while backward predictionerror filters are maximum phase [16, 17]. 3 Figure 1: Discrete-time lattice filter performs stage-wise computation of forward and backward prediction errors. In the first stage, the optimal prediction of xt from xt−1 is computed by delaying the input by one time step and multiplying it by u1 . The upper summation unit subtracts the predicted xt from the actual value and outputs prediction error ft1 . Similarly, the optimal prediction of xt−1 from xt is computed by multiplying the input by v 1 . The lower summation unit subtracts the optimal prediction from the actual value and outputs backward error b1 . In each following stage t k, the optimal prediction of ftk−1 from bk−1 is computed by delaying bk−1 by one time step and t t multiplying it by uk . The upper summation unit subtracts the prediction from the actual ftk−1 and outputs prediction error ftk . Similarly, the optimal prediction of bk−1 from ftk−1 is computed by t−1 multiplying it by uk . The lower summation unit subtracts the optimal prediction from the actual value and outputs backward error bk . Black connections have unitary weights and red connections t have learnable negative weights. One can view forward and backward error calculations as applications of so-called prediction-error filters (blue) to the input sequence. Note that the temporal extent of the filters gets longer from stage to stage. Next, we derive a learning rule for finding optimal coefficients u and v in the online setting. The uk is used for predicting ftk−1 from bk−1 to obtain error ftk . By substituting yt = ftk−1 , zt = bk−1 and t−1 t−1 et = ftk into (5) the update of uk becomes uk → uk + ηftk bk−1 . t−1 (12) Similarly, v k is updated by v k → v k + ηbk ftk−1 . (13) t Interestingly, the updates of the weights are given by the product of the activities of outgoing and incoming nodes of the corresponding cross-links. Such updates are known as Hebbian learning rules thought to be used by biological neurons [22, 23]. Finally, we give a simple proof that, in the offline setting when the entire sequence x is known, f k and bk , given by equations (10, 11), are indeed errors of optimal k-th order linear prediction. Let D be one step time delay operator (Dx)t = xt−1 . The induction statement at k is that f k and bk are k-th order forward and backward errors of optimal linear prediction which is equivalent to f k and bk k k being of the form f k = x−w1 Dx−. . .−wk Dk x and bk = Dk x−w1k Dk−1 x−. . .−wkk x and, from k i normal equations (3), satisfying f D x = 0 and Dbk Di x = bk Di−1 x = 0 for i = 1, . . . , k. That this is true for k = 1 directly follows from the definition of f 1 and b1 . Now we assume that this is true for k − 1 ≥ 1 and show it is true for k. It is easy to see from the forms of f k−1 and bk−1 k k and from f k = f k−1 − uk Dbk−1 that f k has the correct form f k = x − w1 Dx − . . . − wk Dk x. k i k−1 k k−1 Regarding orthogonality for i = 1, . . . , k − 1 we have f D x = (f − u Db )Di x = f k−1 Di x − uk (Dbk−1 )Di x = 0 using the induction assumptions of orhogonality at k − 1. For the remaining i = k we note that f k is the error of the optimal linear prediction of f k−1 from Dbk−1 k−1 and therefore 0 = f k Dbk−1 = f k (Dk x − w1k−1 Dk−1 x − . . . + wk−1 Dx) = f k Dk x as desired. The bk case can be proven similarly. 2.2 Continuous-time implementation The last hurdle remaining for modeling neuronal circuits which operate in continuous time with a lattice filter is its discrete-time operation. To obtain a continuous-time implementation of the lattice 4 filter we cannot simply take the time step size to zero as prediction-error filters would become infinitesimally short. Here, we adapt the discrete-time lattice filter to continous-time operation in two steps. First, we introduce a discrete-time Laguerre lattice filter [24, 17] which uses Laguerre polynomials rather than the shift operator to generate its basis functions, Figure 2. The input signal passes through a leaky integrator whose leakage constant α defines a time-scale distinct from the time step (14). A delay, D, at every stage is replaced by an all-pass filter, L, (15) with the same constant α, which preserves the magnitude of every Fourier component of the input but shifts its phase in a frequency dependent manner. Such all-pass filter reduces to a single time-step delay when α = 0. The optimality of a general discrete-time Laguerre lattice filter can be proven similarly to that for the discrete-time filter, simply by replacing operator D with L in the proof of section 2.1. Figure 2: Continuous-time lattice filter using Laguerre polynomials. Compared to the discretetime version, it contains a leaky integrator, L0 ,(16) and replaces delays with all-pass filters, L, (17). Second, we obtain a continuous-time formulation of the lattice filter by replacing t − 1 → t − δt, defining the inverse time scale γ = (1 − α)/δt and taking the limit δt → 0 while keeping γ fixed. As a result L0 and L are given by: Discrete time L0 (x)t L(x)t Continuous time = αL0 (x)t−1 + xt (14) = α(L(x)t−1 − xt ) + xt−1 (15) dL0 (x)/dt = −γL0 (x) + x L(x) = x − 2γL0 (x) (16) (17) Representative impulse responses of the continuous Laguerre filter are shown in Figure 2. Note that, similarly to the discrete-time case, the area under the first (peak) phase is greater than the area under the second (rebound) phase in the forward branch and the opposite is true in the backward branch. Moreover, the temporal extent of the rebound is greater than that of the peak not just in the forward branch like in the basic discrete-time implementation but also in the backward branch. As will be seen in the next section, these predictions are confirmed by physiological recordings. 3 Experimental evidence for the lattice filter in visual pathways In this section we demonstrate that physiological measurements from visual pathways in vertebrates and invertebrates are consistent with the predictions of the lattice filter model. For the purpose of modeling visual pathways, we identify summation units of the lattice filter with neurons and propose that neural activity represents forward and backward errors. In the fly visual pathway neuronal activity is represented by continuously varying graded potentials. In the vertebrate visual system, all neurons starting with ganglion cells are spiking and we identify their firing rate with the activity in the lattice filter. 3.1 Mammalian visual pathway In mammals, visual processing is performed in stages. In the retina, photoreceptors synapse onto bipolar cells, which in turn synapse onto retinal ganglion cells (RGCs). RGCs send axons to the LGN, where they synapse onto LGN relay neurons projecting to the primary visual cortex, V1. In addition to this feedforward pathway, at each stage there are local circuits involving (usually inhibitory) inter-neurons such as horizontal and amacrine cells in the retina. Neurons of each class 5 come in many types, which differ in their connectivity, morphology and physiological response. The bewildering complexity of these circuits has posed a major challenge to visual neuroscience. Alonso et al. • Connections between LGN and Cortex J. Neurosci., June 1, 2001, 21(11):4002–4015 4009 Temporal Filter 1 0.5 0 -0.5 -1 RGC LGN 0 100 Time (ms) 200 Figure 7. Distribution of geniculate cells and simple cells with respect to the timing of their responses. The distribution of three parameters derived from impulse responses of geniculate and cortical neurons is shown. A, Peak time. B, Zero-crossing time. C, Rebound index. Peak time is the time with the strongest response in the first phase of the impulse response. Zero-crossing time is the time between the first and second phases. Rebound index is the area of the impulse response after the zero crossing divided by the area before the zero crossing. Only impulse responses with good signal to noise were included (Ͼ5 SD above baseline; n ϭ 169). Figure 3: Electrophysiologically measured temporal receptive fields get progressively longer along the cat visual pathway. Left: A cat LGN cell (red) has a longer receptive field than a corresponding RGC cell (blue) (adapted from [12] which also reports population data). Right (A,B): Extent of the temporal receptive fields of simple cells in cat V1 is greater than that of corresponding LGN cells as quantified by the peak (A) and zero-crossing (B) times. Right (C): In the temporal receptive fields of cat LGN and V1 cells the peak can be stronger or weaker than the rebound (adapted from [25]). simple cells and geniculate cells differed for all temporal parameters measured, there was considerable overlap between the distributions (Fig. 7). This overlap raises the following question: does connectivity depend on how well geniculate and cortical responses are matched with respect to time? For instance, do simple cells with fast subregions (early times to peak and early zero crossings) receive input mostly from geniculate cells with fast centers? Figure 8 illustrates the visual responses from a geniculate cell and a simple cell that were monosynaptically connected. A strong positive peak was observed in the correlogram (shown with a 10 msec time window to emphasize its short latency and fast rise time). In this case, an ON central subregion was well overlapped with an ON geniculate center (precisely at the peak of the subregion). Moreover, the timings of the visual responses from the overlapped subregion and the geniculate center were very similar (same onset, ϳ0 –25 msec; same peak, ϳ25–50 msec). It is worth noting that the two central subregions of the simple cell were faster and stronger than the two lateral subregions. The responses of the central subregions matched the timing of the geniculate center. In contrast, the timing of the lateral subregions resembled more closely the timing of the geniculate surround (both peaked at 25–50 msec). Unlike the example shown in Figure 8, a considerable number of geniculocortical pairs produced responses with different timing. For example, Figure 9 illustrates a case in which a geniculate center fully overlapped a strong simple-cell subregion of the same sign, but with slower timing (LGN onset, ϳ0 –25 msec; peak, ϳ25–50 msec; simple-cell onset, ϳ25–50 msec; peak, ϳ50 –75 msec). The cross-correlogram between this pair of neurons was flat, which indicates the absence of a monosynaptic connection (Fig. 9, top right). To examine the role of timing in geniculocortical connectivity, we measured the response time course from all cell pairs that met two criteria. First, the geniculate center overlapped a simple-cell subregion of the same sign (n ϭ 104). Second, the geniculate center overlapped the cortical subregion in a near-optimal position (relative overlap Ͼ 50%, n ϭ 47; see Materials and Methods; Fig. 5A). All these cell pairs had a high probability of being monosynaptically connected because of the precise match in receptive-field position and sign (31 of 47 were connected). The distributions of peak time, zero-crossing time, and rebound index from these cell pairs were very similar to the distributions from the entire sample (Fig. 7; see also Fig. 10 legend). The selected cell pairs included both presumed directional (predicted DI Ͼ 0.3, see Materials and Methods; 12/20 connected) and nondirectional (19/27 connected) simple cells. Most geniculate cells had small receptive fields (less than two simple-cell subregion widths; see Receptive-field sign), although five cells with larger receptive fields were also included (three connected). From the 47 cell pairs used in this analysis, those with similar response time courses had a higher probability of being connected (Fig. 10). In particular, cell pairs that had both similar peak time and zero-crossing time were all connected (n ϭ 12; Fig. 10 A). Directionally selective simple cells were included in all timing groups. For example, in Figure 10 A there were four, five, two, and one directionally selective cells in the time groups Ͻ20, 40, 60, and Ͼ60 msec, respectively. Similar results were obtained if we restricted our sample to geniculate centers overlapped with the dominant subregion of the simple cell (n ϭ 31). Interestingly, the efficacy and contributions of the connections seemed to depend little on the relative timing of the visual responses (Fig. 10, right). Although our sample of them was quite small, lagged cells are of considerable interest and therefore deserve comment. We recorded from 13 potentially lagged LGN cells whose centers were superimposed with a simple-cell subregion (eight with rebound indices between 1.2 and 1.5; five with rebound indices Ͼ1.9). Only seven of these pairs could be used for timing comparisons (in one pair the baseline of the correlogram had insufficient spikes; in three pairs the geniculate receptive fields were Here, we point out several experimental observations related to temporal processing in the visual system consistent with the lattice filter model. First, measurements of temporal receptive fields demonstrate that they get progressively longer at each consecutive stage: i) LGN neurons have longer receptive fields than corresponding pre-synaptic ganglion cells [12], Figure 3left; ii) simple cells in V1 have longer receptive fields than corresponding pre-synaptic LGN neurons [25], Figure 3rightA,B. These observation are consistent with the progressively greater temporal extent of the prediction-error filters (blue plots in Figure 2). Second, the weight of the peak (integrated area under the curve) may be either greater or less than that of the rebound both in LGN relay cells [26] and simple cells of V1 [25], Figure 3right(C). Neurons with peak weight exceeding that of rebound are often referred to as non-lagged while the others are known as lagged found both in cat [27, 28, 29] and monkey [30]. The reason for this becomes clear from the response to a step stimulus, Figure 4(top). By comparing experimentally measured receptive fields with those of the continuous lattice filter, Figure 4, we identify non-lagged neurons with the forward branch and lagged neurons with the backward branch. Another way to characterize step-stimulus response is whether the sign of the transient is the same (non-lagged) or different (lagged) relative to sustained response. Third, measurements of cross-correlation between RGCs and LGN cell spikes in lagged and nonlagged neurons reveals a difference of the transfer function indicative of the difference in underlying circuitry [30]. This is consistent with backward pathway circuit of the Laguerre lattice filter, Figure 2, being more complex then that of the forward path (which results in different transfer function). ” (or replacing ”more complex” with ”different”) Third, measurements of cross-correlation between RGCs and LGN cell spikes in lagged and nonlagged neurons reveals a difference of the transfer function indicative of the difference in underlying circuitry [31]. This is consistent with the backward branch circuit of the Laguerre lattice filter, Figure 2, being different then that of the forward branch (which results in different transfer function). In particular, a combination of different glutamate receptors such as AMPA and NMDA, as well as GABA receptors are thought to be responsible for observed responses in lagged cells [32]. However, further investigation of the corresponding circuitry, perhaps using connectomics technology, is desirable. Fourth, the cross-link weights of the lattice filter can be learned using Hebbian rules, (12,13) which are biologically plausible [22, 23]. Interestingly, if these weights are learned sequentially, starting from the first stage, they do not need to be re-learned when additional stages are added or learned. This property maps naturally on the fact that in the course of mammalian development the visual pathway matures in a stage-wise fashion - starting with the retina, then LGN, then V1 - and implying that the more peripheral structures do not need to adapt to the maturation of the downstream ones. 6 Figure 4: Comparison of electrophysiologically measured responses of cat LGN cells with the continuous-time lattice filter model. Top: Experimentally measured temporal receptive fields and step-stimulus responses of LGN cells (adapted from [26]). Bottom: Typical examples of responses in the continuous-time lattice filter model. Lattice filter coefficients were u1 = v 1 = 0.4, u2 = v 2 = 0.2 and 1/γ = 50ms to model the non-lagged cell and u1 = v 1 = u2 = v 2 = 0.2 and 1/γ = 60ms to model the lagged cell. To model photoreceptor contribution to the responses, an additional leaky integrator L0 was added to the circuit of Figure 2. While Hebbian rules are biologically plausible, one may get an impression from Figure 2 that they must apply to inhibitory cross-links. We point out that this circuit is meant to represent only the computation performed rather than the specific implementation in terms of neurons. As the same linear computation can be performed by circuits with a different arrangement of the same components, there are multiple implementations of the lattice filter. For example, activity of non-lagged OFF cells may be seen as representing minus forward error. Then the cross-links between the non-lagged OFF pathway and the lagged ON pathway would be excitatory. In general, classification of cells into lagged and non-lagged seems independent of their ON/OFF and X/Y classification [31, 28, 29], but see[33]. 3.2 Insect visual pathway In insects, two cell types, L1 and L2, both post-synaptic to photoreceptors play an important role in visual processing. Physiological responses of L1 and L2 indicate that they decorrelate visual signals by subtracting their predictable parts. In fact, receptive fields of these neurons were used as the first examples of predictive coding in neuroscience [6]. Yet, as the numbers of synapses from photoreceptors to L1 and L2 are the same [34] and their physiological properties are similar, it has been a mystery why insects, have not just one but a pair of such seemingly redundant neurons per facet. Previously, it was suggested that L1 and L2 provide inputs to the two pathways that map onto ON and OFF pathways in the vertebrate retina [35, 36]. Here, we put forward a hypothesis that the role of L1 and L2 in visual processing is similar to that of the two branches of the lattice filter. We do not incorporate the ON/OFF distinction in the effectively linear lattice filter model but anticipate that such combined description will materialize in the future. As was argued in Section 2, in forward prediction-error filters, the peak has greater weight than the rebound, while in backward prediction-error filters the opposite is true. Such difference implies that in response to a step-stimulus the signs of sustained responses compared to initial transients are different between the branches. Indeed, Ca2+ imaging shows that responses of L1 and L2 to step-stimulus are different as predicted by the lattice filter model [35], Figure 5b. Interestingly, the activity of L1 seems to represent minus forward error and L2 - plus backward error, suggesting that the lattice filter cross-links are excitatory. To summarize, the predictions of the lattice filter model seem to be consistent with the physiological measurements in the fly visual system and may help understand its operation. 7 Stimulus 1 0.5 0 0 5 10 15 20 5 10 15 20 5 10 time 15 20 − Forward Error 1 0 −1 0 Backward Error 1 0 −1 0 Figure 5: Response of the lattice filter and fruit fly LMCs to a step-stimulus. Left: Responses of the first order discrete-time lattice filter to a step stimulus. Right: Responses of fly L1 and L2 cells to a moving step stimulus (adapted from [35]). Predicted and the experimentally measured responses have qualitatively the same shape: a transient followed by sustained response, which has the same sign for the forward error and L1 and the opposite sign for the backward error and L2. 4 Discussion Motivated by the cascade structure of the visual pathway, we propose to model its operation with the lattice filter. We demonstrate that the predictions of the continuous-time lattice filter model are consistent with the course of neural development and the physiological measurement in the LGN, V1 of cat and monkey, as well as fly LMC neurons. Therefore, lattice filters may offer a useful abstraction for understanding aspects of temporal processing in visual systems of vertebrates and invertebrates. Previously, [11] proposed that lagged and non-lagged cells could be a result of rectification by spiking neurons. Although we agree with [11] that LGN performs temporal decorrelation, our explanation does not rely on non-linear processing but rather on the cascade architecture and, hence, is fundamentally different. Our model generates the following predictions that are not obvious in [11]: i) Not only are LGN receptive fields longer than RGC but also V1 receptive fields are longer than LGN; ii) Even a linear model can generate a difference in the peak/rebound ratio; iii) The circuit from RGC to LGN should be different for lagged and non-lagged cells consistent with [31]; iv) The lattice filter circuit can self-organize using Hebbian rules, which gives a mechanistic explanation of receptive fields beyond the normative framework of [11]. In light of the redundancy reduction arguments given in the introduction, we note that, if the only goal of the system were to compress incoming signals using a given number of lattice filter stages, then after the compression is peformed only one kind of prediction errors, forward or backward needs to be transmitted. Therefore, having two channels, in the absence of noise, may seem redundant. However, transmitting both forward and backward errors gives one the flexibility to continue decorrelation further by adding stages performing relatively simple operations. We are grateful to D.A. Butts, E. Callaway, M. Carandini, D.A. Clark, J.A. Hirsch, T. Hu, S.B. Laughlin, D.N. Mastronarde, R.C. Reid, H. Rouault, A. Saul, L. Scheffer, F.T. Sommer, X. Wang for helpful discussions. References [1] F. Rieke, D. Warland, R.R. van Steveninck, and W. Bialek. Spikes: exploring the neural code. MIT press, 1999. [2] S.B. Laughlin. Matching coding, circuits, cells, and molecules to signals: general principles of retinal design in the fly’s eye. Progress in retinal and eye research, 13(1):165–196, 1994. [3] F. Attneave. Some informational aspects of visual perception. Psychological review, 61(3):183, 1954. [4] H. Barlow. Redundancy reduction revisited. Network: Comp in Neural Systems, 12(3):241–253, 2001. [5] R.M. Gray. Linear Predictive Coding and the Internet Protocol. Now Publishers, 2010. [6] MV Srinivasan, SB Laughlin, and A. Dubs. Predictive coding: a fresh view of inhibition in the retina. Proceedings of the Royal Society of London. Series B. Biological Sciences, 216(1205):427–459, 1982. [7] T. Hosoya, S.A. Baccus, and M. Meister. Dynamic predictive coding by the retina. Nature, 436:71, 2005. 8 [8] HK Hartline, H.G. Wagner, and EF MacNichol Jr. The peripheral origin of nervous activity in the visual system. Studies on excitation and inhibition in the retina: a collection of papers from the laboratories of H. Keffer Hartline, page 99, 1974. [9] N.A. Lesica, J. Jin, C. Weng, C.I. Yeh, D.A. Butts, G.B. Stanley, and J.M. Alonso. Adaptation to stimulus contrast and correlations during natural visual stimulation. Neuron, 55(3):479–491, 2007. [10] Y. Dan, J.J. Atick, and R.C. Reid. Efficient coding of natural scenes in the lateral geniculate nucleus: experimental test of a computational theory. The Journal of Neuroscience, 16(10):3351–3362, 1996. [11] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Computation in Neural Systems, 6(3):345–358, 1995. [12] X. Wang, J.A. Hirsch, and F.T. Sommer. Recoding of sensory information across the retinothalamic synapse. The Journal of Neuroscience, 30(41):13567–13577, 2010. [13] C. Koch. Biophysics of computation: information processing in single neurons. Oxford Univ Press, 2005. [14] F. Itakura and S. Saito. On the optimum quantization of feature parameters in the parcor speech synthesizer. In Conference Record, 1972 International Conference on Speech Communication and Processing, Boston, MA, pages 434–437, 1972. [15] B. Widrow and S.D. Stearns. Adaptive signal processing. Prentice-Hall, Inc. Englewood Cliffs, NJ, 1985. [16] S. Haykin. Adaptive filter theory. Prentice-Hall, Englewood-Cliffs, NJ, 2003. [17] A.H. Sayed. Fundamentals of adaptive filtering. Wiley-IEEE Press, 2003. [18] D.J. Felleman and D.C. Van Essen. Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex, 1(1):1–47, 1991. [19] X. Wang, F.T. Sommer, and J.A. Hirsch. Inhibitory circuits for visual processing in thalamus. Current Opinion in Neurobiology, 2011. [20] SB Laughlin, J. Howard, and B. Blakeslee. Synaptic limitations to contrast coding in the retina of the blowfly calliphora. Proceedings of the Royal society of London. Series B. Biological sciences, 231(1265):437–467, 1987. [21] D.C. Lay. Linear Algebra and Its Applications. Addison-Wesley/Longman, New York/London, 2000. [22] D.O. Hebb. The organization of behavior: A neuropsychological theory. Lawrence Erlbaum, 2002. [23] O. Paulsen and T.J. Sejnowski. Natural patterns of activity and long-term synaptic plasticity. Current opinion in neurobiology, 10(2):172–180, 2000. [24] Z. Fejzo and H. Lev-Ari. Adaptive laguerre-lattice filters. Signal Processing, IEEE Transactions on, 45(12):3006–3016, 1997. [25] J.M. Alonso, W.M. Usrey, and R.C. Reid. Rules of connectivity between geniculate cells and simple cells in cat primary visual cortex. The Journal of Neuroscience, 21(11):4002–4015, 2001. [26] D. Cai, G.C. Deangelis, and R.D. Freeman. Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens. Journal of Neurophysiology, 78(2):1045–1061, 1997. [27] D.N. Mastronarde. Two classes of single-input x-cells in cat lateral geniculate nucleus. i. receptive-field properties and classification of cells. Journal of Neurophysiology, 57(2):357–380, 1987. [28] J. Wolfe and L.A. Palmer. Temporal diversity in the lateral geniculate nucleus of cat. Visual neuroscience, 15(04):653–675, 1998. [29] AB Saul and AL Humphrey. Spatial and temporal response properties of lagged and nonlagged cells in cat lateral geniculate nucleus. Journal of Neurophysiology, 64(1):206–224, 1990. [30] A.B. Saul. Lagged cells in alert monkey lateral geniculate nucleus. Visual neurosci, 25:647–659, 2008. [31] D.N. Mastronarde. Two classes of single-input x-cells in cat lateral geniculate nucleus. ii. retinal inputs and the generation of receptive-field properties. Journal of Neurophysiology, 57(2):381–413, 1987. [32] P. Heggelund and E. Hartveit. Neurotransmitter receptors mediating excitatory input to cells in the cat lateral geniculate nucleus. i. lagged cells. Journal of neurophysiology, 63(6):1347–1360, 1990. [33] J. Jin, Y. Wang, R. Lashgari, H.A. Swadlow, and J.M. Alonso. Faster thalamocortical processing for dark than light visual targets. The Journal of Neuroscience, 31(48):17471–17479, 2011. [34] M. Rivera-Alba, S.N. Vitaladevuni, Y. Mischenko, Z. Lu, S. Takemura, L. Scheffer, I.A. Meinertzhagen, D.B. Chklovskii, and G.G. de Polavieja. Wiring economy and volume exclusion determine neuronal placement in the drosophila brain. Current Biology, 21(23):2000–5, 2011. [35] D.A. Clark, L. Bursztyn, M.A. Horowitz, M.J. Schnitzer, and T.R. Clandinin. Defining the computational structure of the motion detector in drosophila. Neuron, 70(6):1165–1177, 2011. [36] M. Joesch, B. Schnell, S.V. Raghu, D.F. Reiff, and A. Borst. On and off pathways in drosophila motion vision. Nature, 468(7321):300–304, 2010. 9

2 0.76541001 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning

Author: Toke Hansen, Michael W. Mahoney

Abstract: In many applications, one has side information, e.g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks “nearby” that pre-specified target region. Locally-biased problems of this sort are particularly challenging for popular eigenvector-based machine learning and data analysis tools. At root, the reason is that eigenvectors are inherently global quantities. In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. We also provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning. 1

3 0.72537953 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

Author: James Hensman, Magnus Rattray, Neil D. Lawrence

Abstract: We present a general method for deriving collapsed variational inference algorithms for probabilistic models in the conjugate exponential family. Our method unifies many existing approaches to collapsed variational inference. Our collapsed variational inference leads to a new lower bound on the marginal likelihood. We exploit the information geometry of the bound to derive much faster optimization methods based on conjugate gradients for these models. Our approach is very general and is easily applied to any model where the mean field update equations have been derived. Empirically we show significant speed-ups for probabilistic inference using our bound. 1

same-paper 4 0.7192831 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

Author: Kenji Fukumizu, Chenlei Leng

Abstract: We propose a novel kernel approach to dimension reduction for supervised learning: feature extraction and variable selection; the former constructs a small number of features from predictors, and the latter finds a subset of predictors. First, a method of linear feature extraction is proposed using the gradient of regression function, based on the recent development of the kernel method. In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the regressor or type of variables, and uses computationally simple eigendecomposition, thus applicable to large data sets. Second, in combination of a sparse penalty, the method is extended to variable selection, following the approach by Chen et al. [2]. Experimental results show that the proposed methods successfully find effective features and variables without parametric models. 1

5 0.6403687 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

6 0.61107248 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

7 0.58371562 197 nips-2012-Learning with Recursive Perceptual Representations

8 0.58272707 188 nips-2012-Learning from Distributions via Support Measure Machines

9 0.58151543 210 nips-2012-Memorability of Image Regions

10 0.58085448 168 nips-2012-Kernel Latent SVM for Visual Recognition

11 0.57893634 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

12 0.57890487 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

13 0.57833582 193 nips-2012-Learning to Align from Scratch

14 0.57706732 215 nips-2012-Minimizing Uncertainty in Pipelines

15 0.57674086 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

16 0.57447886 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

17 0.5741353 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

18 0.57389414 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

19 0.57340384 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

20 0.57301372 279 nips-2012-Projection Retrieval for Classification