nips nips2001 nips2001-29 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mário Figueiredo
Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.
Reference: text
sentIndex sentText sentNum sentScore
1 pt Abstract In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. [sent-6, score-0.47]
2 Although other applications are possible, we focus here on supervised learning problems: regression and classification. [sent-7, score-0.264]
3 Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. [sent-8, score-0.175]
4 In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters. [sent-9, score-0.501]
5 a vector of parameters defining it; accordingly we write To achieve good generalization (i. [sent-22, score-0.058]
6 In Bayesian approaches, complexity is controlled by placing a prior on the function to be learned, i. [sent-25, score-0.119]
7 This should not be confused with a generative (informative) Bayesian approach, since it involves no explicit modelling of the joint probability . [sent-28, score-0.125]
8 A common choice is a zero-mean Gaussian prior, which appears under different names, like ridge regression [5], or weight decay, in the neural learning literature [6]. [sent-29, score-0.312]
9 Gaussian priors are also used in non-parametric contexts, like the Gaussian processes (GP) approach [2], [7], [8], [9], which has roots in earlier spline models [10] and regularized radial basis functions [11]. [sent-30, score-0.101]
10 Very good performance has been reported for methods based on Gaussian priors [8], [9]. [sent-31, score-0.066]
11 pruning that parameter, but to some small value. [sent-34, score-0.047]
12 , in which irrelevant parameters are set exactly to zero) are desirable because (in addition to other learning-theoretic reasons [4]) they correspond to a structural simplification of the estimated function. [sent-37, score-0.031]
13 Using Laplacian priors (equivalently, -penalized regularization) is known to promote sparseness [12] - [15]. [sent-38, score-0.38]
14 Support vector machines (SVM) take a non-Bayesian approach to the goal of sparseness [2], [4]. [sent-39, score-0.358]
15 Interestingly, however, it can be shown that the SVM and -penalized regression are closely related [13]. [sent-40, score-0.236]
16 Both in approaches based on Laplacian priors and in SVMs, there are hyper-parameters which control the degree of sparseness of the obtained estimates. [sent-41, score-0.329]
17 These are commonly adjusted using cross-validation methods which do not optimally utilize the available data, and are time consuming. [sent-42, score-0.092]
18 We propose an alternative approach which involves no hyperparameters. [sent-43, score-0.047]
19 a Our method is related to the automatic relevance determination (ARD) concept [7], [19], which underlies the recently proposed relevance vector machine (RVM) [20], [21]. [sent-45, score-0.266]
20 The RVM exhibits state-of-the-art performance, beating SVMs both in terms of accuracy and sparseness [20], [21]. [sent-46, score-0.296]
21 However, we do not resort to a type-II maximum likelihood approximation [18] (as in ARD and RVM); rather, our modelling assumptions lead to a marginal a posteriori probability function on whose mode is located by a very simple EM algorithm. [sent-47, score-0.154]
22 a Experimental evaluation of the proposed method, both with synthetic and real data, shows that it performs competitively with (often better than) GP-based methods, RVM, and SVM. [sent-49, score-0.224]
23 , that are linear with respect to (whose dimensionality we will denote by ). [sent-55, score-0.033]
24 This includes: (i) classical linear regression, ; (ii) nonlinear regression via a set of basis functions, ; (iii) kernel regression, , where is some (symmetric) kernel function [2] (as in SVM and RVM regression), not necessarily verifying Mercer’s condition. [sent-56, score-0.409]
25 We follow the standard assumption that , for , where is a set of independent zero-mean Gaussian variables with variance . [sent-67, score-0.033]
26 With , the likelihood function is then , where is the design matrix which depends on the s and on the adopted function representation, and denotes a Gaussian density of mean and covariance , evaluated at . [sent-68, score-0.063]
27 6 With a zero-mean Gaussian prior with covariance is still Gaussian with mean and mode at , the posterior , this is called ridge regression [5]. [sent-76, score-0.493]
28 With a Laplacian prior for , , with , the posterior is not Gaussian. [sent-77, score-0.119]
29 The maximum a posteriori (MAP) estimate is given by (1) © $ a ¥ e 4 t 5 ¢ 4 ¡¡ %%Yt $ ¥ $ t 5 %Yt where is the Euclidean ( ) norm, and is the norm. [sent-78, score-0.048]
30 In linear regression this is called the LASSO (least absolute shrinkage and selection operator) [14]. [sent-79, score-0.377]
31 The main effect of the penalty is that some of the components of may be exactly zero. [sent-80, score-0.042]
32 C 4 (S a 4S ) have a zero-mean Gaussian prior Let us consider an alternative model: let each , with its own variance (like in ARD and RVM). [sent-82, score-0.152]
33 ©'¦ $ 4¥£¥ e E©£ ¥$4 #S¥ e ¡ '¦ ¥$4 #S¥ e 4$ 4 4 © 2 £¦W ` ¨¥ © 4 £ ffw¨ %44 £ aIXV 0©w ©§¦P¡ '¦ $ 4 ¥£¥ e B This shows that the Laplacian prior is equivalent to a 2-level hierachical-Bayes model: zero-mean Gaussian priors with independent exponentially distributed variances. [sent-86, score-0.185]
34 This decomposition has been exploited in robust least absolute deviation (LAD) regression [16]. [sent-87, score-0.27]
35 The hierarchical decomposition of the Laplacian prior allows using the EM algorithm as hidto implement the LASSO criterion in (1) by simply regarding den/missing data. [sent-88, score-0.19]
36 In fact, the complete log-posterior (with a flat prior for , and where diag ), F ¦ ¤£$#! [sent-89, score-0.225]
37 This leads to diag M-step is then defined by the two following update equations: and (2) . [sent-104, score-0.106]
38 ¦¦ One question remains: how to adjust , which controls the degree of sparseness of the estimates? [sent-109, score-0.263]
39 This prior expresses ignorance with respect to scale (see [17], [18]) and, most importantly, it is parameter-free. [sent-111, score-0.161]
40 Of course this is no longer equivalent to a Laplacian prior on , but to some other prior. [sent-112, score-0.119]
41 As will be shown experimentally, this prior strongly induces sparseness and yields state-of-the-art performance. [sent-113, score-0.382]
42 Computationally, this choice leads to a minor modification of the EM algorithm described above: matrix is now given by diag . [sent-114, score-0.106]
43 §8975 H where diag , thus avoiding the inversion of the elements of . [sent-123, score-0.106]
44 Moreover, it is not necessary to invert the matrix, but simply to solve the corresponding linear system, whose dimension is only the number of non-zero elements in . [sent-124, score-0.033]
45 §8975 H 3 Regression experiments Our first example illustrates the use of the proposed method for variable selection in standard linear regression. [sent-125, score-0.186]
46 Consider a sequence of 20 true s, having from 1 to 20 non-zero components (out of 20): from to . [sent-126, score-0.042]
47 1 (a) shows the mean number of estimated non-zero components, as a function of the true number. [sent-129, score-0.034]
48 Our method exhibits a very good ability to find the correct number of nonzero components in , in an adaptive manner. [sent-130, score-0.241]
49 2 10 15 20 25 True # of nonzero parameters -8 -6 -4 a 5 -2 0 2 4 6 8 C Estim. [sent-143, score-0.097]
50 # of nonzero parameters ©¥ ¢ 25 Figure 1: (a) Mean number of nonzero components in versus the number of nonzero components in (the dotted line is the identity). [sent-144, score-0.403]
51 In table 3, we compare the relative modelling error ( ) improvement (with respect to the least squares solution) of our method and of several methods studied in [14]. [sent-154, score-0.231]
52 Our method performs comparably with the best method for each case, although it involves no tuning or adjustment of parameters, and is computationally faster. [sent-155, score-0.274]
53 Method Proposed method LASSO (CV) LASSO (GCV) Subset selection a & ¢ $ % '&$ "! [sent-157, score-0.116]
54 a # w © B Y Y # We now study the performance of our method in kernel regression, using Gaussian kernels, i. [sent-158, score-0.139]
55 We begin by considering the synthetic example studied in [20] and [21], where the true function is (see Fig. [sent-161, score-0.069]
56 To compare our results to the RVM and the variational RVM (VRVM), we ran the algorithm on 25 generations of the noisy data. [sent-163, score-0.075]
57 Finally, we have also applied our method to the wellknown Boston housing data-set (20 random partitions of the full data-set into 481 training samples and 25 test samples); Table 2 shows the results, again versus SVM, RVM, and VRVM regression (as reported in [20]). [sent-165, score-0.415]
58 In these tests, our method performs better than 8 C¨ © ¨8¥ r ©p ¡ § ) 32© w f¨ t 4 § W § 4X ¥ t W ( ` X aIV ¡ 4© )§d¨§¥ ¨ RVM, VRVM, and SVM regression, although it doesn’t require any tuning. [sent-166, score-0.12]
59 8 ¨ © ¨8¥ r p § ” function Table 2: Mean root squared errors and mean number of kernels for the “ and the Boston housing examples. [sent-167, score-0.178]
60 kernels New method New method SVM SVM RVM RVM VRVM VRVM w ! [sent-170, score-0.211]
61 B % ' % ' % ¢ ¢ &¢ 4 Classification In classification the formulation is somewhat more complicated, with the standard approach being generalized linear models [24]. [sent-186, score-0.033]
62 The probit model has a simple interpretation in terms of hidden variables [25], which we will exploit. [sent-190, score-0.076]
63 Then, if the classification rule is if , and if , we obtain the probit model: ©E©'¨§¥ §8975 a W¥ 4 Y5P¡ 4 © Y 4 ¡ G C W W §8975 a % $B © 'A§¥ ¡ §8275G a ¥ & ¦ 4© ¨§¥ ¡ G C """"""$ ¡ F §8275 a A §8975 ©E©'¨§¥ ¡ §8975G C a ¥W %W Y 4 C §8975 a # '( C ©$C4§ 6 C! [sent-192, score-0.076]
64 If we had , we would have a simple linear regression likelihood . [sent-205, score-0.269]
65 To promote sparseness, we will adopt the same hierarchical prior on that we have used for regression: and (the Jeffreys prior). [sent-207, score-0.244]
66 The complete log posterior (with the hidden vectors and ) is (6) which is similar to (2), except for the noise variance which is not needed here, and for the fact that now is missing. [sent-208, score-0.033]
67 The expected value of is similar to the regression case; accordingly we define the same diagonal matrix diag . [sent-209, score-0.372]
68 In (notice that the complete log-posterior is linear with addition, we also need respect to ), which can be expressed in closed form, for each , as if if . [sent-210, score-0.033]
69 §8975 B©'¨§¥ ¡ §8975G Ca 4 Y ¡ 4 , but left-truncated at zero if , and right-truncated at zero if , the M-step is similar to the regression case, Y 4 5W ¡ mean With 4 § These expressions are easily derived after noticing that . [sent-215, score-0.332]
70 5 Classification experiments ¡ 4 b© ¦§dA§¥ ¨ In all the experiments we use kernel classifiers, with Gaussian kernels, i. [sent-217, score-0.07]
71 , where is a parameter that controls the kernel width. [sent-219, score-0.07]
72 ¥ § tW X2 © w f¨ t 4 VW § 4 ( ( ` X aYV Our first experiment is mainly illustrative and uses Ripley’s synthetic data 1 ; the optimal [3]. [sent-220, score-0.04]
73 Table 3 shows the average test set error (on 1000 test error rate for this problems is samples) and the final number of kernels, for 20 classifiers learned from 20 random subsets of size 100 from the original 250 training samples. [sent-221, score-0.032]
74 For comparison, we also include results (from [20]) for RVM, variational RVM (VRVM), and SVM classifiers. [sent-222, score-0.045]
75 On this data set, our method performs competitively with RVM and VRVM and much better than SVM (specially in terms of sparseness). [sent-223, score-0.216]
76 B 1d¡ ( Table 3 also reports the numbers of errors achieved by the proposed method and by several state-of-the-art techniques on three well-known benchmark problems: the Pima Indians diabetes2 , the Leptograpsus crabs2 , and the Wisconsin breast cancer 3 (WBC). [sent-226, score-0.187]
77 For the WBC, we report average results over 30 random partitions (300/269 training/testing, as in [26]). [sent-227, score-0.039]
78 All the inputs are normalized to zero mean and unit variance, and the kernel width was set to , for the Pima and crabs problems, and to for the WBC. [sent-228, score-0.246]
79 On the Pima and crabs data sets, our algorithm outperforms all the other techniques. [sent-229, score-0.154]
80 On the WBC data set, our method performs nearly as well as the best available alternative. [sent-230, score-0.169]
81 The numbers in square brackets in the “method” column indicate the bibliographic reference from which the results are quoted. [sent-234, score-0.034]
82 The numbers in parentheses indicate the (mean) number of kernels used by the classifiers (when available). [sent-235, score-0.107]
83 Method Ripley’s Pima Crabs WBC Proposed method 94 (4. [sent-236, score-0.069]
84 html 2 6 Concluding remarks We have introduced a new sparseness inducing prior related to the Laplacian prior. [sent-250, score-0.427]
85 Its main feature is the absence of any hyper-parameters to be adjusted or estimated. [sent-251, score-0.043]
86 Experiments with several publicly available benchmark data sets, both for regression and classification, have shown state-of-the-art performance. [sent-252, score-0.374]
87 In particular, our approach outperforms support vector machines and Gaussian process classifiers both in terms of error rate and sparseness, although it involves no tuning or adjusting of sparseness-controlling hyper-parameters. [sent-253, score-0.285]
88 One of the weak points of our approach, when used with kernel-based methods, is the need to solve a linear system in the M-step (of dimension equal to the number of training points) whose computational requirements make it impractical to use with very large training data sets. [sent-255, score-0.033]
89 Williams, “Prediction with Gaussian processes: from linear regression to linear prediction and beyond,” in Learning and Inference in Graphical Models, Kluwer, 1998. [sent-285, score-0.302]
90 Wahba, “A correspondence between Bayesian estimation of stochastic processes and smoothing by splines,” Annals of Mathematical Statistics, vol. [sent-295, score-0.035]
91 Saunders, “Atomic decomposition by basis pursuit,” SIAM Journal of Scientific Computation, vol. [sent-306, score-0.034]
92 Girosi, “An equivalence between sparse approximation and support vector machines,” Neural Computation, vol. [sent-311, score-0.058]
93 Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society (B), vol. [sent-315, score-0.108]
94 Williams, “Bayesian regularization and pruning using a Laplace prior,” Neural Computation, vol. [sent-318, score-0.047]
95 MacKay, “Bayesian non-linear modelling for the 1993 energy prediction competition,” in Maximum Entropy and Bayesian Methods, G. [sent-335, score-0.078]
96 Tipping, “Variational relevance vector machines,” in Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pp. [sent-341, score-0.094]
97 Tipping, “The relevance vector machine,” in Advances in Neural Information Processing Systems – NIPS 12 (S. [sent-344, score-0.094]
98 Turlach, “A new approach to variable selection in least squares problems,” IMA Journal of Numerical Analysis, vol. [sent-361, score-0.047]
99 Seeger, “Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers,” in Advances in Neural Information Processing – NIPS 12 (S. [sent-374, score-0.21]
100 Seeger, “Using the Nystrom method to speedup kernel machines,” in NIPS 13, MIT Press, 2001. [sent-383, score-0.139]
wordName wordTfidf (topN-words)
[('rvm', 0.501), ('sparseness', 0.263), ('vrvm', 0.256), ('regression', 0.236), ('laplacian', 0.178), ('wbc', 0.16), ('svm', 0.157), ('classi', 0.145), ('lasso', 0.139), ('ard', 0.128), ('prior', 0.119), ('jeffreys', 0.118), ('pima', 0.111), ('crabs', 0.111), ('diag', 0.106), ('nonzero', 0.097), ('ayv', 0.096), ('competitively', 0.096), ('ripley', 0.083), ('ers', 0.08), ('modelling', 0.078), ('bayesian', 0.077), ('probit', 0.076), ('ridge', 0.076), ('gaussian', 0.075), ('kernels', 0.073), ('housing', 0.071), ('kernel', 0.07), ('method', 0.069), ('machines', 0.067), ('priors', 0.066), ('relevance', 0.066), ('figueiredo', 0.064), ('williams', 0.062), ('ca', 0.062), ('shrinkage', 0.061), ('cation', 0.06), ('em', 0.059), ('donoho', 0.056), ('table', 0.055), ('da', 0.054), ('yt', 0.051), ('promote', 0.051), ('performs', 0.051), ('available', 0.049), ('posteriori', 0.048), ('boston', 0.048), ('pruning', 0.047), ('girosi', 0.047), ('selection', 0.047), ('involves', 0.047), ('benchmark', 0.047), ('variational', 0.045), ('tipping', 0.045), ('mse', 0.045), ('inducing', 0.045), ('vw', 0.045), ('adjusted', 0.043), ('outperforms', 0.043), ('seeger', 0.042), ('expresses', 0.042), ('publicly', 0.042), ('components', 0.042), ('synthetic', 0.04), ('partitions', 0.039), ('tuning', 0.038), ('proposed', 0.037), ('adopt', 0.037), ('hierarchical', 0.037), ('ller', 0.035), ('processes', 0.035), ('numbers', 0.034), ('decomposition', 0.034), ('bishop', 0.034), ('mean', 0.034), ('variance', 0.033), ('exhibits', 0.033), ('linear', 0.033), ('nips', 0.032), ('adjusting', 0.032), ('kluwer', 0.032), ('problems', 0.032), ('zero', 0.031), ('irrelevant', 0.031), ('say', 0.031), ('importantly', 0.03), ('accordingly', 0.03), ('noisy', 0.03), ('support', 0.03), ('journal', 0.03), ('iii', 0.03), ('design', 0.029), ('proposal', 0.029), ('logistic', 0.029), ('includes', 0.029), ('studied', 0.029), ('supervised', 0.028), ('mode', 0.028), ('dotted', 0.028), ('vector', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
Author: Mário Figueiredo
Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.
2 0.15230688 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
Author: Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, Shigeki Sagayama
Abstract: A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classification algorithms can be employed without further modifications. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs). 1
3 0.15210479 46 nips-2001-Categorization by Learning and Combining Object Parts
Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio
Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classifiers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classification experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classifier which may be relevant for biological models of visual recognition.
4 0.14062266 21 nips-2001-A Variational Approach to Learning Curves
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
5 0.13797837 35 nips-2001-Analysis of Sparse Bayesian Learning
Author: Anita C. Faul, Michael E. Tipping
Abstract: The recent introduction of the 'relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. Using a particular form of Gaussian parameter prior, 'learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. It is further shown that if a derived 'sparsity criterion' is satisfied, this maximum is exactly equivalent to 'pruning' the corresponding parameter from the model. 1
6 0.12752675 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
7 0.12694982 8 nips-2001-A General Greedy Approximation Algorithm with Applications
8 0.12050248 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
9 0.11996999 139 nips-2001-Online Learning with Kernels
10 0.11547352 58 nips-2001-Covariance Kernels from Bayesian Generative Models
11 0.11189935 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
12 0.10976133 43 nips-2001-Bayesian time series classification
13 0.10975176 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
14 0.10688245 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
15 0.10202624 164 nips-2001-Sampling Techniques for Kernel Methods
16 0.098747432 137 nips-2001-On the Convergence of Leveraging
17 0.093841806 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering
18 0.093196623 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
19 0.091746211 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
20 0.090862848 62 nips-2001-Duality, Geometry, and Support Vector Regression
topicId topicWeight
[(0, -0.288), (1, 0.154), (2, -0.062), (3, 0.079), (4, -0.035), (5, 0.031), (6, 0.062), (7, -0.023), (8, 0.044), (9, 0.033), (10, 0.068), (11, -0.014), (12, -0.029), (13, -0.136), (14, -0.002), (15, 0.003), (16, 0.034), (17, 0.01), (18, 0.051), (19, -0.039), (20, 0.05), (21, -0.089), (22, 0.106), (23, 0.054), (24, 0.106), (25, 0.024), (26, -0.042), (27, -0.039), (28, 0.029), (29, 0.014), (30, 0.107), (31, 0.009), (32, -0.012), (33, 0.074), (34, 0.123), (35, 0.02), (36, 0.028), (37, -0.006), (38, -0.031), (39, 0.001), (40, -0.104), (41, -0.014), (42, 0.028), (43, -0.035), (44, -0.126), (45, 0.016), (46, -0.119), (47, -0.107), (48, 0.066), (49, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.94621682 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
Author: Mário Figueiredo
Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.
2 0.79099303 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
Author: Ji Zhu, Trevor Hastie
Abstract: The support vector machine (SVM) is known for its good performance in binary classification, but its extension to multi-class classification is still an on-going research issue. In this paper, we propose a new approach for classification, called the import vector machine (IVM), which is built on kernel logistic regression (KLR). We show that the IVM not only performs as well as the SVM in binary classification, but also can naturally be generalized to the multi-class case. Furthermore, the IVM provides an estimate of the underlying probability. Similar to the “support points” of the SVM, the IVM model uses only a fraction of the training data to index kernel basis functions, typically a much smaller fraction than the SVM. This gives the IVM a computational advantage over the SVM, especially when the size of the training data set is large.
3 0.6609053 35 nips-2001-Analysis of Sparse Bayesian Learning
Author: Anita C. Faul, Michael E. Tipping
Abstract: The recent introduction of the 'relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. Using a particular form of Gaussian parameter prior, 'learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. It is further shown that if a derived 'sparsity criterion' is satisfied, this maximum is exactly equivalent to 'pruning' the corresponding parameter from the model. 1
4 0.64586508 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity
Author: Lehel Csató, Manfred Opper, Ole Winther
Abstract: The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka’s expectation propagation. Lastly, we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification and density estimation with Gaussian processes and on an independent component analysis problem.
5 0.63037771 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
Author: Lawrence K. Saul, Daniel D. Lee
Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1
6 0.62184942 21 nips-2001-A Variational Approach to Learning Curves
7 0.60972536 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
8 0.60320348 79 nips-2001-Gaussian Process Regression with Mismatched Models
9 0.6014204 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
10 0.54609919 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
11 0.54190844 60 nips-2001-Discriminative Direction for Kernel Classifiers
12 0.52383417 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
13 0.51111966 8 nips-2001-A General Greedy Approximation Algorithm with Applications
14 0.50786132 43 nips-2001-Bayesian time series classification
15 0.50715351 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
16 0.50334072 46 nips-2001-Categorization by Learning and Combining Object Parts
17 0.50221616 139 nips-2001-Online Learning with Kernels
18 0.48532566 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
19 0.48253411 76 nips-2001-Fast Parameter Estimation Using Green's Functions
20 0.47749257 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
topicId topicWeight
[(3, 0.016), (14, 0.048), (17, 0.026), (19, 0.028), (24, 0.029), (27, 0.144), (30, 0.078), (38, 0.161), (49, 0.022), (59, 0.049), (72, 0.118), (79, 0.06), (83, 0.022), (91, 0.104)]
simIndex simValue paperId paperTitle
1 0.93083137 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network
Author: Lance R. Williams, John W. Zweck
Abstract: We describe a neural network which enhances and completes salient closed contours. Our work is different from all previous work in three important ways. First, like the input provided to V1 by LGN, the input to our computation is isotropic. That is, the input is composed of spots not edges. Second, our network computes a well defined function of the input based on a distribution of closed contours characterized by a random process. Third, even though our computation is implemented in a discrete network, its output is invariant to continuous rotations and translations of the input pattern.
2 0.92868423 72 nips-2001-Exact differential equation population dynamics for integrate-and-fire neurons
Author: Julian Eggert, Berthold Bäuml
Abstract: Mesoscopical, mathematical descriptions of dynamics of populations of spiking neurons are getting increasingly important for the understanding of large-scale processes in the brain using simulations. In our previous work, integral equation formulations for population dynamics have been derived for a special type of spiking neurons. For Integrate- and- Fire type neurons , these formulations were only approximately correct. Here, we derive a mathematically compact, exact population dynamics formulation for Integrate- and- Fire type neurons. It can be shown quantitatively in simulations that the numerical correspondence with microscopically modeled neuronal populations is excellent. 1 Introduction and motivation The goal of the population dynamics approach is to model the time course of the collective activity of entire populations of functionally and dynamically similar neurons in a compact way, using a higher descriptionallevel than that of single neurons and spikes. The usual observable at the level of neuronal populations is the populationaveraged instantaneous firing rate A(t), with A(t)6.t being the number of neurons in the population that release a spike in an interval [t, t+6.t). Population dynamics are formulated in such a way, that they match quantitatively the time course of a given A(t), either gained experimentally or by microscopical, detailed simulation. At least three main reasons can be formulated which underline the importance of the population dynamics approach for computational neuroscience. First, it enables the simulation of extensive networks involving a massive number of neurons and connections, which is typically the case when dealing with biologically realistic functional models that go beyond the single neuron level. Second, it increases the analytical understanding of large-scale neuronal dynamics , opening the way towards better control and predictive capabilities when dealing with large networks. Third, it enables a systematic embedding of the numerous neuronal models operating at different descriptional scales into a generalized theoretic framework, explaining the relationships, dependencies and derivations of the respective models. Early efforts on population dynamics approaches date back as early as 1972, to the work of Wilson and Cowan [8] and Knight [4], which laid the basis for all current population-averaged graded-response models (see e.g. [6] for modeling work using these models). More recently, population-based approaches for spiking neurons were developed, mainly by Gerstner [3, 2] and Knight [5]. In our own previous work [1], we have developed a theoretical framework which enables to systematize and simulate a wide range of models for population-based dynamics. It was shown that the equations of the framework produce results that agree quantitatively well with detailed simulations using spiking neurons, so that they can be used for realistic simulations involving networks with large numbers of spiking neurons. Nevertheless, for neuronal populations composed of Integrate-and-Fire (I&F;) neurons, this framework was only correct in an approximation. In this paper, we derive the exact population dynamics formulation for I&F; neurons. This is achieved by reducing the I&F; population dynamics to a point process and by taking advantage of the particular properties of I&F; neurons. 2 2.1 Background: Integrate-and-Fire dynamics Differential form We start with the standard Integrate- and- Fire (I&F;) model in form of the wellknown differential equation [7] (1) which describes the dynamics of the membrane potential Vi of a neuron i that is modeled as a single compartment with RC circuit characteristics. The membrane relaxation time is in this case T = RC with R being the membrane resistance and C the membrane capacitance. The resting potential v R est is the stationary potential that is approached in the no-input case. The input arriving from other neurons is described in form of a current ji. In addition to eq. (1), which describes the integrate part of the I&F; model, the neuronal dynamics are completed by a nonlinear step. Every time the membrane potential Vi reaches a fixed threshold () from below, Vi is lowered by a fixed amount Ll > 0, and from the new value of the membrane potential integration according to eq. (1) starts again. if Vi(t) = () (from below) . (2) At the same time, it is said that the release of a spike occurred (i.e., the neuron fired), and the time ti = t of this singular event is stored. Here ti indicates the time of the most recent spike. Storing all the last firing times , we gain the sequence of spikes {t{} (spike ordering index j, neuronal index i). 2.2 Integral form Now we look at the single neuron in a neuronal compound. We assume that the input current contribution ji from presynaptic spiking neurons can be described using the presynaptic spike times tf, a response-function ~ and a connection weight W¡ . ',J ji(t) = Wi ,j ~(t - tf) (3) l: l: j f Integrating the I&F; equation (1) beginning at the last spiking time tT, which determines the initial condition by Vi(ti) = vi(ti - 0) - 6., where vi(ti - 0) is the membrane potential just before the neuron spikes, we get 1 Vi(t) = v Rest + fj(t - t:) + l: Wi ,j l: a(t - t:; t - tf) , j - Vi(t:)) e- S / T (4) f with the refractory function fj(s) = - (v Rest (5) and the alpha-function r ds
same-paper 3 0.90925646 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
Author: Mário Figueiredo
Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.
4 0.88429976 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
Author: Gregor Wenning, Klaus Obermayer
Abstract: Cortical neurons might be considered as threshold elements integrating in parallel many excitatory and inhibitory inputs. Due to the apparent variability of cortical spike trains this yields a strongly fluctuating membrane potential, such that threshold crossings are highly irregular. Here we study how a neuron could maximize its sensitivity w.r.t. a relatively small subset of excitatory input. Weak signals embedded in fluctuations is the natural realm of stochastic resonance. The neuron's response is described in a hazard-function approximation applied to an Ornstein-Uhlenbeck process. We analytically derive an optimality criterium and give a learning rule for the adjustment of the membrane fluctuations, such that the sensitivity is maximal exploiting stochastic resonance. We show that adaptation depends only on quantities that could easily be estimated locally (in space and time) by the neuron. The main results are compared with simulations of a biophysically more realistic neuron model. 1
5 0.87237942 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations
Author: Randall C. O'Reilly, R. S. Busby
Abstract: We present a model of binding of relationship information in a spatial domain (e.g., square above triangle) that uses low-order coarse-coded conjunctive representations instead of more popular temporal synchrony mechanisms. Supporters of temporal synchrony argue that conjunctive representations lack both efficiency (i.e., combinatorial numbers of units are required) and systematicity (i.e., the resulting representations are overly specific and thus do not support generalization to novel exemplars). To counter these claims, we show that our model: a) uses far fewer hidden units than the number of conjunctions represented, by using coarse-coded, distributed representations where each unit has a broad tuning curve through high-dimensional conjunction space, and b) is capable of considerable generalization to novel inputs.
6 0.84917396 57 nips-2001-Correlation Codes in Neuronal Populations
8 0.84011298 46 nips-2001-Categorization by Learning and Combining Object Parts
9 0.83934385 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
10 0.83672464 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
11 0.82980382 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules
12 0.8257581 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
13 0.82452226 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
14 0.81979692 171 nips-2001-Spectral Relaxation for K-means Clustering
15 0.8182286 185 nips-2001-The Method of Quantum Clustering
16 0.81620729 155 nips-2001-Quantizing Density Estimators
17 0.81459761 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
18 0.81422675 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
19 0.81411058 13 nips-2001-A Natural Policy Gradient
20 0.81405854 182 nips-2001-The Fidelity of Local Ordinal Encoding